Vincent J. Wallace
I’ve been working with computers for a long time. Below are descriptions of some of the things I’ve done, in more detail than a resume allows.
I’ve twice in my career salvaged failed projects:
The first time I saved a project was at Raytheon, in the Captive Line Division, which managed the procurement of the integrated circuits used in the guidance system of the Trident missile. They had paid an outside firm (a rumored $1 million) to develop a computerized inventory control system. When I joined I was told that the code was all done, and it just needed a few “tweaks” to get it up and running. Six months later I was finally able to hand off a working system. Having been with the organization for six months I was pretty sure that the system was completely inadequate for what they wanted to do, and it turned out I was right. The system was almost worthless. Management agreed to let me redo the system from scratch (I think at that point they had given up on ever getting a working system, as the system I had brought up was actually their second attempt at a computerized system). Six months later they were using the system I developed, and they were still using it when I left several years later.
My second work of salvation was when I was working as a consultant for hire at EDS. A currency trading company (Electronic Brokering Services) had hired a French firm (Cap Gemini) to develop a new network monitoring system (based on a product called TeMIP) for their worldwide private network. The system was supposed to monitor 1200 nodes. It rolled over and died when tasked with monitoring 50. It was my job to fix it. It was a tense situation, because there was a $500 million contract contingent on the software working. Over the course of two months I rewrote about half the Cap Gemini code. That, in combination with a sizeable investment in faster hardware, got the system working. The whole set up was as ugly as hell, but it worked, and it got the contract signed.
I’m a firm believer in doing portable software. Basically, any code that gets written these days should work with any compiler on any processor on any operating system. It’s not hard, but you do need to have an actual design and to understand what your dependencies are.
When I worked at DEC I did an implementation of their CTERM protocol, which is actually a fairly involved protocol. I did it initially for an OSF based Unix. I then adapted it for a BSD based Unix and SCO Unix. I then ported it to Microsoft Windows and to OS/2 (operating systems not generally considered to be closely related to Unix). Finally I also moved it to VMS, although that case was more unusual because the code was running as part of the VMS kernel, rather than as an application. The total time for all those ports was about a month, and that includes the better part of a week tracking down a bug that ultimately turned out to be faulty hardware.
At Airvana when I rewrote their logging framework I did the bulk of the work under Cygwin on my PC. The target was an embedded system running vxWorks on a PowerPC. Doing the work under Cygwin was a million times faster (OK, maybe only ten times faster, but it felt like a million). The amount of effort needed to have it running in both places was down in the noise level. I also made sure the code ran under SunOS on a MIPS CPU. That was not just for verification, because some of the functionality was actually needed there. The logging format was binary, and code was needed to translate the binary logs to readable text – and that work was typically done on Sun systems. Interesting side story – my code ended up not being used on the Sun systems. Management decided the Sun code had to be written in Java (my code was in C++), so they had someone in India duplicate the functionality. Apart from the wasted money (Indians are cheap, but not free) and creating a maintenance headache (having to maintain two sets of code) it wouldn’t have been too bad – except the Java code was literally 17 times slower than my C++ code. A file that the Java code would take 65 seconds to process my code would do in 3.8 seconds – same machine, same file. The impact was not just academic. Providers like Verizon analyze enormous numbers of log files to understand what’s going on in their network, and they had to put up with the slow conversion times. Not a way to make a good impression with your customers.
I’ve debugged many problems (most of them not my own), and sometimes you have to be creative. Here are some techniques I’ve used for memory corruption and memory leak issues, and a few other odds and ends:
Modified Interrupt Handler
Typically it’s a big help to know when the memory corruption happens. One way is to check for the presence of the corruption on every clock tick. Sometimes it’s even possible to modify the interrupt code on the fly – lock out interrupts, replace an instruction in the handler with a jump to the check code (which jumps back to the handler when done, after performing the replaced instruction), then re-enable interrupts. The beauty of that is it’s faster than rebuilding and reloading.
Modified Task Switch Code
If memory is fine before switching into thread X, but is corrupt when switching out thread X, then it’s obvious that thread X created the corruption (well, unless it was an interrupt handler).
I’ve done a fair number of customized malloc/free implementations over the years. For example, modifying the memory block header to record information about the allocator.
Stack Frame Check Summing
How do you debug a nasty stack frame corruption problem? Well, one way is to generate a checksum for a routine’s stack frame before it calls whatever subroutine it’s going to call. And then in each subroutine farther down the call chain you insert extra code in the routine’s epilogue to recalculate the checksum and compare it to the correct value. When they don’t match, you’ve found the subroutine that is doing the corruption.
If you’re doing a true object oriented design (as opposed to just generating unstructured garbage code in C++ rather than in FORTRAN), then of course your system is keeping track of how many of each type of object there is. You simply ask the system to display the counts. When you see there are 2000 foobar objects when there should only be 10, then you’ve immediately determined that you have a memory leak, and you know where the leaked memory is going.
DOM Reset Problem
I’ve written a little article about one of the oddest bugs I’ve ever worked on. Here’s the link: +++ Are Bees Bugs
Gdb and other debuggers are great, and should be used whenever possible, But they don’t give you access to all the information that is available in a core dump. A thread’s stack contains not only its current call chain, but also the “ghosts” of recent activity (what a co-worker used to call the “stack residue”). If you understand how computers actually work you can look beyond the current stack frame, to get an idea of what happened in the past. Also, because stack frames are not zeroed out when instantiated, and because not all local variables are always used, even currently active frames often have left over data from previous call chains, which can sometimes be reconstructed! None of this is type of analysis is fun, but sometimes it is necessary.
The following “activities” were all done by me – on my own. That is, I took the initiative to do them on my own, on top of my official job responsibilities.
Boot Up Time
One employer’s (Airvana) products were in the telecomm industry, which likes to tout five nines (99.999%) reliability. If you work through the math that means you have less than 6 minutes allowable downtime per year. Now Airvana’s initial product was taking 15-20 minutes to boot. So even a single reboot would play havoc with the metrics. I took some time to look into the matter. One thing I found was that it was taking about 20 seconds to transfer 6 megabytes of data across a PCI bus. Even weirder, I was the only person who thought that was worthy of note. If you don’t find that number suspicious you need to take the time to estimate how long it should take to get those 6 meg across a typical PCI bus. Anyway, I got the boot time down to about 5 minutes. Not perfect but much better than before. What did I change? They were using SFTP to download images within the chassis. SFTP uses 512 byte buffers. One change was to use a custom SFTP that used 1500 byte buffers (since this was all internal to a single chassis compatibility was not an issue). Another change was to switch to transferring compressed images, rather than uncompressed images. Another problem was that the disk performance of the chassis was extremely poor. When trying to download the same image to multiple cards the disk block cache would thrash massively, turning poor performance into disastrous performance. The fix was to copy the images into a RAM disk, and download from there. On a personal note, I’ve never been very impressed with the telecom industries claims to care so much about reliability, but maybe that’s because I started out my career involved with the guidance systems for nuclear tipped missiles, where there really was a sense of caring about reliability.
Airvana’s next to last product was ATCA based. Now the ATCA chassis is a complex beast. In this case there were 22 different pieces of firmware. Until I got involved information about the firmware was just kept in random people’s heads, or maybe the C drives on their individual laptops. I was the first person to actually put together a list of all the firmware (yes, I understand the concept of a list! I’m not being facetious. It’s amazing how many people don’t get the concept). I also set up a corporate wiki, in which I created a page for each piece of firmware (There was also a lot of other stuff in the wiki, not germane to this example). What do you put on such a page? Well, information about how to find what firmware version is in use, how to verify firmware file contents, how to actually do an upgrade, what the different releases were and what changed in each release. Here’s a link to a copy of what one of the pages looked like: +++LINK
Airvana for a long time had a very poor build environment (if you changed a low level header file, forget it – it would take all night for a build to complete). I helped speed things up. I tracked down part of the problem to the compiler having to search on the order of a 100 different subdirectories for header files. My solution? Create a single directory with links to all the public header files, so most of the time the compiler only had to search two directories (the current directory and the link directory). The result? Build times were cut in half. Yes, it really made that big a difference.
Airvana’s products used flash memory, and the upgrade software would happily program anything into flash. It would replace the executable mission code with a Microsoft word document, if that’s what you gave it. I did not think that was sufficiently robust. So I defined a “signature” to be added to the end of the files, modified the build process to add it, and modified the upgrade software to check it. This made it a lot more difficult to accidently blow away a system by putting in a bad file. As it turns out, tar implementations tend to ignore extra stuff at the end of the file, so it was also possible to add a signature to the end of a release file (Airvana’s releases were done via tar files), and so the upgrade software could also verify that a file called “v6.2” really did contain version “6.2”.
When I joined Airvana, when a system crashed, all you got was a dump of the register contents and a stack back trace of the currently running thread. Sometimes that’s enough, but you’re certainly not going to track down a memory leak with just that information. I was the person who said we needed core dump functionality, and I was the one who figured out how to do it, including implementing the backdoor communication channel from the line cards to the system controller card, so that the exception handler on line cards would have a way to save the core dump data. BTW, this allows me to claim that I saved Airvana from losing its largest customer. How is that? At one point Verizon was so upset with the quality of the product that they threatened to kick Airvana out of their network. Now as bad as it was, without the core dump functionality, which at least allowed us to figure out and fix the issues, it would have been much worse. Would Verizon really have kicked Airvana out? Probably not. But without the core dumps, resolving issues would have been much more problematic, and maybe things would have gotten so bad that Verizon really would have tossed Airvana!
It sometimes makes sense to NOT generate code directly, but to write a framework to generate the code for you:
Inventory Control GUI
For the previously mentioned inventory control system I knew the creation of the user interface screens was going to be an iterative process. I was dealing directly with the people who would be using the software, and they had no compunctions at all about requesting constant tweaks to how things looked. Rather than having to spend endless hours adjusting the code that generated the screens I defined a syntax for describing how a screen should look, and a program that would take that description and generate the code that would generate the screen when compiled and run. It made everybody’s life easier. The users were happy that changes could be done quickly, and I avoided endless hours of drudge work.
Logging Framework Data Structures
Debugging embedded systems can be painful, especially at a customer site. Verizon is typically not going to let you set up an ICE at one of their operational cell towers. Personally, I believe in incorporating test & debug capabilities in software right from the get go. For the previously discussed logging framework all the objects that made up the framework were linked together into a management tree, which resulted in each object having a unique name, for example: /logging/buffer_pool/security. That allowed me to set up the CLI so I could interact with any of the objects at will. A specific goal was to be able to print each object, just as you can when debugging with gdb. To avoid having to write an enormous amount of repetitive grunt code I developed a program whose input was a description of what the data members of the object should be, and whose output was a data structure definition for holding that data, and a metadata structure describing that data. That meant that printing an object’s contents was simply a matter of calling a generic routine, passing it a pointer to the data structure and a pointer to the structures metadata. I then had a powerful debug ability that worked anywhere – my system, a QA system, even a customer system – with no special set up.
I’ve worked with & studied various architectures over the years. That’s part of how you develop your skill set. I once shocked a junior engineer by saying that the most important attribute to have as an engineer is the ability to steal ideas from others. I wasn’t trying to promote industrial espionage. The point was nobody is going to successfully reinvent a field on their own, and you have to be able (and willing) to look at what others have done – hopefully to copy the good and to avoid the bad. Anyway, here are some architectures that have influenced me:
· Unix (of course) – modularity and reuse
· The X-Windowing System – complexity management
· DECnet Phase 5 – object oriented design
I’ve done quite a bit of both refactoring work and sustaining work. The importance of those experiences is in learning what NOT to do. Only somewhat facetiously, I can describe my career as fixing the same bugs over and over and over again – because an awful lot of software developers just keep making the same mistakes over and over and over again.
I was involved in one large scale refactoring effort – at Lucent a team was created to refactor the code base used in Cascades ATM/Frame-Relay products (Lucent had acquired Cascade via Ascend). In case you don’t know, Cascade was an extremely successful telecomm startup. Unfortunately they had let their code base get way out of hand. As an example – one time a change to the Ethernet driver broke the command line parser. That should be impossible, but it really happened. I would estimate that at least 30% of the development budget was effectively going to just keep the whole mess from imploding. To their credit, the Cascade people realized the mistake they had made, and I was part of the team doing the work of trying to get the code base under control.
Ultimately the project was cancelled due to the telecomm crash of the early 2000’s, but it did give me some experience with reworking huge code bases. All my other refactoring work has been solo work, some of which I’ve described above:
· Captive Line Inventory Control
· DEC CTERM Implementation
· EBS Network Monitoring System
· Airvana Logging Framework