The bad old days of debugging
The computer engineering version of “kids, back in our time…”
Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.
I recently read “The soul of a new machine” by Tracy Kidder, a book about the design of a new computer called Eagle, undertaken during the 1970s by a small team within the company Data General. I have always liked this genre of books - it’s nice to read about people working together to build something special. I usually leave at the end feeling inspired. But at the end of this book, I felt something else: I felt grateful.
If you have been an engineer long enough, you would relate to this set of events: You find a bug during testing. You spend hours (maybe days) staring at the screen trying to figure out what is causing that bug - a period of intense frustration for you. (and sometimes, the people around you.) You finally find the root cause for the bug, experience a (short-lived) feeling of pride, before the next bug hits you. That’s the debugging emotional rollercoaster in a nutshell: And computer engineers go through it on a regular basis.
Like a lot of others, when I’m stuck debugging the same bug for a very long time, I feel like I have the worst job in the world. But looking back at chip design in the 1970s gave me a sense of how much better we all have it today…
Getting your hands dirty
In the era before HDLs and EDA tools, hardware engineering meant working on real hardware. The Eagle computer was not a monolithic piece of silicon like modern SoCs - it was a collection of integrated circuits, spread across 7 different “wire-wrapped boards” with wires in the back connecting different pins. As the author Tracy Kidder puts it, the boards looked like “thin plates, each with one side covered by a profusion of tiny wires. Small cables, flat like tapeworms, ran among the boards. Oh my, there were a lot of wires.” If those words don’t speak to you, here is a picture of one such board. (Source: Wikimedia Commons)
While in modern debugging, finding the root cause for a bug is the harder problem, a setup like this made fixing the bug equally challenging. When a change was needed, existing wires had to be carefully unwrapped using hollow-tipped tools, and if you were successfully able to do that, you then had to wrap new wires with the same precision. In this process, no other wires should be disturbed - but it does happen often, and results in new bugs. As the hardware lead Ed Rasala puts it, this process feels like one is performing an open-heart surgery.
This need to add or remove connections is not always a one-time thing: sometimes, in order to isolate where the bug originates from, different blocks in the design need to be disconnected, and later connected back. The book talks about an example involving the Instruction Processor and its associated I-cache - the engineers needed to run multiple experiments, each with one of the two blocks disconnected. (needing them to perform the “open-heart surgery” multiple times.) In modern debugging, isolating two such blocks would be as simple as modifying a few lines of RTL code.
Keeping the paper business alive
Today, when engineers say paperwork, we never actually work with paper. But in the world where Eagle was being designed, paper still had it’s place.
Part of the debugging process involved understanding how different blocks in the computer worked. This information was compiled into a two-hundred page physical document that was written by the architect Steve Wallach. This document was like the Eagle team’s Bible - in fact, it was divided into multiple chapters, and each chapter started with a famous quote. This all sounds great, until engineers had to search through them for a specific piece of information to help with their debug.
But this was not the worst part of paper. Debugging a wire-wrapped board is vastly different from the way we add signals to a waveform today - a device called Logic Analyzer was used to probe different pins, and a snapshot of the signal had to be noted down on paper. This was improved by automatically capturing some signals in the system console - but this still needed to be printed before the values could be analyzed. Remember, signals change many times each second - so a large number of values need to be printed. In describing the console prints for one of the debugs, the author writes: “Stretched out, the sheet would run across the room and back again several times. You could fit a fairly detailed description of American history from the Civil War to the present on it.” Analyzing this large data dump with no ability to search must have been a nightmare…
Once a fix for a bug has been found, all the console log papers can be cleared out - but the paperwork does not end there. In order to correctly make this fix in all the prototypes, the exact changes in connections had to be marked on a large diagram with the schematic of the computer. This was called an Engineering Change Order, or ECO. By looking at this ECO diagram, engineers working on different prototype had to modify the wiring in their prototype each morning before they started debugging other issues. Today, a change is only called ECO very late in the project, and is done on a software abstraction of wires called Netlist. Such ECOs are rare - not an everyday occurrence like in the Eagle project.
You can’t spell “Hardware” without “Hard”
Once the Eagle project was finalized, new engineers had to be hired to the team. When I read the hiring philosophy in the team, I couldn’t help but think that the challenges of debugging highlighted earlier had a lot to do with it.
During the interviews, Carl Alsing, the microcode lead, always presented the project as a tough one that involved long hours - at times even calling it a “suicide mission.” Those long hours were a direct result of the debugging challenges of the time. The team also screened for “a lack of family life” - they believed that someone with a family could not deliver on the intense commitment of the debugging schedule. Higher grades were valued, but it was not because it signified superior skills or smartness - it was merely seen as another indicator of hard work. In fact, the Eagle team even had a rite of initiation for their new hires, where they made a commitment to "do whatever was necessary for success, including forsaking, if needed, family, hobbies, and friends.”
While good old hard work still has it’s place in debugging today, the best engineers I have worked with rely more on a methodical approach that makes the most of tools available at their disposal. Today’s designers have the tools to prepare in advance for a majority of the debugs they might face - with faster and more accurate simulations, better logging capabilities, and approaches like formal verification. While Alsing and co. lacked these tools and had to push engineers to work 60-80 hour debugging workweeks, it is not a requirement to be successful at debugging anymore.
So, I get my turn to debug at 4 AM?
Let’s say you had all the smarts and energy it takes to debug any issue that your chip throws. If you walked into the Westborough building to debug Eagle, chances are, you will be asked to come back later. That’s because, unlike today’s world of simulation, directly debugging a real chip means, well, you need a real chip. But there were only a limited number of these prototypes, which meant not everyone could debug the Eagle computer at the same time.
The Eagle team were on a tight timeline. In order to maximize debugging time, they had a debugging schedule that had to be strictly followed, which assigned each engineer specific times throughout the day (and night) for debugging. So, if inspiration strikes you while having dinner, you would still need to wait for your shift to test your theory.
Although shifts were never more than 8 hours, the nature of debugging meant that most engineers stayed longer. Since debugging is a very personal activity, it is often easier to push yourself to work longer hours, than explaining your finding to the next engineer. (Which btw, was also “paperwork.”) The book talks about many such stories - Jim Guyer, for instance, spent many nights alone in the lab to debug issues in the Instruction Processor. Ken Holberger noted that it was often dark when he got in, and dark when he left work - leading him to lose track of the day and time when he was home.
Remember, adding more prototypes is not a sustainable solution - if you had 100 different prototypes of the computer, then each ECO would need to be implemented on each of these 100 prototypes. There were diminishing returns to scale in 1970s debugging.
Flakey and Bogeyman: The supervillains of the debugging world
In the story, "flakey" and "bogeyman" were terms used by the engineers working on Eagle to describe specific types of problems or fears encountered during the debugging of the Eagle computer. A flakey refers to a failure that occurs erratically and is often hard to diagnose. Some example of this include loose connections, stray voltages, or worst of all, an IC from another vendor is buggy. The main issue with a flakey is that it is hard to reproduce consistently. The first step in fixing something is getting it to break - and with a flakey, engineers needed to spend additional time reproducing the bug, before finding a way to fix it.
The experienced engineers in the Eagle team had seen a lot of flakeys in their life, and not all of them were caught before their computer was sold. This led to a fear that there will always be one bug that would go unnoticed, but cause their machine to stop working - they called this a bogeyman. The Eagle project lead Tom West defines it as "the infinite page fault you didn’t anticipate. The bogeyman is the space your mind can’t comprehend." The fear of the bogeyman was real - Ed Rasala talks about some nights where he would wake up worrying about a bug they are yet to find, feeling like “the bogeyman was in his bedroom.” Like I mentioned in another post, bugs continue to be seen in chips even today. However, verification has been largely left-shifted now - by starting verification very early, and at different levels of abstraction, the chances of encountering such bugs reduces greatly.
Long-term tiredness
Nobody likes to make mistakes. But the likelihood of not detecting a bug was much higher in the 1970s than it is today. This took it’s toll on the engineers involved.
Jim Veres, who helped design the Instruction Processor and its I-cache, felt annoyed by the fact that his component was blamed for a lot of bugs. Although he would often prove that the problem was elsewhere, he said that the constant blame placed a lot of pressure on him. He felt like the block he designed was a "part of him now," and he didn't like to see it picked on unfairly.
Jon Blau experienced "difficulty forming sentences," his "mind’ll go blank," and felt "pieces of your life get dribbled away" due to the internal pressure to hurry and finish his code. When he took over debugging the ALU, he was "terribly excited by it, then very frightened," leading him to take a week off. Ed Rasala summed this up nicely, by calling it a “long-term tiredness.” He said that engineers on the debug schedule felt tired, but not in a traditional way that going home to their loved ones could fix. They were always thinking about the current issue they are debugging, while being concerned about the next bug they might hit. That’s what it was like to debug a computer in the 1970s.
While debugging is still a challenge today, and tools can be better, reading “The soul of a new machine” gave me a new perspective on what we have today. It’s been about 50 years since this saga unfolded. Who knows, 50 years later, someone might write a sorry tale about the way we debug chips today…
By the way, this post only talks about a small part of this book that resonated with me. For a fuller analysis, I highly recommend this post by the Chip Letter.




"You spend hours (maybe days) staring at the screen trying to figure out what is causing that bug - a period of intense frustration for you. (and sometimes, the people around you.) You finally find the root cause for the bug, experience a (short-lived) feeling of pride, before the next bug hits you." - You have beautifully described the pain of engineering. Good, bad and ugly.
Your consistency is admirable<3