Chip Insights

The Art of Architectural Analysis: Utilization, Throughput, Latency

Bharath Suresh — Mon, 30 Mar 2026 00:48:09 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

In this post, we introduce the art of architectural analysis through three key concepts and use them to evaluate the TinyXPU architecture - a systolic array based 2D matrix multiplication accelerator we designed to study the evolving trends in custom silicon. The ability to analyze architectures is becoming an increasingly valuable skill - especially in a world of rapidly growing custom architectures. So, whether you’re a software or hardware engineer, you will benefit from reading about these concepts.

If you are new here, this post is part of our ongoing series on custom accelerators. In the first part of this series, we motivated the need for a bridge between a software engineer (Alice) to a new architecture developed by a hardware engineer (Harry).

In the second part, we introduced our TinyXPU project - a practical demonstration of the hardware-software bridge.

While we recommend reading the first two parts, it’s not a prerequisite as this post takes a slightly different direction.

Why does Architectural Analysis matter?

So far, we’ve focused on how Alice and Harry can work together more effectively. Using a framework like ONNX, we showed that Alice can now target many different architectures (and many different Harrys) with relative ease.

This naturally leads to the next question: How does Alice decide which Harry to choose?

At first glance, the answer seems simple: Alice would pick the “best architecture” for her algorithm. But what does “best” actually mean?

From Alice’s perspective, it could mean:

I want my algorithm to finish faster
I want to start seeing outputs as early as possible
I want it to fit within a smartwatch, a robot, or a datacenter

From Harry’s perspective, the architecture might be “best” because:

It reduces the number of arithmetic operations or data movement
It runs at a higher clock frequency than competitors
It occupies less chip area than other implementations

Clearly, Alice and Harry are both trying to communicate. But they are speaking completely different languages. If Harry wants to convince Alice to use his architecture, and Alice wants to make the right choice, they need a shared vocabulary for analyzing architectures. This is where understanding concepts in architectural analysis becomes important.

The Cooking Analogy

To build intuition for the ideas in this post, we’ll use a simple analogy: cooking.

More specifically:

The Algorithm: Boiling eggs
The Hardware: A kitchen stove

Boiling eggs is simple, but it can be done in many different ways.

You might cook eggs one at a time or in batches. You might prioritize getting the first egg ready as quickly as possible or finishing all of them as fast as you can.

Similarly, each stove is different. One might have more burners. Another might heat up faster. A third might support a larger pot but heats up slower.

This is not too different from the problem Alice and Harry are trying to solve.

For clarity, all future references to this analogy are displayed as quotes.

Recap: The TinyXPU Terminology

We briefly introduced the parameters used in the TinyXPU project and the matrix sizes in part 2 of our series of posts. Here’s a quick recap of variables that will be used in the rest of the post.

The Input Matrix X has M rows and K columns.
The Weight Matrix W has K rows and N columns.
The PEs are arranged as a 2D array with HW_ROWS rows and HW_COLS columns. In this post, we will only explore the 16*16 configuration.

Concept 1: Hardware Utilization

Quick Takes

What is it?

How much of the hardware is being used to produce useful output.

In our cooking analogy, if your kitchen stove has 4 burners but you only use 3 simultaneously, the utilization is 75%.

Why does it matter?

High utilization means more of the hardware is active at the same time. In general, this implies that more sub-operations (additions, multiplications, data movement) are happening in parallel.

If more burners on your stove are on, it means more eggs are being boiled.

When does it not matter as much?

In general-purpose architectures like CPUs and GPUs, the hardware is almost never fully utilized. In these systems, overall utilization matters less than the utilization of specific components (such as ALUs).

However, an accelerator is designed for a very specific purpose. It is therefore wasteful to build hardware that cannot be effectively utilized. In most accelerators, high utilization is close to a non-negotiable.

Having low utilization is like using a 4-burner stove when you only need to boil one egg.

PE Utilization in TinyXPU

In TinyXPU, the size of the systolic array is fixed in hardware (16×16 in our case), but the matrices we run on it are not. This mismatch is the primary reason utilization becomes an important metric.

The goal is simple: keep as many PEs busy as possible for as long as possible.

In our cooking analogy, this is equivalent to keeping all burners active. If you only have enough eggs for two burners, the remaining burners sit idle, even though the stove could do more work.

Where does underutilization come from?

1. When the workload is smaller than the array

If the weight matrix is smaller than the systolic array, for example fewer than 16 rows or columns, some PEs will not map to useful computation. To maintain correctness, we typically pad the matrix with zeros. These PEs still perform MAC operations, but their outputs are discarded.

This is like turning on burners with empty pots. Heat is being generated, but no eggs are being cooked.

2. When the workload is larger than the array

If the matrix is larger than the array, we process it in tiles. While tiling allows us to handle arbitrarily large matrices, the last tile is often smaller than the array, leading again to underutilization. We will skip the details of tiling in this post, but this edge effect is quite common in practice.

Startup overhead: why utilization is not 100%

Even when the workload maps perfectly to the array, utilization is not immediately 100%. This is because not all cycles contribute to useful work. At the start of execution, the systolic array needs a few cycles before all PEs begin performing meaningful computations. Similarly, toward the end, some PEs become idle as the computation completes.

From a utilization perspective, these cycles are overhead. The hardware is active, but not all of it is doing useful work.

Before you can boil eggs, you need to fill the pot and wait for the water to heat up. During this time, the stove is on, but no eggs are being cooked.

Considering this overhead, we define utilization as:

This shows that utilization improves as the workload grows, because the fixed overhead is amortized over more useful work.

PE Utilization vs Batch Size

The impact of matrix sizes on PE utilization is shown in this plot:

Key takeaways:

The 16×16 case asymptotically approaches 100% utilization, as the pipeline overhead becomes negligible compared to useful work.
The other shapes (4×16, 8×8, 16×4) all have 64 total weights, and therefore can only utilize 64 out of 256 PEs, which corresponds to a maximum of 25% utilization.
For small values of M, utilization is low across all configurations due to pipeline overhead.
As M increases, all curves improve, but they plateau at different levels depending on how well the workload matches the hardware.

Utilization is a useful analytical metric. It tells us how much of the hardware is actively doing useful work and highlights inefficiencies due to pipeline overhead.

How architects use this metric in real chips

In real chips, utilization is rarely exposed as a single number. Instead, architects infer it using performance counters, stall analysis, and activity factors across compute units. A classic example is the TPU v1 paper, where the authors show that achieving high utilization of the 256×256 systolic array was critical to performance. They analyze how different workloads map to the array and highlight cases where underutilization leads to significant efficiency loss.

Concept 2: Throughput

Quick Takes

What is it?

The number of useful operations completed per unit time.

In our cooking analogy, this is the number of eggs you can boil per hour.

Why does it matter?

Throughput directly determines how fast an algorithm completes. If more operations are finished per unit time, the total execution time is lower.

When does it not matter as much?

In many systems, overall throughput is limited by the slowest component. Even if one part of the system achieves very high throughput, it may not translate into end-to-end performance improvements.

Custom accelerators are typically designed to maximize throughput. However, understanding what limits that throughput is just as important.

From Throughput to the Roofline Model

So far, we have treated throughput as a single number. In reality, it is constrained by two fundamental limits:

How fast the hardware can compute
How fast data can be moved to and from the hardware

The roofline model captures both of these limits in a single diagram, which is why it is a better representation of our throughput analysis.

In our cooking analogy, the number of eggs you can boil per hour is not just determined by how many burners your stove has. It is also limited by how quickly you can bring water, eggs, and utensils to the stove. If you have many burners but can only carry a few eggs at a time, most burners will sit idle. On the other hand, if you can supply eggs very quickly but only have a small stove, the burners themselves become the bottleneck.

TinyXPU Roofline Model

Vertical Axis: Peak Throughput

For a systolic array, if the workload fully utilizes the array, the peak compute throughput is:

For our 16×16 array, this gives a peak of 256 MACs per cycle. However, if the weight matrix is smaller than the array, only K×N PEs perform useful work. This represents the vertical axis of our roofline model.

Horizontal Axis: Arithmetic intensity

Arithmetic intensity is the number of MACs executed for each byte of data read from the memory. For our matrix multiplication example:

Weights are loaded once: K×N bytes (we ignore this as batch size is usually large)
Inputs contribute: M×K bytes
Outputs contribute M×N values, each 4 bytes (we assume output values are 32-bit integers)

This gives:

So, the arithmetic intensity (AI) becomes:

Roofline Plot

Based on the definitions above, the roofline plot for TinyXPU is shown here:

Each point on the plot represents a specific workload running on the hardware. The horizontal axis is arithmetic intensity and the peak throughput. The roofline plot also includes two lines which are important:

The horizontal line represents the peak compute capability of the hardware
The sloped lines represent memory bandwidth limits

Several insights emerge from this single diagram:

If the weight matrix is smaller than the array, peak compute throughput cannot be reached due to underutilization (this aligns with our earlier analysis on PE utilization.)
Among shapes with the same number of weights, taller matrices perform better than wider ones. This happens for two reasons:
- Idle rows waste more hardware than idle columns
- Output traffic scales with N, and outputs are larger in size

The 4×16 shape is more bandwidth-limited than 16×4, since it produces more output data
Most configurations in this example are memory-bound unless bandwidth exceeds 64 bytes per cycle (which is rare in CPU L1 caches as seen from this post on min{power})

Using a roofline plot, it becomes clear how throughput can be maximized:

Move upward by increasing compute efficiency (Larger batch sizes)
Move right by increasing arithmetic intensity (Taller matrices)

By being in the top right, we get the best throughput.

How architects use this metric in real chips

The roofline model is one of the most widely used tools for reasoning about throughput limits. Modern accelerators frequently use roofline-style analysis. For example, NVIDIA presents roofline-inspired performance characterizations in its architecture whitepapers, (For example, NVIDIA A100 Architecture) where compute throughput and memory bandwidth limits are analyzed together.

In fact, the roofline plot we obtained for TinyXPU is not too different from the plot shared in the TPU v1 paper:

Source: https://arxiv.org/pdf/1704.04760

Concept 3: Latency

Quick Takes

What is it?

Latency is the time it takes to produce the first useful output after the first input is provided.

In our cooking analogy, this is the time between turning on the stove and having the first boiled egg ready.

Why does it matter?

Latency matters when partial results are useful before the entire computation is complete. For example, modern LLM-based systems stream outputs token by token. As soon as the first token is ready, it is displayed to the user. In such systems, latency directly impacts user experience.

When does it not matter?

Latency matters less when the full output is required before any further computation can proceed. For example, in a classification task such as predicting whether an image contains a cat or a dog, the final decision can only be made after all computations are complete. In such cases, throughput is often the more relevant metric.

Latency in TinyXPU

To analyze latency, we first need to define what we mean by the “first output.” In matrix multiplication, we define the first output as the result corresponding to the first row of the input matrix X.

For readers familiar with LLMs, consider the computation of the query matrix:

Each row of X corresponds to the embedding of an input token. The first row of Q therefore corresponds to the first token. The time taken to produce this row is a key component of time to first token (TTFT).

Where does latency come from?

Even if the input matrix X has many rows, the first result that emerges from the systolic array corresponds to the first row. This delay is caused by the time it takes for data to propagate through the array.

In the cooking analogy, this is like the time it takes to heat the water and cook the first egg. Even if you plan to cook many eggs, the first one still takes the same amount of time.

For a weight-stationary systolic array, the latency to first output is:

Essentially, the latency only depends on the number of rows in the hardware array and the number of output columns. It is independent of the batch size M.

Throughput vs Latency Tradeoff

The plot below shows throughput on the vertical axis and latency on the horizontal axis for different matrix shapes. Each curve corresponds to a fixed number of rows K, while varying the number of columns N. The batch size (M) is fixed to 256.

This plot clearly shows the inherent tradeoff between throughput and latency.

Increasing N increases throughput, since more PEs are active
However, increasing N also increases latency

This means designs with the same total compute capacity (total number of PEs) can have very different latency and throughput characteristics. If you want fast response to the first output, you prefer smaller matrices with lower latency. If you want higher overall throughput, you prefer larger matrices that utilize more of the hardware.

In the cooking analogy, this is like choosing between boiling one egg quickly or boiling many eggs at once. Using a larger pot allows you to cook more eggs simultaneously, but it may take longer before the first one is ready.

This throughput-latency tradeoffs starts to become even more interesting in accelerator designs with pipelined PEs to achieve higher clock frequencies, and full-system implementations with non-zero instruction and memory latencies - topics we will explore in future posts.

How architects use this metric in real chips

Latency is typically modeled using a combination of analytical models and cycle-accurate simulations that capture pipeline depth and data movement delays. For example, Google discusses latency-sensitive inference in the MLPerf Inference Benchmark, which includes metrics like time-to-first-token and tail latency.

As you can see from the discussion, architectural analysis is as much art as it is science. It’s more like appreciating a painting than reading a book. An experienced art connoisseur can analyze a painting at a glance. But that ability comes from a shared understanding of color, brushwork, and design principles between the artist and the observer.

This post aims to build a mental model for architectural analysis, using our basic TinyXPU implementation as a concrete example. As we continue to expand TinyXPU by adding activation functions, mapping it to an FPGA, and exploring unorthodox systolic networks, we’ll build on these ideas and introduce new ones along the way.

Subscribe to Chip Insights and min{power} to follow along as we do that.

Mapping Algorithms to Custom Silicon - Part 2

Bharath Suresh — Sun, 15 Mar 2026 18:42:10 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

This post is the second in our series on custom hardware accelerators. In part 1, we introduced the hardware-software interface problem that has kept the two domains as separate islands and limited co-innovation. We also introduced two approaches to overcome this gap. If you haven’t done so already, check out that post here:

In this part, we present TinyXPU, an open-source, extensible project we are building in conjunction with this series of posts. In this post, we focus on two pieces:

The hardware architecture of TinyXPU
The software interface that allows it to run ONNX models

The goal of this project is to show what the bridge between software and hardware looks like in practice, and is a stepping stone to the architectural analysis we have planned in upcoming posts.

Why Matrix Multiplication

The first important decision is to pick the right algorithm to study. We picked Matrix Multiplication, primarily because of its widespread use in many of today’s most important software domains. Here are a few representative examples:

Deep Learning: Neural network layers compute weighted sums using matrix multiplications.
Robotics: BLAS, in particular the Level 3 GEMM routine, is used in many robotics libraries
Computer Graphics: Shaders use matrix math for transformations; modern APIs like Vulkan have started exposing this directly.

Matrix multiplication also stresses both compute throughput (number of arithmetic operations completed per second) and memory bandwidth (the number of bits read from memory per second). This helps us identify both compute and memory bottlenecks - a topic we will explore in future posts.

Why build an accelerator from scratch

We are not the first to focus on matrix multiplication, and we certainly won’t be the last. As mentioned in an earlier post, there has been an explosion of “processing units” like TPUs and NPUs that, at their core, accelerate matrix multiplication. In fact, modern GPUs and even CPUs now support matrix multiplication extensions and dedicated units to handle the operation efficiently. There are also a number of “tinyTPU” projects out there with an architecture very similar to the one we will be demonstrating. (Some examples are tiny-tpu-v2/tiny-tpu and Alanma23/tinytinyTPU.) So before we go any further, we wanted to explain why we are doing this from scratch.

Our main motivation is that building from scratch allows us to understand (and hopefully explain) everything we are doing from first principles. As you will see, we are only starting with two simple SystemVerilog files. This way, we do not have to make any assumptions or shoot in the dark to understand architectural decisions. This also allows us the flexibility to scale or modify the architecture for future posts in this series.

It’s not just the hardware architecture that could be confusing. As we mentioned in Part 1, the software interface is messy and still evolving. As this is the first implementation in our series, we wanted to keep a simple but still operational software interface. Setting up this interface is important to obtain performance metrics that will be used for architectural analysis in future posts in this series.

With that motivation, we can now look at the two parts of the TinyXPU project: the hardware architecture and the software interface.

The Hardware Architecture

We built the first version of our matrix multiplication accelerator based on the weight stationary systolic array architecture: the same one at the heart of Google’s TPUs. While this post will not go into the details of this architecture, here’s a dedicated post that’s got you covered:

min{power}

Systolic arrays for general robotics, AI, and scientific computing

a month ago · 8 likes · Avik De

The Processing Element (PE)

The core block in TinyXPU is called a Processing Element (PE), a term that originated in H. T. Kung’s seminal work in 1982. Here’s the block diagram of our PE (the SystemVerilog implementation can be found in pe.sv):

This PE block executes the following operation:

acc_out = data_in * weight_in + acc_in

However, this operation is executed in two phases:

Weight Loading Phase: In this phase, the value of weight_in is stored in the register weight_r when weight_ld is set to 1.
Multiply Accumulate Phase: Here, the stored value weight_r is multiplied with data_in, and acc_in is added to get the final result. When en is set to 1, this result is sent as the output.

The reason for having these two phases comes from its primary use case in neural networks, where we typically process multiple inputs (called “batches”) using the same weights - so loading the weights once prevents the need to continuously broadcast its value during the computation. This is called a "weight-stationary" architecture, indicating that the weights don't move through the array in the Multiply Accumulate Phase.

It’s also important to highlight that the Multiply Accumulate Phase has a latency of 1 cycle - which means that the output of the PE is ready one cycle after the inputs. This pipeline stage is needed to break the combinational logic path when two PEs are connected together, in order to achieve a high clock frequency. (If you are new to digital design, this post on Static Timing Analysis would help you understand why this matters.)

Matrix Multiplication on a PE array

The functionality of a single PE can be achieved using the Arithmetic and Logic Unit (ALU) in most CPUs. So, to really understand the benefit of a PE array, let’s consider a PE array with 2 rows and 2 columns, and map a matrix operation on it. Specifically, we will be implementing this operation:

Our PE array performs matrix multiplication by arranging PEs in a grid where data flows between neighboring elements. Each PE stores one weight value and performs a multiply-accumulate operation every cycle. As input values move across the array, partial sums are passed between PEs until the final output values emerge from the last column. We depict the cycle-by-cycle dataflow in this diagram:

In the first cycle, X11 is passed as input to PE with weight W11
In the second cycle, X11 moves on to the next PE. (with weight W12.) Simultaneously, X21 and X12 are passed to the first PE column. (with weights W11 and W21.)
In the third cycle, X21 and X12 move to the second PE column.( with weights W12 and W22.) Also, X22 is passed to PE with weight W21. In this cycle, we also have the first output value ready:
```
Y11 = X11 × W11 + X12 × W21
```
Note that the latency (time from first input to first output) is 2 cycles in this example. As the number of rows of the PE array increases, this latency also increases. After the first output, new outputs are ready in every adjacent cycle, as we will see.
In cycle 4, X22 moves to the second PE column. We also get output Y21 and Y12 from both the PE columns.
Finally, in cycle 5, the last output Y22 is ready in the second PE column.

What makes this an accelerator?

The biggest advantage of this PE architecture is that we read the operands only once from memory. In the above example:

Weights W11, W12, W21, W22 are only read in the Weight Loading Phase and are then stored in the PEs
Although Inputs X11, X12, X21, X22 are used twice during the matrix multiplication, they are only read once (For example, after X11 is read in Cycle 1, it is just passed to the next PE in Cycle 2 - it does not have to be read again.)

In a standard CPU, the intermediate values would need to be stored somewhere. It could be stored in a fast register for small matrices, but might need to spill to slower caches and main memory as the matrix size increases.

Our PE array avoids repeatedly fetching operands from memory because values propagate directly between neighboring PEs. This is what makes the PE array architecture great for matrix multiplication. (This statement assumes PE array size and matrix sizes are the same - the PE array SystemVerilog implementation in TinyXPU is parametrized to support different number of rows and columns. In our upcoming posts, we will explore the impact of these different configurations on the performance of our chip.)

Finally, although our current implementation is just a PE array, we want to highlight that a real accelerator has other components in its microarchitecture which we are skipping here for the sake of simplicity:

We need to define an Instruction Set Architecture (ISA) and include an instruction decode unit
We need a Unified Buffer (typically SRAM) to hold the matrix data before streaming it into the PE array. (Typically, a DMA controller would stream data from CPU DRAM to the local SRAM.)

As this project evolves, some of these components will be (and must be) included in the design.

The Software Bridge

In the previous section, we described the hardware implementation of our TinyXPU Matrix Multiplier. Hopefully, you are convinced that the PE array is an improvement over using general-purpose CPUs for matrix multiplication. If you are a hardware engineer, many of your projects stop here with this question: how do you connect your hardware with the existing software ecosystem and actually run matrix multiplication on your chip? In part 1, we described two approaches to solve this problem - Runtime-level and Compiler level integration. In this section, we will implement runtime-level integration using an ONNX Runtime Execution Provider.

The ONNX Runtime Flow

The flow used in our project is intentionally simple.

First, a small ONNX model containing a matrix multiplication operation is generated using a Python script. An ONNX model represents a computation as a graph, where each node corresponds to an operation such as matrix multiplication. In our implementation, the generated ONNX file includes a MatMulInteger operator, as shown in this code snippet:

X = helper.make_tensor_value_info("X", TensorProto.INT8, [None, W_data.shape[0]])
Y = helper.make_tensor_value_info("Y", TensorProto.INT32, [None, W_data.shape[1]])
W_init = numpy_helper.from_array(W_data, name="W")
node = helper.make_node("MatMulInteger", inputs=["X", "W"], outputs=["Y"])
graph = helper.make_graph(
[node],
"MatMulInteger_4x4",
[X],
[Y],
initializer=[W_init],
)

Next, a Python driver loads this model using ONNX Runtime and registers the TinyXPU Execution Provider (EP), as shown in this code snippet:

ort.register_execution_provider_library("SampleEP", path_to_dll)
session_options = ort.SessionOptions()
session_options.add_provider_for_devices(tinyxpu_devices, {})

The goal of the EP is to inspect the ONNX model, looking for an eligible operation: in our case, it would be MatMulInteger. (See the GetCapabilityImpl() method in the TinyXPU EP for details.) The ONNX Runtime will call the function where we registered our ability to execute MatMulInteger when the session is instantiated.

When the runtime encounters the matrix multiplication operation, it dispatches that operation to the TinyXPU backend (via the ComputeImpl() hook) instead of executing it on the CPU (which is the default option when no match is found.) This way, we are able to connect the TinyXPU hardware directly to software without requiring a custom compiler - a major benefit of this approach.

Practical Constraints

Much like the hardware implementation, our current ONNX interface has some constraints that we would like to highlight.

Firstly, instead of exporting a Pytorch/Tensorflow implementation into the ONNX format, we directly use a Python script to generate an ONNX model containing a single MatMulInteger operation. This keeps the software stack minimal while still exercising the same ONNX Runtime execution flow that would be used with real models. Pytorch and Tensorflow both support functions to export to ONNX, so this additional step is fairly trivial. In fact, ONNX is a common intermediate exchange format before going to many hardware APIs including TensorRT for NVIDIA embedded devices, Qualcomm Hexagon, and others.

As mentioned in the previous section, we do not yet have an FPGA or ASIC implementation of our design for this part. That’s where Verilator comes into the picture. The SystemVerilog implementation of the TinyXPU systolic array is compiled into a cycle-accurate C++ model using Verilator. The TinyXPU Execution Provider links directly against this generated C++ model. When ONNX Runtime dispatches a matrix multiplication operation (MatMulInteger) to TinyXPU, the Execution Provider performs three steps (See the ComputeImpl() method in the TinyXPU EP for details):

Read input tensors from ONNX Runtime
Drive the Verilator simulation input signals and toggle the clock
Collect output signals from the simulation

So, in our implementation, the ONNX runtime connects the software world with a cycle-accurate simulation of the hardware, as shown in this diagram below (The “real silicon” version of this diagram was included in part 1):

It’s important to mention that irrespective of whether we are running on a Verilator-generated simulation, custom hardware implemented on FPGAs or as an ASIC, or existing processors on an SoC like CPUs, GPUs, or NPUs - this implementation successfully abstracts the hardware away from the software - our stated goal from Part 1.

Part 2 ends here. Although the current implementation is intentionally minimal, it already gives us a complete system: a software stack capable of dispatching matrix multiplication operations to a cycle-accurate simulation of our accelerator. This demonstrates how runtime-level integration works, and more importantly, provides a useful platform for experimentation.

In the next post, we use this framework to analyze the architecture in more detail, exploring how design parameters such as the size of the array compared to the workload and memory bandwidth affect throughput, latency and utilization. Check it out here:

The TinyXPU project will also expand to support activation functions to run complete ONNX ML models that can be deployed on an FPGA. To get notified about all this and more, subscribe to both Chip Insights and Min{power}. And if you have some experience taking up similar projects, or have an interesting direction for us to explore, let us know in the comments.

ENIAC and the Workload Problem - Part 1

Bharath Suresh — Mon, 09 Mar 2026 05:31:06 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

This is the first post in a series aimed at understanding the ENIAC architecture and the lessons that remain relevant in today’s computing landscape. In this post, I introduce the ENIAC story and explain why it still matters.

The First Workload Problem

On December 7, 1941, the attack on Pearl Harbor became a key moment in computing history. The US, having now entered World War II, commissioned the Army’s Ballistic Research Laboratory (BRL) with a project: To improve the accuracy of artillery strikes. This effort led to the creation of artillery firing tables: one of the first large computing workloads.

When I was in school, some of our mathematics and physics exams did not allow the use of calculators. Instead, when complex arithmetic problems were involved, we were allowed to bring in something called log tables: collections of precalculated values used to solve complex arithmetic problems in place of a calculator. (I never understood the point then, but maybe it was all for this moment.) An artillery firing table was essentially the Army version of a log table. Instead of telling you what 10.14 ÷ 2.38 equals, it turns complex ballistic calculations into simple lookups that a gun crew can use to quickly determine the correct elevation or angle before firing at a target.

Unlike arithmetic, which does not change with time, firing tables were different based on the gun type, barrel wear, and environmental conditions. To generate the data for each of these situations, ballistic trajectories had to be solved numerically. Human calculators could not produce these tables with the speed and accuracy the war demanded. BRL used a Differential Analyzer, the state-of-the-art computer of the time, to run these calculations. They brought down the computation time from 3 days to just 15 minutes per trajectory. This was still too slow, and the results were often inaccurate, requiring human verification. The most important computational workload of that time was still waiting for a faster machine.

In 1943, the Army approved a proposal from J. Presper Eckert and John Mauchly at the University of Pennsylvania to build a machine with a completely new architecture. This computer, called the Electronic Numerical Integrator and Computer (ENIAC), had one goal: calculate firing tables at an exponentially faster rate to help with the US’s World War II defense. On September 2, 1945, just two years after the ENIAC project was approved, Japan surrendered, and this officially marked the end of World War II. ENIAC served its purpose and has its place in history as one of the most influential computers.

Well… that’s not quite how the story goes. Sure, ENIAC was commissioned to calculate firing tables, and its architecture actually improved the speed of trajectory calculations, reducing the time taken from 15 minutes on the Differential Analyzer to about 30 seconds. ENIAC also made giant strides to improve the setup time and accuracy when running ballistic calculations. But one detail surprised me: ENIAC was announced complete in February 1946. This means ENIAC was never used to compute a single entry in the artillery firing tables used during World War II. From 1943 to 1945, ENIAC experienced major delays, and also went significantly over its initial budget. ENIAC also consumed on the order of 150-200 kW of power, and sometimes contributed to power outages in the Philadelphia area. When the war ended, ENIAC was just an expensive infrastructure investment whose primary workload was no longer important.

Why ENIAC Still Matters

Despite a lack of demand for ballistic calculations, the ENIAC project continued after the war. In fact, this is where the ENIAC story really starts, and was my motivation to write this post.

ENIAC has lived through what I would call an “interesting architectural life.” It was rushed into existence to handle one specific workload. But that’s not why we remember ENIAC today. The ENIAC architecture is widely regarded as the first modern computing architecture. It was one of the first computers capable of running a program. ENIAC went on to effectively run weather models, nuclear simulations, and scientific computations. These workloads fell well outside the scope of military applications that it was designed for. At one point, the demand for ENIAC was so high, that renting ENIAC for two days cost as much as buying a new car.

This period of immense success also came with financial and political pressures on ENIAC to live up to the tag of a “general purpose computer.” That’s when the wheels started to come off, and the architectural deficiencies started to become evident. Eventually, it became clear that the ENIAC architecture could not be repurposed anymore to handle emerging workloads, and it had to be retired. The computing industry is less than 100 years old. It sounds like a lot of time, but it takes a long time for a new architecture to get adopted, and even longer for it to completely be abandoned. The ENIAC story is one of the few stories that has come full-circle, and has also been well documented. This presents a rare opportunity to analyze decisions that worked, and more importantly, decisions that failed.

There is another reason why ENIAC matters. Fundamentally, ENIAC looks nothing like modern computers. Many aspects of ENIAC are no longer directly relevant today, starting from the fact that ENIAC’s fundamental building block is a vacuum tube, not a transistor. However, since ENIAC essentially gave birth to modern computing, the ENIAC computer is simple enough (by today’s standards) that we can try to understand all aspects of the design. This includes the hardware architecture, software decisions, infrastructure challenges, supply chain issues, and interpersonal dynamics that were involved in the design. These details, in addition to making for a great story, are important to explain both the good and bad decisions that shaped ENIAC. In my opinion, attempting the same exercise with a modern computer, while more relevant, is almost impossible due to the sheer complexity involved.

This is why ENIAC is worth your time. But why now?

The New Infrastructure Buildout

Studying ENIAC is not just an academic exercise. Today, we are racing to build advanced computers that fill warehouses. They consume gigawatts of power, cost tens of billions of dollars to build, and are optimized for a single dominant workload: large-scale LLM training. This is not very different from how ENIAC began.

There are a lot of debates happening around the world about whether this investment is justified. I don’t see a lot of value in these debates because I think we are well past that point - today this infrastructure buildout is now inevitable. But if there is one thing that the ENIAC story tells us, it is this: the workload won’t remain the same. When projects attracts large investment, especially from traditionally “non-tech” sources like sovereign funds, it is expected to outlive its first workload. Cities do not build large bridges merely to connect two houses on opposite sides of a river; these projects are funded to enable large-scale economic growth in multiple different ways.

Language models are the first dominant workload of today’s compute infrastructure, and they are expected to keep this it fully utilized in the near term. But I am certain we will see new workloads soon. It could be as simple as a different approach to advance AI. Or it could be something well beyond AI - astrophysics, quantum algorithms, genomics. Regardless of which buzzword you choose to believe in, two important questions emerge:

Which compute infrastructure can last the longest before we need a clean reset?
How effectively can we map future workloads onto the infrastructure we are building today?

By studying ENIAC, I hope to build a mental model to help answer these questions. As Nvidia CEO Jensen Huang put it, even if you are not the first one to catch the metaphorical apple that falls from the tree, this mental model can help you become the first person to pick up the apple from the ground.

That concludes Part 1 of this series. If you like what I just said, follow along for upcoming posts in this series, starting with a deep dive on the ENIAC architecture.

Sources:

My primary reference for this study is the book “ENIAC in Action” which goes into an impressive level of detail about many aspects of this computer.

If you know any other resources to help with this project, leave a comment!

Mapping algorithms to custom silicon - Part 1

Bharath Suresh — Mon, 09 Feb 2026 00:15:44 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

This is part of a series of posts on custom accelerators. Here’s a link to all parts:

Unless you have been living under a large silicon wafer, you would have heard some version of “custom silicon is the future.” In an earlier post, I attempted the classification of this exploding set of XPUs. While classification would help someone understand how to best match today’s chip with today’s algorithms, the real billion dollar question (literally) is: how can we ensure that tomorrow’s chips are best suited to run tomorrow’s algorithms.

One of the biggest problems in hardware-software co-design is that hardware and software have always existed on separate islands. So, the right way to answer this question is by bringing the two a little closer. In that spirit, this post (and hopefully many more to come) is the result of my collaboration with Avik De. Together in this post, we explore the “messy middle” between algorithms and silicon.

The push and the pull

To make custom silicon work, it needs both a “push” and a “pull”. There is a “push” from the side of the hardware designer (who we will call Harry) to get more developers to use their chip, and a “pull” from the algorithm writer (who we will call Alice) to find the best platform on which to run their software.

The algorithm writer (Alice) wants to be able to frame their algorithm in a high-level language. This algorithm could be anything, ranging from the numerous advancements in machine learning, to algorithms from emerging domains like robotics. Although Alice is confident about the value of her algorithm, she is unaware of the best hardware architecture to run it on. This is the “pull” - Alice needs to be able to find hardware to run her algorithm while staying within her software ecosystem of comfort.

Harry, on the other hand, has a great idea for a power-efficient way to compute matrix products by optimizing the operations, the instruction overhead, or the memory movement. He develops the design and synthesizes it on an FPGA. He knows it’s better than any other hardware architecture out there, but he has no idea how to get people to try it. This is the “push” - Harry wants to get his ideas out there, but needs to bridge it to Alice’s “pull.”

In theory, this looks like a match made in heaven - can’t Alice just take her algorithm and “run it” on Harry’s chip? Well, not quite, because of an idea that has been around as long as computing, but continues to be challenging.

An old analogy

The idea that Alice should be able to “just run” her code on Harry’s hardware is not new. In fact, it is the idea that made general-purpose computing work. A CPU solves this problem using the idea of Code Abstraction.

Say Alice was mapping an algorithm to run on a general purpose CPU - she would simply write the language using a high-level language like C++ or Python, “compile” this code, and run it on an x86 or ARM CPU. The reason why this process is so simple is the emergence of the idea of an Instruction Set Architecture (ISA) - which in simple words is a contract to map high-level languages to certain general purpose hardware. (This post skips the details of ISA and compilers, but if you are interested, check-out this older Chip Insights post.)

The mapping between Alice’s algorithm and Harry’s hardware using a standard ISA can be understood using this diagram:

One of the key advantages of specifying an ISA is that it allows software developers to write code that can run on evolving hardware without needing to know the specifics. Say Alice compiled her algorithm using an x86 compiler like GCC, when the next generation of x86 CPU is released, she will have to do one of the following:

In majority of the cases, as the hardware microarchitecture improves, the same binary can benefit from improvements without needing to be rebuilt.
Even if a major hardware improvements need to be accompanied by changes to instruction encoding or order (for example, Advanced Vector Extensions (AVX)), it is managed by “the compiler” - Alice simply needs to recompile the same high-level language code with the latest version of the compiler in order to enjoy the benefits of the new and improved hardware.

This separation worked spectacularly well, because:

Hardware designers like Harry innovated on pipelines, caches, branch predictors, and out-of-order execution.
Software designers like Alice largely ignored those details and built software that lived for decades.
Compilers absorbed the complexity at the boundary.

However, custom accelerators break this assumption.

Why isn’t there a standard ISA for custom silicon?

In the early days of CPUs, every CPU provider had their own custom ISA, largely because hardware constraints and software ecosystems were still tightly coupled. Over time, a few winners emerged, leaving us with just 3-4 significant ISAs, each supported by mature compilers and software ecosystems. This convergence was possible because general-purpose programs share a common structure: scalar control flow, pointer-based memory access, and relatively uniform performance characteristics across workloads.

There is a common misconception that custom accelerators would evolve in the same way - that over time, a standard ISA would emerge, and every accelerator would support it. But this idea is fundamentally flawed. To truly understand why a standard ISA cannot exist for custom accelerators, let’s consider a thought experiment. Assume that Harry defines a standard ISA for his accelerators. Let’s try to think about how that ISA would look for an operation like matrix multiplication.

If Alice is using PyTorch, the operation would look something like this:

result = torch.matmul(A, B)

There are certain aspects of matrix A and matrix B that inform Harry how to execute this operation efficiently on his specific hardware. Crucially, these aspects influence not what computation is performed, but how it is scheduled and mapped onto hardware resources. For example, the shapes of A and B can be used to decide:

Tiling strategy: How many chunks should the matrix multiplication be split into based on the available hardware resources
Buffering techniques: what data can be stored in the fast, on-chip SRAM, and what should be moved to DRAM.
Data layout: For the specific size of A and B, is column major better, or does a tiled layout reduce memory accesses?

This is a short list that only considers the matrix shape - several other factors, like the data types, sparsity, etc. can be exploited to map this operation efficiently in hardware.

However, Harry needs to define a standard ISA that encompasses all these possibilities - the same ISA must work for a large square matrix with significant sparsity, as well as for matrices with just a single row or column. The compiler alone is of limited help here, because many of the most performance-critical decisions like tiling, buffering, and layout are instance-specific and cannot be encoded directly into an instruction set contract.

If Harry tries to account for all these possibilities within an ISA, he must either:

Over-specify behavior, resulting in an unoptimized, CPU-style ISA and microarchitecture
Stick to a very low-level abstraction (resulting in bloated hardware - like one large matrix multiplier with huge amounts of memory.)

In either case, a standard ISA would strip away the very benefits that customization is meant to provide. So where does that leave Harry and Alice?

The two bridges

Since Harry can’t simply expose a standard AI ISA and call it a day, the question becomes: how does his hardware connect to Alice’s world at all?

Today, there are two practical bridges between them.

1. Runtime-level integration

In the first approach, Alice expresses her model in a standardized, framework-agnostic format. A shared runtime is responsible for executing that model and deciding which parts run on which device. From Alice’s perspective, almost nothing changes. She writes PyTorch or TensorFlow code and exports a model. From Harry’s perspective, the integration surface is narrow and predefined: he implements support for a fixed menu of operations, and the runtime takes care of everything else.

A useful analogy here is a food court appliance. Alice places an order by pointing to items on a menu: “grill this,” “blend that,” “heat this up.” The food court manager (the runtime) decides which appliance handles which step. Harry builds a specialized appliance that performs certain actions very efficiently. If the order matches the appliance’s menu, the runtime sends the work to Harry’s hardware and it shines. If not, the runtime quietly routes those steps to a different device that knows how to handle them. (which is typically a CPU or a GPU.)

This is exactly how runtime-level integration works:

The model is a graph of predefined operations
Each operation is stateless and self-contained
Control flow and orchestration live outside the accelerator

A concrete example of this approach is ONNX Runtime Execution Providers, where hardware vendors accelerate supported operators while delegating unsupported parts of the model to a fallback device. This diagram explains how Alice’s software and Harry’s hardware can talk to each other using ONNX.

This model works well for inference workloads, where computation is a fixed graph of tensor operations. However, because execution and control flow remain outside the accelerator, it limits how much of the hardware structure (custom memory hierarchies, interconnect flow) can be exposed.

2. Compiler-level integration

In the second approach, the bridge moves deeper into the stack. Instead of plugging into a runtime API, Harry integrates at the compiler level, where programs are transformed before execution. Alice still writes high-level code, but now the compiler is responsible for mapping that computation onto the hardware - deciding how loops are formed, how memory is reused, and how execution flows on the device.

The analogy here is a custom-built kitchen.

Instead of ordering from a menu, Alice hands over a recipe. Harry designs the kitchen itself: where ingredients are stored, how cooks move, which steps happen in parallel, and how decisions are made mid-cooking. The recipe is compiled into a precise plan tailored to that kitchen.

This may sound similar to targeting a standard ISA, but the distinction is critical. A standard ISA fixes the instruction set and execution model that all software must target, forcing hardware innovation to happen below that boundary. Compiler-level integration, in contrast, fixes only the meaning of the program. The compiler is free to reshape loops, memory usage, and control flow to match the hardware, without exposing a stable instruction set to the programmer. This allows each accelerator to express its architectural strengths without being constrained by a one-size-fits-all ISA.

This approach requires much more effort from Harry, because he must:

define how programs are lowered,
expose memory and synchronization primitives,
and implement a runtime that can execute compiled programs.

But in return, Harry can support:

data-dependent control flow,
irregular or sparse computation,
custom memory layouts,
and long-lived, stateful programs.

A representative example of this model is IREE, which provides a compiler, optimizer, and runtime along with a hardware abstraction layer. Instead of mapping individual operators, Harry defines how programs execute on his device. This turns the accelerator from a co-processor that executes individual operations into a programmable compute target that runs complete programs independently, as shown in this diagram.

It’s worth emphasizing that IREE can ingest programs from multiple frontends, including ONNX, before lowering them through its compiler stack. The distinction between runtime-level and compiler-level integration is therefore not about which framework Alice starts from, but where hardware-specific decisions are made.

Conclusion

We’re entering a phase where meaningful gains in AI performance and efficiency are increasingly coming from custom silicon. But the success of that silicon depends less on raw compute and more on where it connects into the software stack.

This post was intentionally a high-level overview of that connection point. The space is complex, fast-moving, and still evolving, but at a distance, two broad patterns are already visible. Runtime-level integration offers a fast path to deployment by fitting new hardware into an existing execution model, while compiler-level integration demands more effort but unlocks far greater control over how computation actually runs. Neither approach is “better” in the abstract - each reflects a different tradeoff between ease of integration and expressive power.

In the next parts of this series, we’ll make these ideas concrete. On Chip Insights, we’ll walk through building a simple runtime-level backend, starting with a CPU implementation and then extending it toward custom hardware on an FPGA. In parallel, on min{power}, we’ll look at the problem from Alice’s side: how different classes of AI and robotics workloads place fundamentally different demands on hardware, and why those differences increasingly matter.

You can find the next part of this series here:

The Accidental Comeback of Verilog

Bharath Suresh — Sat, 17 Jan 2026 05:00:49 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

In 2022, a graduate-level chip design course at UCLA gave me a glimpse of what felt like the future. Instead of writing RTL in Verilog, we were required to use a Python framework that generated synthesizable Verilog under the hood. I had already been using Verilog for a few years by this point, so this new workflow felt strange at first. But after a few weeks, something clicked. I started to enjoy it. So much so that I made my final project a head-to-head comparison: plain Verilog versus this Python-based approach across several types of RTL IP. The results seemed clear. Yes, the Verilog implementations had slightly better power, performance, and area (PPA) metrics. But the Python framework won on almost everything else that mattered to an RTL designer: it scaled better, made reuse trivial, and produced RTL that was far easier to read and reason about.

I walked away convinced that this was the direction chip design was headed. Higher-level frameworks would replace handwritten Verilog. That’s where I was ready to place my bets.

Lucky for me, I didn’t have the money to bet - because I was completely wrong.

Verilog Is Fine. What Next?

If you have been reading Chip Insights long enough, you’ll know about my two-part HDL saga documenting the story of how Verilog became what it is today. I ended this story in the mid-2000s, when EDA companies started to push for the widespread adoption of SystemVerilog. From this point forward, the Verilog family (plain Verilog and SystemVerilog) emerged as clear winners in the HDL wars. However, there was a parallel storyline which I did not cover in that post.

The story of HDLs was closely related to the emergence of a new simulation technique called behavioral simulation - as designs got bigger, gate-level simulations became slow and unusable for routine functional checks. Cadence enabled behavioral simulation in Verilog through Verilog Compiled Simulation (VCS) - Verilog was first compiled into C, and this C program was then used to run simulations. Great story. The bedrock of modern logic simulation. But that’s not the point here. If you can convert an HDL like Verilog into a high-level language like C, why can’t we start with a high-level language directly?

In 1998, Forte Design Systems came up with a tool called Cynthesizer (a pretty clever name, I must add), which allowed a designer to build synthesizable logic using SystemC. In 2001, Sony became the first company to tape out a chip using this approach, which would later be known as High-Level Synthesis (HLS). As transistor scaling continued in the 2000s, design cost and complexity grew rapidly. Even the most modern HDL of the time, SystemVerilog, was inadequate for human productivity to catch up with Moore’s law. This made HLS a compelling candidate.

There are two primary reasons why HLS was so attractive. The first is obvious: high-level languages provide constructs that HDLs lack which would boost the productivity of chip designers. The second reason was even more compelling. HLS represented a fundamental “shift left” strategy to move hardware design earlier in the development cycle and closer to software design. In other words, anyone who can code should be able to design the hardware to run their code. Believers of HLS shared the same optimism that I did when I used Python for RTL design. In fact, in the mid 2000s, Professor Jason Cong built the xPilot HLS system in the same UCLA building. xPilot pioneered a lot of algorithmic innovations that made HLS more than just a convenience - the PPA metrics started to improve as well. xPilot became AutoESL Design Technologies, was acquired by Xilinx (now AMD) in 2011.

So far, this sounds more like a success story of HLS than what my title suggests. So what happened? HLS definitely proved to be useful in certain cases - like rapid prototyping and deployment on FPGAs. (The core technology of xPilot powers Vivado HLS, which is still used widely today.) However, HLS could never fully replace Verilog, because the promise of HLS was too good to be true. Even today, HLS-generated designs often consume more hardware resources and achieve significantly worse clock frequencies compared to handwritten RTL. More importantly, HLS never truly became what it promised. The fundamental mismatch between software-oriented C/C++ languages and hardware structures meant that HLS tools still can’t reliably produce synthesizable designs that integrate well with existing EDA tools which remain overwhelmingly Verilog-centric.

As a compromise, the idea of a Hardware Construction Language (HCL) was born. While HLS tools were intended to automatically infer the best microarchitecture from high-level algorithms, HCLs are high-level language extensions that allow designers to express hardware structure, while simultaneously leveraging powerful software constructs. Chisel, an extension of Scala, was developed at UC Berkeley in 2012 and is one of the most successful examples of an HCL. By the way, the Python framework I mentioned at the start of this post was also an HCL.

While HLS and HCLs were not perfect, they seemed like early versions of what the future of chip design would look like. HLS algorithmic innovations continue to improve PPA metrics, and a lot of new processor designs are built using frameworks like Chisel. All this while, the Verilog family has not made any major strides. If I had written this post a few years ago, this would be the end of the HDL story. But almost overnight, a new playbook has emerged.

You Can Have Your Cake and Eat It Too

The original promise of HLS and HCLs was never about replacing Verilog for the sake of it. It was about building chips faster: by increasing designer productivity, reuse, and scalability. These benefits came with tradeoffs like worse PPA and painful EDA integration. As they say, there are no free lunches.

Generative AI changes this equation in a fundamental way.

With GenAI-assisted Verilog, you get many of the same benefits that HLS and HCLs aimed to provide, without abandoning the Verilog ecosystem. The productivity gains are the most obvious. Writing boilerplate RTL, parameterizing modules, refactoring interfaces, or instantiating complex microarchitectural patterns are all tasks that LLMs handle remarkably well. What once justified a higher-level language now often collapses into a single prompt. You still end up with Verilog, but you get there faster, with fewer errors, and with much less cognitive overhead.

More interestingly, GenAI quietly revives one of the most compelling arguments for HLS: the “shift left.” While this shift was long promised, HLS never reliably delivered on it. PPA was difficult to estimate accurately at the HLS abstraction, and those estimation errors were often more costly than starting from RTL in the first place.

Generative AI flips this dynamic entirely. Instead of postponing RTL, it accelerates its creation. High-level specifications, performance targets, and even informal design intent can now be translated directly into functional RTL models early in the design cycle. Hardware and software can evolve in parallel, not because RTL is avoided, but because it is cheap to generate, modify, and discard. In other words, GenAI enables a shift left, without a shift away from Verilog.

Why Verilog Was Ready for This Moment

Verilog didn’t reinvent itself for the GenAI era. The GenAI era quietly reinvented itself around Verilog.

First, Verilog sits at the center of the EDA ecosystem. Decades of tool development have gone into taking Verilog as input and squeezing out the best possible PPA. Synthesis, place-and-route, timing closure, power analysis: these flows are deeply optimized for RTL written in Verilog and SystemVerilog. The switching costs are enormous.

Second, in the already limited chip design data available to train large language models, Verilog dominates. This creates a powerful flywheel effect. Verilog is used to train LLMs, which then generate more Verilog, further reinforcing its position as the language of digital design.

Today, we are seeing a combination of these two factors, in the form of closed-loop RTL design agents. Verilog code is both a cycle-accurate, and formally checkable artifact, which can be used to generate accurate reward signals to improve the AI. (I’m not going into details of these systems here, but all signs are pointing towards such RL environments becoming the future.) The implication of this is that GenAI won’t just lead to more Verilog, it will lead to better Verilog. Over time, this creates a compounding effect - Verilog will become the most optimized way to design chips, even if that Verilog is generated by an AI agent.

In the world I just described, HCLs and HLS tools seem irrelevant. If Verilog is cheap to generate, easy to iterate on, and continuously optimized by feedback, the incentive to move away from it fades. Verilog survived long enough to see the light at the end of the tunnel. Now, the future looks brighter than ever.

The Computer Architecture Calendar

Bharath Suresh — Sun, 04 Jan 2026 00:59:07 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

Another disclaimer: The characters and events in this post are fictitious (but the conferences are real). Any resemblance to real people is purely coincidental (though statistically inevitable in my audience). For the record, I have not faithfully followed all of these conferences in the past. This post exists mainly because I’d like to start doing so this year and thought I might as well share ten conferences I think are useful. Unfortunately, I couldn’t resist making the whole thing a little funny, so here we go…

The protagonist of this story is someone I’m sure you have crossed paths with. They forget birthdays, miss tax deadlines, and couldn’t name a single band on tour - but they know exactly when and where every major computer architecture conference is held.

Their calendar isn’t divided into months. It’s divided into trade shows.

This is how a year in that person’s life looks.

CES: Between hope and hype

Our protagonist loves speculation. Branch prediction, speculative execution, prefetching: all the performance gains they deliver are built on confidently guessing what will happen next. So there is no better way to start the year than CES in January. While CES isn’t always about chips, the recent AI accelerator mania has turned it into the perfect launchpad for the year. Bold claims, ambitious roadmaps, and just enough technical detail provide our protagonist with exactly the right mix of signals and buzzwords to carry them through the months ahead.

Why it matters

CES tells you what’s being productized and will therefore shape algorithms and architectures of the future.

How to follow it efficiently

Watch the keynotes. Skim the announcements. Pay attention to products that actually exist, not just the slides describing them.

One thing to watch in 2026

Has anyone built a useful, physical product where AI is solving a genuine problem.

ISSCC: Let’s get real

By February, our protagonist’s speculation bubble goes through a silicon trial by fire. ISSCC is where we stop guessing and start measuring. Statements and projections from CES slides are replaced by die photos, power numbers, and phrases that begin with “fabricated in…”. Each one either validates past optimism or quietly resets expectations.

Why it matters

ISSCC is a good way to understand how performance and power scale between architectural simulations and real silicon.

How to follow it efficiently

Skim broadly to identify the most relevant designs. Then read one or two papers deeply to understand the true state of the art. Watch out for assumptions.

One thing to watch in 2026

How is Moore’s Law really progressing? Are improvements coming from architecture, process nodes, or both?

GTC: Crowning the Starboy

March belongs to the Taylor Swift concert of today’s tech world. Our protagonist once followed GTC for innovations in parallel computing architectures. Now, it’s a coronation ceremony for Nvidia’s platform and its developers. Everyone listens to Jensen Huang carefully, because this is where the industry’s vocabulary for the next twelve months is minted.

Why it matters

GTC now defines the AI narrative and vocabulary that everyone else ends up adopting.

How to follow it efficiently

Watch the keynote. Then pick a handful of genuinely technical sessions. Ignore the excessive press coverage and stock-price astrology.

One thing to watch in 2026

What does Nvidia’s platform roadmap look like for inference—especially in a post Groq license world?

Computex: Board yet?

Computex reminds our protagonist that chips do not exist in isolation. This is where architecture turns into motherboards, racks, cooling solutions, and deeply uncomfortable power budgets. It’s less about novelty and more about integration. More recently, it has also become the place where vendors launch their latest SoCs for, you guessed it, AI.

Why it matters

Computex shows what it actually takes to turn a good architecture into a usable product. It’s important to be aware of what’s happening outside the cores.

How to follow it efficiently

Get a high-level view of system architectures for the latest SoCs. Pay attention to reference designs and power, thermal, and memory numbers when they’re provided.

One thing to watch in 2026

Whether the industry finally agrees on what an “AI PC” is, and if any new players enter that space.

ISCA: Sheer elegance

ISCA is where our protagonist feels intellectually nourished. The best ideas in computer architecture are here in their purest form. Some are brilliant. Some are fictional. A few will quietly influence designs a decade later. No product timelines, no market requirements. Just elegant computer architecture.

Why it matters

ISCA sharpens your architectural intuition by offering new ways to think about familiar problems.

How to follow it efficiently

Skim all abstracts. Then invest real time in a few papers that genuinely intrigue you. Ask whether ideas from one domain could transfer to another.

One thing to watch in 2026

How are CPU architectures evolving in the era of accelerated computing?

DAC: Ship faster

All the old-school, hype-free academic computer architecture study leads our protagonist to wonder if AI is just a bubble. DAC arrives at exactly the right time, reassuring them with claims that AI-powered tools can reduce the time to ship a new chip from years to seconds. It also serves as a reminder of the less glamorous innards of chip design, like verification and physical design.

Why it matters

An architecture is only as good as its ability to be built, and more importantly, built on time. DAC shows how semiconductor tooling is evolving to close the gap between ideas and tape-out.

How to follow it efficiently

Skim panels to understand what’s new in tooling. Before getting excited, always check whether these tools actually work on industry-scale designs.

One thing to watch in 2026

If, and how AI-assisted design flows are being deployed in the industry.

Hot Chips: Industry, minus the hype

As summer draws to a close, our protagonist finds themselves desperately searching for the truth. This leads them to Hot Chips, where architects speak slowly, precisely, and with slides that took their companies months to approve. Tradeoffs are admitted. Constraints are acknowledged. Reality is explained. Our protagonist trusts this conference more than almost any other.

Why it matters

Hot Chips is an industry conference that is unusually light on marketing. The goal is to simply explain what was shipped and why.

How to follow it efficiently

This is worth spending time on. Watch the talks. Read the slides carefully. Take notes.

One thing to watch in 2026

How many meaningfully different AI architectures do we really have?

SEMICON Taiwan: Supply chain magic

By September, the story shifts to the heartland of semiconductor manufacturing. Here, our protagonist listens to conversations about yield, packaging, and process limits. The names of many exhibitors sound unfamiliar, but a closer look reveals them to be critical players in the semiconductor supply chain. It’s a reminder that architecture is important, but only a small part of a vastly more complex world.

Why it matters

Manufacturing constraints increasingly shape architectural decisions. They can’t be ignored.

How to follow it efficiently

Use this as a chance to understand the supply chain better. Process roadmap and packaging sessions are the most valuable.

One thing to watch in 2026

What’s new in the world of advanced packaging and 3D ICs that are actually ready for volume production.

MICRO: Grounding ideas to reality

If ISCA feeds our protagonist’s love for elegant ideas, MICRO satisfies their need to see those ideas survive contact with reality. This is where high-level architectural concepts are dragged into the microarchitecture and explained cycle by cycle. MICRO isn’t for the casual computer architecture enthusiast. The diagrams are denser, the assumptions sharper, and the discussions expect real chip-design experience.

Why it matters

MICRO bridges the gap between architectural ideas and real implementations, exposing the costs, tradeoffs, and complexity that abstractions tend to hide.

How to follow it efficiently

Focus on papers that include detailed evaluations and realistic assumptions. Pay attention to how architectural ideas translate into microarchitectural blocks.

One thing to watch in 2026

Are there credible examples of AI influencing microarchitectural decisions better than a human designer can?

IEDM: A glimpse of the future

The busy computer architecture calendar comes to a close at IEDM. Our protagonist listens to talks about devices, confidently thinking they won’t matter for years. Soon, a quiet fear sets in that, when these devices finally do matter, they might change everything. The thought sends a chill down their spine, and pushes them to search for notes from the one device physics class they took in undergrad.

Why it matters

IEDM defines what architectures may even be possible a decade from now. Following it is a form of long-term career insurance.

How to follow it efficiently

Don’t get stuck on the physics. Assume the devices work, then ask how they would reshape memory hierarchies, compute models, and system architectures.

One thing to watch

Which emerging technologies, like quantum, neuromorphic, and novel memories, are close to being deployed in real chip

By the end of the year, our protagonist is carrying a heavy mental load: half-remembered acronyms, conflicting roadmaps, benchmark caveats, packaging buzzwords, and just enough existential dread to stay alert. They’ve learned when to believe, when to squint, and when to politely wait for silicon. And just as they start to feel like they’ve finally made sense of it all, the year resets. New nodes. New models. New claims. Same conferences. So, they clear their calendar, open a fresh notebook, and show up again, because as stressful and confusing as it sounds, somehow, it’s still fun.

Want to follow this year’s conferences with me?

I’ve created a shareable Google Calendar with all of these conference dates pre-filled, for anyone who wants their existential dread to be neatly scheduled. Subscribers will receive a link to this calendar - consider this a gentle nudge to subscribe if you haven’t already.

If there are other conferences that every computer architect should mentally budget for, let me know in the comments. The calendar, like the study of computer architecture, is never truly complete.

The Alphabet Soup of Processors

Bharath Suresh — Mon, 15 Dec 2025 02:00:22 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

CPU. GPU. TPU. NPU. DPU. IPU. LPU.

Every year, the alphabet soup gets richer. But I wonder if it’s adding any value? Over the last two years, GPUs were the unquestioned champions of the AI boom. Right now, TPUs are having a moment as the new architecture that is supposedly better for AI than GPUs. NPUs reliably show up during some product keynotes, often accompanied by “inference” or “low power.” Meanwhile, CPUs still have their audience of PC and smartphone enthusiasts.

At this point, the industry’s obsession with new “PUs” is starting to look uncomfortably familiar. We’ve seen this movie before - with process nodes. Once upon a time, node names conveyed real, comparable information. Then they became branding. “7nm” stopped meaning 7nm. “3nm” stopped being smaller than someone else’s “5nm.” The label survived; the meaning didn’t. I think processor acronyms are heading down the same path. Pick a new letter. Add “accelerator,” “custom,” or “ASIC.” Publish a carefully chosen benchmark. Congrats, you have a new processing unit.

The problem isn’t that TPUs, NPUs, or GPUs aren’t important or innovative. The problem is that the acronym leaves out the essence of the architecture: what it’s actually optimized for, what tradeoffs it makes, and where its real value comes from. If you squint hard enough, almost anything can be called an accelerator. If you zoom out far enough, almost everything looks like a CPU. This does not sit well with me.

Continuing with the theme of my last post, I don’t think the right question is “Is this a GPU or an NPU?” It’s whether we’re even using the right vocabulary to evaluate compute architectures at all. Instead of arguing over letters, we should be asking sharper questions - questions that force clarity instead of reinforcing branding. I think the following four questions are a far more informative way to classify chips.

Question 1. Host relationship: Is your product the boss, a helper, or just another block on the die?

Categories:

Standalone: Boots an OS and owns the system.
Co-processor: Lives behind PCIe or a fabric. Needs a host to schedule work, manage memory, and keep the lights on.
Integrated IP: One of many blocks inside an SoC, visible to software only through drivers or libraries.

Why it matters:

Host relationship determines who controls the system. Standalone processors shape the entire software stack and system architecture. Co-processors live and die by host integration and software orchestration. Integrated IP blocks rarely escape the platform they’re embedded in, no matter how impressive the pure-silicon performance is. Two chips with identical compute units can have wildly different impact depending on whether they’re in charge, or waiting for instructions.

Question 2. Domain coverage: How many different major software domains (at least a million developers) can this architecture serve effectively?

To clarify, I consider the following as major software domains of today.

System software: OS kernels, language runtimes, compilers, CLI tools
User Interfaces: desktop/mobile apps, browsers, light graphics
Data processing: SQL databases, analytics, streaming
Media & graphics: Video encoding/decoding, 2D/3D graphics pipelines
High Performance Computing: simulation, scientific computing, linear algebra
ML/AI workloads: training and inference across model families

Categories:

General purpose: The chip is competitive or better than the incumbent across 3 or more domains
Domain specific: The chip excels only in 1-2 domains

Why it matters:

Domain coverage is the difference between platforms and point solutions. General-purpose architectures benefit from massive software ecosystems, long lifetimes, and constant reuse. Domain-specific chips can deliver spectacular gains, but only as long as the workload stays stable. When you are looking at a domain-specific chip, the domain matters as much, if not more, that the silicon innovation. (As Jensen Huang puts it, pick “Zero Billion Dollar” markets.”)

Question 3. Execution paradigm: What architectural feature delivers throughput?

Categories:

MIMD (Multiple Instruction, Multiple Data): Many independent cores, each running its own control flow. Classic multicore and many-core CPUs live here.
SIMT (Single Instruction, Multiple Threads): Groups of lanes share an instruction stream with per-lane masking. This is the beating heart of modern GPUs.
Custom: Fixed or semi-fixed dataflow architectures designed around specific computation patterns (e.g., systolic arrays, spatial fabrics).

Why it matters:

Execution paradigm reveals what kind of parallelism the architecture is betting on. MIMD favors flexibility and irregular control flow. SIMT thrives on massive data parallelism. Custom paradigms trade generality for efficiency. Once you understand this axis, performance claims stop sounding magical and start looking like predictable outcomes of design choices.

Question 4. Programmability model: How does a programmer get work done from your chip?

Categories:

Native ISA: General-purpose compilers (C/C++) target it directly.
Kernel-compiled: You write kernels in a language extension like CUDA, OpenCL, HIP, or a domain-specific language.
Graph-compiled: You describe computation as a graph (often via an ML framework like PyTorch or TensorFlow), and a compiler maps it to the hardware. In graph-compiled systems, the graph is the primary abstraction for the programmer, not the kernel. (Graph-compiled systems still use kernels underneath.)
Bitstream configured: You configure the hardware fabric itself. For example, FPGA bitstreams or CGRA configurations.

Why it matters:

Programmability determines who can use the hardware and how fast ecosystems form. Native ISAs scale with developer count. Kernel models reward specialists and expert programmers. Graph-based systems trade flexibility for convenience: they work extremely well when your problem fits the graph, and poorly when it doesn’t. Bitstream-based approaches offer ultimate hardware control, but at the cost of accessibility. Many promising architectures fail not because of silicon, but because the programming model never escapes a niche audience.

Taken together, these questions offer a better way to think about modern compute architectures. Once you look at chips through these lenses, debates like GPU vs TPU vs NPU start to feel oddly shallow. Architectures stop being mysterious, and performance claims start to look like the predictable outcomes of very specific tradeoffs. For instance, below is a classification of several chips that are often loosely grouped under the label “accelerators.” This classification forces each architecture to reveal its position along concrete design axes: who controls the system, how broad the user base is, where throughput actually comes from, and how programmers interact with the hardware. Chips that are often lumped together as “accelerators” end up in very different places once you apply these questions, and those differences explain far more about real-world impact, adoption, and longevity than any three-letter acronym ever could.

If you haven’t subscribed yet, here’s a reason to: when you subscribe, you’ll receive the link to the full list with other categories and explanations.

If you have read so far, I’m curious - what other questions would add value to chip architecture classification?

What should I write if AI can write everything?

Bharath Suresh — Sun, 07 Dec 2025 07:17:06 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

It’s been more than two months since I wrote my last post. One of the reasons I took a pause was to deliberate over a question that just wasn’t leaving me: Does a technical newsletter like Chip Insights matter in a world where GenAI keeps getting better? To answer this, I need to first define what “matters” means. The reality is, this newsletter will only ever reach a small fraction of people who are interested in a very niche topic - so traditional content metrics like views or subscribers make no sense. The only true feedback for my work comes from this question: “Would I read this post if it was written by someone else.” With today’s AI tools, that question really becomes: “Is it easier to generate this post on an AI chatbot, that write it?”

Things we tell ourselves to feel better

One of the most common arguments in defense of humans is that “AI makes mistakes.” I don’t think this is a strong argument to hide behind. Sure, chatbots hallucinate and make mistakes, but so do humans - even the most decorated content creators. Like a lot of technologies we have seen in the past, I’m sure the errors will reduce over time, and AI answers could become a reliable source of information. In my opinion, content creators saying their content is better because it’s more accurate, are willing to challenge an ever-improving machine backed by trillions of research dollars. Personally, I think that’s a losing battle.

There is another defense of traditional content which never sat well with me. I’ve seen a lot of arguments about how consuming difficult technical content is “inherently noble”, and AI provides processed answers which are “cheats” and will shrink your brain. In my opinion, difficulty, by itself, should never be a virtue to strive for. Before humans discovered fire, digestion was hard, consuming a lot of energy. Cooking made food easier to process, freeing energy for other activities like thinking, building, and evolving. I’m quite sure AI-generated educational content will have a very similar positive effect.

So before dismissing AI, we should acknowledge the aspects that AI excels at.

Is AI is coming for my posts?

Today’s AI is extraordinary at compressing information. It can digest long, messy documents and explain them in simpler terms. Interestingly, this is what most traditional content creation has centered around. (Including some of my posts, I’ll admit.) I think we are getting very close to the point where the returns from such content won’t justify the investment - it will be so much easier to generate such content on demand.

AI is also a fantastic tool to explore content related to a specific topic you are interested in. This has been my favorite use of AI: I can easily get a list of 5-10 sources I want to explore for my research on a topic, bypassing the SEO-engineered, often irrelevant links. In my experience of doing this, I have found that a few in-depth engaging sources, along with the AI summary on the topic, are sufficient for my research - I don’t want to jump multiple links where parts of the information are spread out.

Essentially, if the purpose of a piece of writing is simply to transfer information from one form to another, or one place to another, AI will inevitably outperform humans on speed, breadth, and efficiency.

So where does that leave me?

The point of this exercise was not to say that AI will replace content creation entirely. The concept of generative AI fundamentally cannot displace certain types of content - which is specifically what I want to focus on.

My north star for content creation has been the Acquired podcast. For those who don’t know, the idea behind Acquired is to tell the story of a company, with all of its gory details, in episodes that sometimes last 4 hours. Today’s AI struggles with such analysis, and there are fundamental reasons why it might be this way for a long time. An LLM lacks a “mental model” - understanding of concepts like time, hierarchies and abstractions. Models also gravitate towards a median of possibilities. While this makes AI excellent at flattening complexity, maintaining engagement sometimes requires expanding complexity into a narrative form. The failure to do this is what makes AI sound “robotic.” While you can get away with a robotic tone if you have “explain the differences between a CPU and a GPU,” it’s important to keep the audience engaged if you want to “share a deep dive on the evolution of GPU architectures”.

Another problem with AI content is that it is user driven - the value of the answer depends on the question you are asking. AI might have all the answers, but AI isn’t curious. Good content, therefore, should prompt the audience to ask more questions - questions about bottlenecks, architectural shifts, industry narratives, or places where assumptions are beginning to bend. Questions like these generally emerge from judgment, from noticing patterns, and from wondering what others might be missing - something that AI cannot replicate.

This is what I’ve been thinking about during my break: that the real value of technical content isn’t in being a source of information, but in being a source of interpretation. I’m looking forward to returning to writing with these ideas in mind.

Letter to the subscribers - Year 1

Bharath Suresh — Sat, 20 Sep 2025 14:49:56 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

This post is not about chips. It’s not even about AI like that one post was. This is about me.

As I’m writing this, it’s been a year since I wrote my first post on Chip Insights. So, I thought I’ll reflect a little and open up about my plans for the future.

Why did I start Chip Insights?

Ideally, this should have been a part of my first post. But I did not have a good answer. Instead, I chose to write about Moore’s law:

Turns out, Moore’s law was actually the perfect way to start, because, like the number of transistors in a chip, ideas compound exponentially. Every new post I have written here has made me an exponentially better thinker. So more than anything, Chip Insights is my training ground; the place where I take half-baked ideas and make them concrete. I learn more from Chip Insights than any of my readers.

Well if that’s my reason, one could ask (and a lot of people have asked) why I post this content publicly. I have found that there is a huge difference in the way I think in private, versus when I need to present my ideas. The vulnerability associated with public content has pushed me to think more deeply than I would have in a private brainstorming session. For instance, I always thought I knew what CPU bitness was, but when I decided to write a post about it, I realized how complex that number truly is.

What do I offer?

I’m aware that in a post addressing my subscribers, the right way to answer the “why” question is to say something altruistic, something like “I want to pass on my learnings to students.” It’s nice when that happens, but I would be lying to myself if I said everything I do is towards that goal. However, I have two ways of thinking that one might consider altruistic.

First, I am convinced that the positioning of computer engineering as a career needs some work. There are a few attributes that computer engineering has acquired over time which does not sit well with me: slow-moving, hard, and risk-averse come to mind. A lot of very talented students are put-off by these attributes. The emergence of AI has made semiconductors cool again, and I want to use this opportunity to reframe how the industry is perceived. For example, not a lot of people know how fast-paced the EDA world was - a story I covered here.

There’s another issue. Even if you like this industry and want to pursue your career here, there are too many roadblocks, which eventually kills potential talent. I am not in a position to bring about structural changes to the industry, but one thing I can do is to reduce some anxiety in the process of making it, which is what led to posts like this.

How is Chip Insights different?

I have done some technical writing before Chip Insights - in school, at work, and also authoring research papers. But the advantage of a medium like this is the freedom it provides. I have used this freedom in three ways.

First, I was able to bypass the need to explain concepts from the ground up - as you would do in a typical college program. For example, the right way to understand power consumption in chip design is to first understand exactly how a transistor works. But I have always found this to be a bit restrictive - in the case of chip power, only engineers with visibility to the transistor level contribute ideas, and they are often quite similar. If a software engineer wants to understand what makes a chip power efficient, they would be scared off by all the complex transistor-level terminology and would make no progress. My post on power optimization was an attempt to see how much I could abstract out without being inaccurate - turns out, quite a lot.

Another theme in most of my technical post is the idea of analogies - comparing chip design concepts to something relatable - to make them stick in your mind a little longer. I know some people, especially the “academic” types, find analogies to be juvenile. But I have actually found that thinking in terms of analogies, and stretching them out as much as possible really strengthens your understanding of something. For example, I tried that with on-chip networks by comparing them to airline routes in this post.

Finally, there is world building. I think this is the hardest to crack, because building interesting worlds isn’t easy - if it were, The Walt Disney Company wouldn’t be worth $200 billion. But if you can build a world that is compelling enough, then even reading about a pedestrian topic like Static Timing Analysis can be made fun, which is what I tried here.

How has my writing evolved?

I never realized this, but a lot of my early posts were based on topics which had very little scope for subjectivity - for example, this post about pipelining, which walks through how a traditional CPU pipeline has always looked like.

I think a part of me was unsure about how new ideas would be perceived - especially from someone who lacks authority. But as I went forward on this journey, I think my writing has become more assertive. I wrote a post on how AI will influence EDA, which was away from my comfort zone, because, well, nobody still knows the answer.

A bolder version of this was my more recent post about full-stack chip design, and idea that irked a lot of readers on Hacker News, but something that I believe is a well thought out analysis.

I also expanded to cover aspects beyond technical knowledge, because I think the kind of person you are strongly influences the kind of engineer you would be, and we engineers are very bad at understanding this. I aim to do more of this, but one place where I see this mattering a lot is technical interviews, which prompted me to write this post.

How to predict the future?

With the great power of assertiveness comes a great responsibility - the responsibility of being right. (More often than not, at least.) So how can someone get better at that? The best way I know to predict the future is to learn from the past - that’s the only formula that seems to work. (When AI takes over, I’m pretty sure they’ll tell us the same thing.) Exploring semiconductor history has been one of the most rewarding experiences of this journey, and posts like this will continue to be a big part of Chip Insights.

The other way to predict the future is to learn about great companies and their strategies. Patterns in business repeat themselves - success is about finding the right pattern. I captured my learnings from one of the most successful companies of our time in this post.

So, what’s the future of Chip Insights?

Posting an article (almost) every week for the last year has been a great experience - I learn a lot each time I do it, and hopefully could pass on some of my learnings in the process. But something that I have enjoyed even more than that has been the feeling that writing unlocks in me - every time I plan to write a new article, I want to reinvent myself and try something different. This feeling is powerful, and I have seen that it spreads across all aspects of my life - not just writing. But I am realizing that the best way to reinvent is not always to do it more often…

In chip design companies, a common practice used to innovate better is called the “tick-tock” model. Here, a “tick” project involves incremental improvements that can be completed quickly. But after a few ticks, companies pursue a “tock” project - a big structural change, usually dubbed as the next chip generation. In the last year, Chip Insights has seen a lot of ticks that I’ve highlighted in this post, but it’s time for a tock. There are few new ideas that I would like to pursue, which needs me to take a break from writing.

Whether this is the first post of mine you are reading, or you are that one subscriber who subscribed even before I wrote my first ever post, (that really happened, still don’t know why, but I’ll take it.) or maybe you came in somewhere along the way, thank you so much for showing that you care.

I urge you to stay subscribed, because the love for writing hasn’t left me. (And it never will.) I’ll be back soon with Chip Insights 2.0, and it will be worth your wait. Until then, I bid adieu.

Understanding on-chip networks

Bharath Suresh — Mon, 25 Aug 2025 04:30:46 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

Most content in computer architecture courses, and also my newsletter, focus on how a computing chip “computes” - by exploring how the processor core, memory, I/O, and other parts of the chip work in isolation. But in reality, a large portion of time and energy consumed by computing chips is a result of the interactions between the different components of the chip. (called “nodes.”) For instance, moving data from memory and I/O to and from the processor core, or moving data between processor cores. For a long time, these have been assumed to be trivial. (and in reality, they were.) But with increasing number of cores and deeper memory hierarchies, efficiently moving data across a chip can make a huge difference to the overall chip performance. This post intends to introduce the concept to absolute beginners. (If you have been reading my newsletter long enough, you should sense an analogy coming…)

Welcome to your new venture - an airline company

The logistics of airlines has always fascinated me - so that’s the vehicle (pun intended) we are going to use to understand on-chip networks. Imagine this: you read all my posts, became a semiconductor legend, made a lot of money, and have now purchased two Airbus A320 planes, and started a new airline company. Your first task is to decide which routes your airline should operate between, and the logistics behind making this happen. Turns out, designing an on-chip network is surprisingly similar…

Going forward in this post, you can consider the following terms to be analogous:

An airport is similar to a node in the chip (could be a processor core, memory, or I/O component)
An airline route can be treated like a chip bus (bus here is not the transport kind, you can think of it as a collection of wires that can transfer high/low voltage.)

Model 1: Point to point connection

If you have two planes, what’s the simplest thing you can do? You can have two exclusive routes that can operate independently, and in parallel - say one route between Los Angeles (LAX) and San Francisco (SFO), and the other between LAX and New York. (JFK.)

Different nodes in a chip can also be connected in the same way - just have a bus connecting different components with each other, and transfer data through these buses. This is called a “bus-based architecture” and although it seems simple, it is one of the most commonly used networking architecture. (Remember, most airlines continue to operate all-year daily flights between two fixed locations - the logistics can’t be simpler than this.) However, this approach comes with it’s limitations, like:

1. Poor scalability

If you want to increase the number of nodes, you need to increase the number of buses. This makes placement and routing very challenging in the chip. (Do you really want to buy a new plane for each new destination you want to serve?)

2. Limited Flexibility

Having one bus assigned for each connection makes the design very inflexible: Say during execution, two nodes in your chip need to transfer more data than two other nodes, they both still have similar buses with the same bandwidth. It’s like operating an empty flight from LAX to JFK, even though the LAX-SFO route is overbooked!

As you might imagine, the flexibility problem isn’t too hard to solve - you own the airline, so you can be flexible with the routes. That’s the idea of arbitration.

Model 2: Arbitration

Let’s say you want to add a new destination for your Airline to serve - Miami. (MIA.) However, unlike SFO and JFK, travel to Miami is seasonal, peaking during the holidays. In this case, instead of buying a new plane to serve this route, wouldn’t it make more sense to redirect your flights based on the demand? You can “arbitrate” by:

Operating flights from LAX to SFO and JFK during the week
Changing one, or both of the routes to LAX-MIA during the weekends or major holidays

This idea of arbitration can also be used in chips to connect more nodes without adding new buses - by sharing existing buses. Arbitration can be done in many different ways (called arbitration policies) each with it’s pros and cons. While I’m not going to cover all of them in detail, here are some examples:

Round robin arbitration: Every node gets a turn to use the bus, and then waits for all other nodes to be done before using the bus again.
Daisy chain arbitration: All nodes can access the bus simultaneously, but there is a predefined priority order. (Something like this: If both SFO and MIA are in demand, always pick SFO)
Collision-based arbitration: These are more advanced, and rely on real time chip behavior. (For example: If there are no bookings for next week’s flights to JFX, cancel them and open bookings to MIA.)

As you can guess, arbitration adds complexity to your chip. (In case of your airline company, you can think of this like hiring someone to manage the routes.) But in some cases, arbitration is a great way to reduce the number of buses in the chip - Say you have two cores connected to a memory, but your chip mostly runs tasks on one of the cores - then the bus between the two cores and the memory can be shared.) However, as your chip becomes complex, the shortcomings of arbitration become evident:

1. Bandwidth is limited

Remember, although we are able to serve three destinations now by arbitrating between two planes, we can still only serve two destinations at a time. Let’s say each flight can hold 150 passengers, the maximum number of unique passengers you can transport each day is 300 each way - which is the same as it would be if you operated as a point-to-point connection. In chip language, the maximum amount of data that can be moved from one node to another at a given time is called the bandwidth. Although arbitration will help manage multiple nodes, the bandwidth is still limited by the number of buses in a bus-based architecture.

2. Single point of failure

Arbitration has another issue. Say you have decided to fly one of your planes from LAX to SFO and back on a Friday, but from LAX to MIA on Saturday. (And the other plane is going to JFK both days.) However, thunderstorms hit SFO on Friday, grounding all flights. This means the flight won’t be back to LAX in time for the trip to MIA. So this flight becomes a single point of failure for two of your routes. While the weather does not affect electrons in your chip, failures of other types are possible - say you are using a round-robin arbiter, but one of your core runs into an infinite loop - that core does not release the bus and stops the entire chip from executing. That’s the risk with arbitration-based approaches.

Model 3: The crossbar, a.k.a hub and spoke model

If you have ever flown in one of the major international airlines like Emirates or Qatar airways, you would have experienced a layover at one of their “hubs” like Dubai or Doha. These airlines operate based on a very popular airline routing model called the hub and spoke model. (Inspired from it’s resemblance to your bicycle wheel.) Now that you are a seasoned manager of an airline, let’s take inspiration from this model and add a fourth destination to our airline - Seattle. (SEA.) However this time, the flight schedules are adjusted in such a way that LAX becomes the hub:

Two flights leave each morning from SFO and SEA, respectively, and land in LAX at 11 AM
The same two flights depart from LAX at 3 PM - one to JFK, and the other to MIA

With this model, the airline can sell tickets for the following routes:

SFO-LAX, SEA-LAX, LAX-JFX, LAX-MIA with no layover
SFO-JFK, SEA-JFK, SFO-MIA, SEA-MIA with a layover at LAX

As you can see, LAX is the key to making this happen - it’s where the “crossover” of incoming and outgoing passengers happens. In chip networking, we use a similar arrangement called a crossbar. A crossbar is a collection of programmable switches which can decide which inputs need to be routed to which outputs. (Your airline’s staff in LAX will do this check at the boarding gate.) This allows multiple different nodes in the chip to communicate with each other without the need for direct connections between all the nodes.

Crossbars effectively address both the shortcomings that we saw with arbitration. First, they are able to ensure that all nodes can communicate with each other simultaneously. (The crossbar is assumed to have an ability to monitor these transactions and handle them correctly.) In our airplane example with a hub, we are theoretically able to transport 600 unique passengers in a day. (300 in the morning from SEA/SFO to LAX, and 300 later to JFK/MIA.) Also, the presence of a crossbar means that each failure is restricted to it’s own node - if one of the cores in the chip hit’s an infinite loop, the other cores can still talk to the other nodes through the crossbar.

At first glance, you might think that a crossbar arrangement is very similar to a point to point connection since both cases ensure all nodes can be connected simultaneously. But there’s a subtle difference here - using a crossbar, we are able to achieve the same bandwidth as point-to-point connection, but with fewer bus connections. If a chip has N nodes, (N > 2.) then:

Point-to-point connection would need (N-1)*(N-2) buses
Crossbar connection needs N buses

This is the main advantage of a crossbar. However, in order to maintain the full bandwidth, crossbars usually need to store transactions and have complex control mechanisms, which makes the crossbar an additional piece of silicon taking up area and consuming power. Also, crossbars cannot scale indefinitely - each additional node adds some wire latency, and a very large number of nodes could make this intolerable. Remember, if you have 10 nodes in a crossbar, they all contribute to the wire latency - even if you only use a few nodes repeatedly. Transistor scaling also cannot be relied on - reduction in wire latency has not kept up with Moore’s law, which makes the future of crossbars very uncertain. Basically, as your airline business scales to multiple destinations, you need to hire a lot of staff, and reserve many boarding gates - even on days with low occupancy on your flights. Not ideal…

All of the approaches discussed so far were sufficient for the pre-SoC era: with a small core count, and few memory and I/O nodes that needed networking. But as we entered the era of parallel compute with SoCs having more than a thousand cores, interconnects needed to get more sophisticated. A “Southwest Airlines” moment was needed…

Model 4: The “Network”

In 1971, Southwest airlines launched their first flights in Texas - changing the aviation business model forever. Let’s learn from their approach and make our airlines more efficient.

Despite everything we discussed so far, the truth is, point-to-point non-stop flights are the most desirable options. So let’s go back to our first model, and see if we can take it in a different “direction”. (Pun intended.) I mentioned that one of the issues with point to point connections is the flexibility - how can you best manage different bandwidth requirements on each route? We looked at arbitration in model 2, but there is a more efficient approach - operate smaller flights. Smaller flights are cheaper to fly, and easier to sell-out, hence giving you more bang for buck on each route without the need to arbitrate.

The other limitation was the “buy a new flight for each new route” requirement that point-to-point connections require. Smaller flights will also partially address this issue, because you can afford to buy more flights. (and hence operate more routes simultaneously.) But as you expand your destinations, this might still become impractical - so we will remove this requirement from the model, making it pseudo-point-to-point. (i.e. not all connections can be point-to-point, as you will see.)

Let’s incorporate all this flexibility and build a network for our 5 destinations that looks like a ring. Much like the crossbar model, this topology connects all our destinations. (using as few as one flight!)

Point-to-point connections: SEA-SFO, SFO-LAX, LAX-MIA, MIA-JFK, JFK-SEA
Pseudo point-to-point connections (with 1 stop): SEA-LAX, SFO-MIA, LAX-JFK, MIA-SEA, JFK-SFO

Adding a destination to this model is trivial. Say you want to add a new destination, Dallas (DFW), in the same topology, you can add the node like this:

Unlike the crossbar architecture, which adds a strain (more staff, larger number of boarding gates, and so on.) on your “hub” for each new destination, this ring-style approach does not stress any single node. (Remember, in chip terms, this “stress” means wire latency.) However, you might already see a problem here - we have introduced connections with 2 stops which would never exist in a crossbar model. (For example, SFO-MIA.) If such 2-stop routes are undesirable, then a simple connection above can fix that - add a direct route between SFO and MIA.

With just one new route, we are able to maintain access all our destinations with one stop or lower. Also, you can pick between one large flight, or multiple small flights - the network can be adjusted accordingly. This is the biggest advantage of the network model - it allows the flexibility to modify the topology based on your resources and latency requirements. Such networks exist in all modern chips, and they are called, simply, a Network on Chip, or NoC. We will look at them more formally in the next section.

An overview of Network on Chip (NoC) architectures

In the previous section, I (hopefully) convinced you that an NoC architecture allows you to move data efficiently and provide maximum flexibility on chips with a large number of nodes. With great flexibility, comes great challenge - and in this section, I want to walk you through three architecture decisions that need to be made, and the trade-offs involved.

1. What should my network look like?

As you saw from the airline analogy, changing which nodes are connected to each other makes a huge difference to the path that needs to be taken to go from one node to the other. In NoC language, each such connection is called a “Link.” By changing the number of links, and the nodes they connect, we get different “Topologies.” We use two key metrics to understand how good a topology is:

Path diversity: How many unique paths can you find between two nodes in your network (formally called “Bisectional Bandwidth,” which I think raises more questions than provides answers, so I won’t use that term.)
Path distance: The maximum distance between any two nodes in the network. (Again, the formal term is “Diameter”; I’m not going to use that.)

As you might expect, adding additional links would increase path diversity, and reduce the path distance. But each link adds hardware complexity. (More wires, and as we will see later, complex routers.) Here’s an example with three simple topologies that highlights this trade-off.

There are more nuances in designing the best topology, but for now, all you need to take away from this section is: Pick a topology that provides sufficient path diversity with the smallest number of links, while also ensuring that your path distance isn’t too much.

2. How do I split up my data?

It might seem like ages ago, but if you remember from the start of this post, the purpose of on-chip networks is to move data from one point in a chip to another. Due to the complexity of NoC architectures, we have a dedicated unit in each node to manage the sending and receiving of this data - called a Router. So far in this post, we assumed that all the data that we need to send out is sent together. (As a single transaction.) But in reality, it’s not practical to do this - for instance, if you want to move 1 GB worth of data as a single transaction, you would need 8,589,934,592 wires (i.e. one for each bit.) in each link. Instead, it is broken down into multiple pieces. Here’s what they are called:

Message: The is the full data that you want to send from one node to another
Packet: Each message is broken down into multiple packets. All bits in a packet must follow the same path (or route) to reach the destination node.
Flit (Flow Control Digit): Each packet has multiple flits. A router always stores all bits in a flit together.
Phit (Physical Digit): Although a router stores a Flit together, it does not have to be sent out as a single transaction. Each flit can be broken down into multiple Phits, and each Phit must be sent out in a single cycle.

This figure highlights how a message is broken down and transmitted from a sender to a receiver node.

At this point, it might seem like we took a simple idea of moving data and added a lot of complexity to it. But each of these terms has an impact on the network performance.

The size of the phit decides the number of bits transferred in each link. So the phit size decides the maximum bandwidth of our network. A higher phit size is desirable, but comes with challenges to place and route the large number of wires. (Remember our discussion about large vs small flights from earlier?)
Flit size is the hardest to optimize. Since the router must store all bits of a flit together, a very large flit size could increase the latency. (Since all phits of a flit must be sent out before the next flit can be routed.) However, if your flit size is too small, then you may have too many flits to send - which could worsen the network congestion. So this needs careful consideration.
- Think of it this way: To reach your a physical destination, would you rather take a train, which is infrequent, but fast, or drive your car, which you can do anytime, but carries with it a risk that you would be stuck in traffic?
And finally, the packet size. A smaller packet size means more packets, which means more potential routes to take from the sender node to the receiver node. However, only a topology with sufficient path diversity can really benefit from this - if not, you are just adding to the congestion by having more packets.

I know I’m doing injustice to this topic with my “hand-wavy” analysis, but the point I really wanted to convey is that the way you split your message has a big role in the eventual performance of your network.

3. What path should I take to reach my destination?

Now that you know what your network looks like, and how your data is split, life’s great - you just need to send the data, right? Well, not quite. You also need to decide which route your packet will take to reach your destination. I’m not going to go into the details here, as this is an introductory post, but I’ll leave you with a few questions to think about:

Do I need to know the full route to reach the receiver node even before my packet leaves the sender node? Or is it sufficient for me to just decide what’s the next best node?
The agreement was that each packet must follow the same path. But does that mean I should wait for all flits in my packet to arrive at each intermediate node? Or can I send out each flit independently on the same path?
What happens if two packets want to go to the same node? Which one should I prioritize?

This topic deserves its own post. (Maybe someday, if your airline business fails, we could explore a job as a traffic controller - a great analogy to understand routing algorithms.)

This is as far as we’ll go in this introduction to on-chip networks. In an era of parallel computing, moving data efficiently is as important, if not more important than the computation itself. My biggest learning from doing this research was that to build the best on-chip network architecture, the two extremes of the computing stack must come together:

Algorithms will decide how your data is arranged
The state of semiconductor physics will decide how many wires is too many wires

Irrespective of which side of the stack you are on, I hope this post inspires you to explore this topic further and build the best on-chip networks for the next generation of computers.

References:

The bad old days of debugging

Bharath Suresh — Mon, 18 Aug 2025 03:06:44 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s) and are not investment advice.

I recently read “The soul of a new machine” by Tracy Kidder, a book about the design of a new computer called Eagle, undertaken during the 1970s by a small team within the company Data General. I have always liked this genre of books - it’s nice to read about people working together to build something special. I usually leave at the end feeling inspired. But at the end of this book, I felt something else: I felt grateful.

If you have been an engineer long enough, you would relate to this set of events: You find a bug during testing. You spend hours (maybe days) staring at the screen trying to figure out what is causing that bug - a period of intense frustration for you. (and sometimes, the people around you.) You finally find the root cause for the bug, experience a (short-lived) feeling of pride, before the next bug hits you. That’s the debugging emotional rollercoaster in a nutshell: And computer engineers go through it on a regular basis.

Like a lot of others, when I’m stuck debugging the same bug for a very long time, I feel like I have the worst job in the world. But looking back at chip design in the 1970s gave me a sense of how much better we all have it today…

Getting your hands dirty

In the era before HDLs and EDA tools, hardware engineering meant working on real hardware. The Eagle computer was not a monolithic piece of silicon like modern SoCs - it was a collection of integrated circuits, spread across 7 different “wire-wrapped boards” with wires in the back connecting different pins. As the author Tracy Kidder puts it, the boards looked like “thin plates, each with one side covered by a profusion of tiny wires. Small cables, flat like tapeworms, ran among the boards. Oh my, there were a lot of wires.” If those words don’t speak to you, here is a picture of one such board. (Source: Wikimedia Commons)

While in modern debugging, finding the root cause for a bug is the harder problem, a setup like this made fixing the bug equally challenging. When a change was needed, existing wires had to be carefully unwrapped using hollow-tipped tools, and if you were successfully able to do that, you then had to wrap new wires with the same precision. In this process, no other wires should be disturbed - but it does happen often, and results in new bugs. As the hardware lead Ed Rasala puts it, this process feels like one is performing an open-heart surgery.

This need to add or remove connections is not always a one-time thing: sometimes, in order to isolate where the bug originates from, different blocks in the design need to be disconnected, and later connected back. The book talks about an example involving the Instruction Processor and its associated I-cache - the engineers needed to run multiple experiments, each with one of the two blocks disconnected. (needing them to perform the “open-heart surgery” multiple times.) In modern debugging, isolating two such blocks would be as simple as modifying a few lines of RTL code.

Keeping the paper business alive

Today, when engineers say paperwork, we never actually work with paper. But in the world where Eagle was being designed, paper still had it’s place.

Part of the debugging process involved understanding how different blocks in the computer worked. This information was compiled into a two-hundred page physical document that was written by the architect Steve Wallach. This document was like the Eagle team’s Bible - in fact, it was divided into multiple chapters, and each chapter started with a famous quote. This all sounds great, until engineers had to search through them for a specific piece of information to help with their debug.

But this was not the worst part of paper. Debugging a wire-wrapped board is vastly different from the way we add signals to a waveform today - a device called Logic Analyzer was used to probe different pins, and a snapshot of the signal had to be noted down on paper. This was improved by automatically capturing some signals in the system console - but this still needed to be printed before the values could be analyzed. Remember, signals change many times each second - so a large number of values need to be printed. In describing the console prints for one of the debugs, the author writes: “Stretched out, the sheet would run across the room and back again several times. You could fit a fairly detailed description of American history from the Civil War to the present on it.” Analyzing this large data dump with no ability to search must have been a nightmare…

Once a fix for a bug has been found, all the console log papers can be cleared out - but the paperwork does not end there. In order to correctly make this fix in all the prototypes, the exact changes in connections had to be marked on a large diagram with the schematic of the computer. This was called an Engineering Change Order, or ECO. By looking at this ECO diagram, engineers working on different prototype had to modify the wiring in their prototype each morning before they started debugging other issues. Today, a change is only called ECO very late in the project, and is done on a software abstraction of wires called Netlist. Such ECOs are rare - not an everyday occurrence like in the Eagle project.

You can’t spell “Hardware” without “Hard”

Once the Eagle project was finalized, new engineers had to be hired to the team. When I read the hiring philosophy in the team, I couldn’t help but think that the challenges of debugging highlighted earlier had a lot to do with it.

During the interviews, Carl Alsing, the microcode lead, always presented the project as a tough one that involved long hours - at times even calling it a “suicide mission.” Those long hours were a direct result of the debugging challenges of the time. The team also screened for “a lack of family life” - they believed that someone with a family could not deliver on the intense commitment of the debugging schedule. Higher grades were valued, but it was not because it signified superior skills or smartness - it was merely seen as another indicator of hard work. In fact, the Eagle team even had a rite of initiation for their new hires, where they made a commitment to "do whatever was necessary for success, including forsaking, if needed, family, hobbies, and friends.”

While good old hard work still has it’s place in debugging today, the best engineers I have worked with rely more on a methodical approach that makes the most of tools available at their disposal. Today’s designers have the tools to prepare in advance for a majority of the debugs they might face - with faster and more accurate simulations, better logging capabilities, and approaches like formal verification. While Alsing and co. lacked these tools and had to push engineers to work 60-80 hour debugging workweeks, it is not a requirement to be successful at debugging anymore.

So, I get my turn to debug at 4 AM?

Let’s say you had all the smarts and energy it takes to debug any issue that your chip throws. If you walked into the Westborough building to debug Eagle, chances are, you will be asked to come back later. That’s because, unlike today’s world of simulation, directly debugging a real chip means, well, you need a real chip. But there were only a limited number of these prototypes, which meant not everyone could debug the Eagle computer at the same time.

The Eagle team were on a tight timeline. In order to maximize debugging time, they had a debugging schedule that had to be strictly followed, which assigned each engineer specific times throughout the day (and night) for debugging. So, if inspiration strikes you while having dinner, you would still need to wait for your shift to test your theory.

Although shifts were never more than 8 hours, the nature of debugging meant that most engineers stayed longer. Since debugging is a very personal activity, it is often easier to push yourself to work longer hours, than explaining your finding to the next engineer. (Which btw, was also “paperwork.”) The book talks about many such stories - Jim Guyer, for instance, spent many nights alone in the lab to debug issues in the Instruction Processor. Ken Holberger noted that it was often dark when he got in, and dark when he left work - leading him to lose track of the day and time when he was home.

Remember, adding more prototypes is not a sustainable solution - if you had 100 different prototypes of the computer, then each ECO would need to be implemented on each of these 100 prototypes. There were diminishing returns to scale in 1970s debugging.

Flakey and Bogeyman: The supervillains of the debugging world

In the story, "flakey" and "bogeyman" were terms used by the engineers working on Eagle to describe specific types of problems or fears encountered during the debugging of the Eagle computer. A flakey refers to a failure that occurs erratically and is often hard to diagnose. Some example of this include loose connections, stray voltages, or worst of all, an IC from another vendor is buggy. The main issue with a flakey is that it is hard to reproduce consistently. The first step in fixing something is getting it to break - and with a flakey, engineers needed to spend additional time reproducing the bug, before finding a way to fix it.

The experienced engineers in the Eagle team had seen a lot of flakeys in their life, and not all of them were caught before their computer was sold. This led to a fear that there will always be one bug that would go unnoticed, but cause their machine to stop working - they called this a bogeyman. The Eagle project lead Tom West defines it as "the infinite page fault you didn’t anticipate. The bogeyman is the space your mind can’t comprehend." The fear of the bogeyman was real - Ed Rasala talks about some nights where he would wake up worrying about a bug they are yet to find, feeling like “the bogeyman was in his bedroom.” Like I mentioned in another post, bugs continue to be seen in chips even today. However, verification has been largely left-shifted now - by starting verification very early, and at different levels of abstraction, the chances of encountering such bugs reduces greatly.

Long-term tiredness

Nobody likes to make mistakes. But the likelihood of not detecting a bug was much higher in the 1970s than it is today. This took it’s toll on the engineers involved.

Jim Veres, who helped design the Instruction Processor and its I-cache, felt annoyed by the fact that his component was blamed for a lot of bugs. Although he would often prove that the problem was elsewhere, he said that the constant blame placed a lot of pressure on him. He felt like the block he designed was a "part of him now," and he didn't like to see it picked on unfairly.

Jon Blau experienced "difficulty forming sentences," his "mind’ll go blank," and felt "pieces of your life get dribbled away" due to the internal pressure to hurry and finish his code. When he took over debugging the ALU, he was "terribly excited by it, then very frightened," leading him to take a week off. Ed Rasala summed this up nicely, by calling it a “long-term tiredness.” He said that engineers on the debug schedule felt tired, but not in a traditional way that going home to their loved ones could fix. They were always thinking about the current issue they are debugging, while being concerned about the next bug they might hit. That’s what it was like to debug a computer in the 1970s.

While debugging is still a challenge today, and tools can be better, reading “The soul of a new machine” gave me a new perspective on what we have today. It’s been about 50 years since this saga unfolded. Who knows, 50 years later, someone might write a sorry tale about the way we debug chips today…

By the way, this post only talks about a small part of this book that resonated with me. For a fuller analysis, I highly recommend this post by the Chip Letter.

The Chip Letter

The Soul of an Old Machine

"That fellow West is a good man in a storm." … "He didn't sleep for four nights! Four whole nights." And if that trip had been his idea of a vacation, where, the psychologist wanted to know, did he work…

a year ago · 56 likes · 36 comments · Babbage

The psychology of a technical interview

Bharath Suresh — Mon, 04 Aug 2025 06:02:37 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

This post is a little different from what I usually write. I want to preface by saying that I’m not going to talk about the knowledge and skills needed to crack technical interviews - they are very important, but are also specific to the job role you are applying to, and the stage of your career. I compiled some resources for a few such roles in computer engineering in this post, if that’s what you are looking for:

When I had finalized my plans to move to the US for graduate school, I started to proactively apply for internships. Even before I landed here, a recruiter from what was at that time my “dream company” reached out to me, and set up interviews pretty soon. It was perfect - I had the skills and experience for the role, and I had prepared quite well. At least I thought so…

It’s been about 3 years since that interview, but I still remember that day very clearly. I completely froze during the interview. It wasn’t the hardest interview I have given. It wasn’t even the least prepared I was. And yet, it was my worst interview. Looking back, it hasn’t made too much of a difference, I’m happy that things panned out the way they did for me. But I’ve always wondered what happened on that day. Was it me? Was it a poorly conducted interview?

Recently, I conducted a few mock interviews for my alma mater. In preparation, I spoke to some experienced interviewers, and also reviewed some accounts of good and bad interviews. Being on the other side gave me more perspective on what makes an interview good - both for the interviewer and interviewee. Surprisingly, it’s less about how knowledgeable each side is - certain behavioral aspects play a key role. This post is centered around these aspects of a technical interview.

Who should read this?

I know that a lot of people reading this would be interviewees, not interviewers. As interviewees, we are conditioned to believe that the fate of an interview lies solely in our hands. I don’t think this is true - it’s just that interviewers aren’t evaluated to the same degree.

A lot of interviewers brag about the fact that most interviewees fail their round. I’ve found this to be quite amusing. Interviewers assume that their interviews are technically advanced and the quality of candidates is low, but that’s hardly ever the truth. If a lot of interviewees struggle to answer your questions, there is something wrong with your interviewing style.

I know this post would reach a lot of interviewees, but I’m writing this in the hope that it also reaches some interviewers.

The six ingredients that make a good interview

1. Perfect your pitch

Very often, interviewees start their interview by introducing themselves. When I was interviewing, I think I handled this question quite poorly - I had a good number of projects on my resume, and I thought the best idea is to make them the stars. I went into the technicalities of my projects very quickly - after all, it was a technical interview. But from my experiences being on the other side, I realize how pointless this is. As an interviewee, you need to remember that no one cares as much about your projects as you do - so the minor technical details that you find fascinating are not going to impress anyone.

Some of the best candidates I interviewed approached this question very differently - they had a simple, but impressive pitch that spoke about themselves. Say you are someone with projects in computer architecture, and interned in a government funded lab, here’s one example each for a bad, and good pitch:

Bad pitch: Talk about the technical details in each of the projects (basically reading out your resume)
Good pitch: Talk about the importance of computer architecture for the future of your country, and the role you want to play in that

It doesn’t matter if you think your experiences are ordinary - every story is unique, and crafting a good story can completely change the tone of the rest of the interview.

Interviewees usually figure this out over time, but a lot of interviewers forget that they need to pitch too. I can’t tell you the number of times this does not happen. I don’t think it is enough for interviewers to ask an open ended question like “do you have anything you want to ask me.” Interviewers should start the interview by setting expectations - about the interview, the job, and the company. I would go one step further to carefully review the candidate’s resume and mention what aspects would be a great fit for the team.

I have given so many interviews where I have no clue what is expected of me if I join the team, and whether they even care if I join. A short pitch can make a candidate feel special. Taking a few minutes to do this can make the difference between good candidates joining your team, or passing up your offer. Remember that good candidates will always have options, and their decision won’t always be driven by the money offered. Hiring is the most important aspect of a team, and as an interviewer you have an opportunity to attract the best talent.

2. Don’t experiment too much

When I was taking mock interviews, I experimented with the way I framed questions. (Part of me was already thinking about writing a post like this.) I realized that too much experimentation as an interviewer is not a good thing - it will result in an unfair evaluation. Me, and a lot of others I talk to, discuss how boring the interview format is, and how we would design a very interesting interview when we get the opportunity. But it’s important to understand why some interviews are the way they are: standardization is the best way to ensure fairness across candidates.

If I ask candidate A a standard textbook question, but I reframe the question to mimic a real life scenario when I interview candidate B - candidate B may have the more interesting interview experience, but the vagueness of the format will also make it unfair if they cannot answer the question correctly. So my suggestion is to avoid experimenting too much, unless you are sure you can replicate it across multiple candidates. It is important to remember this as an interviewee too. Answering technical questions is very different from the pitch I mentioned previously - here, uniqueness isn’t rewarded. Sometimes, I have tried to be too cheeky in interviews - using clever analogies, or out of the box terminology to answer basic questions. It has never ended well - in the worst case, I ended up confusing myself and the interviewer. In the best case, I answered one question right. (I don't think creativity gets any brownie points in technical interviews.)

Whether you are an interviewer or an interviewee, follow the advice that Michael Scott gave to Dwight Schrute in “The Office”: Keep It Simple, Stupid.

3. The curse of “the know it all”

This is a problem I noticed while interviewing a lot of smart, knowledgeable, well-prepared candidates. When I asked a question, they made a lot of assumptions about what I might be asking, and went on multiple different tangents. Everything they said was technically correct - the only issue was, it was not what I asked. If I have to evaluate the candidate for the question, I would still have to say that the question wasn’t answered correctly. Therein lies the problem.

Sometimes, interview questions are vague - could be by design, or could be accidental. Even if you know what the answer might be, asking a couple of clarifying questions before jumping in with an answer is a good idea. This could even earn you brownie points - if you actually remind the interviewer of a point they forgot to mention, it is a great sign.

I think the problem of oversharing exists with interviewers too (unfortunately, it goes unchecked.) Some interviewers get into the habit of explaining why the interviewee is wrong (and why they are right.) I know a lot of interviewers who justify this, by saying they want to help the candidate do better in other interviews. I don’t buy this argument; as someone that has been explained to multiple times, I can confidently say that none of it has gone into my head in that moment. Interviews are too short and stressful to be conducive for learning. All you are doing as an interviewer is wasting valuable time that could be spent evaluating other skills of the candidate. If you really care about the candidate learning something, offer to help the candidate on a call or email after the interviews. (I would gladly take up this opportunity to learn, but no one has made that offer - some companies have a policy against this - but let’s not get into this.)

4. Sweat the small stuff

Like a lot of interviews today, the mock interviews I took were remote too. Remote interviews introduce non-uniformity that is often ignored. This includes:

A reliable Internet connection
Clear audio and video
The ability to write and draw something, and share it (preferably on the computer)

I know not everyone has the privilege to invest in the best equipment, but even those who do, sometimes don’t pay enough attention to it. I had a lot of such “tech issues” happen that completely disrupt the flow of the interview. Even if they aren’t completely disruptive, your interviewer cannot evaluate you if they cannot hear you correctly, or cannot understand what you write or draw. You might think these are minor things that don’t affect the eventual interview outcome, but when I asked experienced interviewers, they told me about some interviews where the lack of audio and video clarity resulted in candidates being passed on.

Those who have watched the 2013 movie “The Internship” can relate to what I am saying - don’t take an interview in a loud public setting like Nick and Billy:

Interviewers are guilty of this too - more often than you think. I remember interviews where after a point, I had no choice but to ask the interviewer to change their headphone because their audio kept breaking. If you are using visuals or text to ask questions, ensure that they are clearly visible. And finally, my biggest pet peeve - please, please, please turn on your video during interviews. There’s nothing worse than answering technical questions from an icon on the screen with scratchy audio.

5. Play the player, not the cards

There’s a saying about the best poker players - their strategy depends not on the cards they hold; it depends on the players they are up against. I think it would really help if interviewers and especially interviewees develop the ability to read the person they are talking to.

In my experience, there are three main types of technical interviewers:

The no-nonsense type: is only focused on your technical knowledge
The curious type: like to hear about new and interesting projects and ideas
The manager type: is concerned about whether you are a good fit for the teams needs

If you can make an assessment of the type of interviewer you are talking to, you can shift the tone of the interview to match their expectations - give short but precise answers to the no-nonsense type; talk about interesting projects with the curious type; and show that you have the right skillset to the manager type. This isn’t the easiest skill to master, but will definitely help you leave a better impression.

This skill adds value for an interviewer too - especially if you want to lure a high quality candidate to join your team. Matching the energy of the person you are interviewing will definitely improve the quality of the interview, and provide better insights to make your decision.

6. Don’t be a jerk

I’ve saved the most important point for last. Interviewers and interviewees think of the technical interview as some kind of battle of intellect. In the process, one, or both parties can be jerks.

Interviewers are more notorious for this. I have seen that some interviewers have a tendency to demonstrate their intellectual superiority - in the form of snarky comments, exasperation, and concerning expressions. It could even be something simpler - asking questions like “Are you sure this is right?” when the answer is actually right. (In a vulnerable, emotional state, most interviewees will say “no” to a question like this.) Actions like these in no way contribute to the evaluation of the technical skills of a candidate - they are merely shows of dominance. Since most interviews span multiple rounds, it is possible that a candidate had a bad interview with you, but is still offered the job, and is going to be your future teammate. Do you really want your first interaction to be like this?

If you are an interviewer, always remember that you are the support cast in the interview - it is not your stage to demonstrate anything. Your role is merely to conduct a fair evaluation to help your team hire the right candidate.

In defense of the interviewers, I must say that interviewees can be jerks too. I have heard cases where the interviewer being wrong (which can happen) triggers unpleasant disagreements in the midst of an interview. Even if you are sure you are right, I don’t think having a full blown argument is the right thing to do. Often, candidates are evaluated by the number of questions they answer correctly, so the time you spend arguing is time lost from another question. The better approach would be to table the conversation - you can say that there might be a misunderstanding, and that you would like to visit this question again at the end if time permits. I would not even mind a polite email after the interview with the clarification. (But “polite” is very important here - otherwise it could backfire.)

It’s important to remember that both interviewers and interviewees have the same end goal - to contribute towards the progress of technology. If you stay with this goal long enough, you will cross paths once again. And humans have a tendency to remember negative emotional experiences - a simple technical interview should not be one of those.

So please smile, and have a pleasant technical interview experience.

Confessions of a static timing analysis tool

Bharath Suresh — Mon, 21 Jul 2025 15:11:11 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

Hello there. If you are reading this, we have probably already interacted in the past. I was created a long time ago, inspired by a project management tool called Program Evaluation and Review Technique (PERT), that was used to identify the critical path and slack time in projects. I still carry some of these terms with me - from 1966 all the way till today. You may know me today by different names like Synopsys PrimeTime, or Cadence Tempus - does that ring a bell? No? It’s okay, I know I am not as popular as those pretentious synthesis tools that bully me around. But I play an important role in your life, and I’m here to tell you why…

The harsh truth your synthesis tool didn’t tell you

As you may have guessed, what I do is called “Static Timing Analysis”. The cool kids call it STA, so let’s stick to that. You know how you use a modern hardware description language like Verilog to describe your digital design? Well those languages might make your life easy, but tools like me can’t comprehend all the fancy slang you use there. Am I supposed to know what “always @ posedge” means? No thank you. All I understand are the basics:

What are the inputs and outputs (Called “IO”)
Simple logic components, like gates (AND, OR, etc.) and MUXes (Called “Cells”)
Storage elements like Flip-Flops, Latches (Called “Registers”) and SRAMs (Called “Memories”)
Connections between them (called “Nets”)

So before you come asking for my help, you need to convert your fancy HDL design into something I understand called a “Netlist”. (A list of nets, get it?) That’s where the synthesis tools come in - they are just fancy translators who understand your HDL design, and break it down into the simple netlist that I understand. They don’t check whether it is even possible to have the connections you have described - they just want to please you by saying yes to everything you say. No wonder you like them so much.

But let me tell you the harsh truth - not everything you describe is realistic. You might think you are the god of chip design, but in my world, you must bow down to our god - Physics. What you are designing is not software - a chip is a physical entity that works by moving electrons. They’re much faster than anything you’ve ever seen, but they still take time to move from one place to another in the chip. Due to this fact, any logic you add in the chip will also add a delay. Your favorite synthesis tool conveniently left all this out.

Finding out exactly what this delay would be is challenging - remember, at this stage, I am doing this with no knowledge about how the chip is going to look at the very end. But even at this stage, I have something valuable that no one else has - estimated delays of different cells for the technology that the chip would be manufactured in. (In other words, a direct line to Dr. Morris Chang - ever heard of him?) I write down this information in a place called the “Cell Library” - I’m going to need to refer to this often once I get started with my work. But I still haven’t told you what my work is, have I? Before I do that, I will tell you why my job exists in the first place.

Welcome to my synchronized world

Now that you know delays are a thing in the real world, let me introduce you to the idea of synchronization in digital design. Consider this design I once worked on, where there were 5 inputs - A, B, C, D, and E, used to get the final output O. When I saw the netlist, I noticed that this design includes an AND gate, two OR gates, and a MUX, connected like this:

As you can see, I have also added the delays of each cell by checking the cell library. (You’re welcome.) Assuming no input delays, the time taken to observe the correct output O can be obtained by adding all the delays in the path between an input and the output. However, there are multiple different paths from the input and output, each with a different delay:

Path 1: Input A/B → U0 → U1 → U3 → Output O (Delay = 6 ns)
Path 2: Input C → U1 → U3 → Output O (Delay = 4 ns)
Path 3: Input C/D → U2 → U3 → Output O (Delay = 4 ns)
Path 4: Input E → U3 → Output O (Delay = 3 ns)

The purpose of any digital design, is to find the output when the input changes. Since the delays are different, you don’t exactly know when the output is ready - it could be 3 ns, or 6 ns after the inputs are applied. This makes the design pretty much useless. How do we deal with this problem? In the above example, let’s say that the input can only change once every 10 ns, and the output is also sampled with the same frequency - then, irrespective of the path between input and output, we are guaranteed to get the correct output value. In other words, by restricting when inputs to a design can change, and outputs from a design are sampled, we can ensure correct execution across many different paths - an idea called synchronization.

In order to maintain synchronization, we need to ensure that the input remains unchanged. But we have no idea where these inputs are coming from, so each input may have it’s own delays. This means we need to have a way to store the values of the inputs periodically, and use these stored values to calculate the output. This is done using a storage element like a latch or flip-flop - the general term for this is a Register. Let’s add registers R0 and R1 at the input and output of our design, respectively.

Each register also needs to know when to store the next input. (every 10 ns in our example.) To achieve this, we introduce a new input, and an STA tool’s best friend, the clock signal. (also known by the nickname clk.) The clock signal periodically changes from 0 to 1, and then back to 0 - each such transition is called a clock cycle. The time taken for the signal to complete this transition is called the Clock Period. In a typical register, the input to the register gets stored in the register when the clock signal goes from 0 to 1, called the positive clock edge. This allows us to store the inputs in the register R0 during a positive clock edge, then complete the evaluation of the logic, and store the final output in register R1 during the next positive clock edge.

Here’s a timing diagram to show how in the above example, input E and the intermediate output O’ are only sampled at the positive clock edge, ensuring that the correct inputs and outputs are seen after every 10 ns. (Inputs A, B, C and D are not changing in this case.)

As you can see, E’ (output of register R0) and O (Output of register R1) do not change between clock cycles - this is important, because each of these signals can be used as part of other logic with the assumption that the data won’t change during the clock cycle - this is very important to make your designs larger without increasing the delays indefinitely. But this synchronization comes with strings attached, and my job is to ensure you abide by certain rules.

We have needs too, feat. Registers

When I spoke about the role of registers in maintaining synchronization earlier, I might have given you the impression that registers are a gift from the heavens to save all digital designs. Maybe you think registers should be the ones doing the talking here. Let me tell you something, these registers aren’t as generous as you think. While registers help solve the synchronization problem, they introduce new problems in the process.

Remember how I said a register stores data when it sees the positive clock edge - I left out some of the nuances here. There are some registers out there that decide to store data at the negative clock edge instead, but everything that I talk about applies to those registers too - so let’s generalize by calling it a clock event. For the input data to be stored correctly at a clock event, the register enforces some strict rules. (Thanks again to the physics gods.) In my world, we use specific terms to express these “needs.”

The first term is Setup Time, which is the minimum time for which the input data to a register must be stable before a clock event. I’m not going to tell you the complete origin story of setup time here, but just know this - if you change the input to a register too close to a clock event, the registers cannot guarantee that they checked the correct value. Think of it like this - you are supposed to turn in your assignment by 10 AM, but you decide to make some last minute updates at 11 AM - a register is like that strict professor who says: “Sorry, I cannot promise that your updates would be graded.” Ah, if only registers were a good sport…

Their needs don’t end here - in addition to setup time, registers also have something called Hold Time - the minimum time for which the input data should be held stable after the clock event. This might seem excessive to you - this sounds like a professor who not only wants the assignment submitted by 10 AM, but also wants you to show up to their office, wait for 30 minutes while they review your assignment, and leave only after that. Seems unfair? I get it, but registers rule my world, and they are pretty high maintenance. You have no choice but to abide by their rules.

If you think our nightmare ends there, I’m sorry to burst your bubble. Remember that these registers, like everything else in a chip, work by moving electrons around. That means just like the cells we discussed earlier, they have propagation delays too. Typically, there are two different delays, since the path taken by the clock signal and input data are different. Here’s what we call them:

Clk→Q Delay: The delay in output resulting from the clock input
D → Q Delay: The delay in output resulting from the data input

Lot of information, huh? Looks like somebody needs help tracking all this. That’s where I come in. My job is to ensure that all the delays are accounted for, and the registers can guarantee that your design works as expected. The first thing I do, is call up my dear friend Dr. Morris Chang (or one of his friends that is going to manufacture this chip.) and get the setup time, hold time, and propagation delays of the registers. Remember, I already know the propagation delays of the other cells. With all this information, I’m like a fortune teller that can predict the future of your chip - except that unlike your fortune teller, I’m always right.

Show me the slack!

For me to check if your chip will work correctly, I need to ensure that your design meets all of the register’s needs. Let’s start by looking at setup time. Use the diagram below to follow along. To start with, I select a register, say R0 from the design, and pick one of it’s inputs - let’s call it In. At a positive clock edge, R0 stores the data from In. (When I do this analysis, I assume the input register, R0 in this case, does not violate Setup Time. This analysis is done for all registers, so trust me, it will all work out in the end.) Due to the propagation delay in R0, the stored data In’ is only seen after some delay. As I mentioned, a register can have two kinds of delays - but when I’m checking if setup time is met, I pick the larger of the two - because I want your chip to work under all circumstances. Once we get past R0, we see a familiar foe - the combinational delay which I spoke about in an earlier section. Adding both these delays gives us the time it takes to get the correct data for Out’ - the input to register R1. The register R1 expects to store this data at the next positive clock edge - it’s schedule is decided by the clock period. But as I mentioned earlier, the register does not like last minute changes to it’s input - so our real deadline is actually a little earlier than the next positive edge - to ensure that the data is held stable longer than the setup time of R1. This requirement can be expressed using fancy math operators, and this is called the Setup Time Constraint. My job is to check whether this constraint is met:

If YES, that means this path between In and Out “meets setup time”. The extra time you have before the setup time deadline is called Positive Slack.
If NO, the path “violates setup time.” The amount by which your delays exceed the deadline is called Negative Slack.

Not so fast…

As I said earlier, registers also need data to be held for some time (called hold time) before it is safe for the input value to change. In the same scenario I described earlier for setup time, when the data from Out’ is being stored in register R1, a new value of In is being stored in R0, and then passing through the combinational logic to replace the current value at Out’. If this happens too fast, then the value of Out’ changes before the hold time of register R1 has passed - which would result in a Hold Time Violation. This gives me another constraint to check for - called the Hold Time Constraint. (You can think of it like a speeding ticket.) Similar to the setup time check, my goal is to ensure that the chip works at all circumstances - so when I am looking for a hold time violation, I take the smaller of the two propagation delays of register R0. Remember, smaller delays are bad when it comes to hold time.

Showing (off) my work

Now that you know what I do, let me show you how I would actually do it, using the example design I described earlier. In this design, although there are just two registers, there are six different paths between inputs and outputs. (Four of them have unique combinational delays.) For each of these paths, I need to ensure that the setup and hold time constraints are met. Dr. Morris Chang was busy, so I made up some numbers for the register’s setup time , hold time , and propagation delays. With these numbers:

Path 2 and 3 meet both setup and hold time conditions - they are good to go
Path 1 fails setup time - it takes 8 ns for data to reach O’, but R1 expects it at 7 ns
Path 4 fails hold time - The path from E to O’ is just 4 ns long, giving R1 insufficient time to store the previous input

This analysis (i.e. Static Timing Analysis) clearly shows that this design does not pass timing. It also tells you exactly which paths are causing violations, and by how much. I have to repeat this analysis for every combination of input and output registers, and every possible path between the inputs and outputs. Even if one out of a million such paths has a setup or hold violation, I will find that for you, and prevent you from making a billion dollar mistake. That’s what makes me special…

Still not convinced of my value in your life? That’s okay, I have some more experiences to share from the real world that might just convince you. So share this story, and stay tuned for future installments of this series.

The most important law every chip designer should know about

Bharath Suresh — Mon, 14 Jul 2025 02:01:54 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

Story time

In 1989, a chip design team started work on one of the most ambitious PC processors of the time - one with a superscalar architecture, branch prediction, and a much faster floating point unit. Over the next two years, hundreds of engineers were involved in the design, simulation, and extensive verification, which culminated in a test chip. The testing continued - by running major operating systems with the most popular applications of the time - for an estimated 100 million clock cycles. After all the bugs were resolved in 1993, the final chip was shipped to all the biggest computer OEMs of the time - with a new brand name called “Pentium”. The chip was an instant hit - the reviews were positive, and as a result, Intel’s microprocessor sales doubled in 1993. The new floating point unit impressed - with a 10x improvement over the previous generation of Intel chips. It was another feather in Intel’s cap - extending their dominance in the PC processor market. It seemed like they had done everything right, and nothing could go wrong…

The Pentium chip had millions of users over the years, but one of them has a special place in history. Dr. Thomas Nicely and Intel couldn’t be further apart - he was a Mathematics professor at Lynchburg College in Virginia, away from the hustle and loud marketing that Silicon Valley was known for. About 18 months after the release of the Pentium processor, Dr. Nicely noticed something - the results from his research on twin prime numbers were slightly off. Like a typical academic would, instead of ignoring it as a “random error”, he started to dig deeper - which led him to the conclusion that the Pentium processor was making mistakes when running some floating point division operations. He posted his findings online, and this opened up a can of worms that would cost Intel dearly.

Triaging the bug

Remember the long division method from high school? Microprocessors in the 1990s used a similar method to divide floating point numbers. Although it worked, this method was quite slow - making floating point division one of the slowest arithmetic operations in a microprocessor. To make this operation execute faster, Intel implemented a different approach called “Sweeney-Robertson-Tocher (SRT) method”. I’ll skip the details here, but for the purposes of this post, all you need to know is that in order to implement SRT in hardware, a lookup table was needed to provide the quotient digit based on different input combinations. It was in this lookup table that the now famous Pentium FDIV bug originated.

In order to implement this lookup table, the Pentium team used a Programmable Logic Array, or PLA. A PLA is a type of Read-Only Memory (ROM), but can be used to store structured data more efficiently in a smaller area - by representing the data as a logic function. In the 1990s, memory was still very expensive, which made PLAs an attractive alternative. The word “programmable” in PLA could be misleading in today’s context - programming a PLA means “fusing”, or allowing current to flow through certain transistors to get the desired logic. This was done permanently, before shipping the chip out to the customer. This transistor fuse mapping (i.e. which connections exist) is very important in deciding what the final output would be. For example, the figure below shows a 3 input, 2 row, 2 output PLA, where different programming (Blue and Green dots) result in different output logic.

To implement the SRT division method, the Pentium team used a 22 input, 120 row, 2 output PLA - which means there are 2880 potential transistors to fuse. (Of these, only 2048 values were used.) Unfortunately for Intel, 5 of these were programmed incorrectly. Whenever these incorrect values were picked up from the lookup table, the result of the floating point division was incorrect. This was the cause for the FDIV bug in Pentium.

How did this bug go unnoticed?

Looking back, this sounds like a trivial issue that should have been caught much earlier in the testing process. But remember, we are talking about the best microprocessor company of the time - which begs the question - how did they miss this bug? There are a few theories floating around:

Intel’s whitepaper claims this was a clerical error - the C script written by an engineer to load the final transistor mapping into the PLA had an error, which resulted in 5 entries not being loaded in the PLA. The values were tested before loading to the PLA, but the values in the PLA were not checked.
Robert Colwell, architect of the Pentium Pro, mentions in his book "The Pentium Chronicles” that this error was caused by a last minute request (actually, order) from management to shrink the size of the PLA, which made the engineers pursue an optimization that was not properly verified.
Some postmortem studies claim that the engineers misunderstood the SRT method, and applied the wrong rule for lowering thresholds - which means this was not an error in any step in the chip design process - the table was mathematically wrong in the first place
Aliens manipulated the final chip layout in order to scale back human progress (I’m not kidding. Intel made a movie about this, called “Intel: The Journey Inside” with this plot.)

Irrespective of the reason, it’s interesting that the bug went unnoticed for close to 2 years after the chip was released. (Although Intel claims they were aware of the problem in the summer of 1994, months before Dr. Nicely made it public.) This is because, the odds of hitting this bug were extremely low:

The division must access the 5 incorrect values out of the 2048 in the PLA, a 0.24% chance
Intel claimed a typical user would encounter this problem once every 27,000 years. Another way of putting this was that an error would take place once for every 9 billion random divisions.
The error typically shows up in the 9th or 10th decimal digit (at worst, the 4th decimal digit.) There are very few applications that require such high precisions

Although a lot of people claimed to have been impacted by this, Dr. Nicely was the only person who noticed the bug in regular use. (All other scenarios seem to be artificially designed to hit the bug.) When it comes to bugs, Intel won the lottery. (if winning the lottery means creating a bug that is being talked about 30 years later.)

Consequences

Although the bug was clearly not great news for Intel. But in a way, they got lucky - the odds of this bug having any real impact was ridiculously low. (If we go by Intel’s numbers, the odds are lower than that of plane crashes.) Yet, Intel did not get away with this.

Intel’s early response was to shrug it off as an unlikely scenario, which led to a lot of flak. News about Dr. Nicely’s findings spread like wildfire, and even CNN ran a segment on the issue. IBM, who were simultaneously Intel’s biggest customer and competitor (with the Power PC) claimed that the issue was more common than Intel estimated. AMD ran an advertisement that said their chips “Can actually handle the rigors of complex calculations like division.” - a clear dig at Intel’s bug.

Ultimately, Intel had to issue a public apology and offered a free replacement to anyone with a Pentium processor that reached out to them. This was estimated to have costed Intel about $500 million - and a lot of damage to their reputation.

This is the impact that even a seemingly harmless chip design bug can have.

The most important law in chip design…

This brings me to the title of this post, and the reason why I narrated the story of Intel’s famous FDIV bug. This is not just a story about Intel’s mistakes. Every major chip company in existence has had a similar story to this, and I’m sure every chip company in the future will. There is a lesson to learn from stories like this.

It is in the nature of chip design and computing that even the unlikeliest outcomes can occur, and should therefore be accounted for. Over the years, architectures have changed, workloads have changed, but this fact continues to remain true. Hence, the most important law in chip design is actually Murphy’s law: If something can go wrong, it will go wrong.

We talk a lot about transistor nodes and chip performance, but it is important to remember that the first priority will always be: build a chip without bugs. Maintaining such a high standards at the scale of trillions of transistors, each a few nanometers wide, and switching more than a billion times per second, is what makes the chip design industry truly unique.

References

Few facts about the Pentium processor
- https://en.wikipedia.org/wiki/Pentium_(original)
- https://www.tomshardware.com/picturestory/710-history-of-intel-cpus.html
Dr. Thomas Nicely notes on his FDIV bug discovery
- https://faculty.lynchburg.edu/~nicely/pentbug/pentbug.html
About the SRT division method
- https://www.osti.gov/biblio/4157138
A great deep dive into the bug
- https://www.righto.com/2024/12/this-die-photo-of-pentium-shows.html
Explanations for the bug
Media coverage after the bug became mainstream
- AMD’s 1994 advertisement
- https://static.righto.com/images/pentium-fdiv/apology.jpg

How to build a new chip architecture, ft. Nvidia

Bharath Suresh — Sun, 13 Jul 2025 18:06:00 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

Over the last year, I have spent many hours following stories about Nvidia and their meteoric rise. This post is an attempt to consolidate all those learnings. My goal is to look back at Nvidia’s history to try and generalize the principles behind making a new computing architecture succeed.

A few caveats before I get into the post:

This is not a post about business strategy. There are many good resources covering that already. I have tried to stay away from generic advice like “to be successful, build the picks and shovels in a gold rush”.
This post is not forward looking - I’m not claiming that Nvidia has won and everyone should invest in their stock. It is simply a case study in computer architecture. In fact, a lot of my learnings come from the early days of Nvidia

With that out of the way, let me share what I think is key to successfully build a new architecture.

Pick the right domain

Architectural changes cannot happen overnight. So its very important to pick the right target domain. There are a few ideal characteristics:

It’s barely possible to execute applications with current architectures.
This means at least one the following factors:
- It takes too long to run
- It needs very expensive compute resources
- It needs advanced programming skills that only few possess
When Nvidia was founded in 1993, they picked computer graphics. For graphics to be realistic, it needs fast compute. In the 1990s, 3D graphics was only possible using either:
- Expensive workstations from companies like SGI
- Advanced software rendering techniques in games like Doom
These were both not accessible to the average consumer and developer. In addition, 1992 had two key developments that could enable graphics cards:
- The PCI bus protocol was adopted as a standard by all computer manufacturer. This means anyone could make a new chip, and attach it with any computer, and it could work
- Microsoft released Windows 3.1, which was a big jump in operating systems graphics
All this made computer graphics a great domain for Nvidia to try and disrupt.
There are general patterns in the workloads that can be exploited
I think the word “general” is key. Let me explain with the example of graphics.
Nvidia recognized, over time, that the key pattern in computer graphics is this: Computations on each pixel is independent of the other. No other pattern is as important. This led them towards a massively parallel architecture that is still programmable.
There were architectures inspired by other approaches too. For example, early graphics chips focused heavily on fixed-function accelerators. Here, the idea was to pick a specific part of the graphics pipeline (like Tessellation, Rasterization, Shading), and make your architecture optimal for these steps. In fact, Nvidia also had a fixed function approach for a long time.
There are two problem with going narrow with architecture decisions
- Algorithms in your domain can change. Your architecture may not optimally support those changes
- Your architecture can never scale to other similar domains.
If you cannot find a general pattern in your domain, it is a bad idea to build an architecture around this domain. Often, such niche domains get commoditized as part of a bigger architecture - this is what Intel did for audio cards, and even 3D graphics to an extent over the years. But Nvidia survived this because of their architecture was sufficiently general by that point.
Solving for this domain should provide big opportunities
This is what Nvidia calls a “zero billion dollar” market. Its hard to specify exactly which domains would fall in this category, but a good approach I suggest is to ask this question: “If you build it, and they come, what can they do?”
When Nvidia first bet the company on computer graphics, their vision was: If graphics could be made accessible to everyone, it would be the greatest storytelling medium. This was around the time when Jurassic Park was released, and power of computer graphics was becoming evident.
I also found something else that was interesting in my research - Nvidia had AI applications added to their GTC lineup as early as 2010. Although AI was still in its nascent stages, they were thinking about potential applications like computer vision, speech recognition, and robotics - the promise of a world powered by AI was massive, prompting Nvidia to pursue that domain aggressively.
Establishing an architecture needs huge time investment - so it is very important that your target domain has huge potential.

Your architecture should fit your target workloads - not the other way around

If you want to build a standardized architecture, that means that other developers will be building software for your architecture. This presents a major challenge to architecture companies. Usually, when there is an change in algorithms or APIs that your architecture support, architecture companies are presented with two options:

Build an translation layer through your drivers, or compiler, to support the new workloads on the same architecture
Modify your architecture to better suit the new workloads

The first approach is often easier, and can be done on-the-fly (in the same chip generation). So this is the preferred method. However, a lot of chip companies ignore the second approach completely - if it works, why fix it? There are two stories from Nvidia about this which are interesting to study.

In 2003, Nvidia had established itself as the leader in computer graphics, and was preparing to launch its next chip, NV30. Around the same time, Microsoft had just released the new Direct3D 9 API. NV30 did not fully support Direct3D 9 - so they created software workarounds to support the new software features. This turned out to be a disaster - the NV30 received bad reviews, had heating problems. This was a major learning for Nvidia.

Over the next few years, as Nvidia’s programmable shaders started to gain popularity, many researchers started using them for scientific computing. During the early days, the same GeForce cards used for graphics were being repurposed for scientific computing through some cumbersome programming. Although this was better than anything available at the time, it was still not optimal - much of the GPU architecture was still optimized towards graphics applications.

This time though, Nvidia saw the trend early - instead of keeping the same chip architecture for graphics and scientific computing, they started to build chips that were specifically meant for non-graphics applications, by maximizing the number of parallel floating points computations supported. In 2016, they also introduced FP16 support on their GPUs - because most Deep Learning workloads used FP16. These factors went on to play a massive role in their dominance in the data-center - which is mostly comprised of AI workloads.

Have a unified and backward compatible architecture.

Two basic rules of life are: 1) Change is inevitable. 2) Everybody resists change.

To build a standardized architecture, making it backward compatible is key. Having a unified architecture naturally creates a flywheel:

All the libraries developed over the years work on your architecture
Developers are incentivized to write more libraries for your architecture

Getting developers into a new architecture is key - that’s what Intel was able to do in PCs, ARM in mobile (PowerPC and DEC Alpha are examples where this did not happen). A unified architecture is very “developer friendly” - a community (ideally, open-source) starts to build around your architecture. This is exactly what Nvidia achieved with CUDA - since 2006, all Nvidia GPUs are CUDA compatible, giving CUDA developers an install base of about 500 million devices, with about 300 libraries and 600 AI models.

CUDA is Nvidia’s moat, and a lot had been said about this already. For the purposes on this post, I want to talk specifically about some of the challenges with maintaining backward compatibility:

Clunky hardware: To support legacy architecture features, often, additional hardware complexity needs to be maintained. This will result in poorer Power, Performance, Area (PPA) metrics over time
Restricting Innovation: Very often, something new and better cannot be implemented in your architecture because it breaks some legacy constraints.
Expensive Development: As the number of architectural features increases, it needs a bigger workforce, and more knowledge transfer - which costs time and money.

I think over time, every architecture company gets hamstrung by their legacy architectural features - A case in point is Intel, and the x86 architecture (I have a earlier post on ISAs with more details). This makes maintaining backward compatibility one of the most challenging aspects of computer architecture.

So far, Nvidia has managed this well, primarily owing to these factors:

Nvidia’s developers operate at a very high level of abstraction. CUDA by itself is a fairly high-level language (like C). Also, most developers build using CUDA libraries like CuBLAS and CuDNN, which are optimized for Nvidia’s architectures. This gives Nvidia more opportunities to improve their architecture while maintaining backward compatibility. In other words, they can maintain backward compatibility at the developer level, but break compatibility at the microarchitectural level - bridged using their driver and compiler. This differentiates them from Intel, who for the most part were dependent on Windows developers doing a good job at using their architecture.
Their architecture is still fairly general and simple - it is centered around parallel programming, and floating point computations. A lot of issues in Intel’s architecture stemmed from complexities added over the years - like variable length instructions and complex arithmetic.

So far, Nvidia has navigated the backward compatibility challenge well - but their workloads are fairly new, so it will be interesting to see how this continues in the long run.

Build infrastructure to move faster

The nature of workloads keep changing all the time. One of the most impressive aspects of Nvidia is their ability to pivot when an opportunity present itself. They started as a graphics card, pivoted to programmable graphics, and then expanded into high performance computing. Nvidia was able to grab opportunities better than anyone else, because they were able to make big architectural changes quickly.

From early days, Nvidia was a strong believer in simulation and automation. I think it stems from Jensen’s early days at LSI logic, the company that pioneered many EDA innovations. (I have covered many such stories in my EDA series). During his time at LSI, Jensen worked on a new chip architecture called “sea-of-gates” - which was a very early version of an FPGA emulator. Emulation would go on to play a big role in the development of RIVA128 - widely regarded as the chip that saved Nvidia.

Traditionally, once the chip was designed, it was sent out to the fab, and a test chip was sent back. This test chip was used to run software, find bugs, and resolve them. There are multiple iterations of this, until all bugs are resolved. Then, a final, large order of chips gets “taped out”. This whole process usually took 2 years.

Nvidia first two chips - NV1 and NV2, were poorly received. Nvidia needed to make major architectural changes in a very short time - which forced them to adopt emulation. Using emulation, once the design is ready, it is loaded on to an “Emulator”, which is then used to test software on the chip prototype even before it was manufactured. To do this, Nvidia went to a failing company called Ikos, invested heavily in their emulators (each one cost $1 million!), and essentially bet the company on emulation. It was a cumbersome process, but it worked - RIVA128 made one of biggest leap in computer graphics, and made Nvidia the leaders in computer graphics.

Although the RIVA128 story was born out of desperation, moving fast then became a part of Nvidia’s culture. Nvidia continues to invest heavily on infrastructure to help them innovate faster than the competition. Nvidia also incorporated a feature called “virtualized objects” into their architecture. In simple words, this was a mini-OS baked into the hardware (called “Resource Manager”) that could be used to emulate certain late hardware features that could not make it in time for the chip production. Although it incurred a minor performance cost, Nvidia adopted this because they greatly valued the ability to move quickly.

Generally at Nvidia, this obsession with efficiency is referred to as the “speed of light” approach, which says: every project must be executed at the fastest possible rate, and all obstacles in the process should be removed. This is an underrated aspect of chip design that a lot of companies neglect, making them slower, and less receptive to new opportunities.

Understand bottlenecks outside your core architecture

If building a computing platform is like building a car - the architecture is like the engine. Even if you have built the fastest engine, your car can’t go fast if you have weak tires, or there is bumper-to-bumper traffic on the road. Although your architecture is just one part of the computing stack, all the end user cares about is: how fast is my workload running. So its very important to analyze where the bottlenecks are, and work on managing them better.

One fundamental bottleneck every architect must think about is transistor scaling - i.e. how is Moore’s law progressing at the time your are building your architecture. (I covered Moore’s law in more detail in an earlier post). Nvidia learnt this the hard way.

Nvidia’s first graphics card, the NV1, was designed to render 3D surfaces in 2D using quadrilaterals (4 vertices) instead of triangles (3 vertices). Nvidia’s had a good reason to do this - memory costs were very high, and using quadrilaterals would allow them to build their chip with a smaller memory, allowing them to price their graphics card competitively. However, what they didn’t see was that Moore’s law started to accelerate around the same time, making memory much cheaper. The software ecosystem at the time standardized on triangles, and their competitors were able to support this using larger memory at comparable prices. Despite having one of the best architecture, Nvidia could not remain competitive because they had a small memory.

Many years later, in the datacenter business, Nvidia knew that their architecture alone cannot push them to the top. Not a lot of people know this, but Nvidia was in the cloud business very early - in 2013, they released Nvidia Grid, an early version of their cloud gaming platform, known today as GeForce Now. There is another fact that is not well known - Nvidia had an LLM of their own, much before ChatGPT became popular. In 2019, they built an open-source LLM called Megatron.

Both these experiences taught Nvidia about all the components needed to make the best datacenter servers. As a result, Nvidia has expanded well beyond their GPU architecture in datacenters, to offer:

Very high bandwidth memory using CoWoS 2.5D stacking (very important for LLMs)
Faster networking between cards through NVLink, enabled by their acquisition of Mellanox
A very efficient ARM-based CPU for the datacenter (in fact, they tried to acquire ARM)

This gave them control of a bigger portion of the computing stack, and allowed them to optimize it further. Vertical integration as an architecture company also buys you time - even if Nvidia GPU architecture is not the best for a generation, the full solution they provide might still be better than anything on offer. Finally, a full solution is much easier for enterprises to deploy.

Although most architecture companies cannot start here, it’s very important to move towards that direction as your company evolves. This is how you can move from a great technology to a great product.

References:

The Nvidia Way, a book about Nvidia’s story by Tae Kim

The era of full-stack chip designers

Bharath Suresh — Sun, 06 Jul 2025 22:08:24 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

A few years back, when I was talking to a student that was interested in both the front-end and back-end stages in chip design, I made a cheeky remark that they should become a “Full Stack Chip Designer”. (I can’t remember who I was talking to, but if you are reading this, this post is dedicated to you!) It was a term I took from software engineering - a full stack engineer is someone that has the skillsets to work on both the front-end and back-end aspects of a software application. When I said “full-stack” in the chip design context, I just meant it as a joke - there are very few people that actively work on all the steps in the RTL-to-GDS chip design flow, especially when designing complex, real-world chips. I must admit I have never done full stack software engineering, but from what I gather, the skills are more transferrable between front-end and back-end software design, compared to chip design. But with AI in the picture, something tells me we might be heading towards the era of full-stack chip designers.

What I mean by “Full-Stack Chip Designer”

I briefly introduced the idea earlier, but let me be more specific here. The typical digital chip design process can be divided into two broad categories:

Front-End, which involves designing microarchitecture specifications and describing the logic using Hardware Description Languages (HDLs) like Verilog. (the resulting “code” is called Register Transfer Language, or RTL.)
Back-End, which involves taking the RTL, and converting it into a layout (called GDS) that a foundry can use to manufacture the chip. This is a multi-step process that uses EDA tools from companies like Cadence and Synopsys.

Typically, teams managing these two aspects work in isolation to each other - the front-end team finalizes the RTL, then hands it off to the back-end team to generate the GDS. There are a few “leads” involved in the back-and-forth between the teams, but their role is to pass feedback to each other’s teams. The leads have a good understanding of both front-end and back-ends, but typically do not directly work on implementing in either. On the contrary, when I refer to a full-stack chip designer, I mean an individual (or a small team) that can take a design idea through from front-end to back-end - fully implementing all the steps in the process.

Why is Full-Stack Chip Design hard?

There are a lot of smart people in chip design - so if I could think of this idea, you can be sure that a lot of people have thought of it. In fact, a lot of early chip designers were truly full-stack - they worked on all the steps themselves. I can even say that if you have ever taped-out a chip for a project at your university, you are technically a full stack chip designer too. So it’s clearly not an impossible problem to solve - it is just very hard to solve at the scale of today’s commercial chips with billions of transistors. There are two main reasons for this:

1. Back-end tools aren’t intuitive; Neither are HDLs

I think this is a well documented problem. (I have also talked about it in one of my earlier posts.) Due to a large design space, EDA tools are complex to use, and produce reports filled with convoluted terminology. As a result, it is not straightforward for a front-end designer to dive into back-end tools. To make matters worse, bugs in the back-end design process are very hard to spot, and if left unattended, could result in a chip that does not function (a chip re-spin is a billion dollar affair, so no one wants that.)

I also think writing good RTL is an art that takes time to master - unlike high level languages, HDLs are less intuitive, which makes it harder for back-end experts to pick up. All this has meant that back-end designers spend all their time training to be experts in the tools they are working with, while front-end engineers prefer to do what they do best.

2. Front-end and Back-end design happens at different granularities

This aspect is more nuanced and is not seen in small chips or university projects. When working on industry-scale chips, front-end and back-end designers have different priorities. Front-end designers start by becoming experts on a small part of the design. In order to make this happen, all other parts of the design need to be treated like black-boxes. However, for back-end design to be effective, it needs to happen at a bigger granularity - if a chip has two blocks, but the back-end is handled for each block separately, the interactions between the blocks (signals routed across, how often they access hard macros, and so on) would not be captured well - which would result in a low-quality layout. Hence, the goals of a back-end designer are quite different - it is to gain a basic understanding of all parts of the design, by sacrificing depth in a single part.

Essentially, when it comes to understand the chip, a front-end designer should be a specialist, while a back-end designer should be a generalist. Getting an individual (or team) to do both is a challenge.

What do you gain by being a Full-Stack Chip Designer?

Short answer: Time.

Although front-end and back-end work in isolation, they are working towards the same goal - to build a better chip. However, there is a long feedback loop between the two stages: Typically, once the RTL is delivered by the front-end teams, back-end teams find ways to optimize the layout which need changes in the front-end. The front-end team now needs to distil out the feedback, figure out if the request is even feasible, and if so, implement the change. This takes weeks, or sometimes even months. Projects have multiple iterations like this, which either massively increases chip design timelines, or results in a sub-par chip getting shipped.

Bringing both front-end and back-end into the same umbrella can massively shrink this timeline, by:

Making the iteration time faster (”I know exactly what needs to change for a better layout, and whether that’s possible to do in the RTL”)
Reducing the number of iterations (”When I write RTL, I also know how it is going to impact the layout”)

This time saving is crucial - since the semiconductor industry is highly cyclical, when the chip you build is hot, quickly coming up with better versions is key to dominate your industry.

How can AI help to make this possible?

I spoke about the potential of AI in chip design EDA in an earlier post - I think reading this would add value to this discussion:

In addition to the points mentioned there, I want to specifically talk about the two ways I feel AI can make full-stack chip design a reality:

1. Using AI for Knowledge Transfer

If you look at both the reasons why full-stack is hard in chip design, they boil down to the same thing - knowledge transfer is hard. I think AI, even in it’s current form, can solve this problem to a good extent.

For instance, it is possible to develop and maintain a database with all the scripts, best practices, and terminologies associated with back-end tools - effectively allowing an expert in front-end design to move ahead with back-end flows. Similarly, if LLM based RTL generation can be solved effectively, that allows a back-end designer to also manage front-end changes.

I also think LLMs are going to play a big role in documentation and teaching - if basic questions about all parts of a design could be answered immediately, then you can have a specialist in one part of the design expand into being a generalist when needed.

2. Automation is coming

Although I didn’t mention it so far, having different front-end and back-end teams is also important to maintain realistic working hours - from the conversations I have had, chip designers work more hours than software engineers, especially close to deadlines. So in the current state, even if an individual can manage both front-end and back-end, it’s not practical to have them work on both. This is where I feel AI’s big productivity promise is going to play a role.

I think an agentic RTL-to-GDS flow is coming soon, which would automate a lot of mundane tasks that chip designers must do today. I think this would greatly help with the workload problem, and make the idea of a full stack chip designer practical.

Why economic incentives might make this inevitable?

So far, I talked about going “full-stack” like an opportunity - if someone wants to do both front-end and back-end, they might be able to do it in the near future and build better chips. This is actually how it works in software engineering today - some developers become full-stack, but others choose to stay with one. But can chip design afford to offer this flexibility?

I can imagine a future where becoming a full-stack chip designer may not just be an option - it could become the norm. To understand why I feel this way, let’s look at the economics of chip design. At a high level, chip design companies spend money on the following:

Payroll for chip designers
Cost to manufacture wafers
EDA tool licenses

With the way things are headed today, manufacturing costs are increasing sharply - each new node is becoming more challenging to implement, and as a result, more expensive. So this is going to start eating into the profit margins of most chip design companies.

The next factor is EDA licenses. There are many AI-first EDA startups coming up today, but they still don’t have access to the Process Design Kit (PDK) from chip manufacturers. (If you don’t know what a PDK is, think of it as a secret recipe that a manufacturer like TSMC provides to EDA vendors like Cadence and Synopsys in order to correctly map chip designs to a layout that can be manufactured.) As long as PDKs remain under tight control, the legacy EDA vendors cannot be replaced. This leaves two possible outcomes:

AI EDA tools add a new agentic layer in chip design
Legacy EDA vendors do all the AI themselves and become more dominant

Either way, I see the cost related to EDA going up. (Not to mention, a lot of chip design companies also need to upgrade compute to handle the new AI workflow.) So to maintain their profit margins, chip design companies might raise their prices - but that may not be sustainable unless they have a very strong moat.

This leads us to the point I’m trying to make - companies need to innovate, and improve productivity with smaller teams. What better way to do this, than with full stack chip designers? (By the way, when I say smaller teams, I don’t necessarily think jobs are going away - on the contrary, I expect shorter design timelines, and a large variety of chips customized for different use cases.)

Some other definitions of full-stack

I took front-end and back-end design as an example here since the analogy fits nicely with the software engineering world. But chip design actually has many more steps, and could see consolidation at other levels too. Here are a few I could think of:

RTL Design and Functional Verification (software engineering has largely consolidated Programming and Verification)
Different aspects of Verification (like Functional, Power, Performance, and so on)
Architecture Simulation and RTL Design (through High-Level-Synthesis)

If you can think of some other aspects in chip design that can be consolidated, I’d love to hear as a comment below :)

2025 YC AI Startup School - Round up

Bharath Suresh — Tue, 17 Jun 2025 07:10:43 GMT

If you have been reading this substack, you’ll know that this post is different from the others here:

It’s not specifically about chips
It’s from the present (Not the 1960s like one of my posts)

But AI is at the center of tech today, and plays a big role in the semiconductor industry too. I got an opportunity to attend YC’s AI startup school, and here are some learnings from the speakers.

Here’s the list for you to quickly navigate through:

Opening remarks from Garry Tan, President and CEO at Y Combinator

This is a great time for the technology industry: Intelligence can be accessed using an API
It is also a time for great agency: every industry is changing

Sam Altman, CEO at OpenAI

OpenAI wasn’t built to be big - things that become big usually don’t start that way
Today, there is a product overhang in AI: which means models are progressing faster than applications using them
Vision for GPT-5 and beyond: Multimodal (image, code, video) input and output + much better memory functions
In knowledge work, there is a pattern: Work few hours, wait for feedback, and repeat. This kind of work is perfect for AI agents
Open AI’s hardware product vision: A better interface than the smartphone to interact in the real world
- We have only had 2 big computer interface revolutions so far: Mouse, and Touch. The 3rd will come with AI
If the current trend continues, the creation of GPT will be seen by future generations like the invention of the transistor
His hiring philosophy is simple: Hire smart, driven people with track record of getting things done
- In the next 10 years, small teams with big agency will be the most successful
His two personal interests in the 2010s were AI and Energy. Today, they are interlinked - we are converting energy to intelligence.
Favorite startup advice: Be contrarian, but right (from Peter Theil)
- It’s very hard to do - When GPT1 came out, Elon Musk said it had 0% chance of success.
- You just got to keep going, it’s going to be tough

Elon Musk, who needs no introduction

He never sets out to build something great - just wants to build something useful
- When the internet was happening, he just wanted to be a part of it. Applied to get a job at Netscape, but couldn’t get it. So he started something on his own.
2008 was his toughest year - Spacex’s 3rd launch failed, and Tesla was running out of money. At that time, everyone said Elon was an internet guy who shouldn’t try real engineering
It’s important to build truth seeking environments in companies - you cannot fool math and physics
“Don’t inspire to glory, inspire to work”
To build a new AI model, you just need access to three things:
- Compute
- Unique data
- Talented people
Intelligence is very rare - it is possible that we are the only species to possess intelligence. That makes it even more important that we are multi-planetary.
He predicts that in the next 100 years, there will be more humanoid robots than the human population

Satya Nadella, CEO at Microsoft

The ultimate measure of AI shouldn’t be intelligence or AGI - it should be the amount of economic growth that AI can drive
- He doesn’t believe in anthropomorphizing AI (i.e. giving it human traits) - AI is just a tool
Be open minded that the last big algorithm breakthrough in AI is not done - LLMs are not the end, fundamental AI research still matters
AI isn’t going to take away software engineering jobs - instead, the traditional SWE role would change to something like “Forward Deployed Software Engineer” (FDSE), a role pioneered by Palantir
- In the past, typist (someone that uses the typewriter) was a job - now everyone does it. Software engineering will become like that too
The most important factors in AI deployment are going to be: Privacy (for individuals), Security (for organizations), and Sovereignty (for countries)
Microsoft’s breakthrough in Quantum computing is massive - they have finally solved for a stable Q-bit
- Today, AI is helping make Quantum computing better. In the future, Quantum computing will enable better AI
Access to copilot has been the best intervention ever in the field of education, making it the one domain that Satya is watching out for
One lesson he learnt in his career: Do every job like it’s the greatest job you ever had - don’t wait for your next promotion
His favorite question while hiring: Describe how you managed a project that was going nowhere - a good answer would highlight three key skills:
- Can you solve a problem you are faced with
- Can you bring clarity to an uncertain situation
- Can you enable a team to work together
Advice to anyone building products: Build something that makes you feel empowered

Aravind Srinivas, CEO and Co-Founder of Perplexity AI

Perplexity’s next big bet is the browser - agents are going to be like the different open tabs you have right now
- They have partnerships with a lot of websites to make this work
Triaging and fixing bugs is an important skill, even as a CEO
You can’t strategize your way to success - any smart idea will get copied. You just have to work incredibly hard
Competition is a great thing, because it tells you that something is worth doing
He started perplexity without a clear idea of what to do - founders are advised against this, but it is important to just start something
Competing with ChatGPT is hard; competing with Google is easy
AI apps have not figured out how to have network effects yet
Perplexity’s profit margins will never be as high as Google - no company will have such profit margins anymore
Whenever he feels like failing, he goes to Elon Musk’s video about failure

Fei Fei Li, Stanford Researcher, the godmother of AI (created Imagenet)

She started with a simple goal: How to make machines see. The lack of data to solve this led to the Imagenet dataset.
Alexnet was a big AI breakthrough, but it was also a big hardware breakthrough - it was the first time two GPUs were put together to run a workload
A lot of her research draws inspiration from the evolution of the human brain - her new venture World Labs aims to solve the problem of Spatial Intelligence
- Language is purely generative - it does not come from nature. So language alone cannot approximate the world, and get us to AGI
She pursued her early research with newer professors in the field - taking such risks matters
Graduate school is a place where you can be purely driven by curiosity. But if you are running a startup, you won’t have that freedom
She was an immigrant at spent her 20s running a laundromat. Her advice to anyone feeling like a minority: Develop an ability not to over-index on it. Gradient-Descent your way to success.

Andrej Karpathy, Former Director of AI, Tesla

This is the third big shift in software - today’s engineer should be fluent in all three
- SW 1.0: Code (to program computers)
- SW 2.0: Weights (to program neural networks)
- SW 3.0: Prompts (to program LLMs)
LLM Analogy 1: It is like a utility (ex. electricity)
- Building the grid is like training, but instead of serving electricity, it serves intelligence
- Your access is metered (cost per token)
- There are few big providers to switch between
- When an LLM goes down, it feels like a power outage
LLM Analogy 2: It is like a fab
- The capex to train is huge
- Each model has it’s own secret recipe (like TSMC/Intel do)
- Some users go fabless (use general purpose Nvidia GPUs); others manufacture in-house (like Google TPUs)
LLM Analogy 3: It is like an OS
- There are closed and open ecosystems (like Windows vs Linux)
- Different applications can be built on top of them
- Easy to pirate (Once trained, cloning an LLM models is like stealing a CD for Windows)
LLM Analogy 4: They mimic human psychology
- Hallucinations
- Jagged intelligence
- Anterograde amnesia
Unlike most breakthroughs in computing, which started with defense or government contracts, (HDLs too, as I covered in one of my posts.) LLMs started with consumers - which is something very new for the computing industry
AI will support different levels of autonomy
- Augmentation (like code/image generators)
- Partial autonomy apps - like Github copilot/Cursor (coding), Perplexity (search)
- Full autonomy - we are not there yet
All software is going to at least be partially autonomous: So build interfaces for LLMs, not humans
- For example, product documentation should have markdown in addition to plain text/images - LLMs can access them more easily
- Operator is a great way to control a computer - but it is too expensive to use for everything - so in the near future, we need better LLM interfaces
Karpathy first rode a fully autonomous car in 2013. Yet, we still don’t have full self driving. That’s because there is always a big gap between demo and product
- Don’t think of it as the year of Agents, think of it as the decade of Agents

Andrew Ng, one of the greatest AI educators

Execution speed is one of the strongest predictor of a startup’s success
Today, the biggest opportunity in AI is at the application level - not at the model, cloud or chip level
- Specifically, a new agentic orchestration layer is forming - every application would need this
Vague ideas are always seen as right, but are always wrong. With concrete ideas, you get clear feedback about right or wrong.
The ratio of product managers to engineers will change in the near future
- Today, on average, we have 4 engineers per product manager
- In the future, there would be 2 product managers for each engineer
- So as an engineer today, it is important to have better product instinct
Think of building AI products like building a structure with Legos
- The more blocks (i.e. underlying libraries, models, and so on) that you have, the better your outcome would be
AI will push the speed of building products by 10x, so moats will not exist anymore. Brands will be more defensible in the future.

John Jumper, Distinguished Scientist at Google DeepMind, and Nobel Prize winner

Why AlphaGo won?
- They had the same public data as everyone
- They used a 128 core TPU to run experiments (which is underwhelming compared to today’s LLMs)
- It was all about ideas from the team - that made the difference
Trust is built through word of mouth - put your work out there and get feedback
To publish papers in academia, you need ideas that work and are also beautiful. In industry, you just need a working idea
To build a low cost AI product: think about how you can reduce the cost of failed ideas
Narrow AI systems will win out eventually (this is different from what Jared Kaplan said)
It’s easy to come up with dogma - instead, be ruthless and empirical

Chelsea Finn, Assistant Professor at Stanford, Co-Founder of Physical Intelligence

In traditional robotics, a robot is trained to function in very specific environments. But the goal of her new company is different - Build a robot to do anything
For LLMs, scale is the most important - more data + GPUs usually means better models. But to build robots, we need the right kind of data - with sufficient diversity
In her talk, she walked us through the steps they followed to train a laundry folding robot
- But none of the steps followed were specific to laundry folding - they just involved a gradual increase in difficulty level - this can apply to any task
The foundational models used to train such general robots is called a Vision Language Model (VLM). It works like this:
- The robot processes user input along with vision input from cameras
- The VLM uses this data to generate language commands describing how the robot should respond
- These language commands are used to control the robot
To make their robots more robust and handle open ended prompts - the same VLMs were used to generate synthetic prompts, and these prompts were used to train the VLM
She believes general purpose robotics will be more successful than purpose built robots, since the foundational VLMs will keep getting better
- This was a lot like what Jared Kaplan from Anthropic mentioned in Day 1 about LLMs

Jared Kaplan, Co-Founder and CSO at Anthropic

The time taken for a human to do the same task and AI can do is doubling every 7 months - this is like Moore’s law.
Scaling laws will continue to grow - if teams find that scaling laws are failing, it means their training methodology is flawed.
To prepare for an AI future:
- Start building technology that doesn’t work now - by the time you are done, AI would have caught up (similar to chip designers using Moore’s law)
- Use AI to integrate AI - this is the only way to keep up
There are two types of tasks
- Tasks that can be done with 70-80% accuracy - AI already excels at these
- Tasks that need 100% accuracy - this will be solved by future AI
The real value of AI will come in knowledge tasks that needs us to put information from different sources together - like Biology
When integrating AI into existing businesses, one needs to think carefully about the bigger picture - for example, when the electric motor was invented, it was not used to make the steam engine better - instead, an electric engine was redesigned.

Varun Mohan, CEO and Co-Founder of Windsurf

Personally, I loved this talk - Varun walked in with no slides, and simply had a candid conversation with the audience.

His first company was Exafunction - which build GPU virtualization software
- It was quite successful and was used by a lot of autonomous vehicle companies
- Their USP was to abstract away underlying hardware architectures - but with Nvidia gaining dominance, they felt this application wasn’t valuable. So they pivoted
They came up with Codium, a Github co-pilot alternative. Later, this became Windsurf, an agentic IDE.
His advice to founders: Be irrationally optimistic, but uncompromisingly realistic
To stay ahead of the curve, build products where 50% of the ideas don’t work today - by the time you finish, AI would have caught up and everything will work
The reason startups win over big companies is: startups are desperate, if they fail, the company dies
Strategic moats and switching costs are dying - don’t go by traditional VC advice
There are a lot of companies in the world that are still technology starved - this is your opportunity
When asked how he manages the stresses of being a founder, he said: “I don’t manage it. There is no way to escape it. If you fail, just get up and keep going.”

Francois Chollet, CEO and Co-Founder of Ndea (also the creator of Keras)

If we have tasks that humans do that AI cannot - that means we do not have AGI
Scaling laws will not get us to AGI (Contrary to what to Sam Altman and Jared Kaplan said, actually)
There are two types of data abstraction
- 1. Value centric - abstraction in the continuous domain
- 2. Program centric - abstraction in the discrete domain
- Today’s AI like transformers work well in the continuous domain. But for AGI, we need AI that can handle both domains

Suhail Doshi, serial entrepreneur (Mixpanel, Mighty Computing, and Playground AI)

AI will have a second movers advantage - be confident if you want to build consumer AI applications even today
The world will soon enter mass amateurization: what an expert can do today, will be done by AI soon. With agents, these tasks can be done continuously for years
Don’t just focus on AI applications that give immediate feedback - like Chatbots. Consumers are ready to wait for hours if they get value out of AI
To identify opportunities, think about “what-ifs” in the future. For example:
- What if nobody drives a car
- What if everyone has a personal robot
- What if we can never say what’s real
Recommended reading: https://andrewchen.com/the-next-feature-fallacy-the-fallacy-that-the-next-new-feature-will-suddenly-make-people-use-your-product/

Jordan Fisher, CEO at Standard AI

Everyone says focus is important. But if you are running your own company, you need to focus on a lot of different things. It’s not easy.
If you are building something today, build it assuming AGI is coming in 2 years
There will be a big difference between products built as “AI first”, vs existing products that retrofit AI
There are many open questions about the AI-centric world we are entering
- Will software become a commodity? Do you only need product managers?
- What’s the point of downloading an app if you can generate an app on demand?
- How can users ensure an AI agent is right if everything happens under the hood?
Some companies will have an advantage in the AI era
- Companies with data and data creation opportunities (Meta, Reddit)
- Companies with secret recipes (like TSMC, ASML)

I hope you found my notes to be useful, especially if you were not able to make it to the event. Many of these talks were also recorded, and I urge readers to check them out here: https://events.ycombinator.com/ai-sus

As usual, please share this post with someone that it might benefit. And subscribe to stay tuned for upcoming posts.

Power essentials for a chip architect

Bharath Suresh — Sat, 07 Jun 2025 02:37:32 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

Before going ahead, let me define what I mean by some terms from the title of this post:

Chip: I’m talking specifically about computing chips like a CPU or GPU
Chip Architect: Someone who works on architecture or microarchitecture of a chip
Power: Energy consumed by a chip from a battery or a electrical socket per unit time.
Essentials: Only what an architect needs to know - everything else will be abstracted out

When a CPU or GPU is designed, there are three key metrics: Performance, Power and Area. (commonly combined in the acronym “PPA”.) Of the three, I have always found power to be the least intuitive. In most textbooks and college courses, power consumption is explained through the physics of the transistor. (i.e. movement of electrons.) While it’s good to know these details, I have always felt that not knowing them shouldn’t stop you from thinking about power. (In the same way that a web designer needn’t know all the details of the cloud in which their website is hosted.)

So, that’s the goal of this post: to go over essential concepts needed to design a low power chip if you are someone who knows a good amount about chip architecture, but very little about power.

Let’s get started.

Part 1: Fundamental ideas in power

When does a chip consume power: The rental car analogy

Before looking at the impact of design decisions on power, it’s important to understand the two ways in which power is consumed in a chip. To make this simple, consider this analogy: Chip power is like the money you spend on a rental car each day.

Let me explain further. Imagine that you are on a trip and have rented a car. The amount of money you would spend on this car daily can be divided into two components:

Daily rental charges (depends on the type of car, rental company, and so on)
Fuel expenses per day (depends on your usage of the car)

Chip power also has two similar components:

Static power: Power that is consumed anytime the chip is on
Dynamic power: Power that is consumed to perform some logic in the chip

As you may have figured, static power is like the daily rental charges you pay: As long as the chip is turned on, even if there is no new task to be run, power is still consumed. On the other hand, dynamic power, like the fuel cost, depends on the number, and type of logic operations that a chip executes.

Why do these two types of power matter?

Usually, when you rent a car, all you care about is the total amount of money you spend. So you might be wondering why we want to split up the power consumption into two different types. The reason why we do this, is because power alone is not a useful metric. There are two metrics that derived from static and dynamic power which are actually important:

Energy consumed

As you know, power is energy per unit time. So the energy consumed by a chip would depend on the amount of time for which static or dynamic power is consumed, in addition to the values themselves. In our rental car analogy, you can think of energy consumed as the total money spent on the rental car during your trip - which includes rental charges for each day of the trip, and fuel expenses incurred every time you drive the car. At a very high level, this is what you need to know:

Static power is consumed for the full duration when the chip is connected to a power source
Dynamic power is consumed only when the chip is actively executing a task

Energy consumed by a chip is an important metric, because:

In battery operated devices, lower energy consumption means a longer battery life without needing to recharge
In plugged devices (connected to an external power source), lower energy consumption means a smaller amount on your electricity bill

If the goal is to reduce the amount of energy consumed, one type of power could be more important than the other. There are two scenarios you may experience:

Scenario 1: When static power is more important

To understand this, let’s take the rental car analogy further. Imagine that your trip is designed in a way that for most of the days of the rental, you will not be driving the car. In this scenario, the daily rental charge is going to be more significant compared to your fuel expense. Like your rental car, a lot of chips today are designed for use-cases with high idle time.

A real-life example of this use case is a personal assistant device like Google Nest or Amazon Echo. Although these devices are plugged into a power source all the time, they spend majority of this time in an idle state. However, static power is still consumed in this idle state, making it the main contributor towards the energy consumption. Ideally, a chip designed for this use case should prioritize reducing its static power.

Scenario 2: When dynamic power is more important

This is the classic case of a road trip - where you spend most days actually driving your rental car, and in the process spending a lot on the fuel. In real circuits, dynamic power tends to be higher than static power - so although static power is still consumed here, reducing dynamic power becomes the priority. (Ideally, you would reduce both, but as you will see later in the post, there are tradeoffs involved in these design decisions.)

Many datacenters today are designed to maximize utilization of all the available compute all the time - at a chip level, a new task continually needs to be executed. In power terms, this means that dynamic power is being consumed (along with static power) during most of the time when the power supply is connected. Designing a chip that reduces dynamic power consumption becomes valuable in this scenario.

Peak Power

While power is usually correlated with energy consumption, (sometimes, the two are incorrectly interchanged) there is another metric that is actually more important - Peak Power, which is the maximum power consumed at any point during which the chip is on. In other words, it is the power when the sum of static and dynamic power is the highest. Imagine being told that you have a daily allowance for the money you spend on your rental car (including fuel expenses) that you are not allowed to exceed - that’s what Peak Power is.

There is a reason why I said Peak Power is more important than energy consumed - while higher energy consumption would result in smaller battery life or a bigger electricity bill (which are bad, but don’t make the device unusable.), Peak Power directly impacts the operation of the chip. If the peak power is high, a few things can go wrong:

Too much heat is produced
Unexpected voltage drops occur
Complex power distribution network are needed

Each of this could be catastrophic, as Nvidia once learnt the hard way. During the design on their NV30 chip in 2002, Nvidia were forced to increase the clock frequency of their chips in order to be competitive with ATI’s Radeon 9700 PRO. But without realizing it, this exceeded their Peak Power budget, and resulted in excessive heating. In what was a last minute move, they added a huge dual-slot fan to ensure that the chip was still usable. But this fan was very loud, and as a result, NV30 became the butt of many “hot chip” jokes. (In fact, Nvidia posted a spoof video themselves.)

Peak power is a very interesting topic in itself, with challenges in both estimation, and management techniques. But for the purposes of this post, the main takeaway is this: while there could be many design decisions that are effective at reducing energy consumption, they are only valid if they are under the peak power budget.

So far, we’ve understood that there are two types of power consumed in a chip - static and dynamic power. Changes in these two types have power has consequences on two important metric in chip design - Energy consumed and Peak power. Trust me, this is all you need to know about power as a chip architect. Once you are equipped with this knowledge, the next step is to look at some techniques that architects follow to reduce power consumption.

Part 2: Power optimization

If you have read this post so far, you know the basic concepts in chip power, and are ready to put your “low-power chip architect hat” on. When I was reviewing this topic, I found a lot of solutions proposed were either constrained to a very specific design, or were slightly different versions of the same idea. (often wrapped around fancy terminology.) In order to keep this section simple and clear, I’m proposing a new hierarchy for ideas. For both static and dynamic timing, we will classify ideas as:

Principles: the fundamental ideas that can apply to any design
Techniques: a specific implementation of a principle

Techniques are huge in number, and can be implemented at both the architecture, and microarchitecture levels. I’ll share few example techniques of each type for each principle, but I urge the reader to focus more on the principles. Note that whether a technique is good or bad depends on how it impacts other key metrics in the chip - which includes area, performance, peak power, and energy consumption. Since these requirements vary based on the end-application for the chip - there is no one good technique.

With these logistics out of the way, let’s start with static power.

Static Power Optimization

As we saw earlier, static power is always consumed when a chip is kept on. A nuance that I want to add here is: each transistor consumes static power independently. This leads us to the two principles to reduce static power consumption.

Principle 1: Reduce Transistor Count

If we go back to our rental car analogy, and consider each transistor as a rental car - then it becomes clear that your daily rental charge increases for each additional rental car you have. So the most obvious way to reduce static power is to reduce the number of transistors in your chip. Since this post is meant for a chip architect, transistor count reduction is same as area reduction by removing or simplifying big logic blocks in the chip. Here are some such techniques:

Technique 1: Remove Redundancy

Let me use an example to explain the idea of redundancy. Consider executing a CPU instruction to do the following:

Read data from register 1
Read data from register 2
Add them
Store the result in register 3

We read/write register data using a block called register file. The number of parallel reads that can be executed by the register file is decided by the number of register ports. This means:

If we have 2 read ports, data from register 0 and 1 can be read in the same cycle
If we have 1 read port, we read data from register 0 in the first cycle, and data from register 1 in the next cycle.

It should be clear that 2 read ports result in better performance than 1 read port. But an additional read port costs additional area, which means more transistors, and higher static power.

In theory, you only need one instance of each type of block in a chip - one adder, one register/memory port, and so on. But redundancy is a common attribute in most chips that is used to improve performance. If reducing static power is your main goal, removing such redundancy can help. Typically, this also reduces the peak power of the chip. However, energy consumption could vary from case to case - it is possible that reducing redundancy results in a huge performance penalty, and as a result, you need to run the chip for longer and consume more energy to do that. Hence, a good understanding of the application is key before making a decision to reduce redundancy.

Technique 2: Eliminate Pipelines

If you are unaware of what a pipeline is, I recommend one of my earlier posts on CPU pipelines. Each pipeline stage added comes with an area cost - which means additional transistors consuming static power. So by combining different pipeline stage together and reducing the number of pipeline stages, static power can be reduced. (This also reduces dynamic power, but we will look at that later.)

However, pipelining is a common technique to increase the frequency, and hence throughput of the chip. So by removing pipelines, you may save on area, peak power, and even energy consumption in most cases - but it comes at the cost of a significant performance drop.

Technique 3: Simplify Prediction

If you look into any modern computing chip, you will find different “hacks” that are meant to improve execution performance. I’ve put them under a bucket called “prediction” - as these methods build on knowing something about the data or instruction patterns. Two common methods in this bucket are:

Branch/Value prediction
Caching

Although these result in a performance bump, to get better prediction results, you would need more complex hardware blocks. (As an example, you can check the different cache mapping schemes I discussed in an earlier post.) Complex blocks means more transistors, and more static power. By simplifying these techniques (For example, using a direct mapped cache instead of a set associative cache) we can reduce static power consumption.

Much like reducing redundancy, estimating how this technique would reduce energy is tricky: Recovering from cache misses may result in additional energy consumption - and this may overshadow the reduction from static power savings.

Principle 2: Reduce Transistor Voltage

If you have a large group and cannot reduce the number of cars you are renting, the next best option would be to find a rental company that gives you the best deal. In circuit land, this “deal” actually comes from the voltage applied on the transistor. In layman terms: the transistor is like a resistance - when you apply a higher voltage, more current flows, and this increases the static power. To reduce static power through reduced voltage, apply these coupon codes at your rental car checkout:

Technique 1: Power Gating

What's the best way to reduce the voltage applied to transistors in a chip? How about not applying any voltage? Well, you can’t do that for the whole chip, because that would just make it a block of silicon. That's where power gating comes in - instead of turning off the entire chip, only the unused parts of a chip can be turned off. Since static power is only consumed by the transistors that are connected to a voltage source, power gating effectively reduces the number of active transistors - which reduces the static power consumed.

Power gating is done in real time - which means, as a chip architect, designing the chip in a way that would allow larger blocks to remain off for a longer time, while still maintaining the similar performance, becomes the key. It is also important that there is some way to detect if, and when a certain block would be unused. The complex logic needed to handle power gating introduces some area overhead - still, power gating is one of the most commonly used techniques to reduce static power.

Technique 2: Multi-Voltage Islands

This is another similar idea to power gating - but instead of turning off the voltage for some blocks in real time, a lower voltage is used. Low voltage is different from no voltage, because with low voltage, the logic is still functioning - it is just slower. (I'm skipping the details, but transistors running on lower voltages can only support lower frequencies - this is what makes the chip slower.) So as an architect, you could divide the entire chip into different voltage domains (i.e. logic blocks sharing a voltage source) based on the expected usage, and reduce the voltage for the domains that can run slower. Another way to use the same idea is to use the same chip design to support different applications by changing the voltage applied to each domain.

Irrespective of the implementation, using multiple voltage domains (or islands) is an effective way to lower static power for some parts of the chip (and hence lower the total static power.) If done correctly, this may also not have a major performance impact, although it would increase the chip area.

Technique 3: Dynamic Voltage Scaling

As I mentioned in the previous technique, reducing transistor voltage is possible if the frequency is also reduced. This is the idea that Dynamic Voltage Scaling (DVS) exploits. With DVS, the circuit has the ability to lower the voltage applied (and simultaneously reduce frequency) when slower execution is acceptable - imagine an application that has a burst of heavy processing that happens at once, followed by periods of light processing. The static power can be reduced by lowering the voltage when heavy processing is not needed.

As an architect, identifying opportunities for DVS, and building in the capability to detect DVS is key - and often comes with an associated area cost. In most cases, it would also lead to lower performance. But it is commonly used in battery operated devices - when you are running low on battery and hit the “power saving mode” on your smartphone - it’s DVS at work. DVS also impacts dynamic power, which I will cover in the next section.

Dynamic Power Optimization

If you remember our rental car analogy, I mentioned that dynamic power is like the fuel expense incurred each day on your car. When I was first learning to drive a car, I came across many different theories on how to reduce your fuel consumption - from simple concepts like “accelerate/brake gradually”, and “use cruise control on freeways”, to more obscure ideas like “For this car model, maintain tire pressure to be X, drive at speed Y, and have a candid conversation with your engine every Sunday” (I’m kidding of course… or am I?)

My point is, dynamic power is also similar - there are a lot of ideas proposed, with most being too specific to certain microarchitectural implementation or workload - which ultimately results in a lot of confusion. To avoid this, I’m going to try to use the same two tier hierarchy I proposed for static power, and hopefully rid you of some of the confusion.

Principle 1: Reduce active clock cycles

After driving for a few years, here’s what I learnt - if you want to reduce fuel consumption, just turn the engine off. In chip language, this means turning off the clock. (If you want to know about clocks and pipelines, check this post before reading further.) Here are some techniques to reduce active clock cycles.

Technique 1: Clock Gating

This is one of the most popular techniques to reduce chip power. Much like power gating, the goal is to identify blocks that are unused for some period of time, and turn their clocks off. Clock gating inserts logic (called the "gate") into the clock path going to a some related logic - this gate can be selectively turned on or off in real time. (typically based on some other logic in the chip.)

There are many nuances involved in implementing clock gating that I’m skipping in this post. But as a chip architect, identifying opportunities for clock gating is invaluable. A block is ideally suited for clock gating if:

It can remain off (the technical term is “idle”) for a significant time while the chip is still running
There is a way (ideally an easy way) to identify when the clock for this block needs to be active

Clock gating is a great way to save a lot of energy. (i.e. provide better battery life for your devices) In fact, in most cases, clock gating can be implemented with minimal impact on performance. (There are some cases, especially high frequency designs, where meeting timing with clock gating could be challenging, so take this with a grain of salt.) However, there would be some additional area cost - for both the control, and gating logic.

I want to end this section by talking about the interdependence between power gating and clock gating. It should be obvious that if power gating is implemented, then clock gating becomes redundant. However, power gating is usually more complicated, and incurs a bigger performance penalty - so in many cases, a hierarchical approach is followed:

If a block is idle, turn off the clock first (i.e. clock gating) - save dynamic power
If the block remains idle for very long, pull the plug (literally, i.e. power gating) - save both static and dynamic power

Technique 2: Dynamic Frequency Scaling (DFS)

When I talked about Dynamic Voltage Scaling (DVS) under static power optimization, I mentioned that it comes at the cost of reduced clock frequency. Taken independently, this frequency reduction comes under a technique known as Dynamic Frequency Scaling (DFS). Reducing the frequency means fewer clock cycles are completed in a second, which translates to lower dynamic power. However, when you are reducing the frequency, you are reducing the performance of the chip. But a clever usage of DFS can hide the performance drop from the user of the chip - for example, if your computer is turned on, but you aren’t actively doing something, (like playing a game, or watching a movie) there is no need for the highest performance. Detecting such opportunities is the challenge, and can be done broadly in two ways:

Through the operating system (this is called CPU throttling)
Through some hardware signals (for example, architects can add counters to indicate low usage)

Both methods would cost more transistors to implement, but the energy savings are worth the investment. Although you can lower the frequency without lowering the voltage, DFS and DVS are typically used together to get maximum power savings. This is called Dynamic Voltage and Frequency Scaling, or DVFS.

Technique 3: Efficiency Cores

When I first heard about a multi-core CPU, I expected multiple copies of the same core used to execute more than one task simultaneously. But power optimization was built into multi-core when in 2011, ARM introduced their “big.LITTLE” architecture. The idea was to pair high performance (”big”) cores with power-efficient (”LITTLE”) cores to get the best of both worlds. A dynamic task scheduler assigns the appropriate tasks to each type of core, ideally without any slowness that a user can perceive.

This idea has now been adopted by most multi-core CPUs, and is known generally as Performance-Efficiency (PE) architecture. While the Efficiency cores (E-cores) can use any of the power optimization techniques discussed in this post, I have placed this technique here, as the biggest power savings come from the lower clock frequency used by the E-cores.

Although in general, a Performance core has better performance than an Efficiency core, the smaller size of efficiency cores can allow more cores to be packed in the same area - so it is hard to convincingly talk about the performance impact of the multi-core system. Overall, the PE architecture has been quite revolutionary, and can be expected to remain a mainstay for the foreseeable future.

Principle 2: Reduce switching

I hope you aren’t tired of the rental car analogy already, but here I go, one last time. You can think of each switching event like a stop you are making while driving. A stop-start car ride is not going to be fuel efficient - similarly, more switching activity in a chip means more dynamic power. Unlike the previous principle, switching is heavily dependent on the workload and the microarchitecture. To make things easier, I have classified the sources of switching into two categories:

Switching in combinational logic (Think: Stops caused by vehicle traffic)
Switching in sequential logic (Think: Traffic lights, Stop signs, and so on)

I will provide some example techniques for each category, but you will see these are quite “hacky” and won’t be applicable more generally. Let’s start with combinational logic.

Principle 2.1: Reduce switching in combinational logic

This part of my post assumes some digital design knowledge, but if you don’t know what combinational logic is, here’s a layman definition: Any logic that does not use a clock is combinational logic. Some examples are Binary logic gates (like AND, OR, NOT), adders, comparators, and so on. Each time the voltage triggers combinational logic, dynamic power is consumed. Here are some techniques to reduce the triggers.

Technique 1: Remove Redundancy

I also mentioned this under static power optimization. In the context of dynamic power, if you have fewer blocks of a type, (say one multiplier instead of two) it also means that fewer blocks end up consuming dynamic power. An effective way to use this technique in a chip is by identifying logic that can be shared: For example, if you are using two different multipliers to multiply values from the same signal in different parts of the design, why not share the logic? But even an ideal candidate like this may result in routing issues. (what if the two parts of the design are very far away from each other in the final chip layout?) Hence, although peak power and area would be lower, this typically comes with a performance drop, and potentially consume more energy overall to recover from the lost performance.

Technique 2: Moving Logic Downstream

This title could be vague - so let me jump directly to an example.

Consider three values - A, B and C. Your chip needs to do this:

If A is greater than B, multiply A and C.
Else, multiply B and C

There are two ways to do this:

Option 1
- Multiply A and C
- Multiply B and C
- Then compare the two products - the greater one is your final result
Option 2
- Compare A and B - the greater one is your initial result
- Multiply the result from the previous step with C

It’s obvious that Option 2 is better than Option 1 when it comes to dynamic power. In Option 2, we are doing multiplication in the last stage - as a result, we only need to do the multiplication once. In other words, to reduce dynamic power, move complex logic downstream, so you can limit the number of triggers to this logic.

This was a very simplistic example to demonstrate this technique - you might wonder why anyone would even implement Option 1 - because even the performance would be better in Option 2. While it is obvious in simple designs, but in large designs, such minor details often get missed - imagine the values A, B and C coming from completely different modules, passing through different pipe stages, or maybe even being handled by different teams! That’s what makes even simple techniques like this challenging to implement.

Technique 3: Digital Logic Tricks

If you have been working on digital logic long enough, some tricks become evident to you. For example, when I first started writing RTL logic, I used multiplications everywhere - just like you would in a software programming language like Python. But there are a lot of scenarios, where true multiplication is not necessary.

For example, Let’s say you want to multiply the values A and B - but you know B is always a power of 2. (i.e. 1, 2, 4, 8 and so on.) Then, instead of multiplication, this logic can be implemented using bit shifts. For example, If A = 5, B = 4, then A*B is same as 5 << 2. (i.e. shift the binary value of A to the left by 2 places.) In hardware, a shift logic is just wires, which means zero dynamic power - a far more power efficient implementation than using multipliers.

Another common trick related to multiplication is to bypass multiplication when you can detect one of the operands to be 0 or 1. We can skip multiplication here too, but remember that detecting 0s or 1s comes with an area overhead. (and may also increase the peak power slightly, to implement the extra logic.) But if the energy saved is significant, it is still worth doing.

There are many other tricks like this which a good designer has at their disposal to reduce combinational logic (and consequently the resulting dynamic power)

Principle 2.2: Reduce switching in sequential logic

Sequential logic refers to the structures that use clock and can maintain their state between clock cycles - the fundamental components are usually a flip-flop or a latch, which can be combined together to create bigger structures like registers, and finite state machines. (FSMs) Dynamic power is consumed by sequential logic during a clock event. (this typically happens when the clock signal moves from low voltage to high - called the positive clock edge.) This means that all the techniques under principle 1 will also reduce dynamic power consumed by sequential logic. However, this principle focuses on how to reduce switching by keeping the same clock frequency - in other words, how to switch fewer sequential logic elements in a chip.

Technique 1: Eliminate Pipelines

This was already covered under static power, and the same text applies here. Fewer pipeline stages means fewer flop stages, and as a result, lower dynamic power is dissipated.

Technique 2: Replace Registers with SRAMs

I mentioned that sequential logic is used to maintain state - this means they can be used as memory elements to store some data (bits) during the chip execution. The most common way to store data is using registers - which is a collection of 1-bit storage elements like flip-flops or latches. Each time a bit needs to be modified, these elements consume dynamic power - hence, large registers become a hotspot (pun intended) for power consumption in a chip.

Registers are not the only storage - Static Random Access Memories (SRAMs) are an alternative structure that can be used to store data. While the fundamental unit of an SRAM is similar to a register, they are packed together to serve as dense memories. (read as: area reduction.) But SRAM design has gone beyond just area density today. SRAMs are now designed with techniques like sleep modes, voltage scaling, and read/write gating to make them highly power efficient. (Also, in terms of logistics in chip teams, chip architects typically don’t design SRAMs - so as an architect, if you have a low power SRAM solution available from a different team or vendor, this becomes a plug-and-play power optimization technique for you.)

It’s important to note that simply replacing any register with an SRAM will not work - typically, the power benefits of SRAM are significant only for large memories. Also, if you are storing data once, but reading it often, then registers are actually better. (since they only consume dynamic power during the storage phase.) Also, SRAM access is slower than registers, which is likely to have an impact on overall performance. So use this technique cautiously.

Technique 3: Using pointers to avoid toggling

This is a very specific implementation, but I find it to be elegant, so it has earned it’s place here. (Good interview question too, by the way.) Let’s say you want to implement a First-In-First-Out (FIFO) structure, that can hold 4 elements. Here’s a simple way to implement this structure (called a shift register):

Place 4 registers (R1, R2, R3, R4) back-to-back, with the output of R1 connected to the input of R2, and so on
When a new entry should be added, send it as input to R1, and shift all existing elements one place (So the data held between R1 and R2, now moves between R2 and R3)
Once full, if you want to read the first (oldest) data, simply read the output of R4

However, the problem with this implementation is that each time a new entry is added, all 4 registers switch. Now imagine the same structure for 1000 entries, with each entry being 1024 bit wide - that’s a lot of dynamic power for each new entry.

A better (power optimal) way to implement the same design, is to have independent registers, along with pointers for each register. In the above example, if we had 4 independent registers, we could track the start, end and next pointer each time a new entry get added:

First entry - add it to R1, then set:
- start pointer = R1
- end pointer = R1
New entry added - add it to any available register (let’s say R3). Then set
- start pointer = R1
- end pointer = R3
- R1’s next pointer = R3
Similarly, store new entries in any registers, and update start, end and next pointers accordingly.

If you notice, in this implementation, each time a new entry gets added, only one register needs to be updated. (ignoring pointer management, which would have a small area, power, and performance overhead.) This design could easily scale to large FIFOs without a big dynamic power increase.

If you have read so far, here’s something to leave with: a table summarizing the principles and techniques discussed above, and their impacts on metrics. Remember that some metrics depend strongly on the workload being accessed - the results in the table only provides the general trend.

I want to reiterate something I mentioned earlier - this list of techniques in this post (and I’d argue any post) is not exhaustive - they are merely examples to understand the principles better. I hope you can take some of these ideas to build some incredibly power efficient chips in the future. Or at the very least, pass along your learnings - one way to do that is to share this post with someone you know :)

Evolution of HDLs - Part 2: Keeping up with Moore's Law

Bharath Suresh — Sat, 31 May 2025 01:09:44 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

In the first part of my HDL series, we went through the first 30 years in the evolution of HDLs, leading to two standardized HDLs in the 1990s - VHDL and Verilog. This era was dominated by academic research giving rise to completely novel ways to describe chips. (I recommend checking it out first to fully appreciate this story.)

In this part, we’ll continue our journey over the next 20 years - which saw fewer, but very important evolutions that allowed chip designers to keep up with the increasing complexity resulting from Moore’s law.

Chapter 1: Was Verilog losing momentum?

Although Verilog became an open standard in 1995 and was gaining popularity, chip designers found some shortcomings:

1. “What you see may not be what you get”

If you remember, in part 1, I introduced the idea of “Behavioral Simulation” - where the logic expressed by your HDL code is simulated using a different language like C. The advantage of behavioral simulation is the speed - the alternative (gate level simulation) is significantly slower.

Despite the speed advantage, behavioral simulation had a problem: what you see in simulation could be completely different from the logic that gets synthesized. This was never seen as a problem initially for two reasons:

Verilog code was mainly written by experts (who also understood how the simulator worked)
Designs were smaller, so manual reviews could still catch most bugs

In the late 1990s, this non-determinism started to become a real issue. The Verilog-95 standard (which was the first IEEE standard for Verilog) lacked clear specification in some cases, like:

Defining a sensitivity list in an always block
Implementing signed arithmetic
Blocking/Non-blocking assignments

This was quite a big deal in the chip design industry: detecting such bugs was extremely hard, and when they were found, re-spinning the chip would cost millions of dollars. This made Verilog a risky option, especially considering that the alternative, VHDL, had a more precise syntax with fewer such ambiguities. As a result, VHDL became the preferred option for a lot of industries at that time - especially in safety-critical industries like defense and aerospace. (We can see the remnants of this even today - chip design teams that started around this time continue to use VHDL)

2. Verilog couldn’t scale

When Verilog were created, designs were simpler - so more focus was given to precise description of hardware structures. Other aspects like readability and scalability of the code were ignored.

For instance, Verilog was created assuming designs with few (less than 10) input/output ports in each module. But over time, modules started to need a large number of ports (more than hundreds), and the existing Verilog syntax made port declaration painful.

Another critical aspect missing in Verilog was the idea of replication and conditional logic definition - known today as generate statements. As designs got complex, the ability to replicate certain lines of HDL, was needed to ensure that the code size was manageable. The ability to turn on/off certain lines was also helpful to run different experiments without modifying the code each time. VHDL was ahead of the curve - they supported generate statements to replicate logic.

At this point in the story, if I had to bet on one of these two HDLs, I would have picked VHDL - A department of defense initiative and an IEEE standard, with better language constructs, and fewer chances of errors. In fact, Synopsys, the company that promoted Verilog as the primary HDL for their synthesis tool, decided to launch a VHDL simulator (called Scirocco) in the year 2000, and strongly advocated for the language. The conventional wisdom was that VHDL was going to monopolize the HDL world. Things were not looking great for Verilog.

Chapter 2: Good Artists Copy

Despite clear evidence from experts that VHDL was a better designed language, Verilog did show a few glimpses of its usefulness. Back in 1995, at the Synopsys Users Group (SNUG) meeting, John Cooley hosted an interesting competition. He invited a set of HDL practitioners to create a gate netlist for a synchronous parity generator - with highest clock frequency being the winning metric. But there was a catch - the participants only had 90 minutes. The result of this competition was interesting: Almost all the Verilog designers were able to produce functioning HDL within the given time; while none of the VHDL designers could! (Fun fact: This competition was won by Larry Fiedler, who was a designer at Nvidia)

This competition showed that despite lacking some constructs, describing hardware with Verilog was easier (and hence quicker) than using VHDL. Sensing that both languages had their merits, a joint group called Accellera was formed in the year 2000, as a merger between VHDL International and Open Verilog International. While this merger was meant to take both HDLs forward, it was Verilog that benefited greatly.

I already mentioned the two types of issues with Verilog: non-determinism and a deficiency of language constructs. With the formation of Accellera, the latter could be resolved easily: just copy the missing constructs from VHDL and add them to Verilog. And that’s exactly what happened - A new Verilog standard was published in 2001 (called Verilog-2001) with several new constructs like:

ANSI C inspired port and datatype declarations
Wildcard sensitivity list
Generate statements
Multi-dimensional arrays

Ultimately, few other inconsistencies were fixed in another minor update in 2005. The release of Verilog-2001 and Verilog-2005 was a big statement from the language designers to its practitioners: Verilog was designed with the user in mind, and feedback from the user would be incorporated to improve the language. This started to swing the HDL wars in favor of Verilog.

The non-determinism problem was more interesting: While some ambiguities were clarified with the release of the two standards, HDL experts still believed (and some still continue to believe today) that VHDL was a more precise language. But as it turns out, this factor was not as significant as the productivity gains that using Verilog provided - especially to the huge number of chip design startups that started to emerge during this time. (Qualcomm: 1985, Broadcom: 1991, Nvidia: 1993, and so on)

Instead of picking the best HDL to avoid non-determinism, many of these design houses decided that they would rather pick the best HDL, and worry about non-determinism later. To manage the non-determinism, Linting tools were introduced: A Linting tool can detect user intent, and warn against potential cases where the simulation and synthesis result may not match. A lint check could eliminate most of the non-deterministic scenarios, and make Verilog a safe HDL.

So, through collaboration with Accellera, and improved linting tools, Verilog was able to rise from a tough spot. It must be said that during this phase, VHDL stopped growing. (between 1993 and 2008, there was no major update to the VHDL standard.) It is not clear whether this was a decision by Accellera, or classic incumbent arrogance. Either way, once Verilog caught up, VHDL users started to decline. This was the first sign of danger for VHDL. But while all this was happening in the US, a bigger storm was brewing far, far away.

Chapter 3: A New Beginning

In the 1995 Open Verilog International conference, John Costello, the then CEO of Cadence, famously called VHDL a “$400 million mistake”, and mentioned that the money could have instead been spent developing a better HDL. While he was likely pandering to the audience, (I mean, it was a Verilog conference) and likely promoting Verilog simulators from Cadence, a few people took his words seriously.

Remember Brunel University from part 1? That was where Phil Moorby and the HILO HDL came out of. As you know, Phil Moorby went on to join Gateway Design Systems in the U.S., which ultimately gave us Verilog. There were two other key personalities at Brunel along with Moorby that I did not mention earlier - Peter Flake, the project lead for HILO, and Simon Davidmann who helped in its development. Davidmann would go on to work at Gateway Design Systems in the 1980s, but he always had his eyes set on something bigger.

As chip designs got complex, verifying them became a challenge - a problem that inspired many verification languages to emerge. But building a simulator that supported these different verification languages was difficult. Recognizing this problem, in 1997, Davidmann founded Co-Design Automation, with Flake as it’s CTO. Their goal was to come up with a single language that could be used for logic design, verification and system design, and worked on building a simulator for this language. While their initial plan was to build a completely new language, they ultimately decided that it was better to build on top of an existing HDL. While it is unclear why they picked Verilog over VHDL, I assume their previous experience at Brunel and Gateway Design Systems must have played a role. (Again, funny how small moments in this story have a big impact.) In 1999, they introduced the Superlog language, which took Verilog and extended its capabilities for verification and system design.

Although Accellera, which was formed soon after, was intended to take Verilog and VHDL forward, they were also looking outward, at alternatives like Superlog. Co-Design Automation was happy to have them onboard - they could sell more simulator licenses if Superlog was backed by Accellera. In May 2002, Accellera approved Superlog as an official extension of Verilog - but apparently weren’t big fans of the name. So they decided to call this new extension “Verilog for System Design” - i.e. SystemVerilog.

All the EDA companies started to take notice of this: Maintaining different languages for design and verification was a pain for both the users and the tool vendors, so this Accellera backed common language for design and verification was like Christmas in June! (By the way, it was actually around June when all this was happening) Synopsys acted quickly, and acquired Co-Design Automation for $36 million just a few months after the Accellera announcement. With this acquisition, Synopsys started to strongly advocate for SystemVerilog as the HDL of the future. Accelera continued to improve the language, and ultimately, SystemVerilog was recognized as an IEEE standard in 2005. Although verification features were the key selling point for SystemVerilog, it was also the most complete HDL of this era, with features like:

Constructs to specify intent to allow simulators/synthesis tools to model the RTL accurately
Packages, which are key in managing large projects
Datatypes like int and byte, similar to high level languages like C++
Interfaces/Modports - this allowed one interface to be shared by RTL and Verification teams
Assertions, that helped designers add checks along with the HDL and save future debugging time

It was actually an easy decision at this point to position SystemVerilog as it’s own HDL, and making it a competitor to VHDL and Verilog. But by this point, I think the industry had matured beyond these petty fights. Since SystemVerilog was built as an extension of Verilog, it was easy to maintain compatibility. So the designers of the language, and EDA providers, made a key decision to embrace Verilog and its users: All Verilog constructs were supported by default in SystemVerilog (including existing Verilog files, which could be used along with SystemVerilog files in projects)

Essentially, for any chip design team starting at this time, it was hard to look away from SystemVerilog - it was a modern language that was backed by EDA vendors, could support legacy Verilog designs, and could be shared with the Verification teams. As a result, the Verilog ecosystem (SystemVerilog and Verilog) took a strong lead in this era, and this lead is evident even today.

Learnings from this era

In the 1960s-1980s, HDLs were like Tom Hanks in the movie Saving Private Ryan - there were wars, chaos, and a lot of action happening. But the story of HDLs in the 1990s-2000s reminded me of Tom Hanks in Cast Away - life slowed down and got lonely, but in the process, HDLs matured. I have identified two key trends that emerged during this period:

1. HDL code started looking like software

In 1986, the EDA giant Synopsys was born, with a tool that converted HDL code into a netlist, that could then be used to create the physical layout. This step, which is called “synthesis”, had a major impact on how a Hardware Description Language was perceived. If you look back at early HDLs like the Computer Description Language, a HDL was simply a language to describe the design of a chip to someone else. But after synthesis tools were created, a clear analogy with software design started to emerge.

HDLs were seen as High Level Languages (Like C/C++)
The netlist was like assembly code
The synthesis tool was the compiler

This meant that the purpose of a HDL was no longer to describe hardware accurately. Instead, a HDL became the language that a human designer uses to talk to the synthesis tool. As synthesis tools get smarter, the language can get more human friendly without losing precision. Hence, the verbosity and precise constructs that seemed like VHDL’s strength stopped being valuable once synthesis tools improved - making Verilog/SystemVerilog the preferred choice during this era.

2. “More transistors, more lines, more problems”

Chip design, especially in the 1990s, was driven by two rules: Moore’s Law and Dennard Scaling. With every new chip generation, more transistors could be packed in a similar sized chip, without consuming extra power. Chip designers took a liking to this idea - Intel went from 275,000 transistors in their 386 processor (in 1985), to 3.1 million transistors in the first Pentium (early 1990s), and a mammoth 42 million transistors in the Pentium 4 (in the early 2000s)

As the transistor density exploded, so did the number of lines of HDL code that needed to be maintained. In a team, this usually means more designers start modifying the HDL code, each with their own styles, preferences and levels of expertise. It was not enough for an HDL to describe hardware well - the value of an HDL also came from syntax that allows:

Better abstraction and scalability
Different modelling granularities
Easy training and readability

SystemVerilog was certainly ahead of VHDL in this respect during this era - but I think major problems in HDL code maintenance are ahead of us. (The amount of chip design happening today is massive, but the ecosystem still hasn’t caught up as I mentioned in my EDA Deep Dive series.)

If you have any programming experience, you know that SystemVerilog is certainly not as human friendly and easy to maintain as high level languages like Python. As a result HDLs continued to evolve, and will always continue to do so. Subscribe and stay tuned for upcoming posts where I will explore this further.

References:

https://www.cs.columbia.edu/~sedwards/papers/edwards2004design.pdf (Comparing HDLs in early 2000s)
https://ieeexplore.ieee.org/document/597119 (Birth of SystemC)
https://ieeexplore.ieee.org/document/835166 (Why SystemC)
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=545676 (VHDL vs Verilog)
https://dvcon-proceedings.org/wp-content/uploads/a-tale-of-two-languages-systemverilog-and-systemc.pdf (SystemC vs SystemVerilog)
https://www.sigasi.com/opinion/jan/verilogs-major-flaw/ (about early verilog issues)
https://trilobyte.com/pdf/golson_clark_snug16.pdf (How Verilog, VHDL and SystemVerilog evolved)
https://danluu.com/verilog-vs-vhdl/ (About the SNUG Verilog vs VHDL competition)

The Computer Engineering Game

Bharath Suresh — Sat, 22 Mar 2025 03:59:31 GMT

Disclaimer: Opinions shared in this, and all my posts are mine, and mine alone. They do not reflect the views of my employer(s), and are not investment advice.

This is a long post, so here’s the list of sub-topics to help you navigate:

Introduction

As I’m writing this, it’s been about 5 years since I graduated with an undergraduate degree in electrical engineering. Whenever I talk to current students, a part of me goes back to my time in undergrad - I think about the things I did right, things I did wrong, and how everything played out in the end. This post is an attempt to consolidate my reflections, and answer a broad question many students ask: How do I navigate my undergraduate degree to successfully pursue a career in computer engineering?

A few caveats:

When I say “Computer Engineering”, I’m specifically talking about the typical skillsets needed to design a digital computing processor.
Your time in college plays an important role in transitioning you to adult life. This post talks only about one aspect of that transition - your career. This is not a “how to live your life” guide.
This post makes most sense if you are about to start a typical 4 year undergraduate program. If you are in a different situation, you can still use some of these ideas, but modify them according to your situation.

My initial plan was to list out a set of tips, but to make it more fun for me to write (and hopefully more fun for you to read), I have framed your progression as a game, called “The Computer Engineering Game”. Our game has 5 levels. Each level has:

Objectives: What you should achieve at the end of the level
Gameplay: Different ways to achieve your objective
Cheat Codes: Something you can use, if you have the option, to make the level easier (not everyone can use cheat codes, and that’s fine!)
Traps: Things to watch out for as you navigate each level

If you have understood the rules, scroll down to start your journey in the computer engineering game.

Level 1: The branch prediction conundrum

Like every good processor, you need to make a key early decision as you are decoding your career.

Objectives:

This level has a very simple objective - at the end of this level, you need to decide whether you would like to pursue a career in computer engineering, or not. At this stage, you don’t need to know the specifics about what sub-domain within computer engineering you are interested in - it’s a simple ‘Yes’ or ‘No’ to computer engineering.

Gameplay:

I think if you are reading this, you already have some inclination towards computer engineering. I can write a whole post about what makes computer engineering great, but here, I’d like to focus on questions that highlight the realities of working as a computer engineer. These questions may sound grim, but that does not mean computer engineering is a bad career - semiconductors continue to be the vehicle on which all technological progress is built, and computer engineers make this progress happen. So treat these questions as a reality check, instead of something that drives you away from the field.

To decide if computer engineering is right for you, I suggest answering the following questions:

At a high level, do you understand how a processor works?
- If not, that’s totally fine. You are a short video away from saying ‘yes’ to this question. I recommend something like this: How a CPU Works.
- If you have the time, the extended version of this video is a book called But How Do It Know by J Clark Scott.
Does the idea of designing a computing chip (like a processor) excite you?
- If the answer is No, computer engineering is likely not the best choice for you.
Do you like problem solving and debugging?
- Your daily job would involve thinking very deeply about problems. It sounds cool on paper, but advancing in this career needs you to continue to be an excellent problem solver over many decades. There are other careers (and I mean no disrespect to them while saying this), which are primarily based on different skills - like communication, planning, and so on. Computer engineering isn’t like that.
A career in computer engineering is slow moving - you need to spend many years at what you do to feel (and be recognized) as an important contributor. Does that sit well with you?
- I think doing great things takes time in any field, but some others like software engineering have quicker feedback and progress (like quicker promotions, and more opportunities to change jobs). It’s important to understand, and accept this reality about computer engineering jobs today.
Can you tolerate longer project cycles?
- This is a nuanced extension of the previous point. But depending on your role, the product you are working on could take anywhere from 6 months to 5 years to be fully realized. A lot of people are motivated by seeing their work out in the world quickly. I get that, but the complexity of modern semiconductor manufacturing does not allow for this. There is a good chance you may not be in the same role when the product you worked on is actually released.
You might make lesser money than some of your peers as you are starting your career. Is that an acceptable tradeoff?
- As much as I hate saying this, the honest truth is that if you are a smart, driven individual looking to make the most money, there are better options than computer engineering (fields like software engineering and quantitative finance come to mind). There is a fundamental reason why this might never change - computer engineering companies produce physical products (or sell to others who produce physical products), and it costs money to make these physical products. Having said that, computer engineering salaries have been increasing recently, and continue to be on the higher end of the overall salary distribution among all careers.
- Despite what I said, you also need to ask yourself a different question here: can I be one of the best computer engineers? I think the best computer engineer would make more money than the average in other fields, so if you think you can be one of the best, then money shouldn’t be a problem.

It’s important to take time, gather more information, and answer these questions carefully. If you can say YES to all these questions, computer engineering is a great career choice for you and you can move forward to level 2. If there are some clear NOs, I think you should look at other careers.

Cheat Codes:

Do you have a friend or family member that works as a computer engineer? If yes, ask them questions to understand what you might be getting into. Go to their workplace, see what they do on a daily basis.
Can you get some part time work in computer engineering? College clubs are one way. Other options could be to get some (likely unpaid) internship, or work in a research lab at your university.

Traps:

While making this decision, separate the “product” from the “job role”. I did say you need to be excited by processors to take up this career, but that’s the least significant question in my list. Unfortunately, we do not have one man computer engineering teams today - maybe AI can change that, but until then, you only play a role in making a product (especially early in your career). This is different from a solopreneur who can design an app by themselves - here, the line between “product” and “job role” starts to blur)
Don’t take up computer engineering for a specific company, or the money. Companies become irrelevant quickly, and the money comes at the cost of a challenging job. These external factor should be treated as a nice to have, not the basis for your decision.
I mentioned college clubs. While they are an effective way to get hands on experience, they sometimes present a rose tainted picture. Your experience will be seriously impacted by how you feel about the people in the club. It is unlikely that you will work with the same set, or even same type of people in the future. Take only the work aspect of your college club experience while making a career decision.
It’s understandable if you change your mind later, but don’t rush through this level - it is more optimal to spend more time in Level 1, make the right decision, that to regret your decision in a few years.

Level 2: The compilation grind

Assemble information from different sources and create a binary representation in your head.

Objectives:

Welcome to level 2. Your goal in this level is to understand the fundamental ideas in computer engineering. Broadly, I have defined three main categories which are important:

Electrical Engineering 101
Computer Engineering 101
Computer Science 101

Gameplay:

Let’s look at each category from above in more detail. I’ll provide a “bare-minimum” list of topics to be covered, along with a free, publicly available course as a template for each topic.

Electrical Engineering 101

This includes fundamental topics in electrical engineering - usually these are required courses for any student pursuing an electrical engineering degree. There are two fundamental categories of courses I recommend here:

Circuits 101
- Topics:
  - Basic circuit analysis methods
  - Analog vs Digital circuits
  - How do fundamental circuit components work - Capacitors, Diodes, MOSFETs, Operational Amplifiers, Memory elements, and Filters
- An Example Course: https://ocw.mit.edu/courses/6-002-circuits-and-electronics-spring-2007/pages/lecture-notes/
Signals 101
- Topics:
  - Continuous vs Discrete signals and how to mathematically represent them
  - Manipulating signals - why, and how?
    - Useful operations - Convolution, Filtering, Modulation, Sampling, and Interpolation
    - Transforms to change signal representations (Fourier, Laplace, Z transforms)
  - Systems with feedback
- An Example Course: https://ocw.mit.edu/courses/res-6-007-signals-and-systems-spring-2011/pages/lecture-notes/

Computer Engineering 101

I have defined this category to include the basic courses that will tell you more about the building blocks of a modern computer. I’m sharing two reference courses here:

Digital Logic Design
- Topics:
  - Combinational logic blocks
  - Arithmetic blocks
  - Sequential logic blocks
  - State Machines
  - Memory
- An Example Course: https://ocw.mit.edu/courses/6-111-introductory-digital-systems-laboratory-spring-2006/pages/lecture-notes/
Computer Organization
- Topics:
  - How instructions/data move in a computer
  - Data representation in computers - binary/hexadecimal, floating point representations
  - Things computers can do - arithmetic/logic, data movement, conditional execution
  - Computer architecture concepts at a high level - Pipelining, Caching, Memory hierarchy
- An Example Course: https://www.youtube.com/playlist?list=PL-Mfq5QS-s8iUJpNzCOtQKRfpswCrPbiW

Computer Science 101

This is an important category, and the one that is most often neglected. (Most universities do not enforce these as mandatory requirements for an electrical engineering degree.) However, a good computer engineer should at least know as much computer science as a CS sophomore. Any CS 101 course would do. (Harvard CS 50 is a popular choice - https://www.youtube.com/playlist?list=PLhQjrBD2T381WAHyx1pq-sBfykqMBI7V4).

The topics covered by such a course would be:

Types of programming languages, and compilation flow
Basic programming constructs (pick any one language here)
Fundamental data structures (Arrays, Stack, Queue, Linked Lists)
Simple algorithms (Search, Sort, Recursion)
Memory allocation (pointers, dynamic memory, segmentation fault)

At the end of this level, you should be able to answer the following questions:

What are the fundamental components in electronic circuits, and how do they work?
How do we represent physical values in digital form? (this is why you need the Signals 101 course!)
What is digital logic, and what are the building blocks of a digital circuit?
How does a processor work? (You might realize I asked this in level 1 as well. In level 2, you should have a much more detailed answer)
Given a problem, and I know a solution to the problem, can I write a program to solve the problem for me?

Cheat Codes:

The name of the game in level 2 is: “Cover maximum number of topics, with minimum overlap”. So depending on the university you are in, or the material you are using, you could optimize this level - for example, you may find a course that combines Digital Logic Design with Computer Organization.
Programming coursework has become a lot more accessible, so you can try to finish your CS 101 topics before you start level 2. (Don’t worry, even if you choose not to pursue computer engineering after level 1, programming is a valuable skill that you will end up using at some point in your career)
Prefer courses that have a lab or project component. Often, these courses involve more work, but as you will see, they will save you time in level 3.

Traps:

What you learn in level 2 is the foundation for everything that comes next in your computer engineering career. Resist the temptation to skip topics or breeze through this level. Any additional time spent here, will speed up your upcoming levels.
If you are in a university, following their curriculum is good enough for level 2. You don’t need to do anything fancy here. Introductory courses at universities are usually well designed - there is no need to game the system.
There is a tendency for students pursuing computer engineering to assume that programming is optional, or that they only need conceptual understanding of programming. If you feel this way, please get this thought out of your head. Being a good programmer is now a prerequisite for most computer engineering careers. The earlier you start programming, the more you will do it, and the better you will become. I can’t stress this enough - programming is just as important as anything else you do in this level!

Level 3: From simulation to emulation

Time to load up some real hardware

Objectives:

In Level 3, you are going to put the theoretical knowledge from level 2 into practice. The goal is to identify the type of work you like, while simultaneously earning some CV points. At the end of this level, you should be able to answer the following question: What kind of computer engineering job would I like in the future?

Gameplay:

Remember how I said level 2 is straightforward if you are enrolled at a university? Level 3 is on the other extreme - it is mostly self-driven. There are many types of “real-world” jobs that you can do to achieve the objective of this level. I’ll talk about three common approaches that I have experienced:

Corporate Internships:

In my experience, these are the most sought after, and as a consequence, the hardest to get at this level. However, at major tech companies, especially in the US, I see a lot of students interning very early in their academic career (even right after their first year!). If your goal is to work at one of these companies in the future, this is a perfect opportunity to see what your life would look like (FYI - interns get much better treatment than employees to lure them to join, but you can still assume you would get 80% of what you see.) I don’t have too much to say about this category, except that hiring at this level is fairly random (there isn’t too much to differentiate candidates). So everyone should try for one of these, but it’s not the end of the world if you don’t land one at this level.

Working with a professor:

This is what I would classify as the “sweet spot” for this level. Good professors encourage students to work with them, and be involved in their research group. I also think academic research is perfect for this level, because it gives you an understanding of what’s new in the field - so you pick a more future-proof career. I’ll just throw in a few caveats:

Don’t expect too much attention from the professor that you are working with. All you need from the professor is to introduce you to the other members in the group, and involve you in their discussions. The rest is on you.
- For the goals of this level, you actually don’t need a hands-on advisor. Reach out to other members who have been in the group longer (like PhD students) to get started.
- If you really want the professors attention, “show, don’t tell”. Give them something useful, don’t give them more work to do.
Do an actual project as part of your research. Reading papers is important, and that’s how you get started at this level, but you should quickly move on to actually producing some results. The purpose of this level is to get experience working in the field - so don’t just read.

Personal Projects:

After reading the first two categories in this section, this might look meek to you. But this was my main intention for this level. A personal project may not look as good on your resume, but it’s actually more useful than internships for this level, because:

Anyone can do it
You can do multiple such projects
You can choose exactly what you want to work on

My suggestion before embarking on a personal project is to actually talk to someone who has more experience than you - could be a college senior, or just someone who is at the position you want to be. Ask them what kinds of projects would be useful for your goals. This will ensure that your experience is realistic and actually prepares you towards your future career.

If you want to make this a bit more formal, I suggest looking into some open source projects/programs in your field of interest. Unfortunately, computer engineering lacks dedicated programs like software engineering, but you can still find some hardware engineering projects in something like Google Summer Of Code (GSoC). Also, open source EDA is on the rise, and I have shared some resources in an earlier post that would also help in getting started.

I want to end by saying one thing - irrespective of whether you are working on an established open source project or something you just cooked up yourself, document it with all the details. If you have code, have a well maintained repo on Github. I think documenting helps you learn better, but more importantly, it can legitimize your projects - solid documentation seen by the right person could open up a lot of doors in your career.

I want to end with some questions that you should be able to answer at the end of this level.

Do I like working in a lab, or do I like working in an office?
Do I like the slow grind of research, or the fast pace of corporate life?
What subdomain do I want to explore in more detail? (Spoiler alert, you will lock this down in level 4.)

Cheat Codes:

One of the best ways to be accepted into a professor’s research group is to take a course with them, secure a good grade, and reach out to them as the course is about to end. Even if you don’t have a good grade, give this a try if you liked the course - professors prioritize familiarity over anything else.
If you know someone that can get you some unpaid projects at their workplace, you can take that up. But remember that if you are doing unpaid work at this level, you should be compensated with flexibility - in terms of what you want to work on, and for how long. Use that flexibility to your advantage.
If you had a project in one of your courses in level 2, the easiest way to play level 3 is to just take that project to the next level. But try to expand your knowledge while doing this so that it takes you closer to the objective of this level.

Traps:

A lot of students assume this is an optional level - it is not! I understand that not everyone might get an internship at Google, or be accepted into a professors research group, but you can certainly do a project on your own - there are great resources available for free. (Even if you have no ideas, today, you can ask ChatGPT to come up with an idea for you, and tell you exactly how to implement the idea!)
When looking for work opportunities at this level, a lot of students optimize for the wrong things - money, or reputation. Don’t do that. I’m not giving generic advice like “follow your passion”. The reality is that at this level, you don’t have enough leverage to demand either - so it is better for you to focus on the type of work, and bet on the long term.
If you were able to score a corporate internship, that’s fantastic. But remember that it could be a double edged sword - you might be doing dull, redundant work that does not push you towards the objectives of this level. I would still suggest taking up corporate internships if they are available, but if you have more than one option, choose wisely.
It’s okay if you want to work on group projects. But it is very important that you get your hands dirty and make significant individual contributions to the project - otherwise, you won’t really know if you like the work, or you just like the end result of your group project.
Don’t undersell yourself - successful applications at this level are mostly a result of confidence. If you have done level 2 right, you have the skillset needed to manage most tasks you would be working on. The bar is very low for positions at this level.

Level 4: Finalizing the floorplan

Place all your blocks at the right place, and start to connect them. The quality of chip you end up with depends on this!

Objectives:

This level has two missions

You need to decide what sub domain in hardware engineering you want to pursue
You need to get proficient with the coursework for those sub-domains.

At the end of this level, you will have the skillsets needed to move forward to a career in your desired sub-domain.

Gameplay:

Level 4 is the longest level in this game. Here, you will be taking up advanced, but niche coursework (If you remember, level 2 had fundamental, but broad courses). At the university level, these courses are usually “electives”, or optional - which means you need to put together the best combination of courses for your needs. Along the way, you also need to decide on the sub-domain you want to start your career in.

What are the different sub-domains?

I have classified all the traditionally useful skills in computer engineering into three broad categories, each with three sub-categories:

Building the chip: This includes roles that are needed to create a semiconductor chip from a concept. I would say this is roughly what is called Digital VLSI Design. It covers the following areas:
- Microarchitecture design: Here, you take some functionality, and come up with the most efficient arrangement of digital logic blocks to implement that functionality. Traditionally, this involves RTL (Register Transfer Level) design using a HDL (Hardware Description Language) like Verilog.
- Physical design: This is where you actually decide where each transistor should be placed, and how they should be connected. Today, this is achieved using EDA (Electronic Design Automation) tools - so the role of a physical design engineer is to understand how to use these EDA tools to get the best possible chip.
- Verification: When you build a chip, it is crucial that you build something that works as expected. This is where verification comes in. Verification happens at many levels in the chip design process - functional verification (called DV (design verification)), performance verification, verification after the chip is manufactured (post-silicon), and so on.
Making the chip usable: When you design a chip, all you have is a fancy looking solid made of silicon and other materials. A chip becomes useful when it comes with some supporting software that makes it capable of running existing applications (like booting an operating system, or running a game). This is where skills in this category come into the picture. I am mentioning three sub-topics to better explain this role:
- Compiler design: If you finished the Computer Science 101 course in level 2, you know what a compiler is. Depending on the type of chip you have designed, you will either need to modify existing compilers to fit your needs, or build a new compiler from scratch.
- Driver design: A driver is a piece of software that ensures that your chip can communicate with the operating system you are building for. This is another key component that most computing chips need.
- Firmware design: Not everything the chip does is governed by the user application (through the compiler) or the operating system (through the driver). There are some tasks that need to be run automatically - for example, when a chip is first booted up, some tasks needs to be executed to make sure it works correctly. This is done through a special type of software called firmware.
Analyzing the chip: This includes what we commonly refer to as “Architecture roles” (Not the house building kind :) ). Computer architecture is simultaneously the first and last step in chip design - you analyze how your current chip has done, and what your next chip should do. I have divided architecture roles into three categories:
- Modelling: Come up with a software version of the chip, so that new features can first be experimented there (before you ask someone to “Build the chip”)
- Workload analysis: A workload is a standard piece of software that is meant to demonstrate the operations needed for a particular application. A workload analysis role is needed to ensure that workloads execute efficiently on the current chips, and also identify bottlenecks for future chips to address.
- PPA/Competitive analysis: This is closely related to workload analysis, but here, you focus on Performance, Power, Area (PPA) on your current chip, and try to compare it with what your competitors are doing, to plan for future generations of the chip.

There may be other roles that don’t fit exactly into one of these categories, but broadly, this covers the typical skills you would find in a chip design team.

Suggested Coursework:

I’ll start with courses that I think you should do, irrespective of the sub-domain you are interested in. I’ll use the same format as level 2.

Computer Architecture:
- Topics:
  - Instruction Set Architectures (ISA)
  - Pipelined processors
  - Handling Hazards
  - Out of Order execution
  - Memory hierarchy and caching
- An Example Course: Onur Mutlu’s Computer Architecture lectures - https://www.youtube.com/playlist?list=PL5PHm2jkkXmi5CxxI7b3JCL1TWybTDtKq
VLSI Design Flow:
- Topics:
  - Cover the different steps from RTL to GDS
    - RTL design and simulation
    - Logic synthesis and static timing analysis
    - Floorplanning, placement, routing
    - Clocks and clock distribution networks
    - Verification and testing
  - A simple project covering all these steps would be a nice add-on
- An Example Course: NPTEL VLSI Design Flow: RTL to GDS - https://nptel.ac.in/courses/108106191
Operating Systems:
- Topics:
  - Processes and Threads
  - Synchronization
  - Scheduling
  - File handling
  - Interrupt/Exception handling
  - Virtualization
- An Example Course: CSE 421 by Geoffrey Challen - https://ops-class.org/

Coursework specific to your sub-domain:

Building the chip
- CMOS Digital VLSI Design
  - Topics:
    - What is CMOS design
    - Building combinational CMOS logic circuits
    - Building sequential CMOS logic circuits
    - Designing memory circuits
    - Metrics to analyze different circuits - logical effort, fanout, parasitics, etc
  - An Example Course: IIT Roorkee CMOS VLSI Design - https://www.youtube.com/playlist?list=PLLy_2iUCG87Bdulp9brz9AcvW_TnFCUmM
- VLSI Design Automation (how EDA tools work)
  - Topics:
    - Algorithms to implement different steps in the RTL to GDS flow
      - Synthesis
      - Floorplanning
      - Placement
      - Routing
      - Clock Tree Synthesis
  - An Example Course: IIT Kharagpur VLSI Physical Design - https://www.youtube.com/playlist?list=PLU8VFS-HdvKtKswbcvvA8yVhzleTV7OE8
- Some bonus courses
  - A dedicated RTL design course - https://www.youtube.com/playlist?list=PLwdnzlV3ogoVlY7iVqr-FhWUQEX7JDdiP
  - A course on verification methodologies (UVM/OVM) - https://www.youtube.com/playlist?list=PLBIILfL2t1lnvzw7vF0arlvu36Wj4--D7
  - FPGA and High Level Synthesis - https://www.youtube.com/playlist?list=PLf4U4tpbjjz7x_bsG3sBEuXgVQPZfWJgW
Making the chip usable
- Compilers
  - Topics:
    - Lexical analysis
    - Parsing and Abstract syntax trees
    - Semantics
    - Register allocation
    - Code generation
  - An Example Course: Stanford Compilers - https://www.youtube.com/playlist?list=PLTsf9UeqkRebOYdw4uqSN0ugRShSmHrzH
- Data Structures and Algorithms
  - Topics:
    - Standard DSA topics like different data structures, and their use in algorithms like sorting, searching, etc
  - An Example Course: MIT OCW DSA course - https://ocw.mit.edu/courses/6-006-introduction-to-algorithms-spring-2020/
- Some bonus courses
  - Programming Language Design - https://ocw.mit.edu/courses/6-821-programming-languages-fall-2002/
  - Something related to LLVM/GCC internals - https://www.youtube.com/playlist?list=PLlONLmJCfHTo9WYfsoQvwjsa5ZB6hjOG5
Analyzing the chip
- Parallel computer architecture
  - Topics
    - Multicore
    - Multithreading
    - Caching in multicore
    - Interconnects and Dataflow between cores
  - An Example Course: CMU 742 - https://www.youtube.com/playlist?list=PL5PHm2jkkXmh4cDkC3s1VBB7-njlgiG5d
- Computer Networks
  - Topics
    - Packet Switching basics
    - TCP/IP
    - Router design
    - Multicast networks
    - Security and scalability considerations
  - An Example Course: MIT OCW Computer Networks course - https://ocw.mit.edu/courses/6-829-computer-networks-fall-2002/
- Some bonus courses
  - A course on GPUs, or Domain specific (like AI) accelerators - https://www.youtube.com/playlist?list=PLbRMhDVUMngfj_NXI7jqMYLnhcRhRKAGq
  - A course dedicated to memory systems - https://safari.ethz.ch/memory_systems/ACACES2024/doku.php?id=start

How to navigate this level:

So far, I have listed the different sub-domains in computer engineering, and the different courses in each sub-domain. Ideally, if you already know what sub-domain you want to pursue, all you need to do is to take up the coursework for that sub-domain. But often, you are still discovering your interests. I recommend an iterative process like this:

Ask yourself if you know what sub-domain you want to pursue
- If Yes, great, focus on courses in that domain
- If No, take a broad course (or one of the mandatory ones I mentioned), and try to find your interest
Repeat this until you converge at one sub-domain.

Cheat Codes:

All courses you do here should ideally be accompanied with a project. If it is not, make a pseudo project on your own - it could be as simple as a review of state of the art research in the field. This will help in your decision making process, and also make your life easier in level 5.
Similar to level 2, if your university offers courses that cover maximum number of topics in minimum number of courses, that would help you reach your goal more easily.
Use your learnings from level 3 here:
- If you liked, and were successful at the projects you took up in level 3, that’s a good indicator towards the sub-domain you should pursue.
- If you really disliked a project you did, use that signal to eliminate sub-domains

Traps:

I understand if I get some pushback for recommending a VLSI course to someone interested in compilers/firmware, or an OS course to a VLSI engineer - but I think there is value, especially in the long run.
- As someone working in the software side, understanding the VLSI flow will help you make better decisions about hardware vs software tradeoffs (Basically, you can answer the question: why is something implemented at compiler/driver/firmware level, when it can be implemented directly in hardware?)
- As a VLSI engineer, understanding how the operating system works can help you extrapolate how a change in hardware propagates to the application, and can push you to do more impactful work for your organization
This level could become extremely long if you go into “analysis paralysis” mode and cannot decide on the sub-domain you want to take up. Take your time, but remember that they are all equally important aspects in hardware engineering and your goal is largely the same - to build the best computing chip. Think about the type of job you would like doing, and give yourself a hard deadline to take a decision. The sooner you decide, the more time you will have to become more specialized in your sub-domain.
Some of you reading this might actually be high performers that feel they can cover all sub-domains. That’s a great attitude, but I still recommend going deep in one sub-domain at this level. I will talk more about how you can maximize your skills in an upcoming “bonus level”.

Level 5: Making the state transition

As the clock hits “posedge” (I.e. “positive clock edge” in Verilog), its time to take the leap.

Objectives:

This level is meant to successfully transition you from a student of hardware engineering, to a practitioner. At the end of this level, you should:

Know what you are going to do next
Have the means to get there

Gameplay:

Most of us don’t play the hardware engineering game for fun - there are typically two outcomes:

Join the hardware engineering industry
Pursue advanced training (like graduate school)

Each of these topics deserve their own post (which I want to compile at some point.) Here, I’ll just talk about reasons to pick one over the other, and briefly mention few ways to be better placed for these opportunities.

What should you do in level 5

Draft a strong CV that is ready to be shared
- I’m not talking about specifics here like 1 page only, action points, etc. The key is “maximum impact in minimum space” - whatever that means to you
- Make sure to link to documentation of the projects you pursued (this is why I mentioned documentation as a key in level 3)
Decide which path you want to pursue in the future
- Reasons why you should take up a job
  - You need financial stability in your life
  - You want to gain some work experience to move to something different (like an MBA, or management roles)
  - You want to see how things are working in the real world
  - You want a break from the academic life
- Reasons why you should go to grad school
  - You feel you want to gain expertise in some niche topic (this should lead you to a PhD)
  - Either the role, or the location is inaccessible without graduate school (this is a common reason why students pursue a Masters in the US)
  - You have some fellowship or funding opportunity that makes grad school financially attractive
  - You want to pursue a career in academia
  - You want to change sub-domains

What should you do in level 5, specifically if you want to take up a job

Prepare for job interviews
It’s not ideal, but there are some aspects of interview preparation that are not covered well through your coursework - so I recommend taking interview preparation as an independent task, and giving it sufficient time and effort.
I have a post that talks more about how to prepare for interviews for some roles here: https://chipinsights.substack.com/p/hardware-engineering-interview-resources
Actively reach out for jobs in your desired sub-domain
Here are some ways:
- Apply directly on job sites of your target company (if you have the right profile, you’ll be surprised how often this works)
- Through your university (see Cheat Codes below)
- Through contacts you made in earlier levels (Internships, Projects, etc)
- Through social media like LinkedIn
  - Recruiters are active here - you can directly reach out to them
  - Some managers post when they have openings
  - Just message strangers (even if people can’t directly hire you, most people would offer you a referral at their companies)
- Attend industry conferences or networking events (this is a bit of a lucky draw, but you may stumble upon the right person)

What should you do in level 5, specifically if you want to go to grad school

Start connecting the dots from level 1 to 4
A key component of any graduate school application is a “Statement of Purpose” - something that explains why you want to pursue graduate school in the first place. In level 5, you need to start thinking about the story - and the way you make it stand out is by combining all the aspects of your journey.
Focus on academic research
Even if you have no experience working with research group or publishing papers in levels 1 through 4, if you want to go to graduate school, I highly recommend doing that now. Typically, this means working with a research group at a university, and producing at least one of the following artifact:
- A research paper
- A thesis/dissertation
- A strong letter of recommendation from your supervisor
Maximize your grades
- I have not really spoken about the value of grades in this post - I think it’s a convoluted debate for another day. But grades actually play a very important role in your graduate school application. Ideally, your grades would stay healthy from the start; but even if you are at the end of your program, pushing for some good grades is still useful.
Come up with a list of universities you want to apply to for graduate school
- This depends a lot on what your goals out of grad school are. In summary, it depends on the following factors
  - Matching your interests/background with that of professors at the university
  - Prospects after grad school (job? academia?)
  - Location/Expenses
Prepare all the pre-requisites for graduate school applications
- Do I need to take some aptitude tests like the GRE
- What all do I need to apply (like recommendation letters, documents from your undergraduate university)

Cheat Codes:

If you are studying in a reputed university, you can use that to your advantage when looking for jobs, through:
- career fairs or campus placements
- if you have worked closely with a professor who has industry ties, they may also be able to get your profile at the right hands
- Reach out to alumni who are now working at your target companies
If you are still at university while doing this level, apply for an internship first instead of a full-time job: companies are more likely to convert their interns to full-time than to hire from outside.
I think in today’s world, it really helps if you have some kind of “influence” on social media that comes from your projects - so post about your projects, experiences, learnings on places like LinkedIn or Twitter. Level 1 to 4 is about grinding when no one is watching, but level 5 is about grabbing eyeballs, so use social media to your advantage.
It is usually easy to transition from an undergraduate to graduate degree at the same university. You may also be able to pursue a accelerated BS+MS program. This is a good hack to do grad school, but ensure that doing this is actually helping you end up where you wanted to after grad school.
I believe that if you have done level 1 to 4 right, and you have some financial flexibility, both these options should be realistic for you. In that case, there is nothing wrong in trying for both, and deciding which one to choose based on your options - for example, you might want to take up graduate school only if you are admitted at a specific university, or you might want to take up a job only if it is at a certain company. In cases like this, pursue both paths together and buy yourself more time to decide.

Traps:

Resist the temptation of short term gains. I see a lot of students going through level 1 to 4 perfectly, but ending up with a job in software engineering or finance because it has a 10% higher salary. I completely understand that money is important, but if you like computer engineering and can afford a minor short term financial hit, stay with it - with strong fundamentals, you will have a richer (literally) career in the future.
Be very careful before you take up graduate school - it usually comes with a financial strain, and immigration laws in countries like the US puts additional pressure on you. Take up graduate school only when you feel you are truly ready.
- This point is more important if you are also using grad school as an opportunity to change sub-domains (for example, from RTL design to compiler design) - avoid this if there is already a lot of uncertainty in your life.
Remember that applications for a job, or for grad school, usually take place many months before you graduate - so if you want no breaks in your career, you need to act fast on level 5

Bonus Level: Positive slack optimization

If you “meet timing”, you have the luxury to optimize for power, area and other aspects of your chip

Objectives:

If you actually made it so far (both as a reader, and in your career), give yourself a pat on the back. Going through level 1 through 5 already prepares you well for a career in computer engineering. I have included this level for the ambitious-types who want to push themselves even more.

This level is all about combining time, knowledge, and flexibility, and exploring what you can do with it. In this level, you goal should be to expand your skillsets and build a stronger profile to differentiate yourself from your peers.

Gameplay:

While this level is meant to be flexible, here are a few ways to achieve the objectives of your goal:

Explore coursework in the sub-domain that you did not pick

While in level 4, I insisted that you should pick a sub-domain and go deep, this is a chance to explore some other domains in detail as well. For example, if you would like to specialize in workload analysis, and have already secured a job in that domain, this might be a good time to explore RTL design, or compilers. Although this won’t help you immediately in your job, you will need skills in multiple areas if you want to move to an expanded role at an organization, or maybe even build a product of your own someday.

Complete a major project

While you have been doing projects consistently from level 3, it’s very likely that they were small projects targeting a specific skillset. In this level, you have the freedom to take up something bigger. Of the top of my head, these are some ideas

Work with a professor at your university on a thesis
Build a real chip using Open Source tools (Basically go through all the steps from scratch)
Combine the different projects from your past to build a useful product (this could potentially lead you to starting your own company someday)

Get some (more) work experience

If you have the opportunity, pursue an internship at a company or with a research group. I would recommend to go somewhere different from where level 5 is taking you - for example, if you have a job lined up at a company, try to intern at a different company. This is a good way for you to gain different perspectives that could be valuable in the future. (For example, if you want to change jobs at some point.) Plus, you can make decent money while doing internships at this level.

Take a bet on a new technology

By this point, you will have a good idea about the way computing technology is evolving. As you are going through these levels, you would have come across a lot of buzzwords - “Artificial Intelligence”, “Quantum Computing”, or whatever else that looks promising when you are reading this. In the bonus level, spend some time to pick one area that you think will be “the next big thing”, and gain some knowledge about the field. This will certainly help you in the future - if you were right, you have a first-movers advantage; if you were wrong, understand why you were wrong, and how to choose better. If you aspire to be at leadership positions in the future, having “good taste” is important. I wrote a piece on taste in my other blog if you would like to know more about this:

Bharath’s Musings

Why taste matters

"Taste" is something I always thought was subjective - everyone has their own interpretation of what is considered "good taste". My views on this topic have started to change recently, and this is a collection of my thoughts on why I think taste matters…

a year ago · 3 likes · 3 comments · Bharath Suresh

Develop a voice

As I mentioned in level 5, having influence can open a lot of doors. This level is a good opportunity to start thinking about that. Ask yourself: Is there something that I like doing that offers value to others? This could be anything, like starting a research paper reading group, mentoring students at your university, or producing content online. All of these will start off very small, but if you do it consistently, you could become a very influential voice in the field.

Cheat Codes:

If you have the luxury to take some time off before you transition to the next stage of your career, you can fit in the bonus level there. That way, you can move at a more natural pace and still get ahead of your peers before you make your career transition.
Although I have mentioned this as a level, you can (and should) do the things listed here at different points later in your life. Even if you are in your 40s, and want to take up something listed here, you will still see benefits in your career.

Traps:

Make sure that you have completed level 5 before starting this level. It is possible that this level would change your decision from level 5 (say you built a product and want to start your own company instead going to graduate school), but that’s not a reason to do this level first. In my experience, time in this level is used most effectively when uncertainty is minimal - and completing level 5 greatly reduces your uncertainty.
Along the same lines, don’t burn out. This level is a luxury. The objectives of this level can be achieved at other points in your career. I have included this for those who are comfortably through the other levels, and are wondering what’s next.
Given flexibility, we all have a tendency to optimize for the short term. Resist that feeling, especially in this level. If you are playing this level, you are already ahead of many others in the short term (1-2 years) - the goal of this level is to be ahead in the long term. (10-15 years) Keep that in mind as you plan for this level.

Thanks for reading Chip Insights! This post is public so feel free to share it.

Difficulty Settings

You can play the computer engineering game at different difficulties - you just have to change the time you spend in each level. From my personal experience, and having spoken to other students, here’s a template for three different difficulty levels, assuming a 4 year undergraduate program.

My Walkthrough of the Game

I usually don’t talk about myself on this substack, but for a post like this, I thought it would be unfair not to put myself in the scanner. Remember, a lot of points in this post comes from things I did not do - so my path isn’t meant to be a perfect example.

Level 1: The decision to pursue computer engineering

I don’t think I used a structured framework like I recommend in this post. If I’m being completely honest, for a long time I was just following people around me with no clear idea of what I wanted to do.

Things I did right:

I liked electronics, so I knew I wanted to pursue a career in that domain (this is not as specific as I recommend)
I enjoyed fixing (usually the things I broke in the first place) issues on a computer we had at home - as it turns out, that was dress rehearsal for the debugging I do in my current job roles.

Things I did wrong:

For the longest time, I had no idea what a job in computer engineering would look like - in hindsight, I should have spoken to someone about it.
I did not put myself in situations that would have helped me make this decision quicker - I was not involved in any technical clubs at my university, and was not really expanding my knowledge beyond university coursework.

Level 2: Completing basic coursework

I think I went through this phase even before I decided I want to pursue computer engineering - this is because all the basic courses were mandatory at my university. (I still recommend completing level 1 before level 2 to get the best results.)

Things I did right:

I kept things simple - I followed the course curriculum without too much experimentation (this usually works for basic coursework)

Things I did wrong:

If I’m being very critical, I would have liked to put more effort into programming during this phase

Level 3: Getting some internship/work experience

Of all the levels, I would say this was my best, and it set me up nicely for what I wanted to do. First, I pursued a summer internship at a research lab (Central Electronics Engineering Research Institute in India), and soon after, I got involved as a student researcher with a professor at my university. Both of these proved to be pivotal.

Things I did right:

My first internship was not in the computer engineering domain - but I was able to talk to others about their projects, and I figured out two things:
- I wanted to pursue computer engineering
- I don’t want to work in a lab setting (I prefer working on my computer)
I was very productive as a student researcher - I started producing results early, which got everyone in the group more involved, and it led to 3 journal publications in a year. I think it led to two very important outcomes:
- Working in research gave me a better sense of where computing is headed
- My CV got a big boost, which helped with future opportunities

Things I did wrong:

Hindsight is 20/20, but if I had to do it again, my summer internship would have been a corporate internship - that way, I could have covered more aspects of level 3.

Level 4: Going deeper into a sub-domain

If I could do one level from scratch, it would be this one. I don’t think I was very precise about a sub-domain, which resulted in a lack of depth in any one sub-domain.

Things I did right:

I put a lot of effort in the one VLSI and one Computer Architecture course I took up, which went on the save me in the future

Things I did wrong:

I was not prepared for the vastness of computer engineering - I was jumping between various different sub-domains, and could never really pick one.
I also think I rushed through this level, especially considering that I was still unsure about the sub-domain I wanted to pursue.

Level 5: Applying to graduate school

I came to the decision to apply to graduate school very early in this level. This gave me enough time to build a profile that maximized my chances.

Things I did right:

I applied and got accepted for the prestigious DAAD Working Internships in Science and Engineering (WISE) to pursue a summer internship at Center for Cognitive Interaction Technology in Germany. I published a conference paper based on my research.
Soon after, I completed an undergraduate thesis at one of the best research universities in India (The Indian Institute of Science, Bengaluru). This rounded up a strong research-focused CV which is what graduate schools prefer.
I managed the rest of the application prerequisites, applied, and got accepted into some of the top graduate programs in Computer Engineering in the US.

Things I did wrong:

I think the areas of interests I spoke about in my statement of purpose were still quite general (This is a direct consequence of not doing level 4 right)
Knowing what I know now about graduate schools in the US, my university choices and application strategy would have been quite different (I’ll save this for another post someday)

Bonus level: That elusive corporate internship

I had some time between level 5 and starting my graduate school, and I really wanted to use this time to intern at a company. As it turned out, this was a very important decision.

My research experience led me to an opportunity at Intel Labs, and I got to work on AI accelerator architectures. (this was in early 2020, long before the chatGPT/Nvidia moment.) I still consider this one of my best internship experiences, and I gained a lot from it.

I got work experience at a company, which matters a lot when applying for other jobs (Spoiler alert, I needed it immediately)
The good thing about corporate roles, even in research, is that they help you understand what the different sub-domains are. This experience at Intel gave me a clearer idea that I wanted to pursue microarchitecture design.
I ended up with 3 patents in AI architecture, which I’m very proud of.

My Conclusion

As it turns out, when I completed my undergrad (in the year 2020), the world was going through something much bigger that my little computer engineering game. COVID-19 travel restrictions meant that I had to defer my Masters at UCLA. I was lucky to land a job at Xilinx (now AMD), despite no efforts made in this direction in level 5. (the fact that I pushed for the bonus level when I didn’t have to is what made the difference.) I eventually went to graduate school a year later, interned in the summer at Google, and a landed a job in GPU microarchitecture at Qualcomm.

Final Thoughts

If you have actually read through this long post and reached here, I want to end by saying this:

You’ll be fine. You might mess up different levels (like I did), or life will throw you an unexpected curveball. In the end, successfully navigating your career boils down to one thing - managing anxiety. The only difference between a successful student and a failed one, is that the former either never had to face anxiety (through some form of privilege), or dealt better with anxiety. In my experience, you can handle your anxiety better if:

You know where you are going;
And you are prepared for everything you will face on that path.

I hope this post can help you with both, and make you a successful computer engineer.

This post took a lot of effort to compile. My goal is to keep this post relevant over time through regular update. So if you are reading this and find this post to be useful, please share this with someone you know.