For quite some time, I’ve been wanting to learn about GPUs. As someone in ML, I use them a lot, but I was really curious to understand how they actually work under the hood.

These are my notes on GPUs which I made while learning about GPUs from Modal's GPU Glossary.

The GPU Glossary is an excellent resource, but it can feel a bit overwhelming/dense for absolute beginners. So I made these notes to simplify things and put them in a more digestible format. At many points, I also used Gemini to help me better understand some concepts.

I’ve covered most of the topics from the glossary, though not all - only the ones I found most relevant for my use case.

btw you can checkout more of my work **here .**

Device Hardware

CUDA Device Architecture

CUDA ⇒ Compute Unified Device Architecture

Before CUDA , Early GPUs were essentially hardware-based assembly lines specifically for rendering 3D graphics. This pipeline had distinct, specialized hardware stages for each part of the process.

The GPU manufacturer had to guess the ratio of vertex processors to fragment processors. If a particular game or application had very complex geometry but simple coloring, the fragment processors would sit idle while the vertex processors were bottlenecked. The reverse was also true. This meant a portion of the chip's silicon was often underutilized.
Developers were forced to map their programs onto this fixed hardware structure. It was difficult to use the GPU for anything other than its intended graphics tasks, as the hardware itself was inflexible.

Starting with the G80 architecture (GeForce 8800), NVIDIA introduced a "unified" design. They replaced the separate, specialized processing blocks with a large array of identical, flexible processors called Streaming Multiprocessors (SMs).

The main subcomponents of Streaming Multiprocessors are the CUDA Cores and (for recent GPUs) Tensor Cores .

Streaming Multiprocessor

It can be thought of as the equivalent of a core on a CPU. It's the unit that actually executes instructions. But the difference is that a GPU SM can execute more threads in parallel.

For now think of a thread as an extremely lightweight worker that executes a sequence of instructions. We will take a look at it in one of the upcoming sections.

Now let’s try to understand about a streaming multiprocessor