For quite some time, I’ve been wanting to learn about GPUs. As someone in ML, I use them a lot, but I was really curious to understand how they actually work under the hood.
These are my notes on GPUs which I made while learning about GPUs from Modal's GPU Glossary.
The GPU Glossary is an excellent resource, but it can feel a bit overwhelming/dense for absolute beginners. So I made these notes to simplify things and put them in a more digestible format. At many points, I also used Gemini to help me better understand some concepts.
I’ve covered most of the topics from the glossary, though not all - only the ones I found most relevant for my use case.
btw you can checkout more of my work **here .**
CUDA ⇒ Compute Unified Device Architecture
Before CUDA , Early GPUs were essentially hardware-based assembly lines specifically for rendering 3D graphics. This pipeline had distinct, specialized hardware stages for each part of the process.
Starting with the G80 architecture (GeForce 8800), NVIDIA introduced a "unified" design. They replaced the separate, specialized processing blocks with a large array of identical, flexible processors called Streaming Multiprocessors (SMs).
The main subcomponents of Streaming Multiprocessors are the CUDA Cores and (for recent GPUs) Tensor Cores .
It can be thought of as the equivalent of a core on a CPU. It's the unit that actually executes instructions. But the difference is that a GPU SM can execute more threads in parallel.
For now think of a thread as an extremely lightweight worker that executes a sequence of instructions. We will take a look at it in one of the upcoming sections.
Now let’s try to understand about a streaming multiprocessor