CUDA — GPU Memory Architecture

Ashan Priyadarshana
4 min readFeb 25, 2018

Most desktop and laptops computers consist of a CPU which is connected to a large amounts of system memory, which in turn have two or three levels or fully coherent cache. Like this, understanding the basic memory architecture of any system is highly beneficial for a programer to write efficient programs for that particular system. So here we will understand and compare the GPU memory architecture with a general CPU architecture.

GTX 960 — image from nvidia.com

A GPU is hardware device which contain multiple small hardware units called SMs (Streaming Multiprocessors). Each SM can execute many threads concurrently. But these threads are not exactly the same as the threads run by a CPU.

These GPU threads are grouped physically. And a physical thread group is called a “warp”, which contains 32 threads. As stated GPUs contains many SMs, and these SMs can execute many threads concurrently. So where does this warps (32 threads) fit into the picture? Imagine a SM can execute 2048 threads concurrently. So actually this SM can execute 2048/32 = 64 warps concurrently — Thats how warps fit into the picture.

How GPU threads differ from CPU threads is that, different CPU threads can work on different instructions (addition, multiplication) concurrently. But all 32 threads in a warp can execute only same instruction concurrently, nor two threads in a single warp can operate two…

--

--

Ashan Priyadarshana

Associate Technical Lead | BSc. Information Technology | MSc. Artificial Intelligence | Founder Programming.lk | GSoC 2017 |