CUDA — GPU Memory Architecture
Most desktop and laptops computers consist of a CPU which is connected to a large amounts of system memory, which in turn have two or three levels or fully coherent cache. Like this, understanding the basic memory architecture of any system is highly beneficial for a programer to write efficient programs for that particular system. So here we will understand and compare the GPU memory architecture with a general CPU architecture.
A GPU is hardware device which contain multiple small hardware units called SMs (Streaming Multiprocessors). Each SM can execute many threads concurrently. But these threads are not exactly the same as the threads run by a CPU.
These GPU threads are grouped physically. And a physical thread group is called a “warp”, which contains 32 threads. As stated GPUs contains many SMs, and these SMs can execute many threads concurrently. So where does this warps (32 threads) fit into the picture? Imagine a SM can execute 2048 threads concurrently. So actually this SM can execute 2048/32 = 64 warps concurrently — Thats how warps fit into the picture.
How GPU threads differ from CPU threads is that, different CPU threads can work on different instructions (addition, multiplication) concurrently. But all 32 threads in a warp can execute only same instruction concurrently, nor two threads in a single warp can operate two…