Squeezing performance out of CUDA
GPU code is known for being capable of extremely fast computation, but such performance only comes with well crafted code.
The following are the iterations I went through to squeeze performance out of a CUDA kernel for matrix multiplication in CSR format. Testing was done on a Quadro FX 5800 with CUDA 3.2 and involved a matrix multiplication of a large 52000x52000 sparse matrix with about 1% non-zero entries. All timings listed here are imprecise, and only meant to give an idea of relative performance.
Initial (naive) implementation
This is a pretty straightforward implementation, similar to what one would write as a serial CPU implementation. It uses one thread per row, each of which simply computes the dot product and writes it back to global. This kernel takes about 100 milliseconds to run, several times slower than the equivalent CPU code.
Memory coalescing
Perhaps the most important factor in optimizing CUDA performance is to allow memory coalescing when accessing global memory. This is possible when sequential words of memory are accessed by sequential threads in a warp (thread 0 reads word 0, thread 1 reads word 1, etc.). This implementation achieves this by having multiple threads per row, which access words at an offset equal to their thread number within the block. The partial sum of each thread is then stored in per-block shared memory, which is then summed and stored as the final result.
This kernel allows reads from the data
and cols
vectors to be coalesced.
The reduction works without __syncthreads()
calls between each addition since the threads will be executing them simultaneously.
This is similar to the reduction example in the NVIDIA GPU Computing SDK.
This kernel improved perforamnce to about 5 milliseconds.
Texture Memory
Although reads to data
and cols
are not coalesced, accesses to the v
vector is still non-coalesced.
By caching v
in texture memory, the random reads will be faster.
Using texture memory increases performance to about 3 milliseconds.
Other potential improvements
I implemented several other improvements which achieved small performance improvements. Results may vary.
- Summing more rows per block (more threads per row), ensuring that a high occupancy is met (profile with COMPUTE_PROFILE=1).
- Adjusting the number of blocks run.
- Aligning each row to a cache line achieves a small speedup, since the coalesced reads will be aligned.
Implementing these further increased performance to about 2 milliseconds.
A final word
If you are in fact writing something similar, existing libraries like cuBLAS and cusparse are likely much faster than what one would create in a reasonable amount of time. The NVIDIA GPU-Accelerated Libraries page is a good place to start. The excellent CUSP C++ library also has a CSR kernel almost equivalent to this one, and is a very convenient library for most sparse matrix applications.