GPU Gems 3
http://portal.acm.org/citation.cfm?id=1407436
Chapter 31 Lars Nyland – Fast N-Body Simulation with CUDA
This article presents a parallel technique for a O(N*N) calculation. All calculations in this algorithm are all parallelable, thus all calculations can executed simultaneously. But it needs N*N sized memory space that cannot be allocated in GPU memory. They presents tile calculation that exploits data reuse that keeps the arithmetic units busy, and this calculation almost reach the optimal peak GFLOPS. The tile calculation technique can be adapted to a hypernetwork. * I could improve the hypernetwork by 50% using loop fission, loop tiling, loop unrolling and constant memory. Now, the current CUDA version is 60 times faster than the sequential one, and outperforms the original version by 9000x. 😉
Chapter 32 Scott Le Grand – Broad-Phase Collision Detection with CUDA
blahblah
Chapter 35 Elizabeth Seamans – Fast Virus Signature Matching on the GPU
This paper presents an intrapacket approach to a scanning virus on network packets. They showed the similarity between network processors and NVIDIA GPUs, and the possibility to use GPUs to act as network processors. The network packets are divided into chunks and one thread processes one chunk. This is a very easy and straightforward parallel approach. This method can support only fixed-length virus signatures. This showed cooperation between CPU and GPU, but the performance is poor. 🙁
Chapter 37 Lee Howes – Efficient Random Number Generation and Application Using CUDA
This paper shows that a good RNG is very important in Monte Carlo methods. Poor RNG quality can ruin the results of Monte Carlo application. This paper presents fast and extremely good statistical quality parallel RNG methods using CUDA.
Chapter 38 Bernard Deschizeaux – Imaging Earth’s Subsurface Using CUDA
A trivial paper. There is nothing to learn. 😥
Chapter 39 Mark Harris – Parallel Prefix Sum (Scan) with CUDA
Scan is a simple and common parallel building block. Scan can be used in stream compaction, summed-area tables and radix sort. This paper implemented the techniques that Blelloch proposed in early 90’s by using CUDA. They optimized the CUDA source by considering memory hierarchy like shared memory bank conflict. All principal algorithms and methods already exist. Bernard of Chartres said “standing on shoulders of giants”. =)