Simplification of array access patterns for compiler optimizations http://portal.acm.org/citation.cfm?id=277650.277664 Authors present a noble technique named LMAD(Linear Memory Access Descriptor). LMAD consists of stride/span pairs and base offset. Analyzing complex array access pattern could be simpler with LMAD. Thus, some compiler
Hubert Nguyen
GPU Gems 3 http://portal.acm.org/citation.cfm?id=1407436 Chapter 31 Lars Nyland – Fast N-Body Simulation with CUDA This article presents a parallel technique for a O(N*N) calculation. All calculations in this algorithm are all parallelable, thus all calculations can executed simultaneously. But it needs N*N
John Nickolls
Scalable Parallel Programming with CUDA, Queue Vol.6 Issue 2, ’08 http://portal.acm.org/citation.cfm?id=1365500 Just Programming Guide. No more.
Shane Ryoo
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, PPoPP’08 http://portal.acm.org/citation.cfm?id=1345206.1345220 This is a very empirical paper. They introduced the characteristics of GeForce 8800 and optimization principles. Many threads enough to hide memory latency and memory bandwidth
Muthu Manikandan
A compiler framework for optimization of affine loop nests for gpgpus, ICS’08 http://portal.acm.org/citation.cfm?id=1375527.1375562 They showed the characteristics of CUDA such as a coalescing when access the global memory and a bank conflict when access the shared memory. They derived the best