Scalable Parallel Programming with CUDA, Queue Vol.6 Issue 2, ’08 http://portal.acm.org/citation.cfm?id=1365500 Just Programming Guide. No more.
Shane Ryoo
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, PPoPP’08 http://portal.acm.org/citation.cfm?id=1345206.1345220 This is a very empirical paper. They introduced the characteristics of GeForce 8800 and optimization principles. Many threads enough to hide memory latency and memory bandwidth
Compiler Loop Optimization
http://en.wikipedia.org/wiki/Compiler_optimization http://en.wikipedia.org/wiki/Loop_optimization
Muthu Manikandan
A compiler framework for optimization of affine loop nests for gpgpus, ICS’08 http://portal.acm.org/citation.cfm?id=1375527.1375562 They showed the characteristics of CUDA such as a coalescing when access the global memory and a bank conflict when access the shared memory. They derived the best
Zheng Wang
Mapping Parallelism to Multi-cores: A Machine Learning Based Approach, PPoPP’09 http://portal.acm.org/citation.cfm?id=1504176.1504189 They extracted the static code features like operations, control flows, memory access and binary & bitwise operations using LLVM, and got data features like loop counts of kernels, L1 dcache miss and
Louis-Noel Pouchet
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time, CGO’07 http://portal.acm.org/citation.cfm?id=1252537 They made polyhedral models from a sequential program using a method that Paul Feautrier proposed in 1992. Many schedules could come from the polyhedral model, thus they choose only the legal