Compilers and More: A GPU and Accelerator Programming Model http://www.hpcwire.com/specialfeatures/sc08/features/Compilers_and_More_A_GPU_and_Accelerator_Programming_Model.html Accelators support two levels of parallelism: an outer fully-parallel doall loop level, and an inner synchronous (SIMD or vector) loop level. The keys to tuning are minimizing and perhaps optimizing
Vasily Volkov
Benchmarking GPUs to tune dense linear algebra, SC’08 http://portal.acm.org/citation.cfm?id=1413370.1413402 They showed the characteristics of GPUs by some empirical experiments, and adopted the old methods that used for vector processors to GPUs. Frankly I cannot understand this paper. 🙁 I will