Compilers and More: A GPU and Accelerator Programming Model

http://www.hpcwire.com/specialfeatures/sc08/features/Compilers_and_More_A_GPU_and_Accelerator_Programming_Model.html

Accelators support two levels of parallelism: an outer fully-parallel doall loop level, and an inner synchronous (SIMD or vector) loop level. The keys to tuning are minimizing and perhaps optimizing the data traffic between the host and accelerator, and selecting a schedule for the parallelism. The first one is easy, but the second is a difficult problem. This model does use reasonably sophisticated compiler analysis, but nothing that hasn’t been implemented in commercial parallelizing compilers for many years such as classical data flow and array region analysis. But the compiler would have to search among the possible schedules and select a best one; note to academics: this is still a fertile area for continued research.

Compilers and More: Optimizing GPU Kernels

http://www.hpcwire.com/features/Compilers_and_More_Optimizing_GPU_Kernels.html

How sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from on GPU to another, or from one generation to the next from the same vendor? The cost to port a nontrivial application to GPU is high, though the potential performance is alluring.

Compilers and More: Programming GPUs Today

http://www.hpcwire.com/features/Compilers_and_More_Programming_GPUs_Today.html

This article present a detail and precise process when CUDA kernel run. Download and translate and optimize. OpenCL seems to be a step backwards, from a compiler-oriented solution aimed at raising the level of programming to a library-oriented solutio aimed at giving low-level control to the programmer. Low-level control is not bad; that would be like saying assembly language is bad. It is not bad, but it is only necessary for a very small bit of the programming we do today. Many woodworkers prefer hand tools, and they can make beatiful furniture, but the cost is high and productivity is low. We need hand tools, but we will be much more productive with better power tools. Compiler aids for OpenCL will be important!

Compilers and More: GPU Architecture and Applications

http://www.hpcwire.com/features/Compilers_and_More_GPU_Architecture_and_Applications.html

I think Wolfe is a genius, Wolfe rules!! To get high performance, your program needs enough parallelism to keep the thread processors busy (lots of customer). Each of your thread blocks need enough parallelism to fill all the thread processors in a multiprocessor (big tour groups), and at least as many thread blocks as you have multiprocessors (many big tour groups). In addition, you need even more parallelism to keep the multiprocessors busy when they need to thread-switch past a long latency memory operation (many many big tour groups).

Compilers and More: Parallel Programming Made Easy?

http://www.hpcwire.com/features/Compilers_and_More_Parallel_Programming_Made_Easy.html

Sequential proramming is really hard, and parallel programming is a step beyond that. The best we can hope for is to make parallel programming not much harder than make sequential programming.

Compilers and More: Are Optimizing Compilers Important?

http://www.hpcwire.com/features/Compilers_and_More_Are_Optimizing_Compilers_Important.html

The SPEC CPU benchmarks have been used to mearsuring processor, system and compiler performance.  Developing compiler optimization technique targetted specifically at the benchmark is meaningful.

Compilers and More: Productivity and Compilers

http://www.hpcwire.com/features/17901534.html

Productivity is defined as SpeedUp / SLOC, SLOC means source lines of code, estimating the programming effort. 50 years ago, Fortran developers had in mind: delivering the same performance as machine language, with the lower programming effort. Wolfe presents 4 methods to improve productivity. The first, better hardware. But hardware benefits are just on-chip parallel not improved speed, and parallelism is still the domain of the programmer. The second method is faster algorithms. But new algorithm development is quite expensive. The third one is to use high performance library, but this method has many restrictions. The last method is new language and good parallel compilers. 😎 Modern compilers can deliver the same performance without requiring programmers to think about low-level details.

Compilers and More: Accelerating High Performance

http://www.hpcwire.com/topic/processors/Compilers_and_More_Accelerating_High_Performance.html

High performance computing will soon be dominated by accelerator-based systems. Major chip vendors will keep developing multicores not accelerators because HPC is not a high volume business. Moore’s law still works, on-chip density will increase predictaly. Since accelerators are farther behind on that curve, they can enjoy more of that benefit. Therefore a developing parallel compiler for accelerator, specially GPU, will be a important task. Just do it!

How We Should Program GPGPUs

http://www.linuxjournal.com/article/10216

The programming strategy always is a trade-off between performance and convenience. The Cray Fortran Translator was the first widely used vectoring compiler. It did an effective job of vectorization and gave performance feedback to the user. The feedback made the user comfortable using the vectorizable subset of the language. Automatic parallel compilers have common 5 steps, Select Region, Data Analysis, Parallelism Analysis, Parallelism Mapping and Code Generation. It is better to leave the first step to the programmer. The second step is within the scope of current compiler technoligy. Traditional vectorizing and parallelizing compiler techniques are mature enough to apply the third step. The fourth and the last is up to you. 🙂

PGI Articles

http://www.pgroup.com/resources/articles.htm

Michael Wolfe
Tagged on:     

Leave a Reply