BLOG.JUNGWON.KIM

Inoue Takehiko

Takehiko Inoue and Minoru Furuya are my favorite comic artists. Specially VAGABOND by Inoue is something more than comic. It tells us a life. I hope Korean and Japanese be well in with each other.

Cedric Bastoul

Code Generation in the Polyhedral Model Is Easier Than You Think

http://portal.acm.org/citation.cfm?id=1025992

blahblah

Isaac Gelado

CUBA: an architecture for efficient CPU/co-processor data communication

http://portal.acm.org/citation.cfm?id=1375571

This paper presents a hardware-supported double-buffered mechanism that allows the CPU to transfer data for next invocation while the coprocessor is executing the current call.

Comparison of Nvidia graphics processing units

http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units

List of device bandwidths

http://en.wikipedia.org/wiki/List_of_device_bandwidths

Michael Wolfe

Compilers and More: A GPU and Accelerator Programming Model

http://www.hpcwire.com/specialfeatures/sc08/features/Compilers_and_More_A_GPU_and_Accelerator_Programming_Model.html

Accelators support two levels of parallelism: an outer fully-parallel doall loop level, and an inner synchronous (SIMD or vector) loop level. The keys to tuning are minimizing and perhaps optimizing the data traffic between the host and accelerator, and selecting a schedule for the parallelism. The first one is easy, but the second is a difficult problem. This model does use reasonably sophisticated compiler analysis, but nothing that hasn’t been implemented in commercial parallelizing compilers for many years such as classical data flow and array region analysis. But the compiler would have to search among the possible schedules and select a best one; note to academics: this is still a fertile area for continued research.

Compilers and More: Optimizing GPU Kernels

http://www.hpcwire.com/features/Compilers_and_More_Optimizing_GPU_Kernels.html

How sensitive the performance of the GPU is to the formulation of your kernel, and how much and what kind of experimentation you’ll need to do to optimize your performance. How much of the optimization process will carry over from on GPU to another, or from one generation to the next from the same vendor? The cost to port a nontrivial application to GPU is high, though the potential performance is alluring.

Compilers and More: Programming GPUs Today

http://www.hpcwire.com/features/Compilers_and_More_Programming_GPUs_Today.html

This article present a detail and precise process when CUDA kernel run. Download and translate and optimize. OpenCL seems to be a step backwards, from a compiler-oriented solution aimed at raising the level of programming to a library-oriented solutio aimed at giving low-level control to the programmer. Low-level control is not bad; that would be like saying assembly language is bad. It is not bad, but it is only necessary for a very small bit of the programming we do today. Many woodworkers prefer hand tools, and they can make beatiful furniture, but the cost is high and productivity is low. We need hand tools, but we will be much more productive with better power tools. Compiler aids for OpenCL will be important!

Compilers and More: GPU Architecture and Applications

http://www.hpcwire.com/features/Compilers_and_More_GPU_Architecture_and_Applications.html

I think Wolfe is a genius, Wolfe rules!! To get high performance, your program needs enough parallelism to keep the thread processors busy (lots of customer). Each of your thread blocks need enough parallelism to fill all the thread processors in a multiprocessor (big tour groups), and at least as many thread blocks as you have multiprocessors (many big tour groups). In addition, you need even more parallelism to keep the multiprocessors busy when they need to thread-switch past a long latency memory operation (many many big tour groups).

Compilers and More: Parallel Programming Made Easy?

http://www.hpcwire.com/features/Compilers_and_More_Parallel_Programming_Made_Easy.html

Sequential proramming is really hard, and parallel programming is a step beyond that. The best we can hope for is to make parallel programming not much harder than make sequential programming.

Compilers and More: Are Optimizing Compilers Important?

http://www.hpcwire.com/features/Compilers_and_More_Are_Optimizing_Compilers_Important.html

The SPEC CPU benchmarks have been used to mearsuring processor, system and compiler performance. Developing compiler optimization technique targetted specifically at the benchmark is meaningful.

Compilers and More: Productivity and Compilers

http://www.hpcwire.com/features/17901534.html

Productivity is defined as SpeedUp / SLOC, SLOC means source lines of code, estimating the programming effort. 50 years ago, Fortran developers had in mind: delivering the same performance as machine language, with the lower programming effort. Wolfe presents 4 methods to improve productivity. The first, better hardware. But hardware benefits are just on-chip parallel not improved speed, and parallelism is still the domain of the programmer. The second method is faster algorithms. But new algorithm development is quite expensive. The third one is to use high performance library, but this method has many restrictions. The last method is new language and good parallel compilers. 😎 Modern compilers can deliver the same performance without requiring programmers to think about low-level details.

Compilers and More: Accelerating High Performance

http://www.hpcwire.com/topic/processors/Compilers_and_More_Accelerating_High_Performance.html

High performance computing will soon be dominated by accelerator-based systems. Major chip vendors will keep developing multicores not accelerators because HPC is not a high volume business. Moore’s law still works, on-chip density will increase predictaly. Since accelerators are farther behind on that curve, they can enjoy more of that benefit. Therefore a developing parallel compiler for accelerator, specially GPU, will be a important task. Just do it!

How We Should Program GPGPUs

http://www.linuxjournal.com/article/10216

The programming strategy always is a trade-off between performance and convenience. The Cray Fortran Translator was the first widely used vectoring compiler. It did an effective job of vectorization and gave performance feedback to the user. The feedback made the user comfortable using the vectorizable subset of the language. Automatic parallel compilers have common 5 steps, Select Region, Data Analysis, Parallelism Analysis, Parallelism Mapping and Code Generation. It is better to leave the first step to the programmer. The second step is within the scope of current compiler technoligy. Traditional vectorizing and parallelizing compiler techniques are mature enough to apply the third step. The fourth and the last is up to you. 🙂

PGI Articles

http://www.pgroup.com/resources/articles.htm

Byoung-Tak Zhang

Teaching an Agent by Playing a Multimodal Memory Game: Challenges for Machine Learners and Human Teachers, AAAI’09

Zhang developed a research platform that implements a cognitive game called multimodal memory game (MMG) to study machine learning architectures and algorithms for learning in a long-lasting, multimodal, interactive, and dynamic environment. MMG is a real challenge for current machine learning technologies. First the game is played online and the data are received in a stream. This is contrasted with machine learning situations where a fixed set of training data is given. Second, the amount of data is huge. For some machine learning algorithms, such as support vector machines, there is a limit in the number of training examples. Third, the problem involves multimodality. It is necessry to integrate multiple sources of data into a coherent knowledge. Fourth, the game involves vision and language problems. They demonstrated that a version of the MMG game can be solved by the hypernetwork learning model.

ACM SIG Proceedings Templates

http://www.acm.org/sigs/publications/proceedings-templates

Hubert Nguyen

GPU Gems 3

http://portal.acm.org/citation.cfm?id=1407436

Chapter 31 Lars Nyland – Fast N-Body Simulation with CUDA
This article presents a parallel technique for a O(N*N) calculation. All calculations in this algorithm are all parallelable, thus all calculations can executed simultaneously. But it needs N*N sized memory space that cannot be allocated in GPU memory. They presents tile calculation that exploits data reuse that keeps the arithmetic units busy, and this calculation almost reach the optimal peak GFLOPS. The tile calculation technique can be adapted to a hypernetwork. * I could improve the hypernetwork by 50% using loop fission, loop tiling, loop unrolling and constant memory. Now, the current CUDA version is 60 times faster than the sequential one, and outperforms the original version by 9000x. 😉

Chapter 32 Scott Le Grand – Broad-Phase Collision Detection with CUDA
blahblah

Chapter 35 Elizabeth Seamans – Fast Virus Signature Matching on the GPU
This paper presents an intrapacket approach to a scanning virus on network packets. They showed the similarity between network processors and NVIDIA GPUs, and the possibility to use GPUs to act as network processors. The network packets are divided into chunks and one thread processes one chunk. This is a very easy and straightforward parallel approach. This method can support only fixed-length virus signatures. This showed cooperation between CPU and GPU, but the performance is poor. 🙁

Chapter 37 Lee Howes – Efficient Random Number Generation and Application Using CUDA
This paper shows that a good RNG is very important in Monte Carlo methods. Poor RNG quality can ruin the results of Monte Carlo application. This paper presents fast and extremely good statistical quality parallel RNG methods using CUDA.

Chapter 38 Bernard Deschizeaux – Imaging Earth’s Subsurface Using CUDA
A trivial paper. There is nothing to learn. 😥

Chapter 39 Mark Harris – Parallel Prefix Sum (Scan) with CUDA
Scan is a simple and common parallel building block. Scan can be used in stream compaction, summed-area tables and radix sort. This paper implemented the techniques that Blelloch proposed in early 90’s by using CUDA. They optimized the CUDA source by considering memory hierarchy like shared memory bank conflict. All principal algorithms and methods already exist. Bernard of Chartres said “standing on shoulders of giants”. =)

Vasily Volkov

Benchmarking GPUs to tune dense linear algebra, SC’08

http://portal.acm.org/citation.cfm?id=1413370.1413402

They showed the characteristics of GPUs by some empirical experiments, and adopted the old methods that used for vector processors to GPUs. Frankly I cannot understand this paper. 🙁 I will read this later once more, haha!

John Nickolls

Scalable Parallel Programming with CUDA, Queue Vol.6 Issue 2, ’08

http://portal.acm.org/citation.cfm?id=1365500

Just Programming Guide. No more.

Shane Ryoo

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, PPoPP’08

http://portal.acm.org/citation.cfm?id=1345206.1345220

This is a very empirical paper. They introduced the characteristics of GeForce 8800 and optimization principles. Many threads enough to hide memory latency and memory bandwidth reduction using shared or/and constant memory is them. They ported many benchmarks to CUDA, moreover open them in their web site (http://impact.crhc.illinois.edu/parboil.php) !! They showed how to calculate the limit performance in GFLOPS of a given GPU, and how to get GFLOPS of my programs. It’s very useful! Finally they warned readers that do not get stuck in local optimal but try many approach to achieve the limit performance.

Compiler Loop Optimization

http://en.wikipedia.org/wiki/Compiler_optimization

http://en.wikipedia.org/wiki/Loop_optimization

The Time Zone Converter

http://www.timezoneconverter.com/cgi-bin/tzc.tzc

Christian Lengauer

Loop Parallelization in the Polytope Model, CONCUR’93

http://portal.acm.org/citation.cfm?id=703499

He showed how to use a polyhedral model in generating parallel codes from sequential codes by giving a concrete example.

Muthu Manikandan

A compiler framework for optimization of affine loop nests for gpgpus, ICS’08

http://portal.acm.org/citation.cfm?id=1375527.1375562

They showed the characteristics of CUDA such as a coalescing when access the global memory and a bank conflict when access the shared memory. They derived the best performance situations, and generated efficient parallel codes that operate in the efficient mode using polyhedral model.

Zheng Wang

Mapping Parallelism to Multi-cores: A Machine Learning Based Approach, PPoPP’09

http://portal.acm.org/citation.cfm?id=1504176.1504189

They extracted the static code features like operations, control flows, memory access and binary & bitwise operations using LLVM, and got data features like loop counts of kernels, L1 dcache miss and branch miss ratio using PMU, and runtime feature, execution time. These are the inputs to ANN, and the outputs of ANN are the best predicted schedule policy and the predicted speedup. The problem is that the internal of ANN is a ‘blackbox’.

Iterative Compilation

http://www.lri.fr/~girbal/site_wrapit/iterative.html

Evolutinary algorithms may also help in the search of the optimal parameters or transformations.

Louis-Noel Pouchet

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time, CGO’07

http://portal.acm.org/citation.cfm?id=1252537

They made polyhedral models from a sequential program using a method that Paul Feautrier proposed in 1992. Many schedules could come from the polyhedral model, thus they choose only the legal schedules. They compared the performance of the schedules by using a iterative compilation technique.