Automatic data layout for distributed-memory machines http://portal.acm.org/citation.cfm?id=291891.291901 blah
Pete Keleher
TreadMarks: distributed shared memory on standard workstations and operating systems http://portal.acm.org/citation.cfm?id=1267084 TreadMarks is a DSM system that exploits Lazy Release Consistency Model and Lazy Diff Creation. Lazy Release Consistency postpone the propgation of modifications in distributed system until the time
Yunheung Paek
Simplification of array access patterns for compiler optimizations http://portal.acm.org/citation.cfm?id=277650.277664 Authors present a noble technique named LMAD(Linear Memory Access Descriptor). LMAD consists of stride/span pairs and base offset. Analyzing complex array access pattern could be simpler with LMAD. Thus, some compiler
How to delete all files in a CVS directory
ls -1 > tmp cat tmp | xargs rm -f cat tmp | cvs delete cvs commit -m “” cvs update
How to remove all except the newest file
rm -f !(`ls -t | head -1`)
How to install Grub from a live Ubuntu cd
$ sudo -i $ grub grub> find /boot/grub/stage1 (hd0, 5) grub> root (hd0, 5) grub> setup (hd0)
Replace string in file with sed
sed -i ‘s/old/new/g’ filename 😉
How to delete ^M character using vi
In vi, do a :%s/^M//g To get the ^M hold the control key, press V then M (Both while holding the control key) and the ^M will appear. or dos2unix
Cedric Bastoul
Code Generation in the Polyhedral Model Is Easier Than You Think http://portal.acm.org/citation.cfm?id=1025992 blahblah
Isaac Gelado
CUBA: an architecture for efficient CPU/co-processor data communication http://portal.acm.org/citation.cfm?id=1375571 This paper presents a hardware-supported double-buffered mechanism that allows the CPU to transfer data for next invocation while the coprocessor is executing the current call.
Michael Wolfe
Compilers and More: A GPU and Accelerator Programming Model http://www.hpcwire.com/specialfeatures/sc08/features/Compilers_and_More_A_GPU_and_Accelerator_Programming_Model.html Accelators support two levels of parallelism: an outer fully-parallel doall loop level, and an inner synchronous (SIMD or vector) loop level. The keys to tuning are minimizing and perhaps optimizing
Byoung-Tak Zhang
Teaching an Agent by Playing a Multimodal Memory Game: Challenges for Machine Learners and Human Teachers, AAAI’09 Zhang developed a research platform that implements a cognitive game called multimodal memory game (MMG) to study machine learning architectures and algorithms for
Hubert Nguyen
GPU Gems 3 http://portal.acm.org/citation.cfm?id=1407436 Chapter 31 Lars Nyland – Fast N-Body Simulation with CUDA This article presents a parallel technique for a O(N*N) calculation. All calculations in this algorithm are all parallelable, thus all calculations can executed simultaneously. But it needs N*N
Vasily Volkov
Benchmarking GPUs to tune dense linear algebra, SC’08 http://portal.acm.org/citation.cfm?id=1413370.1413402 They showed the characteristics of GPUs by some empirical experiments, and adopted the old methods that used for vector processors to GPUs. Frankly I cannot understand this paper. 🙁 I will
John Nickolls
Scalable Parallel Programming with CUDA, Queue Vol.6 Issue 2, ’08 http://portal.acm.org/citation.cfm?id=1365500 Just Programming Guide. No more.
Shane Ryoo
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, PPoPP’08 http://portal.acm.org/citation.cfm?id=1345206.1345220 This is a very empirical paper. They introduced the characteristics of GeForce 8800 and optimization principles. Many threads enough to hide memory latency and memory bandwidth
Christian Lengauer
Loop Parallelization in the Polytope Model, CONCUR’93 http://portal.acm.org/citation.cfm?id=703499 He showed how to use a polyhedral model in generating parallel codes from sequential codes by giving a concrete example.
Muthu Manikandan
A compiler framework for optimization of affine loop nests for gpgpus, ICS’08 http://portal.acm.org/citation.cfm?id=1375527.1375562 They showed the characteristics of CUDA such as a coalescing when access the global memory and a bank conflict when access the shared memory. They derived the best
Zheng Wang
Mapping Parallelism to Multi-cores: A Machine Learning Based Approach, PPoPP’09 http://portal.acm.org/citation.cfm?id=1504176.1504189 They extracted the static code features like operations, control flows, memory access and binary & bitwise operations using LLVM, and got data features like loop counts of kernels, L1 dcache miss and
Louis-Noel Pouchet
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time, CGO’07 http://portal.acm.org/citation.cfm?id=1252537 They made polyhedral models from a sequential program using a method that Paul Feautrier proposed in 1992. Many schedules could come from the polyhedral model, thus they choose only the legal