Accelerating linpack with CUDA on heterogenous clusters

http://portal.acm.org/citation.cfm?id=1513895.1513901

The author calculates the bandwidth of PCIe and the peak GFlops of a CPU and a GPU. Then calculate the execution time with the measurement and the data input size, and get the optimal split fraction. The author does not overlap the execution with data transfer. Because the memory system cannot supply data to both the PCIe and the CPU at maximum speed on Intel systems using Front Side Bus (FSB). However, on the new Intel systems with Quick Path Interconnect (QPI), this may not be the case.

Massimiliano Fatica

Leave a Reply