Sabtu, 15 November 2008

A Mathematical Model of Hardware Prefetching

The time complexity of algorithms has traditionally been studied exclusively in terms of the instruction count. More recently, efforts to construct cache efficient algorithms have focused on adapting asymptotically optimal algorithms to the multiple tiers of the memory hierarchy. An additional hardware feature, though, makes several theoretically optimal cache efficient algorithms less effective in practice. The hardware prefetcher of the Intel Pentium 4 processor allows for bytes to be loaded from main memory into the L2 cache before being explicitly requested. To anticipate desired bytes, the prefetcher keeps track of previous cache misses. Its range is limited to 256 bytes ahead of current accesses and it can operate on up to eight concurrent data streams [1].

Given the magnitude of its allowed speedups, surprisingly little attention has been paid to studying the effects of the prefetcher for various algorithms. In the experiments of Pan et al., standard and cache efficient implementations of algorithms are studied with and without prefetching enabled. The prefetcher almost universally allows for significant improvement. For some of these algorithms, the prefetcher makes standard implementations faster than cache efficient ones, even when the cache efficient methods outperform standard ones without the prefetcher. This disparity implies that speedups delivered from the prefetcher are highly sensitive to memory access patterns. In order to understand these distinctions, a mathematical model of the prefetcher is developed.

0 komentar: