Three ways to fix DRAM’s latency problem
In a brilliant PhD thesis, Understanding and Improving the Latency of DRAM-Based Memory Systems, Kevin K. Chang of CMU tackles the DRAM issue, and suggests some novel architectural enhancements to make substantial improvements in DRAM latency.
Kevin breaks the DRAM latency problem into four issues, three of which I’ll summarize here:
Inefficient bulk data movement.DRAM refresh interference. While DRAM is being refreshed, it can’t all be accessed.Cell latency variation, due to manufacturing variability.
The fourth issue, the impact of power on latency, is left for the interested reader to investigate.
INEFFICIENT BULK DATA MOVEMENT
Back when memory and storage were costly, data movement was confined to a register-sized chunks, or, at most, a 512 byte block from disk. But today, with terabytes of storage and gigabytes of memory, with video and streaming data, bulk data movement is ever-more common.
But the architecture of data movement – from memory to CPU over narrow memory busses – hasn’t changed. Mr. Chang’s suggestion? A new, high-bandwidth data path between sub-arrays of memory, using a few isolation transistors to create a wide – 8,192 bits wide – parallel bus between sub-arrays in the same bank of memory.
DRAM REFRESH INTERFERENCE
DRAM memory cells need to be refreshed to retain data, which is why it’s called Dynamic RAM. DRAM is refreshed in ranks, not all at once, because doing so would require too much power. While a rank is being refreshed, however, it can’t be accessed, which creates latency.
DRAM latency is getting worse, because as chip density increases, more ranks need to be refreshed, degrading performance by almost 20 percent on 32Gb chips.
Mr. Chang proposes two mechanisms, that hide refresh latency by parallelizing refreshes with memory accesses across banks and subarrays. One uses out-of-order per-bank refresh that enables the memory controller to specify an idle bank to be refreshed instead of the common strict round-robin order. The second strategy is write-refresh parallelization that overlaps refresh latency with write latency.