MEMORY ACCESS LATENCIES
Access latencies, in CPU cycles
(Note: old CPUs took multiple cycles to run one instruction
while new CPUs can run multiple instructions per cycle)
CPU L1 cache L2 cache RAM Disk 386 2 500000 486 2 10 1800000 586 2 20 1500000 Pentium II 2 10 35 2400000 Pentium III 2 15 50 6000000 Pentium 4 3 25 200 18000000 Core 2 3 25 200 24000000
NUMA & SMP FRIENDLY APPLICATIONS
- CPUs are fast, communication between CPUs is slow
* Maximize performance by minimizing communication
- Fine-grained locking increases parallelism, but also increases inter-CPU communication!
* Worked great in the 1990's, but no more
- Writing to common data structures invalidates cache lines and increases inter-CPU communication
* Write mostly to thread-local data, read mostly from shared data
* Use NUMA/SMP friendly runtimes (JVM, etc)
And we wonder why cache-aware data structures are so much faster on modern hardware. How about that order of magnitude jump from L1 to L2 cache? And then again from L2 to RAM!
See also: RAM is the new disk (and disk is the new tape).