The first thing that jumps out is how absurdly fast our processors are. Most simple instructions on the Core 2 take one clock cycle to execute, hence a third of a nanosecond at 3.0Ghz. For reference, light only travels ~4 inches (10 cm) in the time taken by a clock cycle. It’s worth keeping this in mind when you’re thinking of optimization - instructions are comically cheap to execute nowadays.( Collapse )http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait
Some real good stuff in the reddit comments
, including a pointer to Valgrind's cache and branch-prediction profiler Cachegrind
. Also there's a pretty good article in LWN about cache
, including how to best optimize matrix math for cache coherency.
Of course, always remember the first two rules of optimization club:
1. You do not optimize before profiling.
2. You DO NOT OPTIMIZE BEFORE PROFILING.