Where these performance differences come from? In Himeno bench, The inner-most loop in jacobi() consumes 99.7% of whole computation time. I disassembled executables of this part, and found LLVM 2.2 emits very efficient x86 assembly. It is composed of move, add/sub and mul instructions. These 3 instructions can be executed in parallel on Core2 CPU. (Core2 has independent load/store unit, additive fp ALU and multiply fp ALU)
Back in the 70's everyone argued that compilers would never produce faster code than hand-tweaked assembly. These days, you have to be awfully damn good at assembly to create faster code than an optimizing compiler can. So gee, I wonder what's going to happen in the future with virtual machines - which "everyone knows" will never be as fast as compiled code. (Especially when everyone has a multi-core CPU. Cuz we all know great human beings are at thinking in parallel, don't we?)
(Another advantage of VMs - they can optimize code for cache coherency. Which can get you between one and two orders of magnitude increase in performance, if done exactly right.)
NB: Someone suggested that the tests be run again, this time passing "-march=Core2" to gcc. It might create more optimized code and beat LLVM in that case.