?

Log in

No account? Create an account
Guess which language is kicking C and C++'s ass at heavy duty computation? - Adventures in Engineering — LiveJournal
The wanderings of a modern ronin.

Ben Cantrick
  Date: 2008-08-12 12:43
  Subject:   Guess which language is kicking C and C++'s ass at heavy duty computation?
Public
  Tags:  reddit

http://shootout.alioth.debian.org/u64q/benchmark.php?test=spectralnorm&lang=all

Nah, that can't possibly be right! We all know that Java will never be as fast as C. Especially on a quad-core machine. Never happen. Nope.
Post A Comment | 10 Comments | | Link






Ben Cantrick
  User: mackys
  Date: 2008-08-12 19:52 (UTC)
  Subject:   (no subject)
.5k vs 15k of RAM is utterly irrelevant on modern hardware. Hell, it was irrelevant on ten years ago's hardware.
Reply | Parent | Thread | Link



Ben Cantrick
  User: mackys
  Date: 2008-08-12 20:35 (UTC)
  Subject:   (no subject)
You're right, they're in KB.

So it's half a meg vs 15 megs. Still irrelevant.
Reply | Parent | Thread | Link



Alex Belits
  User: abelits
  Date: 2008-08-13 00:34 (UTC)
  Subject:   (no subject)
At this very moment I am sitting in front of a modern computer doing -- guess what? -- fitting a large calculation into 32K L1 cache. Memory footprint is VERY important in real-life computation.

However all this pales compared to this result that I just got:

$ gcc -pipe -Wall -O3 -fomit-frame-pointer -lm spectralnorm.c -o spectralnorm.gcc_run
$ time ./spectralnorm.gcc_run 5500
1.274224153

real 0m20.740s
user 0m20.438s
sys 0m0.021s
$ gcc -pipe -Wall -O3 -fomit-frame-pointer -lm spectralnorm_optimized.c -o spectralnorm.gcc_run
$ time ./spectralnorm.gcc_run 5500
1.274224153

real 0m6.610s
user 0m6.143s
sys 0m0.269s
$


(test updated for cleaner conditions)

No srsly. I sacrificed even more memory and made code a bit more blatantly parallelizable.

Edited at 2008-08-13 01:12 am (UTC)
Reply | Parent | Thread | Link



Ben Cantrick
  User: mackys
  Date: 2008-08-13 07:13 (UTC)
  Subject:   (no subject)
I sacrificed even more memory and made code a bit more blatantly parallelizable.

I totally believe you can make lightning-fast C code to do many-thread parallelism.

I'm just not convinced that it's worth the trouble, when Java is already going so fast.
Reply | Parent | Thread | Link



Alex Belits
  User: abelits
  Date: 2008-08-13 07:58 (UTC)
  Subject:   (no subject)
In calculations? Absolutely worth the trouble -- Java optimizes very poorly compared to even trivial manual optimization in C made with assumption of a known usage scenario. In other situations? There would be no "trouble" because Java will fall over itself fixing inefficiency of its own data structures in its runtime optimization while C and C++ will do it at the compile time (with or without programmer's hints).

Please note that the only two things I have optimized were storing values that were often re-calculated (what is obvious from the problem itself), and giving the compiler/CPU pipelines hints about possible parallelism, I didn't actually run multiple threads. And that by itself produced more than threefold increase in speed. I didn't even touch truly parallel calculations, use of multiple cores (parallelism in this program only works through multiple pipelines and SIMD if compiled for it), cache optimization, etc. that I have to do when I write "real" number-crunching code.

So no, I don't see this case -- trivially repeating calculations, ridiculously large amount of data per iteration -- as an evidence that Java is good for number-crunching. I see that it's easy to make Java utilize features that are more "difficult" to use with other languages (threads/processes to split calculation between processors/cores), however it will lose more trivial performance-enhancing features -- and won't let me get them back no matter how well I will optimize Java code.

As it usually happens with with poorly designed "easy to use" technology, it allows to automatically achieve minimal improvement using very complex methods, then hits an impenetrable wall, preventing any further improvement no matter how much effort is exerted to overcome its limitations. I prefer solutions that allow me to utilize every mechanism that can be reasonably useful, as long as the required effort grows slower than the complexity of the mechanism. I expect to be hit with diminishing returns once I made something reasonably efficient, not at the very beginning where oh-so-smart hidden optimization exceeded expensive methods that are built into it, simultaneously cutting me off from cheap ones that it can't use due to its own complexity.
Reply | Parent | Thread | Link



Alex Belits
  User: abelits
  Date: 2008-08-13 07:24 (UTC)
  Subject:   (no subject)
Updated version correctly processes all possible sizes (though they still have to fit into memory for this thing to work).
Reply | Parent | Thread | Link



Alex Belits
  User: abelits
  Date: 2008-08-13 19:51 (UTC)
  Subject:   (no subject)
I have implemented multithreaded version of the same, and on a dual-core box (same as one in previous tests) it gave me this:

$ gcc -pipe -Wall -O3 -fomit-frame-pointer -lm spectralnorm_optimized_threads.c -o spectralnorm.gcc_run -lpthread
$ time ./spectralnorm.gcc_run 5500
1.274224153

real 0m4.538s
user 0m8.432s
sys 0m0.328s
$


(again, updated the test)

I admit that this version is actually unreasonably complex for anything but a situation when you absolutely definitely have to have best performance you can get from given hardware. However one has to admit that 1.5 performance gain is often worth the complexity.

What is more important is to see this in a perspective. Simple optimization that stayed on one core (so others can be used for something else) and eaten some memory, gave 3 times the performance of the original version. Multithreaded version is 4.6 times the original version's performance. In the best days of Moore Law working for raw single-core performance that would be 3 years and 4 months of hardware development.

Edited at 2008-08-13 08:49 pm (UTC)
Reply | Parent | Thread | Link



  User: nickhalfasleep
  Date: 2008-08-12 19:23 (UTC)
  Subject:   (no subject)
And imagine getting to the point where you do need to optimize what the optimizing compiler wrote. Or having a real numerical problem to attack and having a 1MB C binary.
Reply | Thread | Link



browse
May 2015