"Doing linear scans over a hash is like clubbing someone to death with a loaded Uzi." -Larry Wall
So I'm writing some mail log parsing code for a very high volume customer. They get about 2.8 gigs an hour of mail logs. We thought that we'd have to write the parser in C, because of course interpreted languages couldn't possible be fast enough to handle that kind of volume...
Long story short: the optimized perl script I have sitting in my homedir now is processing a 6.8 meg sample logfile in 0.745 seconds (+/- 0.01 sec). Doing the math, this is about 33 gigs an hour. Roughly ten times faster than they can generate logs.
How? Well, it wasn't too horribly difficult, actually. I sat down this morning and, in keeping with my philosophy of "always profile before optimizing, because you do not know where your code is slow" I looked up the docs on the Perl profiler. Why yes, you can profile perl. The profiler has been a standard part of the perl distribution since 5.6, actually.
( Read more...Collapse )
Lessons? Interpreted may be fast enough if you're smart. Sometimes regexes can be too slow no matter how much you optimize them. Always profile before you optimize.
Edit: Final version is averaging 0.496 seconds for the test file over five runs. So, 14 megs a second, ~870 megs a minute, 48.8 gigs an hour. To get here, I had to revise my strategy. Having hit the rock bottom of tuning, I decided that I simply needed to do less work. I added five or six next if(index($logline, "useless entry identifier")) lines in the while(<>) main loop to avoid processing lines that I knew contained no useful data. That was good for another 30% speedup.