Log in

Regex matching: NFAs vs DFAs. - Adventures in Engineering
The wanderings of a modern ronin.

Ben Cantrick
  Date: 2011-12-05 17:37
  Subject:   Regex matching: NFAs vs DFAs.
  Music:MC Plus+ - The Empty Set

Notice that Perl requires over sixty seconds to match a 29-character string. The other approach, labeled Thompson NFA for reasons that will be explained later, requires twenty microseconds to match the string. That's not a typo. The Perl graph plots time in seconds, while the Thompson NFA graph plots time in microseconds: the Thompson NFA implementation is a million times faster than Perl when running on a miniscule 29-character string. The trends shown in the graph continue: the Thompson NFA handles a 100-character string in under 200 microseconds, while Perl would require over 1015 years. (Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the above graph could have been Python, or PHP, or Ruby, or many other languages. A more detailed graph later in this article presents data for other implementations.)

It may be hard to believe the graphs: perhaps you've used Perl, and it never seemed like regular expression matching was particularly slow. Most of the time, in fact, regular expression matching in Perl is fast enough. As the graph shows, though, it is possible to write so-called "pathological" regular expressions that Perl matches very very slowly. In contrast, there are no regular expressions that are pathological for the Thompson NFA implementation. Seeing the two graphs side by side prompts the question, "why doesn't Perl use the Thompson NFA approach?" It can, it should, and that's what the rest of this article is about.


These aren't used much because the implementation is a little more difficult, and you have to give up some things, like backtracking. Still, if your program is limited by regex matching speed (most aren't, but just in case) you should consider using an NFA regex package, instead of the default DFA regexes that come with most languages/libraries.
Post A Comment | 3 Comments | Share | Link

  User: j_b
  Date: 2011-12-06 03:08 (UTC)
  Subject:   (no subject)
<@Stan> I learned something today.
Reply | Thread | Link

Willow: Candygram for Mongo!
  User: willow_red
  Date: 2011-12-06 16:40 (UTC)
  Subject:   (no subject)
Keyword:Candygram for Mongo!
NFA = Not For Amateurs ?
Reply | Thread | Link

Ben Cantrick
  User: mackys
  Date: 2011-12-07 09:10 (UTC)
  Subject:   (no subject)
Much worse... "Non-Deterministic Finite Automota".

Finite Automata was the class I liked the least in college. Talk about dry...
Reply | Parent | Thread | Link

May 2015