English 中文(简体)
Best practices for regex performance VS sheer iteration
原标题:

I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString") and/or other String API calls?

While above given decision for .contains() is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains() calls or a single regex?

My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.

Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.

最佳回答

The answer (as usual) is that it depends.

In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex s weaknesses. The "OR" in this case doesn t really know what the sub-patterns are trying to do and so can t easily optimize. It ends up doing the equivalent of (in pseudo code):

for( i = 0; i < stringLength; i++ ) {
    if( stringAt pos i starts with "this" )
       found!
    if( stringAt pos i starts with "that" )
       found!
}

There almost isn t a slower way to do it. In this case, two contains() calls will be much faster.

On the other hand, a full match on: ".*this.*|.*that.*" may optimize better.

To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with A or B and ending in g - m ... then use regex.

And then you won t be so worried about a few cycles here and there.

问题回答

RegexBuddy has a built-in regular expressions debugger. It shows how many steps the regular expression engine needed to find a match or to fail to find a match. By using the debugger on strings of different lengths, you can get an idea of the complexity (big O) of the regular expression. If you look up "benchmark" in the index of RegexBuddy s help file you ll get some more tips on how to interpret this.

When judging the performance of a regular expression, it is particularly important to test situations where the regex fails to find a match. It is very easy to write a regular expression that finds its matches in linear time, but fails in exponential time in a situation that I call catastrophic backtracking.

To use your 5 if statements as an example, the regex one|two|three|four|five scans the input string once, doing a little bit of extra work when an o, t, or f is encountered. But 5 if statements checking if the string contains a word will search the entire string 5 times if none of the words can be found. If five occurs at the start of the string, then the regex finds the match instantly, while the first 4 if statements scan the whole string in vain before the 5th if statement finds the match.

It s hard to estimate performance without using a profiler, generally the best strategy is to write what makes the most logical sense and is easier to understand/read. If two .contains() calls are easier to logically understand then that s the better route, the same logic applies if a regex makes more sense.

It s also important to consider that other developers on your team may not have a great understanding of regex. If at a later time in production the use of regex over .contains() (or vice versa) is identified as a bottleneck, try and profile both.

Rule of thumb: Write code to be readable, use a profiler to identify bottlenecks and only then replace the readable code with faster code.

I would strongly suggest that you write the code for both and time it. It s pretty simple to do this and you ll get an answers that is not a generic "rule of thumb" but instead a very specific answer that holds for your problem domain.

Vance Morrison has an excellent post about micro benchmarking, and has a tool that makes it really simple for you to answer questions like this...

http://msdn.microsoft.com/en-us/magazine/cc500596.aspx

If you want my personal "rule of thumb" then it s that RegEx is often slower for this sort of thing, but you should ignore me and measure it yourself :-)

If, for non-performance reasons, you continue to use Regular Expressions then I can really recommend two things. Get a profiler (such as ANTS) and see what your code does in production. Then, get a copy of the Regular Expression Cookbook...

http://www.amazon.co.uk/Regular-Expressions-Cookbook-Jan-Goyvaerts/dp/0596520689/ref=sr_1_1?ie=UTF8&s=books&qid=1259147763&sr=8-1

... as it has loads of tips on speeding up RegEx code. I ve optimized RegEx code by a factor of 10 following tips from this book.





相关问题
Spring Properties File

Hi have this j2ee web application developed using spring framework. I have a problem with rendering mnessages in nihongo characters from the properties file. I tried converting the file to ascii using ...

Logging a global ID in multiple components

I have a system which contains multiple applications connected together using JMS and Spring Integration. Messages get sent along a chain of applications. [App A] -> [App B] -> [App C] We set a ...

Java Library Size

If I m given two Java Libraries in Jar format, 1 having no bells and whistles, and the other having lots of them that will mostly go unused.... my question is: How will the larger, mostly unused ...

How to get the Array Class for a given Class in Java?

I have a Class variable that holds a certain type and I need to get a variable that holds the corresponding array class. The best I could come up with is this: Class arrayOfFooClass = java.lang....

SQLite , Derby vs file system

I m working on a Java desktop application that reads and writes from/to different files. I think a better solution would be to replace the file system by a SQLite database. How hard is it to migrate ...

热门标签