java 优化正则,Java正则表达式库是否针对任何字符进行优化。*？

独狼苏

于 2021-02-24 09:37:31 发布

阅读量95

点赞数

文章标签： java 优化正则

I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.

Pattern pattern = Pattern.compile(regex);

But suppose I used a .* to specify any number of characters. So it's basically a wildcard.

Pattern pattern = Pattern.compile(".*");

Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*

解决方案

In your case, I could just use a possessive quantifier to avoid any backtracking:

.*+

The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.

Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).

If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.

Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.

Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match

If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().

Also remember that you can re-use the Matcher object for different input strings by calling the method reset().

Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).

Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.

Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!

Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.

Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).

Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.

独狼苏

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

java 优化 正则,Java正则表达式库是否针对任何字符进行优化。*？

java 优化正则,Java正则表达式库是否针对任何字符进行优化。*？