java 正则 惰性匹配,Java正则表达式匹配开始/结束标签导致堆栈溢出

The standard implementation of the Java Pattern class uses recursion to implement many forms of regular expressions (e.g., certain operators, alternation).

This approach causes stack overflow issues with input strings that exceed a (relatively small) length, which may not even be more than 1,000 characters, depending on the regex involved.

A typical example of this is the following regex using alternation to extract a possibly multiline element (named Data) from a surrounding XML string, which has already been supplied:

(?(?:.|\r|\n)+?)

The above regex is used in with the Matcher.find() method to read the "data" capturing group and works as expected, until the length of the supplied input string exceeds 1,200 characters or so, in which case it causes a stack overflow.

Can the above regex be rewritten to avoid the stack overflow issue?

解决方案

Sometimes the regex Pattern class will throw a StackOverflowError. This is a manifestation of the known bug #5050507, which has been in the java.util.regex package since Java 1.4. The bug is here to stay because it has "won't fix" status. This error occurs because the Pattern class compiles a regular expression into a small program which is then executed to find a match. This program is used recursively, and sometimes when too many recursive calls are made this error occurs. See the description of the bug for more details. It seems it's triggered mostly by the use of alternations.

Your regex (that has alternations) is matching any 1+ characters between two tags.

You may either use a lazy dot matching pattern with the Pattern.DOTALL modifier (or the equivalent embedded flag (?s)) that will make the . match newline symbols as well:

(?s)(?.+?)

However, lazy dot matching patterns still consume lots of memory in case of huge inputs. The best way out is to use an unroll-the-loop method:

(?[^)[^

Details:

- literal text

(? - start of the capturing group "data"

[^

(?:)[^

) - a < that is not followed with Data> or /Data>

[^

) - end of the "data" group

- closing delimiter

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值