[翻译]High Performance JavaScript(016)

Regular Expression Optimization  正则表达式优化

 

    Incautiously crafted regexes can be a major performance bottleneck (the upcoming section, "Runaway Backtracking" on page 91, contains several examples showing how severe this can be), but there is a lot you can do to improve regex efficiency. Just because two regexes match the same text doesn't mean they do so at the same speed.

    粗浅地编写正则表达式是造成性能瓶颈的主要原因(后面“回溯失控”一节有一些例子说明这是多么严重的问题),但还有很多可以改进正则表达式效率的地方。两个正则表达式匹配相同的文本并不意味着他们具有同等的速度。

 

    Many factors affect a regex's efficiency. For starters, the text a regex is applied to makes a big difference because regexes spend more time on partial matches than obvious nonmatches. Each browser's regex engine also has different internal optimizations.

    许多因素影响正则表达式的效率。首先,正则表达式适配的文本千差万别,部分匹配时比完全不匹配所用的时间要长。每种浏览器的正则表达式引擎也有不同的内部优化。

 

    Regex optimization is a fairly broad and nuanced topic. There's only so much that can be covered in this section, but what's included should put you well on your way to understanding the kinds of issues that affect regex performance and mastering the art of crafting efficient regexes.

    正则表达式的优化是一个相当广泛和细致入微的话题。本节讨论尽其所能,希望这些内容有助于您理解影响正则表达式性能的各种问题和掌握编写高效正则表达式的艺术。

 

    Note that this section assumes you already have some experience with regular expressions and are primarily interested in how to make them faster. If you're new to regular expressions or need to brush up on the basics, numerous resources are available on the Web and in print. Regular Expressions Cookbook (O'Reilly) by Jan Goyvaerts and Steven Levithan (that's me!) is written for people who like to learn by doing, and covers JavaScript and several other programming languages equally.

    请注意,本节假设您已经具有正则表达式经验,主要关注于如何使它们更快。如果您是正则表达式的新手,或者还需要复习一下基础,网上和书上都有许多资源。《Regular Expressions Cookbook》(O'Reilly)由Jan Goyvaerts和Steven Levithan(本文作者!)为那些勇于实践的人们编写,涵盖了JavaScript和其他编程语言。

 

How Regular Expressions Work  正则表达式工作原理

 

    In order to use regular expressions efficiently, it's important to understand how they work their magic. The following is a quick rundown of the basic steps a regex goes through:

    为了有效地使用正则表达式,重要的是理解它们的工作原理。下面是一个正则表达式处理的基本步骤:

 

Step 1: Compilation

第一步:编译

 

    When you create a regex object (using a regex literal or the RegExp constructor), the browser checks your pattern for errors and then converts it into a native code routine that is used to actually perform matches. If you assign your regex to a variable, you can avoid performing this step more than once for a given pattern.

    当你创建了一个正则表达式对象之后(使用一个正则表达式直接量或者RegExp构造器),浏览器检查你的模板有没有错误,然后将它转换成一个本机代码例程,用于执行匹配工作。如果你将正则表达式赋给一个变量,你可以避免重复执行此步骤。

 

Step 2: Setting the starting position

第二步:设置起始位置

 

    When a regex is put to use, the first step is to determine the position within the target string where the search should start. This is initially the start of the string or the position specified by the regex's lastIndex property, but when returning here from step 4 (due to a failed match attempt), the position is one character after where the last attempt started.

    当一个正则表达式投入使用时,首先要确定目标字符串中开始搜索的位置。它是字符串的起始位置,或者由正则表达式的lastIndex属性指定,但是当它从第四步返回到这里的时候(因为尝试匹配失败),此位置将位于最后一次尝试起始位置推后一个字符的位置上。

 

    Optimizations that browser makers build into their regex engines can help avoid a lot of unnecessary work at this stage by deciding early that certain work can be skipped. For instance, if a regex starts with ^, IE and Chrome can usually determine that a match cannot be found after the start of a string and avoid foolishly searching subsequent positions. Another example is that if all possible matches contain x as the third character, a smart implementation may be able to determine this, quickly search for the next x, and set the starting position two characters back from where it's found (e.g., recent versions of Chrome include this optimization).

    浏览器厂商优化正则表达式引擎的办法是,在这一阶段中通过早期预测跳过一些不必要的工作。例如,如果一个正则表达式以^开头,IE和Chrome通常判断在字符串起始位置上是否能够匹配,然后可避免愚蠢地搜索后续位置。另一个例子是匹配第三个字母是x的字符串,一个聪明的办法是先找到x,然后再将起始位置回溯两个字符(例如,最近的Chrome版本实现了这种优化)。

 

Step 3: Matching each regex token

第三步:匹配每个正则表达式的字元

 

    Once the regex knows where to start, it steps through the text and the regex pattern. When a particular token fails to match, the regex tries to backtrack to a prior point in the match attempt and follow other possible paths through the regex.

    正则表达式一旦找好起始位置,它将一个一个地扫描目标文本和正则表达式模板。当一个特定字元匹配失败时,正则表达式将试图回溯到扫描之前的位置上,然后进入正则表达式其他可能的路径上。

 

Step 4: Success or failure

第四步:匹配成功或失败

 

    If a complete match is found at the current position in the string, the regex declares success. If all possible paths through the regex have been attempted but a match was not found, the regex engine goes back to step 2 to try again at the next character in the string. Only after this cycle completes for every character in the string (as well as the position after the last character) and no matches have been found does the regex declare overall failure.

    如果在字符串的当前位置上发现一个完全匹配,那么正则表达式宣布成功。如果正则表达式的所有可能路径都尝试过了,但是没有成功地匹配,那么正则表达式引擎回到第二步,从字符串的下一个字符重新尝试。只有字符串中的每个字符(以及最后一个字符后面的位置)都经历了这样的过程之后,还没有成功匹配,那么正则表达式就宣布彻底失败。

 

    Keeping this process in mind will help you make informed decisions about the types of issues that affect regex performance. Next up is a deeper look into a key feature of the matching process in step 3: backtracking.

    牢记这一过程将有助于您明智地判别那些影响正则表达式性能问题的类型。接下来我们深入剖析第三步中匹配过程的关键点:回溯。

 

Understanding Backtrack  理解回溯

 

    In most modern regex implementations (including those required by JavaScript), backtracking is a fundamental component of the matching process. It's also a big part of what makes regular expressions so expressive and powerful. However, backtracking is computationally expensive and can easily get out of hand if you're not careful. Although backtracking is only part of the overall performance equation, understanding how it works and how to minimize its use is perhaps the most important key to writing efficient regexes. The next few sections therefore cover the topic at some length.

    在大多数现代正则表达式实现中(包括JavaScript所需的),回溯是匹配过程的基本组成部分。它很大程度上也是正则表达式如此美好和强大的根源。然而,回溯计算代价昂贵,如果你不够小心的话容易失控。虽然回溯是整体性能的唯一因素,理解它的工作原理,以及如何减少使用频率,可能是编写高效正则表达式最重要的关键点。因此后面几节用较长篇幅讨论这个话题。

 

    As a regex works its way through a target string, it tests whether a match can be found at each position by stepping through the components in the regex from left to right. For each quantifier and alternation, a decision must be made about how to proceed. With a quantifier (such as *, +?, or {2,}), the regex must decide when to try matching additional characters, and with alternation (via the | operator), it must try one option from those available.

    当一个正则表达式扫描目标字符串时,它从左到右逐个扫描正则表达式的组成部分,在每个位置上测试能不能找到一个匹配。对于每一个量词和分支,都必须决定如何继续进行。如果是一个量词(诸如*,+?,或者{2,}),正则表达式必须决定何时尝试匹配更多的字符;如果遇到分支(通过|操作符),它必须从这些选项中选择一个进行尝试。

 

    Each time the regex makes such a decision, it remembers the other options to return to later if necessary. If the chosen option is successful, the regex continues through the regex pattern, and if the remainder of the regex is also successful, the match is complete. But if the chosen option can't find a match or anything later in the regex fails, the regex backtracks to the last decision point where untried options remain and chooses one. It continues on like this until a match is found or all possible permutations of the quantifiers and alternation options in the regex have been tried unsuccessfully, at which point it gives up and moves on to start this process all over at the next character in the string.

    每当正则表达式做出这样的决定,如果有必要的话,它会记住另一个选项,以备将来返回后使用。如果所选方案匹配成功,正则表达式将继续扫描正则表达式模板,如果其余部分匹配也成功了,那么匹配就结束了。但是如果所选择的方案未能发现相应匹配,或者后来的匹配也失败了,正则表达式将回溯到最后一个决策点,然后在剩余的选项中选择一个。它继续这样下去,直到找到一个匹配,或者量词和分支选项的所有可能的排列组合都尝试失败了,那么它将放弃这一过程,然后移动到此过程开始位置的下一个字符上,重复此过程。

 

Alternation and backtracking  分支和回溯

 

    Here's an example that demonstrates how this process plays out with alternation.

    下面的例子演示了这一过程是如何处理分支的。

 

/h(ello|appy) hippo/.test("hello there, happy hippo");

 

    This regex matches "hello hippo" or "happy hippo". It starts this test by searching for an h, which it finds immediately as the first character in the target string. Next, the subexpression_r(ello|appy) provides two ways to proceed. The regex chooses the leftmost option (alternation always works from left to right), and checks whether ello matches the next characters in the string. It does, and the regex is also able to match the following space character. At that point, though, it reaches a dead end because the h in hippo cannot match the t that comes next in the string. The regex can't give up yet, though, because it hasn't tried all of its options, so it backtracks to the last decision point (just after it matched the leading h) and tries to match the second alternation option. That doesn't work, and since there are no more options to try, the regex determines that a match cannot be found starting from the first character in the string and moves on to try again at the second character. It doesn't find an h there, so it continues searching until it reaches the 14th character, where it matches the h in "happy". It then steps through the alternatives again. This time ello doesn't match, but after backtracking and trying the second alternative, it's able to continue until it matches the full string "happy hippo" (see Figure 5-4). Success.

    此正则表达式匹配“hello hippo”或“happy hippo”。测试一开始,它要查找一个h,目标字符串的第一个字母恰好就是h,它立刻就被找到了。接下来,子表达式(ello|appy)提供了两个处理选项。正则表达式选择最左边的选项(分支选择总是从左到右进行),检查ello是否匹配字符串的下一个字符。确实匹配,然后正则表达式又匹配了后面的空格。然而在这一点上它走进了死胡同,因为hippo中的h不能匹配字符串中的下一个字母t。此时正则表达式还不能放弃,因为它还没有尝试过所有的选择,随后它回溯到最后一个检查点(在它匹配了首字母h之后的那个位置上)并尝试匹配第二个分支选项。但是没有成功,而且也没有更多的选项了,所以正则表达式认为从字符串的第一个字符开始匹配是不能成功的,因此它从第二个字符开始,重新进行查找。它没有找到h,所以就继续向后找,直到第14个字母才找到,它匹配happy的那个h。然后它再次进入分支过程。这次ello未能匹配,但是回溯之后第二次分支过程中,它匹配了整个字符串“happy hippo”(如图5-4)。匹配成功了。

Figure 5-4. Example of backtracking with alternation

图5-4  分支回溯的例子

 

Repetition and backtracking  重复与回溯

 

    This next example shows how backtracking works with repetition quantifiers.

    下一个例子显示了带重复量词的回溯。

 

var str = "<p>Para 1.</p>" +
          "<img src='smiley.jpg'>" +
          "<p>Para 2.</p>" +
          "<div>Div.</div>";
/<p>.*<//p>/i.test(str);

 

    Here, the regex starts by matching the three literal characters <p> at the start of the string. Next up is .*. The dot matches any character except line breaks, and the greedy asterisk quantifier repeats it zero or more times—as many times as possible. Since there are no line breaks in the target string, this gobbles up the rest of the string! There's still more to match in the regex pattern, though, so the regex tries to match <. This doesn't work at the end of the string, so the regex backtracks one character at a time, continually trying to match <, until it gets back to the < at the beginning of the </div> tag. It then tries to match // (an escaped backslash), which works, followed by p, which doesn't. The regex backtracks again, repeating this process until it eventually matches the </p> at the end of the second paragraph. The match is returned successfully, spanning(译者注:大概是scanning) from the start of the first paragraph until the end of the last one, which is probably not what you wanted.

    正则表达式一上来就匹配了字符串开始的三个字母<p>。然后是.*。点号匹配除换行符以外的任意字符,星号这个贪婪量词表示重复零次或多次——匹配尽量多的次数。因为目标字符串中没有换行符,它将吞噬剩下的全部字符串!不过正则表达式模板中还有更多内容需要匹配,所以正则表达式尝试匹配<。它在字符串末尾匹配不成功,所以它每次回溯一个字符,继续尝试匹配<,直到它回到</div>标签的<位置。然后它尝试匹配//(转义反斜杠),匹配成功,然后是p,匹配不成功。正则表达式继续回溯,重复此过程,直到第二段末尾时它终于匹配了</p>。匹配返回成功,它从第一段头部一直扫描到最后一个的末尾,这可能不是你想要的结果。

 

    You can change the regex to match individual paragraphs by replacing the greedy * quantifier with the lazy (aka nongreedy) *?. Backtracking for lazy quantifiers works in the opposite way. When the regex /<p>.*?<//p>/ comes to the .*?, it first tries to skip this altogether and move on to matching <//p>. It does so because *? repeats its preceding element zero or more times, as few times as possible, and the fewest possible times it can repeat is zero. However, when the following < fails to match at this point in the string, the regex backtracks and tries to match the next fewest number of characters: one. It continues backtracking forward like this until the <//p> that follows the quantifier is able to fully match at the end of the first paragraph.

    你可以将正则表达式中的贪婪量词*改为懒惰(又名非贪婪)量词*?,以匹配单个段落。懒惰量词的回溯工作以相反方式进行。当正则表达式/<p>.*?<//p>/推进到.*?时,它首先尝试全部跳过然后继续匹配<//p>。它这么做是因为*?匹配零次或多次,但尽可能少重复,尽可能少的话那么它就可以重复零次。但是,当随后的<在字符串的这一点上匹配失败时,正则表达式回溯并尝试下一个最小的字符数:一个。它继续像这样向前回溯到第一段的末尾,在那里量词后面的<//p>得到完全匹配。

 

    You can see that even if there was only one paragraph in the target string and therefore the greedy and lazy versions of this regex were equivalent, they would go about finding their matches differently (see Figure 5-5).

    如果目标字符串只有一个段落,你可以看到此正则表达式的贪婪版本和懒惰版本是等价的,但是他们尝试匹配的过程不同(如图5-5)。


Figure 5-5. Example of backtracking with greedy and lazy quantifiers

图5-5  回溯与贪婪量词和懒惰量词

 

Runaway Backtracking  回溯失控

 

    When a regular expression stalls your browser for seconds, minutes, or longer, the problem is most likely a bad case of runaway backtracking. To demonstrate the problem, consider the following regex, which is designed to match an entire HTML file. The regex is wrapped across multiple lines in order to fit the page. Unlike most other regex flavors, JavaScript does not have an option to make dots match any character, including line breaks, so this example uses [/s/S] to match any character.

    当一个正则表达式占用浏览器上秒,上分钟或者更长时间时,问题原因很可能是回溯失控。为说明此问题,考虑下面的正则表达式,它的目标是匹配整个HTML文件。此表达式被拆分成多行是为了适合页面显示。不像其他大多数正则表达式那样,JavaScript没有选项可使点号匹配任意字符,包括换行符,所以此例中以[/s/S]匹配任意字符。

 

/<html>[/s/S]*?<head>[/s/S]*?<title>[/s/S]*?<//title>[/s/S]*?<//head>
[/s/S]*?<body>[/s/S]*?<//body>[/s/S]*?<//html>/

 

    This regex works fine when matching a suitable HTML string, but it turns ugly when the string is missing one or more required tags. If the </html> tag is missing, for instance, the last [/s/S]*? expands to the end of the string since there is no </html> tag to be found, and then, instead of giving up, the regex sees that each of the previous [/s/S]*? sequences remembered backtracking positions that allow them to expand further. The regex tries expanding the second-to-last [/s/S]*?—using it to match the </body> tag that was previously matched by the literal <//body> pattern in the regex— and continues to expand it in search of a second </body> tag until the end of the string is reached again. When all of that fails, the third-to-last [/s/S]*? expands to the end of the string, and so on.

    此正则表达式匹配正常HTML字符串时工作良好,但是如果目标字符串缺少一个或多个标签时,它就会变得十分糟糕。例如</html>标签缺失,那么最后一个[/s/S]*?将扩展到字符串的末尾,因为在那里没有发现</html>标签,然后并没有放弃,正则表达式将察看此前的[/s/S]*?队列记录的回溯位置,使它们进一步扩大。正则表达式尝试扩展倒数第二个[/s/S]*?——用它匹配</body>标签,就是此前匹配过正则表达式模板<//body>的那个标签——然后继续查找第二个</body>标签直到字符串的末尾。当所有这些步骤都失败了,倒数第三个[/s/S]*?将被扩展直至字符串的末尾,依此类推。

 

The solution: Be specific  解决方法:具体化

 

    The way around a problem like this is to be as specific as possible about what characters can be matched between your required delimiters. Take the pattern ".*?", which is intended to match a string delimited by double-quotes. By replacing the overly permissive .*? with the more specific [^"/r/n]*, you remove the possibility that backtracking will force the dot to match a double-quote and expand beyond what was intended.

    此类问题的解决办法在于尽可能具体地指出分隔符之间的字符匹配形式。例如模板".*?"用于匹配双引号包围的一个字符串。用更具体的[^"/rn]*取代过于宽泛的.*?,就去除了回溯时可能发生的几种情况,如尝试用点号匹配引号,或者扩展搜索超出预期范围。

 

    With the HTML example, this workaround is not as simple. You can't use a negated character class like [^<] in place of [/s/S] because there may be other tags between those you're searching for. However, you can reproduce the effect by repeating a noncapturing group that contains a negative lookahead (blocking the next required tag) and the [/s/S] (any character) metasequence. This ensures that the tags you're looking for fail at every intermediate position, and, more importantly, that the [/s/S] patterns cannot expand beyond where the tags you are blocking via negative lookahead are found. Here's how the regex ends up looking using this approach:

    在HTML的例子中解决办法不是那么简单。你不能使用否定字符类型如[^<]替代[/s/S]因为在搜索过程中可能会遇到其他类型的标签。但是,你可以通过重复一个非捕获组来达到同样效果,它包含一个回顾(阻塞下一个所需的标签)和[/s/S](任意字符)元序列。这确保中间位置上你查找的每个标签都会失败,然后,更重要的是,[/s/S]模板在你在回顾过程中阻塞的标签被发现之前不能被扩展。应用此方法后正则表达式最终修改如下:

 

/<html>(?:(?!<head>)[/s/S])*<head>(?:(?!<title>)[/s/S])*<title>
(?:(?!<//title>)[/s/S])*<//title>(?:(?!<//head>)[/s/S])*<//head>
(?:(?!<body>)[/s/S])*<body>(?:(?!<//body>)[/s/S])*<//body>
(?:(?!<//html>)[/s/S])*<//html>/

 

    Although this removes the potential for runaway backtracking and allows the regex to fail at matching incomplete HTML strings in linear time, it's not going to win any awards for efficiency. Repeating a lookahead for each matched character like this is rather inefficient in its own right and significantly slows down successful matches. This approach works well enough when matching short strings, but since in this case the lookaheads may need to be tested thousands of times in order to match an HTML file, there's another solution that works better. It relies on a little trick, and it's described next.

    虽然这样做消除了潜在的回溯失控,并允许正则表达式匹配不完整HTML字符串失败时,其使用时间与文本长度呈线性关系,但是它的效率并没有提高。像这样为每个匹配字符多次前瞻缺乏效率,而且成功匹配过程也相当慢。匹配较短字符串时此方法相当不错,但匹配一个HTML文件可能需要前瞻并测试上千次。另外一种解决方案更好,它使用了一点小技巧,如下:

 

Emulating atomic groups using lookahead and backreferences  使用前瞻和后向引用列举原子组

 

    Some regex flavors, including .NET, Java, Oniguruma, PCRE, and Perl, support a feature called atomic grouping. Atomic groups—written as (?>…), where the ellipsis represents any regex pattern—are noncapturing groups with a special twist. As soon as a regex exits an atomic group, any backtracking positions within the group are thrown away. This provides a much better solution to the HTML regex's backtracking problem: if you were to place each [/s/S]*? sequence and its following HTML tag together inside an atomic group, then every time one of the required HTML tags was found, the match thus far would essentially be locked in. If a later part of the regex failed to match, no backtracking positions would be remembered for the quantifiers within the atomic
groups, and thus the [/s/S]*? sequences could not attempt to expand beyond what they already matched.

    一些正则表达式引擎,如.NET,Java,Oniguruma,PCRE,Perl,支持一种称作原子组的属性。原子组,写作(?>…)(译者注:有的书上称“贪婪子表达式),省略号表示任意正则表达式模板——非捕获组和一个特殊的扭曲。存在于原子组中的正则表达式组中的任何回溯点都将被丢弃。这就为HTML正则表达式的回溯问题提供了一个更好的解决办法:如果你将[/s/S]*?序列和它后面的HTML标记一起放在一个原子组中,每当所需的HTML标签被发现一次,这次匹配基本上就被锁定了。如果正则表达式的后续部分匹配失败,原子组中的量词没有记录回溯点,因此[/s/S]*?序列就不能扩展到已匹配的范围之外。

 

    That's great, but JavaScript does not support atomic groups or provide any other feature to eliminate needless backtracking. It turns out, though, that you can emulate atomic groups by exploiting a little-known behavior of lookahead: that lookaheads are atomic groups. The difference is that lookaheads don't consume any characters as part of the overall match; they merely check whether the pattern they contain can be matched at that position. However, you can get around this by wrapping a lookahead's pattern inside a capturing group and adding a backreference to it just outside the lookahead.Here's what this looks like:

    这是了不起的技术。但是,JavaScript不支持原子组,也不提供其他方法消除不必要的回溯。不过,你可以利用前瞻过程中一项鲜为人知的行为来模拟原子组:前瞻也是原子组。不同的是,前瞻在整个匹配过程中,不消耗字符;它只是检查自己包含的模板是否能在当前位置匹配。然而,你可以避开这点,在捕获组中包装一个前瞻模板,在前瞻之外向它添加一个后向引用。它看起来是下面这个样子:

 

(?=(pattern to make atomic))/1

 

    This construct is reusable in any pattern where you want to use an atomic group. Just keep in mind that you need to use the appropriate backreference number if your regex contains more than one capturing group.

    在任何你打算使用原子组的模式中这个结构都是可重用的。只要记住,你需要使用适当的后向引用次数如果你的正则表达式包含多个捕获组。

 

    Here's how this looks when applied to the HTML regex:

    HTML正则表达式使用此技术后修改如下:

 

/<html>(?=([/s/S]*?<head>))/1(?=([/s/S]*?<title>))/2(?=([/s/S]*?
<//title>))/3(?=([/s/S]*?<//head>))/4(?=([/s/S]*?<body>))/5
(?=([/s/S]*?<//body>))/6[/s/S]*?<//html>/

 

    Now, if there is no trailing </html> and the last [/s/S]*? expands to the end of the string, the regex immediately fails because there are no backtracking points to return to. Each time the regex finds an intermediate tag and exits a lookahead, it throws away all backtracking positions from within the lookahead. The following backreference simply rematches the literal characters found within the lookahead, making them a part of the actual match.

    现在如果没有尾随的</html>那么最后一个[/s/S]*?将扩展至字符串结束,正则表达式将立刻失败因为没有回溯点可以返回。正则表达式每次找到一个中间标签就退出一个前瞻,它在前瞻过程中丢弃所有回溯位置。下一个后向引用简单地重新匹配前瞻过程中发现的字符,将他们作为实际匹配的一部分。

 

Nested quantifiers and runaway backtracking  嵌套量词和回溯失控

 

    So-called nested quantifiers always warrant extra attention and care in order to ensure that you're not creating the potential for runaway backtracking. A quantifier is nested when it occurs within a grouping that is itself repeated by a quantifier (e.g., (x+)*).

    所谓嵌套量词总是需要额外的关注和小心,以确保没有掩盖回溯失控问题。嵌套量词指的是它出现在一个自身被重复量词修饰的组中(例如(x+)*)。

 

    Nesting quantifiers is not actually a performance hazard in and of itself. However, if you're not careful, it can easily create a massive number of ways to divide text between the inner and outer quantifiers while attempting to match a string.

    嵌套量词本身并不会造成性能危害。然而,如果你不小心,它很容易在尝试匹配字符串过程中,在内部量词和外部量词之间,产生一大堆分解文本的方法。

 

    As an example, let's say you want to match HTML tags, and you come up with the following regex:

    例如,假设你想匹配的HTML标签,使用了下面的正则表达式:

 

/<(?:[^>"']|"[^"]*"|'[^']*')*>/

 

    This is perhaps overly simplistic, as it does not handle all cases of valid and invalid markup correctly, but it might work OK if used to process only snippets of valid HTML. Its advantage over even more naive solutions such as /<[^>]*>/ is that it accounts for > characters that occur within attribute values. It does so using the second and third alternatives in the noncapturing group, which match entire double- and single-quoted attribute values in single steps, allowing all characters except their respective quote type to occur within them.

    这也许过于简单,因为它不能正确处理所有情况的有效和无效标记,但它处理有效HTML片段时应该没什么问题。与更加幼稚的/<[^>]*>/相比,它的优势在于涵盖了出现在属性值中的>符号。在非捕获组中它不使用第二个和第三个分支,它们匹配单引号和双引号包围的属性值,除特定的引号外允许所有字符出现。

 

    So far, there's no risk of runaway backtracking, despite the nested * quantifiers. The second and third alternation options match exactly one quoted string sequence per repetition of the group, so the potential number of backtracking points increases linearly with the length of the target string.

    到目前为止还没有回溯失控的危险,尽管遇到了嵌套量词*。分组的每次重复过程中,第二和第三分支选项严格匹配一个带引号的字符串,所以潜在的回溯点数目随目标字符串长度而线性增长。

 

    However, look at the first alternative in the noncapturing group: [^>"']. This can match only one character at a time, which seems a little inefficient. You might think it would be better to add a + quantifier at the end of this character class so that more than one suitable character can be matched during each repetition of the group—and at positions within the target string where the regex finds a match—and you'd be right. By matching more than one character at a time, you'd let the regex skip many unnecessary steps on the way to a successful match.

    但是,察看非捕获组的第一个分支:[^>"'],每次只匹配一个字符,似乎有些效率低下。你可能认为在字符类后面加一个+量词会更好些,这样一来每次组重复过程就可以匹配多于一个的字符了。正则表达式可以在目标字符串的那个位置上发现一个匹配。你是对的,通过每次匹配多个字符,你让正则表达式在成功匹配的过程中跳过许多不必要的步骤。

 

    What might not be as readily apparent is the negative consequence such a change could lead to. If the regex matches an opening < character, but there is no following > that would allow the match attempt to complete successfully, runaway backtracking will kick into high gear because of the huge number of ways the new inner quantifier can be combined with the outer quantifier (following the noncapturing group) to match the text that follows <. The regex must try all of these permutations before giving up on the match attempt. Watch out!

    没有什么比这种改变所带来的负面效应更显而易见了。如果正则表达式匹配一个<字符,但后面没有>,却可以令匹配成功完成,回溯失控就会进入快车道,因为内部量词和外部量词的排列组合产生了数量巨大的分支路径(跟在非捕获组之后)用以匹配<之后的文本。正则表达式在最终放弃匹配之前必须尝试所有的排列组合。要当心啊!

 

From bad to worse. 从坏到更坏

 

    For an even more extreme example of nested quantifiers resulting in runaway backtracking, apply the regex /(A+A+)+B/ to a string containing only As. Although this regex would be better written as /AA+B/, for the sake of discussion imagine that the two As represent different patterns that are capable of matching some of the same strings.

    关于嵌套量词导致回溯失控一个更极端的例子是,在一大串A上应用正则表达式/(A+A+)+B/。虽然这个正则表达式写成/AA+B/更好,但为了讨论方便,设想一下两个A能够匹配同一个字符串的多少种模板。

 

    When applied to a string composed of 10 As ("AAAAAAAAAA"), the regex starts by using the first A+ to match all 10 characters. The regex then backtracks one character, letting the second A+ match the last one. The grouping then tries to repeat, but since there are no more As and the group's + quantifier has already met its requirement of matching at least once, the regex then looks for the B. It doesn't find it, but it can't give up yet, since there are more paths through the regex that haven't been tried. What if the first A+ matched eight characters and the second matched two? Or if the first matched three characters, the second matched two, and the group repeated twice? How about if during the first repetition of the group, the first A+ matched two characters and the second matched three; then on the second repetition the first matched one and the second matched four? Although to you and me it's obviously silly to think that any amount of
backtracking will produce the missing B, the regex will dutifully check all of these futile options and a lot more. The worst-case complexity of this regex is an appalling O(2n), or two to the n^th power, where n is the length of the string. With the 10 As used here, the regex requires 1,024 backtracking steps for the match to fail, and with 20 As, that number explodes to more than a million. Thirty-five As should be enough to hang Chrome, IE, Firefox, and Opera for at least 10 minutes (if not permanently) while they process the more than 34 billion backtracking steps required to invalidate all permutations of the regex. The exception is recent versions of Safari, which are able to detect that the regex is going in circles and quickly abort the match (Safari also imposes a cap of allowed backtracking steps, and aborts match attempts when this is exceeded).

    当应用在一个由10个A组成的字符串上(“AAAAAAAAAA”),正则表达式首先使用第一个A+匹配了所有10个字符。然后正则表达式回溯一个字符,让第二个A+匹配最后一个字符。然后这个分组试图重复,但是没有更多的A而且分组中的+量词已经符合匹配所需的最少一次,因此正则表达式开始查找B。它没有找到,但是还不能放弃,因为还有许多路径没有被测试过。如果第一个A+匹配8个字符,第二个A+匹配2个字符会怎么样呢?或者第一个匹配3个,第二个匹配2个,分组重复两次,又会怎么样呢?如果在分组的第一遍重复中,第一个A+匹配2个字符,第二个A+匹配3个字符,然后第二遍重复中,第一个匹配1个,第二个匹配4个,又怎么样呢?虽然你我都不会笨到认为多次回溯后可以找到那个并不存在的B,但是正则表达式只会忠实地一次又一次地检查所有这些无用的选项。此正则表达式最坏情况的复杂性是一个惊人的O(2n),也就是2的n次方。n表示字符串的长度。在10个A构成的字符串中,正则表达式需要1024次回溯才能确定匹配失败,如果是20个A,该数字剧增到一百万以上。25个A足以挂起Chrome,IE,Firefox,和Opera至少10分钟(如果还没死机的话)用以处理超过三千四百万次回溯以排除正则表达式的各种排列组合。唯一的例外是最新的Safari,它能够检测正则表达式陷入了循环,并快速终止匹配(Safari还限定了回溯的次数,超出则终止匹配尝试)。

 

    The key to preventing this kind of problem is to make sure that two parts of a regex cannot match the same part of a string. For this regex, the fix is to rewrite it as /AA+B/, but the issue may be harder to avoid with complex regexes. Adding an emulated atomic group often works well as a last resort, although other solutions, when possible, will most likely keep your regexes easier to understand. Doing so for this regex looks like /((?=(A+A+))/2)+B/, and completely removes the backtracking problem.

    预防此类问题的关键是确保正则表达式的两个部分不能对字符串的同一部分进行匹配。这个正则表达式可重写为/AA+B/,但复杂的正则表达式可能难以避免此类问题。增加一个模拟原子组往往作为最后一招使用,虽然还有其他解决办法,如果可能的话,尽可能保持你的正则表达式简单易懂。如果这么做此正则表达式将改成/((?=(A+A+))/2)+B/,就彻底消除了回溯问题。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值