[翻译]High Performance JavaScript(016)

Regular Expression Optimization  正则表达式优化


    Incautiously crafted regexes can be a major performance bottleneck (the upcoming section, "Runaway Backtracking" on page 91, contains several examples showing how severe this can be), but there is a lot you can do to improve regex efficiency. Just because two regexes match the same text doesn't mean they do so at the same speed.



    Many factors affect a regex's efficiency. For starters, the text a regex is applied to makes a big difference because regexes spend more time on partial matches than obvious nonmatches. Each browser's regex engine also has different internal optimizations.



    Regex optimization is a fairly broad and nuanced topic. There's only so much that can be covered in this section, but what's included should put you well on your way to understanding the kinds of issues that affect regex performance and mastering the art of crafting efficient regexes.



    Note that this section assumes you already have some experience with regular expressions and are primarily interested in how to make them faster. If you're new to regular expressions or need to brush up on the basics, numerous resources are available on the Web and in print. Regular Expressions Cookbook (O'Reilly) by Jan Goyvaerts and Steven Levithan (that's me!) is written for people who like to learn by doing, and covers JavaScript and several other programming languages equally.

    请注意,本节假设您已经具有正则表达式经验,主要关注于如何使它们更快。如果您是正则表达式的新手,或者还需要复习一下基础,网上和书上都有许多资源。《Regular Expressions Cookbook》(O'Reilly)由Jan Goyvaerts和Steven Levithan(本文作者!)为那些勇于实践的人们编写,涵盖了JavaScript和其他编程语言。


How Regular Expressions Work  正则表达式工作原理


    In order to use regular expressions efficiently, it's important to understand how they work their magic. The following is a quick rundown of the basic steps a regex goes through:



Step 1: Compilation



    When you create a regex object (using a regex literal or the RegExp constructor), the browser checks your pattern for errors and then converts it into a native code routine that is used to actually perform matches. If you assign your regex to a variable, you can avoid performing this step more than once for a given pattern.



Step 2: Setting the starting position



    When a regex is put to use, the first step is to determine the position within the target string where the search should start. This is initially the start of the string or the position specified by the regex's lastIndex property, but when returning here from step 4 (due to a failed match attempt), the position is one character after where the last attempt started.



    Optimizations that browser makers build into their regex engines can help avoid a lot of unnecessary work at this stage by deciding early that certain work can be skipped. For instance, if a regex starts with ^, IE and Chrome can usually determine that a match cannot be found after the start of a string and avoid foolishly searching subsequent positions. Another example is that if all possible matches contain x as the third character, a smart implementation may be able to determine this, quickly search for the next x, and set the starting position two characters back from where it's found (e.g., recent versions of Chrome include this optimization).



Step 3: Matching each regex token



    Once the regex knows where to start, it steps through the text and the regex pattern. When a particular token fails to match, the regex tries to backtrack to a prior point in the match attempt and follow other possible paths through the regex.



Step 4: Success or failure



    If a complete match is found at the current position in the string, the regex declares success. If all possible paths through the regex have been attempted but a match was not found, the regex engine goes back to step 2 to try again at the next character in the string. Only after this cycle completes for every character in the string (as well as the position after the last character) and no matches have been found does the regex declare overall failure.



    Keeping this process in mind will help you make informed decisions about the types of issues that affect regex performance. Next up is a deeper look into a key feature of the matching process in step 3: backtracking.



Understanding Backtrack  理解回溯


    In most modern regex implementations (including those required by JavaScript), backtracking is a fundamental component of the matching process. It's also a big part of what makes regular expressions so expressive and powerful. However, backtracking is computationally expensive and can easily get out of hand if you're not careful. Although backtracking is only part of the overall performance equation, understanding how it works and how to minimize its use is perhaps the most important key to writing efficient regexes. The next few sections therefore cover the topic at some length.



    As a regex works its way through a target string, it tests whether a match can be found at each position by stepping through the components in the regex from left to right. For each quantifier and alternation, a decision must be made about how to proceed. With a quantifier (such as *, +?, or {2,}), the regex must decide when to try matching additional characters, and with alternation (via the | operator), it must try one option from those available.



    Each time the regex makes such a decision, it remembers the other options to return to later if necessary. If the chosen option is successful, the regex continues through the regex pattern, and if the remainder of the regex is also successful, the match is complete. But if the chosen option can't find a match or anything later in the regex fails, the regex backtracks to the last decision point where untried options remain and chooses one. It continues on like this until a match is found or all possible permutations of the quantifiers and alternation options in the regex have been tried unsuccessfully, at which point it gives up and moves on to start this process all over at the next character in the string.



Alternation and backtracking  分支和回溯


    Here's an example that demonstrates how this process plays out with alternation.



/h(ello|appy) hippo/.test("hello there, happy hippo");


    This regex matches "hello hippo" or "happy hippo". It starts this test by searching for an h, which it finds immediately as the first character in the target string. Next, the subexpression_r(ello|appy) provides two ways to proceed. The regex chooses the leftmost option (alternation always works from left to right), and checks whether ello matches the next characters in the string. It does, and the regex is also able to match the following space character. At that point, though, it reaches a dead end because the h in hippo cannot match the t that comes next in the string. The regex can't give up yet, though, because it hasn't tried all of its options, so it backtracks to the last decision point (just after it matched the leading h) and tries to match the second alternation option. That doesn't work, and since there are no more options to try, the regex determines that a match cannot be found starting from the first character in the string and moves on to try again at the second character. It doesn't find an h there, so it continues searching until it reaches the 14th character, where it matches the h in "happy". It then steps through the alternatives again. This time ello doesn't match, but after backtracking and trying the second alternative, it's able to continue until it matches the full string "happy hippo" (see Figure 5-4). Success.

    此正则表达式匹配“hello hippo”或“happy hippo”。测试一开始,它要查找一个h,目标字符串的第一个字母恰好就是h,它立刻就被找到了。接下来,子表达式(ello|appy)提供了两个处理选项。正则表达式选择最左边的选项(分支选择总是从左到右进行),检查ello是否匹配字符串的下一个字符。确实匹配,然后正则表达式又匹配了后面的空格。然而在这一点上它走进了死胡同,因为hippo中的h不能匹配字符串中的下一个字母t。此时正则表达式还不能放弃,因为它还没有尝试过所有的选择,随后它回溯到最后一个检查点(在它匹配了首字母h之后的那个位置上)并尝试匹配第二个分支选项。但是没有成功,而且也没有更多的选项了,所以正则表达式认为从字符串的第一个字符开始匹配是不能成功的,因此它从第二个字符开始,重新进行查找。它没有找到h,所以就继续向后找,直到第14个字母才找到,它匹配happy的那个h。然后它再次进入分支过程。这次ello未能匹配,但是回溯之后第二次分支过程中,它匹配了整个字符串“happy hippo”(如图5-4)。匹配成功了。

Figure 5-4. Example of backtracking with alternation

图5-4  分支回溯的例子


Repetition and backtracking  重复与回溯


    This next example shows how backtracking works with repetition quantifiers.



var str = "<p>Para 1.</p>" +
          "<img src='smiley.jpg'>" +
          "<p>Para 2.</p>" +


    Here, the regex starts by matching the three literal characters <p> at the start of the string. Next up is .*. The dot matches any character except line breaks, and the greedy asterisk quantifier repeats it zero or more times—as many times as possible. Since there are no line breaks in the target string, this gobbles up the rest of the string! There's still more to match in the regex pattern, though, so the regex tries to match <. This doesn't work at the end of the string, so the regex backtracks one character at a time, continually trying to match <, until it gets back to the < at the beginning of the </div> tag. It then tries to match // (an escaped backslash), which works, followed by p, which doesn't. The regex backtracks again, repeating this process until it eventually matches the </p> at the end of the second paragraph. The match is returned successfully, spanning(译者注:大概是scanning) from the start of the first paragraph until the end of the last one, which is probably not what you wanted.



    You can change the regex to match individual paragraphs by replacing the greedy * quantifier with the lazy (aka nongreedy) *?. Backtracking for lazy quantifiers works in the opposite way. When the regex /<p>.*?<//p>/ comes to the .*?, it first tries to skip this altogether and move on to matching <//p>. It does so because *? repeats its preceding element zero or more times, as few times as possible, and the fewest possible times it can repeat is zero. However, when the following < fails to match at this point in the string, the regex backtracks and tries to match the next fewest number of characters: one. It continues backtracking forward like this until the <//p> that follows the quantifier is able to fully match at the end of the first paragraph.



    You can see that even if there was only one paragraph in the target string and therefore the greedy and lazy versions of this regex were equivalent, they would go about finding their matches differently (see Figure 5-5).


Figure 5-5. Example of backtracking with greedy and lazy quantifiers

图5-5  回溯与贪婪量词和懒惰量词


Runaway Backtracking  回溯失控


    When a regular expression stalls your browser for seconds, minutes, or longer, the problem is most likely a bad case of runaway backtracking. To demonstrate the problem, consider the following regex, which is designed to match an entire HTML file. The regex is wrapped across multiple lines in order to fit the page. Unlike most other regex flavors, JavaScript does not have an option to make dots match any character, including line breaks, so this example uses [/s/S] to match any character.





    This regex works fine when matching a suitable HTML string, but it turns ugly when the string is missing one or more required tags. If the </html> tag is missing, for instance, the last [/s/S]*? expands to the end of the string since there is no </html> tag to be found, and then, instead of giving up, the regex sees that each of the previous [/s/S]*? sequences remembered backtracking positions that allow them to expand further. The regex tries expanding the second-to-last [/s/S]*?—using it to match the </body> tag that was previously matched by the literal <//body> pattern in the regex— and continues to expand it in search of a second </body> tag until the end of the string is reached again. When all of that fails, the third-to-last [/s/S]*? expands to the end of the string, and so on.



The solution: Be specific  解决方法:具体化


    The way around a problem like this is to be as specific as possible about what characters can be matched between your required delimiters. Take the pattern ".*?", which is intended to match a string delimited by double-quotes. By replacing the overly permissive .*? with the more specific [^"/r/n]*, you remove the possibility that backtracking will force the dot to match a double-quote and expand beyond what was intended.



    With the HTML example, this workaround is not as simple. You can't use a negated character class like [^<] in place of [/s/S] because there may be other tags between those you're searching for. However, you can reproduce the effect by repeating a noncapturing group that contains a negative lookahead (blocking the next required tag) and the [/s/S] (any character) metasequence. This ensures that the tags you're looking for fail at every intermediate position, and, more importantly, that the [/s/S] patterns cannot expand beyond where the tags you are blocking via negative lookahead are found. Here's how the regex ends up looking using this approach:





    Although this removes the potential for runaway backtracking and allows the regex to fail at matching incomplete HTML strings in linear time, it's not going to win any awards for efficiency. Repeating a lookahead for each matched character like this is rather inefficient in its own right and significantly slows down successful matches. This approach works well enough when matching short strings, but since in this case the lookaheads may need to be tested thousands of times in order to match an HTML file, there's another solution that works better. It relies on a little trick, and it's described next.



Emulating atomic groups using lookahead and backreferences  使用前瞻和后向引用列举原子组


    Some regex flavors, including .NET, Java, Oniguruma, PCRE, and Perl, support a feature called atomic grouping. Atomic groups—written as (?>…), where the ellipsis represents any regex pattern—are noncapturing groups with a special twist. As soon as a regex exits an atomic group, any backtracking positions within the group are thrown away. This provides a much better solution to the HTML regex's backtracking problem: if you were to place each [/s/S]*? sequence and its following HTML tag together inside an atomic group, then every time one of the required HTML tags was found, the match thus far would essentially be locked in. If a later part of the regex failed to match, no backtracking positions would be remembered for the quantifiers within the atomic
groups, and thus the [/s/S]*? sequences could not attempt to expand beyond what they already matched.



    That's great, but JavaScript does not support atomic groups or provide any other feature to eliminate needless backtracking. It turns out, though, that you can emulate atomic groups by exploiting a little-known behavior of lookahead: that lookaheads are atomic groups. The difference is that lookaheads don't consume any characters as part of the overall match; they merely check whether the pattern they contain can be matched at that position. However, you can get around this by wrapping a lookahead's pattern inside a capturing group and adding a backreference to it just outside the lookahead.Here's what this looks like:



(?=(pattern to make atomic))/1


    This construct is reusable in any pattern where you want to use an atomic group. Just keep in mind that you need to use the appropriate backreference number if your regex contains more than one capturing group.



    Here's how this looks when applied to the HTML regex:





    Now, if there is no trailing </html> and the last [/s/S]*? expands to the end of the string, the regex immediately fails because there are no backtracking points to return to. Each time the regex finds an intermediate tag and exits a lookahead, it throws away all backtracking positions from within the lookahead. The following backreference simply rematches the literal characters found within the lookahead, making them a part of the actual match.



Nested quantifiers and runaway backtracking  嵌套量词和回溯失控


    So-called nested quantifiers always warrant extra attention and care in order to ensure that you're not creating the potential for runaway backtracking. A quantifier is nested when it occurs within a grouping that is itself repeated by a quantifier (e.g., (x+)*).



    Nesting quantifiers is not actually a performance hazard in and of itself. However, if you're not careful, it can easily create a massive number of ways to divide text between the inner and outer quantifiers while attempting to match a string.



    As an example, let's say you want to match HTML tags, and you come up with the following regex:





    This is perhaps overly simplistic, as it does not handle all cases of valid and invalid markup correctly, but it might work OK if used to process only snippets of valid HTML. Its advantage over even more naive solutions such as /<[^>]*>/ is that it accounts for > characters that occur within attribute values. It does so using the second and third alternatives in the noncapturing group, which match entire double- and single-quoted attribute values in single steps, allowing all characters except their respective quote type to occur within them.



    So far, there's no risk of runaway backtracking, despite the nested * quantifiers. The second and third alternation options match exactly one quoted string sequence per repetition of the group, so the potential number of backtracking points increases linearly with the length of the target string.



    However, look at the first alternative in the noncapturing group: [^>"']. This can match only one character at a time, which seems a little inefficient. You might think it would be better to add a + quantifier at the end of this character class so that more than one suitable character can be matched during each repetition of the group—and at positions within the target string where the regex finds a match—and you'd be right. By matching more than one character at a time, you'd let the regex skip many unnecessary steps on the way to a successful match.



    What might not be as readily apparent is the negative consequence such a change could lead to. If the regex matches an opening < character, but there is no following > that would allow the match attempt to complete successfully, runaway backtracking will kick into high gear because of the huge number of ways the new inner quantifier can be combined with the outer quantifier (following the noncapturing group) to match the text that follows <. The regex must try all of these permutations before giving up on the match attempt. Watch out!



From bad to worse. 从坏到更坏


    For an even more extreme example of nested quantifiers resulting in runaway backtracking, apply the regex /(A+A+)+B/ to a string containing only As. Although this regex would be better written as /AA+B/, for the sake of discussion imagine that the two As represent different patterns that are capable of matching some of the same strings.



    When applied to a string composed of 10 As ("AAAAAAAAAA"), the regex starts by using the first A+ to match all 10 characters. The regex then backtracks one character, letting the second A+ match the last one. The grouping then tries to repeat, but since there are no more As and the group's + quantifier has already met its requirement of matching at least once, the regex then looks for the B. It doesn't find it, but it can't give up yet, since there are more paths through the regex that haven't been tried. What if the first A+ matched eight characters and the second matched two? Or if the first matched three characters, the second matched two, and the group repeated twice? How about if during the first repetition of the group, the first A+ matched two characters and the second matched three; then on the second repetition the first matched one and the second matched four? Although to you and me it's obviously silly to think that any amount of
backtracking will produce the missing B, the regex will dutifully check all of these futile options and a lot more. The worst-case complexity of this regex is an appalling O(2n), or two to the n^th power, where n is the length of the string. With the 10 As used here, the regex requires 1,024 backtracking steps for the match to fail, and with 20 As, that number explodes to more than a million. Thirty-five As should be enough to hang Chrome, IE, Firefox, and Opera for at least 10 minutes (if not permanently) while they process the more than 34 billion backtracking steps required to invalidate all permutations of the regex. The exception is recent versions of Safari, which are able to detect that the regex is going in circles and quickly abort the match (Safari also imposes a cap of allowed backtracking steps, and aborts match attempts when this is exceeded).



    The key to preventing this kind of problem is to make sure that two parts of a regex cannot match the same part of a string. For this regex, the fix is to rewrite it as /AA+B/, but the issue may be harder to avoid with complex regexes. Adding an emulated atomic group often works well as a last resort, although other solutions, when possible, will most likely keep your regexes easier to understand. Doing so for this regex looks like /((?=(A+A+))/2)+B/, and completely removes the backtracking problem.






当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


