[翻译]High Performance JavaScript(017)

A Note on Benchmarking  测试基准说明


    Because a regex's performance can be wildly different depending on the text it's applied to, there's no straightforward way to benchmark regexes against each other. For the best result, you need to benchmark your regexes on test strings of varying lengths that match, don't match, and nearly match.



    That's one reason for this chapter's lengthy backtracking coverage. Without a firm understanding of backtracking, you won't be able to anticipate and identify backtracking-related problems. To help you catch runaway backtracking early, always test your regexes with long strings that contain partial matches. Think about the kinds of strings that your regexes will nearly but not quite match, and include those in your tests.



More Ways to Improve Regular Expression Efficiency  提高正则表达式效率的更多方法


    The following are a variety of additional regex efficiency techniques. Several of the points here have already been touched upon during the backtracking discussion.



Focus on failing faster



    Slow regex processing is usually caused by slow failure rather than slow matching. This is compounded by the fact that if you're using a regex to match small parts of a large string, the regex will fail at many more positions than it will succeed. A change that makes a regex match faster but fail slower (e.g., by increasing the number of backtracking steps needed to try all regex permutations) is usually a losing trade.



Start regexes with simple, required tokens



    Ideally, the leading token in a regex should be fast to test and rule out as many obviously nonmatching positions as possible. Good starting tokens for this purpose are anchors (^ or $), specific characters (e.g., x or /u263A), character classes (e.g., [a-z] or shorthands like /d), and word boundaries (/b). If possible, avoid starting regexes with groupings or optional tokens, and avoid top-level alternation such as /one|two/ since that forces the regex to consider multiple leading tokens. Firefox is sensitive to the use of any quantifier on leading tokens, and is better able to optimize, e.g., /s/s* than /s+ or /s{1,}. Other browsers mostly optimize away such differences.



Make quantified patterns and their following token mutually exclusive



    When the characters that adjacent tokens or subexpressions are able to match overlap, the number of ways a regex will try to divide text between them increases. To help avoid this, make your patterns as specific as possible. Don't use ".*?" (which relies on backtracking) when you really mean "[^"/r/n]*".



Reduce the amount and reach of alternation



    Alternation using the | vertical bar may require that all alternation options be tested at every position in a string. You can often reduce the need for alternation by using character classes and optional components, or by pushing the alternation further back into the regex (allowing some match attempts to fail before reaching the alternation). The following table shows examples of these techniques.

    分支使用 | ,竖线,可能要求在字符串的每一个位置上测试所有的分支选项。你通常可通过使用字符类和选项组件减少对分支的需求,或将分支在正则表达式上的位置推后(允许到达分支之前的一些匹配尝试失败)。下表列出这些技术的例子。


    Character classes are faster than alternation because they are implemented using bit vectors (or other fast implementations) rather than backtracking. When alternation is necessary, put frequently occurring alternatives first if this doesn't affect what the regex matches. Alternation options are attempted from left to right, so the more frequently an option is expected to match, the sooner you want it to be considered.



    Note that Chrome and Firefox perform some of these optimizations automatically, and are therefore less affected by techniques for hand-tuning alternation.



Use noncapturing groups



    Capturing groups spend time and memory remembering backreferences and keeping them up to date. If you don't need a backreference, avoid this overhead by using a noncapturing group—i.e., (?:…) instead of (). Some people like to wrap their regexes in a capturing group when they need a backreference to the entire match. This is unnecessary since you can reference full matches via, e.g., element zero in arrays returned by regex.exec() or $& in replacement strings.



    Replacing capturing groups with their noncapturing kin has minimal impact in Firefox, but can make a big difference in other browsers when dealing with long strings.



Capture interesting text to reduce postprocessing



    As a caveat to the last tip, if you need to reference parts of a match, then, by all means, capture those parts and use the backreferences produced. For example, if you're writing code to process the contents of quoted strings matched by a regex, use /"([^"]*)"/ and work with backreference one, rather than using /"[^"]*"/ and manually stripping the quote marks from the result. When used in a loop, this kind of work reduction can save significant time.



Expose required tokens



    In order to help regex engines make smart decisions about how to optimize a search routine, try to make it easy to determine which tokens are required. When tokens are used within subexpressions or alternation, it's harder for regex engines to determine whether they are required, and some won't make the effort to do so. For instance, the regex /^(ab|cd)/ exposes its start-of-string anchor. IE and Chrome see this and prevent the regex from trying to find matches after the start of a string, thereby making this search near instantaneous regardless of string length. However, because the equivalent regex /(^ab|^cd)/ doesn't expose its ^ anchor, IE doesn't apply the same optimization and ends up pointlessly searching for matches at every position in the string.



Use appropriate quantifiers



    As described in the earlier section “Repetition and backtracking” on page 90, greedy and lazy quantifiers go about finding matches differently, even when they match the same strings. Using the more appropriate quantifier type (based on the anticipated amount of backtracking) in cases where they are equally correct can significantly improve performance, especially with long strings.



    Lazy quantifiers are particularly slow in Opera 9.x and earlier, but Opera 10 removes this weakness.

    懒惰量词在Opera 9.x和更早版本上格外缓慢,但Opera 10消除了这一弱点。


Reuse regexes by assigning them to variables



    Assigning regexes to variables lets you avoid repeatedly compiling them. Some people go overboard, using regex caching schemes that aim to avoid ever compiling a given pattern and flag combination more than once. Don't bother; regex compilation is fast, and such schemes likely add more overhead than they evade. The important thing is to avoid repeatedly recompiling regexes within loops. In other words, don't do this:



while (/regex1/.test(str1)) {


Do this instead:



var regex1 = /regex1/,
regex2 = /regex2/;
while (regex1.test(str1)) {


Split complex regexes into simpler pieces



    Try to avoid doing too much with a single regex. Complicated search problems that require conditional logic are easier to solve and usually more efficient when broken into two or more regexes, with each regex searching within the matches of the last. Regex monstrosities that do everything in one pattern are difficult to maintain, and are prone to backtracking-related problems.



When Not to Use Regular Expressions  什么时候不应该使用正则表达式


    When used with care, regexes are very fast. However, they're usually overkill when you are merely searching for literal strings. This is especially true if you know in advance which part of a string you want to test. For instance, if you want to check whether a string ends with a semicolon, you could use something like this:



endsWithSemicolon = /;$/.test(str);


    You might be surprised to learn, though, that none of the big browsers are currently smart enough to realize in advance that this regex can match only at the end of the string. What they end up doing is stepping through the entire string. Each time a semicolon is found, the regex advances to the next token ($), which checks whether the match is at the end of the string. If not, the regex continues searching for a match until it finally makes its way through the entire string. The longer your string (and the more semicolons it contains), the longer this takes.



    In this case, a better approach is to skip all the intermediate steps required by a regex and simply check whether the last character is a semicolon:



endsWithSemicolon = str.charAt(str.length - 1) == ";";


    This is just a bit faster than the regex-based test with small target strings, but, more importantly, the string's length no longer affects the time needed to perform the test.



    This example used the charAt method to read the character at a specific position. The string methods slice, substr, and substring work well when you want to extract and check the value of more than one character at a specific position. Additionally, the indexOf and lastIndexOf methods are great for finding the position of literal strings or checking for their presence. All of these string methods are fast and can help you avoid invoking the overhead of regular expressions when searching for literal strings that don't rely on fancy regex features.


  • 0
  • 0
    觉得还不错? 一键收藏
  • 0




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


