[翻译]High Performance JavaScript(017)

A Note on Benchmarking  测试基准说明

 

    Because a regex's performance can be wildly different depending on the text it's applied to, there's no straightforward way to benchmark regexes against each other. For the best result, you need to benchmark your regexes on test strings of varying lengths that match, don't match, and nearly match.

    因为正则表达式性能因应用文本不同而产生很大差异,没有简单明了的方法可以测试正则表达式之间的性能差别。为得到最好的结果,你需要在各种字符串上测试你的正则表达式,包括不同长度,能够匹配的,不能匹配的,和近似匹配的。

 

    That's one reason for this chapter's lengthy backtracking coverage. Without a firm understanding of backtracking, you won't be able to anticipate and identify backtracking-related problems. To help you catch runaway backtracking early, always test your regexes with long strings that contain partial matches. Think about the kinds of strings that your regexes will nearly but not quite match, and include those in your tests.

    这也是本章长篇大论回溯的原因之一。如果没有确切理解回溯,就无法预测和确定回溯相关问题。为帮助你早日把握回溯失控,总是用包含特殊匹配的长字符串测试你的正则表达式。针对你的正则表达式构思一些近似但不能完全匹配的字符串,将他们应用在你的测试中。

 

More Ways to Improve Regular Expression Efficiency  提高正则表达式效率的更多方法

 

    The following are a variety of additional regex efficiency techniques. Several of the points here have already been touched upon during the backtracking discussion.

    下面是一写提高正则表达式效率的技术。几个技术点已经在回溯部分讨论过了。

 

Focus on failing faster

关注如何让匹配更快失败

 

    Slow regex processing is usually caused by slow failure rather than slow matching. This is compounded by the fact that if you're using a regex to match small parts of a large string, the regex will fail at many more positions than it will succeed. A change that makes a regex match faster but fail slower (e.g., by increasing the number of backtracking steps needed to try all regex permutations) is usually a losing trade.

    正则表达式处理慢往往是因为匹配失败过程慢,而不是匹配成功过程慢。如果你使用正则表达式匹配一个很大字符串的一小部分,情况更为严重,正则表达式匹配失败的位置比匹配成功的位置要多得多。如果一个修改使正则表达式匹配更快但失败更慢(例如,通过增加所需的回溯次数去尝试所有分支的排列组合),这通常是一个失败的修改。

 

Start regexes with simple, required tokens

正则表达式以简单的,必需的字元开始

 

    Ideally, the leading token in a regex should be fast to test and rule out as many obviously nonmatching positions as possible. Good starting tokens for this purpose are anchors (^ or $), specific characters (e.g., x or /u263A), character classes (e.g., [a-z] or shorthands like /d), and word boundaries (/b). If possible, avoid starting regexes with groupings or optional tokens, and avoid top-level alternation such as /one|two/ since that forces the regex to consider multiple leading tokens. Firefox is sensitive to the use of any quantifier on leading tokens, and is better able to optimize, e.g., /s/s* than /s+ or /s{1,}. Other browsers mostly optimize away such differences.

    最理想的情况是,一个正则表达式的起始字元应当尽可能快速地测试并排除明显不匹配的位置。用于此目的好的起始字元通常是一个锚(^或$),特定字符(例如x或/u363A),字符类(例如,[a-z]或速记符例如/d),和单词边界(/b)。如果可能的话,避免以分组或选择字元开头,避免顶级分支例如/one|two/,因为它强迫正则表达式识别多种起始字元。Firefox对起始字元中使用的任何量词都很敏感,能够优化的更好,例如,以/s/s*替代/s+或/s{1,}。其他浏览器大多优化掉这些差异。

 

Make quantified patterns and their following token mutually exclusive

编写量词模板,使它们后面的字元互相排斥

 

    When the characters that adjacent tokens or subexpressions are able to match overlap, the number of ways a regex will try to divide text between them increases. To help avoid this, make your patterns as specific as possible. Don't use ".*?" (which relies on backtracking) when you really mean "[^"/r/n]*".

    当字符与字元毗邻或子表达式能够重叠匹配时,一个正则表达式尝试分解文本的路径数量将增加。为避免此现象,尽量具体化你的模板。当你想表达“[^"/r/n]*”时不要使用“.*?”(依赖回溯)。

 

Reduce the amount and reach of alternation

减少分支的数量,缩小它们的范围

 

    Alternation using the | vertical bar may require that all alternation options be tested at every position in a string. You can often reduce the need for alternation by using character classes and optional components, or by pushing the alternation further back into the regex (allowing some match attempts to fail before reaching the alternation). The following table shows examples of these techniques.

    分支使用 | ,竖线,可能要求在字符串的每一个位置上测试所有的分支选项。你通常可通过使用字符类和选项组件减少对分支的需求,或将分支在正则表达式上的位置推后(允许到达分支之前的一些匹配尝试失败)。下表列出这些技术的例子。

 



    Character classes are faster than alternation because they are implemented using bit vectors (or other fast implementations) rather than backtracking. When alternation is necessary, put frequently occurring alternatives first if this doesn't affect what the regex matches. Alternation options are attempted from left to right, so the more frequently an option is expected to match, the sooner you want it to be considered.

    字符类比分支更快,因为他们使用位向量实现(或其他快速实现)而不是回溯。当分支必不可少时,将常用分支放在最前面,如果这样做不影响正则表达式匹配的话。分支选项从左向右依次尝试,一个选项被匹配上的机会越多,它被检测的速度就越快。

 

    Note that Chrome and Firefox perform some of these optimizations automatically, and are therefore less affected by techniques for hand-tuning alternation.

    注意Chrome和Firefox自动执行这些优化中的某些项目,因此较少受到手工调整的影响。

 

Use noncapturing groups

使用非捕获组

 

    Capturing groups spend time and memory remembering backreferences and keeping them up to date. If you don't need a backreference, avoid this overhead by using a noncapturing group—i.e., (?:…) instead of (). Some people like to wrap their regexes in a capturing group when they need a backreference to the entire match. This is unnecessary since you can reference full matches via, e.g., element zero in arrays returned by regex.exec() or $& in replacement strings.

    捕获组花费时间和内存用于记录后向引用,并保持它们是最新的。如果你不需要一个后向引用,可通过使用非捕获组避免这种开销——例如,(?:…)替代(…)。有些人当他们需要一个完全匹配的后向引用时,喜欢将他们的正则表达式包装在一个捕获组中。这是不必要的,因为你能够通过其他方法引用完全匹配,例如,使用regex.exec()返回数组的第一个元素,或替换字符串中的$&。

 

    Replacing capturing groups with their noncapturing kin has minimal impact in Firefox, but can make a big difference in other browsers when dealing with long strings.

    用非捕获组取代捕获组在Firefox中影响很小,但在其他浏览器上处理长字符串时影响很大。

 

Capture interesting text to reduce postprocessing

捕获感兴趣的文字,减少后处理

 

    As a caveat to the last tip, if you need to reference parts of a match, then, by all means, capture those parts and use the backreferences produced. For example, if you're writing code to process the contents of quoted strings matched by a regex, use /"([^"]*)"/ and work with backreference one, rather than using /"[^"]*"/ and manually stripping the quote marks from the result. When used in a loop, this kind of work reduction can save significant time.

    最后一个告诫,如果你需要引用匹配的一部分,应当通过一切手段,捕获那些片断,再使用后向引用处理。例如,如果你写代码处理一个正则表达式所匹配的引号中的字符串内容,使用/"([^"]*)"/然后使用一次后向引用,而不是使用/"[^"]*"/然后从结果中手工剥离引号。当在循环中使用时,减少这方面的工作可以节省大量时间。

 

Expose required tokens

暴露所需的字元

 

    In order to help regex engines make smart decisions about how to optimize a search routine, try to make it easy to determine which tokens are required. When tokens are used within subexpressions or alternation, it's harder for regex engines to determine whether they are required, and some won't make the effort to do so. For instance, the regex /^(ab|cd)/ exposes its start-of-string anchor. IE and Chrome see this and prevent the regex from trying to find matches after the start of a string, thereby making this search near instantaneous regardless of string length. However, because the equivalent regex /(^ab|^cd)/ doesn't expose its ^ anchor, IE doesn't apply the same optimization and ends up pointlessly searching for matches at every position in the string.

    为帮助正则表达式引擎在如何优化查询例程时做出明智的决策,应尽量简单地判断出那些必需的字元。当字元应用在子表达式或者分支中,正则表达式引擎很难判断他们是不是必需的,有些引擎并不作此方面努力。例如,正则表达式/^(ab|cd)/暴露它的字符串起始锚。IE和Chrome会注意到这一点,并阻止正则表达式尝试查找字符串头端之后的匹配,从而使查找瞬间完成而不管字符串长度。但是,由于等价正则表达式/(^ab|^cd)/不暴露它的^锚,IE无法应用同样的优化,最终无意义地搜索字符串并在每一个位置上匹配。

 

Use appropriate quantifiers

使用适当的量词

 

    As described in the earlier section “Repetition and backtracking” on page 90, greedy and lazy quantifiers go about finding matches differently, even when they match the same strings. Using the more appropriate quantifier type (based on the anticipated amount of backtracking) in cases where they are equally correct can significantly improve performance, especially with long strings.

    正如前一节《重复和回溯》所讨论过的那样,贪婪量词和懒惰量词即使匹配同样的字符串,其查找匹配过程也是不同的。在确保正确等价的前提下,使用更合适的量词类型(基于预期的回溯次数)可以显著提高性能,尤其在处理长字符串时。

 

    Lazy quantifiers are particularly slow in Opera 9.x and earlier, but Opera 10 removes this weakness.

    懒惰量词在Opera 9.x和更早版本上格外缓慢,但Opera 10消除了这一弱点。

 

Reuse regexes by assigning them to variables

将正则表达式赋给变量,以重用它们

 

    Assigning regexes to variables lets you avoid repeatedly compiling them. Some people go overboard, using regex caching schemes that aim to avoid ever compiling a given pattern and flag combination more than once. Don't bother; regex compilation is fast, and such schemes likely add more overhead than they evade. The important thing is to avoid repeatedly recompiling regexes within loops. In other words, don't do this:

    将正则表达式赋给变量以避免对它们重新编译。有人做的太过火,使用正则表达式缓存池,以避免对给定的模板和标记组合进行多次编译。不要过分忧虑,正则表达式编译很快,这样的缓存池所增加的负担可能超过他们所避免的。重要的是避免在循环体中重复编译正则表达式。换句话说,不要这样做:

 

while (/regex1/.test(str1)) {
  /regex2/.exec(str2);
  ...
}

 

Do this instead:

替代以如下做法:

 

var regex1 = /regex1/,
regex2 = /regex2/;
while (regex1.test(str1)) {
  regex2.exec(str2);
  ...
}

 

Split complex regexes into simpler pieces

将复杂的正则表达式拆分为简单的片断

 

    Try to avoid doing too much with a single regex. Complicated search problems that require conditional logic are easier to solve and usually more efficient when broken into two or more regexes, with each regex searching within the matches of the last. Regex monstrosities that do everything in one pattern are difficult to maintain, and are prone to backtracking-related problems.

    尽量避免一个正则表达式做太多的工作。复杂的搜索问题需要条件逻辑,拆分为两个或多个正则表达式更容易解决,通常也更高效,每个正则表达式只在最后的匹配结果中执行查找。在一个模板中完成所有工作的正则表达式怪兽很难维护,而且容易引起回溯相关的问题。

 

When Not to Use Regular Expressions  什么时候不应该使用正则表达式

 

    When used with care, regexes are very fast. However, they're usually overkill when you are merely searching for literal strings. This is especially true if you know in advance which part of a string you want to test. For instance, if you want to check whether a string ends with a semicolon, you could use something like this:

    小心使用它,正则表达式是非常快的。然而,当你只是搜索文字字符串时它们经常矫枉过正。尤其当你事先知道了字符串的哪一部分将要被测试时。例如,如果你想检查一个字符串是不是以分号结束,你可以使用:

 

endsWithSemicolon = /;$/.test(str);

 

    You might be surprised to learn, though, that none of the big browsers are currently smart enough to realize in advance that this regex can match only at the end of the string. What they end up doing is stepping through the entire string. Each time a semicolon is found, the regex advances to the next token ($), which checks whether the match is at the end of the string. If not, the regex continues searching for a match until it finally makes its way through the entire string. The longer your string (and the more semicolons it contains), the longer this takes.

    你可能觉得很奇怪,虽说当前没有哪个浏览器聪明到这个程度,能够意识到这个正则表达式只能匹配字符串的末尾。最终它们所做的将是一个一个地测试了整个字符串。每当发现了一个分号,正则表达式就前进到下一个字元($),检查它是否匹配字符串的末尾。如果不是这样的话,正则表达式就继续搜索匹配,直到穿越了整个字符串。字符串的长度越长(包含的分号越多),它占用的时间也越长。

 

    In this case, a better approach is to skip all the intermediate steps required by a regex and simply check whether the last character is a semicolon:

    这种情况下,更好的办法是跳过正则表达式所需的所有中间步骤,简单地检查最后一个字符是不是分号:

 

endsWithSemicolon = str.charAt(str.length - 1) == ";";

 

    This is just a bit faster than the regex-based test with small target strings, but, more importantly, the string's length no longer affects the time needed to perform the test.

    目标字符串很小时,这种方法只比正则表达式快一点,但更重要的是,字符串的长度不再影响执行测试所需要的时间。

 

    This example used the charAt method to read the character at a specific position. The string methods slice, substr, and substring work well when you want to extract and check the value of more than one character at a specific position. Additionally, the indexOf and lastIndexOf methods are great for finding the position of literal strings or checking for their presence. All of these string methods are fast and can help you avoid invoking the overhead of regular expressions when searching for literal strings that don't rely on fancy regex features.

    这个例子使用charAt函数在特定位置上读取字符。字符串函数slice,substr,和substring可用于在特定位置上提取并检查字符串的值。此外,indexOff和lastIndexOf函数非常适合查找特定字符串的位置,或者判断它们是否存在。所有这些字符串操作函数速度都很快,当您搜索那些不依赖正则表达式复杂特性的文本字符串时,它们有助于您避免正则表达式带来的性能开销。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值