司徒汇编的专栏

我是一个志存高远的程序员

[翻译]High Performance JavaScript(018)

String Trimming  字符串修剪

 

    Removing leading and trailing whitespace from a string is a simple but common task. Although ECMAScript 5 adds a native string trim method (and you should therefore start to see this method in upcoming browsers), JavaScript has not historically included it. For the current browser crop, it's still necessary to implement a trim method yourself or rely on a library that includes it.

    去除字符串首尾的空格是一个简单而常见的任务。虽然ECMAScript 5添加了原生字符串修剪函数(你应该可以在即将出现的浏览器中看到它们),到目前为止JavaScript还没有包含它。对当前的浏览器而言,有必要自己实现一个修剪函数,或者依靠一个包含此功能的库。

 

    Trimming strings is not a common performance bottleneck, but it serves as a decent case study for regex optimization since there are a variety of ways to implement it.

    修剪字符串不是一个常见的性能瓶颈,但作为学习正则表达式优化的例子有多种实现方法。

 

Trimming with Regular Expressions  用正则表达式修剪

 

    Regular expressions allow you to implement a trim method with very little code, which is important for JavaScript libraries that focus on file size. Probably the best all-around solution is to use two substitutions—one to remove leading whitespace and another to remove trailing whitespace. This keeps things simple and fast, especially with long strings.

    正则表达式允许你用很少的代码实现一个修剪函数,这对JavaScript关心文件大小的库来说十分重要。可能最好的全面解决方案是使用两个子表达式:一个用于去除头部空格,另一个用于去除尾部空格。这样处理简单而迅速,特别是处理长字符串时。

 

if (!String.prototype.trim) {
  String.prototype.trim = function() {
    return this.replace(/^/s+/, "").replace(//s+$/, "");
  }
}
// test the new method...
// tab (/t) and line feed (/n) characters are
// included in the leading whitespace.
var str = " /t/n test string ".trim();
alert(str == "test string");
// alerts "true"

 

    The if block surrounding this code avoids overriding the trim method if it already exists, since native methods are optimized and usually far faster than anything you can implement yourself using a JavaScript function. Subsequent implementations of this example assume that this conditional is in place, though it is not written out each time.

    if语句避免覆盖trim函数如果它已经存在,因为原生函数进行了优化,通常远远快于你用JavaScript自己写的函数。后面的例子假设这个条件已经判断过了,所以不是每次都写上。

 

    You can give Firefox a performance boost of roughly 35% (less or more depending on the target string's length and content) by replacing //s+$/ (the second regex) with //s/s*$/. Although these two regexes are functionally identical, Firefox provides additional optimization for regexes that start with a nonquantified token. In other browsers, the difference is less significant or is optimized differently altogether. However, changing the regex that matches at the beginning of strings to /^/s/s*/ does not produce a measurable difference, because the leading ^ anchor takes care of quickly invalidating nonmatching positions (precluding a slight performance difference from compounding over thousands of match attempts within a long string).

    你可以给Firefox一个大约35%的性能提升(或多或少依赖于目标字符串的长度和内容)通过将//s+$/(第二个正则表达式)替换成//s/s*$/。虽然这两个正则表达式的功能完全相同,Firefox却为那些以非量词字元开头的正则表达式提供额外的优化。在其他浏览器上,差异不显著,或者优化完全不同。然而,改变正则表达式,在字符串开头匹配/^/s/s*/不会产生明显差异,因为^锚需要照顾那些快速作废的非匹配位置(避免一个轻微的性能差异,因为在一个长字符串中可能产生上千次匹配尝试)。

 

    Following are several more regex-based trim implementations, which are some of the more common alternatives you might encounter. You can see cross-browser performance numbers for all trim implementations described here in Table 5-2 at the end of this section. There are, in fact, many ways beyond those listed here that you can write a regular expression to help you trim strings, but they are invariably slower (or at least less consistently decent cross-browser) than using two simple substitutions when working with long strings.

    以下是几个基于正则表达式的修剪实例,这是你可能会遇到的几个常见的替代品。你可以在本节末尾表5-2中看到这里讨论的trim实例在不同浏览器上的性能。事实上,除这里列出的之外还有许多方法,你可以写一个正则表达式来修剪字符串,但它们在处理长字符串时,总比用两个简单的表达式要慢(至少在跨浏览器时缺乏一致性)。

 

// trim 2
String.prototype.trim = function() {
  return this.replace(/^/s+|/s+$/g, "");
}

    This is probably the most common solution. It combines the two simple regexes via alternation, and uses the /g (global) flag to replace all matches rather than just the first (it will match twice when its target contains both leading and trailing whitespace). This isn't a terrible approach, but it's slower than using two simple substitutions when working with long strings since the two alternation options need to be tested at every character position.

    这可能是最通常的解决方案。它通过分支功能合并了两个简单的正则表达式,并使用/g(全局)标记替换所有匹配,而不只是第一个(当目标字符串首尾都有空格时它将匹配两次)。这并不是一个可怕的方法,但是对长字符串操作时,它比使用两个简单的子表达式要慢,因为两个分支选项都要测试每个字符位置。

 

// trim 3
String.prototype.trim = function() {
  return this.replace(/^/s*([/s/S]*?)/s*$/, "$1");
}

    This regex works by matching the entire string and capturing the sequence from the first to the last nonwhitespace characters (if any) to backreference one. By replacing the entire string with backreference one, you're left with a trimmed version of the string.

    这个正则表达式的工作原理是匹配整个字符串,捕获从第一个到最后一个非空格字符之间的序列,记入后向引用1。然后使用后向引用1替代整个字符串,就留下了这个字符串的修剪版本。

 

    This approach is conceptually simple, but the lazy quantifier inside the capturing group makes the regex do a lot of extra work (i.e., backtracking), and therefore tends to make this option slow with long target strings. After the regex enters the capturing group, the [/s/S] class's lazy *? quantifier requires that it be repeated as few times as possible. Thus, the regex matches one character at a time, stopping after each character to try to match the remaining /s*$ pattern. If that fails because nonwhitespace characters remain somewhere after the current position in the string, the regex matches one more character, updates the backreference, and then tries the remainder of the pattern again.

    这个方法概念简单,但捕获组里的懒惰量词使正则表达式进行了许多额外操作(例如,回溯),因此在操作长目标字符串时很慢。进入正则表达式捕获组时,[/s/S]类的懒惰量词*?要求它尽可能地减少重复次数。因此,这个正则表达式每匹配一个字符,都要停下来尝试匹配余下的/s*$模板。如果字符串当前位置之后存在非空格字符导致匹配失败,正则表达式将匹配一个或多个字符,更新后向引用,然后再次尝试模板的剩余部分。

 

    Lazy repetition is particularly slow in Opera 9.x and earlier. Consequently, trimming long strings with this method in Opera 9.64 performs about 10 to 100 times slower than in the other big browsers. Opera 10 fixes this longstanding weakness, bringing this method's performance in line with other browsers.

    在Opera 9.x 和更早版本中懒惰重复特别慢。因此,这个方法在Opera 9.64上比其它大型浏览器慢了10到100倍。Opera 10修正了这个长期存在的弱点,将此方法的性能提高到与其它浏览器相当的水平。

 

// trim 4
String.prototype.trim = function() {
  return this.replace(/^/s*([/s/S]*/S)?/s*$/, "$1");
}

    This is similar to the last regex, but it replaces the lazy quantifier with a greedy one for performance reasons. To make sure that the capturing group still only matches up to the last nonwhitespace character, a trailing /S is required. However, since the regex must be able to match whitespace-only strings, the entire capturing group is made optional by adding a trailing question mark quantifier.

    这个表达式与上一个很像,但出于性能原因以贪婪量词取代了懒惰量词。为确保捕获组只匹配到最后一个非空格字符,必需尾随一个/S。然而,由于正则表达式必需能够匹配全部由空格组成的字符串,整个捕获组通过尾随一个?量词而成为可选组。

 

    Here, the greedy asterisk in [/s/S]* repeats its any-character pattern to the end of the string. The regex then backtracks one character at a time until it's able to match the following /S, or until it backtracks to the first character matched within the group (after which it skips the group).

    在此,[/s/S]*中的贪婪量词“*”表示重复方括号中的任意字符模板直至字符串结束。然后正则表达式每次回溯一个字符,直到它能够匹配后面的/S,或者直到回溯到第一个字符而匹配整个组(然后它跳过这个组)。

 

    Unless there's more trailing whitespace than other text, this generally ends up being faster than the previous solution that used a lazy quantifier. In fact, it's so much faster that in IE, Safari, Chrome, and Opera 10, it even beats using two substitutions. That's because those browsers contain special optimization for greedy repetition of character classes that match any character. The regex engine jumps to the end of the string without evaluating intermediate characters (although backtracking positions must still be recorded), and then backtracks as appropriate. Unfortunately, this method is considerably slower in Firefox and Opera 9, so at least for now, using two substitutions still holds up better cross-browser.

    如果尾部空格不比其它字符串更多,它通常比前面那些使用懒惰量词的方案更快。事实上,它在IE,Safari,Chrome和Opear 10上如此之快,甚至超过使用两个子表达式的方案。因为这些浏览器包含特殊优化,专门服务于为字符类匹配任意字符的贪婪重复操作。正则表达式引擎直接跳到字符串末尾而不检查中间的字符(尽管回溯点必需被记下来),然后适当回溯。不幸的是,这种方法在Firefox和Opera 9上非常慢,所以到目前为止,使用两个子表达式仍然是更好的跨浏览器方案。

 

// trim 5
String.prototype.trim = function() {
  return this.replace(/^/s*(/S*(/s+/S+)*)/s*$/, "$1");
}

    This is a relatively common approach, but there's no good reason to use it since it's consistently one of the slowest of the options shown here, in all browsers. It's similar to the last two regexes in that it matches the entire string and replaces it with the part you want to keep, but because the inner group matches only one word at a time, there are a lot of discrete steps the regex must take. The performance hit may be unnoticeable when trimming short strings, but with long strings that contain many words, this regex can become a performance problem.

    这是一个相当普遍的方法,但没有很好的理由使用它,因为它在所有浏览器上都是这里列出所有方法中最慢的一个。它类似最后两个正则表达式,它匹配整个字符串然后用你打算保留的部分替换这个字符串,因为内部组每次只匹配一个单词,正则表达式必需执行大量的离散步骤。修剪短字符串时性能冲击并不明显,但处理包含多个词的长字符串时,这个正则表达式可以成为一个性能问题。

 

    Changing the inner group to a noncapturing group—i.e., changing (/s+/S+) to (?:/s+/S+)—helps a bit, slashing roughly 20%–45% off the time needed in Opera, IE, and Chrome, along with much slighter improvements in Safari and Firefox. Still, a noncapturing group can't redeem this implementation. Note that the outer group cannot be converted to a noncapturing group since it is referenced in the replacement string.

    将内部组修改为一个非捕获组——例如,将(/s+/S+)修改为(?:/s+/S+)——有一点帮助,在Opera,IE和Chrome上缩减了大约20%-45%的处理时间,在Safari和Firefox上也有轻微改善。尽管如此,一个非捕获组不能完全代换这个实现。请注意,外部组不能转换为非捕获组,因为它在被替换的字符串中被引用了。

 

Trimming Without Regular Expressions  不使用正则表达式修剪

 

    Although regular expressions are fast, it's worth considering the performance of trimming without their help. Here's one way to do so:

    虽然正则表达式很快,还是值得考虑没有它们帮助时修剪字符串的性能。有一种方法这样做:

 

// trim 6
String.prototype.trim = function() {
  var start = 0,
      end = this.length - 1,
      ws = " /n/r/t/f/x0b/xa0/u1680/u180e/u2000/u2001/u2002/u2003
/u2004/u2005/u2006/u2007/u2008/u2009/u200a/u200b/u2028/u2029/u202f
/u205f/u3000/ufeff";
  while (ws.indexOf(this.charAt(start)) > -1) {
    start++;
  }
  while (end > start && ws.indexOf(this.charAt(end)) > -1) {
    end--;
  }
  return this.slice(start, end + 1);
}

    The ws variable in this code includes all whitespace characters as defined by ECMAScript 5. For efficiency reasons, copying any part of the string is avoided until the trimmed version's start and end positions are known.

    此代码中的ws变量包括ECMAScript 5中定义的所有空白字符。出于效率原因,在得到修剪区的起始和终止位置之前避免拷贝字符串的任何部分。

 

    It turns out that this smokes the regex competition when there is only a bit of whitespace on the ends of the string. The reason is that although regular expressions are well suited for removing whitespace from the beginning of a string, they're not as fast at trimming from the end of long strings. As noted in the section “When Not to Use Regular Expressions” on page 99, a regex cannot jump to the end of a string without considering characters along the way. However, this implementation does just that, with the second while loop working backward from the end of the string until it finds a nonwhitespace character.

    当字符串末尾只有少量空格时,这种情况使正则表达式陷入疯狂工作。原因是,尽管正则表达式很好地去除了字符串头部的空格,它们却不能同样快速地修剪长字符串的尾部。正如《什么时候不应该使用正则表达式》一节所提到的那样,一个正则表达式不能跳到字符串的末尾而不考虑沿途字符。然而,本实现正是如此,在第二个while循环中从字符串末尾向前查找一个非空格字符。

 

    Although this version is not affected by the overall length of the string, it has its own weakness: long leading and trailing whitespace. That's because looping over characters to check whether they are whitespace can't match the efficiency of a regex's optimized search code.

    虽然本实现不受字符串总长度影响,但它有自己的弱点:(它害怕)长的头尾空格。因为循环检查字符是不是空格在效率上不如正则表达式所使用的优化过的搜索代码。

 

A Hybrid Solution  混合解决方案

 

    The final approach for this section is to combine a regex's universal efficiency at trimming leading whitespace with the nonregex method's speed at trimming trailing characters.

    本节中最后一个办法是将两者结合起来,用正则表达式修剪头部空格,用非正则表达式方法修剪尾部字符。

 

// trim 7
String.prototype.trim = function() {
  var str = this.replace(/^/s+/, ""),
  end = str.length - 1,
  ws = //s/;
  while (ws.test(str.charAt(end))) {
    end--;
  }
  return str.slice(0, end + 1);
}

    This hybrid method remains insanely fast when trimming only a bit of whitespace, and removes the performance risk of strings with long leading whitespace and whitespaceonly strings (although it maintains the weakness for strings with long trailing whitespace). Note that this solution uses a regex in the loop to check whether characters at the end of the string are whitespace. Although using a regex for this adds a bit of performance overhead, it lets you defer the list of whitespace characters to the browser for the sake of brevity and compatibility.

    当只修剪一个空格时,此混合方法巨快无比,并去除了性能上的风险,诸如以长空格开头的字符串,完全由空格组成的字符串(尽管它在处理尾部长空格的字符串时仍然具有弱点)。请注意,此方案在循环中使用正则表达式检测字符串尾部的字符是否空格,尽管使用正则表达式增加了一点性能负担,但它允许你根据浏览器定义空格字符列表,以保持简短和兼容性。

 

    The general trend for all trim methods described here is that overall string length has more impact than the number of characters to be trimmed in regex-based solutions, whereas nonregex solutions that work backward from the end of the string are unaffected by overall string length but more significantly affected by the amount of whitespace to trim. The simplicity of using two regex substitutions provides consistently respectable performance cross-browser with varying string contents and lengths, and therefore it's arguably the best all-around solution. The hybrid solution is exceptionally fast with long strings at the cost of slightly longer code and a weakness in some browsers for long, trailing whitespace. See Table 5-2 for all the gory details.

    所有修剪方法总的趋势是:在基于正则表达式的方案中,字符串总长比修剪掉的字符数量更影响性能;而非正则表达式方案从字符串末尾反向查找,不受字符串总长的影响,但明显受到修剪空格数量的影响。简单地使用两个子正则表达式在所有浏览器上处理不同内容和长度的字符串时,均表现出稳定的性能。因此它可以说是最全面的解决方案。混合解决方案在处理长字符串时特别快,其代价是代码稍长,在某些浏览器上处理尾部长空格时存在弱点。表5-2是所有的测试细节。

 

Table 5-2. Cross-browser performance of various trim implementations

表5-2  不同trim版本在各种浏览器上的性能



a Reported times were generated by trimming a large string (40 KB) 100 times, first with 10 and then 1,000 spaces added to each end.

  报告时间是修剪一个大字符串(40KB)100次所用的时间,每个字符串以10个空格开头,以1'000个空格结尾。

b Tested without the //s/s*$/ optimization.

  测试时关闭//s/s*$/优化
c Tested without the noncapturing group optimization.

  测试时关闭非捕获组优化

 

Summary  总结

 

    Intensive string operations and incautiously crafted regexes can be major performance obstructions, but the advice in this chapter helps you avoid common pitfalls.

    密集的字符串操作和粗浅地编写正则表达式可能是主要性能障碍,但本章中的建议可帮助您避免常见缺陷。

 

• When concatenating numerous or large strings, array joining is the only method with reasonable performance in IE7 and earlier.

  当连接数量巨大或尺寸巨大的字符串时,数组联合是IE7和它的早期版本上唯一具有合理性能的方法。


• If you don't need to worry about IE7 and earlier, array joining is one of the slowest ways to concatenate strings. Use simple + and += operators instead, and avoid unnecessary intermediate strings.

  如果你不关心IE7和它的早期版本,数组联合是连接字符串最慢的方法之一。使用简单的+和+=取而代之,可避免(产生)不必要的中间字符串。


• Backtracking is both a fundamental component of regex matching and a frequent source of regex inefficiency.

  回溯既是正则表达式匹配功能基本的组成部分,又是正则表达式影响效率的常见原因。


• Runaway backtracking can cause a regex that usually finds matches quickly to run slowly or even crash your browser when applied to partially matching strings. Techniques for avoiding this problem include making adjacent tokens mutually exclusive, avoiding nested quantifiers that allow matching the same part of a string more than one way, and eliminating needless backtracking by repurposing the atomic nature of lookahead.

  回溯失控发生在正则表达式本应很快发现匹配的地方,因为某些特殊的匹配字符串动作,导致运行缓慢甚至浏览器崩溃。避免此问题的技术包括:使相邻字元互斥,避免嵌套量词对一个字符串的相同部分多次匹配,通过重复利用前瞻操作的原子特性去除不必要的回溯。


• A variety of techniques exist for improving regex efficiency by helping regexes find matches faster and spend less time considering nonmatching positions (see “More Ways to Improve Regular Expression Efficiency” on page 96).

  提高正则表达式效率的各种技术手段,帮助正则表达式更快地找到匹配,以及在非匹配位置上花费更少时间(见《更多提高正则表达式效率的方法》)。


• Regexes are not always the best tool for the job, especially when you are merely searching for literal strings.

  正则表达式并不总是完成工作的最佳工具,尤其当你只是搜索一个文本字符串时。

 

• Although there are many ways to trim a string, using two simple regexes (one to remove leading whitespace and another for trailing whitespace) offers a good mix of brevity and cross-browser efficiency with varying string contents and lengths. Looping from the end of the string in search of the first nonwhitespace characters, or combining this technique with regexes in a hybrid approach, offers a good alternative that is less affected by overall string length.

  虽然有很多方法来修整一个字符串,使用两个简单的正则表达式(一个用于去除头部空格,另一个用于去除尾部空格)提供了一个简洁、跨浏览器的方法,适用于不同内容和长度的字符串。从字符串末尾开始循环查找第一个非空格字符,或者在一个混合应用中将此技术与正则表达式结合起来,提供了一个很好的替代方案,它很少受到字符串整体长度的影响。

阅读更多
个人分类: 计算机技术
想对作者说点什么? 我来说一句

High Performance JavaScript

2010年04月28日 4.43MB 下载

High Performance Spark pdf 完整版 免费

2017年11月11日 6.59MB 下载

没有更多推荐了,返回首页

加入CSDN,享受更精准的内容推荐,与500万程序员共同成长!
关闭
关闭