[翻译]High Performance JavaScript(018)

String Trimming  字符串修剪


    Removing leading and trailing whitespace from a string is a simple but common task. Although ECMAScript 5 adds a native string trim method (and you should therefore start to see this method in upcoming browsers), JavaScript has not historically included it. For the current browser crop, it's still necessary to implement a trim method yourself or rely on a library that includes it.

    去除字符串首尾的空格是一个简单而常见的任务。虽然ECMAScript 5添加了原生字符串修剪函数(你应该可以在即将出现的浏览器中看到它们),到目前为止JavaScript还没有包含它。对当前的浏览器而言,有必要自己实现一个修剪函数,或者依靠一个包含此功能的库。


    Trimming strings is not a common performance bottleneck, but it serves as a decent case study for regex optimization since there are a variety of ways to implement it.



Trimming with Regular Expressions  用正则表达式修剪


    Regular expressions allow you to implement a trim method with very little code, which is important for JavaScript libraries that focus on file size. Probably the best all-around solution is to use two substitutions—one to remove leading whitespace and another to remove trailing whitespace. This keeps things simple and fast, especially with long strings.



if (!String.prototype.trim) {
  String.prototype.trim = function() {
    return this.replace(/^/s+/, "").replace(//s+$/, "");
// test the new method...
// tab (/t) and line feed (/n) characters are
// included in the leading whitespace.
var str = " /t/n test string ".trim();
alert(str == "test string");
// alerts "true"


    The if block surrounding this code avoids overriding the trim method if it already exists, since native methods are optimized and usually far faster than anything you can implement yourself using a JavaScript function. Subsequent implementations of this example assume that this conditional is in place, though it is not written out each time.



    You can give Firefox a performance boost of roughly 35% (less or more depending on the target string's length and content) by replacing //s+$/ (the second regex) with //s/s*$/. Although these two regexes are functionally identical, Firefox provides additional optimization for regexes that start with a nonquantified token. In other browsers, the difference is less significant or is optimized differently altogether. However, changing the regex that matches at the beginning of strings to /^/s/s*/ does not produce a measurable difference, because the leading ^ anchor takes care of quickly invalidating nonmatching positions (precluding a slight performance difference from compounding over thousands of match attempts within a long string).



    Following are several more regex-based trim implementations, which are some of the more common alternatives you might encounter. You can see cross-browser performance numbers for all trim implementations described here in Table 5-2 at the end of this section. There are, in fact, many ways beyond those listed here that you can write a regular expression to help you trim strings, but they are invariably slower (or at least less consistently decent cross-browser) than using two simple substitutions when working with long strings.



// trim 2
String.prototype.trim = function() {
  return this.replace(/^/s+|/s+$/g, "");

    This is probably the most common solution. It combines the two simple regexes via alternation, and uses the /g (global) flag to replace all matches rather than just the first (it will match twice when its target contains both leading and trailing whitespace). This isn't a terrible approach, but it's slower than using two simple substitutions when working with long strings since the two alternation options need to be tested at every character position.



// trim 3
String.prototype.trim = function() {
  return this.replace(/^/s*([/s/S]*?)/s*$/, "$1");

    This regex works by matching the entire string and capturing the sequence from the first to the last nonwhitespace characters (if any) to backreference one. By replacing the entire string with backreference one, you're left with a trimmed version of the string.



    This approach is conceptually simple, but the lazy quantifier inside the capturing group makes the regex do a lot of extra work (i.e., backtracking), and therefore tends to make this option slow with long target strings. After the regex enters the capturing group, the [/s/S] class's lazy *? quantifier requires that it be repeated as few times as possible. Thus, the regex matches one character at a time, stopping after each character to try to match the remaining /s*$ pattern. If that fails because nonwhitespace characters remain somewhere after the current position in the string, the regex matches one more character, updates the backreference, and then tries the remainder of the pattern again.



    Lazy repetition is particularly slow in Opera 9.x and earlier. Consequently, trimming long strings with this method in Opera 9.64 performs about 10 to 100 times slower than in the other big browsers. Opera 10 fixes this longstanding weakness, bringing this method's performance in line with other browsers.

    在Opera 9.x 和更早版本中懒惰重复特别慢。因此,这个方法在Opera 9.64上比其它大型浏览器慢了10到100倍。Opera 10修正了这个长期存在的弱点,将此方法的性能提高到与其它浏览器相当的水平。


// trim 4
String.prototype.trim = function() {
  return this.replace(/^/s*([/s/S]*/S)?/s*$/, "$1");

    This is similar to the last regex, but it replaces the lazy quantifier with a greedy one for performance reasons. To make sure that the capturing group still only matches up to the last nonwhitespace character, a trailing /S is required. However, since the regex must be able to match whitespace-only strings, the entire capturing group is made optional by adding a trailing question mark quantifier.



    Here, the greedy asterisk in [/s/S]* repeats its any-character pattern to the end of the string. The regex then backtracks one character at a time until it's able to match the following /S, or until it backtracks to the first character matched within the group (after which it skips the group).



    Unless there's more trailing whitespace than other text, this generally ends up being faster than the previous solution that used a lazy quantifier. In fact, it's so much faster that in IE, Safari, Chrome, and Opera 10, it even beats using two substitutions. That's because those browsers contain special optimization for greedy repetition of character classes that match any character. The regex engine jumps to the end of the string without evaluating intermediate characters (although backtracking positions must still be recorded), and then backtracks as appropriate. Unfortunately, this method is considerably slower in Firefox and Opera 9, so at least for now, using two substitutions still holds up better cross-browser.

    如果尾部空格不比其它字符串更多,它通常比前面那些使用懒惰量词的方案更快。事实上,它在IE,Safari,Chrome和Opear 10上如此之快,甚至超过使用两个子表达式的方案。因为这些浏览器包含特殊优化,专门服务于为字符类匹配任意字符的贪婪重复操作。正则表达式引擎直接跳到字符串末尾而不检查中间的字符(尽管回溯点必需被记下来),然后适当回溯。不幸的是,这种方法在Firefox和Opera 9上非常慢,所以到目前为止,使用两个子表达式仍然是更好的跨浏览器方案。


// trim 5
String.prototype.trim = function() {
  return this.replace(/^/s*(/S*(/s+/S+)*)/s*$/, "$1");

    This is a relatively common approach, but there's no good reason to use it since it's consistently one of the slowest of the options shown here, in all browsers. It's similar to the last two regexes in that it matches the entire string and replaces it with the part you want to keep, but because the inner group matches only one word at a time, there are a lot of discrete steps the regex must take. The performance hit may be unnoticeable when trimming short strings, but with long strings that contain many words, this regex can become a performance problem.



    Changing the inner group to a noncapturing group—i.e., changing (/s+/S+) to (?:/s+/S+)—helps a bit, slashing roughly 20%–45% off the time needed in Opera, IE, and Chrome, along with much slighter improvements in Safari and Firefox. Still, a noncapturing group can't redeem this implementation. Note that the outer group cannot be converted to a noncapturing group since it is referenced in the replacement string.



Trimming Without Regular Expressions  不使用正则表达式修剪


    Although regular expressions are fast, it's worth considering the performance of trimming without their help. Here's one way to do so:



// trim 6
String.prototype.trim = function() {
  var start = 0,
      end = this.length - 1,
      ws = " /n/r/t/f/x0b/xa0/u1680/u180e/u2000/u2001/u2002/u2003
  while (ws.indexOf(this.charAt(start)) > -1) {
  while (end > start && ws.indexOf(this.charAt(end)) > -1) {
  return this.slice(start, end + 1);

    The ws variable in this code includes all whitespace characters as defined by ECMAScript 5. For efficiency reasons, copying any part of the string is avoided until the trimmed version's start and end positions are known.

    此代码中的ws变量包括ECMAScript 5中定义的所有空白字符。出于效率原因,在得到修剪区的起始和终止位置之前避免拷贝字符串的任何部分。


    It turns out that this smokes the regex competition when there is only a bit of whitespace on the ends of the string. The reason is that although regular expressions are well suited for removing whitespace from the beginning of a string, they're not as fast at trimming from the end of long strings. As noted in the section “When Not to Use Regular Expressions” on page 99, a regex cannot jump to the end of a string without considering characters along the way. However, this implementation does just that, with the second while loop working backward from the end of the string until it finds a nonwhitespace character.



    Although this version is not affected by the overall length of the string, it has its own weakness: long leading and trailing whitespace. That's because looping over characters to check whether they are whitespace can't match the efficiency of a regex's optimized search code.



A Hybrid Solution  混合解决方案


    The final approach for this section is to combine a regex's universal efficiency at trimming leading whitespace with the nonregex method's speed at trimming trailing characters.



// trim 7
String.prototype.trim = function() {
  var str = this.replace(/^/s+/, ""),
  end = str.length - 1,
  ws = //s/;
  while (ws.test(str.charAt(end))) {
  return str.slice(0, end + 1);

    This hybrid method remains insanely fast when trimming only a bit of whitespace, and removes the performance risk of strings with long leading whitespace and whitespaceonly strings (although it maintains the weakness for strings with long trailing whitespace). Note that this solution uses a regex in the loop to check whether characters at the end of the string are whitespace. Although using a regex for this adds a bit of performance overhead, it lets you defer the list of whitespace characters to the browser for the sake of brevity and compatibility.



    The general trend for all trim methods described here is that overall string length has more impact than the number of characters to be trimmed in regex-based solutions, whereas nonregex solutions that work backward from the end of the string are unaffected by overall string length but more significantly affected by the amount of whitespace to trim. The simplicity of using two regex substitutions provides consistently respectable performance cross-browser with varying string contents and lengths, and therefore it's arguably the best all-around solution. The hybrid solution is exceptionally fast with long strings at the cost of slightly longer code and a weakness in some browsers for long, trailing whitespace. See Table 5-2 for all the gory details.



Table 5-2. Cross-browser performance of various trim implementations

表5-2  不同trim版本在各种浏览器上的性能

a Reported times were generated by trimming a large string (40 KB) 100 times, first with 10 and then 1,000 spaces added to each end.


b Tested without the //s/s*$/ optimization.

c Tested without the noncapturing group optimization.



Summary  总结


    Intensive string operations and incautiously crafted regexes can be major performance obstructions, but the advice in this chapter helps you avoid common pitfalls.



• When concatenating numerous or large strings, array joining is the only method with reasonable performance in IE7 and earlier.


• If you don't need to worry about IE7 and earlier, array joining is one of the slowest ways to concatenate strings. Use simple + and += operators instead, and avoid unnecessary intermediate strings.


• Backtracking is both a fundamental component of regex matching and a frequent source of regex inefficiency.


• Runaway backtracking can cause a regex that usually finds matches quickly to run slowly or even crash your browser when applied to partially matching strings. Techniques for avoiding this problem include making adjacent tokens mutually exclusive, avoiding nested quantifiers that allow matching the same part of a string more than one way, and eliminating needless backtracking by repurposing the atomic nature of lookahead.


• A variety of techniques exist for improving regex efficiency by helping regexes find matches faster and spend less time considering nonmatching positions (see “More Ways to Improve Regular Expression Efficiency” on page 96).


• Regexes are not always the best tool for the job, especially when you are merely searching for literal strings.



• Although there are many ways to trim a string, using two simple regexes (one to remove leading whitespace and another for trailing whitespace) offers a good mix of brevity and cross-browser efficiency with varying string contents and lengths. Looping from the end of the string in search of the first nonwhitespace characters, or combining this technique with regexes in a hybrid approach, offers a good alternative that is less affected by overall string length.


个人分类: 计算机技术
想对作者说点什么? 我来说一句

High Performance JavaScript

2010年04月28日 4.43MB 下载

High Performance Spark pdf 完整版 免费

2017年11月11日 6.59MB 下载