危险的encodeURIComponent-CSDN博客

javascript中的 encodeURIComponent() 方法很常用，MDN里在描述这个方法的时候，有提到这个异常: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent

在下面这个case中，node.js里被发现一个严重bug: http://cnodejs.org/topic/4fd6b7ba839e1e581407aac8
当时所有connect的app都紧急修复大伙们之前都没有意识到这个问题的存在，直到遇到这个可怕的0xDFFF，实在是隐蔽，汗！

stackoverflow上有人给出了一个详细的解答，大致翻译一下，大伙看看：

同样也是遇到这个问题，为了发现问题所在，@Brett Zamir 写了一个遍历方法，把ucs-2的字符集扫描了一遍：

for (var regex = '/[', firstI = null, lastI = null, i = 0; i <= 65535; i++) {
    try {
        encodeURIComponent(String.fromCharCode(i));
    }
    catch(e) {
        if (firstI !== null) {
            if (i === lastI + 1) {
                lastI++;
            }
            else if (firstI === lastI) {
                regex += '\\u' + firstI.toString(16);
                firstI = lastI = i; 
            }
            else {
                regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
                firstI = lastI = i; 
            }
        }
        else {
            firstI = i;
            lastI = i;
        }        
    }
}

if (firstI === lastI) {
    regex += '\\u' + firstI.toString(16);
}
else {
    regex += '\\u' + firstI.toString(16) + '-' + '\\u' + lastI.toString(16);
}
regex += ']/';
alert(regex);  // /[\ud800-\udfff]/

很快结果就出来了，ud800-udfff 这段的字符是有问题的，再写个脚本验证一下：

for (var i = 0; i <= 65535 && (i <0xD800 || i >0xDFFF ) ; i++) {
    try {
        encodeURIComponent(String.fromCharCode(i));
    }
    catch(e) {
        alert(e); // Doesn't alert
    }
}
alert('ok!');

上面的输出符合MSDN上说的：这些字符除了surrogates，即便是“non-characters”也都是合法的unicode序列。

surrogates就是上面扫描出来的危险字符集空间段，分为高位段，低位段

有了这个范围，你可以直接过滤掉这些危险字符，但是当这组字符成对出现的时候（高低位段的字符搭配），它们作为unicode扩展集字符又是合法的（utf-16）

alert(encodeURIComponent('\uD800\uDC00')); // ok
alert(encodeURIComponent('\uD800')); // not ok
alert(encodeURIComponent('\uDC00')); // not ok either

所以如果你只是想屏蔽这个段的字符

urlPart = urlPart.replace(/[\ud800-\udfff]/g, '');

如果你想屏蔽非法的字符，但是保留高低位合法组合的字符（utf-16字符，很少用到），可以这么搞：

function stripUnmatchedSurrogates (str) {
    return str.replace(/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/g, '').split('').reverse().join('').replace(/[\uDC00-\uDFFF](?![\uD800-\uDBFF])/g, '').split('').reverse().join('');
}

var urlPart = '\uD801 \uD801\uDC00 \uDC01'
alert(stripUnmatchedSurrogates(urlPart)); // Leaves one valid sequence (representing a single

如果js native处理了这个问题，就可以免了这个猥琐的修补，多好啊，哼！

====== 翻译完毕 ====

补充：

上诉的高低位端，实际上是 UCS-2中的保留字段。 unicode字符集分为很多端，看这里 http://baike.baidu.com/view/40801.htm

D800-DBFF：High-half zone of UTF-16  utf-16 高位
DC00-DFFF：Low-half zone of UTF-16  utf-16 低位

utf-16的参考资料：http://zh.wikipedia.org/wiki/UTF-16，其中描述到：

Unicode的编码空间从U+0000到U+10FFFF，共有1,112,064个码位（code point）可用来映射字符. Unicode的编码空间可以划分为17个平面（plane），每个平面包含216（65,536）个码位。17个平面的码位可表示为从U+xx0000到U+xxFFFF,其中xx表示十六进制值从0016到1016，共计17个平面。第一个平面称为基本多语言平面（Basic Multilingual Plane, BMP），或称第零平面（Plane 0）。其他平面称为辅助平面（Supplementary Planes）。基本多语言平面内，从U+D800到U+DFFF之间的码位区段是永久保留不映射到Unicode字符。UTF-16就利用保留下来的0xD800-0xDFFF区段的码位来对辅助平面的字符的码位进行编码。

所以这个问题就清晰明了

补充：感谢 @猎隼的补充
decodeURIComponent()确实也是危险的存在， js的encodeURIComponent, decodeURIComponent 处理的都是utf-8编码的字符集。decodeURIComponent 一旦传入gbk的encode字符串，异常就会抛出，没有try catch就会搞死node进程。

参考：
http://stackoverflow.com/questions/16868415/encodeuricomponent-throws-an-exception