python用于url解码和中文解析的小脚本_%u79d1 python 解析成中文-CSDN博客

之前写过一篇关于处理url里中文字符解码文章，后来看到原文中TL的回复，发现原来那一篇文章存在着几个问题，觉得这些问题可能别的同学也会遇到，就续写一篇吧。

非默认编码的转换

 
         import 
          urllib 
        
         a 
         = 
         "http://zh.wikipedia.org/wiki/%BD%F0%B6" 
        
         b 
         = 
         "http://zh.wikipedia.org/wiki/%E9%97%A8" 
        
         de 
         = 
         urllib.unquote 
        
         print 
          de(a),de(b)

之前的文章里的这段代码，我没有考虑到gbk和utf编码的问题，以为不带有%5Cu这种unicode标志字符的汉字解码只要unquote就万事大吉了呢，但对于与“默认编码环境”不同的编码来说，还需要再多一步处理，所以上述的代码是无法对a正确解码的

TL给出了一种解决办法，可以处理a这种残疾的编码形式（残疾的原因，下面就会解释）

 
         de(a).decode( 
         "gbk" 
         , 
         "ignore" 
         ) 
        
         de(b).decode( 
         "utf8" 
         , 
         "ignore" 
         )

再print就可以打印出中文字符了~

残疾的编码

可是，问题又来了，为什么还需要“ignore”这个参数呢，我发现如果不加这个参数，这样使用，会报错的。

 
         de(a).decode( 
         "gbk" 
         )

检查了一下a在gfwlist中的出处以后，我发现自己犯了一个挺低级的错误的（汗。）

事实是：a里那个网站本来应该是zh.wikipedia.org*%BD%F0%B6%DC%B9%A4%B3%CC这样的，我误以为汉字编码都是3个“百分号+2个十六进制数”（3个字节）这样的样式，所以只取了前3个字节，也就是“%BD%F0%B6″。

而问题在于，gbk编码和utf编码所需的字节数是不一样的，gbk只需2个字节即可编码一个汉字，而a是用gbk编码的，1个汉字的解码不需要3个字节，多出来的这1个残疾的字节就成为了decode异常的来源，删掉这个多余的字节以后，解码顺利通过：

 
         import 
          urllib 
        
         a 
         = 
         "http://zh.wikipedia.org/wiki/%BD%F0" 
        
         # gbk, 2 bytes per Chinese character 
        
         b 
         = 
         "http://zh.wikipedia.org/wiki/%E9%97%A8" 
        
         # utf8, 3 bytes per Chinese character 
        
         de 
         = 
         urllib.unquote 
        
         print 
          de(a).decode( 
         "gbk" 
         ) 
        
         print 
          de(b).decode( 
         "utf8" 
         )

定义解码方式的优先级

最后，我将TL的脚本中以优先级的形式处理多种中文编码的函数代码copy了过来，同时将中文编码的字节下限由3字节改为了2个字节以后，发现原来gfwlist中所有不能正常解码的中文，现在都可以显示出来了，哈哈，不错~

 
         import 
          urllib 
        
         import 
          re 
        
         def 
          _strdecode(string): 
        
         try 
         : 
        
         return 
          string.decode( 
         'utf8' 
         ) 
        
         except 
          UnicodeDecodeError: 
        
         try 
         : 
        
         return 
          string.decode( 
         'gb2312' 
         ) 
        
         except 
          UnicodeDecodeError: 
        
         try 
         : 
        
         return 
          string.decode( 
         'gbk' 
         ) 
        
         except 
          UnicodeDecodeError: 
        
         return 
          string.decode( 
         'gb18030' 
         ) 
        
         with  
         open 
         ( 
         "gfwcn" 
         , 
         "r" 
         ) as f: 
        
         for 
          escaped_str  
         in 
          f: 
        
         match 
         = 
         re. 
         compile 
         ( 
         "((%\w{2}){2,})" 
         ).findall(escaped_str) 
        
         # the pattern of a Chinese character is ： 
        
         # % + 2 hexadecimal digits，repeat 2 (gbk) or 3 (utf8) times 
        
         # I choose 2 as the lower limit of the amount of repetitions 
        
         if 
          match! 
         = 
         None 
         : 
        
         for 
          trans  
         in 
          match: 
        
         print 
          _strdecode(urllib.unquote(trans[ 
         0 
         ])), 
        
         # decode these Chinese characters in priority order