搜索引擎或者APP搜索时,其实生成的http链接中基本都带有UTF8或者其他编码的中文关键字,目前只做了UTF8的,其他编码可以通过字符范围筛选。
以下为解析方法:
import urllib
import sysreload(sys)
sys.setdefaultencoding('utf8')
en=urllib.quote
de=urllib.unquote
for line in sys.stdin:
line = line.strip()
phone, host, url = line.split(',', 2)
changeurl= de(url).decode("utf8","ignore")
lasturl = u'%s,%s' % (phone, changeurl)
print '%s,%s' % (phone, changeurl)
#this is debug part under below
#a="http://3g.baidu.com/ssid=0/from=0/bd_page_type=1/uid=wiaui_1332346181_5317/pu=sz%401330_227%2Cusm%401/w=0_10_%E7%A6%8F%E5%B7%9E%E7%83%9F%E8%8D%89/t=wap/tc?pn=15&m=0&src=www%2Efztobacco%2Ecom%2Forder%2F/"
# gbk, 2 bytes per Chinese character
#b="http://m.baidu.com/from=1089a/bd_page_type=1/ssid=0/uid=wiaui_1331226364_4966/pu=usm%406%2Csz%401330_320%2Cgt%40500126_coolpad_f800_0_2/w=0_10_%E6%89%8B%
E6%9C%BAqq/t=wap/l=1/tc?ref=www_touch&lid=3342572836&tj=w/"
# utf8, 3 bytes per Chinese character
#de=urllib.unquote
#en=urllib.quote
#print de(a).decode("gbk")
#print de(b).decode("utf8")
#print de(b).decode("gbk")
#print de(a).decode("utf8")
碰到问题
UnicodeEncodeError: 'ascii' codec can't encode characters in position 115-121: ordinal not in range(128)
解决方法:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
参考文档:
python中文转换url编码
http://hi.baidu.com/yobin/blog/item/274e5a82cbeda3aa0cf4d2b9.html
文件字符编码
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html