需求是:为了知道搜索引擎通过哪些关键词到达站点,需要对url进行反编码,google了一下(怕没机会用了),看到可可熊的链接:http://cocobear.info/blog/2008/08/11/tool-of-python-url-encode/,还有其他不错链接,在此表示感谢。
借鉴了转码方法,主要针对国内几个主流搜索引擎,代码如下,留作纪念:
import urllib
import sys,getopt,re
searchEngines = {'http://www.google.com': 'q=',
'http://www.google.cn': 'q=',
'http://www.baidu.com': 'wd=',
'http://www.soso.com': 'w=',
'http://www.youdao.com': 'q='}
def getQueryString(url):
queryStr = ''
for k, v in searchEngines.items():
index = url.find(k)
if index == 0:
print k
startIndex = url.find(v)
print startIndex
if startIndex > 0:
endIndex = url.find('&', startIndex)
print endIndex
if endIndex == -1:
queryStr = url[startIndex+len(v):]
else:
queryStr = url[startIndex+len(v):endIndex]
return queryStr
def url2read(s):
s = urllib.unquote(s)
try:
print '11111111111111\n'
s = s.decode('utf-8')
except UnicodeDecodeError:
print '2222222222222\n'
s = s.decode('gbk')
if __name__ == "__main__":
# url2read('%C0%F6%BD%AD')
# url2read('%E4%B8%BD%E6%B1%9F')
s1 = getQueryString(r'http://www.google.com/search?hl=en&source=hp&q=%E4%B8%BD%E6%B1%9F&aq=f&oq=&aqi=');
s2 = getQueryString(r'http://www.baidu.com/s?wd=%C0%F6%BD%AD')
url2read(s1)
url2read(s2)