python 解析搜索引擎关键词

最新推荐文章于 2022-11-28 11:10:50 发布

minbing

最新推荐文章于 2022-11-28 11:10:50 发布

阅读量370

点赞数

分类专栏： python 文章标签：搜索引擎 Python Google HP Blog

本文链接：https://blog.csdn.net/minbing/article/details/83529673

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

需求是：为了知道搜索引擎通过哪些关键词到达站点，需要对url进行反编码，google了一下（怕没机会用了），看到可可熊的链接：http://cocobear.info/blog/2008/08/11/tool-of-python-url-encode/，还有其他不错链接，在此表示感谢。

借鉴了转码方法，主要针对国内几个主流搜索引擎，代码如下，留作纪念：

import urllib
import sys,getopt,re
    
searchEngines = {'http://www.google.com': 'q=',
                 'http://www.google.cn': 'q=',
                 'http://www.baidu.com': 'wd=',
                 'http://www.soso.com': 'w=',
                 'http://www.youdao.com': 'q='}
    
def getQueryString(url):
    queryStr = ''
    for k, v in searchEngines.items():
        index = url.find(k)
        if index == 0:
            print k
            startIndex = url.find(v)
            print startIndex
            if startIndex > 0:
                endIndex = url.find('&', startIndex)
                print endIndex
                if endIndex == -1:
                    queryStr = url[startIndex+len(v):]
                else:
                    queryStr = url[startIndex+len(v):endIndex]
    return queryStr
 
def url2read(s):

    s = urllib.unquote(s)
    try: 
            print '11111111111111\n'
            s = s.decode('utf-8')
    except UnicodeDecodeError:
            print '2222222222222\n'
            s = s.decode('gbk')

if __name__ == "__main__":
#    url2read('%C0%F6%BD%AD')
#    url2read('%E4%B8%BD%E6%B1%9F')
    s1 = getQueryString(r'http://www.google.com/search?hl=en&source=hp&q=%E4%B8%BD%E6%B1%9F&aq=f&oq=&aqi=');
    s2 = getQueryString(r'http://www.baidu.com/s?wd=%C0%F6%BD%AD')
    url2read(s1)
    url2read(s2)