python解析url的关键字

原创 2017年01月03日 17:51:29
近期刚接触python,主要于分析网站用户访问的日志,其中涉及到解析日志中的关键字。该业务主要需要解决以下几个问题:
        1、访客使用的搜索引擎关键字标志不同,如百度中搜索‘大数据’
           https://www.baidu.com/s?f=8&rsv_bp=1&rsv_idx=1&word=%E5%A4%A7%E6%95%B0%E6%8D%AE&tn=91483420_s_hao_pg
           关键词标志为word
           在谷歌中搜索‘大数据’
            https://www.google.com.hk/?gws_rd=ssl#safe=strict&q=%E5%A4%A7%E6%95%B0%E6%8D%AE
            关键词标志为q
        2、相同搜索引擎的不同搜索方式,产生的关键字标志也未必相同,百度中就可能以wd作为关键字的标志。
            https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=91483420_s_hao_pg&wd=%E5%A4%A7%E6%95%B0%E6%8D%AE&oq=%E5%A4%A7%E6%95%B0%E6%8D%AE&rsv_pq=eaf35f8e00003566&rsv_t=17aa2xBQlaQi4DN3AX5SdzIaXVFjARA4pfZnag9PfymOMaWmUFdUUgIuqyJNp3ItorvpNPFN3%2FU&rqlang=cn&rsv_enter=1&rsv_sug3=2&rsv_sug1=2&rsv_sug7=101&rsv_sug2=0&inputT=1432&rsv_sug4=1432
        3、不同搜索引擎的编码方式不同,如国内的搜搜网页
           http://www.soso.cn/search.asp?want=&search=%B4%F3%CA%FD%BE%DD&engine=
           采用的gbk编码
针对前两个问题,将常用的搜索引擎及关键字构建成一个搜索引擎字典:
engineList = {
    "Baidu":('q','word','kw','utf8'),
    "Google":('q','query','keywords','utf8'),
    "Sogou":('query','keyword','utf8'),
    "Chinaso":('q','utf8'),
    "Yahoo":('p','q','utf8'),
    "Soso":('search','q','gb2312'),
    "Youdao":('q','utf8'),
    "Bing":('q','utf8'),
    "Easou":('q','utf8'),
    "360search":('q','utf8'),
    "sm.cn":('q','utf8')
}

对于第三个问题,使用python的异常机制进行判断URL中关键词编码

def decode_keyword(keyword): 
    keyword = urllib.unquote(keyword)
    try :
        keyword = keyword.decode('utf-8')
        return keyword
    except UnicodeDecodeError :
        try:
            keyword = keyword.decode('gbk')
            return keyword
        except UnicodeDecodeError :
            return keyword

完整代码如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#设定解析日志时编码格式

import sys
import urllib
from urlparse import urlparse
reload(sys)
sys.setdefaultencoding('utf8')

engineList = {
    "Baidu":('q','word','kw','utf8'),
    "Google":('q','query','keywords','utf8'),
    "Sogou":('query','keyword','utf8'),
    "Chinaso":('q','utf8'),
    "Yahoo":('p','q','utf8'),
    "Soso":('search','q','gb2312'),
    "Youdao":('q','utf8'),
    "Bing":('q','utf8'),
    "Easou":('q','utf8'),
    "360search":('q','utf8'),
    "sm.cn":('q','utf8')
}

dict = {}
def main():
    file = open("keyword.txt")
    while True:
        line = file.readline()
        if line:
            #print(line)
            parseLine(line)

def parseLine(line):
    parseUrl(line,dict)
    parseKey(line,dict)

def parseUrl(line,dict):
    url = urlparse(line)
    searchName = str(url.hostname)
    #print(dict)
    if "baidu.com" in searchName:
        dict['searchname'] = 'Baidu'
        dict['issearch'] = 1
    elif "google.com" in searchName:
        dict['searchname'] = 'Google'
        dict['issearch'] = 1
    elif "sogou.com" in searchName:
        dict['searchname'] = 'Sogou'
        dict['issearch'] = 1
    elif "chinaso.com" in searchName:
        dict['searchname'] = 'Chinaso'
        dict['issearch'] = 1
    elif "yahoo.com" in searchName:
        dict['searchname'] = 'Yahoo'
        dict['issearch'] = 1
    elif "soso.cn" in searchName:
        dict['searchname'] = 'Soso'
        dict['issearch'] = 1
    elif "youdao.com" in searchName:
        dict['searchname'] = 'Youdao'
        dict['issearch'] = 1
    elif "bing.com" in searchName:
        dict['searchname'] = 'Bing'
        dict['issearch'] = 1
    elif "easou.com" in searchName:
        dict['searchname'] = 'Easou'
        dict['issearch'] = 1
    elif "so.com" in searchName:
        dict['searchname'] = '360search'
        dict['issearch'] = 1
    elif "sm.cn" in searchName:
        dict['searchname'] = 'sm.cn'
        dict['issearch'] = 1
    else:
        dict['searchname'] = searchName
        dict['issearch'] = -1

def parseKey(line,dict):
    line = line.replace('/','&')
    line = line.replace('?','&')
    paramList = line.split('&')
    for l in paramList[4:]:
        parseParam(l,dict)

def parseParam(l, dict):
    linelist = l.split('=')
    if dict['issearch'] == 1:
        keys = engineList[dict['searchname']]
        keywords = keys[0:(len(keys)-1)]
        for key in keywords:
            if linelist[0] == key:
                dict['keyword']=decode_keyword(linelist[1])
                print(dict['keyword'])

def decode_keyword(keyword):
    keyword = urllib.unquote(keyword)
    try :
        keyword = keyword.decode('utf-8')
        return keyword
    except UnicodeDecodeError :
        try:
            keyword = keyword.decode('gbk')
            return keyword
        except UnicodeDecodeError :
            return keyword
if __name__ == "__main__":
    main()
#keyword.txt
https://www.baidu.com/s?wd=%E6%AD%A6%E5%BF%A0%E5%81%A5&rsv_spt=1&rsv_iqid=0xbe3298bf0001da8c&issp=1&f=8&rsv_bp=0&rsv_idx=2&ie=utf-8&tn=baiduhome_pg&rsv_enter=1&rsv_sug3=1
https://www.google.com.hk/?gws_rd=ssl#safe=strict&q=%E5%A4%A7%E6%95%B0%E6%8D%AE
https://www.sogou.com/web?query=%E5%A4%A7%E6%95%B0%E6%8D%AE&_asf=www.sogou.com&_ast=&w=01019900&p=40040100&ie=utf8&from=index-nologin&sut=1487&sst0=1482993265601&lkt=0%2C0%2C0
chinaso.com:http://www.chinaso.com/search/pagesearch.htm?q=%E5%A4%A7%E6%95%B0%E6%8D%AE
https://search.yahoo.com/search;_ylc=X3oDMTFiN25laTRvBF9TAzIwMjM1MzgwNzUEaXRjAzEEc2VjA3NyY2hfcWEEc2xrA3NyY2h3ZWI-?p=%E5%A4%A7%E6%95%B0%E6%8D%AE&fr=yfp-t&fp=1&toggle=1&cop=mss&ei=UTF-8
http://www.soso.cn/search.asp?want=&search=%B4%F3%CA%FD%BE%DD&engine=   
http://www.youdao.com/search?keyfrom=navindex.normal.searchbox&T1=1482992913989&q=%E5%A4%A7%E6%95%B0%E6%8D%AE
http://cn.bing.com/search?q=%E5%A4%A7%E6%95%B0%E6%8D%AE&go=%E6%8F%90%E4%BA%A4&qs=n&form=QBLH&sp=-1&pq=%E5%A4%A7%E6%95%B0%E6%8D%AE&sc=8-3&sk=&cvid=02BF61C729E04CB39B16D91602092F44
http://i.easou.com/s.m?idx=1&sty=1&q=%E5%A4%A7%E6%95%B0%E6%8D%AE&prefix=100&cid=paw&fr=9.1005.2.2&esid=HeCvH5j3kDA&wver=dsp
https://www.so.com/s?ie=utf-8&shb=1&src=360sou_newhome&q=%E5%A4%A7%E6%95%B0%E6%8D%AE
http://m.sm.cn/s?q=%E5%A4%A7%E6%95%B0%E6%8D%AE&uc_param_str=dnntnwvepffrgibijbprsvdsme&from=smor&safe=1&snum=0

相关文章推荐

[Python]从url中解析域名的几种方法

Python从url中解析域名的几种方法 从url中找到域名,首先想到的是用正则,然后寻找相应的类库。用正则解析有很多不完备的地方,url中有域名,域名后缀一直在不断增加等。通过google查到...

python 读取文件里的每行url 返回每个url请求响应页面的title!

# -*- coding: utf-8 -*- __author__ = 'wangjingyao' import urllib2 import re import sys import thre...
  • wjy397
  • wjy397
  • 2015年12月08日 11:45
  • 611

使用python对url编码解码

最近在抓取一些js代码产生的动态数据,需要模拟js请求获得所需用的数据,遇到对url进行编码和解码的问题,就把遇到的问题总结一下,有总结才有进步,才能使学到的知识更加清晰。对url进行编码和解码,py...

python URL解析转换成字典

引用包: import urlparse 获取URL: url="https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=2&tn=baiduho...

python用于url解码和中文解析的小脚本

python用于url解码和中文解析的小脚本(续) by 不求东西 之前写过一篇关于处理url里中文字符解码文章,后来看到原文中TL的回复,发现原来那一篇文章存在着几个问题,觉得这些问...

C# 分析搜索引擎url 得到搜索关键字

using System;using System.Collections.Generic;using System.Text;using System.Text.RegularExpressions...

python(2):使用python分析大日志文件思路及过程

1.做服务器开发的经常会遇到要分析大量的日志,统计大量数据;这里介绍几种统计日志数据的方法和思路 之前有遇到过要统计几天内的url出现次数的事情,一天有24个gz压缩文件,每个文件大概6G左右,UR...

百度URL参数解析

百度URL参数解析在用Python爬取百度搜索的内容时,发现百度搜索的url非常的长,往往会跟一大段的参数,但其实很多参数都是没有必要的,那它们的作用是什么呢?...

Delphi7高级应用开发随书源码

  • 2003年04月30日 00:00
  • 676KB
  • 下载

python url格式解析

from urlparse import urlparseurl_str = "http://www.163.com/mail/index.htm"url = urlparse(url_str)pri...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:python解析url的关键字
举报原因:
原因补充:

(最多只允许输入30个字)