Python爬虫之网页抓取Keywords，帮对象查录取信息

最新推荐文章于 2024-07-05 03:43:34 发布

ShareWow丶

最新推荐文章于 2024-07-05 03:43:34 发布

阅读量2.2k

点赞数 2

分类专栏： Python 语言及应用文章标签： python 爬虫网页内容搜索 chardet

本文链接：https://blog.csdn.net/sinat_31206523/article/details/81154339

版权

Python 语言及应用专栏收录该内容

5 篇文章 1 订阅

订阅专栏

Python爬虫之网页抓取Keywords，帮对象查录取信息

要干什么

这是三个月之前用写的脚本，现在补的Markdown。当时对象（这个是真的对象）参加完某高校的研究生复试，一直焦急的等待着成绩，成绩出来的结果是范围时间的，所以就写了这个脚本，在后台一直搜索 keywords ，如果搜索到了，就弹出窗口提示。

话不多说，直接开始

操作环境：Win10 & Pycharm
Python Version：Python 3.6.3：：Anaconda，Inc.

首先是获取整个网页的信息

#! usr/bin/python3
# _*_coding:utf-8_*_
# import some package which should be used
from urllib.request import urlopen
from urllib.error import URLError
import chardet
import re
import threading
import time
from tkinter import messagebox

def getInfo():
    # locked the web URL
    url = "https://yjszs.ecnu.edu.cn/system/sslqyx_list.asp"
    try :
        response = urlopen(url)
        # print(response)
    except URLError:
        raise IndexError("No internet connection available to transfer audio data")
    except:
        pass
    # read the html all information
    html = response.read()
    # get the Encoding method of the html
    charset = chardet.detect(html)
    # print(charset)
    # Decode the html to content
    content = html.decode("GB2312")
    # print(content)
    response.close()
    return content

函数 getInfo() 中是正常的获取整个网页信息的操作。如果有疑问可以释放每一步 print 函数，这里对charset 进行详细阐述。

import charset

字符串编码一直是令人非常头疼的问题，尤其是在处理一些不规范的第三方网页的时候。虽然Python提供了Unicode表示的str和bytes两种数据类型，并且可以通过encode()和decode()方法转换，但是，在不知道编码的情况下，对bytes做decode()不好做。

对于未知编码的bytes，要把它转换成str，需要先“猜测”编码。猜测的方式是先收集各种编码的特征字符，根据特征字符判断，就能有很大概率“猜对”。

当然，肯定不能从头自己写这个检测编码的功能，这样做费时费力。chardet这个第三方库正好就派上了用场。用它来检测编码，简单易用。

安装chardet

如果安装了Anaconda，chardet就已经可用了。否则，需要在命令行下通过pip安装：

$ pip install chardet

如果遇到Permission denied安装失败，请加上sudo重试。如果用 pycharm 安装就很方便。

使用chardet

当拿到一个bytes时，就可以对其检测编码。用chardet检测编码，只需要一行代码：

>>> chardet.detect(b'Hello, world!')
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}

检测出的编码是ascii，注意到还有个confidence字段，表示检测的概率是1.0（即100%）。

来试试检测GBK编码的中文：

>>> data = '离离原上草，一岁一枯荣'.encode('gbk')
>>> chardet.detect(data)
{'encoding': 'GB2312', 'confidence': 0.7407407407407407, 'language': 'Chinese'}

检测的编码是GB2312，注意到GBK是GB2312的超集，两者是同一种编码，检测正确的概率是74%，language字段指出的语言是'Chinese'。

对UTF-8编码进行检测：

>>> data = '离离原上草，一岁一枯荣'.encode('utf-8')
>>> chardet.detect(data)
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

再试试对日文进行检测：

>>> data = '最新の主要ニュース'.encode('euc-jp')
>>> chardet.detect(data)
{'encoding': 'EUC-JP', 'confidence': 0.99, 'language': 'Japanese'}

可见，用chardet检测编码，使用简单。获取到编码后，再转换为str，就可以方便后续处理。

chardet支持检测的编码列表请参考官方文档Supported encodings。

从content中搜寻Keywords

刚才我们已经获取整个网页的 content 了，那么怎么获取有用的信息呢，一想到字符串的匹配问题，正则表达式 当仁不让了。因为这次匹配的内容针对性，比较强，所以我们就用 re.search() 进行匹配，代码如下：

def searchInfo(content):
    # set the info model as pattern
    pattern = r'心理'
    # search pattern from content
    searchObj = re.search(pattern,content)
    # print(searchObj)
    return searchObj

正则表达式，博大精深，就不在这里叙述了，有机会的话会补充，这里简单介绍：

re. search ( pattern, string, flags=0 )
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

添砖加瓦

其实，我们主体内容已经完成了，我们可以通过上面的方式，找到匹配的 Keywords ，但是，如果找不到，还是得然程序一直找的，当找到的时候，最好还能很直接的告诉我，找到了！这就需要添加一些功能。我们添加另外一个线程，用来计数时间，每过多少秒进行一次查询搜索。当搜索到的时候，弹窗提示。

def trySearch():
    global cnt
    cnt += 1
    result = searchInfo(getInfo())
    if result:
        showMsg(result)
    else:
        print("进行第{:0>4}次尝试。".format(cnt))

def t2():
    global cnt
    while 1:
        trySearch()
        time.sleep(1)

def showMsg(txt):
    messagebox.showinfo('Got it!',txt)

if __name__ == '__main__':
    cnt = 0
    t = threading.Thread(target=t2())
    t.start()
    t.join()