爬虫实战1：爬取糗事百科段子

最新推荐文章于 2021-02-12 11:47:32 发布

一不小心写起了代码

最新推荐文章于 2021-02-12 11:47:32 发布

阅读量801

点赞数 3

分类专栏：爬虫文章标签： python 爬虫糗事百科

本文链接：https://blog.csdn.net/hwh1996/article/details/89500480

版权

爬虫专栏收录该内容

4 篇文章 2 订阅

订阅专栏

本文主要展示利用python3.7+urllib实现一个简单无需登录爬取糗事百科段子实例。

如何获取网页源代码
对网页源码进行正则分析，爬取段子
对爬取数据进行再次替换&删除处理易于阅读

0、全部源码展示

本文将先展示全部源码，后面将逐步分析如何实现爬取糗事百科段子。

import urllib.request
import re

class QSBKCrawler:
 User_Agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"

 # 获取指定页面代码，进行连接程
 def getPageCode(self, pageIndex):
     # 1.设置头部
     headers = {"User-Agent": QSBKCrawler.User_Agent}
     # 2.设置request，用默认opener访问
     request = urllib.request.Request("https://www.qiushibaike.com/text/page/" + str(pageIndex), headers=headers)
     response = urllib.request.urlopen(request)
     return response.read().decode("utf-8")

 # 处理页面代码，获取想要的信息
 def getStroies(self, pageCode):
     # 1.获取初步匹配信息
     pattern = re.compile('<div.*?author clearfix">.*?<h2>(.*?)</h2>' +# 作者
                          '.*?<div.*?content">\n<span>(.*?)</span>' +  # 内容
                          '.*?<span.*?vote">.*?<i.*?>(.*?)</i>',       # 笑脸
                          re.S)
     items = re.findall(pattern, pageCode)
     # 2.进行一些信息处理，去除空格等
     pageStroies = []  # 设置列表来存储信息
     for item in items:
         removeSpace = re.compile('\n|\n\n|\n\n\n')
         repalceBR = re.compile('<br/>|<br/><br/>')
         # ①去除作者名和内容的空格
         authors = re.sub(removeSpace, "", item[0])
         text = re.sub(removeSpace, "", item[1])
         # ②替换<\br>为空格
         text = re.sub(repalceBR, "\n", text)
         # ③将处理好的信息加入pageStroies列表
         pageStroies.append([authors, text, item[2]])  # 注意：pageStroies是一个列表，每一项又存储了一个列表
     return pageStroies

 
# 调用实现爬取
crawler = QSBKCrawler()
pageCode = crawler.getPageCode(1)
pageStroies = crawler.getStroies(pageCode)
for p in pageStroies:
 print("作者名：{0}  赞同数：{1} \n    {2}\n".format(p[0], p[2], p[1]))

一、获取网页源码

python爬虫获取网页源码需引用urllib.request，一个简单的获取网页源码实例如下：

User_Agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0"

    # 获取指定页面代码，进行连接程序
    def getPageCode(self, pageIndex):
        # 1.设置头部
        headers = {"User-Agent": QSBKCrawler.User_Agent}
        # 2.设置request，用默认opener访问
        request = urllib.request.Request("https://www.qiushibaike.com/text/page/" + str(pageIndex), headers=headers)
        response = urllib.request.urlopen(request)
        # print(response.read().decode("utf-8"))  #打印测试，注意要进行转码到utf-8
        return response.read().decode("utf-8")

urllib.request.Request原型：

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

在这里我们只用到url和headers参数，date参数会在下一个爬虫实战遇到，暂且不提。其他参数一般采用默认，如果想知道关于他们的更多信息，请点击这里。

url即资源地址定位符，我们要爬取糗事百科的段子，分析它的网址：

在这里插入图片描述

https是超文本安全传输协议
www.qiushibaike.com/text/定位到糗事百科文字段子页面
page/2/定位到具体第几页

headers参数主要是为了伪装成浏览器进行访问，firefox浏览器获取其参数:打开网址，按下F12---->点击网络,选定下面任意一栏，可以在右侧最下方查看User-Agent。

在这里插入图片描述

至此，我们就只需要调用 urllib.request.Request方法设置好url和headers，然后将其传入urllib.request.urlopen就可以直接获取到网页源码了。如果将源码打印print出来：

二、正则分析源码获取段子

获取了源码，对源码进行分析提取段子自然用正则表达式是最方便的，引用库为re,对正则表达式不太了解的同学可点击这里。好了，废话不多说，我们且来看如何一步步做到提取段子出来。

1.打开网页，按F12查看网页源码。找到段子分布规律（在什么标签里）

在这里插入图片描述

大家可以看到，在标签<div class="author clearfix">下包含段子，作者和段子内容等信息。而作者信息是在一对<h2></h2>标签中；段子内容在一对<span></span>标签中。那么从<div class...>标签开始匹配，进行书写正则表达式。

2.匹配正则表达式

def getStroies(self, pageCode):
     # 1.获取初步匹配信息
     pattern = re.compile('<div.*?author clearfix">.*?<h2>(.*?)</h2>' +# 作者
                          '.*?<div.*?content">\n<span>(.*?)</span>' +  # 内容
                          '.*?<span.*?vote">.*?<i.*?>(.*?)</i>',       # 笑脸
                          re.S)
     items = re.findall(pattern, pageCode)

re.findall原型

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

简单来说，pattern就是我们设置的正则表达式，string是我们被匹配的母串。如果string中存在多个满足pattern正则匹配的子串会返回一个列表存储这些子串。

这里出现了大量的.*?,它的意思如下：.代表匹配除换行符任意字符，*代表匹配前一个字符0次或多次，结合起来代表匹配任意字符0次或者多次，再结合？变成非贪婪模式匹配（遇到？后第一个字符结束匹配）。如字符串10100000010：

1.*1，是贪婪模式，它会从后往前匹配.*后第一个字符1,故结果：1010000001

1.*?1，是非贪婪模式，它会从前往后匹配.*？后第一个字符1,故结果：101

更多细节，点击这里。

(.*?)代表一个分组，在这个正则表达式中我们匹配了3个分组，在后面的遍历item中，item[0]就代表第一个(.*?)所指代的内容，item[1]就代表第二个(.*?)所指代的内容，以此类推。

注意，item指代的是母串string一次匹配pattern成功结果，而一般有多个，用列表存储。item本身也是列表，3个分组有3个元素。

3.查看结果，分析优化

上诉代码，不经过优化运行结果如下。

在这里插入图片描述

我们可以看到，虽然成功读取了网页数据。但是出现了大量多余的换行符，同时html换行符<br/>应该换成/n。至此，我们变进行第三大步，对数据进行最后处理。

三、对爬取数据进行替换、删除处理

 # 2.进行一些信息处理，去除空格等
        pageStroies = []  # 设置列表来存储信息
        for item in items:
            removeSpace = re.compile('\n|\n\n|\n\n\n')
            repalceBR = re.compile('<br/>|<br/><br/>')
            # ①去除作者名和内容的空格
            authors = re.sub(removeSpace, "", item[0])
            text = re.sub(removeSpace, "", item[1])
            # ②替换<\br>为空格
            text = re.sub(repalceBR, "\n", text)
            # ③将处理好的信息加入pageStroies列表
            pageStroies.append([authors, text, item[2]])  # 注意：pageStroies是一个列表，每一项又存储了一个列表
        return pageStroies

items是一个列表，存储了所有匹配的子串item集合。item也是一个列表存储了段子作者、内容、笑脸数信息。我们一般利用re.sub进行数据替换、删除，其原型：

re.sub(pattern, repl, string, count=0, flags=0)

Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes of ASCII letters are reserved for future use and treated as errors. Other unknown escapes such as \& are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.

简单来说，就是在母串string匹配所有满足pattern的字符串替换为repl,如上诉代码替换/删除\n和<br/>,返回替换后的母串string。

设置pattern:removeSpace & repalceBR匹配所有\n和<br/>

   removeSpace = re.compile('\n|\n\n|\n\n\n')
   repalceBR = re.compile('<br/>|<br/><br/>')

进行替换，返回替换后的母串string

 # ①去除作者名和内容的空格
   authors = re.sub(removeSpace, "", item[0])
   text = re.sub(removeSpace, "", item[1])
 # ②替换<\br>为空格
   text = re.sub(repalceBR, "\n", text)
   pageStroies.append([authors, text, item[2]])