手机应用url抓取_Python爬虫入门到入职03:全量抓取

b1232e1077dfd1a21dec460fdf36d79f.png
全量抓取是一种常见的抓取方式,针对目标网站进行批量抓取,需要我们进行翻页操作,遍历整个网站。

本章知识点:

  • 网页中文编码问题
  • 处理翻页,实现全量抓取
  • 抽取函数,减少重复代码
  • 异常处理

处理中文编码

我们以手机天堂-新闻资讯为本次项目,分析网页源码写出简单的抓取代码:

class PhoneHeavenSpider:
    def start(self):
        rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
        print(rsp.text)

执行代码,查看得到的html源码,发现异常字符:

5a02d7423448f249f689008ce254a538.png

用chrome打开手机天堂-新闻资讯,点击右键,点击“查看网页源码”:

8786c406040e8ec330e14bfe4d59455d.png

中文有很多编码方式,可以查看中文编码杂谈详细了解。对中文编码有一个基础认知后,我们来修改代码:

class PhoneHeavenSpider:
    def start(self):
        rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
        rsp.encoding = 'gbk'
        print(rsp.text)

执行代码,成功获取中文源码:

87cb856133c4bc92ec98e1278ccfa40e.png

关于网页编码问题,我们要知其然还要知其所以然,requests使用三种方法自动识别网页编码:

  • get_encodings_from_content():通过预设的正则表达式,从页面种识别编码。
  • get_encoding_from_headers():从HTTP头部Content-Type中识别编码,如果没有设置charset则默认为ISO-8859-1
  • 使用chardet库(requests的依赖库)的detect()函数猜测编码。

当requests没有正确识别编码时,获取rsp.text前手动设置编码即可正确解析。

翻页处理

我们使用一个新的解析工具lxml,使用pip进行安装:pip install lxml。继续编写代码,根据上次教程先抓取首页信息:

class PhoneHeavenSpider:
    def start(self):
        rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
        rsp.encoding = 'gbk'  # 指定编码方式

        # 处理第一页数据
        soup = BeautifulSoup(rsp.text, 'lxml')  # 使用一个更强大的解析库,需要安装pip install lxml
        for div_node in soup.find_all('div', class_='zixun_li_title'):
            a_node = div_node.find('a')
            href = a_node['href']
            url = urljoin(rsp.url, href)

            # 继续请求新闻页面
            rsp_detail = requests.get(url)
            rsp_detail.encoding = 'gbk'
            soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
            title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
            info = soup_detail.find('div', class_='top_others_lf').text.strip()  # 包含时间、作者信息
            infos = info.split('|')  # 使用split()对字符串进行切割
            publish_time = infos[0].split(':')[-1].strip()  # 发布时间
            author = infos[1].split(':')[-1].strip()  # 作者
            summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()  # 简介
            article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()  # 文章
            images = []   # 图片
            for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
                src = node['src']
                img_url = urljoin(rsp_detail.url, src)
                images.append(img_url)
            data = {
                'title': title,
                'publish_time': publish_time,
                'author': author,
                'summary': summary,
                'article': article,
                'images': images,
            }
            print(data)

执行代码,控制台打印出结果,成功!

要进行全量抓取的话,还必须做翻页处理,获取每一页的新闻链接。我们先查看新闻列表页的链接特征:

  • 第1页:https://www.xpgod.com/shouji/news/zixun.html
  • 第2页:https://www.xpgod.com/shouji/news/zixun_2.html
  • 第3页:https://www.xpgod.com/shouji/news/zixun_3.html

一般情况下,翻页即是在url中修改对应页码。我们先获取最大页数,再拼接出url即可实现翻页(特殊情况需要特殊处理)。

在源码中找到最大页码:

b718bc5d0fdf0e3a8d04d68a30aab263.png

继续编写代码:

class PhoneHeavenSpider:
    def start(self):
        rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
        rsp.encoding = 'gbk'

        # 处理第一页数据
        soup = BeautifulSoup(rsp.text, 'lxml')
        for div_node in soup.find_all('div', class_='zixun_li_title'):
            a_node = div_node.find('a')
            href = a_node['href']
            url = urljoin(rsp.url, href)

            # 继续请求新闻页面
            rsp_detail = requests.get(url)
            rsp_detail.encoding = 'gbk'
            soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
            title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
            info = soup_detail.find('div', class_='top_others_lf').text.strip()
            infos = info.split('|')
            publish_time = infos[0].split(':')[-1].strip()
            author = infos[1].split(':')[-1].strip()
            summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()
            article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()
            images = []
            for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
                src = node['src']
                img_url = urljoin(rsp_detail.url, src)
                images.append(img_url)
            data = {
                'title': title,
                'publish_time': publish_time,
                'author': author,
                'summary': summary,
                'article': article,
                'images': images,
            }

        # 翻页处理
        li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
        max_page = int(li_node.text.strip())  # 拿到的text是字符串类型,需要转为int类型
        for page in range(2, max_page + 1):  # 2 ~ max_page+1页,不包含第max_page+1
            url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)  # 字符串格式化,使用page的值填充”{}“
            print(url)

成功拼接出所有新闻列表页的url:

cea0bc84b620b936c90f5e5c9ad8db0e.png

重复第一步操作,获得每一页新闻链接,抓取每一条新闻数据:

class PhoneHeavenSpider:
    def start(self):
        rsp = requests.get('https://www.xpgod.com/shouji/news/zixun.html')
        rsp.encoding = 'gbk'

        # 处理第一页数据
        soup = BeautifulSoup(rsp.text, 'lxml')
        for div_node in soup.find_all('div', class_='zixun_li_title'):
            a_node = div_node.find('a')
            href = a_node['href']
            url = urljoin(rsp.url, href)

            # 继续请求新闻页面
            rsp_detail = requests.get(url)
            rsp_detail.encoding = 'gbk'
            soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
            title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
            info = soup_detail.find('div', class_='top_others_lf').text.strip()
            infos = info.split('|')
            publish_time = infos[0].split(':')[-1].strip()
            author = infos[1].split(':')[-1].strip()
            summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()
            article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()
            images = []
            for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
                src = node['src']
                img_url = urljoin(rsp_detail.url, src)
                images.append(img_url)
            data = {
                'title': title,
                'publish_time': publish_time,
                'author': author,
                'summary': summary,
                'article': article,
                'images': images,
            }

        # 翻页处理
        li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
        max_page = int(li_node.text.strip())
        for page in range(2, max_page + 1):
            url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)
            rsp_index = requests.get(url)
            rsp.encoding = 'gbk'

            # 处理列表页,类似处理第一页
            soup_index = BeautifulSoup(rsp_index.text, 'lxml')
            for div_node in soup_index.find_all('div', class_='zixun_li_title'):
                a_node = div_node.find('a')
                href = a_node['href']
                url = urljoin(rsp.url, href)

                # 继续请求新闻页面
                rsp_detail = requests.get(url)
                rsp_detail.encoding = 'gbk'
                soup_detail = BeautifulSoup(rsp_detail.text, 'lxml')
                title = soup_detail.find('div', class_='youxizt_top_title').text.strip()
                info = soup_detail.find('div', class_='top_others_lf').text.strip()
                infos = info.split('|')
                publish_time = infos[0].split(':')[-1].strip()
                author = infos[1].split(':')[-1].strip()
                summary = soup_detail.find('div', class_='zxxq_main_jianjie').text.strip()
                article = soup_detail.find('div', class_='zxxq_main_txt').text.strip()
                images = []
                for node in soup_detail.find('div', class_='zxxq_main_txt').find_all('img'):
                    src = node['src']
                    img_url = urljoin(rsp_detail.url, src)
                    images.append(img_url)
                data = {
                    'title': title,
                    'publish_time': publish_time,
                    'author': author,
                    'summary': summary,
                    'article': article,
                    'images': images,
                }
                print('这是来自第{}页的新闻数据:'.format(page), data)

执行代码,开始全量抓取!

e2daa603fed58c2c8cf07d4c3f54240c.png

函数抽取

回顾代码,我们发现解析第1页和其他列表页的代码一样,而且请求新闻页面的代码写了两次。为了避免这种情况,我们来把重复的代码写成函数进行调用,代码如下:

class PhoneHeavenSpider:
    def start(self):
        self.crawl_index(1)

    # 抓取列表页数据,包含第一页。
    def crawl_index(self, page):
        if page == 1:
            url = 'https://www.xpgod.com/shouji/news/zixun.html'
        else:
            url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)
        rsp = requests.get(url)
        rsp.encoding = 'gbk'

        soup = BeautifulSoup(rsp.text, 'lxml')
        for div_node in soup.find_all('div', class_='zixun_li_title'):
            a_node = div_node.find('a')
            href = a_node['href']
            url_detail = urljoin(rsp.url, href)
            self.crawl_detail(url_detail, page)

        if page == 1:  # 翻页处理,只有第1页需要
            li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
            max_page = int(li_node.text.strip())
            for new_page in range(2, max_page + 1):
                self.crawl_index(new_page)

    # 抓取新闻详情数据
    def crawl_detail(self, url, page):
        rsp = requests.get(url)
        rsp.encoding = 'gbk'

        soup = BeautifulSoup(rsp.text, 'lxml')
        title = soup.find('div', class_='youxizt_top_title').text.strip()
        info = soup.find('div', class_='top_others_lf').text.strip()
        infos = info.split('|')
        publish_time = infos[0].split(':')[-1].strip()
        author = infos[1].split(':')[-1].strip()
        summary = soup.find('div', class_='zxxq_main_jianjie').text.strip()
        article = soup.find('div', class_='zxxq_main_txt').text.strip()
        images = []
        for node in soup.find('div', class_='zxxq_main_txt').find_all('img'):
            src = node['src']
            img_url = urljoin(rsp.url, src)
            images.append(img_url)

        data = {
            'title': title,
            'publish_time': publish_time,
            'author': author,
            'summary': summary,
            'article': article,
            'images': images,
        }
        print('第{}页新闻:'.format(page), data)

代码清爽了很多,根据功能划分函数,使项目更有条理!如果以后需要修改代码,只要在函数内修改一次即可,不用把重复的代码都去修改一次,这就是软件开发的一个基本原则:避免代码重复性

异常处理

采集过程中,我们可能遇到很多未知情况,比如一次请求异常:

Traceback (most recent call last):
  File "E:/JuniorProject/tutorial/phone_heaven.py", line 63, in <module>

  File "E:/JuniorProject/tutorial/phone_heaven.py", line 8, in start
    self.crawl_index(1)
  File "E:/JuniorProject/tutorial/phone_heaven.py", line 30, in crawl_index
    max_page = int(li_node.text.strip())  # 拿到的text是字符串类型,需要转为int类型
  File "E:/JuniorProject/tutorial/phone_heaven.py", line 24, in crawl_index
    href = a_node['href']
  File "E:Python37libsite-packagesrequestsapi.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "E:Python37libsite-packagesrequestsapi.py", line 60, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:Python37libsite-packagesrequestssessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "E:Python37libsite-packagesrequestssessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "E:Python37libsite-packagesrequestsadapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.xpgod.com', port=443): Max retries exceeded with url: /shouji/news/17804.html (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001C8D56FB0F0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

在大规模抓取中会遇到很多异常情况:

  • 代码逻辑不够完善
  • 网络请求异常
  • 非常规页面导致的解析异常
  • 脏数据导致的解析异常
  • 其他异常

这就要求在代码中进行异常处理,今天我们先简单把异常过滤掉:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin


class PhoneHeavenSpider:
    def start(self):
        self.crawl_index(1)

    # 抓取列表页数据,包含第一页。
    def crawl_index(self, page):
        try:
            if page == 1:
                url = 'https://www.xpgod.com/shouji/news/zixun.html'
            else:
                url = 'https://www.xpgod.com/shouji/news/zixun_{}.html'.format(page)
            rsp = requests.get(url)
            rsp.encoding = 'gbk'  # 指定编码方式

            soup = BeautifulSoup(rsp.text, 'lxml')  # 更强大的解析库lxml
            for div_node in soup.find_all('div', class_='zixun_li_title'):
                a_node = div_node.find('a')
                href = a_node['href']
                url_detail = urljoin(rsp.url, href)
                self.crawl_detail(url_detail, page)  # 抓取新闻url,抓取具体数据

            if page == 1:  # 翻页处理,只有第1页需要
                li_node = soup.find('ul', class_='fenye_ul').find_all('li')[-3]
                max_page = int(li_node.text.strip())  # 拿到的text是字符串类型,需要转为int类型
                for new_page in range(2, max_page + 1):  # 2 ~ max_page+1页,不包含第max_page+1
                    self.crawl_index(new_page)  # 继续调用自身,写上return避免迭代过多导致异常
        except:
            pass

    # 抓取新闻详情数据
    def crawl_detail(self, url, page):
        try:
            rsp = requests.get(url)
            rsp.encoding = 'gbk'

            soup = BeautifulSoup(rsp.text, 'lxml')
            title = soup.find('div', class_='youxizt_top_title').text.strip()
            info = soup.find('div', class_='top_others_lf').text.strip()  # 包含时间、作者信息
            infos = info.split('|')  # 使用split()对字符串进行切割
            publish_time = infos[0].split(':')[-1].strip()  # 发布时间
            author = infos[1].split(':')[-1].strip()  # 作者
            summary = soup.find('div', class_='zxxq_main_jianjie').text.strip()  # 简介
            article = soup.find('div', class_='zxxq_main_txt').text.strip()  # 文章
            images = []  # 图片
            for node in soup.find('div', class_='zxxq_main_txt').find_all('img'):
                src = node['src']
                img_url = urljoin(rsp.url, src)
                images.append(img_url)

            data = {
                'title': title,
                'publish_time': publish_time,
                'author': author,
                'summary': summary,
                'article': article,
                'images': images,
            }
            print('第{}页新闻:'.format(page), data)
        except:
            pass


if __name__ == '__main__':
    PhoneHeavenSpider().start()

课外练习:

  1. 游戏葡萄:全量抓取

练习答案:

Github地址


下一章 >> Python爬虫入门到入职04:数据存储

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值