全网新闻全知道——利用 Python AsyncIO 抓取百度新闻

网络上关于Python的异步相关的文章已经很多了，但是成体系地完整抓取的教程还是比较少的，正好最近做了百度新闻的爬虫，正好借此机会实践了 Python 的 AsyncIO，利用异步进行爬虫的编写，提高爬取速度。

本文所有代码基于 Python 3.8 编写，低版本可能在API上有所区别，在此请注意。

什么是异步 Async

所谓异步，是相对于同步来说的。正常情况下，程序是一行一行运行的，如果一行代码运行耗时很久，就会一直卡在当前位置。这样的行为对于计算密集型程序（如计算数据）是没有问题的，毕竟CPU一直在运算。但是对于IO密集型程序（如爬虫），大量时间被消耗在等待缓慢的IO设备上（如网络请求、数据库操作等），CPU在等待期间什么事也没做，浪费了算力。这样当然是不可以接受的。

在过去，Python语言中实现IO高并发，通常是通过多线程或者多进程来实现的，但是由于相关编程的难度和容易出错等问题，导致并不能快速地编写出可用的程序。

用过JavaScript语言的同学肯定知道，JS也是单线程的，但是JS通过回调、Promise、async/await等实现了异步操作，运行IO密集型程序的效率很高。

在Python最近的几个版本中，实现了一种称为AsyncIO的库，和JS一样使用了 async/await 关键字来标明异步操作，将异步代码的写法向同步代码靠近，便于理解。

Python语言如何实现 IO 高并发？

Python 实现异步的方法和 JS 类似，都采用了一种称为事件循环(event loop)的模型，将任务放进这个模型中，会自动在CPU可用时执行任务，并在任务完成时进行回调，然后继续执行下一个任务，CPU就可用一直处于工作状态，而不是经常性地等待IO的返回结果。

在 Python 中，我们可以把async关键字加在函数的def关键字之前，声明一个异步函数，也称之为协程（coroutine），可以理解为"可以放进事件循环的任务"。

构建百度新闻爬虫程序的基本框架

以下是本文编写的程序的基本框架

#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import asyncio
import aiohttp
import uvloop

# 设置uvloop，提高运行速度
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())


class BaiduNewsCrawler:
    def __init__(self):
        pass

    async def loop_crawl(self):
        print('start')
        await asyncio.sleep(3)
        print('end')

    def run(self):
        try:
            # 启动异步程序
            asyncio.run(self.loop_crawl())
        except KeyboardInterrupt:
            print('stopped by yourself!')


if __name__ == '__main__':
    c = BaiduNewsCrawler()
    c.run()

程序主要功能介绍

创建数据库

本程序基于公司名称，搜索该公司的相关新闻并进行爬取，就此设计了数据库，代码如下

本文设计了两个表，一个是company表，存储公司的信息和状态，另外一个是news表，保存跟公司相关的新闻的信息，两个表通过company_id进行关联。

@staticmethod
def create_db():
    db = sqlite3.connect('stock.db')
    cursor = db.cursor()

    cursor.execute(
        "CREATE TABLE IF NOT EXISTS company (id INTEGER PRIMARY KEY AUTOINCREMENT, name VARCHAR(50), code VARCHAR(20), current_pn INTEGER, status INTEGER);")
    cursor.execute(
        """
        CREATE TABLE IF NOT EXISTS news (
          id INTEGER PRIMARY KEY AUTOINCREMENT, 
          company_id INTEGER, 
          title VARCHAR(500), 
          posttime VARCHAR(50), 
          news_ts INTEGER, 
          subsitename VARCHAR(50), 
          url VARCHAR(600) unique, 
          baidu_json TEXT, 
          html_lzma BLOB, 
          status INTEGER, 
          created_ts INTEGER, 
          CONSTRAINT fk_company FOREIGN KEY (company_id) REFERENCES company(id)
        );
        """)
    db.commit()
    return db, cursor

百度新闻爬虫逻辑

查询数据库，遍历公司名称

async def loop_crawl(self):
    self.session = aiohttp.ClientSession()

    while True:
        # 这里的 _workers_max 控制了同时请求的数量
        self.cursor.execute('select * from company where status=1 limit ?', (self._workers_max,))
        company_records = self.cursor.fetchall()
        if not len(company_records):
            break
        task_list = []
        for rec in company_records:
            self.logger.info(f'crawl {rec[1]} pn: {rec[3]}')
            # 生成任务列表
            task = asyncio.create_task(self.crawl(rec))
            task_list.append(task)
        # 运行当前的任务列表
        await asyncio.wait(task_list)
    await self.session.close()

具体分析网络请求

我们都知道，在Chrome浏览器里打开开发者工具的网络板块，可以看到请求的情况，我们可以在需要模拟的请求上点击右键，然后选择Copy，再选择Copy as cURL，这时候请求的详情会被复制为一条curl命令。

点此打开 Convert cURL 在此介绍一个非常好用的工具。这个工具的功能就是把curl命令转换为程序代码，比如Python的代码，这样子就不用我们自己一行行地把请求里面的信息抄到代码里了。

当然，这个工具生成的python代码是基于requests库的，我们需要进行一点微小的修改，使其适合aiohttp库使用。

自定义请求函数

这里为了进行请求重试，作者加入了tenacity库，可以通过装饰器@retry引入重试功能。

同时为了每次请求的UA都不同，这里自定义了一个随机UA的小工具。

@retry(stop=stop_after_attempt(2), wait=wait_random(min=2, max=5))
async def fetch(session, url, params, headers, cookies, timeout=9, to_json=False, random_ua=True):
    if random_ua or 'User-Agent' not in headers:
        headers['User-Agent'] = RandomUserAgent().random
    async with session.get(url, params=params, cookies=cookies, headers=headers, timeout=timeout) as response:
        status = response.status
        if to_json:
            html = await response.json()
        else:
            html = await response.read()
            encoding = response.get_encoding()
            if encoding == 'gb2312':
                encoding = 'gbk'
            html = html.decode(encoding, errors='ignore')
        redirected_url = str(response.url)
    return status, html, redirected_url

爬取函数

async def crawl(self, company_record):
    # 将全局的访问参数进行浅拷贝
    p = params.copy()
    company_id = company_record[0]
    # 修改为当前的参数
    p['word'] = company_record[1]
    current_pn = company_record[3]
    p['pn'] = str(current_pn)
    try:
        # 尝试请求
        status, json_resp, redirected_url = await su.fetch(
            self.session,
            'https://m.baidu.com/sf/vsearch',
            params=p,
            headers=headers,
            cookies=cookies,
            to_json=True
        )
    except RetryError as ex: # 对于请求异常过多的情况，写入数据库进行记录
        self.cursor.execute('update company set status=0 where id = ?;', (company_id,))
        self.db.commit()
        return
    else: # 请求正常获取数据
        if status == 200 and json_resp['errno'] == 0:
            news_list = json_resp['data']['list']
            for item in news_list:
                try:
                    # 把当前页的新闻依次插入数据库
                    self.cursor.execute("""
                    insert into news (company_id, title, posttime, subsitename, url, baidu_json, status, created_ts, news_ts)
                    values (?, ?, ?, ?, ?, ?, ?, ?, ?)
                    """, (
                        company_id,
                        item['title'],
                        item['posttime'],
                        item['subsitename'],
                        item['url'],
                        json.dumps(item),
                        1,
                        su.get_unix_timestamp(),
                        su.convert_baidu_news_datetime_to_unix_ts(item['posttime']),
                    ))
                except: # 对于网址有重复的情况，由于数据库已经对url字段做了unique限制，所以插入重复的url会导致报错，这里直接跳过即可
                    print(item['url'], 'duplicate')
                    continue
        if not json_resp['data']['hasMore']: # 当前公司的相关新闻已经爬取完毕，直接标记公司的状态为2，结束爬取
            self.cursor.execute('update company set status=2 where id = ?;', (company_id,))
        else: # 翻页
            self.cursor.execute('update company set current_pn=? where id = ?;', (current_pn + 10, company_id))
        self.db.commit()

构建具体新闻页的爬虫

上文介绍了爬取百度新闻的代码。这部分我们来了解如何大规模爬取具体的新闻页。

由于百度上得到的新闻页来自很多外部新闻网站，因此并不适用于上文的顺序爬取的方法。

为了尽最大可能降低对网站的压力（也是为了防止被网址屏蔽），在此采用了URL池的做法（参考了猿人学的代码，但是进行了一些改动）将不同网站的URL分别存储，每次请求的时候尽可能访问不同的网站，避免集中访问同一个网站。

首先将所有数据库里面的URL放入池中，在此没有考虑性能和内存，一股脑读了进来。

def push_to_pool(self, news):
    host = urlparse(news.url).netloc
    if not host or '.' not in host:
        print('try to push_to_pool with bad url:', news.url, ', len of ur:', len(news.url))
        return False
    if host in self.waiting:
        if news in self.waiting[host]:
            return True
        self.waiting[host].add(news)
    else:
        self.waiting[host] = {news}
    return True

然后开始死循环，不断提取URL

def pop(self, size):
    result = []

    waiting_len = len(self.waiting)
    sample_len = min(waiting_len, size)
    hosts = random.sample(list(self.waiting), k=sample_len)

    for host in hosts:
        result.append(self.pop_by_host(host))

    counter = size - waiting_len
    while counter > 0 and len(self.waiting):
        host = random.choice(list(self.waiting))
        result.append(self.pop_by_host(host))
        counter -= 1
    return result

访问URL

async def crawl(self, news):
    status, html, redirected_url = await su.fetch_without_retry(
        self.session,
        news.url,
        headers=headers,
    )
    self.set_status(news, status, html)

根据请求的结果写入数据库

def set_status(self, news, status_code, html):
    if news in self.pending:
        self.pending.pop(news)

    if status_code == 200:
        # 正常情况，把html压缩后存入数据库
        html_lzma = lzma.compress(html)
        self.cursor.execute('update news set status=2, html_lzma=? where id = ?;', (html_lzma, news.id,))
    elif status_code == 404:
        self.cursor.execute('update news set status=0 where id = ?;', (news.id,))
    elif news in self.failure:
        self.failure[news] += 1
        if self.failure[news] > self.failure_threshold:
            self.cursor.execute('update news set status=0 where id = ?;', (news.id,))
            self.failure.pop(news)
        else:
            self.add(news)
    else:
        self.failure[news] = 1
        self.add(news)
    self.db.commit()