Python分别用单线程，多线程，异步协程爬取一部小说，最快仅需要5s

最新推荐文章于 2024-01-02 14:56:05 发布

中意灬

最新推荐文章于 2024-01-02 14:56:05 发布

阅读量2.2k

点赞数 7

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/qq_55977554/article/details/122521771

版权

文章目录

本文运用了三种方式爬取一整部小说，分别运用了单线程爬取，多线程爬取和异步协程爬取。
小说网址：`

http://www.doupo321.com/doupocangqiong/`

网页很简单，也不用过多分析，内容都在网页源代码中，就是一个多级链接爬虫，步骤就是先爬取到网页下的子链接，然后通过子链接爬取到每章小说内容。
因为这个网页的源代码都很规整，所有我们用xpath来匹配，当然你熟悉正则或者bs4也可以用bs4来匹配。然后我们就开始写代码吧。

单线程爬取

# @Time:2022/1/1312:04
# @Author:中意灬
# @File:斗破2.py
# @ps:tutu qqnum:2117472285
import time
import requests
from lxml import etree
def download(url,title):#下载内容
    resp=requests.get(url)
    resp.encoding='utf-8'
    html=resp.text
    tree=etree.HTML(html)
    body = tree.xpath("/html/body/div/div/div[4]/p/text()")
    body = '\n'.join(body)
    with open(f'斗破2/{title}.txt',mode='w',encoding='utf-8')as f:
        f.write(body)
def geturl(url):#获取子链接
    resp=requests.get(url)
    resp.encoding='utf-8'
    html=resp.text
    tree=etree.HTML(html)
    lis=tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li")
    for li in lis:
        href=li.xpath("./a/@href")[0].strip('//')
        href="http://"+href
        title=li.xpath("./a/text()")[0]
        download(href,title)
if __name__ == '__main__':
    url="http://www.doupo321.com/doupocangqiong/"
    t1=time.time()
    geturl(url)
    t2=time.time()
    print("耗时：",t2-t1)

运行结果：
在这里插入图片描述

在这里插入图片描述

多线程爬取

# @Time:2022/1/1311:42
# @Author:中意灬
# @File:斗破1.py
# @ps:tutu qqnum:2117472285
import time
import requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor
def download(url,title):
    resp=requests.get(url)
    resp.encoding='utf-8'
    html=resp.text
    tree=etree.HTML(html)
    body = tree.xpath("/html/body/div/div/div[4]/p/text()")
    body = '\n'.join(body)
    with open(f'斗破1/{title}.txt',mode='w',encoding='utf-8')as f:
        f.write(body)
def geturl(url):
    resp = requests.get(url)
    resp.encoding = 'utf-8'
    html = resp.text
    tree = etree.HTML(html)
    lis = tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li")
    return lis

if __name__ == '__main__':
    url="http://www.doupo321.com/doupocangqiong/"
    t1=time.time()
    lis=geturl(url)
    with ThreadPoolExecutor(1000)as t:#创建线程池，有1000个线程
        for li in lis:
            href = li.xpath("./a/@href")[0].strip('//')
            href = "http://" + href
            title = li.xpath("./a/text()")[0]
            t.submit(download,url=href,title=title)
    t2=time.time()
    print("耗时：",t2-t1)

运行结果：
在这里插入图片描述

在这里插入图片描述

异步协程爬取

# @Time:2022/1/1310:30
# @Author:中意灬
# @File:斗破.py
# @ps:tutu qqnum:2117472285
import requests
import aiohttp
import asyncio
import aiofiles
from lxml import etree
import time
async def download(url,title,session):
        async with session.get(url) as resp:#resp=requst.get()
            html= await resp.text()
            tree=etree.HTML(html)
            body=tree.xpath("/html/body/div/div/div[4]/p/text()")
            body='\n'.join(body)
        async with aiofiles.open(f'斗破/{title}.txt',mode='w',encoding='utf-8')as f:#保存下载内容
                await f.write(body)

async def geturl(url):
    resp=requests.get(url)
    resp.encoding='utf-8'
    html=resp.text
    tree=etree.HTML(html)
    lis=tree.xpath("/html/body/div[1]/div[2]/div[1]/div[3]/div[2]/ul/li")
    tasks=[]
    async with aiohttp.ClientSession() as session:#request
        for li in lis:
            href=li.xpath("./a/@href")[0].strip('//')
            href="http://"+href
            title=li.xpath("./a/text()")[0]
    # 插入异步操作
            tasks.append(asyncio.create_task(download(href,title,session)))
        await asyncio.wait(tasks)
if __name__ == '__main__':
    url="http://www.doupo321.com/doupocangqiong/"
    t1=time.time()
    loop = asyncio.get_event_loop()
    loop.run_until_complete(geturl(url))
    t2=time.time()
    print("耗时：",t2-t1)

运行结果：
在这里插入图片描述

在这里插入图片描述
因为没有进行排序，所以爬取出来的章节都是乱序的，大家可以写爬虫的时候里面自己设置一下标题，这样爬取出来的顺序就可能是有序的了。
我们可以看出，用多线程，仅仅5秒就扒完了一部1600多章的小说，但是多线程会对系统的开销较大；如果用异步协程，爬取速度会稍微慢些，需要大概20多秒，但是对系统开销较小，建议大家采用异步协程的方式，但是用单线程去爬取会慢很多，扒完一部小说耗时需要9分多钟，不是很推荐。

中意灬

关注

7
点赞
踩
37

收藏

觉得还不错? 一键收藏
7
评论
Python分别用单线程，多线程，异步协程爬取一部小说，最快仅需要5s

文章目录单线程爬取多线程爬取异步协程爬取本文运用了三种方式爬取一整部小说，分别运用了单线程爬取，多线程爬取和异步协程爬取。小说网址：`http://www.doupo321.com/doupocangqiong/`网页很简单，也不用过多分析，内容都在网页源代码中，就是一个多级链接爬虫，步骤就是先爬取到网页下的子链接，然后通过子链接爬取到每章小说内容。因为这个网页的源代码都很规整，所有我们用xpath来匹配，当然你熟悉正则或者bs4也可以用bs4来匹配。然后我们就开始写代码吧。单线程爬取# @
复制链接

扫一扫