aiohttp实现高并发爬虫(aiohttp+aiomysql)

最新推荐文章于 2023-10-09 18:47:59 发布

VIP文章 hubingshabi

最新推荐文章于 2023-10-09 18:47:59 发布

阅读量3.4k

点赞数 5

分类专栏：多任务 python高级编程

本文链接：https://blog.csdn.net/hubingshabi/article/details/101074244

版权

asyncio+aiohttp(实现异步请求)爬虫，去重（在爬取的过程中有些url已经爬取了，就不需要再爬取），入库（使用异步的方式，pymysql已经不适用了，aiomysql）

# asyncio爬虫，去重（在爬取的过程中有些url已经爬取了，就不需要再爬取， 入库（使用异步的方式，pymysql已经不适用了，aiomysql）
import asyncio
import re
import time

import aiohttp
import aiomysql
from pyquery import PyQuery

stopping = False
start_url = 'https://cuiqingcai.com/'
waiting_urls = []
seen_urls = set()

async def article_handler(url, session, pool):
    '''
    function:提取页面title的信息，且将页面中出现的url地址加入waiting_url列表中
    :param url:
    :param session:
    :param pool:
    :return:
    '''
    print('start get url: {}'.format(url))
    # 获取文章详情，并解析入库
    html = await fetch(url, session)
    # 最终要提取的url地址，添加到seen_urls列表中
    seen_urls.add(url)
    # extract_urls提取出页面中所有的url，解析是

最低0.47元/天解锁文章

hubingshabi

关注

5
点赞
踩
10

收藏

觉得还不错? 一键收藏
2
评论
aiohttp实现高并发爬虫(aiohttp+aiomysql)

asyncio+aiohttp(实现异步请求)爬虫，去重（在爬取的过程中有些url已经爬取了，就不需要再爬取），入库（使用异步的方式，pymysql已经不适用了，aiomysql）# asyncio爬虫，去重（在爬取的过程中有些url已经爬取了，就不需要再爬取，入库（使用异步的方式，pymysql已经不适用了，aiomysql）import asyncioimport reimpo...
复制链接

扫一扫