采用aiohttp以及asyncio库提取学校官网数据

最新推荐文章于 2023-03-14 20:39:09 发布

DongXun_Lord

最新推荐文章于 2023-03-14 20:39:09 发布

阅读量427

点赞数 1

分类专栏：爬虫异步异步

本文链接：https://blog.csdn.net/GMBai/article/details/99698701

版权

爬虫异步同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

异步

2 篇文章 0 订阅

订阅专栏

这几天学习了异步，协程asyncio的原理，今天用这种方法爬了一下以前爬过的网站
真的头痛，爬取的结果显示同步和异步时间差不多，让我怀疑是不是我哪里写错了，
还有实现并发的方式也容易错。百度了一下直到了requests请求依然是同步的，所以这里使用aiohttp异步请求

注：爬取的网站为河北金融学院的信息类网址： https://www.hbfu.edu.cn/newsList?type=1

"""
本次二次爬取 主要目的是熟悉post请求抓包参数携带 熟悉消除警告提示的方法
以及使用异步请求实现并发，快速抓取数据
time: 2019年8月17日16:13:51
2019年8月17日21:47:00 更新了新的请求库， 因为requests请求依然是阻塞的， 所有使用异步请求
aiohttp， 目前学了点皮毛。
1. 26s 2. 28s
"""
import asyncio, aiohttp
from aiohttp import ClientSession
import time
import logging, datetime

headers = {
    'User-Agent': 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/74.0.3729.169 Safari/537.36'
}

a = time.time()
print(f'开始采集：{datetime.datetime.now().strftime("%Y-%m-%d %H:%m:%d")}')


class HeBei(object):

    async def get_id(self, url, start):
        """
        抓包获取详情页的id
        :param url: 列表页网址
        :return: id
        """
        data = {
            'start': f'{start}',
            'limit': 20,
            'type': 1,
       }

        async with ClientSession() as session:
            async with session.post(url, headers=headers, data=data,) as resp:
                result = await resp.json()  # 对于json格式直接返回， 不需要使用json.dumps进行处理。
                logging.captureWarnings(True)

                id_list = []
                if result:
                    rows = result.get('rows')
                    for row in rows:
                        id = row.get('id')
                        item = f'{id}'
                        id_list.append(item)
                    return id_list

    async def get_detail_msg(self, url, id):
        """
        :param url:
        :param id:
        """
        data = {
            'id': f'{id}',
        }
        async with aiohttp.ClientSession() as session:
            async with session.post(url, headers=headers, data=data) as resp:
                result = await resp.json()

                logging.captureWarnings(True)
                if result:
                    content = result.get('content')
                    hit = result.get('hit')
                    title = result.get('title')
                    creat_time = result.get('createDate')
                    content = await hebei.convert_str(content)  # 接受返回值 await调用协程函数
                    msg_list = []
                    item = {
                        '文章内容': content,
                        '点击量': hit,
                        '标题': title,
                        '时间戳': creat_time,
                    }
                    msg_list.append(item)
                    # print(item)
                    return item
	# 替换掉一些无用数据
    async def convert_str(self, text):
        text =  text.replace('<p', '').replace('</p>', '').replace('<p align="right" style="text-indent:2em;">','').replace('style="text-indent:2em;" align="center">', '').replace('style="text-indent:2em;">', '').\
            replace('style="text-indent:2em;">', '').replace('src', 'image').replace('<img alt=""', '').replace('/>','').replace('align="center"', '').replace(' align="right"', '').replace('\r\n\r\n','').replace('\r\n\t', '').replace('\r\n', '').replace('style="text-indent:2em;">', '').replace('&nbsp;', '').replace('style="text-align:right;text-indent:2em;">', '').replace('alt=""', '').strip()
        return text

    def save_file_csv(self, data):
        """
        csv 格式保存数据
        :param data:
        """
        with open('asyncHebei.csv', 'a', encoding='utf-8') as fp:
            fp.write(data)

    async def main(self):
        """
        并发实现数据抓取
        :return:
        """
        hebei = HeBei()
        list_url = 'https://www.hbfu.edu.cn/news/queryListForPage'
        tasks = [hebei.get_id(list_url, start) for start in range(0, 200, 20)]
        # task = asyncio.ensure_future(hebei.get_id(list_url)) # 一个任务对象
        # return await asyncio.gather(task)
        return await asyncio.ensure_future(asyncio.gather(*tasks))


if __name__ == '__main__':
    hebei = HeBei()
    detail_url = 'https://www.hbfu.edu.cn/news/findById'

    loop = asyncio.get_event_loop()  # 创建事件循环
    results = loop.run_until_complete(hebei.main())
    for ids in results:
        for id in ids:
            import json
            datas = loop.run_until_complete(hebei.get_detail_msg(detail_url, id))
            hebei.save_file_csv(json.dumps(datas, ensure_ascii=False) + "\n")  # 保存字典数据


b = time.time()
print(f'采集结束:{datetime.datetime.now()}, \n耗时{b-a}s')

以上不是很难，网站数据很好抓取，难点在于数据的并发执行，还有使用aiohttp库。
数据是ajax加载的，所以还需要抓包分析。

DongXun_Lord

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
采用aiohttp以及asyncio库提取学校官网数据

这几天学习了异步，协程asyncio的原理，今天用这种方法爬了一下以前爬过的网站真的头痛，爬取的结果显示同步和异步时间差不多，让我怀疑是不是我哪里写错了，还有实现并发的方式也容易错。百度了一下直到了requests请求依然是同步的，所以这里使用aiohttp异步请求"""本次二次爬取主要目的是熟悉post请求抓包参数携带熟悉消除警告提示的方法以及使用异步请求实现并发，快速抓取数...
复制链接

扫一扫

专栏目录