python爬虫异步加载_基于 asyncio 的Python异步爬虫框架

最新推荐文章于 2024-03-31 23:26:58 发布

weixin_39612038

最新推荐文章于 2024-03-31 23:26:58 发布

阅读量111

点赞数

文章标签： python爬虫异步加载

aspider

A web scraping micro-framework based on asyncio.

轻量异步爬虫框架aspider，基于asyncio，目的是让编写单页面爬虫更方便更迅速，利用异步特性让爬虫更快（减少在IO上的耗时）

介绍

pip install aspider

Item

对于单页面，只要实现框架定义的 Item 就可以实现对目标数据的抓取：

import asyncio

from aspider import Request

request = Request("https://news.ycombinator.com/")

response = asyncio.get_event_loop().run_until_complete(request.fetch())

# Output

# [2018-07-25 11:23:42,620]-Request-INFO

#

Spider

对于页面目标较多，需要进行深度抓取时，Spider就派上用场了

import aiofiles

from aspider import AttrField, TextField, Item, Spider

class HackerNewsItem(Item):

target_item = TextField(css_select='tr.athing')

title = TextField(css_select='a.storylink')

url = AttrField(css_select='a.storylink', attr='href')

async def clean_title(self, value):

return value

class HackerNewsSpider(Spider):

start_urls = ['https://news.ycombinator.com/', 'https://news.ycombinator.com/news?p=2']

async def parse(self, res):

items = await HackerNewsItem.get_items(html=res.body)

for item in items:

async with aiofiles.open('./hacker_news.txt', 'a') as f:

await f.write(item.title + '\n')

if __name__ == '__main__':

HackerNewsSpider.start()

支持JS的加载

Request类也可以很好的工作并返回内容，这里以这个为例演示下抓取需要加载js才可以抓取的例子：

request = Request("https://www.jianshu.com/", load_js=True)

response = asyncio.get_event_loop().run_until_complete(request.fetch())

print(response.body)

如果喜欢，可以玩玩看，项目Github地址：aspider

weixin_39612038

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫异步加载_基于 asyncio 的Python异步爬虫框架

aspiderA web scraping micro-framework based on asyncio.轻量异步爬虫框架aspider，基于asyncio，目的是让编写单页面爬虫更方便更迅速，利用异步特性让爬虫更快（减少在IO上的耗时）介绍pip install aspiderItem对于单页面，只要实现框架定义的 Item 就可以实现对目标数据的抓取：import asynciofrom ...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。