再战异步库asyncio/aiohttp--尝试generalnewsextractor爬取news
偶得generalnewsextractor一个新闻提取工具
导入:
pip install gne
from gne import GeneralnewsExtractor
extractor=GeneralnewsExtractor(html)
近期以来一直在用异步的两个库,学习动态网页相关内容的获取,偶然找到了一个新闻提取工具,感觉很是方便.一次可以爬取news的五个要素.
用到的库有:
from aiohttp import ClientSession
import asyncio
from myheaders import *
from parsel import Selector
from pymongo import MongoClient
from gne import GeneralNewsExtractor
extractor=GeneralNewsExtractor()
client=MongoClient(host='localhost',port=27017)
db=client.yandb
url='https://news.163.com/'
headers=header
项目爬取过程中的感受
1.异步协程让我体会到了速度.
2.parsel库的css/xpath/re三库相融的解析,让新闻链接的获取变得垂手可得
html=await resp.text()
sel=Selector(html)
all_url=sel.re('div><a href="(.*?)">.*?</a></div>')
titles=sel.re('div><a href=".*?">(.*?)</a></div>')
for href,title in zip(all_url,titles):
task={'href':href,
'title':title}
最后贴出全部的代码,希望得到路过的各位高手的指导:
# !/usr/bin/python39
# -*-coding:utf-8-*-
from aiohttp import ClientSession
import asyncio
from myheaders import *
from parsel import Selector
from pymongo import MongoClient
from gne import GeneralNewsExtractor
extractor=GeneralNewsExtractor()
client=MongoClient(host='localhost',port=27017)
db=client.yandb
url='https://news.163.com/'
headers=header
async def get_next_request(session,task):
try:
url= task['href']
print(url)
async with await session.get(url=task['href'],headers=headers)as resp:
print(resp.status)
html=await resp.text()
result=extractor.extract(html)
if db.tb_163news39.insert(result):
print('储存成功',result)
except Exception as e:
print(e)
async def main():
try:
async with ClientSession()as session:
async with await session.get(url=url,headers=headers)as resp:
print(resp.status,resp.charset)
html=await resp.text()
sel=Selector(html)
all_url=sel.re('div><a href="(.*?)">.*?</a></div>')
titles=sel.re('div><a href=".*?">(.*?)</a></div>')
for href,title in zip(all_url,titles):
task={'href':href,
'title':title}
await get_next_request(session,task)
print(task)
except Exception as e:
print(e)
await asyncio.sleep(3)
if __name__ == '__main__':
loop=asyncio.get_event_loop()
loop.run_until_complete(main())
插入链接与图片
带尺寸的图片: