day-9 爬虫实例

最新推荐文章于 2022-07-11 07:35:00 发布

xff980913

最新推荐文章于 2022-07-11 07:35:00 发布

阅读量2.1k

点赞数

本文链接：https://blog.csdn.net/xff980913/article/details/117673735

版权

本文介绍了使用aiohttp进行爬虫实战，调用天行数据API获取信息，并展示了如何实现阿里云邮箱的自动登录，包括利用超级鹰平台解决验证码问题。同时，还涉及到了图片剪切和通过接码平台读取手机验证码的技术细节。

摘要由CSDN通过智能技术生成

day-9 爬虫实例

1. aiohttp爬虫

import re

import aiohttp
import asyncio

# 命令捕获组（具名捕获组）    ?P<T> 
pattern = re.compile(r'<title>(?P<T>.*?)</title>')

urls = [
    'https://www.python.org/',
    'https://www.taobao.com/',
    'https://pypi.org/',
    'https://www.git-scm.com/',
    'https://www.jd.com/',
    'https://opendata.sz.gov.cn/',
    'https://www.tmall.com/'
]


async def show_title(url):
    """根据指定的URL获取网站标题"""
    await asyncio.sleep(1)  # 等待1秒
    async with aihottp.ClientSession() as seeion:
        						# 等待2秒
        async with session.get(url, timeout=2, ssl=False) as resp:  
            html_code = await resp.text()
            matcher = pattern.search(html_code)
			if matcher:
                print(matcher.groip('T'))
                
                
cos_list = [show_title(url) for url in urls]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(cos_list))
# loop.close()

2. 调用第三方API接口获取数据天行

import requests

for page in range(1,6):
	resp = requests.get(
		http://api.tianapi.com/topnews/index',
		params={
   
			'key': '自己申请的Key'
			'page': page,
			'num': 20,
		}
	)
	result_dict = resp.json()
	for news in result_dict['newslist']
		print(news['title'])
		print(news['url'])

3.阿里云邮箱自动登录

image_data = browser.get_screenshot_as_png()
# bytes（只读字节串） ----> io.BytesIO（可写字节串）---> getvalue() ---> bytes
# str（只读字符串） ----> io.StringIO（可写字符串）---> getvalue() ---> str
browser_image = Image.open(io.BytesIO(image_data))
# 从截图上剪裁出验证码的图片
x, y = x1 + x2 + x3, y1 + y2 + y3
# Windows系统的写法 ---> 如果截图有问题就把坐标写死
# print(x, y, width, height)
# checkcode_image = browser_image.crop((x, y, x + width, y + height))
# macOS高清屏的写法
checkcode_image = browser_image.crop((x * 2, y * 2,