同步爬虫
例子:url是阻塞爬取的,执行完毕上一个图片爬取才会执行下一个图片爬取。 是单线程的。
import requests
header = {'User-Agent': 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'}
urls = [
'http://pic.netbian.com/uploads/allimg/210122/195550-1611316550d711.jpg',
'http://pic.netbian.com/uploads/allimg/180803/084010-15332568107994.jpg',
'http://pic.netbian.com/uploads/allimg/190415/214606-15553359663cd8.jpg'
]
# 封装方法 获取url内容
def get_content(url):
print('正在爬取', url)
response = requests.get(url=url, headers=header)
if response.status_code == 200:
return response.content
def parse_content(content):
print('响应数据的长度为:', len(content))
for url in urls:
content = get_content(url)
parse_content(content)
异步爬虫
方式:
- 多线程,多进程(不建议):
- 好处:可以为相关阻塞的操作单独开启线程或者进程,阻塞操作就可以异步执行
- 弊端:无法无限制的开启多线程或者多进程
- 线程池、进程池(适当使用):
- 好处:可以降低系统对进程或者线程创建和消耗,从而很好的降低系统的开销
- 弊端:池中线程或进程的数量是有上限的
- 单线程 + 异步协程(推荐):
- event_loop:事件循环,相当于一个无线循环,可以把一些函数注册到这个事件循环上,当满足某些条件时,函数就会被循环执行。
- coroutine:协程对象,可以将协程对象注册到事件循环中,它会被事件循环调用。可以使用async关键字来定义一个方法,这个方法在调用时不会立即被执行,而是返回一个协程对象
- task:任务,是对协程对象的进一步封装,包含了任务的各个状态
- future:代表将来执行或还没有执行的任务,实际上和task没有本质区别
- async:定义一个协程
- await:用来挂起阻塞方法的执行
注意
在协程当中,需要使用aiohttp模块来代替requests来进行异步网络请求。
requests.get是基于同步的
代码:
import aiohttp
# 使用该模块中的ClientSession
async def get_page(url):
async with aiohttp.ClientSession() as session:
async with await session.get(url=url) as response:
# text() 返回字符串形式的响应数据
# read() 返回的二进制形式响应数据
# json() 返回的是json对象
# 注意:获取响应数据操作之前一定要使用await进行挂起
page_text = await response.text()
print(page_text)
python实例化一个线程池对象
from multiprocessing.dummy import Pool
# 实例化一个线程池对象
pool = Pool(4)
协程:
import asyncio
async def request(url):
print('请求url:', url)
# async修饰的函数,调用后返回一个协程对象
c = request('www.wzc.com')
# 创建一个事件循环对象
loop = asyncio.get_event_loop()
# 将协程对象注册到事件循环中,启动loop
loop.run_until_complete(c)
线程池原则
线程池处理的是阻塞且耗时的操作
实战
爬取来源网站:https://www.pearvideo.com/category_8
注意:在进入单个视频页面前的操作都比较基础,在进入单个页面后,视频是动态获取的,需要通过Ajax来进行数据传递获取
获取ajax数据时需要添加的参数,在header中需要添加Referer
post_url = 'https://www.pearvideo.com/videoStatus.jsp'
data = {
'contId': id_,
'mrd': str(random.random()),
}
ajax_headers = {
'User-Agent': random.choice(user_agent_list),
'Referer':'https://www.pearvideo.com/video_' + id_
}
response = requests.post(post_url, data, headers=ajax_headers)
代码:
import requests
from lxml import etree
import random
import os
import time
from multiprocessing.dummy import Pool
user_agent_list=[
'Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0)',
'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)',
'Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)',
'Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11',
'Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.3.4000 Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36'
]
# 视频地址url解密,返回正确视频路径
def videoUrlDeal(video_url, id_):
# 得到真url,做字符串处理
video_true_url = ''
s_list = str(video_url).split('/')
for i in range(0, len(s_list)):
if i < len(s_list) - 1:
video_true_url += s_list[i] + '/'
else:
ss_list = s_list[i].split('-')
for j in range(0, len(ss_list)):
if j == 0:
video_true_url += 'cont-' + id_ + '-'
elif j == len(ss_list) - 1:
video_true_url += ss_list[j]
else:
video_true_url += ss_list[j] + '-'
return video_true_url
def testPost(id_):
post_url = 'https://www.pearvideo.com/videoStatus.jsp'
data = {
'contId': id_,
'mrd': str(random.random()),
}
ajax_headers = {
'User-Agent': random.choice(user_agent_list),
'Referer':'https://www.pearvideo.com/video_' + id_
}
response = requests.post(post_url, data, headers=ajax_headers)
page_json = response.json()
# print(page_json['videoInfo']['videos']['srcUrl'])
return videoUrlDeal(page_json['videoInfo']['videos']['srcUrl'], id_)
# 存储视频到本地
def saveVideo(data):
true_url = data[0]
videoTitle = data[1]
content = requests.get(url=true_url, headers=header).content
with open('./video/' + videoTitle + '.mp4', 'wb') as fp:
fp.write(content)
print(true_url, videoTitle, '存储成功')
if __name__ == '__main__':
# 创建一个文件夹,保存所有的图片
if not os.path.exists('./video'):
os.mkdir('./video')
header = {'User-Agent': 'Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)'}
url = 'https://www.pearvideo.com/category_8'
# 对url发起请求,解析出视频详情也的url和视频名称
response = requests.get(url=url, headers=header)
tree = etree.HTML(response.text)
li_list = tree.xpath('//ul[@class="category-list clearfix"]/li')
true_url_list = []
for li in li_list:
videoTitle = li.xpath('./div[@class="vervideo-bd"]/a/div[@class="vervideo-title"]/text()')[0]
videoHref = 'https://www.pearvideo.com/' + li.xpath('./div[@class="vervideo-bd"]/a/@href')[0]
# 对详情页url发起请求 (此页面已更改)
# videoText = requests.get(url=videoHref, headers=header).text
# 从详情页中解析出视频的地址,通过id(url)
id = videoHref.split('_')[1]
true_url_list.append((testPost(id),videoTitle))
# print(true_url_list)
# 实例化一个线程池对象,进行多线程存储
pool = Pool(5)
pool.map(saveVideo, true_url_list)
pool.close()
pool.join()
爬取结果: