python爬虫02-提升爬取效率、多线程，多线程传参，多进程，线程及线程池概念，协程，多任务异步协程，异步请求aiohttp模块，视频站工作原理

最新推荐文章于 2023-07-05 18:25:46 发布

心湖中的石子

最新推荐文章于 2023-07-05 18:25:46 发布

阅读量426

点赞数 2

分类专栏： python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_43745804/article/details/120545780

版权

python 专栏收录该内容

22 篇文章 4 订阅

订阅专栏

1、提升爬取效率

使用多线程，多进程，携程，异步

2、多线程

进程是资源单位，每个进程，都会有一个默认的主线程
线程是执行单位
执行多线程需要导包：

from threading import Thread

1、多线程第一种写法

from threading import Thread
def func():
	for index in range(1, 50):
		str2 = 'func' + str(index)
		print(str2)


if __name__ == '__main__':  # 是否是入口程序
	thread = Thread(target=func)  # 创建一个多线程对象，并指定要执行的任务
	thread.start()  # 开启多线程，但是线程什么时候执行有cpu来决定
	for item in range(100, 150):
		str1 = 'main' + str(item)
		print(str1)

main序号和func序号交叉打印

2、多线程第二种写法
创建自定义线程类

from threading import Thread
class MyThread(Thread):
	#重写父类Thread类的run方法
	def run(self):
		for index in range(1,50):
			print('子线程',index)

if __name__=='__main__':
	thread=MyThread()
	thread.start()#开启线程
	for item in range(150,200):
		print('主线程',item)

3、多线程传参

传入的实参必须是一个元祖，如果只有一个参数，注意要在第一个参数后面加逗号

from threading import Thread

def func(name):
	for index in range(1,50):
		print(name,index)


if __name__=='__main__':
	thread1=Thread(target=func,args=('张三丰',))#传入的args参数必须是一个元祖
	thread2=Thread(target=func,args=('王力宏',))
	thread1.start()
	thread2.start()

4、多进程（比较耗资源）

要导入multiporocessing包的process模块

from multiprocessing import Process

def func():
	for index in range(1,50):
		print('子进程',index)


if __name__=='__main__':
	process=Process(target=func)
	process.start()
	for item in range(150,200):
		print('主进程', item)

5、线程池和进程池的概念

反复使用的一组线程，构成线程池，
一次性开辟一些线程，用户直接给线程池提交任务，线程任务的调度交给线程池来完成
需要导入线程池

from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor

#导入线程池和进程池模块
from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor

def func(name):
	for index in range(1,50):
		print(name,index)

if __name__=='__main__':
	#创建线程池,创建一个10个线程组成的线程池，并以threadpool作为线程池名
	with ThreadPoolExecutor(10) as threadpool:
		for item in range(50):
			threadpool.submit(func,name=f'线程{item}')
	#等待线程池中的任务全部执行完毕，才能继续执行（守护）
	print('====')

5、协程

在sleep或input（），request.get（）等状态下，线程处于阻塞状态下
一般来书，当程序处于IO操作的时候，线程都会处于阻塞状态

携程：当程序遇见了IO操作的时候，可以选择性的切换到其他任务上，从而避免耗时操作卡死程序。

在微观上是一个任务一个任务的进行切换，切换条件一般就是io操作
在宏观上，我们看到的其实是多个任务一起在执行
多任务是异步操作
以上说的都是在单线程的情况下。

6、多任务异步协程

要使用关键字async，和js的异步操作类似

#导入异步操作模块
import asyncio
#携程就是异步
async def func():
	print('你好，我是艾奥雅')


if __name__=='__main__':
	g=func() #此时的函数是异步协程函数，此时函数得到的是一个协程对象
	#print(g)#<coroutine object func at 0x002431A8>
	asyncio.run(g)#你好，我是艾奥雅

案例2

# 导入异步操作模块
import asyncio
import time


# 携程就是异步
async def func1():
	print('你好，我是艾奥雅')
	#time.sleep(3)#当程序出现了同步操作的时候，异步就中断了
	await asyncio.sleep(3)#s使用异步休眠，可以避免同步操作，也就是这三秒钟可以去做别的任务
	print('你好，我是艾奥雅')


async def func2():
	print('你好，我是王建国')
	await asyncio.sleep(2)
	print('你好，我是王建国')


async def func3():
	print('你好，我是李雪琴')
	await asyncio.sleep(4)
	print('你好，我是李雪琴')


async def func4():
	print('你好，我是呼兰')
	await asyncio.sleep(1)
	print('你好，我是呼兰')


async def func5():
	print('你好，我是徐志胜')
	await asyncio.sleep(2)
	print('你好，我是徐志胜')


if __name__ == '__main__':
	f1 = func1()  # 返回一个协程对象
	f2 = func2()
	f3 = func3()
	f4 = func4()
	f5 = func5()
	task = [f1, f2, f3, f4, f5]  # 把异步任务放在一个列表中
	t1=time.time()
	# 一次性启动多个任务（协程）
	asyncio.run(asyncio.wait(task))  # 有async肯定要有wait
	t2=time.time()
	#print(t2-t1)#12.010096311569214,等待耗时12秒，执行时间0.01秒
	print(t2-t1)#4.007等待最大值为4秒，执行时间为0.007秒

案例3

# 导入异步操作模块
import asyncio
import time


# 携程就是异步
async def func1():
	print('你好，我是艾奥雅')
	# time.sleep(3)#当程序出现了同步操作的时候，异步就中断了
	await asyncio.sleep(3)  # s使用异步休眠，可以避免同步操作，也就是这三秒钟可以去做别的任务
	print('你好，我是艾奥雅')


async def func2():
	print('你好，我是王建国')
	await asyncio.sleep(2)
	print('你好，我是王建国')


async def func3():
	print('你好，我是李雪琴')
	await asyncio.sleep(4)
	print('你好，我是李雪琴')


async def func4():
	print('你好，我是呼兰')
	await asyncio.sleep(1)
	print('你好，我是呼兰')


async def func5():
	print('你好，我是徐志胜')
	await asyncio.sleep(2)
	print('你好，我是徐志胜')


async def main():
	# 第一种写法
	# f1 = func1()  # 返回一个协程对象
	# await f1  # 一般await挂起操作放在协程对象前面
	# f2 = func2()
	# await f2
	# f3 = func3()
	# await f3
	# f4 = func4()
	# await f4
	# f5 = func5()
	# await f5

	#第二种写法
	tasks=[
		func1(),
		func2(),
		func3(),
		func4(),
		func5()
	]
	await asyncio.wait(tasks)


if __name__ == '__main__':
	t1 = time.time()
	asyncio.run(main())
	t2 = time.time()
	print(t2 - t1)  # 4.0027008056640625秒

7、异步http请求aiohttp模块

使用异步的requests，需要安装aiohttp,

pip install aiohttp

umei.cc唯美壁纸抓取案例

#导入异步http模块
import aiohttp
import asyncio

urls = [
	'http://kr.shanghai-jiuxin.com/file/2020/1031/e9d17d27dfd693d88b232899538144e8.jpg',
	'http://kr.shanghai-jiuxin.com/file/2020/0807/98ec5c7f5d7d0b2d750dd9b5ea834cfc.jpg',
	'http://kr.shanghai-jiuxin.com/file/2020/1031/26b7e178e987be6d914bf8d1af120890.jpg'
]


async def aiodownload(url):
	name = url.rsplit('/', 1)[1]#从右边切一次，第一个元素
	print(name)
	# 发送请求，这里需要使用aiohttp.clientSession()来替代以前的同步requests（）
	async with aiohttp.ClientSession() as session:
		async with session.get(url) as resp:
			# 请求回来，创建文件
			with open('./images/' + name, mode='wb') as fp:
				# 得到图片内容并保存到文件，读取内容是异步的，要加await挂起
				fp.write(await resp.content.read())  # resp.content.read() 等价于resp.content
	print(name, '文件已写入')


async def main():
	tasks = []
	for url in urls:
		tasks.append(aiodownload(url))
	await asyncio.wait(tasks)


if __name__ == '__main__':
	asyncio.run(main())

结果，下载了三张图片

8、视频网站工作原理

对用户上传的视频
转码，转成低码
切片，切成多个小文件
需要一个文件记录，1、视频文件播放顺序2，视频存放的路径
顺序文件做成m3u文件

要抓取视频就必须

1、找到m3u8文件
2、通过m3u8下载到ts文件
3、通过各种手段，把ts文件合并为一个MP4文件

9、抓取91看剧简单版

先拿到视频页面的源代码
从源代码中提取m3u8的url
下载m3u8
读取m3u8文件，下载视频
合并视频

心湖中的石子

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
python爬虫02-提升爬取效率、多线程，多线程传参，多进程，线程及线程池概念，协程，多任务异步协程，异步请求aiohttp模块，视频站工作原理

1、提升爬取效率使用多线程，多进程，携程，异步2、多线程进程是资源单位，每个进程，都会有一个默认的主线程线程是执行单位执行多线程需要导包：from threading import Thread1、多线程第一种写法from threading import Threaddef func(): for index in range(1, 50): str2 = 'func' + str(index) print(str2)if __name__ == '__main__
复制链接

扫一扫

专栏目录