并发编程(三)Python编程慢的罪魁祸首。全局解释器锁GIL
并发编程(四)如何使用多线程,使用多线程对爬虫程序进行修改及比较
并发编程(七)好用的线程池ThreadPoolExecutor
并发编程(九)使用多进程multiprocessing加速程序运行
并发编程(十二)使用subprocess启动电脑任意程序(听歌、解压缩、自动下载等等)
Python创建多线程的方法
# 1.准备一个函数
def my_func(a, b):
do_craw(a, b)
# 2.创建一个线程
import threading
t = threading.Thread(target=my_func, args=(100, 200))
# 3.启动线程
t.start()
# 4.等待结束
t.join()
使用最基本的方法爬取数据
# 导包
import requests
# 生成要爬取的url列表
urls = [
f"https://w.cnblogs.com/#p{page}"
for page in range(1, 51)
]
# 创建爬取数据函数
def craw(url):
'''
input:url,要爬取的网页链接
'''
# 请求网页
r = requests.get(url)
# 输出网页链接和网页文本长度
print(url, len(r.text))
# 调用爬取数据函数
craw(urls[10])
'''
output:
https://w.cnblogs.com/#p11 70100
'''
分别使用单线程和多线程来比较程序执行速度
# -*- coding: utf-8 -*-
# @Time : 2021-03-20 14:47:02
# @Author : wlq
# @FileName: blog_spider.py
# @Email :rd_wlq@163.com
# 导包
import requests
import threading
import time
# 创建装饰器,用于计算程序运行时间(如有疑问可参考https://blog.csdn.net/qq_42546127/article/details/115007989)
def get_time(func):
def wrapper(*args, **kwargs):
start_time = time.time()
func(*args, **kwargs)
end_time = time.time()
print(end_time - start_time)
return wrapper
# 生成爬取链接列表
urls = [
f"https://w.cnblogs.com/#p{page}"
for page in range(1, 51)
]
# 原生的爬取数据函数
def craw(url):
r = requests.get(url)
print(url, len(r.text))
# 使用单线程爬取数据
@get_time
def single_thread():
print("single_thread begin")
for url in urls:
craw(url)
print("single_thread end")
# 使用多线程爬取数据
@get_time
def multi_thread():
print("multi_thread begin")
threads = []
for url in urls:
threads.append(
threading.Thread(target=craw, args=(url,))
)
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print("multi_thread end")
# 执行函数
if __name__ == '__main__':
single_thread()
multi_thread()
'''
output:
single_thread begin
https://w.cnblogs.com/#p1 70100
https://w.cnblogs.com/#p2 70100
https://w.cnblogs.com/#p3 70100
https://w.cnblogs.com/#p4 70100
https://w.cnblogs.com/#p5 70100
https://w.cnblogs.com/#p6 70100
https://w.cnblogs.com/#p7 70100
https://w.cnblogs.com/#p8 70100
https://w.cnblogs.com/#p9 70100
https://w.cnblogs.com/#p10 70100
https://w.cnblogs.com/#p11 70100
https://w.cnblogs.com/#p12 70100
https://w.cnblogs.com/#p13 70100
https://w.cnblogs.com/#p14 70100
https://w.cnblogs.com/#p15 70100
https://w.cnblogs.com/#p16 70100
https://w.cnblogs.com/#p17 70100
https://w.cnblogs.com/#p18 70100
https://w.cnblogs.com/#p19 70100
https://w.cnblogs.com/#p20 70100
https://w.cnblogs.com/#p21 70100
https://w.cnblogs.com/#p22 70100
https://w.cnblogs.com/#p23 70100
https://w.cnblogs.com/#p24 70100
https://w.cnblogs.com/#p25 70100
https://w.cnblogs.com/#p26 70100
https://w.cnblogs.com/#p27 70100
https://w.cnblogs.com/#p28 70100
https://w.cnblogs.com/#p29 70100
https://w.cnblogs.com/#p30 70100
https://w.cnblogs.com/#p31 70100
https://w.cnblogs.com/#p32 70100
https://w.cnblogs.com/#p33 70100
https://w.cnblogs.com/#p34 70100
https://w.cnblogs.com/#p35 70100
https://w.cnblogs.com/#p36 70100
https://w.cnblogs.com/#p37 70100
https://w.cnblogs.com/#p38 70100
https://w.cnblogs.com/#p39 70100
https://w.cnblogs.com/#p40 70100
https://w.cnblogs.com/#p41 70100
https://w.cnblogs.com/#p42 70100
https://w.cnblogs.com/#p43 70100
https://w.cnblogs.com/#p44 70100
https://w.cnblogs.com/#p45 70100
https://w.cnblogs.com/#p46 70100
https://w.cnblogs.com/#p47 70100
https://w.cnblogs.com/#p48 70100
https://w.cnblogs.com/#p49 70100
https://w.cnblogs.com/#p50 70100
single_thread end
9.648617267608643
multi_thread begin
https://w.cnblogs.com/#p7 70100
https://w.cnblogs.com/#p2 70100
https://w.cnblogs.com/#p1https://w.cnblogs.com/#p4 70100 70100
https://w.cnblogs.com/#p6 70100
https://w.cnblogs.com/#p12https://w.cnblogs.com/#p10 70100
https://w.cnblogs.com/#p3https://w.cnblogs.com/#p5 7010070100
https://w.cnblogs.com/#p13 70100
70100https://w.cnblogs.com/#p8 70100
https://w.cnblogs.com/#p11https://w.cnblogs.com/#p9 70100
https://w.cnblogs.com/#p1670100
70100
https://w.cnblogs.com/#p19 70100
https://w.cnblogs.com/#p18https://w.cnblogs.com/#p15 70100
https://w.cnblogs.com/#p17 70100
70100
https://w.cnblogs.com/#p14 70100
https://w.cnblogs.com/#p25 70100
https://w.cnblogs.com/#p24 https://w.cnblogs.com/#p21https://w.cnblogs.com/#p27 70100
70100https://w.cnblogs.com/#p23https://w.cnblogs.com/#p20 70100
70100 70100
https://w.cnblogs.com/#p26 70100
https://w.cnblogs.com/#p30 70100
https://w.cnblogs.com/#p32https://w.cnblogs.com/#p35 70100
https://w.cnblogs.com/#p22https://w.cnblogs.com/#p31 70100
70100
https://w.cnblogs.com/#p33https://w.cnblogs.com/#p37 70100 70100https://w.cnblogs.com/#p29
70100
70100
https://w.cnblogs.com/#p28 70100
https://w.cnblogs.com/#p34https://w.cnblogs.com/#p39 70100
70100
https://w.cnblogs.com/#p36 70100
https://w.cnblogs.com/#p41 70100
https://w.cnblogs.com/#p38 70100
https://w.cnblogs.com/#p40 70100
https://w.cnblogs.com/#p42 70100
https://w.cnblogs.com/#p44https://w.cnblogs.com/#p45 https://w.cnblogs.com/#p46 70100
70100
70100
https://w.cnblogs.com/#p48 70100
https://w.cnblogs.com/#p47 70100
https://w.cnblogs.com/#p43 70100
https://w.cnblogs.com/#p50 70100
https://w.cnblogs.com/#p49 70100
multi_thread end
0.5046284198760986
'''
从上面程序输出来看,单线程是顺序一个一个执行的,执行时间是9.648617267608643。而多线程是随机执行的,执行时间是0.5046284198760986。
可计算得
m
u
l
t
i
p
l
e
=
9.648617267608643
0.5046284198760986
=
19.120241523411753
≈
19
multiple = \frac{9.648617267608643}{0.5046284198760986} = 19.120241523411753 \approx 19
multiple=0.50462841987609869.648617267608643=19.120241523411753≈19