学习并发编程时,首先要区分并发concurrency
和并行parallelism
并发:对应python中的多线程/协程
同一时刻,只能运行一个线程/协程,通过不断切换实现并发
适用于I/O频繁的操作
并行:对应python中的多进程
在多个CPU(如果有的话)上同时运行
适用于CPU密集型的操作
线程
python中的线程封装了操作系统底层的线程,操作知道每个线程的信息,所以它会在适当的时候进行线程切换.这样不用我们手动调度,但是,就因为我们不能手动调动,就很容易出现race condition的情况
为什么要引入线程的概念?
单线程vs多线程
我们通过一系列的例子向你展示多线程的优势,我们的目标是简单的爬取网页内容.
# single.py
import requests
import time
def download_one(url):
resp = requests.get(url)
print('Read {} from {}'.format(len(resp.content), url))
def download_all(sites):
for site in sites:
download_one(site)
def main():
sites = [
'https://www.baidu.com',
'https://www.zhihu.com',
'https://www.taobao.com',
'https://www.douban.com',
'https://www.jianshu.com',
'https://account.geekbang.org',
'https://leetcode-cn.com/',
'https://www.github.com',
'https://open.163.com/',
'https://www.rainymood.com/',
'https://www.bilibili.com/',
]
start_time = time.perf_counter()
download_all(sites)
end_time = time.perf_counter()
print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))
if __name__ == '__main__':
main()
# 输出
Read 2443 from https://www.baidu.com
Read 170 from https://www.zhihu.com
Read 143154 from https://www.taobao.com
Read 94136 from https://www.douban.com
Read 583 from https://www.jianshu.com
Read 1524 from https://account.geekbang.org
Read 28899 from https://leetcode-cn.com/
Read 87112 from https://www.github.com
Read 161816 from https://open.163.com/
Read 12244 from https://www.rainymood.com/
Read 70945 from https://www.bilibili.com/
Download 11 sites in 11.405088154002442 seconds
单线程的程序简单明了,按照顺序进行打印.每一步都需要等待I/O操作(通过网卡发送请求,接受响应数据)完成才能进行下一波,所以结果就是速度很慢.这几个网页速度尚且如此,如果数量太多,简直不堪设想.
接下来看一下多线程版本
import concurrent.futures
import requests
import time
def download_one(site):
resp = requests.get(site)
print('Read {} from {}'.format(len(resp), site))
def download_many(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
executor.map(download_one, sites)
def main():
sites = [
'https://www.baidu.com',
'https://www.zhihu.com',
'https://www.taobao.com',
'https://www.douban.com',
'https://www.jianshu.com',
'https://account.geekbang.org',
'https://leetcode-cn.com/',
'https://www.github.com',
'https://open.163.com/',
'https://www.rainymood.com/',
'https://www.bilibili.com/',
]
start = time.time()
download_many(sites)
end = time.time()
print('Download {} sites by {}s'.format(len(sites), end-start))
if __name__ == "__main__":
main()
# result
Read 2443 from https://www.baidu.com
Read 1524 from https://account.geekbang.org
Read 583 from https://www.jianshu.com
Read 94193 from https://www.douban.com
Read 143154 from https://www.taobao.com
Read 28899 from https://leetcode-cn.com/
Read 161816 from https://open.163.com/
Read 70354 from https://www.bilibili.com/
Read 12244 from https://www.rainymood.com/
Read 87102 from https://www.github.com
Read 170 from https://www.zhihu.com
Download 11 sites in 5.321507043998281 seconds
多线程版本明显提高了下载速度.
我们分析一下关于多线程的代码
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
executor.map(download_one, sites)
ThreadPoolExecutor表示创建一个线程池,max_workers参数设置共有几个线程,当然并不是创建的线程越多越好,根据具体情况而定,可以进行一些测试实验.
executor是一个线程池对象
type(executor)
<class 'concurrent.futures.thread.ThreadPoolExecutor'>
map方法和普通的内置map函数使用起来很像,表示对sites中的每一个site,并发的调用函数down_load.
一个你可能忽视,但是很重要的一点便是requests.get是线程安全的,所以在多线程环境下,它也是全的,并不会出现race condition的情况.
下面我举一个出现race condition的情况
其次你应该注意到,我们得到的结果相比单线程版本,顺序是混乱的.这是因为线程的切换是由操作系统调度,而不是代码编写者.所以,非常可能乱序.
future在哪里?
上面的多线程版本的代码中,我们隐含的用到了future.哈?future是啥?
future是concurrent.futures模块和asyncio模块的重要组件.在这两个模块中,都有一个Future类.这个类的作用相同, 两个Future类的实例都表示已经完成或者尚未完成的延迟计算,类似JavaScript中的Promise对象.
我们应当记住一件事,通常情况下,自己不应该创建future,而是使用并发框架(concurrent.futures和asyncio)实例化,原因很简单:future表示终将发生的事情,而确定某件事会发生的唯一方式就是执行的时间已经排定.
如何排定执行时间呢,即如何生成Future类实例呢?
将需要排定的某件事交给concurrent.future.Executor子类处理,才会创建Future类实例.
现在让我们来看一个例子,我们可以通过创建future,使用future来改写上一版本中Executor.map抽象的代码.看不懂没关系,后面会有更详细的解释
import concurrent.futures
import requests
import time
def download_one(url):
resp = requests.get(url)
print('Read {} from {}'.format(len(resp.content), url))
def download_all(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
to_do = []
for site in sites:
future = executor.submit(download_one, site) # 生成future实例
to_do.append(future)
for future in concurrent.futures.as_completed(to_do):
future.result()
def main():
sites = [
'https://www.baidu.com',
'https://www.zhihu.com',
'https://www.taobao.com',
'https://www.douban.com',
'https://www.jianshu.com',
'https://account.geekbang.org',
'https://leetcode-cn.com/',
'https://www.github.com',
'https://open.163.com/',
'https://www.rainymood.com/',
'https://www.bilibili.com/',
]
start_time = time.perf_counter()
download_all(sites)
end_time = time.perf_counter()
print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))
if __name__ == '__main__':
main()
# 输出
Read 2443 from https://www.baidu.com
Read 1524 from https://account.geekbang.org
Read 583 from https://www.jianshu.com
Read 94299 from https://www.douban.com
Read 143086 from https://www.taobao.com
Read 161856 from https://open.163.com/
Read 28899 from https://leetcode-cn.com/
Read 70206 from https://www.bilibili.com/
Read 12244 from https://www.rainymood.com/
Read 87112 from https://www.github.com
Read 170 from https://www.zhihu.com
Download 11 sites in 5.302511596004479 seconds
上面代码,你先了解它的思路:先创建future,然后调度,再返回结果
我们简单讨论一下,创建Future类实例与使用Future类实例
Executor.submit:接收一个可调用对象,调用这个方法后会为传入的可调用对象排期,并返回一个future
Executor.map : 先生成一系列future,然后返回值是一个迭代器,迭代器的__next__
方法会调用各个future的result方法
首先你需要了解,future是由状态的.且有以下几个状态:running/pending/finished
Future.done():这个方法不阻塞,立即返回布尔值,表示此Future链接的对象是否已经执行
Future.add_done_callback: 接收一个可调用对象,此Future运行结束后会调用该可调用对象
Future.result: 如果在该Future运行结束之后调用的话,在两个模块中的Future类中的行为一致,返回可调用对象的结果,或者异常;但如果Future没有运行结束, 对于concurrent.future.Future实例来说,调用future.result()会阻塞调用方所在的线程,直到有结果可以返回,所以可以接受一个额外的timeout参数,如果在指定的时间没有返回结果,抛出TimeoutError异常
concurrent.futures.as_completed:接收一个future列表,返回一个迭代器,在future运行结束后产出future.
还是上面那个例子,这一次我们要更加的细致一些
import concurrent.futures
import requests
import time
def download_one(url):
resp = requests.get(url)
print('Read {} from {}'.format(len(resp.content), url))
def download_all(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
to_do = []
for site in sites:
future = executor.submit(download_one, site)
print('Scheduled for {} {}'.format(site, future)) # 查看future状态
to_do.append(future)
for future in concurrent.futures.as_completed(to_do):
print(future.result()) # 因为该future连接的可调用对象没有返回值,所以打印结果为None
print(future, 'n') # 再次查看future状态
def main():
sites = [
'https://www.baidu.com',
'https://www.zhihu.com',
'https://www.taobao.com',
'https://www.douban.com',
'https://www.jianshu.com',
'https://account.geekbang.org',
'https://leetcode-cn.com/',
'https://www.github.com',
'https://open.163.com/',
'https://www.rainymood.com/',
'https://www.bilibili.com/',
]
start_time = time.perf_counter()
download_all(sites)
end_time = time.perf_counter()
print('Download {} sites in {} seconds'.format(len(sites), end_time - start_time))
if __name__ == '__main__':
main()
# 输出
Scheduled for https://www.baidu.com <Future at 0x7efd4e692470 state=running>
Scheduled for https://www.zhihu.com <Future at 0x7efd4d8faf98 state=running>
Scheduled for https://www.taobao.com <Future at 0x7efd4d88d9b0 state=running>
Scheduled for https://www.douban.com <Future at 0x7efd4d88df28 state=running>
Scheduled for https://www.jianshu.com <Future at 0x7efd4d897cf8 state=running>
Scheduled for https://account.geekbang.org <Future at 0x7efd4d8ba5c0 state=pending>
Scheduled for https://leetcode-cn.com/ <Future at 0x7efd4d8ba668 state=pending>
Scheduled for https://www.github.com <Future at 0x7efd4d8ba710 state=pending>
Scheduled for https://open.163.com/ <Future at 0x7efd4d8ba7f0 state=pending>
Scheduled for https://www.rainymood.com/ <Future at 0x7efd4d8ba8d0 state=pending>
Scheduled for https://www.bilibili.com/ <Future at 0x7efd4d8ba9b0 state=pending>
Read 2443 from https://www.baidu.com
None
<Future at 0x7efd4e692470 state=finished returned NoneType>
Read 1524 from https://account.geekbang.org
None
<Future at 0x7efd4d8ba5c0 state=finished returned NoneType>
Read 583 from https://www.jianshu.com
None
<Future at 0x7efd4d897cf8 state=finished returned NoneType>
Read 94299 from https://www.douban.com
None
<Future at 0x7efd4d88df28 state=finished returned NoneType>
Read 28899 from https://leetcode-cn.com/
Read 161856 from https://open.163.com/
None
<Future at 0x7efd4d8ba668 state=finished returned NoneType>
None
<Future at 0x7efd4d8ba7f0 state=finished returned NoneType>
Read 143154 from https://www.taobao.com
None
<Future at 0x7efd4d88d9b0 state=finished returned NoneType>
Read 71284 from https://www.bilibili.com/
None
<Future at 0x7efd4d8ba9b0 state=finished returned NoneType>
Read 12244 from https://www.rainymood.com/
None
<Future at 0x7efd4d8ba8d0 state=finished returned NoneType>
Read 87112 from https://www.github.com
None
<Future at 0x7efd4d8ba710 state=finished returned NoneType>
Read 170 from https://www.zhihu.com
None
<Future at 0x7efd4d8faf98 state=finished returned NoneType>
Download 11 sites in 5.306397142994683 seconds
分析如下
首先我们通过Executor.submit生成一个future,然后我们可以打印它的状态.我们观察到前五个是running的状态,因为我们设置了max_workers为5
调用futures.ascompleted接收future的列表,返回一个迭代器,然后在future运行结束后,产出future.所以此时调用future.result方法不阻塞就一定得到该future连接的可调用对象的返回值,但是因为这些down_one没有返回值,所以为 None,然后对应的future的状态为finished
接下来,让我们具体看一下as_completed的运行
import concurrent.futures
import time
def sleeping(index, n):
print('{:2d} starts sleep'.format(index))
time.sleep(n)
print('{:2d} ends sleep'.format(index))
def main():
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
to_dos = []
for i, n in enumerate([5, 4, 4, 2, 1,]):
future = executor.submit(sleeping, i, n)
to_dos.append(future)
for future in concurrent.futures.as_completed(to_dos):
future.result()
if __name__ == "__main__":
start = time.time()
main()
end = time.time()
print('Total Using {}s'.format(end-start))
# 结果
0 starts sleep
1 starts sleep
2 starts sleep
1 ends sleep
3 starts sleep
2 ends sleep
4 starts sleep
0 ends sleep
4 ends sleep
3 ends sleep
Total Using 6.006955623626709s