如何选择多线程Thread,多进程Process,和多协程Coroutine
1、什么是CPU密集型计算(CPU-bound),IO密集型(I/O-bound)计算?
CPU密集型:I/O短时间内完成,CPU需执行大量计算。例如:压缩解压缩、加密解密、正则表达式搜索。
I/O密集型:大部分时间,CPU在等待I/O执行。例如:文件处理程序、网络爬虫程序、读写数据库程序。
2、多线程、多进程、多协程的对比?
关系:一个进程中可以启动多个线程,一个线程中可以启动多个协程。
多进程Process(multiprocessing)
优点:可以利用多核cpu并行运算
缺点:占用资源最多、可启动数目比线程少
适用于:cpu密集型计算
多线程Thread(threading)
优点:相比进程,更轻量,占用资源少。
缺点:相比进程:多线程只能并发执行,不能利用多CPU(GIL)
相比协程:启动数目有限制,占用内存资源,有线程切换开销。
适用于:IO密集型计算、同时运行的任务数目要求不多
多协程Coroutine(asyncio)
优点:内存开销最少,启动协程数量最多
缺点:支持的内存库有限制(aiohttp vs requests)、代码实现复杂
使用于:IO密集型计算、需要超过任务并行、但有现成库支持的场景
3、怎么根据任务选择对应技术?
python速度慢的两大原因?
GIL是什么?
为什么有GIL这个东西?
怎么规避GIL带来的限制?
python创建多线程的方法
import requests
urls = [
f'https://www.cnblogs.com/#p{i}' for i in range(1,50)
]
def craw(url):
res = requests.get(url)
print(url,len(res.text))
import blog_spider
import threading
import time
def test_time(func):
def inner_func():
start = time.time()
func()
end = time.time()
print(f'运行时间为:{end-start}')
return inner_func
@test_time
def single_thread():
print('单线程开始')
for url in blog_spider.urls:
blog_spider.craw(url)
print('单线程结束')
@test_time
def multi_thread():
print('多线程开始')
threads = []
for url in blog_spider.urls:
threads.append(threading.Thread(target=blog_spider.craw,args=(url,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('多线程结束')
if __name__ =='__main__':
single_thread()
multi_thread()
python实现生产者,消费者爬虫
from bs4 import BeautifulSoup
import requests
urls = [
f'https://www.cnblogs.com/#p{i}' for i in range(1,50)
]
def craw(url):
res = requests.get(url)
return res.text
def parse(html):
# class = "post-item-title"
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all("a",class_="post-item-title")
return [(link['href'],link.get_text()) for link in links]
if __name__ =="__main__":
for result in parse(craw(urls[3])):
print(result)
import queue
import blog_spider
import time
import random
import threading
def do_craw(url_quene: queue.Queue, html_queue: queue.Queue):
while True:
url = url_quene.get()
html = blog_spider.craw(url)
html_queue.put(html)
print(threading.current_thread().name, f'craw{url}', 'url_quene.size=', url_quene.qsize())
time.sleep(random.randint(1,2))
def do_parse(html_queue: queue.Queue, fout):
while True:
html = html_queue.get()
results = blog_spider.parse(html)
for result in results:
fout.write(str(result) + '\n')
print(threading.current_thread().name, f'result.size', len(results),
'html_quene.size=',html_queue.qsize())
time.sleep(random.randint(1,2))
if __name__=='__main__':
url_queue = queue.Queue()
html_queue = queue.Queue()
for url in blog_spider.urls:
url_queue.put(url)
for ids in range(3):
t = threading.Thread(target=do_craw,args=(url_queue,html_queue),
name=f'craw{ids}')
t.start()
fout = open('02.data.txt','w')
for ids in range(3):
t = threading.Thread(target=do_parse,args=(html_queue,fout),
name=f'parse{ids}')
t.start()
python线程安全以及解决方案:
import threading
import time
lock = threading.Lock()
class Account():
def __init__(self,amount):
self.amount = amount
def draw(account,amount):
with lock:
if account.amount>amount:
time.sleep(0.1)
print(threading.current_thread().name, '开始取钱')
account.amount = account.amount - amount
print(threading.current_thread().name, '取钱成功')
print(threading.current_thread().name, '剩余金额:',account.amount)
else:
print(threading.current_thread().name,'取钱失败,余额不足!')
if __name__ == '__main__':
account = Account(1000)
thread1 = threading.Thread(name='thread1', target=draw, args=(account, 800))
thread2 = threading.Thread(name='thread2', target=draw, args=(account, 800))
thread1.start()
thread2.start()
参考来源:bilibili视频,后续会附上地址