线程池的原理
线程需要cpu的调度
- 线程的生命周期
新建线程系统需要分配资源,终止线程系统需要回收资源,比较耗时间。线程池就是重用线程来减去新建线程,终止线程的开销。
线程池里边是事先建立好的线程,线程池等待任务的到来。
使用线程池的好处
- 减去大量线程的建立和终止带来的时间开销
- 线程池里边线程数量有限,能有效避免系统因为创建线程过多,而导致系统负荷多大导致电脑变慢
ThreadPoolExecutor使用语法
- 两种用法
用法1
from concurrent.futures import ThreadPoolExecutor,as_completed
with ThreadPlloExcutor() as Pool:
results=pool.map(craw,urls)
for result in results:
print(result)
with concurrent.futures.ThreadPoolExecutor() as pool:
futures={}
for url,html in htmls:
future=pool.submit(parse,html)
futures[future]=url
# for future,url in futures.items():
# print(url,future.result())
for future in concurrent.futures.as_completed(futures):
url=futures[future]
print(url,future.result())
map和submit的区别是,map需要事先准备好任务,然后往线程池里边拿。submit是一个个任务往线程池拿
线程池改造爬虫程序
# -*- coding: utf-8 -*-
"""
Created on Tue May 4 15:24:11 2021
@author: hellohaojun
"""
import threading
import requests
import time
import concurrent.futures
from bs4 import BeautifulSoup
urls=[
f"https://www.cnblogs.com/#p{page}"
for page in range(1,50+1)
]
def craw(url):
r=requests.get(url)
return r.text
def parse(html):
soup=BeautifulSoup(html,"html.parser")
links=soup.find_all("a",class_="post_item-title")
return[(link["href"],link.get_text())for link in links]
#craw
with concurrent.futures.ThreadPoolExecutor() as pool:
htmls=pool.map(craw,urls)
htmls=list(zip(urls,htmls))
for url,html in htmls:
print(url,len(html))
print("craw over")
#parse
with concurrent.futures.ThreadPoolExecutor() as pool:
futures={}
for url,html in htmls:
future=pool.submit(parse,html)
futures[future]=url
# for future,url in futures.items():
# print(url,future.result())
for future in concurrent.futures.as_completed(futures):
url=futures[future]
print(url,future.result())
https://www.cnblogs.com/#p44 69419
https://www.cnblogs.com/#p45 69419
https://www.cnblogs.com/#p46 69419
https://www.cnblogs.com/#p47 69419
https://www.cnblogs.com/#p48 69419
https://www.cnblogs.com/#p49 69419
https://www.cnblogs.com/#p50 69419
craw over
https://www.cnblogs.com/#p14 []
https://www.cnblogs.com/#p2 []
https://www.cnblogs.com/#p16 []
https://www.cnblogs.com/#p8 []
https://www.cnblogs.com/#p6 []
https://www.cnblogs.com/#p9 []
https://www.cnblogs.com/#p1 []
https://www.cnblogs.com/#p12 []
https://www.cnblogs.com/#p13 []