协程可以很完美的处理IO密集型的问题,但是处理CPU密集型并不是他的长处。要充分发挥CPU的性能,可以结合多进程+多线程的方式。Python并没有提供协程的相关模块和包,需要手动下载:
pip install gevent
- 代码如下:
from gevent import monkey
monkey.patch_all()
import gevent
from gevent.queue import Queue
import requests
urls = ["URL1","URL2","URL3","URL4","URL5","URL6","URL7","URL8","URL9","URL10"] # 可自行修改URL
work = Queue() # 创建队列,队列的特点是先进先出
for data_url in urls:
work.put_nowait(data_url)
def get_each_url_all_page(data_url):
"""爬虫主程序"""
result = requests.get(data_url)
html = result.text
# 省略部分爬虫内容
def crawler():
while not work.empty():
data_url = work.get_nowait()
get_each_url_all_page(data_url)
def main():
tasks_list = []
for x in range(10): # 可以设置开多少个协程,这里以10为例。
task = gevent.spawn(crawler)
tasks_list.append(task)
gevent.joinall(tasks_list)
if __name__ == '__main__':
main()