第十二节段 -- 爬虫04：【进程；线程；协程】-CSDN博客

本文链接：https://blog.csdn.net/weixin_42375099/article/details/97245758

文章目录

1. 线程
2. 进程
3. 协程 Gevent

优先使用多线程，到后来可以使用多进程套用多线程；其中协程最快

1. 线程

1. 简介

导入：我们之前写的爬虫都是单个线程的？这怎么够？一旦一个地方卡到不动了，那不就永远等待下去了？为此我们可以使用多线程或者多进程来处理。

不建议你用这个，不过还是介绍下了，如果想看可以看看下面，不想浪费时间直接看。
如何使用：爬虫使用多线程来处理网络请求，使用线程来处理URL队列中的url，然后将url返回的结果保存在另一个队列中，其它线程在读取这个队列中的数据，然后写到文件中去。

2.主要组成成分

2.1. URL 队列和结果队列

将将要爬去的url放在一个队列中，这里使用标准库Queue。访问url后的结果保存在结果队列中。

初始化一个URL队列：

from queue import Queue
urls_queue = Queue()
out_queue = Queue()

2.2. 包装类 – 实现多线程

使用多个线程，不停的取URL队列中的url，并进行处理：

import threading

class ThreadCrawl(threading.Thread):
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue
	
    # 需要重写run()方法
    def run(self):
        while True:
            item = self.queue.get()

如果队列为空，线程就会被阻塞，直到队列不为空。处理队列中的一条数据后，就需要通知队列已经处理完该条数据。

2.3. 函数包装 – 实现多线程

from threading import Thread
def func(args)
    pass
if __name__ == '__main__':
    info_html = Queue()
    t1 = Thread(target=func,args=(info_html,))

2.4. 进程池

# 简单往队列中传输线程数
import threading
import time
import queue

class Threadingpool():
    def __init__(self,max_num = 10):
        self.queue = queue.Queue(max_num)
        for i in range(max_num):
            self.queue.put(threading.Thread)

    def getthreading(self):
        return self.queue.get()

    def addthreading(self):
        self.queue.put(threading.Thread)

def func(p,i):
    time.sleep(1)
    print(i)
    p.addthreading()

if __name__ == "__main__":
    p = Threadingpool()
    for i in range(20):
        thread = p.getthreading()
        t = thread(target = func, args = (p,i))
        t.start()

3. Queue 模块中的常用方法

Python的 Queue 模块中提供了同步的、线程安全的队列类，包括 FIFO（先入先出）队列 Queue，LIFO（后入先出）队列 LifoQueue，和优先级队列 PriorityQueue。这些队列都实现了锁原语，能够在多线程中直接使用。可以使用队列来实现线程间的同步：

Queue.qsize() 返回队列的大小
Queue.empty() 如果队列为空，返回True,反之False
Queue.full() 如果队列满了，返回True,反之False
Queue.full 与 maxsize 大小对应
Queue.get([block[, timeout]]) 获取队列，timeout 等待时间
Queue.get_nowait() 相当 Queue.get(False)
Queue.put(item) 写入队列，timeout 等待时间
Queue.put_nowait(item) 相当 Queue.put(item, False)
Queue.task_done() 在完成一项工作之后，Queue.task_done() 函数向任务已经完成的队列发送一个信号
Queue.join() 实际上意味着等到队列为空，再执行别的操作

4. 实例

爬取糗事百科1-13页

方案01：通过函数

from fake_useragent import UserAgent
import requests
from lxml import etree
from threading import Thread
from queue import Queue
from time import time

def get_data(url_queue):
    while not url_queue.empty():
        url = url_queue.get()
        headers = {'User-Agent': UserAgent().chrome}
        resp = requests.get(url, headers=headers)
        e = etree.HTML(resp.text)

        # infos_span = e.xpath('//div[@class="content"]/span[1]')
        # for span in infos_span:
        #    info = span.xpath('string(.)')
        infos = [span.xpath('string(.)') for span in e.xpath('//div[@class="content"]/span[1]')]
        with open('duanzi.txt', 'a', encoding='utf-8') as f:
            for info in infos:
                f.write(info + '\n')

if __name__ == '__main__':
    start = time()
    base_url = 'https://www.qiushibaike.com/text/page/{}/'
    url_queue = Queue()
    for i in range(1, 11):
        url = base_url.format(i)
        url_queue.put(url)

    thread_list = []
    for i in range(2):
        t1 = Thread(target=get_data, args=(url_queue,))
        thread_list.append(t1)
        t1.start()
		# t1.join() # 这样写会变成串行！
    for t in thread_list:
        t.join()

    end = time()
    print('{}:{}----{}'.format(end,start,end-start))

方案02：通过类

from fake_useragent import UserAgent
import requests
from lxml import etree
from threading import Thread
from queue import Queue
import time

class Spider1(Thread):
    # 线程之间是可以通信的，所以不需要这么写！
	# def __init__(self, url_queue):
    #     Thread.__init__(self)
    #     self.url_queue = url_queue
        
    def run(self):
        while not url_queue.empty():
            url = url_queue.get()
            print('get:{}'.format(url))
            headers = {'User-Agent': UserAgent().chrome}
            resp = requests.get(url, headers=headers)
            e = etree.HTML(resp.text)

            # infos_span = e.xpath('//div[@class="content"]/span[1]')
            # for span in infos_span:
            #    info = span.xpath('string(.)')
            infos = [span.xpath('string(.)') for span in e.xpath('//div[@class="content"]/span[1]')]
            with open('duanzi.txt', 'a', encoding='utf-8') as f:
                for info in infos:
                    f.write(info + '\n')

if __name__ == '__main__':
    base_url = 'https://www.qiushibaike.com/text/page/{}/'
    url_queue = Queue()
    for i in range(1, 11):
        url = base_url.format(i)
        url_queue.put(url)

    thread_obj = []
    for i in range(2):
        t = Spider1()
        t.start()
        thread_obj.append(t)

    for t in thread_obj:
        t.join()

2. 进程

multiprocessing 是 python 的多进程管理包，和 threading.Thread 类似

1. multiprocessing模块

直接从侧面用 subprocesses 替换线程使用 GIL 的方式，由于这一点，multiprocessing 模块可以让程序员在给定的机器上充分的利用 CPU。在 multiprocessing 中，通过创建 Process 对象生成进程，然后调用它的 start() 方法。

from multiprocessing import Process

def func(name):
    print('hello', name)

if __name__ == "__main__":
    p = Process(target=func,args=('sxt',))
    p.start()
    p.join()  # 等待进程执行完毕

2. Manager类，实现数据共享

在使用并发设计的时候最好尽可能的避免共享数据，尤其是在使用多进程的时候。如果你真有需要要共享数据，可以使用由 Manager() 返回的 manager 提供 list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Barrier, Queue, Value and Array 类型的支持

from multiprocessing import Process,Manager,Lock

def print_num(info_queue,l,lo):
    with lo:
        for n in l:
            info_queue.put(n)

def updata_num(info_queue,lo):
    with lo:
        while not info_queue.empty():
            print(info_queue.get())

if __name__ == '__main__':
        manager = Manager()
        into_html = manager.Queue()
        lock = Lock()
        a = [1, 2, 3, 4, 5]
        b = [11, 12, 13, 14, 15]

        p1 = Process(target=print_num,args=(into_html,a,lock))
        p1.start()
        p2 = Process(target=print_num,args=(into_html,b,lock))
        p2.start()
        p3 = Process(target=updata_num,args=(into_html,lock))
        p3.start()
        p1.join()
        p2.join()
        p3.join()

3. 进程池

进程池内部维护一个进程序列，当使用时，则去进程池中获取一个进程，如果进程池序列中没有可供使用的进进程，那么程序就会等待，直到进程池中有可用进程为止。
进程池中有两个方法：
- apply同步执行-串行
- apply_async异步执行-并行

from multiprocessing import Pool,Manager

def print_num(info_queue,l):
    for n in l:
        info_queue.put(n)

def updata_num(info_queue):
    while not info_queue.empty():
        print(info_queue.get())

if __name__ == '__main__':
    html_queue =Manager().Queue()
    a=[11,12,13,14,15]
    b=[1,2,3,4,5]
    pool = Pool(3) # 进程池有3个进程
    pool.apply_async(func=print_num,args=(html_queue,a))
    pool.apply_async(func=print_num,args=(html_queue,b))
    pool.apply_async(func=updata_num,args=(html_queue,))
    # 注意线程不同，进程中，close和join方法反过来写
    pool.close() # 这里join一定是在close之后，且必须要加join，否则主进程不等待创建的子进程执行完毕
    pool.join() # 进程池中进程执行完毕后再关闭，如果注释，那么程序直接关闭

4. 实例

from multiprocessing import Process
from fake_useragent import UserAgent
import requests
from lxml import etree
# from queue import Queue
from multiprocessing import Manager
from time import time

def get_data(url_queue):
    while not url_queue.empty():
        url = url_queue.get()
        print('get:{}'.format(url))
        headers = {'User-Agent': UserAgent().chrome}
        resp = requests.get(url, headers=headers)
        e = etree.HTML(resp.text)
        infos = [span.xpath('string(.)') for span in e.xpath('//div[@class="content"]/span[1]')]
        with open('duanzi.txt', 'a', encoding='utf-8') as f:
            for info in infos:
                f.write(info + '\n')

if __name__ == '__main__':
    start =time()
    base_url = 'https://www.qiushibaike.com/text/page/{}/'
    url_queue = Manager().Queue()
    for i in range(1, 11):
        url = base_url.format(i)
        url_queue.put(url)

    process_list = []
    for i in range(2):
        p1 = Process(target=get_data, args=(url_queue,))
        process_list.append(p1)
        p1.start()

    for p in process_list:
        p.join()
    end = time()
    print('{}:{}----{}'.format(end,start,end-start))

3. 协程 Gevent

Python通过 yield提供了对协程的基本支持，但是不完全。而第三方的 gevent 为Python提供了比较完善的协程支持。

gevent 是第三方库，通过 greenlet 实现协程，其基本思想是：

当一个 greenlet 遇到IO操作时，比如访问网络，就自动切换到其他的 greenlet，等到IO操作完成，再在适当的时候切换回来继续执行。由于IO操作非常耗时，经常使程序处于等待状态，有了 gevent 为我们自动切换协程，就保证总有 greenlet 在运行，而不是等待IO。

由于切换是在IO操作时自动完成，所以 gevent 需要修改Python自带的一些标准库，这一过程在启动时通过 monkey.patch 完成：

from gevent import monkey; monkey.patch_socket()
import gevent

def f(n):
    for i in range(n):
        print(gevent.getcurrent(), i)

g1 = gevent.spawn(f, 5)
g2 = gevent.spawn(f, 5)
g3 = gevent.spawn(f, 5)
g1.join()
g2.join()
g3.join()

如需将协程应用到爬虫，如下：

from gevent import monkey
import gevent
import requests

def f(url):
    print('GET: %s' % url)
    resp = requests.get(url)
    data = requests.text
    print('%d bytes received from %s.' % (len(data), url))

# 同时启动线程
gevent.joinall([
        gevent.spawn(f, 'https://www.python.org/'),
        gevent.spawn(f, 'https://www.163.com/'),
        gevent.spawn(f, 'https://www.baidu.com/'),
])

示例：

import gevent
from gevent import monkey

monkey.patch_all()
import requests
from fake_useragent import UserAgent
from lxml import etree
from queue import Queue
from time import time

def get_data(url_queue):
    while not url_queue.empty():
        url = url_queue.get()
        print('get:{}'.format(url))
        headers = {'User-Agent': UserAgent().chrome}
        resp = requests.get(url, headers=headers)
        e = etree.HTML(resp.text)
        infos = [span.xpath('string(.)') for span in e.xpath('//div[@class="content"]/span[1]')]
        with open('duanzi.txt', 'a', encoding='utf-8') as f:
            for info in infos:
                f.write(info + '\n')
        print('success:{}'.format(url))

if __name__ == '__main__':
    start = time()
    base_url = 'https://www.qiushibaike.com/text/page/{}/'
    url_queue = Queue()
    for i in range(1, 11):
        url = base_url.format(i)
        url_queue.put(url)

    gevent.joinall([
        gevent.spawn(get_data, url_queue),
        gevent.spawn(get_data, url_queue)

    ])
    end = time()
    print('{}:{}----{}'.format(end, start, end - start))