爬虫之多线程_递归爬虫能用多线程吗-CSDN博客

本文链接：https://blog.csdn.net/qq_41386300/article/details/83858736

1.引入

之前写的爬虫都是单个线程的，一旦某个地方卡住不动了，那就要演员等待下去了，所以我们可以使用多线程或多进程来处理

但是我个人不建议用，不过还是简单的介绍下

2.使用

爬虫使用多线程来处理网络请求，使用线程来处理URL队列中的url，然后将url返回的结果保存在另一个队列中，其它线程在读取这个队列中的数据，然后写到文件中去

3. 主要组成部分

3.1 URL队列和结果队列

将要爬取的url放在一个队列中，这里使用标准库Queue，访问url后的结果保存在结果队列中

初始化一个URL队列

from queue import Queue
url_queue=Queue()
html_queue=Queue()

3.2 请求线程

使用多个线程，不停的取URL队列中的url，并进行处理：

from threading import Thread
class ThreadInfo(Thread):
    def __init__(self,url_queue,html_queue):
        Thread.__init__(self)
        self.url_queue=url_queue
        self.html_queue=html_queue
    def run(self):
        user_agents = [
            "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50",
            "User-Agent:Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
        ]
        headers = {
            "User-Agent": choice(user_agents)
        }
        while self.url_queue.empty()==False:
            url=self.url_queue.get()
            response = requests.get(url,headers=headers)
            if response.status_code==200:
                self.html_queue.put(response.text)

如果队列为空，线程就会被阻塞，知道队列不为空，处理队列中的一条数据后，就需要通知队列已经这条数据处理完

3.3 处理线程

处理结果队列中的数据，并保存到文件中，如果使用多个线程的话必须要给文件加上锁

lock=threading.Lock()
f=codecs.open('xiaohua.txt','w','utf-8')

当线程需要写入文件的时候，可以这样处理：

with lock:
	f.write(something)

4. 一个小例子

这里举一个爬取糗事百科的段子的例子，

from threading import Thread
from queue import Queue
from lxml import etree
from random import choice
import requests

#爬虫类
class CrawlInfo(Thread):
    def __init__(self,url_queue,html_queue):
        Thread.__init__(self)
        self.url_queue=url_queue
        self.html_queue=html_queue
    def run(self):
        user_agents = [
            "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHTML,likeGecko)Version/5.1Safari/534.50",
            "User-Agent:Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36"
        ]
        headers = {
            "User-Agent": choice(user_agents)
        }
        while self.url_queue.empty()==False:#url队列不为空的时候
            url=self.url_queue.get()
            response = requests.get(url,headers=headers)
            if response.status_code==200:
                self.html_queue.put(response.text)

#解析类
class ParseInfo(Thread):
    def __init__(self,html_queue):
        Thread.__init__(self)
        self.html_queue=html_queue
    def run(self):
        while self.html_queue.empty()==False:
            e = etree.HTML(self.html_queue.get())
            span_list=e.xpath('//div[@class="content"]/span[1]')
            with open('xiaohua.txt','a',encoding='utf-8') as f:
                for span in span_list:
                    info=span.xpath('string(.)')
                    f.write(info+'\n')
if __name__=='__main__':
    url_queue=Queue()#用来存储url的容器
    base_url="https://www.qiushibaike.com/text/page/{}/"
    html_queue=Queue()#用来存储爬取到的整个页面的html，还未解析
    for i in range(1,14):
        new_url=base_url.format(i)
        print(new_url)
        url_queue.put(new_url)

    Crawl_list=[]#用来放爬虫类的线程，因为下面要让3个线程都等待，所以需要存起来
    for i in range(0,3):#创建3个线程
        Crawl1=CrawlInfo(url_queue,html_queue)
        Crawl_list.append(Crawl1)
        Crawl1.start()

    for crawl in Crawl_list:
        crawl.join()

    parse_list=[]
    for i in range(0,3):

        parse=ParseInfo(html_queue)
        parse_list.append(parse)
        parse.start()
    for parse in parse_list:
        parse.join()