多线程爬取小说保证章节有序

grandbear

已于 2023-02-01 15:12:01 修改

阅读量990

点赞数 2

分类专栏： python 多线程爬虫文章标签： python 数据挖掘 java

于 2023-01-10 20:53:33 首次发布

本文链接：https://blog.csdn.net/qq_62456857/article/details/128635808

版权

python 同时被 3 个专栏收录

2 篇文章 1 订阅

订阅专栏

爬虫

2 篇文章 0 订阅

订阅专栏

多线程

1 篇文章 0 订阅

订阅专栏

放假在家闲来无事，就想着爬一些小说来玩，发现单线程爬取速度实在太慢，多线程爬取又会导致数据混乱，在网上找了几天资料后，终于能够保证多线程爬取时实现章节有序写入，现分享出来，欢迎大家指正。（注：如有版权或其他问题，请联系删除）

多线程在爬取时能够有效提高爬取效率，但是由于多线程爬取时线程是并发执行的，每个线程的执行顺序是不确定的，导致爬取的数据顺序也是混乱的，这里可以先将要爬取的网页地址加入队列中，爬取时再一个个取出来（队列具有先进先出的特点，所以可以保证网页地址取出时是按照章节顺序来的），并且设定一个标志位，通过标志位判断是否当前线程已经爬取的内容是否是按照顺序爬取的，是就写入缓存，不是就阻塞该线程。

注意：我们使用多线程爬取的原因主要是因为向网页发出请求和服务器的响应比较花费时间，所以需要提高效率。在使用多线程爬取时，先进行爬取，再判断是否按照顺序写入，不是就阻塞（爬取耗时远超过写入缓存）。

代码中其实注释比较多，所以这里简单理一下：

1、发起请求，通过xpath定位获取到小说每一章的网址和相关信息。

def get_info():
    global flag,name, author

    url1 = ''   #爬取的小说网址（此处已抹去）

    response1 = requests.get(url1,headers=header)                    # 发送网络请求
    response1.encoding = 'GBK'                                       # 编码
    html = etree.HTML(response1.text)                                # 解析html字符串

    names = html.xpath('//div[@id="info"]/h1/text()')
    name = processing(names)
    name = name
    authors = html.xpath('//div[@id="info"]//small/a/text()')
    author = processing(authors)

    div_all = html.xpath('//dl[@class="zjlist"]')
    for div in div_all:

        chapter_urls = div.xpath('./dd/a/@href')
        chapter_url = []
        for i in chapter_urls:
            url2s = url1 + i   # 获取每一章节的网页地址
            chapter_url.append(url2s)
    #print(chapter_url)
    return(chapter_url)

2、对小说基本信息进行抓取，作为基本信息写在小说开头。

def novel_information():   # 写入小说的基本信息
    x=len(get_info())
    x=str(x)
    novel_content = name + '  ' + author + chapter_name + '    共有' + x + '章节' + cartoon + '\n\n\n'
    return novel_content


save_chapter_num = 1

3、爬取内容并且写入。判断当前写入章节（save_chapter_num）和小说的章节一致时就写入，否则就阻塞等待。

def write_data(q):
    global save_chapter_num
    while not q.empty():
        novel_data = q.get()
        url2 = novel_data[1]
        # 爬取正文部分
        # 开始爬取每一章节的信息
        response2 = requests.get(url2, headers=header)  # 发送网络请求
        response2.encoding = 'GBK'  # 编码
        html = etree.HTML(response2.text)  # 解析html字符串

        chapter_names = html.xpath('/html/body/div[3]/h1/text()')
        chapter_name = processing(chapter_names)

        # 爬取正文部分
        chapter_contents = re.findall('div id="content">(.*?)</div>', response2.text)
        chapter_content = processing(chapter_contents)
        
        global name, author
        try:

            novel_content = chapter_name + '\n\n' + chapter_content + '\n\n\n'
            time.sleep(2)
            # print(novel_content)

            print(chapter_name + '  爬取成功！\n')
        except:
            print(chapter_name + '爬取失败！')


        # content = get_chapter_information(novel_data[1])
        while(save_chapter_num < novel_data[0]+1):
            pass  # 阻塞线程，等待直到顺序写入

        if (save_chapter_num == novel_data [0]+1):
            fw.write(novel_content)
            print( chapter_name , ' 保存成功！！！')
            save_chapter_num += 1

4、主线程部分，主要是创建队列和线程。

if __name__=="__main__":
    chapter_url = get_info()
    with open(name +'.txt', 'w', encoding='utf-8')as text_file:
        text_file.write(novel_information())
        # print(novel_information())
    q = Queue()  # 创建队列
    for i,url in enumerate(chapter_url):
        q.put((i,url))

    with open(name+'.txt', 'a', encoding='utf-8') as fw:

        ts = []
        for i in range(10):  # 创建多个线程
            t = Thread(target=write_data,args=[q])
            t.start()
            ts.append(t)

        for t in ts:
            t.join()
        fw.close()

    print('\n',name,'已全部写入！！！！！',)

由于线程的并发执行，爬取时是乱序的，但是在保存章节时是有序写入的。

完整代码如下：（具体网址部分已抹去）

import time
import re
import requests
from lxml import etree
from threading import Thread
from queue import Queue
cartoon = """
                          へ　　／|
                       　　/＼7　　　 ∠＿/
                       　 /　│　　 ／　／
                       　│　Z ＿,＜　／　　 /`ヽ
                       　│　　　　　ヽ　　 /　　〉
                       　 Y　　　　　`　 /　　/
                       　ｲ●　､　●　　⊂⊃〈　　/
                       　()　 へ　　　　|　＼〈
                       　　>ｰ ､_　 ィ　 │ ／／
                       　 / へ　　 /　ﾉ＜| ＼＼
                       　 ヽ_ﾉ　　(_／　 │／／
                       　　7　　　　　　　|／
                       　　＞―r￣￣`ｰ―＿6"""

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) '
                        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}



# 处理字符串中的空白符，并拼接字符串
def processing(strs):
    s = ''  # 定义保存内容的字符串
    for n in strs:
        n = ''.join(n.split())  # 去除空字符
        s = s + n  # 拼接字符串
    return s      # 返回拼接后的字符串
flag=1
author=''
name=''
write_flag = 1

chapter_url=[]

# 获取信息
def get_info():
    global flag,name, author

    url1 = ''  #要爬取的小说网址（此处已抹去）

    response1 = requests.get(url1,headers=header)                    # 发送网络请求
    response1.encoding = 'GBK'                                       # 编码
    html = etree.HTML(response1.text)                                # 解析html字符串

    names = html.xpath('//div[@id="info"]/h1/text()')
    name = processing(names)
    name = name
    authors = html.xpath('//div[@id="info"]//small/a/text()')
    author = processing(authors)

    div_all = html.xpath('//dl[@class="zjlist"]')
    for div in div_all:

        chapter_urls = div.xpath('./dd/a/@href')
        chapter_url = []
        for i in chapter_urls:
            url2s = url1 + i   # 获取每一章节的网页地址
            chapter_url.append(url2s)
    #print(chapter_url)
    return(chapter_url)

chapter_name=''
chapter_num = 0
novel_content = ''
# def get_chapter_information(url2):
#
#
#     return novel_content

def novel_information():   # 写入小说的基本信息
    x=len(get_info())
    x=str(x)
    novel_content = name + '  ' + author + chapter_name + '    共有' + x + '章节' + cartoon + '\n\n\n'
    return novel_content


save_chapter_num = 1
def write_data(q):
    global save_chapter_num
    while not q.empty():
        novel_data = q.get()
        url2 = novel_data[1]
        # 爬取正文部分
        # 开始爬取每一章节的信息
        response2 = requests.get(url2, headers=header)  # 发送网络请求
        response2.encoding = 'GBK'  # 编码
        html = etree.HTML(response2.text)  # 解析html字符串

        chapter_names = html.xpath('/html/body/div[3]/h1/text()')
        chapter_name = processing(chapter_names)

        # 爬取正文部分
        chapter_contents = re.findall('div id="content">(.*?)</div>', response2.text)
        chapter_content = processing(chapter_contents)
       
        global name, author
        try:

            novel_content = chapter_name + '\n\n' + chapter_content + '\n\n\n'
            time.sleep(2)
            # print(novel_content)

            print(chapter_name + '  爬取成功！\n')
        except:
            print(chapter_name + '爬取失败！')


        # content = get_chapter_information(novel_data[1])
        while(save_chapter_num < novel_data[0]+1):
            pass  # 阻塞线程，等待直到顺序写入

        if (save_chapter_num == novel_data [0]+1):
            fw.write(novel_content)
            print( chapter_name , ' 保存成功！！！')
            save_chapter_num += 1

if __name__=="__main__":
    chapter_url = get_info()
    with open(name +'.txt', 'w', encoding='utf-8')as text_file:
        text_file.write(novel_information())
        # print(novel_information())
    q = Queue()  # 创建队列
    for i,url in enumerate(chapter_url):
        q.put((i,url))

    with open(name+'.txt', 'a', encoding='utf-8') as fw:

        ts = []
        for i in range(10):  # 创建多个线程
            t = Thread(target=write_data,args=[q])
            t.start()
            ts.append(t)

        for t in ts:
            t.join()
        fw.close()

    print('\n',name,'已全部写入！！！！！',)

整个代码的关键其实就是不断拿出队列中的数据先爬取再写入，爬取一定要放在判断写入的条件之前，否则效率依然很低。还有就是在使用多线程时，如果想要进程之间的数据不产生混乱，就不要设全局变量，之前因为这个找了好久的错误。

彼方尚有荣光在，祝大家都能得偿所愿！