史上最简单的多线程爬小说

最新推荐文章于 2024-01-02 14:56:05 发布

Mr.Shawn

最新推荐文章于 2024-01-02 14:56:05 发布

阅读量529

点赞数

分类专栏：爬虫系列（个人学习过程）文章标签：爬虫 python 多线程

本文链接：https://blog.csdn.net/shawn_fung/article/details/88013364

版权

爬虫系列（个人学习过程）专栏收录该内容

10 篇文章 0 订阅

订阅专栏

import requests
import threading
import queue
from lxml import etree
import time
Q = queue.Queue()

class A(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
        self.url = 'http://www.17k.com/list/2926161.html'

    def run(self):
        resp = requests.get(self.url)
        html = resp.content.decode('utf-8')
        text = etree.HTML(html)
        dds = text.xpath('//div[@class="Main List"]/dl[@class="Volume"]/dd/a/@href')
        for url in dds:
            url = 'http://www.17k.com' + url
            Q.put(url)

class B(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        while True:
            url = Q.get()
            resp = requests.get(url)
            html = resp.content.decode('utf-8')
            text = etree.HTML(html)
            name = text.xpath('//div[@class="readAreaBox content"]/h1/text()')[0].strip()  # 章节的名字
            contents = text.xpath('//div[@class="readAreaBox content"]/div[@class="p"]/text()')
            f = open('./%s.txt' % name, 'w')
            print('正在保存%s' % name)
            for content in contents:
                f.write(content)  # content是一段一段的文字，不是一个整体的，若是使用with open只能保存第一句
                f.write('\n')
            f.close()

if __name__ == '__main__':
    start = time.time()
    s = A()
    q = B()
    s.start()
    q.start()
    s.join()
    q.join()
    print(time.time()-start)

Mr.Shawn

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
史上最简单的多线程爬小说

import requestsimport threadingimport queuefrom lxml import etreeimport timeQ = queue.Queue()class A(threading.Thread): def __init__(self): threading.Thread.__init__(self) ...
复制链接

扫一扫

专栏目录