7-多线程爬虫糗事百科

最新推荐文章于 2021-09-26 22:34:18 发布

-admin-

最新推荐文章于 2021-09-26 22:34:18 发布

阅读量301

点赞数

分类专栏： Python3网络爬虫文章标签： Python3网络爬虫

本文链接：https://blog.csdn.net/flyingkitty_/article/details/105882464

版权

Python3网络爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

7-多线程爬虫糗事百科

简介

多线程 thread 在 Python 里面被称作鸡肋的存在！不建议使用，多是使用多进程，虽然不建议使用，还是做个笔记吧。

实现线程的两种方式

直接利用函数创建多线程

import _thread
import time


# 为线程定义函数
def print_time(thread_name, delay):
    count = 0
    while count < 5:
        time.sleep(delay)
        count += 1
        print("%s : %s" % (thread_name, time.ctime(time.time())))


# 创建两个线程
try:
    _thread.start_new_thread(print_time, ("thread-1", 2))
    _thread.start_new_thread(print_time, ("thread-2", 3))
except:
    print("Error: unalbe to start thread")


# 这是让主线程一直在等待
# 如果去掉,线程中数据不会打印
while 1:
    pass

利用threading创建多线程

使用Threading模块创建线程，直接从threading.Thread继承，然后重写init方法和run方法：

利用threading多线程糗事百科

from threading import Thread
from queue import Queue
from fake_useragent import UserAgent
import requests
from lxml import etree
import time


# 爬虫类
class CrawlInfo(Thread):
    def __init__(self, url_queue, html_queue):
        Thread.__init__(self)
        self.url_queue = url_queue
        self.html_queue = html_queue

    def run(self):
        headers = {
            'User-Agent': UserAgent().random
        }
        while not self.url_queue.empty():
            response = requests.get(self.url_queue.get(), headers=headers)
            if response.status_code == 200:
                self.html_queue.put(response.text)


class ParseInfo(Thread):
    def __init__(self, html_queue):
        Thread.__init__(self)
        self.html_queue = html_queue

    def run(self) -> None:
        while not self.html_queue.empty():
            e = etree.HTML(self.html_queue.get())
            contents = e.xpath('//div[@class="content"]/span[1]')
            with open("duanzi.txt", "a", encoding="utf-8") as f:
                for content in contents:
                    info = content.xpath('string(.)')
                    f.write(info + "\n")


if __name__ == '__main__':
    time.time()
    # 存储url的容器
    # FIFO（first-in-first-out先入先出)队列
    url_queue = Queue()
    # 存储内容容器
    html_queue = Queue()
    base_url = 'https://www.qiushibaike.com/text/page/{}/'
    for i in range(1, 13):
        new_url = base_url.format(i)
        url_queue.put(new_url)

    # 创建一个爬虫
    crawl_list = []
    for i in range(0, 3):
        crawl1 = CrawlInfo(url_queue, html_queue)
        crawl_list.append(crawl1)
        crawl1.start()

    for crawl_detail in crawl_list:
        crawl_detail.join()

    # 处理数据
    crawl2 = ParseInfo(html_queue)
    crawl2.start()

线程同步

如果多个线程共同对某个数据修改，则可能出现不可预料的结果，为了保证数据的正确性，需要对多个线程进行同步。

使用Thread对象的Lock和Rlock可以实现简单的线程同步，这两个对象都有acquire方法和release方法，对于那些需要每次只允许一个线程操作的数据，可以将其操作放到acquire和release方法之间。

-admin-

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
7-多线程爬虫糗事百科

7-多线程爬虫糗事百科简介多线程 thread 在 Python 里面被称作鸡肋的存在！不建议使用，多是使用多进程，虽然不建议使用，还是做个笔记吧。实现线程的两种方式直接利用函数创建多线程import _threadimport time# 为线程定义函数def print_time(thread_name, delay): count = 0 while c...
复制链接

扫一扫