视频爬取，快就完事了

warm...

已于 2022-07-04 13:57:22 修改

阅读量1.1k

点赞数 1

分类专栏： Python爬虫文章标签： python 队列多线程 queue

于 2020-10-18 21:33:52 首次发布

本文链接：https://blog.csdn.net/qq_46292926/article/details/109150328

版权

Python爬虫专栏收录该内容

41 篇文章 0 订阅

订阅专栏

现在在练习多线程爬取视频，就找个网站练练手了。现在只是使用多线程爬取，具体的多线程理论知识还待后续的补充。

爬虫第一步：分析网页

打开网页按下F12弹出开发者工具，切换到network选项，分析发现原网页并没有我们想要的数据。
在这里插入图片描述

切换到XHR时发现有一条请求。
在这里插入图片描述
点进去之后发现就是我们想要的数据。

有我们想要的视频地址，还有视频的名称，只需提取处理即可。

编写代码

多线程用的是生产者与消费者模式。生产者负责生产相应的地址，消费者拿到这些地址后进行解析保存到本地。
具体的解释在代码注释中，可详细的阅读。
代码如下：

# 导入第三方库
import os
import requests
import jsonpath
import threading
from queue import Queue
from fake_useragent import UserAgent


# 定义生产者
class Product(threading.Thread):
    # 初始化对象
    def __init__(self, page_queue, video_queue, *args, **kwargs):
        self.headers = {'User-Agent': UserAgent().random}
        # 重写父类方法
        super(Product, self).__init__(*args, **kwargs)
        # 定义页数队列
        self.page_queue = page_queue
        # 定义视频地址队列
        self.video_queue = video_queue

    # 重写run方法
    def run(self):
        # 循环
        while True:
            # 如果页数队列为空，则退出循环
            if self.page_queue.empty():
                break
            # 否则将页数地址加入到页数地址队列中
            url = self.page_queue.get()
            # 解析url
            self.paser_json(url)

    # 定义解析url方法
    def paser_json(self, url):
        # 请求url
        response = requests.get(url, headers=self.headers).json()
        # 提取title字段
        title = jsonpath.jsonpath(response, '$..title')
        # 提取MP4地址字段
        playurl = jsonpath.jsonpath(response, '$..playurl')
        # for循环，加入到子典中
        for title, playurl in zip(title, playurl):
            mag_dict = {}
            mag_dict[title] = playurl
            # 将title，mp4加入到视频队列中
            self.video_queue.put(mag_dict)


# 定义消费者
class Customer(threading.Thread):
    # 初始化实例对象
    def __init__(self, page_queue, video_queue, *args, **kwargs):
        self.headers = {'User-Agent': UserAgent().random}
        # 重写父类方法
        super(Customer, self).__init__(*args, **kwargs)
        self.page_queue = page_queue
        self.video_queue = video_queue

    # 重写run方法
    def run(self):
        # 创建文件夹
        if not os.path.exists('创建的文件夹'):
            os.mkdir('创建的文件夹')
        # 循环
        while True:
            # 判断，如果页数队列和视频队列全为空时退出循环
            if self.page_queue.empty() and self.video_queue.empty():
                break
            # 获取视频队列里的内容
            mag = self.video_queue.get()
            # 拆分字典
            for title, mp4 in mag.items():
                # 请求视频地址
                response = requests.get(mp4, headers=self.headers)
                # 保存小视频
                with open('创建的文件夹' + '/' + title + '.mp4', 'wb') as f:
                    print('正在写入：' + title)
                    f.write(response.content)


# 定义主函数
def main():
    # 页数队列
    page_queue = Queue()
    # 视频队列
    video_queue = Queue()
    # 循环创建页数地址
    for i in range(1, 11):
        url = '分析的网址'
        # 将页数地址加入页数队列中
        page_queue.put(url)
    # 开启5个生产者
    for x in range(5):
        t = Product(page_queue, video_queue)
        t.start()
    # 开启5个消费者
    for x in range(5):
        t = Customer(page_queue, video_queue)
        t.start()


# 程序运行入口
if __name__ == '__main__':
    main()