python 爬虫下载视频并安装使用 ffmpeg 合并ts视频文件使用16进制修改文件头类型

fuchto

已于 2022-09-09 11:05:06 修改

阅读量2.2k

点赞数 1

文章标签： python 爬虫网络爬虫

于 2022-09-08 16:28:23 首次发布

本文链接：https://blog.csdn.net/fuchto/article/details/126710611

版权

示例：私聊获取示例链接

第一步找到视频资源接口

这是m3u8的资源获取接口

使用 postman 模拟发送请求

发现无法获取到数据报错 403

查看请求头发现还有其他请求参数

设置请求头

在pyhton 中使用request 请求

获取到资源文档发现链接内容都是 ts文件采用多线程下载 ts文件


# 采用线程池下载m3u8里的ts
def down_ts():
    # 读取 文件内容
    with open("./hhls.m3u8", "r", encoding="utf-8") as f:
        # n = 0
        # 创建 10 个线程
        with ThreadPoolExecutor(10) as t:
            # 循环 文件内容
            for n, line in enumerate(f):
                line = line.strip()
                # print(line)
                # 判断是否是忽略内容
                if line.startswith("#"):
                    # 是则跳过循环
                    continue
                #     提交线程
                t.submit(ts_video, n, line)

下载方法

# 下载ts视频 ./hhls 为下载ts视频存放的路径
def ts_video(n, line):

    headers = {
        'origin' : 'https://xxxxx',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
    }
    resp3 = requests.get(headers=headers, url=line)
    print(resp3.status_code)
    with open(f"./hhls/{n}.ts", "wb+") as f:
        f.write(resp3.content)

视频下载完成之后则需要将 ts 文件合并成MP4 文件

这时则需要使用到 ffmpeg

ffmpeg 安装步骤

下载成功后解压即可

执行合并视频代码

def movie_video():
    filePath = "./hhls"
    # 获取当前文件夹下 使用文件名称
    file_list = os.listdir(filePath)
    print(file_list)
    li = []
    for file in file_list:
        # 字符串截取
        file = file.split(".")[0]
        # 添加进 列表
        li.append(int(file))
        # 列表进行排序
        li.sort()
    ts_file_name = []
    # 循环列表
    for i in li:
        i = str(i) + ".ts"
        # 添加 文件名进入列表
        ts_file_name.append(i)
    #     创建或打开文件 执行写入
    with open("./hhls/file_list.txt", "w+") as f:
        # 循环 文件名列表
        for file in ts_file_name:
            # 循环写入文件
            f.write("file '{}'\n".format(file))
    print("file_list.txt已生成")
    txt_file = "./hhls/file_list.txt"
    mp4 = "./hhls/hhls.mp4"
    # 执行合并命令
    cmd = f"ffmpeg -f concat -i " + txt_file + " -c copy " + mp4
    print(cmd)
    try:
        # 执行命令
        os.system(cmd)
    except Exception as e:
        print(e)
    print("done")

发现报错乱码

乱码解决方法在 pycharm 中修改配置打开 file 下 settings 中搜索 file encodings 修改字符集为 GBK 即可

这是由于没有将 ffmpeg 添加到环境变量导致

回到桌面右键点击此电脑点击属性添加系统环境变量 ffmpeg.exe 所在的文件位置即可

打开cmd 执行命令 ffmpeg -v 查看是否添加成功

添加成功重启 pycharm 编辑器即可

重新运行合并ts文件时报错

Could not find tag for codec bmp in stream #0, codec not currently supported in container
Could not write header for output file #0 (incorrect codec parameters ?): Invalid argument
Error initializing output stream 0:0 --

这种是利用图床来存储视频切片文件前被增加了BMP的header信息，导致ffmpeg错误地识别为BMP图片，自然不能合并了。

其实他就是 ts文件，被改成.bmp 后缀，放到图床，达到白嫖图床的目的

那么如何解决这个问题呢

使用16进制查看器WinHex 编辑删除

47 40 11 之前的数据即可

片段示例

12.ts 文件 16进制数据

删除 bmp头后执行合并成功

png 头部信息 ts片段数据

常用文件格式十六进制文件头
JPEG (jpg)，文件头：FFD8FF
PNG (png)，文件头：89504E47
GIF (gif)，文件头：47494638
TIFF (tif)，文件头：49492A00
Windows Bitmap (bmp)，文件头：424D
CAD (dwg)，文件头：41433130
Adobe Photoshop (psd)，文件头：38425053
Rich Text Format (rtf)，文件头：7B5C727466
XML (xml)，文件头：3C3F786D6C
HTML (html)，文件头：68746D6C3E
Email [thorough only] (eml)，文件头：44656C69766572792D646174653A
Outlook Express (dbx)，文件头：CFAD12FEC5FD746F
Outlook (pst)，文件头：2142444E
MS Word/Excel (xls.or.doc)，文件头：D0CF11E0
MS Access (mdb)，文件头：5374616E64617264204A
WordPerfect (wpd)，文件头：FF575043
Adobe Acrobat (pdf)，文件头：255044462D312E
Quicken (qdf)，文件头：AC9EBD8F
Windows Password (pwl)，文件头：E3828596
ZIP Archive (zip)，文件头：504B0304
RAR Archive (rar)，文件头：52617221
Wave (wav)，文件头：57415645
AVI (avi)，文件头：41564920
Real Audio (ram)，文件头：2E7261FD
Real Media (rm)，文件头：2E524D46
MPEG (mpg)，文件头：000001BA
MPEG (mpg)，文件头：000001B3
Quicktime (mov)，文件头：6D6F6F76
Windows Media (asf)，文件头：3026B2758E66CF11
MIDI (mid)，文件头：4D546864

那么问题来了如何在代码中实现呢

示例：

 # bmp标头文件示例
    f = open(f'./hhls/648.ts','rb')# 二进制读
    # 读取文件流
    a = f.read()# 打印读出来的数据
    # 转换16进制
    hexstr = binascii.b2a_hex(a)
    # 转换成 字符串
    hexstr = hexstr.decode('UTF-8')
    # print(hexstr)
    #判断是否 存在 特定标头
    header = ['FFD8FF','89504E47','47494638','424D']
    for value in header:
        # 搜索 判断16进制内容是否 存在 特定标头
       if hexstr.startswith(value.lower()) :
           # 进行字符串截图
           new_hexstr = hexstr[len(value):]
           # 字符串转为bytes 类型 数据
           new_hexstr = bytes.fromhex(new_hexstr)
           f = open(f'./hhls/648.ts', 'wb') # 二进制写模式
           f.write(new_hexstr) # 二进制写入

成功合成之后

发现视频时长不正确在命令行执行合并发现 482.ts文件有问题无法合并

打开文件发现文件资源有问题属于无效文件

查看网站后发现网站资源也丢失这一片段

解决方法删除 482.ts 文件重新执行脚本

或

修改 file_list.txt 删除 file '482.ts' 即可

重新执行合并命令

合并ts 资源成功

下载方法优化后

# 下载ts视频 ./hhls 为下载ts视频存放的路径   
def ts_video(n, line):

    headers = {
        'origin' : 'https://xxxxx',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
    }
    resp3 = requests.get(headers=headers, url=line)
    print(resp3.status_code)
    with open(f"./hhls/{n}.ts", "wb+") as f:
        f.write(resp3.content)

    # bmp标头文件示例
    f = open(f'./hhls/{n}.ts','rb')# 二进制读
    # 读取文件流
    a = f.read()# 打印读出来的数据
    # 转换16进制
    hexstr = binascii.b2a_hex(a)
    # 转换成 字符串
    hexstr = hexstr.decode('UTF-8')
    # print(hexstr)
    #判断是否 存在 特定标头
    header = ['FFD8FF','89504E47','47494638','424D']
    for value in header:
        # 搜索 判断16进制内容是否 存在 特定标头
       if hexstr.startswith(value.lower()) :
           # 进行字符串截图
           new_hexstr = hexstr[len(value):]
           # 字符串转为bytes 类型 数据
           new_hexstr = bytes.fromhex(new_hexstr)
           f = open(f'./hhls/{n}.ts', 'wb') # 二进制写模式
           f.write(new_hexstr) # 二进制写入
    print(n, "下载完成")

完整代码示例

示例网站地址：私聊获取

import requests
import os
import binascii
from concurrent.futures import ThreadPoolExecutor

def download_video():
    headers = {
        'origin' : 'xxxxxx',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
    }
    # 当前链接只有 几分钟 时效 请去示例网站地址 实时获取
    link = 'xxxxxx'
    resp = requests.get(headers=headers,url=link)
    # print(resp)
    m3u8text = resp.text
    with open("hhls.m3u8", "w") as f:
        f.write(m3u8text)
    print("hhls.m3u8下载完成")



# 采用线程池下载m3u8里的ts
def down_ts():
    # 读取 文件内容
    with open("./hhls.m3u8", "r", encoding="utf-8") as f:
        # n = 0
        # 创建 10 个线程
        with ThreadPoolExecutor(10) as t:
            # 循环 文件内容
            for n, line in enumerate(f):
                line = line.strip()
                # print(line)
                # 判断是否是忽略内容
                if line.startswith("#"):
                    # 是则跳过循环
                    continue
                #     提交线程
                t.submit(ts_video, n, line)


# 下载ts视频 ./hhls 为下载ts视频存放的路径
def ts_video(n, line):

    headers = {
        'origin' : 'xxxxxx',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36',
    }
    resp3 = requests.get(headers=headers, url=line)
    print(resp3.status_code)
    with open(f"./hhls/{n}.ts", "wb+") as f:
        f.write(resp3.content)

    # bmp标头文件示例
    f = open(f'./hhls/{n}.ts','rb')# 二进制读
    # 读取文件流
    a = f.read()# 打印读出来的数据
    # 转换16进制
    hexstr = binascii.b2a_hex(a)
    # 转换成 字符串
    hexstr = hexstr.decode('UTF-8')
    # print(hexstr)
    #判断是否 存在 特定标头
    header = ['FFD8FF','89504E47','47494638','424D']
    for value in header:
        # 搜索 判断16进制内容是否 存在 特定标头
       if hexstr.startswith(value.lower()) :
           # 进行字符串截图
           new_hexstr = hexstr[len(value):]
           # 字符串转为bytes 类型 数据
           new_hexstr = bytes.fromhex(new_hexstr)
           f = open(f'./hhls/{n}.ts', 'wb') # 二进制写模式
           f.write(new_hexstr) # 二进制写入
    print(n, "下载完成")


def movie_video():
    filePath = "./hhls"
    # 获取当前文件夹下 使用文件名称
    file_list = os.listdir(filePath)
    print(file_list)
    li = []
    for file in file_list:
        # 字符串截取
        file = file.split(".")[0]
        # 添加进 列表
        li.append(int(file))
        # 列表进行排序
        li.sort()
    ts_file_name = []
    # 循环列表
    for i in li:
        i = str(i) + ".ts"
        # 添加 文件名进入列表
        ts_file_name.append(i)
    #     创建或打开文件 执行写入
    with open("./hhls/file_list.txt", "w+") as f:
        # 循环 文件名列表
        for file in ts_file_name:
            # 循环写入文件
            f.write("file '{}'\n".format(file))
    print("file_list.txt已生成")
    txt_file = "./hhls/file_list.txt"
    mp4 = "./hhls/hhls.mp4"
    # 执行合并命令
    cmd = f"ffmpeg -f concat -i " + txt_file + " -c copy " + mp4
    print(cmd)
    try:
        # 执行命令
        os.system(cmd)
    except Exception as e:
        print(e)
    print("done")


if __name__ == '__main__':
    download_video()
    down_ts()
    movie_video()

fuchto

关注

1
点赞
踩
12

收藏

觉得还不错? 一键收藏
打赏
0
评论
python 爬虫下载视频并安装使用 ffmpeg 合并ts视频文件使用16进制修改文件头类型

示例：第一步找到视频资源接口这是m3u8的资源获取接口使用 postman 模拟发送请求发现无法获取到数据报错 403查看请求头发现还有其他请求参数设置请求头 origin 为 https://www.ysgc.tv 请求成功在pyhton 中使用request 请求获取到资源文档发现链接内容都是 ts文件采用多线程下载 ts文件下载方法视频下载完成之后则需要将 ts 文件合并成MP4 文件这时则需要使用到 ffmpeg。
复制链接

扫一扫