【机器学习】FFmpeg+Whisper：二阶段法视频理解（video-to-text）大模型实战

多媒体流的解析：FFmpeg能够解析各种常见的多媒体格式，包括MP4, MKV, AVI, MP3, OGG等，并将其转换为FFmpeg内部的统一表示格式，也就是所谓的复用格式（Container Format）和编码格式（Codec）。
多媒体流的编码和解码：FFmpeg可以使用不同的编解码器来编码和解码音频/视频数据。例如，它可以使用H.264编码来压缩视频数据，使用AAC编码来压缩音频数据。
过滤器（Filters）：FFmpeg提供了一个强大的过滤器系统，可以用来处理视频和音频的各种效果，例如裁剪、裁切、旋转、缩放等。
流的复用和解复用：FFmpeg可以将多个音频/视频流合并为一个文件，也可以将一个文件分离成多个音频/视频流。
并行处理：FFmpeg利用多线程技术，可以并行处理多个任务，比如同时进行多个转码操作。

2.3 FFmpeg使用示例

ffmpeg -i input.mp4 -vn -ar 44100 -ac 2 -ab 192k -f mp3 output.mp3

-i input.mp4 指定输入文件。
-vn 表示禁用视频录制。
-ar 44100 设置采样率为44.1kHz。
-ac 2 设置声道数为2（立体声）。
-ab 192k 设置比特率为192k。
-f mp3 设置输出格式为MP3。
output.mp3 是输出文件的名称。

三、FFmpeg+Whisper二阶段法视频理解实战

3.1 FFmpeg安装

由于FFmpeg不支持pip安装，所以需要使用apt-get

sudo apt-get update && apt-get install ffmpeg

3.2 Whisper模型下载

这里与上一篇一样，还是采用transformers的pipeline，首先创建conda环境以及安装transformers

创建并激活conda环境：


  
  
    
    
     
     
    
    
    
    
     
     
      
      conda create -n video2text python=3.11
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      conda activate video2text

安装transformers库：

pip install transformers -i https://mirrors.cloud.tencent.com/pypi/simple

基于transformers的pipeline会自动进行模型下载，当然，如果您的网速不行，请替换HF_ENDPOINT为国内镜像。


  
  
    
    
     
     
    
    
    
    
     
     
      
      os.environ[
      
      "HF_ENDPOINT"] = 
      
      "https://hf-mirror.com"
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      transcriber = pipeline(task=
      
      "automatic-speech-recognition", model=
      
      "openai/whisper-medium")

不同尺寸模型参数量、多语言支持情况、需要现存大小以及推理速度如下

3.3 FFmpeg抽取视频的音频

3.3.1 方案一：命令行方式使用ffmpeg

首先将ffmpeg命令放入ffmpeg_command，之后采用subprocess库的run方法执行ffmpeg_command内的命令。

输入的视频文件为input_file，输出的音频文件为output_file。


  
  
    
    
     
     
    
    
    
    
     
     
      
      import subprocess
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      def 
      
      extract_audio(
      
      input_file, output_file):
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      """
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          使用FFmpeg从MP4文件中提取音频并保存为MP3格式。
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          :param input_file: 输入的MP4文件路径
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          :param output_file: 输出的MP3文件路径
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          """
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      # 构建FFmpeg命令
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          ffmpeg_command = [
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      'ffmpeg', 
      
      '-i', input_file, 
      
      '-vn', 
      
      '-acodec', 
      
      'libmp3lame', output_file
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          ]
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      try:
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      # 执行命令
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
              subprocess.run(ffmpeg_command, check=
      
      True)
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      print(
      
      f"音频已成功从 {input_file} 提取到 {output_file}")
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      except subprocess.CalledProcessError 
      
      as e:
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      print(
      
      f"处理错误: {e}")

3.3.2 方案二：ffmpeg-python库使用ffmpeg

首先安装ffmpeg-python：

 pip install ffmpeg-python -i  https://mirrors.cloud.tencent.com/pypi/simple

引入ffmpeg库，一行代码完成音频转文本


  
  
    
    
     
     
    
    
    
    
     
     
      
      import ffmpeg
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      def 
      
      extract_audio(
      
      input_file, output_file):
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      """
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          使用FFmpeg从MP4文件中提取音频并保存为MP3格式。
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          :param input_file: 输入的MP4文件路径
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          :param output_file: 输出的MP3文件路径
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          """
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      try:
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      # 执行命令
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
              ffmpeg.
      
      input(input_file).output(output_file, acodec=
      
      "libmp3lame", ac=
      
      2, ar=
      
      "44100").run()
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      print(
      
      f"音频已成功从 {input_file} 提取到 {output_file}")
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      except subprocess.CalledProcessError 
      
      as e:
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      print(
      
      f"处理错误: {e}")

3.4 Whisper将音频转为文本


  
  
    
    
     
     
    
    
    
    
     
     
      
      from transformers 
      
      import pipeline
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      def 
      
      speech2text(
      
      speech_file):
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          transcriber = pipeline(task=
      
      "automatic-speech-recognition", model=
      
      "openai/whisper-medium")
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          text_dict = transcriber(speech_file)
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      return text_dict

这里采用pipeline完成openai/whisper-medium的模型下载以及实例化，将音频文件输入实例化的transcriber对象即刻得到文本。

3.5 视频理解完整代码


  
  
    
    
     
     
    
    
    
    
     
     
      
      import os
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      os.environ[
      
      "HF_ENDPOINT"] = 
      
      "https://hf-mirror.com"
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      os.environ[
      
      "CUDA_VISIBLE_DEVICES"] = 
      
      "2"
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      os.environ[
      
      "TF_ENABLE_ONEDNN_OPTS"] = 
      
      "0"
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      from transformers 
      
      import pipeline
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      import subprocess
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      def 
      
      speech2text(
      
      speech_file):
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          transcriber = pipeline(task=
      
      "automatic-speech-recognition", model=
      
      "openai/whisper-medium")
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          text_dict = transcriber(speech_file)
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      return text_dict
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      def 
      
      extract_audio(
      
      input_file, output_file):
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      """
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          使用FFmpeg从MP4文件中提取音频并保存为MP3格式。
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          :param input_file: 输入的MP4文件路径
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          :param output_file: 输出的MP3文件路径
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          """
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      # 构建FFmpeg命令
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          ffmpeg_command = [
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      'ffmpeg', 
      
      '-i', input_file, 
      
      '-vn', 
      
      '-acodec', 
      
      'libmp3lame', output_file
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          ]
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      try:
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      # 执行命令
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
              subprocess.run(ffmpeg_command, check=
      
      True)
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      print(
      
      f"音频已成功从 {input_file} 提取到 {output_file}")
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      except subprocess.CalledProcessError 
      
      as e:
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      print(
      
      f"处理错误: {e}")
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      import argparse
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      import json
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      def 
      
      main():
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          parser = argparse.ArgumentParser(description=
      
      "视频转文本")
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          parser.add_argument(
      
      "--video",
      
      "-v", 
      
      type=
      
      str, 
      
      help=
      
      "输入视频文件路径")
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          parser.add_argument(
      
      "--audio",
      
      "-a", 
      
      type=
      
      str, 
      
      help=
      
      "输出音频文件路径")
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          args = parser.parse_args()
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      print(args) 
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          extract_audio(args.video, args.audio)
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          text_dict = speech2text(args.audio)
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      print(
      
      "视频内的文本是：\n" +  text_dict[
      
      "text"])
     
     
    
    

    
    
     
     
    
    
    
    
     
         
      
      #print("视频内的文本是：\n"+ json.dumps(text_dict,indent=4))
     
     
    
    

    
    
     
     
    
    
    
    
     
      
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
      if __name__==
      
      "__main__":
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
          main()