AI之Tool：FastRTC(构建实时音视频 AI 应用)的简介、安装和使用方法、案例应用之详细攻略

一个处女座的程序猿

已于 2025-04-21 00:35:16 修改

阅读量1k

点赞数 8

分类专栏： AI/AGI Tool/IDE etc 文章标签： LLM FastRTC

于 2025-04-19 13:54:05 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/147349919

版权

AI/AGI 同时被 2 个专栏收录

338 篇文章

订阅专栏

Tool/IDE etc

103 篇文章

订阅专栏

AI之Tool：FastRTC(构建实时音视频 AI 应用)的简介、安装和使用方法、案例应用之详细攻略

FastRTC的简介

FastRTC 是一个专为 Python 设计的实时通信库，旨在简化音视频流的实时处理。通过 FastRTC，开发者可以轻松地将任何 Python 函数转换为基于 WebRTC 或 WebSocket 的实时音视频流。该库特别适用于构建实时音视频 AI 应用，如语音助手、实时翻译和视频聊天机器人等。

FastRTC 的设计理念是简化实时音视频 AI 应用的开发流程，使开发者能够专注于核心逻辑，而无需深入了解底层的通信协议。其丰富的功能和灵活的架构使其成为构建实时通信应用的有力工具。

Github地址：https://github.com/gradio-app/fastrtc

官网地址：FastRTC

1、核心特点

>> 自动语音检测与轮流发言机制：FastRTC 内置语音活动检测（VAD）功能，能够自动检测用户的发言并实现智能的对话轮流机制，开发者无需手动管理对话的开始与结束。

>> 内置 UI 与快速部署：.ui.launch() 方法可快速启动基于 Gradio 的 WebRTC UI，方便测试和分享。.mount(app) 方法允许将 FastRTC 流挂载到 FastAPI 应用中，轻松集成到现有的生产系统中。

>> 多协议支持：支持 WebRTC 和 WebSocket 两种通信协议，满足不同前端的集成需求。

>> 临时电话接入：通过 .fastphone() 方法，用户可以获得一个免费的临时电话号码，实现电话接入流的功能（需要 Hugging Face Token）。

>> 可扩展的后端架构：FastRTC 的 Stream 类可以轻松挂载到 FastAPI 应用中，便于开发者根据需求扩展功能，适应生产环境的复杂需求。

FastRTC的安装与使用方法

1、安装

使用 pip 安装 FastRTC：

pip install fastrtc

若需使用内置的语音活动检测（VAD）和文本转语音（TTS）功能，可安装相应的扩展：

pip install "fastrtc[vad, tts]"

2、使用方法

快速开始

启动内置 UI：

stream.ui.launch()

挂载到 FastAPI 应用：

stream.mount(app)

仅音频电话接入

stream.fastphone()

FastAPI 集成

from fastapi import FastAPI
from fastapi.responses import HTMLResponse

app = FastAPI()
# 将 Stream 挂载到 FastAPI 应用
stream.mount(app)

# （可选）添加首页路由
@app.get("/")
async def _():
    return HTMLResponse(content=open("index.html").read())

# 启动服务：
# uvicorn app:app --host 0.0.0.0 --port 8000

3、示例

回声音频

from fastrtc import Stream, ReplyOnPause
import numpy as np

def echo(audio: tuple[int, np.ndarray]):
    # 该函数会接收到用户在暂停前的音频数据
    # 你可以在这里实现任意产生音频的迭代器
    # 更多完整示例见“LLM 语音聊天”
    yield audio

stream = Stream(
    handler=ReplyOnPause(echo),
    modality="audio", 
    mode="send-receive",
)

LLM 语音聊天

from fastrtc import (
    ReplyOnPause, AdditionalOutputs, Stream,
    audio_to_bytes, aggregate_bytes_to_16bit
)
import gradio as gr
from groq import Groq
import anthropic
from elevenlabs import ElevenLabs

groq_client = Groq()
claude_client = anthropic.Anthropic()
tts_client = ElevenLabs()


# 有关如何维护对话历史的示例，参见 Cookbook 中的 “Talk to Claude”
def response(
    audio: tuple[int, np.ndarray],
):
    # 将音频发送给 Whisper 模型进行转录
    prompt = groq_client.audio.transcriptions.create(
        file=("audio-file.mp3", audio_to_bytes(audio)),
        model="whisper-large-v3-turbo",
        response_format="verbose_json",
    ).text

    # 将转录文本发送给 Claude 模型获取回复
    response = claude_client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    # 从模型输出中提取纯文本
    response_text = " ".join(
        block.text
        for block in response.content
        if getattr(block, "type", None) == "text"
    )

    # 使用 ElevenLabs TTS 将回复文本转换为音频流
    iterator = tts_client.text_to_speech.convert_as_stream(
        text=response_text,
        voice_id="JBFqnCBsd6RMkjVDRZzb",
        model_id="eleven_multilingual_v2",
        output_format="pcm_24000"
    )
    # 将字节流聚合并转换为 16 位 PCM 数组，逐块输出
    for chunk in aggregate_bytes_to_16bit(iterator):
        audio_array = np.frombuffer(chunk, dtype=np.int16).reshape(1, -1)
        yield (24000, audio_array)

stream = Stream(
    modality="audio",
    mode="send-receive",
    handler=ReplyOnPause(response),
)

摄像头流

from fastrtc import Stream
import numpy as np

def flip_vertically(image):
    # 将图像垂直翻转
    return np.flip(image, axis=0)

stream = Stream(
    handler=flip_vertically,
    modality="video",
    mode="send-receive",
)

目标检测

from fastrtc import Stream
import gradio as gr
import cv2
from huggingface_hub import hf_hub_download
from .inference import YOLOv10

# 从 Hugging Face Hub 下载 ONNX 模型文件
model_file = hf_hub_download(
    repo_id="onnx-community/yolov10n", filename="onnx/model.onnx"
)

# 克隆项目以获取 YOLOv10 实现：
# git clone https://huggingface.co/spaces/fastrtc/object-detection
model = YOLOv10(model_file)

def detection(image, conf_threshold=0.3):
    # 将输入图像调整到模型所需尺寸
    image = cv2.resize(image, (model.input_width, model.input_height))
    # 进行目标检测并绘制框
    new_image = model.detect_objects(image, conf_threshold)
    # 最后再缩放到展示大小
    return cv2.resize(new_image, (500, 500))

stream = Stream(
    handler=detection,
    modality="video", 
    mode="send-receive",
    additional_inputs=[
        # 置信度阈值滑块
        gr.Slider(minimum=0, maximum=1, step=0.01, value=0.3)
    ]
)