利用pdf.js +FastAPI+openai-TTS 搭建在线PDF 文档语音阅读服务

风翔

已于 2024-12-16 19:22:51 修改

阅读量911

点赞数 23

于 2024-12-05 11:10:00 首次发布

本文链接：https://blog.csdn.net/sequoia00/article/details/144259821

版权

之前一直用NuturalReader 阅读英文pdf，校准英文单词发音的准确性，无奈NuturalReader的LLM真人语音价格太贵了，一年要有110刀。实在肉疼。就有了自己写个在线语音阅读pdf的想法。（实现一些简单的功能，真要达到NuturalReader的水平，还需要更多的研发。）

本文将介绍如何使用pdf.js + FastAPI +openai-TTS构建一个功能强大的 Web 服务，支持 PDF 文件的上传与展示以及文本转语音（Text-to-Speech, TTS）功能。前端部分将采用广受欢迎的 pdf.js 的 viewer.html 来实现 PDF 的浏览。

fastapi集成pdf.js的文章可看这一篇：
如何用fastapi集成pdf.js 的viewer.html ，并支持 mjs

项目代码下载地址：
TTS_reader_pdf_Online
https://gitcode.com/sequoia00/TTS_reader_pdf_Online/overview

项目概述

本项目旨在搭建一个基于 FastAPI 的 Web 服务，具备以下主要功能：

PDF 文件上传与展示：用户可以上传 PDF 文件，并通过嵌入的 pdf.js 查看器在线浏览这些文件。
文本转语音（TTS）：用户可以输入文本，系统将其转换为语音音频，并提供下载或在线播放功能。

技术栈

后端：FastAPI
前端：pdf.js 的 viewer.html
音频处理：pydub
文本转语音：OpenAI API（通过聚合接口站）
其他：CORS 中间件、静态文件服务等

项目结构

project/
├── main.py           # FastAPI 应用主文件
├── static/
│   ├── files/        # 存放上传的 PDF 文件
│   ├── web/          # 存放 pdf.js 的 viewer.html 及相关静态资源
│   └── ...           # 其他静态资源
└── audio_cache/      # 存放生成的音频文件

主要功能实现

1. 环境配置与依赖安装

首先，确保已安装以下 Python 库：

pip install fastapi uvicorn pydantic openai pydub

另外，由于使用 pydub 处理音频，还需要安装 ffmpeg。可以通过以下命令安装：

Ubuntu:
```
sudo apt-get install ffmpeg
```
MacOS (使用 Homebrew):
```
brew install ffmpeg
```
Windows:
下载 ffmpeg 并配置环境变量。

2. FastAPI 应用配置

中间件与静态文件服务

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from fastapi.staticfiles import StaticFiles

app = FastAPI()

# 配置允许的跨域源，* 表示允许所有
origins = ["*"]

app.add_middleware(
    CORSMiddleware,
    allow_origins=origins,         # 允许的来源
    allow_credentials=True,
    allow_methods=["*"],           # 允许的方法
    allow_headers=["*"],           # 允许的请求头
)

# 指定上传文件保存的目录
UPLOAD_DIRECTORY = "static/files"

if not os.path.exists(UPLOAD_DIRECTORY):
    os.makedirs(UPLOAD_DIRECTORY)

# 配置静态文件服务，使上传的 PDF 可以通过 URL 访问
app.mount("/static/files", StaticFiles(directory=UPLOAD_DIRECTORY), name="static_files")
app.mount("/static/web", StaticFiles(directory="static/web"), name="static_web")

# 挂载其他静态文件
app.mount("/static", StaticFiles(directory="static"), name="static")

根路径重定向

根路径 / 将重定向到 viewer.html，并加载名为 compress.pdf 的 PDF 文件。

from fastapi.responses import RedirectResponse

@app.get("/")
def root():
    return RedirectResponse(url="/static/web/viewer.html?file=/static/files/compress.pdf")

3. PDF 文件上传与展示

文件名清理与上传

为了确保文件名的安全性和唯一性，需要对上传的文件名进行清理，并在必要时添加 UUID 以避免重名。

from fastapi import UploadFile, File, Form, HTTPException
from fastapi.responses import JSONResponse
import os
import shutil
import hashlib

def sanitize_filename(name: str) -> str:
    return "".join(c for c in name if c.isalnum() or c in (' ', '.', '_', '-')).rstrip()

@app.post("/upload-pdf")
async def upload_pdf(file: UploadFile = File(...), custom_name: str = Form(...)):
    if file.content_type != 'application/pdf':
        raise HTTPException(status_code=400, detail="文件类型必须是 PDF")
    
    sanitized_name = sanitize_filename(custom_name)
    if not sanitized_name:
        return JSONResponse(status_code=400, content={"success": False, "error": "无效的文件名"})
    
    unique_filename = f"{sanitized_name}.pdf"
    file_path = os.path.join(UPLOAD_DIRECTORY, unique_filename)

    if os.path.exists(file_path):
        return JSONResponse(status_code=400, content={"success": False, "error": "文件名已存在，请使用其他名称"})
    
    try:
        with open(file_path, "wb") as buffer:
            shutil.copyfileobj(file.file, buffer)
    except Exception:
        raise HTTPException(status_code=500, detail="上传过程中出错")
    finally:
        file.file.close()
    
    file_relative_path = f"/static/files/{unique_filename}"
    return JSONResponse(content={"success": True, "file_path": file_relative_path})

列出已上传的 PDF 文件

提供一个接口 /list-pdfs 来列出所有已上传的 PDF 文件及其访问 URL。

@app.get("/list-pdfs")
async def list_pdfs():
    try:
        files = os.listdir(UPLOAD_DIRECTORY)
        pdf_files = [
            {
                "name": file,
                "url": f"/static/files/{file}"
            }
            for file in files if file.lower().endswith(".pdf")
        ]
        return JSONResponse(content={"success": True, "files": pdf_files})
    except Exception:
        raise HTTPException(status_code=500, detail="无法获取文件列表")

4. 文本转语音（TTS）功能

配置 OpenAI 客户端

from openai import OpenAI

api_key = "***"  # 替换为您的 API Key
client = OpenAI(
    base_url="https://api/v1",
    api_key=api_key
)

音频缓存目录

为了提高效率，生成的音频将被缓存到 audio_cache 目录中。

CACHE_DIR = "audio_cache"
os.makedirs(CACHE_DIR, exist_ok=True)

单段文本转语音接口

/text-to-speech/ 接口接收用户输入的文本，并返回对应的语音音频。如果相同文本的音频已缓存，则直接返回缓存的文件。

from pydantic import BaseModel
from fastapi.responses import Response

class TextToSpeechRequest(BaseModel):
    user_input: str

@app.post("/text-to-speech/")
async def text_to_speech(request: TextToSpeechRequest):
    user_input = request.user_input
    try:
        text_hash = hashlib.md5(user_input.encode('utf-8')).hexdigest()
        audio_path = os.path.join(CACHE_DIR, f"{text_hash}.mp3")

        if os.path.exists(audio_path):
            with open(audio_path, "rb") as f:
                audio_data = f.read()
            return Response(content=audio_data, media_type="audio/mpeg")
        else:
            with client.audio.speech.with_streaming_response.create(
                model="tts-1",
                voice="nova", 
                input=user_input,
            ) as response:
                response.stream_to_file(audio_path)

            with open(audio_path, "rb") as f:
                audio_data = f.read()
            return Response(content=audio_data, media_type="audio/mpeg")
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

整页阅读的文本转语音

为了实现对长文本的支持，系统将文本按句子分割成多个块，分别生成语音音频，最后将这些音频拼接成一个完整的音频文件。

from typing import AsyncGenerator
import asyncio
from pydub import AudioSegment

MAX_CHUNK_SIZE = 200  # 每个块的最大字符数

def split_text_into_chunks(text: str, max_chunk_size: int = MAX_CHUNK_SIZE) -> list:
    import re
    sentences = re.split('(?<=[.!?]) +', text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1 <= max_chunk_size:
            current_chunk += " " + sentence if current_chunk else sentence
        else:
            if current_chunk:
                chunks.append(current_chunk)
            if len(sentence) > max_chunk_size:
                for i in range(0, len(sentence), max_chunk_size):
                    chunks.append(sentence[i:i + max_chunk_size])
                current_chunk = ""
            else:
                current_chunk = sentence

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

async def generate_tts_audio(chunk: str) -> str:
    text_hash = hashlib.md5(chunk.encode('utf-8')).hexdigest()
    audio_path = os.path.join(CACHE_DIR, f"{text_hash}.mp3")

    if not os.path.exists(audio_path):
        try:
            with client.audio.speech.with_streaming_response.create(
                model="tts-1",
                voice="nova",
                input=chunk,
            ) as response:
                response.stream_to_file(audio_path)
        except Exception as e:
            raise HTTPException(status_code=500, detail=f"TTS生成失败: {str(e)}")

    return audio_path

def concatenate_audios(audio_paths: list, output_path: str) -> None:
    combined = AudioSegment.empty()
    for path in audio_paths:
        audio = AudioSegment.from_mp3(path)
        combined += audio
    combined.export(output_path, format="mp3")

@app.post("/page-to-speech/")
async def page_to_speech(request: TextToSpeechRequest):
    user_input = request.user_input.strip()
    if not user_input:
        raise HTTPException(status_code=400, detail="输入文本为空。")

    full_text_hash = hashlib.md5(user_input.encode('utf-8')).hexdigest()
    full_audio_path = os.path.join(CACHE_DIR, f"{full_text_hash}_full.mp3")

    if os.path.exists(full_audio_path):
        return StreamingResponse(open(full_audio_path, "rb"), media_type="audio/mpeg")

    chunks = split_text_into_chunks(user_input)
    audio_paths = []

    async def audio_generator() -> AsyncGenerator[bytes, None]:
        for chunk in chunks:
            audio_path = await generate_tts_audio(chunk)
            audio_paths.append(audio_path)
            with open(audio_path, "rb") as f:
                yield f.read()
            await asyncio.sleep(0)

    async def create_full_audio():
        await asyncio.gather(*(generate_tts_audio(chunk) for chunk in chunks))
        concatenate_audios(audio_paths, full_audio_path)

    asyncio.create_task(create_full_audio())

    return StreamingResponse(audio_generator(), media_type="audio/mpeg")

5. 前端展示：使用 pdf.js 的 viewer.html

将 pdf.js 的 viewer.html 文件放置在 static/web/ 目录下。从 pdf.js 官方仓库下载完整的 pdf.js 发布包，其中包含 viewer.html 及相关资源。

viewer.html 的主要作用是加载指定的 PDF 文件并提供丰富的浏览功能，如缩放、搜索、分页等。在本项目中，根路径 / 被重定向到 viewer.html，并默认加载 compress.pdf 文件。用户上传新的 PDF 文件后，可以通过调用 /list-pdfs 接口获取文件列表，并在前端动态更新 PDF 列表供用户浏览。

示例：加载上传的 PDF 文件

在 viewer.html 中，可以通过查询参数 file 来指定要加载的 PDF 文件。例如，访问 /static/web/viewer.html?file=/static/files/example.pdf 将加载 example.pdf 文件。

6. 运行与测试

确保所有依赖已安装，并启动 FastAPI 服务器：

uvicorn main:app --reload

打开浏览器，访问 http://localhost:8000/，将自动重定向到 viewer.html 并加载默认的 compress.pdf 文件。

上传 PDF 文件

使用工具如 Postman 或前端页面（需自行开发上传界面）发送 POST 请求至 /upload-pdf，上传 PDF 文件。

文本转语音

发送 POST 请求至 /text-to-speech/ 或 /page-to-speech/，传递 JSON 数据例如：

{
    "user_input": "你好，欢迎使用文本转语音服务。"
}

将收到生成的音频数据，可以在线播放或下载。

部署建议

在开发完成后，可以选择多种方式部署此应用：

Docker：将应用打包成 Docker 镜像，方便部署和管理。
云服务：如 AWS、GCP、Azure 等，使用其提供的服务部署。
服务器：自行搭建服务器环境，安装必要依赖并运行应用。

总结

本文介绍了如何使用 FastAPI 构建一个集 PDF 展示与文本转语音功能于一体的 Web 服务。通过利用 pdf.js 提供的强大 PDF 浏览功能和 OpenAI 的 TTS 技术，实现了一个功能丰富且易于扩展的应用。希望本文对您在类似项目中的开发有所帮助！