[Linux] [FunASR] 简单实现一个语音输入
python环境
用虚拟环境(conda/miniconda/micromamba)安装torch, 这里以micromamba举例:
MY_TORCH_ENV_NAME="torch"
pip_mirror="https://pypi.tuna.tsinghua.edu.cn/simple/"
micromamba create -n $MY_TORCH_ENV_NAME python=3.11
micromamba run -n $MY_TORCH_ENV_NAME pip install -i $pip_mirror torch==2.0.1
micromamba run -n $MY_TORCH_ENV_NAME pip install -i $pip_mirror jieba modelscope funasr_onnx
micromamba run -n $MY_TORCH_ENV_NAME pip install -i $pip_mirror gevent flask
# 安装模型
micromamba run -n $MY_TORCH_ENV_NAME modelscope download "damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-onnx"
torch版本非必须, 2.0.1只是一个可用版本, 用其它的FunASR支持的版本也可以
torch及其依赖的nvidia相关的包比较大, 可以搜索pytorch离线安装教程离线安装
接口服务
#!/usr/bin/env python
from flask import Flask
app = Flask(__name__)
model = None
@app.route('/stt', methods=['GET'])
def stt():
from flask import request
wav_file = request.args.get("wav_file").strip()
result = model([wav_file])
return format_text_with_timestamps(result)
# 转文本(用端点分句)
def format_text_with_timestamps(result):
r_arr = []
for obj in result:
text = obj['preds']
timestamps = obj['timestamp']
formatted_text = []
for i, (word, (start, end)) in enumerate(zip(text.split(), timestamps)):
formatted_text.append(word)
if i < len(timestamps) - 1:
next_start = timestamps[i + 1][0]
if next_start != end:
formatted_text.append(' ')
r_arr.append(''.join(formatted_text))
return '\n'.join(r_arr)
if __name__ == '__main__':
server = None
try:
from gevent.pywsgi import WSGIServer
server = WSGIServer(('127.0.0.1', 9978), app)
from pathlib import Path
mpdel_root = Path.home() / '.cache' / 'modelscope' / 'hub'
model_dir = mpdel_root / "damo" / "speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-onnx"
from funasr_onnx import Paraformer
model = Paraformer(model_dir, batch_size=1, quantize=True)
import os
os.system("notify-send -e 'funstt已启动' '模型已加载'")
server.serve_forever()
finally:
if server:
server.stop()
保存到文件如server.py
启动服务:
micromamba run -n $MY_TORCH_ENV_NAME python server.py
脚本调用
在系统安装wl-clipboard, ydotool. (x11可以用xclip, xdotool代替)
#!/usr/bin/env bash
# 录音, 再次调用结束录音识别并上屏
RECORD_FILE="/tmp/_my_asr.wav"
PID_FILE=/tmp/_my_asr.pid
if [ -f "$PID_FILE" ]; then
pid=$(cat "$PID_FILE")
kill -SIGTERM "$pid" && rm "$PID_FILE"
curl "http://127.0.0.1:9978/stt?wav_file=$RECORD_FILE" | wl-copy
rm "$RECORD_FILE"
# 模拟Ctrl+V
ydotool key 29:1 47:1 47:0 29:0
else
notify-send -e "提示" "再次按下相同快捷键以输入内容"
arecord -f cd "$RECORD_FILE" &
pid=$!
echo "$pid" > "$PID_FILE"
fi
把脚本添加到系统快捷键就可以实现在文本输入状态下按下快捷键录音, 再次按下快捷键上屏了
部分内容参考自开源项目stt