翻译：audio_fingerprint_beginner - 基于音频特征向量的听音识曲

本文介绍了一种使用Towhee进行音频特征提取和Milvus进行数据库管理的音频识别系统。教程涵盖了音频指纹的概念、使用Towhee的audio_embedding.nnfp运算符、设置Milvus服务、数据准备以及构建和评估系统性能的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Audio Fingerprint I: Build a Demo with Towhee & Milvus

翻译自Towhee的开源项目audio-fingerprint-beginner

增加了一些个人的看法与观点，机翻，仅供参考，建议先观看原文

音频指纹是提取特征以表示数字音频的过程。通常，该过程会将输入音频剪切成固定长度的较短片段。然后，它将每个片段转换为固定大小的单个指纹片段。通过将所有小块按时间戳排序，可以为输入音频生成完整的指纹。

在这里插入图片描述

以音频指纹作为身份，系统可以通过各种变换来识别音乐。本教程将使用Tohee作为特征提取器和Milvus作为数据库构建了一个简单的音乐识别演示系统。它包括4个部分，最后两个部分是可选的，用于评估和用户界面。

1.提前准备包裹、数据、Milvus服务
2.构建系统并了解关键组件
3.评估所有示例数据的系统性能
4.在线播放

在这里插入图片描述

准备

我们需要安装一些python包，准备示例数据，并设置Milvus服务。

依赖项

安装以下具有正确版本的python包。下面的命令将使用pip进行安装，并尝试导入python中的所有包。如果失败或出现意外错误，请在您的环境中手动安装所需的软件包。

! python -m pip install -q towhee towhee.models gradio ipython scikit-learn

import os
import pandas as pd

import IPython
import glob

from towhee import pipe, ops
from towhee.datacollection import DataCollection
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
from sklearn.metrics import accuracy_score

数据

示例数据使用GTZAN的子集作为候选人。查询音频文件是从每个候选者转换而来的：随机裁剪10个片段，并与一些随机背景噪声混合。您可以通过以下命令从github下载：

! curl -L https://github.com/towhee-io/examples/releases/download/data/audio_fp.zip -O
! unzip -q -o audio_fp.zip

使用pandas读取csv文件，查看数据

数据目录"audio_fp"的组织方式如下：

candidates: wav文件，每个30秒 (可以更短，我测试了10秒，可以较为精准的识别2秒的片段)
queries: wav文件，每个10秒或更短 (这里面的wav是通过加噪声处理的音乐)
ground_truth.csv: 一个csv文件，将每个查询映射到候选中的答案，还包括扩充信息

df = pd.read_csv('audio_fp/ground_truth.csv')
df.head()

	query	answer
0	audio_fp/queries/n_1951069525.num10_audio_0417...	audio_fp/candidates/1951069525.num10.wav
1	audio_fp/queries/n_1951069525.num10_audio_2320...	audio_fp/candidates/1951069525.num10.wav
2	audio_fp/queries/n_1951069525.num10_audio_58aa...	audio_fp/candidates/1951069525.num10.wav
3	audio_fp/queries/n_1951069525.num10_audio_7a56...	audio_fp/candidates/1951069525.num10.wav
4	audio_fp/queries/n_1951069525.num10_audio_a5d8...	audio_fp/candidates/1951069525.num10.wav

从csv中可以看出，答案可以从查询路径中得出。为了以后使用，我们将一个函数注册为临时Towhee运算符，以便在给定查询路径的情况下获得基本事实：

def get_gt(query_path):
    filename = query_path.split('/')[-1]
    name = filename.split('_')[1]
    answer = os.path.join('audio_fp', 'candidates', name + '.wav')
    return answer

与原始音乐相比，查询片段听起来怎么样？单击下面的播放按钮收听一对示例数据：

example_query = df['query'][0]
example_candidate = df['answer'][0]

IPython.display.display(
    f'example query: {example_query}',
    IPython.display.Audio(example_query),
    f'example answer: {example_candidate}',
    IPython.display.Audio(example_candidate)
)

设置Milvus

最后需要准备的是Milvus。有关更多选项和详细说明，请参阅Milvus doc. 如果您需要Milvus的更多帮助，请随时提交门票或加入Milvus github的讨论.

这个笔记本使用milvus 2.2.10和pymilvus 2.2.11.

这个位置建议使用Linux系统虚拟机（VMware、WSL……）或者服务器部署Milvus

# Download docker yaml for Milvus standalone
! wget https://github.com/milvus-io/milvus/releases/download/v2.2.10/milvus-standalone-docker-compose.yml -O docker-compose.yml
# Run command below under the same directory as the docker yaml
! docker-compose up -d
# Install pymilvus
! python -m pip install -q pymilvus==2.2.11

关键组件

现在，您应该已经成功安装了所有软件包，下载了示例数据，并启动了Milvus服务。现在是时候建立音乐识别系统了。通过本节，您将了解：
1.音频指纹
2.Milvus系列
在开始之前，让我们定义一些全局变量。

HOST = 'Milvus服务IP'
PORT = '19530'
COLLECTION_NAME = 'nnfp'
INDEX_TYPE = 'IVF_FLAT'
METRIC_TYPE = 'IP'
DIM = 128
TOPK = 10

音频指纹

Towhee使构建人工智能应用的神经数据处理管道变得容易。它提供了数百种模型、算法和转换，可以用作标准的管道构建块。您可以使用Towhee的音频嵌入操作符来提取音频输入的特征。如果您对这一步骤有任何疑问，可以访问Tohee Github.
在本教程中，我们选择Towhee运算符audio_embedding.nnfp，它使用了专门用于音频检索的预训练的深度学习模型。在默认配置下，它为每秒的音频生成1个维度为128的嵌入，没有重叠。让我们来看看最简单的指纹识别步骤和一个候选示例。

emb_pipe = (
    pipe.input('url')
        .map('url', 'frames', ops.audio_decode.ffmpeg())
        .map('frames', 'embedding', ops.audio_embedding.nnfp())
        .output('embedding')
)

流程：

输入url，表示音乐地址
执行函数ops.audio_decode.ffmpeg()将音乐进行归一化，
- 原文：Audio Decode converts the encoded audio back to uncompressed audio frames. In most cases, audio decoding is the first step of an audio processing pipeline.
- 译文：音频解码将编码的音频转换回未压缩的音频帧。在大多数情况下，音频解码是音频处理流水线的第一步。
- 最后获得frames字段
执行函数ops.audio_embedding.nnfp()提取音频向量
- 原文：The audio embedding operator converts an input audio into a dense vector which can be used to represent the audio clip’s semantics. Each vector represents for an audio clip with a fixed length of around 1s. This operator generates audio embeddings with fingerprinting method introduced by Neural Audio Fingerprint. The model is implemented in Pytorch. We’ve also trained the nnfp model with FMA dataset (& some noise audio) and shared weights in this operator. The nnfp operator is suitable for audio fingerprinting.
- 译文：音频嵌入算子将输入音频转换成可用于表示音频片段的语义的密集向量。每个向量表示具有大约1秒的固定长度的音频剪辑。该算子采用神经音频指纹算法生成音频嵌入。该模型在Pytorch中实现。我们还使用FMA数据集（和一些噪声音频）训练了nnfp模型，并在该运算符中共享了权重。nnfp运算符适用于音频指纹识别。
- 最终得到字段embedding。

emb_pipe = (
    pipe.input('url')
        .map('url', 'frames', ops.audio_decode.ffmpeg())
        .map('frames', 'embedding', ops.audio_embedding.nnfp())
        .output('embedding')
)
DataCollection(emb_pipe(example_candidate)).show()

Milvus系列

要使用Milvus，我们需要将服务连接到相应的主机和端口。

connections.connect(host=HOST, port=PORT)

如果您没有现成的集合，我们将在插入向量之前首先创建Milvus集合。下面的代码将使用预定义的全局变量创建一个包含3列的Milvus集合：

id：Milvus自动生成的具有整数值的主键
embedding：浮动值中长度为DIM的矢量
path：与音频嵌入对应的音频路径

它还使用IVF FLAT类型和内积度量为集合创建了一个索引。（Milvus插入和搜索已经由Towhee完成，所以我们不需要提前准备。）

# Create Milvus collection
fields = [
    FieldSchema(name='id', dtype=DataType.INT64, description='embedding ids', is_primary=True, auto_id=True),
    FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, description='audio embeddings', dim=DIM),
    FieldSchema(name='path', dtype=DataType.VARCHAR, description='audio path', max_length=500)
    ]
schema = CollectionSchema(fields=fields, description='audio fingerprints')

if utility.has_collection(COLLECTION_NAME):
    collection = Collection(COLLECTION_NAME)
    collection.drop() # drop collection if it exists
    
collection = Collection(name=COLLECTION_NAME, schema=schema)

# Create index
index_params = {
    'metric_type': METRIC_TYPE,
    'index_type': INDEX_TYPE,
    'params':{"nlist":2048}
}

status = collection.create_index(field_name='embedding', index_params=index_params)

Build System

We have learnt about key technologies used in the sytem. Now let’s build the sytem and query with an example audio using Towhee DataCollection.

Insert

在这里插入图片描述

insert_pipe = (
    pipe.input('path')
        .map('path', 'frames', ops.audio_decode.ffmpeg())
        .flat_map('frames', 'fingerprints', ops.audio_embedding.nnfp())
        .map(('fingerprints', 'path'), 'milvus_res', ops.ann_insert.milvus_client(host=HOST, port=PORT, collection_name=COLLECTION_NAME))
        .output()
)


path = glob.glob('audio_fp/candidates/*.wav')

for i in path:
    res = insert_pipe(i)

Milvus集合准备就绪后，我们将为候选音频生成嵌入，并将所有指纹插入Milvus集。Towhee DataCollection方法能够与Towhee运营商一起构建管道。插入步骤需要以下运算符：

map（'path'，'frames'，ops.audio_decode.ffmpeg（））：使用ffmpeg将音频文件解码为帧。
flat_map（'frames'，'fingers'，ops.audio_embedding.nnfp（））：使用给定的音频帧列表生成指纹并压平；帧作为输入，指纹作为输出。
map（（'fingerprints'，'path'），'milvus_res'，ops.ann_insert.milvus_client（host=host，port=port，collection_name=collection_name））：将每个指纹和相应的音频路径插入milvus集合；'fingerprints’和’path’作为输入、'milvus_res’作为输出。

With Milvus collection ready, we will generate embeddings for candidate audios and insert all fingerprints into Milvus collection. Towhee DataCollection method is able to build a pipeline with Towhee operators. The insert step requires the following operators:

map('path', 'frames', ops.audio_decode.ffmpeg()): decode audio files into frames using ffmpeg.
flat_map('frames', 'fingerprints', ops.audio_embedding.nnfp()): generate fingerprints with given a list of audio frames and flatten; ‘frames’ as input & ‘fingerprints’ as output.
map(('fingerprints', 'path'), 'milvus_res', ops.ann_insert.milvus_client(host=HOST, port=PORT, collection_name=COLLECTION_NAME)): insert each fingerprint & corresponding audio path into Milvus collection; ‘fingerprints’ and ‘path’ as inputs & ‘milvus_res’ as output.

print(f'Total number of embeddings in the collection: {collection.num_entities}')

Total number of embeddings in the collection: 0

def vote(milvus_res):
    votes = {}
    for res in milvus_res:
        path = res[2]
        score = res[1]
        if path not in votes:
            votes[path] = score
        else:
            votes[path] = votes[path] + score
    votes = sorted(votes.items(), key=lambda item: item[1], reverse=True)
    return votes[0]

def select(pred, score):
    preds = {}
    for i, j in zip(pred, score):
        if i not in preds:
            preds[i] = j
        else:
            preds[i] += j
    
    final_preds = sorted(preds.items(), key=lambda item: item[1], reverse=True)
    return final_preds[0][0]

collection.load()
search_pipe = (
	pipe.input('path')
		.map('path', 'frames', ops.audio_decode.ffmpeg())
		.flat_map('frames', 'embs', ops.audio_embedding.nnfp())
		.map('embs', 'milvus_res', ops.ann_search.milvus_client(
										host=HOST,
										port=PORT,
										collection_name=COLLECTION_NAME,
										metric_type=METRIC_TYPE,
										limit=TOPK,
										output_fields=['path']
									))
		.map('milvus_res', ('pred', 'score'), vote)
		.window_all(('pred', 'score'), 'result', select)
)


query_pipe = search_pipe.output('path', 'result')
DataCollection(query_pipe(example_query)).show()

测试

您已经构建了音乐识别系统，并尝试使用示例音频。但整体表现如何？在这里，我们准确地测量了所有示例数据的系统。从下面的结果来看，我们称之为tell，10s段检测所需的平均时间约为1.1s（128个CPU和1个GPU），总体准确率达到84%。

eval_pipe = (
    search_pipe.map('path', 'ground_truth', get_gt)
        .output('result', 'ground_truth')
)

path = glob.glob('audio_fp/queries/*.wav')

preds = []
facts = []
for i in path:
    res = eval_pipe(i).get()
    preds.append(res[0])
    facts.append(res[1])

accuracy_score(facts, preds)

服务端的两种方式

gradio(官方)
flask

gradio

import gradio

def query_function(query_path):
    pred = query_pipe(query_path).get()[1]
    return os.path.basename(pred)
    
interface = gradio.Interface(query_function, 
                             gradio.Audio(type='filepath'),
                             gradio.Label()
                            )

interface.launch(inline=True, share=True)

flask

import tempfile
from flask import Flask, request, jsonify

app = Flask(__name__)


@app.route('/forward', methods=['POST'])
def forward_request():
    # 获取请求的音频文件
    audio_file = request.files.get('audio')

    # 创建一个临时文件并保存上传的音频
    temp_dir = tempfile.gettempdir()
    temp_file_path = os.path.join(temp_dir, audio_file.filename)
    audio_file.save(temp_file_path)
    result = query_function(temp_file_path)
    return jsonify(result)


if __name__ == '__main__':
    app.run(host='0.0.0.0')