目录
1.4 Resampling - dimensionality reduction
语音识别以及可视化分析数据
基础来自公开代码:https://www.kaggle.com/davids1992/speech-representation-and-data-exploration
0. 目的
- 对于语音识别问题解决方法的框架搭建
- 对该问题具体数据的特征观察
会用到的头文件以及简单解释
import os
from os.path import isdir, join
from pathlib import Path #不懂
import pandas as pd #自己没用过
# Math
import numpy as np
from scipy.fftpack import fft #为什么单独引用这个?
from scipy import signal #一般用librosa
from scipy.io import wavfile #一般用来读写,为什么不用librosa
import librosa #这里引用了,为什么前面那么多scipy
from sklearn.decomposition import PCA
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns #不懂
import IPython.display as ipd #不太会notebook那套额
import librosa.display #不太会notebook那套额
import plotly.offline as py #不懂
py.init_notebook_mode(connected=True) #不懂
import plotly.graph_objs as go #不懂
import plotly.tools as tls #不懂
%matplotlib inline #不懂
1. 对输入的观察与可视化
There are two theories of a human hearing - place ( https://en.wikipedia.org/wiki/Place_theory_(hearing) (frequency-based) and temporal (https://en.wikipedia.org/wiki/Temporal_theory_(hearing) ) In speech recognition, I see two main tendencies - to input spectrogram (frequencies), and more sophisticated features MFCC - Mel-Frequency Cepstral Coefficients, PLP. You rarely work with raw, temporal data.
人类听觉有两种理论——位置(基于频率)和时间(基于频率)在语音识别中,我看到两种主要的趋势——输入声谱图(频率)和更复杂的特征MFCC -梅尔-频率倒谱系数(PLP)。您很少使用原始的时态数据。
以下是罗列可以使用的特征描述:
1.1 Wave and spectrogram线性谱
- 读入wav作为波形
train_audio_path = '../input/train/audio/'
filename = '/yes/0a7c2a8d_nohash_0.wav'
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
TODO:需要实际去看samples是[-1, 1]还是int
- 计算spectrogram线性谱
计算的为线性谱的log,作用和注意有三个:
- 语谱图这样画看起来细节更清晰
- 这样严格遵循了耳朵听力的注意力权重分布
- 保证取对数之前,大于0
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10): # 20ms和10ms
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, ti