语音识别基础篇(三) - pocketsphinx在windows下的中文语音识别
如需转载请标明出处:http://blog.csdn.net/itas109
QQ技术交流群:129518033
相关文章
语音识别基础篇(一) - CMU Sphinx简介
语音识别基础篇(二) - pocketsphinx在windows下的编译和运行
1.环境
操作系统:windows 7 64Bit SP1 / Linux
编译器:Viual Studio 2013 + QT 5.6.2
pocketsphinx版本:5prealpha
编译版本:Debug
2.中文语音识别的必须文件
主要是三个文件:
###a.声学模型
这里我们使用官方的声学模型。
模型下载地址:Download models
其中Mandarin为中文普通话,下载下来之后我们可以看到
声学模型:zh_broadcastnews_16k_ptm256_8000.tar.bz2
语言模型:zh_broadcastnews_64000_utf8.DMP
拼音字典:zh_broadcastnews_utf8.dic
###b.语言模型
我们使用在线的lm生成工具,链接
###c.拼音字典
这里的*.dic是我们根据zh_broadcastnews_utf8.dic现有的字典而来的。
3.生成语音模型lm
a.确定要识别的中文命令集
在线生成的lm对上传的文件时有格式要求的。命令集的txt文件必须为utf-8的形式。且不能有UTF-8文件头,换行符需要是\n(即0x0A)。
Unix 系统里,每行结尾只有“<换行>”,即“\n”;Windows系统里面,每行结尾是“<换行><回车 >”,即“\n\r”;Mac系统里,每行结尾是“<回车>”,即“\n”。
因此最简单处理这个问题的解决办法是在linux里编辑(换行是0x0A)。
中文通过在zh_broadcastnews_utf8.dic文件中搜索并拷贝。
假设我们的命令词为打开、关闭、计算器、浏览器
b.在Linux虚拟机下进行编辑

如果可以直接输入中文,也可以不进行复制。这里复制只是linux下没有安装中文输入法。
这里我们将命令集文件命名为command.txt
c.使用在线lm工具生成lm文件
在http://www.speech.cs.cmu.edu/tools/lmtool-new.html点击"选择文件",上传command.txt,然后点击"COMPILE KNOWLEDGE BASE"按钮进行合成。

我们直接下载整合的文件*.tgz。这里的文件命名是在线工具随机生成的。

到这里为止,我们的lm语音模型已经生成了。
将*.lm复制到windows下。

4.生成拼音字典dic
我们可以发现,在上一次中生成的文件中是有*.dic文件的。不过由于该在线工具目前不支持中文注音表,所以这个dic文件只有中文,没有注音的部分。
我们需要做的就是根据zh_broadcastnews_utf8.dic来补充dic注音表。同样还是在linux下编辑复制。

注意:
如果遇到没有完整的词语,那么可以直接单个字进行复制,比如启天,我们就先搜启,再搜天,然后拼到一起。
启天 q i t ian
完成之后将*.dic复制到windows下。

5.测试
准备工作:
将zh_broadcastnews_16k_ptm256_8000文件夹、.lm、.dic三个文件复制到pocketsphinx_continuous.exe(pocketsphinx编译生成的文件)所在的目录下。

运行的Demo地址:
http://download.csdn.net/download/itas109/10193907
在命令提示符中运行
pocketsphinx_continuous -inmic yes -hmm zh_broadcastnews_ptm256_8000 -lm a.lm -dict a.dic
识别结果:

6.其他的注意事项
###a.默认pocketsphinx_continuous是不会显示中文的
将continuous.c修改为continuous.cpp
/* -*- c-basic-offset: 4; indent-tabs-mode: nil -*- */
/* ====================================================================
* Copyright (c) 1999-2010 Carnegie Mellon University. All rights
* reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
*
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
*
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in
* the documentation and/or other materials provided with the
* distribution.
*
* This work was supported in part by funding from the Defense Advanced
* Research Projects Agency and the National Science Foundation of the
* United States of America, and the CMU Sphinx Speech Consortium.
*
* THIS SOFTWARE IS PROVIDED BY CARNEGIE MELLON UNIVERSITY ``AS IS'' AND
* ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
* THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY
* NOR ITS EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*
* ====================================================================
*
*/
/*
* continuous.c - Simple pocketsphinx command-line application to test
* both continuous listening/silence filtering from microphone
* and continuous file transcription.
*/
/*
* This is a simple example of pocketsphinx application that uses continuous listening
* with silence filtering to automatically segment a continuous stream of audio input
* into utterances that are then decoded.
*
* Remarks:
* - Each utterance is ended when a silence segment of at least 1 sec is recognized.
* - Single-threaded implementation for portability.
* - Uses audio library; can be replaced with an equivalent custom library.
*/
#include <stdio.h>
#include <string.h>
#include <assert.h>
#if defined(_WIN32) && !defined(__CYGWIN__)
#include <windows.h>
#else
#include <sys/select.h>
#endif
#include <sphinxbase/err.h>
#include <sphinxbase/ad.h>
#include "pocketsphinx.h"
#include <string>
static const arg_t cont_args_def[] = {
POCKETSPHINX_OPTIONS,
/* Argument file. */
{"-argfile",
ARG_STRING,
NULL,
"Argument file giving extra arguments."},
{"-adcdev",
ARG_STRING,
NULL,
"Name of audio device to use for input."},
{"-infile",
ARG_STRING,
NULL,
"Audio file to transcribe."},
{"-inmic",
ARG_BOOLEAN,
"no",
"Transcribe audio from microphone."},
{"-time",
ARG_BOOLEAN,
"no",
"Print word times in file transcription."},
CMDLN_EMPTY_OPTION
};
static ps_decoder_t *ps;
static cmd_ln_t *config;
static FILE *rawfd;
std::string UTF8_To_string(const std::string & str)
{
int nwLen = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, NULL, 0);
wchar_t * pwBuf = new wchar_t[nwLen + 1];//一定要加1,不然会出现尾巴
memset(pwBuf, 0, nwLen * 2 + 2);
MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), pwBuf, nwLen);
int nLen = WideCharToMultiByte(CP_ACP, 0, pwBuf, -1, NULL, NULL, NULL, NULL);
char * pBuf = new char[nLen + 1];
memset(pBuf, 0, nLen + 1);
WideCharToMultiByte(CP_ACP, 0, pwBuf, nwLen, pBuf, nLen, NULL, NULL);
std::string retStr = pBuf;
delete[]pBuf;
delete[]pwBuf;
pBuf = NULL;
pwBuf = NULL;
return retStr;
}
static void
print_word_times()
{
int frame_rate = cmd_ln_int32_r(config, "-frate");
ps_seg_t *iter = ps_seg_iter(ps);
while (iter != NULL) {
int32 sf, ef, pprob;
float conf;
ps_seg_frames(iter, &sf, &ef);
pprob = ps_seg_prob(iter, NULL, NULL, NULL);
conf = logmath_exp(ps_get_logmath(ps), pprob);
printf("%s %.3f %.3f %f\n", ps_seg_word(iter), ((float)sf / frame_rate),
((float) ef / frame_rate), conf);
iter = ps_seg_next(iter);
}
}
static int
check_wav_header(char *header, int expected_sr)
{
int sr;
if (header[34] != 0x10) {
E_ERROR("Input audio file has [%d] bits per sample instead of 16\n", header[34]);
return 0;
}
if (header[20] != 0x1) {
E_ERROR("Input audio file has compression [%d] and not required PCM\n", header[20]);
return 0;
}
if (header[22] != 0x1) {
E_ERROR("Input audio file has [%d] channels, expected single channel mono\n", header[22]);
return 0;
}
sr = ((header[24] & 0xFF) | ((header[25] & 0xFF) << 8) | ((header[26] & 0xFF) << 16) | ((header[27] & 0xFF) << 24));
if (sr != expected_sr) {
E_ERROR("Input audio file has sample rate [%d], but decoder expects [%d]\n", sr, expected_sr);
return 0;
}
return 1;
}
/*
* Continuous recognition from a file
*/
static void
recognize_from_file()
{
int16 adbuf[2048];
const char *fname;
const char *hyp;
int32 k;
uint8 utt_started, in_speech;
int32 print_times = cmd_ln_boolean_r(config, "-time");
fname = cmd_ln_str_r(config, "-infile");
if ((rawfd = fopen(fname, "rb")) == NULL) {
E_FATAL_SYSTEM("Failed to open file '%s' for reading",
fname);
}
if (strlen(fname) > 4 && strcmp(fname + strlen(fname) - 4, ".wav") == 0) {
char waveheader[44];
fread(waveheader, 1, 44, rawfd);
if (!check_wav_header(waveheader, (int)cmd_ln_float32_r(config, "-samprate")))
E_FATAL("Failed to process file '%s' due to format mismatch.\n", fname);
}
if (strlen(fname) > 4 && strcmp(fname + strlen(fname) - 4, ".mp3") == 0) {
E_FATAL("Can not decode mp3 files, convert input file to WAV 16kHz 16-bit mono before decoding.\n");
}
ps_start_utt(ps);
utt_started = FALSE;
while ((k = fread(adbuf, sizeof(int16), 2048, rawfd)) > 0) {
ps_process_raw(ps, adbuf, k, FALSE, FALSE);
in_speech = ps_get_in_speech(ps);
if (in_speech && !utt_started) {
utt_started = TRUE;
}
if (!in_speech && utt_started) {
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL);
if (hyp != NULL)
printf("%s\n", hyp);
if (print_times)
print_word_times();
fflush(stdout);
ps_start_utt(ps);
utt_started = FALSE;
}
}
ps_end_utt(ps);
if (utt_started) {
hyp = ps_get_hyp(ps, NULL);
if (hyp != NULL) {
printf("%s\n", hyp);
if (print_times) {
print_word_times();
}
}
}
fclose(rawfd);
}
/* Sleep for specified msec */
static void
sleep_msec(int32 ms)
{
#if (defined(_WIN32) && !defined(GNUWINCE)) || defined(_WIN32_WCE)
Sleep(ms);
#else
/* ------------------- Unix ------------------ */
struct timeval tmo;
tmo.tv_sec = 0;
tmo.tv_usec = ms * 1000;
select(0, NULL, NULL, NULL, &tmo);
#endif
}
/*
* Main utterance processing loop:
* for (;;) {
* start utterance and wait for speech to process
* decoding till end-of-utterance silence will be detected
* print utterance result;
* }
*/
static void
recognize_from_microphone()
{
ad_rec_t *ad;
int16 adbuf[2048];
uint8 utt_started, in_speech;
int32 k;
char const *hyp;
if ((ad = ad_open_dev(cmd_ln_str_r(config, "-adcdev"),
(int) cmd_ln_float32_r(config,
"-samprate"))) == NULL)
E_FATAL("Failed to open audio device\n");
if (ad_start_rec(ad) < 0)
E_FATAL("Failed to start recording\n");
if (ps_start_utt(ps) < 0)
E_FATAL("Failed to start utterance\n");
utt_started = FALSE;
E_INFO("Ready....\n");
for (;;) {
if ((k = ad_read(ad, adbuf, 2048)) < 0)
E_FATAL("Failed to read audio\n");
ps_process_raw(ps, adbuf, k, FALSE, FALSE);
in_speech = ps_get_in_speech(ps);
if (in_speech && !utt_started) {
utt_started = TRUE;
E_INFO("Listening...\n");
}
if (!in_speech && utt_started) {
/* speech -> silence transition, time to start new utterance */
ps_end_utt(ps);
hyp = ps_get_hyp(ps, NULL );
if (hyp != NULL) {
std::string str = UTF8_To_string(hyp);
printf("%s\n", str.c_str());
fflush(stdout);
}
if (ps_start_utt(ps) < 0)
E_FATAL("Failed to start utterance\n");
utt_started = FALSE;
E_INFO("Ready....\n");
}
sleep_msec(100);
}
ad_close(ad);
}
int
main(int argc, char *argv[])
{
char const *cfg;
config = cmd_ln_parse_r(NULL, cont_args_def, argc, argv, TRUE);
/* Handle argument file as -argfile. */
if (config && (cfg = cmd_ln_str_r(config, "-argfile")) != NULL) {
config = cmd_ln_parse_file_r(config, cont_args_def, cfg, FALSE);
}
if (config == NULL || (cmd_ln_str_r(config, "-infile") == NULL && cmd_ln_boolean_r(config, "-inmic") == FALSE)) {
E_INFO("Specify '-infile <file.wav>' to recognize from file or '-inmic yes' to recognize from microphone.\n");
cmd_ln_free_r(config);
return 1;
}
ps_default_search_args(config);
ps = ps_init(config);
if (ps == NULL) {
cmd_ln_free_r(config);
return 1;
}
E_INFO("%s COMPILED ON: %s, AT: %s\n\n", argv[0], __DATE__, __TIME__);
if (cmd_ln_str_r(config, "-infile") != NULL) {
recognize_from_file();
} else if (cmd_ln_boolean_r(config, "-inmic")) {
recognize_from_microphone();
}
ps_free(ps);
cmd_ln_free_r(config);
return 0;
}
#if defined(_WIN32_WCE)
#pragma comment(linker,"/entry:mainWCRTStartup")
#include <windows.h>
//Windows Mobile has the Unicode main only
int
wmain(int32 argc, wchar_t * wargv[])
{
char **argv;
size_t wlen;
size_t len;
int i;
argv = malloc(argc * sizeof(char *));
for (i = 0; i < argc; i++) {
wlen = lstrlenW(wargv[i]);
len = wcstombs(NULL, wargv[i], wlen);
argv[i] = malloc(len + 1);
wcstombs(argv[i], wargv[i], wlen);
}
//assuming ASCII parameters
return main(argc, argv);
}
#endif
b.其他的识别方式
1).Keyword lists
打开 /1e-40/
关闭 /1e-20/
pocketsphinx_continuous -inmic yes -hmm zh_broadcastnews_ptm256_8000 -kws keyphrase_cn.list -dict a.dic
2).grammar
#JSGF V1.0;
grammar hello;
public <greet> = (打开 |关闭)(计算器|浏览器);
pocketsphinx_continuous -inmic yes -hmm zh_broadcastnews_ptm256_8000 -jsgf a.jsgf -dict a.dic
3).language model
关闭 g uan b i
打开 d a k ai
浏览器 l iu l an q i
计算器 j i s uan q i
pocketsphinx_continuous -inmic yes -hmm zh_broadcastnews_ptm256_8000 -lm a.lm -dict a.dic
在后续章节我们将介绍一下语音模型的微调来提高识别率、通过QT二次封装语音识别类库等。
觉得文章对你有帮助,可以扫描二维码捐赠给博主,谢谢!

如需转载请标明出处:http://blog.csdn.net/itas109
QQ技术交流群:129518033

被折叠的 条评论
为什么被折叠?



