Speech SDK 5.1--No.1:SAPI OVERVIEW,mainly about speech recognizers

最新推荐文章于 2021-11-20 00:15:23 发布

HappyJandun

最新推荐文章于 2021-11-20 00:15:23 发布

阅读量1.3k

点赞数

文章标签： microsoft speech sdk

Micorsoft--Speech SDK 5.1 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

From official document --sapi.chm

The SAPI application programming interface (API) dramatically reduces(减轻) the code overhead(代价，费用) required for an application to use speech recognition and text-to-speech, making speech technology more accessible and robust for a wide range of applications.

API Overview

The SAPI API provides a high-level interface between an application and speech engines. SAPI implements(实现) all the low-level details needed to control and manage the real-time operations of various speech engines.

The two basic types of SAPI engines are text-to-speech (TTS) systems and speech recognizers. TTS systems synthesize text strings and files into spoken audio using synthetic voices. Speech recognizers convert human spoken audio into readable text strings and files.

API For Speech Recoginiton

Just as ISpVoice is the main interface for speech synthesis,ISpRecoContext is the main interface for speech recognition. Like the ISpVoice, it is an ISpEventSource, which means that it is the speech application's vehicle(中介) for receiving notifications(通知) for the requested speech recognition events.

An application has the choice of two different types of speech recognition engines (ISpRecognizer). A shared recognizer that could possibly be shared with other speech recognition applications is recommended for most speech applications. To create an ISpRecoContext for a shared ISpRecognizer, an application need only call COM's CoCreateInstance on the component CLSID_SpSharedRecoContext. In this case, SAPI will set up the audio input stream, setting it to SAPI's default audio input stream. For large server applications that would run alone on a system, and for which performance is key, an InProc speech recognition engine is more appropriate. In order to create an ISpRecoContext for an InProc ISpRecognizer, the application must first call CoCreateInstance on the component CLSID_SpInprocRecoInstance to create its own InProc ISpRecognizer. Then the application must make a call to ISpRecognizer::SetInput (see also ISpObjectToken) in order to set up the audio input. Finally, the application can call ISpRecognizer::CreateRecoContext to obtain an ISpRecoContext.

The next step is to set up notifications for events the application is interested in. As the ISpRecognizer is also an ISpEventSource, which in turn is an ISpNotifySource, the application can call one of the ISpNotifySource methods from its ISpRecoContext to indicate where the events for that ISpRecoContext should be reported. Then it should call ISpEventSource::SetInterest to indicate which events it needs to be notified of. The most important event is the SPEI_RECOGNITION, which indicates that the ISpRecognizer has recognized some speech for this ISpRecoContext. See SPEVENTENUM for details on the other available speech recognition events.

Finally, a speech application must create, load, and activate an ISpRecoGrammar, which essentially indicates what type of utterances(说话方式) to recognize, i.e., dictation(口授) or a command and control grammar.First, the application creates an ISpRecoGrammar using ISpRecoContext::CreateGrammar. Then, the application loads the appropriate grammar, either by calling ISpRecoGrammar::LoadDictation for dictation or one of the ISpRecoGrammar::LoadCmdxxx methods for command and control.Finally, in order to activate these grammars so that recognition can start, the application calls ISpRecoGrammar::SetDictationState for dictation or ISpRecoGrammar::SetRuleState or ISpRecoGrammar::SetRuleIdState for command and control.

When recognitions come back to the application by means of the requested notification mechanism(机制), the lParam member of the SPEVENT structure will be an ISpRecoResult by which the application can determine what was recognized and for which ISpRecoGrammar of the ISpRecoContext.

An ISpRecognizer, whether shared or InProc, can have multiple ISpRecoContexts associated with it, and each one can be notified in its own way of events pertaining to it. An ISpRecoContext can have multiple ISpRecoGrammars created from it, each one for recognizing different types of utterances.