Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to machine-readable input (for example, to key presses, using the binary code for a string of character codes). The term "voice recognition" is sometimes incorrectly used to refer to speech recognition, when actually referring to speaker recognition, which attempts to identify the person speaking, as opposed to what is being said. Confusingly, journalists and manufacturers of devices that use speech recognition for control commonly use the term Voice Recognition when they mean Speech Recognition.
Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). The difference between the cepstrum and the mel-frequency cepstrum is that in the MFC, the frequency bands are equally spaced on the mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal cepstrum. This frequency warping can allow for better representation of sound.
Voice activity detection (also known as speech activity detection or, more simply, speech detection) is a technique used in speech processing wherein the presence or absence of human speech is detected in regions of audio (which may also contain music, noise, or other sound) [1]. The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech segments: it can avoid unnecessary coding/transmission of silence packets in VOIP, saving on computation and on network bandwidth.