Early attempts at speech processing and recognition were primarily focused on understanding a handful of simple phonetic elements such as vowels. In 1952, three researchers at Bell Labs, Stephen. Balashek, R. Biddulph, and K. H. Davis, developed a system that could recognize digits spoken by a single speaker.[2] Pioneering works in field of speech recognition using analysis of its spectrum were reported in the 1940s.[3]
One of the first commercially available speech recognition products was Dragon Dictate, released in 1990. In 1992, technology developed by Lawrence Rabiner and others at Bell Labs was used by AT&T in their Voice Recognition Call Processing service to route calls without a human operator. By this point, the vocabulary of these systems was larger than the average human vocabulary.[6]
In 2012, Geoffrey Hinton and his team at the University of Toronto demonstrated that deep neural networks could significantly outperform traditional HMM-based systems on large vocabulary continuous speech recognition tasks. This breakthrough led to widespread adoption of deep learning techniques in the industry.[7][8]
By the mid-2010s, companies like Google, Microsoft, Amazon, and Apple had integrated advanced speech recognition systems into their virtual assistants such as Google Assistant, Cortana, Alexa, and Siri.[9] These systems utilized deep learning models to provide more natural and accurate voice interactions.
The development of Transformer-based models, like Google's BERT (Bidirectional Encoder Representations from Transformers) and OpenAI's GPT (Generative Pre-trained Transformer), further pushed the boundaries of natural language processing and speech recognition. These models enabled more context-aware and semantically rich understanding of speech.[10][7] In recent years, end-to-end speech recognition models have gained popularity. These models simplify the speech recognition pipeline by directly converting audio input into text output, bypassing intermediate steps like feature extraction and acoustic modeling. This approach has streamlined the development process and improved performance.[11]
Dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences, which may vary in speed. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restriction and rules. The optimal match is denoted by the match that satisfies all the restrictions and the rules and that has the minimal cost, where the cost is computed as the sum of absolute differences, for each matched pair of indices, between their values.[citation needed]
A hidden Markov model can be represented as the simplest dynamic Bayesian network. The goal of the algorithm is to estimate a hidden variable x(t) given a list of observations y(t). By applying the Markov property, the conditional probability distribution of the hidden variable x(t) at time t, given the values of the hidden variable x at all times, depends only on the value of the hidden variable x(t − 1). Similarly, the value of the observed variable y(t) only depends on the value of the hidden variable x(t) (both at time t).[citation needed]
An artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs.[citation needed]
Phase-aware processing
Phase is usually supposed to be random uniform variable and thus useless. This is due wrapping of phase:[12] result of arctangent function is not continuous due to periodical jumps on . After phase unwrapping (see,[13] Chapter 2.3; Instantaneous phase and frequency), it can be expressed as:[12][14], where is linear phase ( is temporal shift at each frame of analysis), is phase contribution of the vocal tract and phase source.[14]
Obtained phase estimations can be used for noise reduction: temporal smoothing of instantaneous phase [15] and its derivatives by time (instantaneous frequency) and frequency (group delay),[16] smoothing of phase across frequency.[16] Joined amplitude and phase estimators can recover speech more accurately basing on assumption of von Mises distribution of phase.[14]
^Mowlaee, Pejman; Kulmer, Josef; Stahl, Johannes; Mayer, Florian (2017). Single channel phase-aware signal processing in speech communication: theory and practice. Chichester: Wiley. ISBN978-1-119-23882-9.
^ abcKulmer, Josef; Mowlaee, Pejman (April 2015). "Harmonic phase estimation in single-channel speech enhancement using von Mises distribution and prior SNR". Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE. pp. 5063–5067.