Audio mining is a technique by which the content of an audio signal can be automatically analyzed and searched. It is most commonly used in the field of automatic speech recognition, where the analysis tries to identify any speech within the audio. The term ‘audio mining’ is sometimes used interchangeably with audio indexing, phonetic searching, phonetic indexing, speech indexing, audio analytics, speech analytics, word spotting, and information retrieval. Audio indexing, however, is mostly used to describe the pre-process of audio mining, in which the audio file is broken down into a searchable index of words.
History
Academic research on audio mining began in the late 1970s in schools like Carnegie Mellon University, Columbia University, the Georgia Institute of Technology, and the University of Texas.[1] Audio data indexing and retrieval began to receive attention and demand in the early 1990s, when multimedia content started to develop and the volume of audio content significantly increased.[2]
Before audio mining became the mainstream method, written transcripts of audio content were created and manually analyzed.[3]
Process
Audio mining is typically split into four components: audio indexing, speech processing and recognition systems, feature extraction and audio classification.[4] The audio will typically be processed by a speech recognition system in order to identify word or phoneme units that are likely to occur in the spoken content. This information may either be used immediately in pre-defined searches for keywords or phrases (a real-time "word spotting" system), or the output of the speech recognizer may be stored in an index file. One or more audio mining index files can then be loaded at a later date in order to run searches for keywords or phrases.
The results of a search will normally be in terms of hits, which are regions within files that are good matches for the chosen keywords. The user may then be able to listen to the audio corresponding to these hits in order to verify if a correct match was found.
Audio Indexing
In audio, there is the main problem of information retrieval - there is a need to locate the text documents that contain the search key. Unlike humans, a computer is not able to distinguish between the different types of audios such as speed, mood, noise, music or human speech - an effective searching method is needed. Hence, audio indexing allows efficient search for information by analyzing an entire file using speech recognition. An index of content is then produced, bearing words and their locations done through content-based audio retrieval, focusing on extracted audio features.
It is done through mainly two methods: Large Vocabulary Continuous Speech Recognition (LVCSR) and Phonetic-based Indexing.
Large Vocabulary Continuous Speech Recognizers (LVCSR)
In text-based indexing or large vocabulary continuous speech recognition (LVCSR), the audio file is first broken down into recognizable phonemes. It is then run through a dictionary that can contain several hundred thousand entries and matched with words and phrases to produce a full text transcript. A user can then simply search a desired word term and the relevant portion of the audio content will be returned.
If the text or word could not be found in the dictionary, the system will choose the next most similar entry it can find. The system uses a language understanding model to create a confidence level for its matches. If the confidence level be below 100 percent, the system will provide options of all the found matches.[5]
Advantages and disadvantages
The main draw of LVCSR is its high accuracy and high searching speed. In LVCSR, statistical methods are used to predict the likelihood of different word sequences, hence the accuracy is much higher than the single word lookup of a phonetic search. If the word can be found, the probability of the word spoken is very high.[6] Meanwhile, while initial processing of audio takes a fair bit of time, searching is quick as just a simple test to text matching is needed.
On the other hand, LVCSR is susceptible to common issues of speech recognition. The inherent random nature of audio and problems of external noise all affect the accuracies of text-based indexing.
Another problem with LVCSR is its over reliance on its dictionary database. LVCSR only recognizes words that are found in their dictionary databases, and these dictionaries and databases are unable to keep up with the constant evolving of new terminology, names and words. Should the dictionary not contain a word, there is no way for the system to identify or predict it. This reduces the accuracy and reliability of the system. This is named the Out-of-vocabulary (OOV) problem. Audio mining systems try to cope with OOV by continuously updating the dictionary and language model used, but the problem still remains significant and has probed a search for alternatives.[7]
Additionally, due to the need to constantly update and maintain task-based knowledge and large training databases to cope with the OOV problem, high computational costs are incurred. This makes LVCSR an expensive approach to audio mining.
Phonetic-based Indexing
Phonetic-based indexing also breaks the audio file into recognizable phonemes, but instead of converting them to a text index, they are kept as they are and analyzed to create a phonetic-based index.
The process of phonetic-based indexing can be split into two phases. The first phase is indexing. It begins by converting the input media into a standard audio representation format (PCM). Then, an acoustic model is applied to the speech. This acoustic model represents characteristics of both an acoustic channel (an environment in which the speech was uttered and a transducer through which it was recorded) and a natural language (in which human beings expressed the input speech). This produces a corresponding phonetic search track, or phonetic audio track (PAT), a highly compressed representation of the phonetic content of the input media.
The second phase is searching. The user's search query term is parsed into a possible phoneme string using a phonetic dictionary. Then, multiple PAT files can be scanned at high speed during a single search for likely phonetic sequences that closely match corresponding strings of phonemes in the query term.[8][9]
Advantages and disadvantages
Phonetic indexing is most attractive as it is largely unaffected by linguistic issues such as unrecognized words and spelling errors. Phonetic preprocessing maintains an open vocabulary that does not require updating. That makes it particularly useful for searching specialized terminology or words in foreign languages that do not commonly appear in dictionaries. It is also more effective for searching audio files with disruptive background noise and/or unclear utterances as it can compile results based on the sounds it can discern, and should the user wish to, they can search through the options until they find the desired item.[10]
Furthermore, in contrast to LVCSR, it can process audio files very quickly as there are very few unique phonemes between languages. However, phonemes cannot be effectively indexed like an entire word, thus searching on a phonetic-based system is slow.[11]
An issue with phonetic indexing is its low accuracy. Phoneme-based searches result in more false matches than text-based indexing. This is especially prevalent for short search terms, which have a stronger likelihood of sounding similar to other words or being part of bigger words. It could also return irrelevant results from other languages. Unless the system recognizes exactly the entire word, or understands phonetic sequences of languages, it is difficult for phonetic-based indexing to return accurate findings.[12]
Deemed as the most critical and complex component of audio mining, speech recognition requires the knowledge of human speech production system and its modeling.
To correspond the Human speech production system, the electrical speech production system is developed to consist of:
Speech generation
Speech perception
Voiced & unvoiced speech
Model of human speech
The electrical speech production system converts acoustic signal into corresponding representation of the spoken through the acoustic models in their software where all phonemes are represented. A statistical language model aids in the process by identifying how likely words are to follow each other in certain languages. Put together with a complex probability analysis, the speech recognition system is capable of taking an unknown speech signal and transcribing it into words based on the program's dictionary.[13][14]
ASR (automatic speech recognition) system includes:
Acoustic analysis: input sound waveform is transformed into a feature
Acoustic model: establishes relationship between speech signal and phonemes, pronunciation model and language model. Training algorithms are applied to the speech database to create statistical representation of each phoneme, thus generating an acoustic model with a set of phonemes and their probability measures.
Pronunciation model: Phonemes are mapped to specific words
Language model: Words are organized to form meaningful sentences
Some applications of speech processing includes speech recognition, speech coding, speaker authentication, speech enhancement and speech synthesis.
Feature extraction
Prerequisite to the entire speech recognition process, feature extraction must be established first within the system. Audio files must be processed from start to end, ensuring no important information is lost.
By differentiating sound sources through pitch, timbral features, rhythmic features, inharmonicity, autocorrelation and other features based on the signal's predictability, statistical pattern, and dynamic characteristics.
Enforcing standardization within feature extraction is regulated through the international MPEG-7 standard features, where features for audio or speech signal classification are fixed in terms of techniques used to analyze and represent raw data in terms of certain features.
Standard speech extraction techniques:
Linear Predictive Coding (LPC) estimates current speech sample by analyzing previous speech sample
Mel-frequency cepstral coefficient (MFCC) represents speech signal through parametric form using mel scale
Perceptual Linear Prediction (PLP) takes human speech into consideration
Audio classification is a form of supervised learning, and involves the analysis of audio recordings. It is split into several categories- acoustic data classification, environmental sound classification, musical classification, and natural language utterance classification.[15] The features often used for this process are pitch, timbral features, rhythmic features, inharmonicity, and audio correlation, although other features may also be used. There are several methods to audio classification using existing classifiers, such as the k-Nearest Neighbors, or the naïve Bayes classifier. Using annotated audio data, machines learn to identify and classify the sounds.
There has also been research into using deep neural networks for speech recognition and audio classification, due to their effectiveness in other fields such as image classification.[16] One method of using DNNs is by converting audio files into image files, by way of spectrograms in order to perform classification.[citation needed]
Applications of Audio Mining
Audio mining is used in areas such as musical audio mining (also known as music information retrieval), which relates to the identification of perceptually important characteristics of a piece of music such as melodic, harmonic or rhythmic structure. Searches can then be carried out to find pieces of music that are similar in terms of their melodic, harmonic and/or rhythmic characteristics.
Within the field of linguistics, audio mining has been used for phonetic processing and semantic analysis.[17] The efficiency of audio mining in processing audio-visual data lends aid in speaker identification and segmentation, as well as text transcription. Through this process, speech can be categorized in order to identify information, or to extract information through keywords spoken in the audio. In particular, this has been used for speech analytics. Call centers have used the technology to conduct real time analysis by identifying changes in tone, sentiment or pitch, amongst others, which is then processed by decision engine or artificial intelligence to take further action.[18] Further use has been seen in areas of speech recognition and text-to-speech applications.
It has also been used in conjunction with video mining, in projects such as mining movie data.