Gates Building, Room 100, 3:15-4:30
February 2, 2001

Speech Recognition for Video Indexing

Dr. Savitha Srinivisan, Manager, Multimedia Knowledge Discovery, IBM Almaden Research Center.


Speech recognition technology can be applied to the audio track of a video to obtain words that can be used to index the video. Even though the accuracy may only be in the neighborhood of 80%, still enough index terms can be provided to make speech the major indexing technique used in video.

The speech recognition problem is difficult because it is difficult to determine word boundaries in continuous speech, because words are pronounced differently depending on the adjacent words, because words that are very diffeernet in meaning can sound almost the same, and because the duration of words may vary.

This lecture indicates the difficulties of automated speech recognition and the large vocabulary, continuous speech recognition (LVCSR) techniques currently available for this purpose.

LVSR systms typically consist of three components: a vocabulary, a language model, and a set of pronunciations for each word in the vocabulary. A language model is a domain-specific database of sequences of words in the vocabulary, alonmg with the probabilities of the words occurring in a specific order. The language model assists the recognizer in decoding dictated speech by biasing the output of the speech system towards high probability word sequences. Recognizing out-of-vocabulary terms continues to be an open issue with this approach.