The problem of automatic topic discovery of video content for real-world applications such as distributed learning and corporate training is addressed. We present a model-based approach to segment a single video into topically cohesive segments. We refer to this as "intravideo" topic discovery.
The segmentation is based upon an analysis of the audio track. IBM's large vocabulary continuous speech recognition system (LVCSR) is applied to the audio track of the video. We view the speech transcript as a predefined cluster of topically related documents. Features are extracted from the document cluster to learn about the document collection. CueVideo's phonetic retrieval engine is used to compute the timing information associated with each feature. The distribution of the features in time drives the segmentation algorithm. The segmentation of the video is based on changes in topic. Finally we assign representative labels to each segment thereby enabling automatic generation of Table of Content, topic-based queries and topical navigation.
Applications that incorporate some form of automatic video categorization based on an analysis of the speech transcripts have been focused on broadcast news content. The problem we address bears the greatest similarity to the segmentation task in the DARPA-sponsored Topic Detection and Tracking. However, there are important differences that are unique to our problem domain. Our content comes from various distributed learning and corporate training videos (a heterogeneous corpus), where the duration of audio ranges between 10 and 90 minutes each. More significantly, we do not require pre-encoded knowledge or pre-labeled training data. From our survey of the literature, much of the research addresses topic discovery for large document collections, which is similar to intervideo topic discovery. There is very little work in the intravideo domain.