Automated Analysis of Images in Documents for Intelligent Document Search

Xiaonan Lu, Saurabh Kataria, William J. Brouwer, James Z. Wang, Prasenjit Mitra and C. Lee Giles
The Pennsylvania State University
Abstract:

Images, capable of highlighting a wide variety of information, are commonly used in documents. Contents of images, including descriptive and numerical data, need to be extracted for effective document search. For instance, two-dimensional (2-D) plots carry important quantitative information for scientific publications. Extracting the data from 2-D plots and storing them in a database can enable users to query the information, analyze it and compare it with data from other documents. There is no documented example of a fully automated system capable of performing all the required image recognition and extraction tasks, amenable to a large digital library. Whereas semi-automated software implementations exist, this work focuses on completely eliminating the need for human intervention. We present a supervised learning-based algorithm that classifies figures in digital documents into five classes: photographs, 2-D plots, 3-D plots, diagrams, and others. We also present an integrated algorithm for extracting numerical data and text from 2-D plot images. These unique contributions together with existing algorithms comprise a complete system for use in the construction of high volume digital libraries.


PDF file (0.7MB)

On-line info   


Citation: Xiaonan Lu, Saurabh Kataria, William J. Brouwer, James Z. Wang, Prasenjit Mitra and C. Lee Giles, ``Automated Analysis of Images in Documents for Intelligent Document Search,'' International Journal on Document Analysis and Recognition, vol. 12, no. 2, pp. 65-81, 2009.

Copyright 2009 Springer-Verlag. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. The original publication is available at www.springerlink.com. DOI: 10.1007/s10032-009-0081-0.

Last Modified: March 2, 2009.
© 2009, James Z. Wang