Automated Analysis of Images in Documents for Intelligent Document Search
Xiaonan Lu, Saurabh Kataria, William J. Brouwer,
James Z. Wang, Prasenjit Mitra and C. Lee Giles
The Pennsylvania State University
Abstract:
Images, capable of highlighting a wide variety of information, are
commonly used in documents. Contents of images, including descriptive
and numerical data, need to be extracted for effective document
search. For instance, two-dimensional (2-D) plots carry important
quantitative information for scientific publications. Extracting the
data from 2-D plots and storing them in a database can enable users to
query the information, analyze it and compare it with data from other
documents. There is no documented example of a fully automated system
capable of performing all the required image recognition and
extraction tasks, amenable to a large digital library. Whereas
semi-automated software implementations exist, this work focuses on
completely eliminating the need for human intervention. We present a
supervised learning-based algorithm that classifies figures in digital
documents into five classes: photographs, 2-D plots, 3-D plots,
diagrams, and others. We also present an integrated algorithm for
extracting numerical data and text from 2-D plot images. These unique
contributions together with existing algorithms comprise a complete
system for use in the construction of high volume digital libraries.
PDF file (0.7MB)
On-line info   
Citation:
Xiaonan Lu, Saurabh Kataria, William J. Brouwer, James Z. Wang,
Prasenjit Mitra and C. Lee Giles, ``Automated Analysis of Images in
Documents for Intelligent Document Search,'' International Journal on
Document Analysis and Recognition, vol. 12, no. 2, pp. 65-81, 2009.
Copyright 2009 Springer-Verlag. Permission to make digital or hard
copies of all or part of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. The original publication is
available at www.springerlink.com. DOI: 10.1007/s10032-009-0081-0.
Last Modified:
March 2, 2009.
© 2009, James Z. Wang