Automatic Categorization of Figures
in Scientific Documents
Xiaonan Lu, Prasenjit Mitra, James Z. Wang, and C. Lee Giles
The Pennsylvania State University
Figures are very important non-textual information
contained in scientific documents. Current digital libraries
do not provide users tools to retrieve documents based on
the information available within the figures. We propose
an architecture for retrieving documents by integrating
figures and other information. The initial step in enabling
integrated document search is to categorize figures into a set
of pre-defined types. We propose several categories of figures
based on their functionalities in scholarly articles. We have
developed a machine-learning-based approach for automatic
categorization of figures. Both global features, such as
texture, and part features, such as lines, are utilized in
the architecture for discriminating among figure categories.
The proposed approach has been evaluated on a testbed
document set collected from the CiteSeer scientific literature
digital library. Experimental evaluation has demonstrated
that our algorithms can produce acceptable results for realworld
use. Our tools will be integrated into a scientificdocument
Full color PDF file (0.8MB)
Xiaonan Lu, Prasenjit Mitra, James Z. Wang, and C. Lee Giles, ``Automatic Categorization of Figures in Scientific Documents,''
Proceedings of the Joint ACM/IEEE Conference on Digital Libraries,
pp. 129-138, Chapel Hill, North Carolina, June 2006.
Copyright 2006 Permission to make digital or hard copies of all or
part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full
citation on the first page. To copy otherwise, to republish, to post
on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
April 7, 2006.
© 2006, James Z. Wang