Automatic Categorization of Figures
in Scientific Documents

Xiaonan Lu, Prasenjit Mitra, James Z. Wang, and C. Lee Giles
The Pennsylvania State University

Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for realworld use. Our tools will be integrated into a scientificdocument digital library.

Full color PDF file (0.8MB)

On-line info   

Citation: Xiaonan Lu, Prasenjit Mitra, James Z. Wang, and C. Lee Giles, ``Automatic Categorization of Figures in Scientific Documents,'' Proceedings of the Joint ACM/IEEE Conference on Digital Libraries, pp. 129-138, Chapel Hill, North Carolina, June 2006.

Copyright 2006 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Last Modified: April 7, 2006.
2006, James Z. Wang