A Metadata Generation System for Scanned Scientific Volumes

Xiaonan Lu (1), Brewster Kahle (2), James Z. Wang (1), and C. Lee Giles (1)
(1) The Pennsylvania State University
(2) Internet Archive

Large scale digitization projects have been conducted at digital libraries to preserve cultural artifacts and to provide permanent access. The increasing amount of digitized resources, including scanned books and scientific publications, requires development of tools and methods that will efficiently analyze and manage large collections of digitized resources. In this work, we tackle the problem of extracting metadata from scanned volumes of journals. Our goal is to extract information describing internal structures and content of scanned volumes, which is necessary for providing effective content access functionalities to digital library users. We propose methods for automatically generating volume level, issue level, and article level metadata based on format and text features extracted from OCRed text. We show the performance of our system on scanned bound historical documents nearly two centuries old. We have developed the system and integrated it into an operational digital library, the Internet Archive, for real-world usage.

Full color PDF file (1MB)

On-line info   

Citation: Xiaonan Lu, Brewster Kahle, James Z. Wang, and C. Lee Giles, ``A Metadata Generation System for Scanned Scientific Volumes,'' Proceedings of the Joint ACM/IEEE Conference on Digital Libraries, pp. 167-176, Pittsburgh, Pennsylvania, June 2008.

Copyright 2008 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

Last Modified: April 2, 2008.
2008, James Z. Wang