A Metadata Generation System for
Scanned Scientific Volumes
Xiaonan Lu (1), Brewster Kahle (2), James Z. Wang (1), and C. Lee Giles (1)
(1) The Pennsylvania State University
(2) Internet Archive
Abstract:
Large scale digitization projects have been conducted
at digital libraries to preserve cultural artifacts and to
provide permanent access. The increasing amount of
digitized resources, including scanned books and scientific
publications, requires development of tools and methods
that will efficiently analyze and manage large collections of
digitized resources. In this work, we tackle the problem of
extracting metadata from scanned volumes of journals. Our
goal is to extract information describing internal structures
and content of scanned volumes, which is necessary for
providing effective content access functionalities to digital
library users. We propose methods for automatically
generating volume level, issue level, and article level
metadata based on format and text features extracted from
OCRed text. We show the performance of our system on
scanned bound historical documents nearly two centuries
old. We have developed the system and integrated it into
an operational digital library, the Internet Archive, for real-world
usage.
Full color PDF file (1MB)
On-line info   
Citation:
Xiaonan Lu, Brewster Kahle, James Z. Wang, and C. Lee Giles,
``A Metadata Generation System for Scanned Scientific Volumes,''
Proceedings of the Joint ACM/IEEE Conference on Digital Libraries,
pp. 167-176, Pittsburgh, Pennsylvania, June 2008.
Copyright 2008 Permission to make digital or hard copies of all or
part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full
citation on the first page. To copy otherwise, to republish, to post
on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Last Modified:
April 2, 2008.
© 2008, James Z. Wang