Intelligent Parsing of Scanned Volumes for Web Based Archives
Xiaonan Lu, James Z. Wang, and C. Lee Giles
The Pennsylvania State University
Abstract:
The proliferation of digital libraries and the large amount of
existing documents raise important issues in efficient handling of
documents. Printed texts in documents need to be converted into
digital format and semantic information need to be parsed and managed
for effective retrieval. In this work, we attempt to solve the
problems faced by current web based archives, where large scale
repositories of electronic resources have been built from scanned
volumes. Specifically, we focus on the scientific domain and target
scanned volumes of scientific publications. Our goal is to automate
the semantic processing of scanned volumes, an important and
challenging step towards efficient retrieval of content within scanned
volumes. We tackle the problem by designing a machine learning-based
method to extract multi-level metadata about content of scanned
volumes. We combine image and text information within scanned volumes
for intelligent parsing. We developed a system and test it with real
world data from the Internet Archive, and the experimental evaluation
has demonstrated good results.
PDF file (141KB)
On-line info   
Citation:
Xiaonan Lu, James Z. Wang, and C. Lee Giles,
``Intelligent Parsing of Scanned Volumes for Web Based Archive,''
Proceedings of the IEEE International Conference on Semantic Computing,
pp. 559-566, Irvine, California, 2007.
Copyright 2007 IEEE. Permission to make digital or hard copies of all or
part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full
citation on the first page. To copy otherwise, to republish, to post
on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Last Modified:
July 30, 2007.
© 2007, James Z. Wang