Robust Web Extraction, A Principled Approach
Philip Bohannon, Yahoo! Research
On script-generated web sites, many documents share common HTML tree
structure, allowing wrappers to effectively extract information of
interest. Of course, the scripts and thus the tree structure evolve
over time, causing wrappers to break repeatedly, and resulting in a
high cost of maintaining wrappers. In this paper, we explore a novel
approach: we use temporal snapshots of web pages to develop a
tree-edit model of HTML, and use this model to improve wrapper
construction. We view the changes to the tree structure as
suppositions of a series of edit operations: deleting nodes, inserting
nodes and substituting labels of nodes. The tree structures evolve by
choosing these edit operations stochastically.
Our model is attractive in that the probability that a source tree has
evolved into a target tree can be estimated efficiently -- in
quadratic time in the size of the trees -- making it a potentially
useful tool for a variety of tree-evolution problems. We give an
algorithm to learn the probabilistic model from training examples
consisting of pairs of trees, and apply this algorithm to collections
of web-page snapshots to derive HTML-specific tree edit models.
Finally, we describe a novel wrapper-construction framework that takes
the tree-edit model into account, and compare the quality of resulting
wrappers to that of traditional wrappers on synthetic and real HTML
document examples.
Possible second topic: A Generative Model of Record Extraction
If time permits, I will give an overview of some research-in-progress.
This effort is, to our knowledge, the first attempt to formalize a
variety of information extraction and integration problems around a
single generative model of web site creation, extending existing
models in EXALG, MDR, RoadRunner, and Stalker. We feel the model will
have a variety of uses, including helping to emphasize some 'missing
pieces' in the web-scale extraction puzzle.