|
|
Many web sites contain large collections of pages generated using a common
template or layout. For example, Amazon
lays out the author, title, comments, etc. in the same way in all the
book pages. The values used to generate the pages (e.g., the author,
title, ...) typically come from a database. We have studied the problem
of automatically extracting database values from such a collection of web
pages automatically without any human input. Please follow this
link for the paper
discussing the techniques that we have developed for the above problem.
This page contains the experimental results of applying our techniques to
real web page collections. Some of the collections that we used in our
experiments were obtained from RoadRunner Project
which tries to solve a similar problem. The other collections were
manually crawled from well-known data-rich sites like E-bay and Netflix .
We briefly describe how we have organized the experimental results. For
each input collection of web pages that we used we present the following
information as part of the experimental results.
- Source Pages: The source pages in the collection.
- Extracted Template: The template deduced by our system for
the collection.
- Extracted Schema: The schema deduced by our system.
- Extracted Data: The data encoded in each page that is
extracted by our system.
- Equivalence Classes: Equivalence classes are sets of words
that are used by our system to construct the template. Please refer to
the paper for the definition of equivalence classes.
- Manual Schema: The schema that we deduced manually using
the semantics of the information in the pages. This is used for
evaluating the system.
The extracted schema, value and template are output by our sytem in XML.
Schema
The following text illustrates how we encode a schema in XML.
<schema id="1">
<tuple id="2" order="2">
<basic id="3"/>
<set id="4">
<basic id="5"/>
</set>
</tuple>
</schema>
The schema represented by the above XML text is a tuple with two
attributes; the first attribute is of basic type (string); and, the second
attribute is a set of basic type. Each element has an unique attribute
id.
Value
The following example value that is instance of the schema above
illustrates how we encode a value in XML.
<value instanceof="1">
<value instanceof="2">
<value instanceof="3">
<![CDATA[What is Mathematics]]>
</value>
<value instanceof="4">
<value instanceof="5">
<![CDATA[Courant]]>
</value>
<value instanceof="5">
<![CDATA[Robbins]]>
</value>
</value>
</value>
</value>
The instanceof attribute of a <value> element corresponds to
the id attribute of a type in the schema of which the value is an
instance.
Template
The following example template for the schema above illustrates how we
represent a template in XML.
<template schema="1">
<start-string context="2">
<![CDATA[<html> <body> Book:]]>
</start-string>
<start-string context="5">
<![CDATA[Author:]]>
</start-string>
<end-string context="2">
<![CDATA[</body> </html>]]>
</end-string>
</template>
The encoding of the value above using the template above results in the
following page:
<html>
<body>
Book: What is Mathematics
Author: Courant
Author: Robbins
</body>
</html>
A template is just set of optional start-string and
end-strings associated with each type in the schema. The context
attribute in the <start-string> and <end-string> elements
identifies the type in the schema that the element in associated with.
In an encoded page, the "start-string" occurs before the encoding a
sub-value of the type that it is associated with, and the "end-string"
after. The above representation of the template is equivalent to our
definition of a template in the
paper .
The following are the links for experimental results on various
collections.
- Amazon Cars
- Amazon Pop Artist
- Baseball Players
- RPM Packages
- UEFA National Teams
- UEFA Players
- E-Bay
- Netflix
- Tennis Players' Profiles
|