Date: 03/04/99

Attendees: Rajeev Motwani, Svetlozar Nestorov, Sebastien Brion, Dick Tsur,
and Yue Zhuge

We discussed more about NoDoSE, Northwestern Document Structure Extractor
(available from http://shrike.cs.nwu.edu/nodose ), developed by Brad
Adelberg and his students. We concluded that NoDoSE, at least the version we
have, only works well for very regularly structured documents, such as
records separated by delimiters.

We decided to implement our own system to extract structures from text files
and generate XML documents. We studied several possibilities to perform this
task. Basically, according to different amount of information we provide to
the system, there are different approaches. For example, we may provide key
words or let the system detect them. Sebastien and Yue spent most of Friday
trying to come up with a detailed proposal.

On the XML side, we found a tool that converts HTML documents into
(well-formed) XML documents. This tool is Tidy, at
http://www.w3.org/People/Raggett/tidy. Tidy has UNIX and WinNT versions and
is very easy to use. The source code is also available. The problem with
Tidy is that it simply converts all HTML tags into XML tags, so "displaying
information" becomes "structural information". As a result, the structural
information (e.g., tags) of converted XML documents may not be associated
with the semantic meanings of the document elements.