Date: 03/04/99 Attendees: Rajeev Motwani, Svetlozar Nestorov, Sebastien Brion, Dick Tsur, and Yue Zhuge We discussed more about NoDoSE, Northwestern Document Structure Extractor (available from http://shrike.cs.nwu.edu/nodose ), developed by Brad Adelberg and his students. We concluded that NoDoSE, at least the version we have, only works well for very regularly structured documents, such as records separated by delimiters. We decided to implement our own system to extract structures from text files and generate XML documents. We studied several possibilities to perform this task. Basically, according to different amount of information we provide to the system, there are different approaches. For example, we may provide key words or let the system detect them. Sebastien and Yue spent most of Friday trying to come up with a detailed proposal. On the XML side, we found a tool that converts HTML documents into (well-formed) XML documents. This tool is Tidy, at http://www.w3.org/People/Raggett/tidy. Tidy has UNIX and WinNT versions and is very easy to use. The source code is also available. The problem with Tidy is that it simply converts all HTML tags into XML tags, so "displaying information" becomes "structural information". As a result, the structural information (e.g., tags) of converted XML documents may not be associated with the semantic meanings of the document elements.