This package contains Ælfred2, an enhanced SAX2-compatible version of the Ælfred non-validating XML parser, and a validating SAX2 parser using that as its core. Use them like any other SAX2 parsers.

Some of the documentation below was modified from the original Ælfred README.txt file. All of it has been updated.

About Ælfred

Ælfred is a Java-based XML parser originally from Microstar Software Limited (no longer in existence) and more or less placed into the public domain. (There was also a licence statement, effectively "Open Source", which would seem to be inapplicable for public domain content.)

Design Principles

In most Java applets and applications, XML should not be the central feature; instead, XML is the means to another end, such as loading configuration information, reading meta-data, or parsing transactions.

When an XML parser is only a single component of a much larger program, it cannot be large, slow, or resource-intensive. With Java applets, in particular, code size is a significant issue. The standard modem is still not operating at 56 Kbaud, or sometimes even with data compression. Assuming an uncompressed 28.8 Kbaud modem, only about 3 KBytes can be downloaded in one second; compression often doubles that speed, but a V.90 modem may not provide another doubling. When used with embedded processors, similar size concerns apply.

Ælfred is designed for easy and efficient use over the Internet, based on the following principles:

  1. Ælfred must be as small as possible, so that it doesn't add too much to an applet's download time.
  2. Ælfred must use as few class files as possible, to minimize the number of HTTP connections necessary. (The use of JAR files has made this be less of a concern.)
  3. Ælfred must be compatible with most or all Java implementations and platforms. (Write once, run anywhere.)
  4. Ælfred must use as little memory as possible, so that it does not take away resources from the rest of your program. (It doesn't force you to use DOM or a similar costly data structure API.)
  5. Ælfred must run as fast as possible, so that it does not slow down the rest of your program.
  6. Ælfred must produce correct output for well-formed and valid documents, but need not reject every document that is not valid or not well-formed. (In this revision, correctness was a bigger concern than in the original version; and a validation option is available.)
  7. Ælfred must provide full internationalization from the first release. (Its current weakness there is that it doesn't handle very many encodings beyond the absolute minimum.)

As you can see from this list, Ælfred is designed for production use, but neither validation nor perfect conformance was a requirement. Good validating parsers exist, including one in this package, and you should use them as appropriate. (See conformance reviews available at http://www.xml.com)

One of the main goals of this revision was to significantly improve conformance, while not significantly affecting the other goals stated above. Since the primary use of this parser is with SAX, some classes could be removed, and so the overall size of Ælfred was actually reduced. Subsequent performance work produced a notable speedup (over twenty percent on larger files). That is, the tradeoffs between speed, size, and conformance were re-targeted towards conformance and support of newer APIs (SAX2), with a a positive performance impact.

The role anticipated for this version of Ælfred is as a lightweight Open Source SAX parser that can be used in essentially every Java program where the handful of conformance violations (noted below) are acceptable. That certainly includes applets, and nowadays one must also mention embedded systems as being even more size-critical. At this writing, all parsers that are more conformant are significantly larger, even when counting the validation support in this version of Ælfred.

About the Name Ælfred

Ælfred the Great (AElfred in ASCII) was King of Wessex, and some say of King of England, at the time of his death in 899 AD. (Edward I was the first to be crowned with that title, not quite a century later.) Ælfred introduced a wide-spread literacy program in the hope that his people would learn to read English, at least, if Latin was too difficult for them. This Ælfred hopes to bring another sort of literacy to Java, using XML, at least, if full SGML is too difficult.

The initial Æ ligature ("AE)" is also a reminder that XML is not limited to ASCII.

Character Encodings

The Ælfred parser currently builds in support for a handful of input encodings. Of course these include UTF-8 and UTF-16, which all XML parsers are required to support:

If you use any encoding other than UTF-8 or UTF-16 you should make sure to label your data appropriately:

<?xml version="1.0" encoding="ISO-8859-1"?>

Encodings accessed through java.io.InputStreamReader are not currently supported, unless they're described externally. That means no Russian, Chinese, Japanese, Korean, or other encoding support, unless you:

The difficulty with creating a reader based on the content of the encoding declaration (as shown above) relates to buffering issues. By the time the parser knows the data encoding, some or all of the XML text may already have been decoded using UTF-8. The parsing would need to be restarted, knowing the correct encoding scheme.

Known Conformance Violations

Known conformance issues should be of negligible importance for most applications, and include:

When tested against the July 12, 1999 version of the OASIS XML Conformance test suite, an earlier version passed 1057 of 1067 tests. That contrasts with the original version, which passed 867. The current parser is top-ranked in terms of conformance, as is its validating sibling (which has some additional conformance violations imposed on it by SAX2 API deficiencies as well as some of the more curious SGML layering artifacts found in the XML specification).

Licensing

As noted above, the original distribution was either public domain or a relatively liberal open source license (it's unclear exactly which any court might hold to apply). The license had the constraint that differences from the original distribution must be identified. Those differences are identified below, although if any courts hold that the software is in fact public domain, that Microstar license clearly can't apply.

This version is Copyright (c) 1999-2000 by David Brownell. The applicable license is in the LICENSE file with this distribution.

Changes Since the last Microstar Release

As noted above, Microstar has not updated this parser since the summer of 1998, when it released version 1.2a on its web site. This release is intended to benefit the developer community by refocusing the API on SAX2, and improving conformance to the extent that most developers should not need to use another XML parser.

The code has been cleaned up (referring to the XML 1.0 spec in comments, rather than some preliminary draft, for one example) and has been sped up a bit as well.

SAX2 Support

The original version of Ælfred did not support the SAX2 APIs.

This version supports the SAX2 APIs, exposing the standard boolean feature descriptors. It supports the "DeclHandler" property to provide access to all DTD declarations not already exposed through the SAX1 API. The "LexicalHandler" property is supported, except that entity references are hidden; this means you can see things like comments and CDATA boundaries. SAX1 compatibility is currently provided.

Validation

In the 'pipeline' package in this same software distribution is an XML Validation component using any full SAX2 event stream (including all document type declarations) to validate. There is now a Validator class which combines that class and this enhanced Ælfred parser, creating a validating parser. (This validating parser, and its support classes in other packages, can be removed from this package if a smaller distribution is needed.)

As noted in the documentation for that validating component, certain validity constraints can't be tested. These include all those relying on layering violations (exposing XML at the level of tokens or below, required since XML isn't a context-free grammar), some that SAX2 doesn't support, and a few others. However, the resulting validating parser is notably more conformant than most of the other parsers I've tested over the past several months. Moreover, that component can be used without a parser ... any component that emits SAX events can have its output validated on demand.

Bugs Fixed

Bugs fixed in this version includes:

  1. Originally Ælfred didn't close file descriptors, which led to file descriptor leakage on programs which ran for any length of time.
  2. NOTATION declarations without system identifiers are now handled correctly.
  3. DTD events are now reported for all invocations of a given parser, not just the first one.
  4. More correct character handling:
  5. Certain validity errors were previously treated as well formedness violations.
  6. Attribute handling is improved:
  7. More correct entity handling:
  8. Neither conditional sections nor parameter entity references within markup declarations are permitted in the internal subset.
  9. Processing instructions whose target names are "XML" (ignoring case) are now rejected.
  10. Comments may not include "--".
  11. Most "]]>" sequences in text are rejected.
  12. Correct syntax for standalone declarations is enforced.
  13. Setting a locale for diagnostics only produces an exception if the language of that locale isn't English.
  14. Some more encoding names are recognized. These include the Unicode 3.0 variants of UTF-16 (UTF-16BE, UTF-16LE) as well as US-ASCII and a few commonly seen synonyms.
  15. Text (from character content, PIs, or comments) large enough not to fit into internal buffers is now handled correctly even in some cases which were originally handled incorrectly.
  16. Content is now reported for element types for which attributes have been declared, but no content model is known. (Such documents are invalid, but may still be well formed.)

Other bugs may also have been fixed.