package.html 20 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507
  1. <!DOCTYPE html PUBLIC
  2. '-//W3C//DTD XHTML 1.0 Transitional//EN'
  3. 'http://www.w3.org/TR/xhtml1/DTD/transitional.dtd'>
  4. <html><head>
  5. <title>package overview</title>
  6. <!--
  7. /*
  8. * Copyright (C) 1999,2000,2001 The Free Software Foundation, Inc.
  9. */
  10. -->
  11. </head><body>
  12. <p> This package contains &AElig;lfred2, which includes an
  13. enhanced SAX2-compatible version of the &AElig;lfred
  14. non-validating XML parser, a modular (and hence optional)
  15. DTD validating parser, and modular (and hence optional)
  16. JAXP glue to those.
  17. Use these like any other SAX2 parsers. </p>
  18. <ul>
  19. <li><a href="#about">About &AElig;lfred</a><ul>
  20. <li><a href="#principles">Design Principles</a></li>
  21. <li><a href="#name">About the Name &AElig;lfred</a></li>
  22. <li><a href="#encodings">Character Encodings</a></li>
  23. <li><a href="#violations">Known Conformance Violations</a></li>
  24. <li><a href="#copyright">Licensing</a></li>
  25. </ul></li>
  26. <li><a href="#changes">Changes Since the Last Microstar Release</a><ul>
  27. <li><a href="#sax2">SAX2 Support</a></li>
  28. <li><a href="#validation">Validation</a></li>
  29. <li><a href="#smaller">You Want Smaller?</a></li>
  30. <li><a href="#bugfixes">Bugs Fixed</a></li>
  31. </ul></li>
  32. </ul>
  33. <h2><a name="about">About &AElig;lfred</a></h2>
  34. <p>&AElig;lfred is a XML parser written in the java programming language.
  35. <h3><a name="principles">Design Principles</a></h3>
  36. <p>In most Java applets and applications, XML should not be the central
  37. feature; instead, XML is the means to another end, such as loading
  38. configuration information, reading meta-data, or parsing transactions.</p>
  39. <p> When an XML parser is only a single component of a much larger
  40. program, it cannot be large, slow, or resource-intensive. With Java
  41. applets, in particular, code size is a significant issue. The standard
  42. modem is still not operating at 56 Kbaud, or sometimes even with data
  43. compression. Assuming an uncompressed 28.8 Kbaud modem, only about
  44. 3 KBytes can be downloaded in one second; compression often doubles
  45. that speed, but a V.90 modem may not provide another doubling. When
  46. used with embedded processors, similar size concerns apply. </p>
  47. <p> &AElig;lfred is designed for easy and efficient use over the Internet,
  48. based on the following principles: </p> <ol>
  49. <li> &AElig;lfred must be as small as possible, so that it doesn't add too
  50. much to an applet's download time. </li>
  51. <li> &AElig;lfred must use as few class files as possible, to minimize the
  52. number of HTTP connections necessary. (The use of JAR files has made this
  53. be less of a concern.) </li>
  54. <li> &AElig;lfred must be compatible with most or all Java implementations
  55. and platforms. (Write once, run anywhere.) </li>
  56. <li> &AElig;lfred must use as little memory as possible, so that it does
  57. not take away resources from the rest of your program. (It doesn't force
  58. you to use DOM or a similar costly data structure API.)</li>
  59. <li> &AElig;lfred must run as fast as possible, so that it does not slow down
  60. the rest of your program. </li>
  61. <li> &AElig;lfred must produce correct output for well-formed and valid
  62. documents, but need not reject every document that is not valid or
  63. not well-formed. (In &AElig;lfred2, correctness was a bigger concern
  64. than in the original version; and a validation option is available.) </li>
  65. <li> &AElig;lfred must provide full internationalization from the first
  66. release. (&AElig;lfred2 now automatically handles all encodings
  67. supported by the underlying JVM; previous versions handled only
  68. UTF-8, UTF_16, ASCII, and ISO-8859-1.)</li>
  69. </ol>
  70. <p>As you can see from this list, &AElig;lfred is designed for production
  71. use, but neither validation nor perfect conformance was a requirement.
  72. Good validating parsers exist, including one in this package,
  73. and you should use them as appropriate. (See conformance reviews
  74. available at <a href="http://www.xml.com/">http://www.xml.com</a>)
  75. </p>
  76. <p> One of the main goals of &AElig;lfred2 was to significantly improve
  77. conformance, while not significantly affecting the other goals stated above.
  78. Since the only use of this parser is with SAX, some classes could be
  79. removed, and so the overall size of &AElig;lfred was actually reduced.
  80. Subsequent performance work produced a notable speedup (over twenty
  81. percent on larger files). That is, the tradeoffs between speed, size, and
  82. conformance were re-targeted towards conformance and support of newer APIs
  83. (SAX2), with a a positive performance impact. </p>
  84. <p> The role anticipated for this version of &AElig;lfred is as a
  85. lightweight Free Software SAX parser that can be used in essentially every
  86. Java program where the handful of conformance violations (noted below)
  87. are acceptable.
  88. That certainly includes applets, and
  89. nowadays one must also mention embedded systems as being even more
  90. size-critical.
  91. At this writing, all parsers that are more conformant are
  92. significantly larger, even when counting the optional
  93. validation support in this version of &AElig;lfred. </p>
  94. <h3><a name="name">About the Name <em>&AElig;lfred</em></a></h3>
  95. <p>&AElig;lfred the Great (AElfred in ASCII) was King of Wessex, and
  96. some say of King of England, at the time of his death in 899 AD.
  97. &AElig;lfred introduced a wide-spread literacy program in the hope that
  98. his people would learn to read English, at least, if Latin was too
  99. difficult for them. This &AElig;lfred hopes to bring another sort of
  100. literacy to Java, using XML, at least, if full SGML is too difficult.</p>
  101. <p>The initial &AElig; ligature ("AE)" is also a reminder that XML is
  102. not limited to ASCII.</p>
  103. <h3><a name="encodings">Character Encodings</a></h3>
  104. <p> The &AElig;lfred parser currently builds in support for a handful
  105. of input encodings. Of course these include UTF-8 and UTF-16, which
  106. all XML parsers are required to support:</p> <ul>
  107. <li> UTF-8 ... the standard eight bit encoding, used unless
  108. you provide an encoding declaration or a MIME charset tag.</li>
  109. <li> US-ASCII ... an extremely common seven bit encoding,
  110. which happens to be a subset of UTF-8 and ISO-8859-1 as well
  111. as many other encodings. XHTML web pages using US-ASCII
  112. (without an encoding declaration) are probably more
  113. widely interoperable than those in any other encoding. </li>
  114. <li> ISO-8859-1 ... includes accented characters used in
  115. much of western Europe (but excluding the Euro currency
  116. symbol).</li>
  117. <li> UTF-16 ... with several variants, this encodes each
  118. sixteen bit Unicode character in sixteen bits of output.
  119. Variants include UTF-16BE (big endian, no byte order mark),
  120. UTF-16LE (little endian, no byte order mark), and
  121. ISO-10646-UCS-2 (an older and less used encoding, using a
  122. version of Unicode without surrogate pairs). This is
  123. essentially the native encoding used by Java. </li>
  124. <li> ISO-10646-UCS-4 ... a seldom-used four byte encoding,
  125. also known as UTF-32BE. Four byte order variants are supported,
  126. including one known as UTF-32LE. Some operating systems
  127. standardized on UCS-4 despite its significant size penalty,
  128. in anticipation that Unicode (even with surrogate pairs)
  129. would eventually become limiting. UCS-4 permits encoding
  130. of non-Unicode characters, which Java can't represent (and
  131. XML doesn't allow).
  132. </li>
  133. </ul>
  134. <p> If you use any encoding other than UTF-8 or UTF-16 you should
  135. make sure to label your data appropriately: </p>
  136. <blockquote>
  137. &lt;?xml version="1.0" encoding="<b>ISO-8859-15</b>"?&gt;
  138. </blockquote>
  139. <p> Encodings accessed through <code>java.io.InputStreamReader</code>
  140. are now fully supported for both external labels (such as MIME types)
  141. and internal types (as shown above).
  142. There is one limitation in the support for internal labels:
  143. the encodings must be derived from the US-ASCII encoding,
  144. the EBCDIC family of encodings is not recognized.
  145. Note that Java defines its
  146. own encoding names, which don't always correspond to the standard
  147. Internet encoding names defined by the IETF/IANA, and that Java
  148. may even <em>require</em> use of nonstandard encoding names.
  149. Please report
  150. such problems; some of them can be worked around in this parser,
  151. and many can be worked around by using external labels.
  152. </p>
  153. <p>Note that if you are using the Euro symbol with an fixed length
  154. eight bit encoding, you should probably be using the encoding label
  155. <em>iso-8859-15</em> or, with a Microsoft OS, <em>cp-1252</em>.
  156. Of course, UTF-8 and UTF-16 handle the Euro symbol directly.
  157. </p>
  158. <h3><a name="violations">Known Conformance Violations</a></h3>
  159. <p>Known conformance issues should be of negligible importance for
  160. most applications, and include: </p><ul>
  161. <li> Rather than following the voluminous "Appendix B" rules about
  162. what characters may appear in names (and name tokens), the Unicode
  163. rules embedded in <em>java.lang.Character</em> are used.
  164. This means mostly that some names are inappropriately accepted,
  165. though a few are inappropriately rejected. (It's much simpler
  166. to avoid that much special case code. Recent OASIS/NIST test
  167. cases may have these rules be realistically testable.) </li>
  168. <li> Text containing "]]&gt;" is not rejected unless it fully resides
  169. in an internal buffer ... which is, thankfully, the typical case. This
  170. text is illegal, but sometimes appears in illegal attempts to
  171. nest CDATA sections. (Not catching that boundary condition
  172. substantially simplifies parsing text.) </li>
  173. <li> Surrogate characters that aren't correctly paired are ignored
  174. rather than rejected, unless they were encoded using UTF-8. (This
  175. simplifies parsing text.) Unicode 3.1 assigned the first characters
  176. to those character codes, in early 2001, so few documents (or tools)
  177. use such characters in any case. </li>
  178. <li> Declarations following references to an undefined parameter
  179. entity reference are not ignored. (Not maintaining and using state
  180. about this validity error simplifies declaration handling; few
  181. XML parsers address this constraint in any case.) </li>
  182. <li> Well formedness constraints for general entity references
  183. are not enforced. (The code to handle the "content" production
  184. is merged with the element parsing code, making it hard to reuse
  185. for this additional situation.) </li>
  186. </ul>
  187. <p> When tested against the July 12, 1999 version of the OASIS
  188. XML Conformance test suite, an earlier version passed 1057 of 1067 tests.
  189. That contrasts with the original version, which passed 867. The
  190. current parser is top-ranked in terms of conformance, as is its
  191. validating sibling (which has some additional conformance violations
  192. imposed on it by SAX2 API deficiencies as well as some of the more
  193. curious SGML layering artifacts found in the XML specification). </p>
  194. <p> The XML 1.0 specification itself was not without problems,
  195. and after some delays the W3C has come out with a revised
  196. "second edition" specification. While that doesn't resolve all
  197. the problems identified the XML specification, many of the most
  198. egregious problems have been resolved. (You still need to drink
  199. magic Kool-Aid before some DTD-related issues make sense.)
  200. To the extent possible, this parser conforms to that second
  201. edition specification, and does well against corrected versions
  202. of the OASIS/NIST XML conformance test cases. See <a href=
  203. "http://xmlconf.sourceforge.net">http://xmlconf.sourceforge.net</a>
  204. for more information about SAX2/XML conformance testing. </p>
  205. <h3><a name="copyright">Copyright and distribution terms</a></h3>
  206. <p>
  207. The software in this package is distributed under the GNU General Public
  208. License (with a special exception described below).
  209. </p>
  210. <p>
  211. A copy of GNU General Public License (GPL) is included in this distribution,
  212. in the file COPYING. If you do not have the source code, it is available at:
  213. <a href="http://www.gnu.org/software/classpath/">http://www.gnu.org/software/classpath/</a>
  214. </p>
  215. <pre>
  216. Linking this library statically or dynamically with other modules is
  217. making a combined work based on this library. Thus, the terms and
  218. conditions of the GNU General Public License cover the whole
  219. combination.
  220. As a special exception, the copyright holders of this library give you
  221. permission to link this library with independent modules to produce an
  222. executable, regardless of the license terms of these independent
  223. modules, and to copy and distribute the resulting executable under
  224. terms of your choice, provided that you also meet, for each linked
  225. independent module, the terms and conditions of the license of that
  226. module. An independent module is a module which is not derived from
  227. or based on this library. If you modify this library, you may extend
  228. this exception to your version of the library, but you are not
  229. obligated to do so. If you do not wish to do so, delete this
  230. exception statement from your version.
  231. Parts derived from code which carried the following notice:
  232. Copyright (c) 1997, 1998 by Microstar Software Ltd.
  233. AElfred is free for both commercial and non-commercial use and
  234. redistribution, provided that Microstar's copyright and disclaimer are
  235. retained intact. You are free to modify AElfred for your own use and
  236. to redistribute AElfred with your modifications, provided that the
  237. modifications are clearly documented.
  238. This program is distributed in the hope that it will be useful, but
  239. WITHOUT ANY WARRANTY; without even the implied warranty of
  240. merchantability or fitness for a particular purpose. Please use it AT
  241. YOUR OWN RISK.
  242. </pre>
  243. <p> Some of this documentation was modified from the original
  244. &AElig;lfred README.txt file. All of it has been updated. </p>
  245. </p>
  246. <h2><a name="changes">Changes Since the last Microstar Release</a></h2>
  247. <p> As noted above, Microstar has not updated this parser since
  248. the summer of 1998, when it released version 1.2a on its web site.
  249. This release is intended to benefit the developer community by
  250. refocusing the API on SAX2, and improving conformance to the extent
  251. that most developers should not need to use another XML parser. </p>
  252. <p> The code has been cleaned up (referring to the XML 1.0 spec in
  253. all the production numbers in
  254. comments, rather than some preliminary draft, for one example) and
  255. has been sped up a bit as well.
  256. JAXP support has been added, although developers are still
  257. strongly encouraged to use the SAX2 APIs directly. </p>
  258. <h3><a name="sax2">SAX2 Support</a></h3>
  259. <p> The original version of &AElig;lfred did not support the
  260. SAX2 APIs. </p>
  261. <p> This version supports the SAX2 APIs, exposing the standard
  262. boolean feature descriptors. It supports the "DeclHandler" property
  263. to provide access to all DTD declarations not already exposed
  264. through the SAX1 API. The "LexicalHandler" property is supported,
  265. exposing entity boundaries (including the unnamed external subset) and
  266. things like comments and CDATA boundaries. SAX1 compatibility is
  267. currently provided.</p>
  268. <h3><a name="validation">Validation</a></h3>
  269. <p> In the 'pipeline' package in this same software distribution is an
  270. <a href="../pipeline/ValidationConsumer.html">XML Validation component</a>
  271. using any full SAX2 event stream (including all document type declarations)
  272. to validate. There is now a <a href="XmlReader.html">XmlReader</a> class
  273. which combines that class and this enhanced &AElig;lfred parser, creating
  274. an optionally validating SAX2 parser. </p>
  275. <p> As noted in the documentation for that validating component, certain
  276. validity constraints can't reliably be tested by a layered validator.
  277. These include all constraints relying on
  278. layering violations (exposing XML at the level of tokens or below,
  279. required since XML isn't a context-free grammar), some that
  280. SAX2 doesn't support, and a few others. The resulting validating
  281. parser is conformant enough for most applications that aren't doing
  282. strange SGML tricks with DTDs.
  283. Moreover, that validating filter can be used without
  284. a parser ... any application component that emits SAX event streams
  285. can DTD-validate its output on demand. </p>
  286. <h3><a name="smaller">You want Smaller?</a></h3>
  287. <p> You'll have noticed that the original version of &AElig;lfred
  288. had small size as a top goal. &AElig;lfred2 normally includes a
  289. DTD validation layer, but you can package without that.
  290. Similarly, JAXP factory support is available but optional.
  291. Then the main added cost due to this revision are for
  292. supporting the SAX2 API itself; DTD validation is as
  293. cleanly layered as allowed by SAX2.</p>
  294. <h3><a name="bugfixes">Bugs Fixed</a></h3>
  295. <p> Bugs fixed in &AElig;lfred2 include: </p>
  296. <ol>
  297. <li> Originally &AElig;lfred didn't close file descriptors, which
  298. led to file descriptor leakage on programs which ran for any
  299. length of time. </li>
  300. <li> NOTATION declarations without system identifiers are
  301. now handled correctly. </li>
  302. <li> DTD events are now reported for all invocations of a
  303. given parser, not just the first one. </li>
  304. <li> More correct character handling: <ul>
  305. <li> Rejects out-of-range characters, both in text and in
  306. character references. </li>
  307. <li> Correctly handles character references that expand to
  308. surrogate pairs. </li>
  309. <li> Correctly handles UTF-8 encodings of surrogate pairs. </li>
  310. <li> Correctly handles Unicode 3.1 rules about illegal UTF-8
  311. encodings: there is only one legal encoding per character. </li>
  312. <li> PUBLIC identifiers are now rejected if they have illegal
  313. characters. </li>
  314. <li> The parser is more correct about what characters are allowed
  315. in names and name tokens. Uses Unicode rules (built in to Java)
  316. rather than the voluminous XML rules, although some extensions
  317. have been made to match XML rules more closely.</li>
  318. <li> Line ends are now normalized to newlines in all known
  319. cases. </li>
  320. </ul></li>
  321. <li> Certain validity errors were previously treated as well
  322. formedness violations. <ul>
  323. <li> Repeated declarations of an element type are no
  324. longer fatal errors. </li>
  325. <li> Undeclared parameter entity references are no longer
  326. fatal errors. </li>
  327. </ul></li>
  328. <li> Attribute handling is improved: <ul>
  329. <li> Whitespace must exist between attributes. </li>
  330. <li> Only one value for a given attribute is permitted. </li>
  331. <li> ATTLIST declarations don't need to declare attributes. </li>
  332. <li> Attribute values are normalized when required. </li>
  333. <li> Tabs in attribute values are normalized to spaces. </li>
  334. <li> Attribute values containing a literal "&lt;" are rejected. </li>
  335. </ul></li>
  336. <li> More correct entity handling: <ul>
  337. <li> Whitespace must precede NDATA when declaring unparsed
  338. entities.</li>
  339. <li> Parameter entity declarations may not have NDATA annotations. </li>
  340. <li> The XML specification has a bug in that it doesn't specify
  341. that certain contexts exist within which parameter entity
  342. expansion must not be performed. Lacking an offical erratum,
  343. this parser now disables such expansion inside comments,
  344. processing instructions, ignored sections, public identifiers,
  345. and parts of entity declarations. </li>
  346. <li> Entity expansions that include quote characters no longer
  347. confuse parsing of strings using such expansions. </li>
  348. <li> Whitespace in the values of internal entities is not mapped
  349. to space characters. </li>
  350. <li> General Entity references in attribute defaults within the
  351. DTD now cause fatal errors when the entity is not defined at the
  352. time it is referenced. </li>
  353. <li> Malformed general entity references in entity declarations are
  354. now detected. </li>
  355. </ul></li>
  356. <li> Neither conditional sections
  357. nor parameter entity references within markup declarations
  358. are permitted in the internal subset. </li>
  359. <li> Processing instructions whose target names are "XML"
  360. (ignoring case) are now rejected. </li>
  361. <li> Comments may not include "--".</li>
  362. <li> Most "]]&gt;" sequences in text are rejected. </li>
  363. <li> Correct syntax for standalone declarations is enforced. </li>
  364. <li> Setting a locale for diagnostics only produces an exception
  365. if the language of that locale isn't English. </li>
  366. <li> Some more encoding names are recognized. These include the
  367. Unicode 3.0 variants of UTF-16 (UTF-16BE, UTF-16LE) as well as
  368. US-ASCII and a few commonly seen synonyms. </li>
  369. <li> Text (from character content, PIs, or comments) large enough
  370. not to fit into internal buffers is now handled correctly even in
  371. some cases which were originally handled incorrectly.</li>
  372. <li> Content is now reported for element types for which attributes
  373. have been declared, but no content model is known. (Such documents
  374. are invalid, but may still be well formed.) </li>
  375. </ol>
  376. <p> Other bugs may also have been fixed. </p>
  377. <p> For better overall validation support, some of the validity
  378. constraints that can't be verified using the SAX2 event stream
  379. are now reported directly by &AElig;lfred2. </p>
  380. </body></html>