koz.ross
/
relax-ng-textbook


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
							<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-stylesheet type="text/css" href="chapter.css"?><chapter>
  <title>Appendix A: DSDL</title>
  <simplesect/>
  <sect1>
    <title>What's the problem?</title>
    <para>Although Relax NG has been started as a standalone project under the auspice of the Organization for the Advancement of Structured Information Standards (OASIS), Relax NG is now been standardized at the ISO (ISO/IEC JTC1 SC34 WG1 to be precise) as a part of a multi-part standard named DSDL (see http://dsdl.org).</para>
    <para>Standing for &quot;Document Schema Definition Languages&quot;, DSDL is a recognition that the validation of XML documents is a subject too wide and complex to be covered by a single language and that the industry needs a set of simple and dedicated languages to perform different validation tasks and a framework in which these languages may be used together.</para>
    <para>There are many different aspects in validating (or schematizing) XML documents which can be categorized into:</para>
    <itemizedlist>
        <listitem><para>Validating the structure of the document, i.e. checking the imbrication of elements and attributes (this is the domain in which Relax NG is so good).</para></listitem>
        <listitem><para>Validating the content of each text node and attribute independently of each other (this is where datatype libraries are needed).</para></listitem>
        <listitem><para>Validating integrity constraints between different elements and attributes.</para></listitem>
        <listitem><para>Validating any other rules (often called business rules).</para></listitem>
      </itemizedlist>
    <para>All over this book, we've seen how Relax NG can help us to cover an important part of this issue, but we've also seen that Relax NG is simple and efficient because it has been kept focussed in solving one and only one problem and there are huge gaps which cannot be covered by Relax NG. For instance, if a XML vocabulary includes mixed content models, you can't restrict the content of your documents to be ASCII only, nor can you define that the content of your &quot;modeling&quot; element must be spell checked. The goal of DSDL is to provide means to fill out these gaps and to cover the whole domain of document validation.</para>
    <para>DSDL can be seen as a framework and set of languages to check the quality of XML documents and this issue appears to be crucial for any XML based application. Recent works such has the presentation given by Simon Riggs at XML Europe 2003 or the work of Isabelle Boydens about the quality of big databases have shown that about 10% of XML documents (or data records) contain at least an error and this level of quality is unacceptable for many applications. DSDL could thus be a technology which is just indispensable for most of XML applications.</para>
  </sect1>
  <sect1>
    <title>A multi part standard</title>
    <para>DSDL is still work in progress. It is a multi-part specification, each of the parts presenting a different schema language (except part 1 which is an introduction and part 10 which is the description of the framework itself).</para>
    <sect2>
      <title>Part 1: Overview</title>
      <para>This is a kind of road map describing DSDL itself and introducing each of the parts.</para>
    </sect2>
    <sect2>
      <title>Part 2: Regular-grammar-based Validation</title>
      <para>This part is Relax NG itself. It is a rewriting of the Relax NG Oasis Technical Committee specification to meet the requirements of ISO publications. Its wording is more formal than the Oasis specification but the features of the language is the same and any Relax NG implementation conform the one of these two documents should also be conform to the other.</para>
      <para>DSDL Part 2 is now a &quot;Final Draft International Standard&quot; (FDIS), i.e. an official ISO standard.</para>
    </sect2>
    <sect2>
      <title>Part 3: Rule-based Validation</title>
      <para>This part will describe the next release of the rule based schema language known as Schematron. Schematron has been defined by Rick Jelliffe and other contributors and its home page is http://www.ascc.net/xml/schematron/.</para>
      <para>The current version of Schematron is a language to express sets of rules as XPath expressions (or mode accurately as XSLT expressions since XSLT functions such as <literal>document()</literal> are also supported in XPath expressions).</para>
      <para>Without entering in the details of the language, let's say that a Schematron schema is composed of set of rules named &quot;patterns&quot; (these patterns shouldn't be confused with Relax NG patterns). Each pattern includes one or more rules. Each rules sets the context nodes under which tests will be performed and each tests is performed either as <literal>assert</literal> or as <literal>report</literal>. An <literal>assert</literal> is a test which raises an error if it is not verified while a <literal>report</literal> is a test which raises an error if it is specified.</para>
      <para>A partial schema for our library could be:</para>
      <programlisting>
<![CDATA[ <sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">
  <sch:title>Schematron Schema for library</sch:title>
  <sch:pattern>
   <sch:rule context="/">
    <sch:assert test="library">The document element should be "library".</sch:assert>
   </sch:rule>
   <sch:rule context="/library">
    <sch:assert test="book">There should be at least a book!</sch:assert>
    <sch:assert test="not(@*)">No attribute for library, please!</sch:assert>
   </sch:rule>
   <sch:rule context="/library/book">
    <sch:report test="following-sibling::book/@id=@id">Duplicated ID for this book.</sch:report>
    <sch:assert test="@id=concat('_', isbn)">The id should be derived from the ISBN.</sch:assert>
   </sch:rule>
   <sch:rule context="/library/*">
    <sch:assert test="self::book or self::author or self::character">This element shouldn't be here...</sch:assert>
   </sch:rule>
  </sch:pattern>
 </sch:schema>]]>
      </programlisting>
      <para>We see from that simple example that it would be very verbose to write a full schema with Schematron since that would mean writing a rule for each element and in this rule writing all the individual tests checking the content model and eventually the relative order between children elements. We see also that it cannot be beaten to express what is oftent called business rules such as:</para>
      <programlisting>
<![CDATA[ <sch:assert test="@id=concat('_', isbn)">The id should be derived from the ISBN.</sch:assert>]]>
      </programlisting>
      <para>which checks that the <literal>id</literal> attribute of a book should be derived from its ISBN element by adding a leading underscore.</para>
      <para>DSDL Part 3, the next version of Schematron should keep this structure and add still more power by allowing to use not only XPath 1.0 expressions but also expressions taken from other languages such as EXSLT (a standard extension library for XSLT), XPath 2.0, XSLT 2.0 and even XQuery 1.0.</para>
    </sect2>
    <sect2>
      <title>Part 4: Selection of Validation Candidates</title>
      <para>Although Relax NG provides a way to write and combine modular schemas, it is often the case that you need to validate a composite document against existing schemas which can be written using different languages: you may want for instance to validate XHTML documents with embedded RDF statements. In this case, you need to split your documents into pieces and validate each of these pieces against its own schema.</para>
      <para>The first contribution to Part 4 has been an ISO specification known as &quot;Relax Namespace&quot; by Murata Makoto. This contribution has been followed by a couple of others, namely MNS by James Clark and  &quot;Namespace Switchboard&quot; by Rick Jelliffe. The latest contribution, &quot;Namespace Routing Language&quot; (NRL) has been made by James Clark in June 2003 and builds on the previous proposals. Although it is too early to say if NRL will become DSDL Part 4, it will most likely influence it heavily. NRL is implemented in the latest versions of Jing.</para>
      <para>The first example given in the specification (http://www.thaiopensource.com/relaxng/nrl.html) shows how NRL can be used to validate a SOAP message containing one or more XHTML document:</para>
      <programlisting>
<![CDATA[ <rules xmlns="http://www.thaiopensource.com/validate/nrl">
  <namespace ns="http://schemas.xmlsoap.org/soap/envelope/">
    <validate schema="soap-envelope.xsd"/>
  </namespace>
  <namespace ns="http://www.w3.org/1999/xhtml">
    <validate schema="xhtml.rng"/>
  </namespace>
 </rules>]]>
      </programlisting>
      <para>This would split the SOAP messages into its envelope validated against the W3C XML Schema schema &quot;soap-envelope.xsd&quot; and one or more XHTML documents found in the body of the SOAP message which will be validated against the Relax NG schema &quot;xhtml.rng&quot;.</para>
      <para>More advanced features are available including namespace wildcards, validation modes, open schemas, transparent namespaces and NRL seems to be able to handle the most complex cases until the basic assumption that instance documents may be split according to the namespaces of its elements and attributes is met.</para>
    </sect2>
    <sect2>
      <title>Part 5: Datatypes</title>
      <para>The goal of this part is to define a set of primitive datatypes with their constraining facets and the mechanisms to derive new datatypes from this set and it is fair to say that it's probably the least advanced and more complex part of DSDL. While people agree on what shouldn't be done it is difficult to go beyond the criticism of existing systems such as W3C XML Schema datatypes and propose something better.</para>
      <para>Some interesting ideas have been raised during the last DSDL meeting in May 2003 which kind of converge with threads discussed on the XML-DEV mailing list in June and we may hope that this should lead to something more constructive in the next DSDL meeting in December 2003.</para>
    </sect2>
    <sect2>
      <title>Part 6: Path-based Integrity Constraints</title>
      <para>The goal of this part is basically to define a feature covering W3C XML Schema's xs:unique, xs:key and xs:keyref. Part 6 hasn't seen any contribution yet.</para>
    </sect2>
    <sect2>
      <title>Part 7: Character Repertoire Validation</title>
      <para>This part will allow to specify which characters may be used in specific elements and attributes or within entire XML documents. The W3C note &quot;A Notation for Character Collections for the WWW&quot; (http://www.w3.org/TR/charcol/) is used as an input for Part 7 and the first contribution is &quot;Character Repertoire Validation for XML&quot; (CRVX) (http://dret.net/netdret/docs/wilde-crvx-www2003.html).</para>
      <para>A simple example of CRVX is:</para>
      <programlisting>
<![CDATA[ <crvx xmlns="http://dret.net/xmlns/crvx10">
  <restrict structure="ename aname pitarget" charrep="\p{IsBasicLatin}"/>
  <restrict structure="ename aname" charrep="[^0-9]"/>
 </crvx>]]>
      </programlisting>
      <para>In this proposal, the structure attribute contains identifiers for &quot;element names&quot; (ename), &quot;attribute names (aname)&quot;, Processing Instruction targets &quot;pitarget&quot; and other XML constructions including element and attribute contents. This example would thus impose that element and attribute names and Processing Instruction targets are all using characters from the BasicLatin block and that element and attribute names do not use digits.</para>
      <para>There is some overlap between Part 7 and other schema languages such as Part 2 (Relax NG) since you'd just need to take care that your names match the rules defined there and can use <literal>data</literal> pattern to check the content of attributes and simple content elements. However, Part 7 gives you a more focused mean of expressing these rules independently of other schemas and is filling some gaps in such constraints: Relax NG cannot express such constraints on name classes nor on mixed content elements.</para>
    </sect2>
    <sect2>
      <title>Part 8: Declarative Document Architectures</title>
      <para>This part is still the most mysterious to me. The idea here is to allow to add information to documents (such as default values) depending on the structure of the document and the only input considered for Part 8 so far is known as &quot;Architectural Forms&quot;, an old promising-but-never-used-that-much technology.</para>
    </sect2>
    <sect2>
      <title>Part 9: Namespace and Datatype-aware DTDs</title>
      <para>There were plenty of good things in DTDs, especially in SGML DTDs and many people are still using them and do challenge the need to put them to trash and define new schema languages to support namespaces and datatypes. DSDL Part 9 is for these people who would like to rely on years of usage of DTDs without loosing all of the goodies of newer schema languages. Despite a burst of discussion in April 2002, this part hasn't really advanced yet.</para>
    </sect2>
    <sect2>
      <title>Part 10: Validation Management</title>
      <para>Last but not least, Part 10 (formerly known as Part 1: Interoperability Framework) is the cement which will let you use together the different parts from DSDL together with external tools such as XSLT, W3C XML Schema or your favorite spell checker to come back to an example given in the introduction to this chapter.</para>
      <para>Here again, different contributions have been made, including my own &quot;XML Validation Interoperability Framework&quot; XVIF and Rick Jelliffe's Schemachine and the latest contribution is know (and implemented) as &quot;xvif/outie&quot; (see http://downloads.xmlschemata.org/python/xvif/outie/about.xhtml).</para>
      <para>A simple example of a xvif/outie document is:</para>
      <programlisting>
<![CDATA[ <?xml version="1.0" encoding="utf-8"?>Declarative Document Architectures]]>
      </programlisting>
      <programlisting>
<![CDATA[
 <framework>
  <rule>
   <instance>
    <transform transformation="normalize.xslt"/>
   </instance>
   <assert>
    <isValid schema="schema.rng"/>
    <isValid schema="schema.sch"/>
   </assert>
  </rule>
 </framework>]]>
      </programlisting>
      <para>This document is defining a rule to be checked on the result of the XSLT transformation &quot;normalize.xslt&quot; applied on the instance document and this rule is that the result of the transformation must be valid per both &quot;schema.rng&quot; and &quot;schema.sch&quot;.</para>
    </sect2>
  </sect1>
  <sect1>
    <title>What DSDL should bring you</title>
    <para>As a Relax NG user, DSDL should bring you all what's Relax NG has left behind to focus on the validation of the structure of XML documents and even more:</para>
    <itemizedlist>
        <listitem><para>You are already using Part 2 (Relax NG)</para></listitem>
        <listitem><para>Part 3 (Schematron) gives you the ability to add highly flexible &quot;business rules&quot; to your schemas.</para></listitem>
        <listitem><para>Part 4 (Selection of Validation Candidates) lets you write and reuse schemas written in any language and combine them to validate composite documents.</para></listitem>
        <listitem><para>Part 5 (Datatypes) should provide a better alternative to W3C XML Schema datatypes.</para></listitem>
        <listitem><para>Part 6 (Path-based Integrity Constraints) will let you specify integrity constraints between elements and attributes.</para></listitem>
        <listitem><para>Part 7 (Character repertoire) will let you specify which characters may be used in your documents.</para></listitem>
        <listitem><para>Part 8 (Declarative Document Architectures) will let you add the information which had been kept implicit to your documents before validation.</para></listitem>
        <listitem><para>Part 9 (Namespace and Datatype-aware DTDs) will let you upgrade and reuse your DTDs in the context of newer applications.</para></listitem>
        <listitem><para>Part 10 (Validation Management) will let you do all this together and plug other transformation and validation tools.</para></listitem>
      </itemizedlist>
    <para>If you like Relax NG, I am sure that you'll enjoy the other members of the DSDL family. They share the same principles of focus to solving a specific issue and this focus keeps them both powerful and easy to use.</para>
  </sect1>
</chapter>