mro
/
relax-ng-textbook
forked from koz.ross/relax-ng-textbook


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650
							In the previous chapter, we have seen the basics of the "data" pattern used with the highly restricted built-in datatype library.

The extreme simplicity of this built-in type library -limited to the two datatypes "string" and "token"- should not be seen as a limitation of Relax NG but rather as a fundamental design decision that validating the structure and the content of XML documents are different issues that are better solved by different tools working in close cooperation.

The Relax NG strategy is thus to rely on external pluggable libraries for the validation of the content of the text nodes and attributes.

There is no limit to the potential variety of external type libraries which could be implemented and used by a Relax NG schema and the designers of Relax NG think that there is probably room for both generic type libraries and application specific types libraries meeting the needs of a specific domain such as mathematics, physics or business.

It is also possible to implement language specific type libraries and my Python implementation of Relax NG supports a native Python types library which maps the built in types and allow to define restrictions using the Python syntax.

That being said, the it is expected that most of the users will use generic XML type libraries ranging from a library emulating the datatypes from the DTDs to an ISO/DSDL type library not yet defined through the W3C XML Schema datatype library and in this chapter we'll introduce the two of them which are already available, i.e. the W3C XML Schema and DTD compatibility type libraries.

!!!W3C XML Schema type library

W3C XML Schema so called simple types are a part that's taking several chapters in my book about W3C XML Schema, but I'll try to give a brief overview here so that you can use their most basic features within Relax NG schemas. You will find their definition in "Chapter 19: W3C XML Schema Datatypes" and you are of course welcome to read the chapters 4, 5, 6 and 16 of my W3C XML Schema book to get a deeper understanding of their behavior!

The W3C XML Schema datatypes which can be used in a Relax NG schema are the so-called "predefined" W3C XML Schema types, i.e. those which are defined in the W3C XML Schema recommendation as opposed to "user defined types" which are derived from the predefined types using the W3C XML Schema language and can't be used from a Relax NG schema. We will see that restrictions (called "facets" in the terminology of W3C XML Schema) can be applied to these datatypes using the Relax NG "param" pattern.

Since we are able to defined named patterns in Relax NG, it means that even though there is no access to "user defined W3C XML Schema simple types", we will have a possibility to define "user defined Relax NG patterns consisting of a predefined W3C XML Schema type and a set of facets". This might be a bit confusing right now but it will become clearer with examples and I just wanted to draw your attention to the fact that Relax NG is just borrowing the most basic part of W3C XML Schema datatypes without borrowing its syntax and derivation methods.

!!The datatypes

The W3C XML Schema predefined datatypes are divided into "primitive" and "derived" types. Primitive types are basic types which do not share a common semantic and behave differently while each of the derived type could have been derived from a primitive type using the W3C XML Schema derivation features, shares the semantic of this primitive type and are provided for the convenience of the users since it is expected that it will be commonly used.

The other notion which needs to be introduced before we start is the notion of lexical and value spaces: the lexical space is the string as it appears in the XML document after an eventual whitespace normalization while the value space is the matching value interpreted by the datatype library. The distinction is important since all the facets save one (the "pattern" facet which will be covered in depth in next chapter: "Chapter 9: W3C XML Schema Regular Expressions") are acting on the value space. For instance, the two text nodes "1" and "01" will be considered as different if the datatype is a token and identical if the datatype is an integer.

In this section, we will give a brief presentation of the datatypes classified by their primary types.

!String datatypes

The string datatypes are:

* "string" : This is the only datatype for which no whitespace normalization is done. There is no restriction on the lexical or value spaces of this datatype which is identical to the "string" Relax NG built-in type with the exception that restriction can be applied through "param" patterns on the W3C XML Schema string type.
* "normalizedString" : An intermediate whitescape processing is done to this datatypes: the occurrences of whitespaces (is #x9 (tabs), #xA (linefeed) and #x20 (space) are replaced by the same number of spaces (#x20) but no space collapsing or trimming is performed. Like for the "string" datatype, there is no restriction on the lexical or value spaces of this datatype.
* "token" : This datatype is similar to the built-in token datatype: whitespaces are normalized, i.e. all the sequences of whitespaces are replaced by a single space and the leading and trailing spaces are removed. This is -with the two previous one- the third and last datatype which has no constraint on its value and lexical spaces. We must also note that all the datatypes except "string" and "normalizedString" follow the same normalization rules as the "token" datatype.
* "language" : This was created to accept all the language codes standardized by RFC 1766. Some valid values for this datatype are en, en-US, fr, or fr-FR.
* "NMTOKEN" :  This corresponds to the XML 1.0 "Nmtoken" (Name token) production, which is a single token (a set of characters without spaces) composed of characters allowed in XML name. Some valid values for this datatype are "Snoopy", "CMS", "1950-10-04", or "0836217462". Invalid values include "brought classical music to the Peanuts strip" (spaces are forbidden) or "bold,brash" (commas are forbidden).
* "NMTOKENS" : The lexical and value spaces of "NMTOKENS" is the whitespace separated lists of "NMTOKEN".
* "Name" :     This is similar to "NMTOKEN" with the additional restriction that the values must start with a letter or the characters ":" or "-". This datatype conforms to the XML 1.0 definition of a "Name." Some valid values for this datatype are "Snoopy", "CMS", or "-1950-10-04-10:00". Invalid values include "0836217462" ("Name" cannot start with a number) or "bold,brash" (commas are forbidden). This datatype should not be used for names that may be "qualified" by a namespace prefix, since we will see another datatype ("QName") that has a specific semantic for these values.
* "NCName" :     This is the "noncolonized name" defined by Namespaces in XML1.0, i.e., a "Name" without any colons (":"). As such, this datatype is probably the predefined datatype that is closest to the notion of a "name" in most of the programming languages, even though some characters such as "-" or "." may still be a problem in many cases. Some valid values for this datatype are "Snoopy", "CMS", "-1950-10-04-10-00", or "1950-10-04". Invalid values include "-1950-10-04:10-00" or "bold:brash" (colons are forbidden).
* "ID" : The lexical space of "ID" is the same than the lexical space of "NCName". As defined by the W3C XML Schema recommendation, there is one constraint added to its value space which is that there must not be any duplicate values in a document. Relax NG doesn't allow datatype libraries to perform this type of checks. This is a job for the "DTD compatibility feature" as we will see at the end of this chapter and its specification asks to Relax NG processors supporting this feature to enforce ID uniqueness for W3C XML Schema ID datatypes. Other implementations will just check its lexical space as a "NCName".
* "IDREF" : The lexical space of "IDREF" is the same than the lexical space of "NCName". As for "ID", W3C XML Schema adds the constraint that it must match an ID defined in the same document, and Relax NG makes this behavior optional for Relax NG processors supporting the W3C XML Schema type library without supporting the DTD compatibility feature.
* "IDREFS" : The lexical space of "IDREFS" is the whitespace separated lists of "NCName". As for "ID" and "IDREF", W3C XML Schema adds the constraint that each of the values must match an ID defined in the same document, and Relax NG makes this behavior optional for Relax NG processors supporting the W3C XML Schema type library without supporting the DTD compatibility feature.
* "ENTITY" : The lexical space of "ENTITY" is the same than the lexical space of "NCName". Also provided for compatibility with XML 1.0 DTDs, an "ENTITY" value and must match an unparsed entity defined in a DTD.
* "ENTITIES" : The lexical and value spaces of "ENTITIES" is the whitespace separated lists of "ENTITY".
!URIs
Strictly speaking, "anyURI", the only representant of this family isn't considered as a string since its value can be different from its lexical representation to compensate the differences of format between XML and URIs as specified in the RFCs 2396 and 2732. These RFCs are not very friendly toward non-ASCII characters and require many character escaping that are not necessary in XML.

As an example of this transformation, the href attribute of an XHTML link written as:

 <a href="http://dmoz.org/World/Français/">
   World/Français
 </a>

would be converted to the value:

 http://dmoz.org/World/Fran%C3%A7ais/

in the value space.

Also note that the "anyURI" datatype doesn't pay any attention to xml:base attributes that may have been defined in the document.

!Qualified names

Up to know, we have only briefly mentioned XML namespaces and we will introduce them in "Chapter 11: Namespaces" but we need to use some of their concepts right now. If you're not familiar with namespaces, you should probably be safe to skip this section: you can be quite sure that you don't need qualified names and even if you are a XML namespace guru, I wouldn't recommend you to use them which I consider a bad practice!

What we're talking about here is different to using qualified names for element and attribute names. Using qualified names for element and attribute names is required by the recommendation "Namespaces in XML 1.0" and there isn't much debate left on the subject. Here, we are speaking of using qualified names in element or attribute values which is much more controversial since it's creating a dependency between the markup and its content.

Because of this dependency, you cannot consider a qualified name as string datatypes since its prefix is only a shortcut to the associated namespace URI. The value space of a qualified named is thus not what we see but a tuple composed of the associated namespace URI (replacing the prefix) and its local part (i.e. what is after the prefix and the colon).

For instance, if the "xs" prefix has been associated with the namespace URI "http://www.w3.org/2001/XMLSchema", a qualified name (QName) "xsd:language" would thus have a value which is the tuple {"http://www.w3.org/2001/XMLSchema", "language"} and can be considered equal to a QName "foo:language" if the prefix "foo" has been associated with "http://www.w3.org/2001/XMLSchema" or even "language" if "http://www.w3.org/2001/XMLSchema" has been defined as the default namespace.

There are two QName datatypes considered as equivalent for Relax NG:

* "QName" : this is the "usual" QName datatype where the lexical space is the set of "colonized" names consisting of a prefix and a local names separated by a colon (":") and the value space is the set of tuples {namespace URI, local name} as explained above. Note that a prefix must be defined through a namespace declaration in the scope of the location where it is used to be considered as valid.

* "NOTATION" : for W3C XML Schema, a "NOTATION" is a QName declared as a notation in a schema W3C XML Schema. Since Relax NG has no equivalent syntax to declare notations, a Relax NG processor treats the "NOTATION" as a synonym to "QName".

! Binary string-encoded datatypes

XML 1.0  is unable to hold binary content, which must be string-encoded before it can be included in a XML document. W3C XML Schema has defined two primary datatypes to support two encodings, one that are commonly used (base64) and one which is newer (hexBinary). These encodings may be used to include any binary content, including text formats whose content may be incompatible with the XML markup. Other binary text encodings may also be used (such as uuXXcode, Quote Printable, BinHex, aencode, or base85, to name a few), but their value would not be recognized by W3C XML Schema.

* "hexBinary": This defines a simple way to code binary content as a character string by translating the value of each binary octet into two hexadecimal digits. This encoding is different from the encoding method called BinHex (introduced by Apple, described by RFC 1741, and includes a mechanism to compress repetitive characters). A UTF-8 XML header such as: <?xml version="1.0" encoding="UTF-8"?> that is encoded as hexBinary would be: "3f3c6d78206c657673726f693d6e3122302e20226e656f636964676e223d54552d4622383e3f".

* "base64Binary": This matches the encoding known as "base64" and is described in RFC 2045. It maps groups of 6 bits into an array of 64 printable characters. The same header encoded as base64Binary would be: "PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCg==". The W3C XML Schema Recommendation missed the fact that RFC 2045 requests a line break every 76 characters. This should be clarified in an errata. The consequence of these line breaks being thought of as optional by W3C XML Schema, is that the lexical and value spaces of "base64Binary" cannot be considered identical.

!Numeric datatypes

The  numeric datatypes are built on top of four primitive datatypes: "decimal" for all the decimal types (including the integer datatypes, considered decimals without a fractional part), "double" and "float" for single and double precision floats, and "boolean" for Booleans.

The first family of numeric datatypes is derived from the primitive type "decimal":

* "decimal": This datatype represents the decimal numbers. The number of digits can be arbitrarily long (the datatype doesn't impose any restriction), but obviously, since a XML document has an arbitrary but finite length, the number of digits of the lexical representation of a "decimal" value needs to be finite. Although the number of digits is not limited, we will see in the next section (facets) how the author of a schema can derive user-defined datatypes with a limited number of digits if needed. Leading and trailing zeros are not significant and may be trimmed. The decimal separator is always a dot ("."); a leading sign ("+" or "-") may be specified and any characters other than the 10 digits (including whitespaces) are forbidden. Allowed values for decimal include "123.456", "+1234.456", "-.456" or "-456".
* "integer":This integer datatype is a subset of "decimal", representing numbers which don't have any fractional digits in its lexical or value spaces. The characters that are accepted are reduced to 10 digits and an optional leading sign. Like its base datatype, "integer" doesn't impose any limitation on the number of digits, and leading zeros are not significant. Note that the decimal separator is forbidden even if the decimal numbers are omitted or zeros.
* "nonPositiveInteger": The W3C has thought that negative statements would be clearer for developers here and "nonPositiveInteger" are the "integer" which are negative or null (because zero is neither positive for negative).
* "negativeInteger": "integer" which are strictly negative.
* "nonNegativeInteger": positive or null "integer".
* "positiveInteger": strictly positive "integer".
* "long": integer between -9223372036854775808 and 9223372036854775807, i.e., the values that can be stored in a 64-bit word.
* "int": integer between -2147483648 and 2147483647 (32 bits).
* "short": integer between -32768 and 32767 (16 bits).
* "byte": integer between -128 and 127 (8 bits).
* "unsignedLong": unsigned integers between 0 and 18446744073709551615, i.e., the values that can be stored in a 64-bit word.
* "unsignedInt": unsigned integers between 0 and 4294967295 (32 bits).
* "unsignedShort": unsigned integers between 0 and 65535 (16 bits).
* "unsignedByte": unsigned integers between 0 and 255 (8 bits).

The second family is made of the "float" and "double" datatypes which represent IEEE simple (32 bits) and double (64 bits) precision floating-point types. These store the values in the form of mantissa and an exponent of a power of 2 (m x 2^e), allowing a large scale of numbers in a storage that has a fixed length. Fortunately, the lexical space doesn't require that we use powers of 2 (in fact, it doesn't accept powers of 2), but instead lets us use a traditional scientific notation with integer powers of 10. Since the value spaces (powers of 2) don't exactly match the values from the lexical space (powers of 10), the recommendation specifies that the closest value is taken. The consequence of this approximate matching is that float datatypes are the domain of approximation; most of the float values can't be considered exact, and are approximate.

These datatypes accept several "special" values: positive zero (0), negative zero (-0) (which is less than positive 0 but greater than any negative value), infinity (INF) (which is greater than any value), negative infinity (-INF) (which is less than any float, and "not a number" (NaN).

The last member is "boolean", a primitive datatype that can take the values "true" and "false"  (or "1" and "0" considered as equivalent).

!Date and time formats

This is probably the most controversial piece of W3C XML Schema datatypes. In order to meet the requirements of the "dates on the web", the W3C XML Schema Working Group has attempted to define a value space for a subset of the ISO 8601 date formats which is a syntactical specification of how dates should be exchanged on the web.

The result is overly complex and yet fails to satisfy the experts of date and time representations, doesn't support any other calendar system than Gregorian and has no support for localization.

One of the most fuzzy aspects of these datatypes is that many of them (such as "dateTime" which we'll introduce in a moment) accept both values with and without timezones introducing for the same datatypes two classes of values which can be compared only partially.

Let's take a closer look to this important distinction before we present the detail of these datatypes... Two "dateTime" with a timezone can be compared without any hesitation. W3C XML Schema states that a "dateTime" without a timezone has an undetermined timezone but that you can still compare two such "dateTime". Things get fuzzy when you want to compare a "dateTime" with a timezone and a "dateTime" without: all you know about the "dateTime" without having an undetermined timezone is that in can be in an interval from 14 hours before UTC to 14 hours after UTC and you can never conclude that the two "dateTime" are equal and can only say that one is before the other when they are different enough.

Why 14 hours? No, that's not a typo! National regulations have some level of flexibility with the timezones used in their countries and can vary from their geographical timezone. This variation does even often change with the date in the year with many countries having winter and summer times. As a result of that, the worse case when the W3C has published the W3C XML Schema recommendation was not between -12 and +12 hours from UTC but between -13 and +12 hours. And since the W3C doesn't expect that national authorities would ask them the permission if they wanted to enlarge this interval, they have taken a security margin and written this -14/+14 hours interval in their recommendation.

Since fuzziness isn't what computers like best, it's probably a very good practice to use exclusively "dateTime" with timezones!

All that being said, the date, time and related datatypes defined by W3C XML Schema are:

* "dateTime": This datatype is defined as representing a "specific instant of time." This is a subset of what ISO 8601 calls a "moment of time." Its lexical value follows the format "CCYY-MM-DDThh:mm:ss," in which all the fields must be present and may optionally be preceded by a sign and leading figures, if needed, and followed by fractional digits for the seconds and a time zone. The time zone may be specified using the letter "Z," which identifies UTC, or by the difference of time with UTC. As we've seen, a value such as "2001-10-26T21:32:52" which are defined without a timezone can't be compared to "2001-10-26T21:32:52+02:00" or "2001-10-26T19:32:52Z" which have a timezone and the two latest values are considered as equal since they identify the same moment.
* "date": This datatype has the same lexical space than the date part of "dateTime" with an optional timezone and is representing a period one day in its time zone, "independent of how many hours this day has." The consequence of this definition is that two dates defined in a different time zone cannot be equal except if they designate the same interval (2001-10-26+12:00 and 2001-10-25-12:00, for instance). Another consequence is that, like with "dateTime", the order relation between a date with a time zone and a date without a time zone is partial.
* "gYearMonth": ("g" for Gregorian) is a Gregorian calendar month ie a period of one calendar month in its timezone and its format is the format of "date" without the day part: "2001-10", "2001-10+02:00" or "2001-10Z" for instance.
* "gYear" is a Gregorian calendar year, ie a period of one calendar year in its timezone and its format is the format of "gYearMonth" without its month part: "2001", "2001+02:00" or "2001Z" for instance (note that these three values identify three different periods and are not considered equal).
* "time": The lexical space of "time" is identical to the time part of "dateTime". The semantic of "time" represents a point in time that recurs every day; the meaning of "01:20:15" is "the point in time recurring each day at 01:20:15 am." Like "date" and "dateTime", "time" accepts an optional time zone definition. The same issue arises when comparing times with and without time zones such as "21:32:52", "21:32:52+02:00" and "19:32:52Z".
* "gDay": The lexical space of "gDay" is "---DD" with an optional time zone specification and it represents a recurring period of one day in the specified time zone occurring each Gregorian calendar month. "---01" represents for instance the first day of each month with an undetermined timezone. Dates are pinned down depending of the number of days of each month and in February for instance, "--31Z" would occur on February 28th (or 29th for leap years).
* "gMonthDay" : The lexical space of "gMonthDay" is "--MM-DD" with an optional time zone specification and it represents a recurring period of one day in the specified time zone occurring each Gregorian calendar year. The Christmas day in UK would, for instance, be "--12-25Z".
* "gMonth": The lexical space of "gMonth" should have been "--MM" with an optional timezone, but a typo in the W3C XML Schema recommendation as specified it as "--MM--" which you can still find in some tools even though an erratum has fixed it back to "--MM" and it represents a recurring period of a calendar month in its timezone. The months of January in Paris would for instance be represented as "--01+01:00".
* "duration": The lexical space of "duration" is "PnYnMnDTnHnMnS", each part (except the leading "P") being optional and a significant amount of complexity comes from the fact that you can mix quantities expressed as months (which have a variable number of days) with quantities expressed as days such as for instance "P1Y2M8DT123S" which means a duration of 1 year, 2 months, 8 days and 123 seconds. We will not enter into the detail of the algorithms here, but this leads to a partial order relation between durations which do not facilitate the facets and processing of these datatypes when they use all the parts together.

!Examples

After that long and dense enumeration of types, let's see how we could add W3C XML Schema datatypes in our first schema... The most natural choices seem to be:

* id attributes: the semantic of the "ID" datatype isn't captured when it is used with Relax NG, we won't use it in our schema since it would be misleading and we will use "NMTOKEN" for the id attributes.
* xml:lang: the natural candidate for xml:lang is "language".
* available: we can use a "boolean" for this attribute.
* born and died: "date" seem the right choice since we have been lucky enough to have ISO 8601 dates in our instance documents.
* other text content elements: we have no reason here to preserve whitespaces in these elements and will use "token" datatypes for all of them.

Our first schema could thus be rewritten (note the declaration of the datatype library) as:

 <element xmlns="http://relaxng.org/ns/structure/1.0" name="library"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <oneOrMore>
   <element name="book">
    <attribute name="id">
     <data type="NMTOKEN"/>
    </attribute>
    <attribute name="available">
     <data type="boolean"/>
    </attribute>
    <element name="isbn">
     <data type="NMTOKEN"/>
    </element>
    <element name="title">
     <attribute name="xml:lang">
      <data type="language"/>
     </attribute>
     <data type="token"/>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id">
       <data type="NMTOKEN"/>
      </attribute>
      <element name="name">
       <data type="token"/>
      </element>
      <element name="born">
       <data type="date"/>
      </element>
      <optional>
       <element name="died">
        <data type="date"/>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id">
       <data type="NMTOKEN"/>
      </attribute>
      <element name="name">
       <data type="token"/>
      </element>
      <element name="born">
       <data type="date"/>
      </element>
      <element name="qualification">
       <data type="token"/>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>

or:

 element library {
  element book {
   attribute id {xsd:NMTOKEN},
   attribute available {xsd:boolean},
   element isbn {xsd:NMTOKEN},
   element title {attribute xml:lang {xsd:language}, xsd:token},
   element author {
    attribute id {xsd:NMTOKEN},
    element name {xsd:token},
    element born {xsd:date},
    element died {xsd:date}?}*,
   element character {
    attribute id {xsd:NMTOKEN},
    element name {xsd:token},
    element born {xsd:date},
    element qualification {xsd:token}}*
  } +
 }

Note that the W3C XML Schema datatype library has a special privilege to have its prefix built in to the compact syntax: I have used the "xsd" prefix without needing to declare any datatype library! We will see later on that this isn't the case for the DTD compatibility type library.

We have noticed in the previous chapter that the data types declarations are kind of transient to a "data" pattern and are not inherited by its child patterns. Let's illustrate this now that we have a richer set of datatypes at hand.

In the schema which we've just written, we have defined the "available" attribute as a "boolean" but in our instance documents, we have only used one of the two syntaxes for a boolean ("true" or "false") and not used the other equivalent one (0 or 1). We may want to exclude this second syntax for boolean (for instance if our applications haven't been designed to support it). In this case, we can just exclude these two values:

    <attribute name="available">
     <data type="boolean">
      <except>
       <value>0</value>
       <value>1</value>
      </except>
     </data>
    </attribute>

or:

   attribute available {xsd:boolean - ("0"|"1")}

Seems rather natural, but why is this working? When you think about it, it's working because Relax NG forgets that the type of the attribute is "boolean" as soon as we've left the "data" pattern and does use the default type (Relax NG built in "token" type) to test that the value is neither "0" nor "1". If Relax NG did not forget the type of the attribute, the schema would have removed the entire lexical space of "boolean" and would have been impossible to meet since "0" and "false" are equivalent (and "1" and "true" too).

We have seen a situation where we rely on the fact that the types used in the "data" and "value" patterns are different. There are also situations where we would like them to be the same and, then, we need to repeat the type attribute. If our applications are designed to accept both formats for the available attributes and if we need to test that the books are available, we would prefer to use the same type for both patterns and in this case we can write:

    <attribute name="available">
     <data type="boolean">
      <except>
       <value type="boolean">false</value>
      </except>
     </data>
    </attribute>

or

   attribute available {xsd:boolean - (xsd:boolean "false")},

We now rely on the datatype "boolean" to exclude both "0" and "false" which are equivalent. Of course, in the case of booleans, the number of possible values is limited and we could have simplified our schema to:

    <attribute name="available">
     <value type="boolean">true</value>
    </attribute>

or

   attribute available {xsd:boolean "true"}

but this wouldn't have made the point I wanted to make which is also valid for other datatypes!

!!The facets

The restrictions that a user can apply on a predefined W3C XML Schema datatypes, known as "facets" in the W3C XML Schema recommendation can be applied in a Relax NG schema through a pattern named "param" directly included within "data" patterns before the optional "except" pattern which we already know. These "param" patterns have a name attribute which is the name of the facet and their text content is the value of the facet. When several "param" patterns are included, all the constraints must be matched (in other words, the result is a logical "and" of all the conditions) and a same facet can't be repeated twice except for the facet named "pattern".

Yes I know, this is confusing but the vocabularies used by Relax NG and W3C XML Schema are different. What Relax NG calls "param" is called "facet" by W3C XML Schema and what's called a "pattern" by Relax NG should not be confused with the facet named "pattern" by W3C XML Schema... Also note that we have seen previously that what Relax NG calls whitespace normalization is not the same than whitespace processing applied on the W3C XML Schema "normalizedSpace" datatype.

The different facets defined by W3C XML Schema are:

* "whiteSpace": this somewhat controversial facet cannot be used in Relax NG.
* "enumeration": this facet cannot be used in Relax NG since equivalent to Relax NG own enumerations which should be used instead.
* "pattern": this is the only facet working in the lexical space, all the other facets working in the value space only. This facet checks if the data matches a regular expression. This facet is covered in the next chapter "Chapter 9: W3C XML Schema Regular Expressions". For the moment, let's just say that it is a superset of Perl regular expressions (anchored to the beginning and the end of the values to match) and that it does not support the POSIX style character classes defined in Perl, includes a few XML goodies, supports all the Unicode classes and blocks and defines a special construct to define "differences" between character classes.
* "length": this facet is available only for string, binary and list datatypes. For string (and string like) type, this defines the number of Unicode characters, for binary (i.e. "hexBinary" and "base64Binary") datatypes it defines a number of bytes and for list datatypes ("entities", "idrefs" and "NMTOKENS") it defines the number of tokens in the list.
* "maxLength": same meaning and restrictions than "length" but defines a maximum length.
* "minLength": same meaning and restrictions than "length" but defines a minimum length.
* "maxExclusive": applies only to decimal, integer (and derived), float and double and all the date time and duration datatypes and defines a maximum value that cannot be reached. Note that, for date times and duration datatypes, the relation of order between two values is partial and that the result cannot always be determined.
* "minExclusive": same restriction than "maxExclusive" and defines a minimum value that cannot be reached.
* "maxInclusive": same restriction than "maxExclusive" and defines a maximum value that can be reached.
* "minInclusive": same restriction than "maxExclusive" and defines a minimum value that can be reached.
* "totalDigits": applies to decimal, integer and derived types to define the maximum number of digits (after and before the decimal point). As all the facets (except "pattern") this facet works on the value space, and "000001.10000000" for instance would be considered as only having 2 digits.
* "fractionDigits": applies to decimal to define the maximum number of fractional digits (i.e. after and the decimal point). As all the facets (except "pattern") this facet works on the value space, and "000001.10000000" for instance would be considered as only having 1 fractional digit.

Again, after this enumeration of facets, let's see how we could use some of our new knowledge to improve the schema of our library:

* xml:lang: we might want to ignore the regional differences and accept only two character codes using the "length" facet.
* isbn: there would be much more to check on isbn number but we may want to use a "pattern" to check that it's composed of 9 digits terminated by a character which is either a digit or the character "x".
* born and died: assuming that our library is only interested in recent books we could check that they belong to the twentieth or twenty-first centuries (in other words between 1900 and 2099). We might also want to check that our dates do not specify a timezone since we've seen that comparing dates with and without timezone is fuzzy and that the instance documents which we've seen up to now have no timezones.
* and the maximum length of the other text data could be constrained using a "maxLength" facet.

The corresponding schema would be:

 <element xmlns="http://relaxng.org/ns/structure/1.0"
  name="library" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <oneOrMore>
   <element name="book">
    <attribute name="id">
     <data type="NMTOKEN">
       <param name="maxLength">16</param>
     </data>
    </attribute>
    <attribute name="available">
     <data type="boolean"/>
    </attribute>
    <element name="isbn">
     <data type="NMTOKEN">
       <param name="pattern">[[0-9]{9}[[0-9x]</param>
     </data>
    </element>
    <element name="title">
     <attribute name="xml:lang">
      <data type="language">
       <param name="length">2</param>
      </data>
     </attribute>
     <data type="token">
       <param name="maxLength">255</param>
     </data>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id">
       <data type="NMTOKEN">
        <param name="maxLength">16</param>
       </data>
      </attribute>
      <element name="name">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
      <element name="born">
       <data type="date">
        <param name="minInclusive">1900-01-01</param>
        <param name="maxInclusive">2099-12-31</param>
        <param name="pattern">[[0-9]{4}-[[0-9]{2}-[[0-9]{2}</param>
       </data>
      </element>
      <optional>
       <element name="died">
        <data type="date">
         <param name="minInclusive">1900-01-01</param>
         <param name="maxInclusive">2099-12-31</param>
         <param name="pattern">[[0-9]{4}-[[0-9]{2}-[[0-9]{2}</param>
        </data>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id">
       <data type="NMTOKEN">
        <param name="maxLength">16</param>
       </data>
      </attribute>
      <element name="name">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
      <element name="born">
       <data type="date">
        <param name="minInclusive">1900-01-01</param>
        <param name="maxInclusive">2099-12-31</param>
        <param name="pattern">[[0-9]{4}-[[0-9]{2}-[[0-9]{2}</param>
       </data>
      </element>
      <element name="qualification">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>

or:

 element library {
  element book {
   attribute id {xsd:NMTOKEN {maxLength = "16"}},
   attribute available {xsd:boolean "true"},
   element isbn {xsd:NMTOKEN {pattern = "[[0-9]{9}[[0-9x]"}},
   element title {
     attribute xml:lang {xsd:language {length="2"}},
     xsd:token {maxLength="255"}
   },
   element author {
    attribute id {xsd:NMTOKEN {maxLength = "16"}},
    element name {xsd:token {maxLength = "255"}},
    element born {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[[0-9]{4}-[[0-9]{2}-[[0-9]{2}"
    }},
    element died {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[[0-9]{4}-[[0-9]{2}-[[0-9]{2}"
    }}?}*,
   element character {
    attribute id {xsd:NMTOKEN {maxLength = "16"}},
    element name {xsd:token {maxLength = "255"}},
    element born {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[[0-9]{4}-[[0-9]{2}-[[0-9]{2}"
    }},
    element qualification {xsd:token {maxLength = "255"}}}*
  } +
 }

Note the usage of the regular expressions in the "pattern" facets. The set of facets of W3C XML Schema isn't extremely rich and the "pattern" facet acts as a Swiss army knife helping you to do all the tricky tasks that other facets can't do!

Also note that facets are restrictions which are added to the restrictions of the lexical space and that you cannot extend the lexical space of a datatype.

!!!DTD Compatibility

DTD Compatibility is both a library which checks the lexical spaces of its "ID", "IDREF" and "IRDEFS" datatypes and a feature, i.e. a restriction added to the normal Relax NG processing, which enforces DTDlike rules on the schema and on the instance document. This package is designed to facilitate the transition from DTDs to Relax NG by emulating the attribute types "ID", "IDREF" and "IDREFS". The DTD compatibility feature checks that "ID" values are unique within a document and that "IDREF" and "IDREFS" are references or whitespace separated lists of references to "ID" values actually defined in the document. It also checks rules on the schema itself such as the fact that these datatypes are used only in attributes. Unlike their W3C XML Schema counterpart, these datatypes have no facets.

That's pretty much all we have to know about this library and we can use it straight away to define the "id" attributes in our library:

 <element xmlns="http://relaxng.org/ns/structure/1.0" name="library" datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
  <oneOrMore>
   <element name="book">
    <attribute name="id">
     <data datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0" type="ID"/>
    </attribute>
    <attribute name="available">
     <data type="boolean"/>
    </attribute>
    <element name="isbn">
     <data type="NMTOKEN">
       <param name="pattern">[[0-9]{9}[[0-9x]</param>
     </data>
    </element>
    <element name="title">
     <attribute name="xml:lang">
      <data type="language">
       <param name="length">2</param>
      </data>
     </attribute>
     <data type="token">
       <param name="maxLength">255</param>
     </data>
    </element>
    <zeroOrMore>
     <element name="author">
      <attribute name="id">
       <data datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0" type="ID"/>
      </attribute>
      <element name="name">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
      <element name="born">
       <data type="date">
        <param name="minInclusive">1900-01-01</param>
        <param name="maxInclusive">2099-12-31</param>
        <param name="pattern">[[0-9]{4}-[[0-9]{2}-[[0-9]{2}</param>
       </data>
      </element>
      <optional>
       <element name="died">
        <data type="date">
         <param name="minInclusive">1900-01-01</param>
         <param name="maxInclusive">2099-12-31</param>
         <param name="pattern">[[0-9]{4}-[[0-9]{2}-[[0-9]{2}</param>
        </data>
       </element>
      </optional>
     </element>
    </zeroOrMore>
    <zeroOrMore>
     <element name="character">
      <attribute name="id">
       <data datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0" type="ID"/>
      </attribute>
      <element name="name">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
      <element name="born">
       <data type="date">
        <param name="minInclusive">1900-01-01</param>
        <param name="maxInclusive">2099-12-31</param>
        <param name="pattern">[[0-9]{4}-[[0-9]{2}-[[0-9]{2}</param>
       </data>
      </element>
      <element name="qualification">
       <data type="token">
        <param name="maxLength">255</param>
       </data>
      </element>
     </element>
    </zeroOrMore>
   </element>
  </oneOrMore>
 </element>


or:

 datatypes dtd="http://relaxng.org/ns/compatibility/datatypes/1.0"
 element library {
  element book {
   attribute id {dtd:ID},
   attribute available {xsd:boolean "true"},
   element isbn {xsd:NMTOKEN {pattern = "[[0-9]{9}[[0-9x]"}},
   element title {
     attribute xml:lang {xsd:language {length="2"}},
     xsd:token {maxLength="255"}
   },
   element author {
    attribute id {dtd:ID},
    element name {xsd:token {maxLength = "255"}},
    element born {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[[0-9]{4}-[[0-9]{2}-[[0-9]{2}"
    }},
    element died {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[[0-9]{4}-[[0-9]{2}-[[0-9]{2}"
    }}?}*,
   element character {
    attribute id {dtd:ID},
    element name {xsd:token {maxLength = "255"}},
    element born {xsd:date {
      minInclusive = "1900-01-01"
      maxInclusive = "2099-12-31"
      pattern = "[[0-9]{4}-[[0-9]{2}-[[0-9]{2}"
    }},
    element qualification {xsd:token {maxLength = "255"}}}*
  } +
 }

As already mentioned, the DTD compatibility feature has been designed to provide compatibility with the features of the DTD and that includes emulating some of their restrictions. We have already mentioned the fact that these datatypes can only be used in attributes, not in elements and we need to mention another limitation which can be more insidious and have bitten renowned experts trying to do things such as write Relax NG schemas for XHTML.

This rule might be called the "consistent attribute definition rule": since a DTD won't allow you to give two different definition of the content of an element, Relax NG does enforce the fact that if an attribute "id" is defined as "ID", "IDREF" or "IDREFS" in an element "bar" somewhere in a Relax NG schema, all the definitions of the same attribute under the same element must use the same type.

The simplest schemas which don't meet that and thus are not correct with respect to the DTD compatibility feature are schemas containing multiple declarations of the same element and attribute with different types, such as in:

 <?xml version="1.0" encoding="UTF-8"?>
 <element name="foo" xmlns="http://relaxng.org/ns/structure/1.0"
   datatypeLibrary="http://relaxng.org/ns/compatibility/datatypes/1.0">
   <element name="bar">
     <attribute name="id">
       <data type="ID"/>
     </attribute>
   </element>
   <zeroOrMore>
     <element name="bar">
       <attribute name="id">
         <data type="token" datatypeLibrary=""/>
       </attribute>
     </element>
   </zeroOrMore>
 </element>

or:

 datatypes dtd="http://relaxng.org/ns/compatibility/datatypes/1.0"

 element foo {
    element bar {
      attribute id { dtd:ID }
    },
    element bar {
      attribute id { token }
    } *
  }

Here, we have two definitions of "bar" with "id" attributes having competing types and, since one of these types is a dtd:ID type, this is forbidden.

A tougher to detect and tougher to fix situation is when one of these competing definitions involves patterns allowing name classes to allow the inclusion of any elements such as we will see in "Chapter 12: Writing Extensible Schemas". The restriction applies even in this case and the situation can become really nasty.


!!!Which library should we use?

All the Relax NG implementations must support the native datatype library and many of them also support the DTD compatibility datatypes library and the W3C XML Schema datatypes library. That means that if we want to define a "token" or "string" datatype we will often have the choice between the native library and W3C XML Schema datatypes and if we are defining "ID", "IDREF" or "IDREFS" we will often have the choice between the DTD compatibility library and W3C XML Schema datatypes.

That makes a lot of choices to do and in this section we'll try to give some general rules to do your choice.

!!Native types versus W3C XML Schema datatypes

The criteria to choose between native or W3C XML Schema datatypes to define "string" and "token" types is simple: if you need facets then use W3C XML Schema datatypes. If not use native datatypes: your schema will be more portable since the Relax NG processors are not obliged to support the W3C XML Schema type library.

!!DTD versus W3C XML Schema datatypes

When you need to define a datatype covered by both DTD and W3C XML Schema, i.e. "ID", "IDREF" or "IDREFS", the same rule of thumb can be followed: if you are using the DTD compatibility library your schema should be slightly more portable but you will loose the facets.

The other factor to take into account is that the rules applied if you are using the DTD compatibility feature are strict and consistent over different implementations while if you are using the W3C XML Schema type library, a processor should apply these same rules if and only if it also supports the DTD datatype library: processors which only support W3C XML Schema datatypes are only supposed to check the lexical space of these datatypes.

In practice, that means that you can use "ID", "IDREF" or "IDREFS" datatypes from the W3C XML Schema library but then it is safer to debug your schema using and implementation supporting both the DTD and the W3C XML Schema type libraries.

If you design a Relax NG schema using W3C XML Schema's "ID", "IDREF" and "IDREFS" and test it with an implementation which supports only W3C XML Schema datatypes you will have a laxed control over both the instance documents and the schema --the rules of the DTD compatibility will not be enforced. When you will use the same schema and instance documents with a Relax NG processor supporting both the DTD and W3C XML Schema datatypes you will then get a stricter control; the instance documents and even the schema which were previously valid may suddenly become invalid or incorrect because of this stricter control.

A simple example of schema which is correct for Relax NG implementations supporting W3C XML Schema datatypes without supporting the DTD compatibility layer but doesn't meet the DTD compatibility feature for Relax NG implementations supporting both is a schema defining ID elements:

 <?xml version="1.0" encoding="UTF-8"?>
 <element name="foo" xmlns="http://relaxng.org/ns/structure/1.0"
  datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">
   <zeroOrMore>
     <element name="bar">
       <element name="id">
         <data type="ID"/>
       </element>
     </element>
   </zeroOrMore>
 </element>

or:

 element foo {
   element bar {
     element id { xsd:ID }
   } *
 }

Other examples include schemas which are not respecting the rule by which the definitions of attributes holding these datatypes must be consistent over the schema.

The reason for this behavior is that although I have often been speaking of "DTD compatibility datatype library" for clarity all over this chapter, DTD compatibility is more than a datatype library. Per the Relax NG formal specification, a datatype library must be decoupled from the validation of the structure of the document and the context passed to the datatype library is restricted to the namespace declarations available under the node being validated. This context itself is an exception required to process qualified names. The datatype library has thus not enough information to do the tests requires to support the DTD compatibility: it doesn't even know if the data to validate has been found in an element or an attribute. This part of the DTD compatibility is thus a feature and not a datatype library as defined per Relax NG.

When we use a datatype from the datatype library "http://relaxng.org/ns/compatibility/datatypes/1.0" we are then doing two different things:
* use a datatype library which will restrict the lexical space of our "data" and "value" patterns
* trigger a feature requesting to validate that the "ID" are unique, and that the "IDREF" and "IDREFS" are referring to ids and lists of ids.

Applied to the W3C XML Schema datatype library, this translates as: trigger the ID DTD compatibility feature when available if these datatypes are used.