koz.ross
/
relax-ng-textbook


			
				
					
						
						
							12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
							The exploration of a foreign country or language deserves some preliminary explanations on its particularities to save lot of time and trouble and avoid lots of misunderstandings. Relax NG doesn't escape this rule so we'd better try to highlight its profound differences with other XML schema languages.

!! XML Infoset

One of the few things that all the XML schema languages have in common is that they define constraints to apply to a logical view on the XML documents (called the XML Infoset) rather than to the document which you can read as a text file. This is how they differentiate from other techniques such as regular expressions which you might used at the level of XML documents considered as text files.

This may be a surprised if you're not familiar with the concept, but in XML, what you see is not what you get and when you write "<book id='b0836217462' available='true'/>" XML applications do not see the string "<book id='b0836217462' available='true'/>". Most of them do not even see a "tag" named "book" but they all see an "element" named book with its two attributes "id" and "available" and the vast majority of them do not care about the way you've written this element in a text document nor even if you have ever written it in a ext document: what they really care about is the element "book" and its two attributes.

The set of the information considered significant in a XML document -such as our element "book" and its two attributes- is what is called the XML Infoset and it has been published as a W3C Recommendation. This Infoset defines an abstract model of XML documents which has a hierarchical structure and is described in terms generic and neutral enough to be acceptable for specifications with different backgrounds and goals such as XPath or the DOM.

The different schema languages work at the level of the XML Infoset and their main goal is to let us define constraints on a subset of the XML Infoset. Because they work at that level, they can't be used to express constraints on things which do not belong to the XML Infoset, such as the order of the attributes or the number of spaces between them. In addition, most of them won't let you define constraints on XML comments or Processing Instructions nor on the use of entities.

!!Different types of schema languages

That being said, the different XML schema languages have chosen different ways of defining those constraints:

*The constraints may be expressed as rules, like it's the case with Schematron: you give sets of rules such as "the element named book must have an attribute named id and this attribute must match this and this rule, ...".
*They may be expressed as a thorough description of each element and attribute like DTDs and W3C XML Schema and say: "it's an element, named book and it has two attributes named id and available which look like this and this".
*They may be expressed as "patterns" similar in their principle to regular expressions adapted to match XML infosets rather than text documents and we will cover this third way of defining constraints in detail over this book since it's the way that has been chosen by Relax NG.

The first XML schema language ever used was the DTD. Of course DTDs cover more than schema features and include the definition of internal and external entities, but their schema features focus on describing elements: each element must be described and for each element, a list of nodes must be defined listing whether text nodes are allowed, the list of allowed child elements and the list of its attributes. Pieces of content model may be defined, by using special entity types (the parameter entities) which work like a kind of macro-processing.

W3C XML Schema has extended this principle and defines several kind of "components" allowing to manipulate not only elements, but also attributes, datatypes which are containers describing the content of elements or attributes and even groups of elements and groups of attributes. The approach is still very focused on elements and attributes and which are clearly differentiated.

Relax NG, on the contrary is based the generic concept of "pattern" which is more or less symmetrical to the XPath concept of "node set": in first approximation, a pattern could be defined as the description of a set of valid nodesets.

The difference may be difficult to perceive, but when we define an element with a DTD or W3C XML Schema, we try to give a description of the element itself while when we define the same element with Relax NG, we define a pattern which will be checked against elements like a regular expression to see if they match. The difference is dim, but the later option gives us a much wider flexibility to write, maintain and combine schemas.

!!A simple example:

Let's take a first look at the example which we will using throughout this book and look at this "book" element with its two attributes and four different sub-elements:

[http://localhost/rngbook/full-pattern.png]

With a DTD and to a lesser attempt with W3C XML Schema, we are pretty much stuck to define lists of attributes and elements and cannot mix and combine them together. W3C XML Schema has introduced the concept of "type" which is an abstract object that has no match in the XML documents and is the description of the content of an element or an attribute, but still, types can't be freely combined together. This means that we can split the description of this elements into blocks such as:

[http://localhost/rngbook/xsd-full-pattern.png]

Relax NG patterns on the contrary can freely mix different type of nodes (elements, text and attributes) and if we have a need for this, Relax NG is flexible enough to split the definition of the book element into a first pattern composed of the "id" attribute, the "isbn", "title", "author" and first "character" element and a second one composed of the "available" attribute and the other "character" elements:

[http://localhost/rngbook/rng-full-pattern.png]

This flexibility is not only useful for combining complex patterns but also a source of simplicity for the designers of Relax NG schemas who do not need to learn a long list of limitations which must be checked when they write and combine their schemas.

This generic concept of patterns is powerful enough to replace the specialized containers of the DTD and W3C XML Schema. Relax NG has no need (and no notion) of reusable element, attribute or type definition: to reuse an element or attribute, this element or attribute is embedded in a pattern where it will be left alone; to reuse a type, a pattern is created to contain the content definition. These patterns are the reusable building blocks of Relax NG. They can be named, reused and even redefined at will, combined through operators to group them or provide alternatives between them.

The benefit of having non specialized patterns is an increased flexibility: this is well known in the construction industry where reusing a small number of generic parts provides more flexibility and a higher number of possible combinations than using more specific pieces and this is true for XML schema languages too...

!!A strong mathematical background

This notion of patterns is both new an ancient. New in the way it has been applied to XML in Relax first and now in Relax NG and ancient since it is the adaptation to XML of techniques and theories developed for Regular Expressions in the 60s and the name "Relax" stands for REgular LAnguage for XML. It relies on a strong mathematical theory and on works done by Murata Makoto to adapt the mathematical concept of "hedges" to XML.

When Murata Makoto has kindly pointed me to his work to answer my first questions, I have been horrified to see that all the maths I had learned at school seemed to have left me and that I couldn't understand the first word of it and I can insure you that you won't need to understand the maths behind Relax NG to use it and shouldn't worry about this. On the contrary, it's very comforting to know that the schema language you are using has such a background and it's a guarantee that its design is flawless.

The Relax NG specification is based on this mathematical background and the Relax NG patterns are defined as logical operations performed on sets of XML structures. This gives to the specification a formalism which removes any possibility of ambiguity for its interpretation and this is most important for insuring the interoperability of the different implementations of Relax NG.

!!And a strong experimental basis

This strong mathematical background doesn't mean that everything need to be reinvented for Relax NG implementers. On the contrary, the so-called "derivative algorithm" used by James Clark in his processor named Jing has been inspired by works done in 64 on the "derivation" of regular expressions and simply recursively removes from the patterns the nodes found in the instance documents: the document is valid if the patterns left after the last node are all optional.

Murata Makoto on his side, has adapted the ancient and well know algorithm of finite state machines to cope with the level of non determinism accepted by Relax NG and has developed a Relax NG validator lightweight enough to be used in a mobile phone.

Beside the fact that it is implementable with well known and documented algorithm, developers of Relax NG processors also appreciate the the simplicity of this underlying model and this should also guarantee a strong interoperability between implementation which is unfortunately not the case with more complex schema languages.

!!Patterns and only patterns

In the history of science, strong theories based on simple and basic particles have proven to have an almost infinite potential and can be used as a foundation for the most complex applications. There is no doubt that Relax NG is, in its domain, one of these applications both easy to explain, easy to implement and generic and flexible enough to meet the most stringent requirements.

We will present the Relax NG patterns throughout this book but can't leave this chapter before we give a list of some of them.

The three basic patterns match the three types of XML nodes in the scope of Relax NG (which do not pay attention to XML Processing Instructions and comments):
*Text nodes --which can be specialized into "data" which can carry "datatypes" and split into list items.
*Elements
*Attributes.

These patterns can be combined into ordered or non ordered groups and into choices defining alternatives between several patterns. Their cardinality, i.e. the number of time that can appear in instance documents, can also be controlled using cardinality patterns and, finally, a whole set of features are provided to build reusable libraries of patterns. Similar to patterns, name classes define set of element and attribute names that can be used to open a schema and control where elements and attributes with unknown names may be included in the instance documents.

Some of these features have been defined to facilitate the work of writing Relax NG schemas and are not basic "atomic" patterns. To avoid to overloading and omplicating the basic model with these "cosmetic" features, the Relax NG specification describes a "simplification algorithm" applied internally by Relax NG processors to transform a full schema into a simple form with fewer and simpler patterns. This algorithm is presented in "Chapter 15: Simplification And Restrictions".