123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322 |
- ========================
- eZ Publish markup format
- ========================
- Summarization of discussion results on the new internal eZ Publish markup
- format.
- Scope
- =====
- The discussed format will be used for the storage of documents in the data
- backend and therefore need to be able to represent a sufficient superset of
- markup used by various input and output formats.
- Common use cases
- ----------------
- Common use cases, which should be matched by the document format.
- 1) Web content management
- In web content management the user will most likely edit the contents using
- some rich text editor [#]_ in the browser and the contents will be
- transformed to (X)HTML for output on the website. Depending on the
- customers preferences the output language might be anything from HTML 4, to
- HTML 5, or X/HTML 1, 1.1, 2 or 5.
- 2) Content management
- Content management normally involves more formats like the already known
- Office document import and export, and also exporting documents using known
- print output formats like PDF and LaTeX. The storage format must be able to
- match the markup offered by those documents as much as possible to lose as
- little document semantics as possible.
- 3) Website styling
- Some users want to use web content management systems for easy editing and
- styling of their web contents, which includes formatting of contents beside
- pure semantic markup. This markup should also be possible to store in the
- backend, even it should also be easy to filter out for later content
- cleaning.
- 4) Extensibility
- Content management and publication also means we must offer an easy way to
- integrate with external contents (like images, videos or other external
- data providers). We cannot foresee which applications evolve here, so the
- markup format should stay extensible with custom tags.
- Document component
- ==================
- In the `eZ Components`__ project we develop the `document component`__ which
- aims to provide document conversions between all relevant markup formats. The
- current state is that we can convert documents in all directions between
- RST__, Docbook__, XHTML 1 and HTML <=4.
- We will work next on integrating the eZ Publish markup formats in the chain
- and then integrate `wiki markup languages`__, as well as PDF__ and maybe
- common other markup languages like the `Open Document Format`__.
- The document component currently uses a subset of Docbook as the internal
- conversion format, because an initial evaluation showed that it covers most
- semantic markup structures of the used formats and is easy to process, because
- one of the supported syntax languages is XML. So each format added to the
- document component is required to convert from and to Docbook. This way we
- will be able to convert between all formats using Docbook as an intermediate
- step.
- The document components will offer a base for the conversion required by some
- of the above mentioned use cases.
- Format considerations
- =====================
- With the use cases above and the background of already existing conversion
- tools the following markup languages are up to consideration.
- RST / Wiki markup
- -----------------
- So called "lightweight markup formats" which are easily editable by the user
- and offer great flexibility, because they are commonly extensible by custom
- plugins. They will be available as input and output formats using the document
- component, but are not valid for an internal storage format, because:
- - There are no common tools to parse such languages, so the parser is required
- to be implemented in PHP, which is slower then established markup parser
- frameworks like libxml2, available through the XML extensions in PHP.
- - RST even is a context free language, so no common parser approaches work
- here.
- - A common base for wiki syntaxes is evolving__ but not really defined yet,
- and a lot of different dialects of the language yet exist.
- - The general tool support is quite bad for both language flavors - there are
- only two tools which are really able to parse RST (docutils__ and the
- document component) and most wiki markup parsers are dialect specific.
- X/HTML 1 / X/HTML 5
- -------------------
- X/HTML is easy to parse, because it uses XML as syntax and is used widely in
- the web environment as a markup format for textual contents. A dialect similar
- to XHMLT 1.1 is already used in some versions of eZ Publish as a markup
- language in the database.
- X/HTML semantics
- ^^^^^^^^^^^^^^^^
- X/HTML improves its semantic markup from version to version, and in version 5
- of X/HTML there are several new elements introduced like <video>, <audio> and
- <section>.
- Generally the X/HTML markup is document representation centric without markup
- elements for structures often used in text semantics, like:
- - Footnotes
- Footnotes are available in all other markup formats, like in RST__ and
- Docbook__, but cannot really be represented in in X/HTML.
- - Names, addresses, mail addresses, etc.
- Docbook defines lots of already available markup for elements commonly used
- in various documents, which are only available in X/HTML through external not
- solidified extensions like microformats__.
- X/HTML still includes a lot of markup which is used only or partly for
- representation. The most common example here are tables used to layout
- websites. But also elements like <div> and <span>, or the attributes style="",
- on(load|click|...)="" are used solely for representational purposes. X/HTML is
- not designed for document centric markup, but still designed as a mix of
- representational and semantical markup [CIT_IAN_2008]_.
- However, it lacks elements to express the semantics of many of the
- non-document types of content often seen on the Web. For instance, forum
- sites, auction sites, search engines, online shops, and the like, do not
- fit the document metaphor well, and are not covered by XHTML2
- -- Ian Hickson, HTML 5, W3C Working Draft 22 January 2008
- X/HTML conversion benefits
- ^^^^^^^^^^^^^^^^^^^^^^^^^^
- One might think, that X/HTML offers the benefit of less conversions in the
- most traditional use case, the web content management. Considering the fourth
- use case X/HTML also always is required to be processed on input and output.
- The input processing would need to filter representational elements from a
- document to sanitize the contents stored in the data backend.
- The output processing would need to transform custom extensions, like
- <ezp:object node_id="23"/> or <mymodule:gallery/> into valid X/HTML code, not
- speaking of yet necessary conversions from X/HTML 5 to X/HTML 1 / HTML 4.
- X/HTML editor integration
- ^^^^^^^^^^^^^^^^^^^^^^^^^
- X/HTML integrates perfectly with yet existing editors, even they often do not
- focus on semantically correct markup, but representation centric WYSIWYG
- editing.
- The rich text editors will probably be updated to generate X/HTML 5 sooner or
- later, which could spare us the work of convincing the editors of creating a
- custom markup.
- Custom formatting
- ^^^^^^^^^^^^^^^^^
- Custom user defined formatting like colors, as mentioned in use case 3 is
- offered in X/HTML by default. This may make it hard to filter later on,
- because, like mentioned above, in X/HTML semantic and representational markup
- is mixed by design. On the other hand no markup extensions are required.
- A filter can still remove all elements and attributes not defined in a
- whitelist for valid markup.
- X/HTML 2
- --------
- X/HTML 2 is also a strong improvement compared with X/HTML 1, by offering
- similar section definitions as in Docbook and X/HTML 5 and other small
- improvements. It still has many of the same drawbacks like X/HTML 5, as
- mentioned in the sections `X/HTML conversion benefits`_, `X/HTML semantics`_
- and `X/HTML editor integration`_.
- X/HTML 1
- --------
- Beside the drawbacks mentioned for X/HTML 2 and 5, X/HTML 1 and 1.1 do have
- additional problems. It lacks several of the markup structures introduced in
- X/HTML 2 and 5, especially the <section> element, which makes it hard to
- decide which block level element belongs to which section, like the following
- example shows::
- <h1>Header 1</h1>
- <p>First paragraph...</p>
- <h2>Header 2</h2>
- <p>Second paragraph...</p>
- <p>Third paragraph...</p>
- Where it is not decidable, if the third paragraph belongs to the first or
- second sections, introduced by the respective headers. The same is true for
- the second paragraph. The resulting documents could look like::
- <section>
- <header>Header 1</header>
- <para>First paragraph...</para>
- <section>
- <header>Header 1</header>
- <para>Second paragraph...</para>
- </section>
- <para>Third paragraph...</para>
- </section>
- Or::
- <section>
- <header>Header 1</header>
- <para>First paragraph...</para>
- <section>
- <header>Header 1</header>
- <para>Second paragraph...</para>
- <para>Third paragraph...</para>
- </section>
- </section>
- This may be problematic when converting documents edited in the web interface
- to output formats, which are aware of those structures and style documents
- accordingly.
- Docbook
- -------
- Docbook is one of the most complete XML based markup languages with only
- semantical markup.
- Docbook semantics
- ^^^^^^^^^^^^^^^^^
- Docbook is by far the most complete and established markup language,
- comparable with LaTeX, but XML based. The only problems experienced so far
- converting other markup languages to Docbook are documented in the
- `documentation of the document component`__. The described problems are all
- not really relevant from a semantical point of view, but only small possible
- conversion losses.
- Docbook editor integration
- ^^^^^^^^^^^^^^^^^^^^^^^^^^
- The used rich text editor is required to create non X/HTML elements, to offer
- the user WYSIWYG experience with a Docbook markup format. The elements created
- by the editor can be styled as usual using CSS, like `documented here`__.
- Another possibility would be to keep the editor creating X/HTML and converting
- it to Docbook before storing the document in the database like already
- supported by the document component. This would, of course, reduce the
- features, which can be used from the markup language.
- Custom formatting
- ^^^^^^^^^^^^^^^^^
- Since Docbook is also XML, custom formatting and modules can be integrated
- with the XML source using different XML namespaces, and be converted on output
- to X/HTML including the required representational markup.
- Conclusion
- ==========
- All formats require conversions during input and output of contents, because
- of to the above mentioned use cases. Even there is progress in X/HTML 2 and 5,
- the markup offered by those languages is not nearly as complete as the Docbook
- markup and still includes purely representational markup, which would require
- us to define a subset of X/HTML which is valid to store. Also the X/HTML
- standards in the versions 2 and 5 have not settled down yet and may be up for
- future modifications.
- All formats offer enough capabilities to extend them with custom markup
- directives.
- The XML based formats should offer faster processing then the text based
- formats, especially because of the integration of libxml2 with PHP 5.
- Because of the above considerations Docbook seems the best choice for the
- interal markup format in eZ Publish.
- .. [#] Rich text editors in the web commonly mean editors like TinyMCE__ or
- FCKEditor__, which offer WYSIWYG capabilities in web browsers.
- .. [CIT_IAN_2008] `"HTML 5, 1.1.2. Relationship to XHTML2"`__. World Wide Web
- Consortium. Retrieved on 2008-07-19. “… XHTML2… defines a new HTML
- vocabulary with better features for hyperlinks, multimedia content,
- annotating document edits, rich metadata, declarative interactive forms,
- and describing the semantics of human literary works such as poems and
- scientific papers… However, it lacks elements to express the semantics of
- many of the non-document types of content often seen on the Web. For
- instance, forum sites, auction sites, search engines, online shops, and the
- like, do not fit the document metaphor well, and are not covered by XHTML2…
- This specification aims to extend HTML so that it is also suitable in these
- contexts…”
- __ http://ezcomponents.org/
- __ http://ezcomponents.org/docs/tutorials/Document
- __ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html
- __ http://docbook.org/tdg/en/html/docbook.html
- __ http://www.wikicreole.org/wiki/Engines
- __ http://en.wikipedia.org/wiki/Portable_Document_Format
- __ http://de.wikipedia.org/wiki/OpenDocument
- __ http://www.wikicreole.org/wiki/Engines
- __ http://docutils.sourceforge.net/
- __ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#footnotes
- __ http://docbook.org/tdg/en/html/footnote.html
- __ http://en.wikipedia.org/wiki/Microformat
- __ http://ezcomponents.org/docs/api/trunk/Document_conversion.html
- __ http://kore-nordmann.de/blog/the_long_way_to_semantic_web.html#id6
- __ http://tinymce.moxiecode.com/
- __ http://www.fckeditor.net/
- __ http://www.w3.org/TR/2008/WD-html5-20080122/#relationship0
|