ezp_markup.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322
  1. ========================
  2. eZ Publish markup format
  3. ========================
  4. Summarization of discussion results on the new internal eZ Publish markup
  5. format.
  6. Scope
  7. =====
  8. The discussed format will be used for the storage of documents in the data
  9. backend and therefore need to be able to represent a sufficient superset of
  10. markup used by various input and output formats.
  11. Common use cases
  12. ----------------
  13. Common use cases, which should be matched by the document format.
  14. 1) Web content management
  15. In web content management the user will most likely edit the contents using
  16. some rich text editor [#]_ in the browser and the contents will be
  17. transformed to (X)HTML for output on the website. Depending on the
  18. customers preferences the output language might be anything from HTML 4, to
  19. HTML 5, or X/HTML 1, 1.1, 2 or 5.
  20. 2) Content management
  21. Content management normally involves more formats like the already known
  22. Office document import and export, and also exporting documents using known
  23. print output formats like PDF and LaTeX. The storage format must be able to
  24. match the markup offered by those documents as much as possible to lose as
  25. little document semantics as possible.
  26. 3) Website styling
  27. Some users want to use web content management systems for easy editing and
  28. styling of their web contents, which includes formatting of contents beside
  29. pure semantic markup. This markup should also be possible to store in the
  30. backend, even it should also be easy to filter out for later content
  31. cleaning.
  32. 4) Extensibility
  33. Content management and publication also means we must offer an easy way to
  34. integrate with external contents (like images, videos or other external
  35. data providers). We cannot foresee which applications evolve here, so the
  36. markup format should stay extensible with custom tags.
  37. Document component
  38. ==================
  39. In the `eZ Components`__ project we develop the `document component`__ which
  40. aims to provide document conversions between all relevant markup formats. The
  41. current state is that we can convert documents in all directions between
  42. RST__, Docbook__, XHTML 1 and HTML <=4.
  43. We will work next on integrating the eZ Publish markup formats in the chain
  44. and then integrate `wiki markup languages`__, as well as PDF__ and maybe
  45. common other markup languages like the `Open Document Format`__.
  46. The document component currently uses a subset of Docbook as the internal
  47. conversion format, because an initial evaluation showed that it covers most
  48. semantic markup structures of the used formats and is easy to process, because
  49. one of the supported syntax languages is XML. So each format added to the
  50. document component is required to convert from and to Docbook. This way we
  51. will be able to convert between all formats using Docbook as an intermediate
  52. step.
  53. The document components will offer a base for the conversion required by some
  54. of the above mentioned use cases.
  55. Format considerations
  56. =====================
  57. With the use cases above and the background of already existing conversion
  58. tools the following markup languages are up to consideration.
  59. RST / Wiki markup
  60. -----------------
  61. So called "lightweight markup formats" which are easily editable by the user
  62. and offer great flexibility, because they are commonly extensible by custom
  63. plugins. They will be available as input and output formats using the document
  64. component, but are not valid for an internal storage format, because:
  65. - There are no common tools to parse such languages, so the parser is required
  66. to be implemented in PHP, which is slower then established markup parser
  67. frameworks like libxml2, available through the XML extensions in PHP.
  68. - RST even is a context free language, so no common parser approaches work
  69. here.
  70. - A common base for wiki syntaxes is evolving__ but not really defined yet,
  71. and a lot of different dialects of the language yet exist.
  72. - The general tool support is quite bad for both language flavors - there are
  73. only two tools which are really able to parse RST (docutils__ and the
  74. document component) and most wiki markup parsers are dialect specific.
  75. X/HTML 1 / X/HTML 5
  76. -------------------
  77. X/HTML is easy to parse, because it uses XML as syntax and is used widely in
  78. the web environment as a markup format for textual contents. A dialect similar
  79. to XHMLT 1.1 is already used in some versions of eZ Publish as a markup
  80. language in the database.
  81. X/HTML semantics
  82. ^^^^^^^^^^^^^^^^
  83. X/HTML improves its semantic markup from version to version, and in version 5
  84. of X/HTML there are several new elements introduced like <video>, <audio> and
  85. <section>.
  86. Generally the X/HTML markup is document representation centric without markup
  87. elements for structures often used in text semantics, like:
  88. - Footnotes
  89. Footnotes are available in all other markup formats, like in RST__ and
  90. Docbook__, but cannot really be represented in in X/HTML.
  91. - Names, addresses, mail addresses, etc.
  92. Docbook defines lots of already available markup for elements commonly used
  93. in various documents, which are only available in X/HTML through external not
  94. solidified extensions like microformats__.
  95. X/HTML still includes a lot of markup which is used only or partly for
  96. representation. The most common example here are tables used to layout
  97. websites. But also elements like <div> and <span>, or the attributes style="",
  98. on(load|click|...)="" are used solely for representational purposes. X/HTML is
  99. not designed for document centric markup, but still designed as a mix of
  100. representational and semantical markup [CIT_IAN_2008]_.
  101. However, it lacks elements to express the semantics of many of the
  102. non-document types of content often seen on the Web. For instance, forum
  103. sites, auction sites, search engines, online shops, and the like, do not
  104. fit the document metaphor well, and are not covered by XHTML2
  105. -- Ian Hickson, HTML 5, W3C Working Draft 22 January 2008
  106. X/HTML conversion benefits
  107. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  108. One might think, that X/HTML offers the benefit of less conversions in the
  109. most traditional use case, the web content management. Considering the fourth
  110. use case X/HTML also always is required to be processed on input and output.
  111. The input processing would need to filter representational elements from a
  112. document to sanitize the contents stored in the data backend.
  113. The output processing would need to transform custom extensions, like
  114. <ezp:object node_id="23"/> or <mymodule:gallery/> into valid X/HTML code, not
  115. speaking of yet necessary conversions from X/HTML 5 to X/HTML 1 / HTML 4.
  116. X/HTML editor integration
  117. ^^^^^^^^^^^^^^^^^^^^^^^^^
  118. X/HTML integrates perfectly with yet existing editors, even they often do not
  119. focus on semantically correct markup, but representation centric WYSIWYG
  120. editing.
  121. The rich text editors will probably be updated to generate X/HTML 5 sooner or
  122. later, which could spare us the work of convincing the editors of creating a
  123. custom markup.
  124. Custom formatting
  125. ^^^^^^^^^^^^^^^^^
  126. Custom user defined formatting like colors, as mentioned in use case 3 is
  127. offered in X/HTML by default. This may make it hard to filter later on,
  128. because, like mentioned above, in X/HTML semantic and representational markup
  129. is mixed by design. On the other hand no markup extensions are required.
  130. A filter can still remove all elements and attributes not defined in a
  131. whitelist for valid markup.
  132. X/HTML 2
  133. --------
  134. X/HTML 2 is also a strong improvement compared with X/HTML 1, by offering
  135. similar section definitions as in Docbook and X/HTML 5 and other small
  136. improvements. It still has many of the same drawbacks like X/HTML 5, as
  137. mentioned in the sections `X/HTML conversion benefits`_, `X/HTML semantics`_
  138. and `X/HTML editor integration`_.
  139. X/HTML 1
  140. --------
  141. Beside the drawbacks mentioned for X/HTML 2 and 5, X/HTML 1 and 1.1 do have
  142. additional problems. It lacks several of the markup structures introduced in
  143. X/HTML 2 and 5, especially the <section> element, which makes it hard to
  144. decide which block level element belongs to which section, like the following
  145. example shows::
  146. <h1>Header 1</h1>
  147. <p>First paragraph...</p>
  148. <h2>Header 2</h2>
  149. <p>Second paragraph...</p>
  150. <p>Third paragraph...</p>
  151. Where it is not decidable, if the third paragraph belongs to the first or
  152. second sections, introduced by the respective headers. The same is true for
  153. the second paragraph. The resulting documents could look like::
  154. <section>
  155. <header>Header 1</header>
  156. <para>First paragraph...</para>
  157. <section>
  158. <header>Header 1</header>
  159. <para>Second paragraph...</para>
  160. </section>
  161. <para>Third paragraph...</para>
  162. </section>
  163. Or::
  164. <section>
  165. <header>Header 1</header>
  166. <para>First paragraph...</para>
  167. <section>
  168. <header>Header 1</header>
  169. <para>Second paragraph...</para>
  170. <para>Third paragraph...</para>
  171. </section>
  172. </section>
  173. This may be problematic when converting documents edited in the web interface
  174. to output formats, which are aware of those structures and style documents
  175. accordingly.
  176. Docbook
  177. -------
  178. Docbook is one of the most complete XML based markup languages with only
  179. semantical markup.
  180. Docbook semantics
  181. ^^^^^^^^^^^^^^^^^
  182. Docbook is by far the most complete and established markup language,
  183. comparable with LaTeX, but XML based. The only problems experienced so far
  184. converting other markup languages to Docbook are documented in the
  185. `documentation of the document component`__. The described problems are all
  186. not really relevant from a semantical point of view, but only small possible
  187. conversion losses.
  188. Docbook editor integration
  189. ^^^^^^^^^^^^^^^^^^^^^^^^^^
  190. The used rich text editor is required to create non X/HTML elements, to offer
  191. the user WYSIWYG experience with a Docbook markup format. The elements created
  192. by the editor can be styled as usual using CSS, like `documented here`__.
  193. Another possibility would be to keep the editor creating X/HTML and converting
  194. it to Docbook before storing the document in the database like already
  195. supported by the document component. This would, of course, reduce the
  196. features, which can be used from the markup language.
  197. Custom formatting
  198. ^^^^^^^^^^^^^^^^^
  199. Since Docbook is also XML, custom formatting and modules can be integrated
  200. with the XML source using different XML namespaces, and be converted on output
  201. to X/HTML including the required representational markup.
  202. Conclusion
  203. ==========
  204. All formats require conversions during input and output of contents, because
  205. of to the above mentioned use cases. Even there is progress in X/HTML 2 and 5,
  206. the markup offered by those languages is not nearly as complete as the Docbook
  207. markup and still includes purely representational markup, which would require
  208. us to define a subset of X/HTML which is valid to store. Also the X/HTML
  209. standards in the versions 2 and 5 have not settled down yet and may be up for
  210. future modifications.
  211. All formats offer enough capabilities to extend them with custom markup
  212. directives.
  213. The XML based formats should offer faster processing then the text based
  214. formats, especially because of the integration of libxml2 with PHP 5.
  215. Because of the above considerations Docbook seems the best choice for the
  216. interal markup format in eZ Publish.
  217. .. [#] Rich text editors in the web commonly mean editors like TinyMCE__ or
  218. FCKEditor__, which offer WYSIWYG capabilities in web browsers.
  219. .. [CIT_IAN_2008] `"HTML 5, 1.1.2. Relationship to XHTML2"`__. World Wide Web
  220. Consortium. Retrieved on 2008-07-19. “… XHTML2… defines a new HTML
  221. vocabulary with better features for hyperlinks, multimedia content,
  222. annotating document edits, rich metadata, declarative interactive forms,
  223. and describing the semantics of human literary works such as poems and
  224. scientific papers… However, it lacks elements to express the semantics of
  225. many of the non-document types of content often seen on the Web. For
  226. instance, forum sites, auction sites, search engines, online shops, and the
  227. like, do not fit the document metaphor well, and are not covered by XHTML2…
  228. This specification aims to extend HTML so that it is also suitable in these
  229. contexts…”
  230. __ http://ezcomponents.org/
  231. __ http://ezcomponents.org/docs/tutorials/Document
  232. __ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html
  233. __ http://docbook.org/tdg/en/html/docbook.html
  234. __ http://www.wikicreole.org/wiki/Engines
  235. __ http://en.wikipedia.org/wiki/Portable_Document_Format
  236. __ http://de.wikipedia.org/wiki/OpenDocument
  237. __ http://www.wikicreole.org/wiki/Engines
  238. __ http://docutils.sourceforge.net/
  239. __ http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#footnotes
  240. __ http://docbook.org/tdg/en/html/footnote.html
  241. __ http://en.wikipedia.org/wiki/Microformat
  242. __ http://ezcomponents.org/docs/api/trunk/Document_conversion.html
  243. __ http://kore-nordmann.de/blog/the_long_way_to_semantic_web.html#id6
  244. __ http://tinymce.moxiecode.com/
  245. __ http://www.fckeditor.net/
  246. __ http://www.w3.org/TR/2008/WD-html5-20080122/#relationship0