GML-Introduction.tex 16 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490
  1. %
  2. % Introduction
  3. %
  4. \chapter{Introduction}
  5. \label{c:Introduction}
  6. \section{File Formats}
  7. \subsection{User vs developer perspective}
  8. File formats do often provide tough problems both for the
  9. software engineers who write programs and for the people who are
  10. using them. Software engineers want formats that store data in
  11. an efficient manner, and are easy to read and write. Users want
  12. way to save their data in a convenient and fast manner, and don't
  13. want to be bothered with the choice of a specific format.
  14. \subsection{Converters}
  15. The consequence is that almost every graphics or desktop publishing
  16. system has its own file format, optimized for the needs of that
  17. product. This means that direct data exchange between different
  18. products is impossible, since the file formats will be mutually
  19. exclusive. So, most programs contain lots of converters that
  20. transform data between different formats.
  21. \subsection{The User's Perspective}
  22. Having lots of converters this is inconvenient for the user. First,
  23. it means that $O(n^{2})$ converters are ideally needed to exchange data
  24. between n programs. However, it is unlikely that each program can
  25. read and write each other format.
  26. If programs cannot exchange data since they don't understand each
  27. others format, but both understand a third program, there is a still
  28. a way to exchange data with that format. However, this is
  29. inconvenient because the user has to find out which programs are
  30. candidates for such an operation.
  31. Furthermore, data may be lost in a format conversion. This may be
  32. because the other format is simply too complicated, or even secret
  33. information. There is obviously no way to avoid both issues, but they
  34. should not be as bothering as they are.
  35. \subsection{The Developer's Perspective}
  36. From an engineering point of view, it is conceivable that most
  37. programs need their own format to represent their data in an
  38. efficient way. However, it is not so obvious that the user needs
  39. to be bothered with this. First, consider converters. One easy
  40. way to get rid of them is to provide one powerful format that has
  41. a core part which is understood by all participating
  42. applications, and can be extended to meet a particular format's
  43. needs. The RTF and SGML formats for marking text are such
  44. approaches, The PICT graphics format is a successful example for
  45. a graphics format. Second, consider efficiency. It is often the
  46. case that a more generalized format is less efficient in terms of
  47. storage space or loading time. However, one can provide the
  48. choice of saving data either in a native, efficient format, or in
  49. a exchange format. Many desktop publishing programs use this
  50. approach.
  51. \section{Intermezzo: What is a graph ?}
  52. \begin{definition}[Graph]
  53. A graph is a tuple
  54. \begin{eqnarray*}
  55. G & = & (V,E), \mbox{\textnormal{where}} \\
  56. V & & \mbox{\textnormal{is the set of \emph{nodes}, and}} \\
  57. E & \subseteq & V \times V \mbox{\textnormal{is the set of edges}}
  58. \end{eqnarray*}
  59. We also define a mapping \emph{label} which assigns information
  60. to nodes, edges and labels:
  61. \begin{displaymath}
  62. label: G \cup V \cup E \mapsto \Sigma^{*}
  63. \end{displaymath}
  64. \end{definition}
  65. \begin{notes}
  66. \item In graph theory, $\Sigma$ is usually a fixed alphabet;
  67. GML needs a more general approach and allows to attach
  68. arbitrary attributes to graphs, nodes and edges.
  69. \end{notes}
  70. %
  71. % A Simple GML Example
  72. %
  73. \section{A Simple GML Example}
  74. \begin{example}%
  75. {e:GML:Intro:circle3}%
  76. {A simple graph in GML (circle of three nodes)}
  77. \begin{alltt}
  78. graph [ \ttcomment{Defines a new graph}
  79. node [ \ttcomment{Defines a new node}
  80. id 1 \ttcomment{This node has the id 1}
  81. ]
  82. node [ \ttcomment{Defines a new node}
  83. id 2 \ttcomment{This node has the id 2}
  84. ]
  85. node [ \ttcomment{Defines a new node}
  86. id 3 \ttcomment{This node has the id 3}
  87. ]
  88. edge [ \ttcomment{Defines a new edge}
  89. source 1 \ttcomment{Source is the node with the id 1}
  90. target 2 \ttcomment{Target is the node with the id 2}
  91. ]
  92. edge [ \ttcomment{Defines a new edge}
  93. source 2 \ttcomment{Source is the node with the id 2}
  94. target 3 \ttcomment{Target is the node with the id 3}
  95. ]
  96. edge [ \ttcomment{Defines a new edge}
  97. source 3 \ttcomment{Source is the node with the id 3}
  98. target 1 \ttcomment{Target is the node with the id 1}
  99. ]
  100. ]
  101. \end{alltt}
  102. \end{example}
  103. \begin{example}%
  104. {e:GML:ComplexExample}%
  105. {A complex GML example}
  106. \begin{alltt}
  107. # This file is in version 1 of GML
  108. Version 1
  109. graph [
  110. # \ttcomment{This graph has been created by the program "demo"}
  111. Vendor "demo"
  112. # directed \ttcomment{determines whether a graph is directed (1) or not (0).}
  113. # \ttcomment{In a directed graph, edges are have arrows that indicate the direction.}
  114. directed 1
  115. # \ttcomment{A label is a text attached to an object}
  116. label "The principles of space travel"
  117. node [
  118. id 1
  119. label "Earth"
  120. graphics [
  121. x 0.1
  122. y 0.0
  123. w 0.1
  124. h 0.1
  125. image "earth.gif"
  126. ]
  127. ]
  128. node [
  129. id 2
  130. label "Mars"
  131. graphics [
  132. x 0.9
  133. y 0.0
  134. w 0.055
  135. h 0.055
  136. image "Mars.gif"
  137. ]
  138. ]
  139. edge [
  140. source 1
  141. target 2
  142. ]
  143. ]
  144. \end{alltt}
  145. \end{example}
  146. Example \ref{e:GML:Intro:circle3} shows a simple graph which
  147. consists of three nodes. The more complex example
  148. \ref{e:GML:ComplexExample} demonstrates how to attach text to
  149. graphs, nodes and edges, and how to specify coordinates and
  150. images for nodes and edges. These example illustrates the key
  151. elements of GML:
  152. \begin{itemize}
  153. \item A GML file is made up of pairs of a key and a value.
  154. Examples for keys are \texttt{graph}, \texttt{node} and
  155. \texttt{edge}.
  156. \item The key idea behind GML is that there are some
  157. standard keys like graph, node and edge, and anybody is free
  158. to add its keys to add specific information.
  159. \item Values can be integers, floating point numbers, strings
  160. and lists, where the latter must be enclosed in square
  161. brackets.
  162. \item The graph in Example \ref{e:GML:ComplexExample} did not
  163. specify how to place the labels. They are arranged in a
  164. convenient manner in Figure 2, but an application might also
  165. ignore graph labels and print node labels over the images.
  166. There is no way to prevent a program from not drawing labels,
  167. but we will show in sections \ref{s:GML:NodeAttributes} and
  168. \ref{s:GML:EdgeAttributes} how the placement of labels is specified.
  169. \end{itemize}
  170. %
  171. % Design Issues
  172. %
  173. \section{GML Design Issues}
  174. \subsection{Syntax: Simple or Complex ?}
  175. A complex syntax -- like a programming language -- gives the
  176. designer the freedom to express facts in a efficient and
  177. easy-to-read manner. The \texttt{dot} format is a good example
  178. for this practice. However, the price to be paid is a less simple
  179. implementation. A simple syntax has the advantage that the
  180. implementation is easier, but the format is less efficient in
  181. terms of storage space and runtime. However, there is no reason
  182. why the format should be less powerful; everything that can be
  183. expressed with a complex format can be expressed in a simple
  184. format.
  185. \textbf{Answer: Simple} We chose a simple format over a complex
  186. one. We loose some efficiency, but we gain a much easier
  187. implementation and thus a wider distribution. However,
  188. simplicity is limited: since the format needs to be universal,
  189. some details will be slightly complex. For example, strings are
  190. always terminated by " characters. Therefore, we need a mechanism
  191. to deal with a quote that appears inside a string. Other issues
  192. are maximum line length and non ASCII characters, such as German
  193. umlauts, \"{a}.
  194. \subsection{Data types}
  195. Which types of values do we need to represent? The answer is: all
  196. data types present in programming languages. This includes
  197. numbers (both integers and floating point), boolean values,
  198. characters, strings and composite data types such as record,
  199. array, set and list structures.
  200. \subsection{Constraints}
  201. There are several external constraints which have to be
  202. considered:
  203. \begin{description}
  204. \item[Maximum line length] Some systems cannot handle arbitrary
  205. long lines without problems. Therefore, we need to restrict
  206. line lengths to a size that fits all systems.
  207. \item[Character Set] Internationalization is an important issue
  208. these days, and the ASCII characters are no longer sufficient
  209. for a real application. Therefore, we will use the ISO 8859
  210. character set, which is a common way to code non ASCII
  211. character sets within ASCII. ISO 8859 is also used in HTML, so
  212. we are in good company.
  213. \item[Range of values] The range of numbers is another
  214. sensitive point. We will assume 32 bit signed integers and
  215. double precision floating point values. These should be
  216. supported by all current systems.
  217. This rules out other data types like unsigned integers and 64
  218. bit integers, but those could still be stores as strings and
  219. converted afterwards. We will not assume a maximum length for
  220. strings, as this would be a restrictions for applications that
  221. store long texts in strings.
  222. \end{description}
  223. %
  224. % Notes on our Notation
  225. %
  226. \section{Notes on our Notation}
  227. We use a Pascal like notation with some object oriented
  228. extensions\footnote{Graphlet's implemention of GML is in C++, but
  229. we feel that Pascal is better for for aesthetic reasons, and
  230. more people are familiar with Pascal than with the fine prints
  231. of C++.}. A GML file is composed of key-value pairs, which we
  232. call objects. An object is a parametrized data type:
  233. \begin{alltt}
  234. type Object(\Param{Type}) = record
  235. key: Key;
  236. value: \Param{Type};
  237. end;
  238. \end{alltt}
  239. \noindent In the following, we will use the types \texttt{Object(Integer)},
  240. \texttt{Object(Real)}, \texttt{Object(String)} and
  241. \texttt{Object(List(Object))}. We will also assume that the type
  242. of an object is available at runtime, as in the following
  243. example:
  244. \begin{alltt}
  245. case type(\Param{o}) of
  246. Object(Integer): {\textnormal{\emph{Action for type}}} Integer;
  247. Object(Real): {\textnormal{\emph{Action for type}}} Real;
  248. Object(String): {\textnormal{\emph{Action for type}}} String;
  249. Object(List(Object)): {\textnormal{\emph{Action for type}}} List(Object);
  250. end
  251. \end{alltt}
  252. %
  253. % Paths
  254. %
  255. \subsection{Paths}
  256. The data structure Object forms a tree, where elements of type
  257. \texttt{Object(Integer)}, \texttt{Object(Real)} and
  258. \texttt{Object(String)} are leaves, and elements of type
  259. \texttt{Object(List(Object)))} are inner nodes. We will
  260. frequently use paths in this tree to describe the location of an
  261. object.
  262. \begin{definition}[Path]
  263. Let $k_{1},k_{2},\ldots,k_{n}, n \ge 1$ be keys. The path
  264. \[
  265. .k_{1}.k_{2}.\ldots.k_{n}
  266. \]
  267. denotes all sequences of objects $o_{1},o_{2},\ldots,o_{n}$ where
  268. \begin{eqnarray*}
  269. 1 \le i \le n & : & o_{i}.\mbox{key} = k_{i} \\
  270. 1 \le i \le n-1 & : & \mbox{type}(o_{i}.\mbox{value}) =
  271. \mbox{Object(List(Object))} \\
  272. 2 \le i \le n & : & o_{i} \in o_{i-1}.\mbox{value}
  273. \end{eqnarray*}
  274. \end{definition}
  275. \noindent Examples of paths are \texttt{.id}, \texttt{.graph.label},
  276. \texttt{.node.label} and \texttt{.node.graphics}. A path
  277. describes a class of sequences of objects which can start
  278. anywhere in the object tree. We can also define paths that start
  279. at a specific object:
  280. \begin{definition}[Path Starting at an Object]
  281. Let $k_{1}.k_{2}.\ldots.k_{n}, n \ge 1$ be a path and $o$ be an
  282. object. Then
  283. \[
  284. \mbox{\texttt{o}}.k_{1}.k_{2}.\ldots.k_{n}
  285. \]
  286. denotes the path starting at o, that is all sequences of
  287. objects $o_{1},o_{2},\ldots,o_{n}$ where
  288. \begin{eqnarray*}
  289. 1 \le i \le n & : & o_{i}.\mbox{key} = k_{i} \\
  290. 1 \le i \le n-1 & : & \mbox{type}(o_{i}.\mbox{value}) =
  291. \mbox{List(Object)} \\
  292. 2 \le i \le n & : & o_{i} \in o_{i-1}.\mbox{value} \\
  293. & & o_{1} \in o.\mbox{value}
  294. \end{eqnarray*}
  295. \end{definition}
  296. \noindent Finally, we omit the leading point on a path if it
  297. starts at the root object:
  298. \begin{definition}[Path Starting at the Root]
  299. Let $k_{1},k_{2},\ldots,k_{n}, n \ge 1$ be keys.
  300. \[
  301. k_{1}.k_{2}.\ldots.k_{n}
  302. \]
  303. denotes the path
  304. \[
  305. R.k_{1}.k_{2}.\ldots.k_{n}
  306. \]
  307. where $R$ is the root object of the tree.
  308. \end{definition}
  309. %
  310. % Other Graph File Formats
  311. %
  312. \chapter{Other Graph File Formats}
  313. \begin{note}
  314. This chapter is not complete yet.
  315. \end{note}
  316. \section{Simple adjacency lists}
  317. Many systems use simple adjacency lists, perhaps enriched with labels
  318. or coordinates. Often, an adjacency list is terminated by the end of
  319. the line.
  320. While this format type is convenient and easy to use in these
  321. systems, it has several disadvantages for our purpose. First, it is
  322. not expandible. Second, labels are usually restricted to one
  323. character or a single word. Further, the degree of a node is limited
  324. on systems which do not support arbitrary line lengths.
  325. \section{GraphEd}
  326. GraphEd had a format which is in spirit very simpliar to the one
  327. which is presented in this paper. However, its syntax is more complex
  328. than necessary in several aspects.
  329. \begin{verbatim}
  330. GRAPH "" =
  331. 1 {$ NS 32 32 $} ""
  332. 2 ""
  333. ;
  334. 2 {$ NS 32 32 $} ""
  335. 3 ""
  336. ;
  337. 3 {$ NS 32 32 $} ""
  338. 1 ""
  339. ;
  340. END
  341. \end{verbatim}
  342. This format contains several complications
  343. \begin{itemize}
  344. \item There are several ways to represent lists, e.g \verb|"[ Š]"|,
  345. \verb|"{$ Š $}"| and \texttt{GRAPH}\ldots\texttt{END}.
  346. \item Adjacency lists start with a node and end with \texttt{";"},
  347. whereas the list of nodes in a graph is terminated by \texttt{"END"}.
  348. \item Some syntax elements are superficial, like the \texttt{"="}
  349. after the keyword \texttt{GRAPH}.
  350. \end{itemize}
  351. On the other hand, the format supported generic attributes (inside
  352. \verb|"{$ Š $}"|) which were very similar to the one we propose. The main
  353. difference was that GraphEd's attributes had a key and a list of
  354. values, where we will only have one value per key. Of course
  355. GraphEd's approach made it easier to write files, but the data
  356. structures behind that were more complex, since two list structures
  357. where needed. GraphEd needed a list of attributes and a list of
  358. values, whereas we need only a list of key-value pairs.
  359. Another difference is that GraphEd put the graph structure and labels
  360. into a syntax outside the attributes, where we will combine them,
  361. once again to have only one data structure.
  362. \section{TEI (SGML) Format}
  363. The format used in TEI [reference] is actually a SGML DTD. It has the
  364. advantage that SGML provides a powerful standardized framework for it.
  365. However, the SGML syntax is more complex than ours, so parsing is
  366. more difficult. Our goal is a format that can easily be converted
  367. into any other format, so a easy parser is essential. Also, we do not
  368. need a format with the power of SGML here.
  369. \section{VRML Format}
  370. VRML is a format who's syntax is quite similar to the one defined
  371. here. Basically, it uses a key-value structure. The main difference
  372. is that a key can have a list values.
  373. The format is extensible.
  374. \section{dot Format}
  375. \TBD{}
  376. \section{Tom Sawyer Software Format}
  377. Tom Sawyer Software, Berkeley makes commercial graph layout and graph
  378. editor toolkits. Their file format uses keys in a line started by
  379. \texttt{//} , followed by a list of values, each on its line:
  380. \begin{verbatim}
  381. // Graph Layout Toolkit
  382. Hierarchical Layout
  383. // minimumSlopePercent
  384. 20
  385. // Nodes
  386. 2
  387. 42
  388. \end{verbatim}
  389. There is no hierarchical structure, although they can be modelled
  390. with dummy begin/end keys. The format is extensible; new elements can
  391. be added through a C or C++ interface.
  392. %%% Local Variables:
  393. %%% mode: latex
  394. %%% TeX-master: "GML"
  395. %%% End: