123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490 |
- %
- % Introduction
- %
- \chapter{Introduction}
- \label{c:Introduction}
- \section{File Formats}
- \subsection{User vs developer perspective}
- File formats do often provide tough problems both for the
- software engineers who write programs and for the people who are
- using them. Software engineers want formats that store data in
- an efficient manner, and are easy to read and write. Users want
- way to save their data in a convenient and fast manner, and don't
- want to be bothered with the choice of a specific format.
- \subsection{Converters}
- The consequence is that almost every graphics or desktop publishing
- system has its own file format, optimized for the needs of that
- product. This means that direct data exchange between different
- products is impossible, since the file formats will be mutually
- exclusive. So, most programs contain lots of converters that
- transform data between different formats.
- \subsection{The User's Perspective}
- Having lots of converters this is inconvenient for the user. First,
- it means that $O(n^{2})$ converters are ideally needed to exchange data
- between n programs. However, it is unlikely that each program can
- read and write each other format.
- If programs cannot exchange data since they don't understand each
- others format, but both understand a third program, there is a still
- a way to exchange data with that format. However, this is
- inconvenient because the user has to find out which programs are
- candidates for such an operation.
- Furthermore, data may be lost in a format conversion. This may be
- because the other format is simply too complicated, or even secret
- information. There is obviously no way to avoid both issues, but they
- should not be as bothering as they are.
- \subsection{The Developer's Perspective}
- From an engineering point of view, it is conceivable that most
- programs need their own format to represent their data in an
- efficient way. However, it is not so obvious that the user needs
- to be bothered with this. First, consider converters. One easy
- way to get rid of them is to provide one powerful format that has
- a core part which is understood by all participating
- applications, and can be extended to meet a particular format's
- needs. The RTF and SGML formats for marking text are such
- approaches, The PICT graphics format is a successful example for
- a graphics format. Second, consider efficiency. It is often the
- case that a more generalized format is less efficient in terms of
- storage space or loading time. However, one can provide the
- choice of saving data either in a native, efficient format, or in
- a exchange format. Many desktop publishing programs use this
- approach.
- \section{Intermezzo: What is a graph ?}
- \begin{definition}[Graph]
- A graph is a tuple
- \begin{eqnarray*}
- G & = & (V,E), \mbox{\textnormal{where}} \\
- V & & \mbox{\textnormal{is the set of \emph{nodes}, and}} \\
- E & \subseteq & V \times V \mbox{\textnormal{is the set of edges}}
- \end{eqnarray*}
- We also define a mapping \emph{label} which assigns information
- to nodes, edges and labels:
- \begin{displaymath}
- label: G \cup V \cup E \mapsto \Sigma^{*}
- \end{displaymath}
- \end{definition}
- \begin{notes}
- \item In graph theory, $\Sigma$ is usually a fixed alphabet;
- GML needs a more general approach and allows to attach
- arbitrary attributes to graphs, nodes and edges.
- \end{notes}
- %
- % A Simple GML Example
- %
- \section{A Simple GML Example}
- \begin{example}%
- {e:GML:Intro:circle3}%
- {A simple graph in GML (circle of three nodes)}
- \begin{alltt}
- graph [ \ttcomment{Defines a new graph}
- node [ \ttcomment{Defines a new node}
- id 1 \ttcomment{This node has the id 1}
- ]
- node [ \ttcomment{Defines a new node}
- id 2 \ttcomment{This node has the id 2}
- ]
- node [ \ttcomment{Defines a new node}
- id 3 \ttcomment{This node has the id 3}
- ]
- edge [ \ttcomment{Defines a new edge}
- source 1 \ttcomment{Source is the node with the id 1}
- target 2 \ttcomment{Target is the node with the id 2}
- ]
- edge [ \ttcomment{Defines a new edge}
- source 2 \ttcomment{Source is the node with the id 2}
- target 3 \ttcomment{Target is the node with the id 3}
- ]
- edge [ \ttcomment{Defines a new edge}
- source 3 \ttcomment{Source is the node with the id 3}
- target 1 \ttcomment{Target is the node with the id 1}
- ]
- ]
- \end{alltt}
- \end{example}
- \begin{example}%
- {e:GML:ComplexExample}%
- {A complex GML example}
- \begin{alltt}
- # This file is in version 1 of GML
- Version 1
- graph [
- # \ttcomment{This graph has been created by the program "demo"}
- Vendor "demo"
- # directed \ttcomment{determines whether a graph is directed (1) or not (0).}
- # \ttcomment{In a directed graph, edges are have arrows that indicate the direction.}
- directed 1
- # \ttcomment{A label is a text attached to an object}
- label "The principles of space travel"
- node [
- id 1
- label "Earth"
- graphics [
- x 0.1
- y 0.0
- w 0.1
- h 0.1
- image "earth.gif"
- ]
- ]
- node [
- id 2
- label "Mars"
- graphics [
- x 0.9
- y 0.0
- w 0.055
- h 0.055
- image "Mars.gif"
- ]
- ]
- edge [
- source 1
- target 2
- ]
- ]
- \end{alltt}
- \end{example}
- Example \ref{e:GML:Intro:circle3} shows a simple graph which
- consists of three nodes. The more complex example
- \ref{e:GML:ComplexExample} demonstrates how to attach text to
- graphs, nodes and edges, and how to specify coordinates and
- images for nodes and edges. These example illustrates the key
- elements of GML:
- \begin{itemize}
- \item A GML file is made up of pairs of a key and a value.
- Examples for keys are \texttt{graph}, \texttt{node} and
- \texttt{edge}.
-
- \item The key idea behind GML is that there are some
- standard keys like graph, node and edge, and anybody is free
- to add its keys to add specific information.
- \item Values can be integers, floating point numbers, strings
- and lists, where the latter must be enclosed in square
- brackets.
-
- \item The graph in Example \ref{e:GML:ComplexExample} did not
- specify how to place the labels. They are arranged in a
- convenient manner in Figure 2, but an application might also
- ignore graph labels and print node labels over the images.
- There is no way to prevent a program from not drawing labels,
- but we will show in sections \ref{s:GML:NodeAttributes} and
- \ref{s:GML:EdgeAttributes} how the placement of labels is specified.
- \end{itemize}
- %
- % Design Issues
- %
- \section{GML Design Issues}
- \subsection{Syntax: Simple or Complex ?}
- A complex syntax -- like a programming language -- gives the
- designer the freedom to express facts in a efficient and
- easy-to-read manner. The \texttt{dot} format is a good example
- for this practice. However, the price to be paid is a less simple
- implementation. A simple syntax has the advantage that the
- implementation is easier, but the format is less efficient in
- terms of storage space and runtime. However, there is no reason
- why the format should be less powerful; everything that can be
- expressed with a complex format can be expressed in a simple
- format.
- \textbf{Answer: Simple} We chose a simple format over a complex
- one. We loose some efficiency, but we gain a much easier
- implementation and thus a wider distribution. However,
- simplicity is limited: since the format needs to be universal,
- some details will be slightly complex. For example, strings are
- always terminated by " characters. Therefore, we need a mechanism
- to deal with a quote that appears inside a string. Other issues
- are maximum line length and non ASCII characters, such as German
- umlauts, \"{a}.
- \subsection{Data types}
- Which types of values do we need to represent? The answer is: all
- data types present in programming languages. This includes
- numbers (both integers and floating point), boolean values,
- characters, strings and composite data types such as record,
- array, set and list structures.
- \subsection{Constraints}
- There are several external constraints which have to be
- considered:
- \begin{description}
-
- \item[Maximum line length] Some systems cannot handle arbitrary
- long lines without problems. Therefore, we need to restrict
- line lengths to a size that fits all systems.
-
- \item[Character Set] Internationalization is an important issue
- these days, and the ASCII characters are no longer sufficient
- for a real application. Therefore, we will use the ISO 8859
- character set, which is a common way to code non ASCII
- character sets within ASCII. ISO 8859 is also used in HTML, so
- we are in good company.
-
- \item[Range of values] The range of numbers is another
- sensitive point. We will assume 32 bit signed integers and
- double precision floating point values. These should be
- supported by all current systems.
-
- This rules out other data types like unsigned integers and 64
- bit integers, but those could still be stores as strings and
- converted afterwards. We will not assume a maximum length for
- strings, as this would be a restrictions for applications that
- store long texts in strings.
- \end{description}
- %
- % Notes on our Notation
- %
- \section{Notes on our Notation}
- We use a Pascal like notation with some object oriented
- extensions\footnote{Graphlet's implemention of GML is in C++, but
- we feel that Pascal is better for for aesthetic reasons, and
- more people are familiar with Pascal than with the fine prints
- of C++.}. A GML file is composed of key-value pairs, which we
- call objects. An object is a parametrized data type:
- \begin{alltt}
- type Object(\Param{Type}) = record
- key: Key;
- value: \Param{Type};
- end;
- \end{alltt}
- \noindent In the following, we will use the types \texttt{Object(Integer)},
- \texttt{Object(Real)}, \texttt{Object(String)} and
- \texttt{Object(List(Object))}. We will also assume that the type
- of an object is available at runtime, as in the following
- example:
- \begin{alltt}
- case type(\Param{o}) of
- Object(Integer): {\textnormal{\emph{Action for type}}} Integer;
- Object(Real): {\textnormal{\emph{Action for type}}} Real;
- Object(String): {\textnormal{\emph{Action for type}}} String;
- Object(List(Object)): {\textnormal{\emph{Action for type}}} List(Object);
- end
- \end{alltt}
- %
- % Paths
- %
- \subsection{Paths}
- The data structure Object forms a tree, where elements of type
- \texttt{Object(Integer)}, \texttt{Object(Real)} and
- \texttt{Object(String)} are leaves, and elements of type
- \texttt{Object(List(Object)))} are inner nodes. We will
- frequently use paths in this tree to describe the location of an
- object.
- \begin{definition}[Path]
- Let $k_{1},k_{2},\ldots,k_{n}, n \ge 1$ be keys. The path
- \[
- .k_{1}.k_{2}.\ldots.k_{n}
- \]
- denotes all sequences of objects $o_{1},o_{2},\ldots,o_{n}$ where
- \begin{eqnarray*}
- 1 \le i \le n & : & o_{i}.\mbox{key} = k_{i} \\
- 1 \le i \le n-1 & : & \mbox{type}(o_{i}.\mbox{value}) =
- \mbox{Object(List(Object))} \\
- 2 \le i \le n & : & o_{i} \in o_{i-1}.\mbox{value}
- \end{eqnarray*}
- \end{definition}
- \noindent Examples of paths are \texttt{.id}, \texttt{.graph.label},
- \texttt{.node.label} and \texttt{.node.graphics}. A path
- describes a class of sequences of objects which can start
- anywhere in the object tree. We can also define paths that start
- at a specific object:
- \begin{definition}[Path Starting at an Object]
- Let $k_{1}.k_{2}.\ldots.k_{n}, n \ge 1$ be a path and $o$ be an
- object. Then
- \[
- \mbox{\texttt{o}}.k_{1}.k_{2}.\ldots.k_{n}
- \]
- denotes the path starting at o, that is all sequences of
- objects $o_{1},o_{2},\ldots,o_{n}$ where
- \begin{eqnarray*}
- 1 \le i \le n & : & o_{i}.\mbox{key} = k_{i} \\
- 1 \le i \le n-1 & : & \mbox{type}(o_{i}.\mbox{value}) =
- \mbox{List(Object)} \\
- 2 \le i \le n & : & o_{i} \in o_{i-1}.\mbox{value} \\
- & & o_{1} \in o.\mbox{value}
- \end{eqnarray*}
- \end{definition}
- \noindent Finally, we omit the leading point on a path if it
- starts at the root object:
- \begin{definition}[Path Starting at the Root]
- Let $k_{1},k_{2},\ldots,k_{n}, n \ge 1$ be keys.
- \[
- k_{1}.k_{2}.\ldots.k_{n}
- \]
- denotes the path
- \[
- R.k_{1}.k_{2}.\ldots.k_{n}
- \]
- where $R$ is the root object of the tree.
- \end{definition}
- %
- % Other Graph File Formats
- %
- \chapter{Other Graph File Formats}
- \begin{note}
- This chapter is not complete yet.
- \end{note}
- \section{Simple adjacency lists}
- Many systems use simple adjacency lists, perhaps enriched with labels
- or coordinates. Often, an adjacency list is terminated by the end of
- the line.
- While this format type is convenient and easy to use in these
- systems, it has several disadvantages for our purpose. First, it is
- not expandible. Second, labels are usually restricted to one
- character or a single word. Further, the degree of a node is limited
- on systems which do not support arbitrary line lengths.
- \section{GraphEd}
- GraphEd had a format which is in spirit very simpliar to the one
- which is presented in this paper. However, its syntax is more complex
- than necessary in several aspects.
- \begin{verbatim}
- GRAPH "" =
- 1 {$ NS 32 32 $} ""
- 2 ""
- ;
- 2 {$ NS 32 32 $} ""
- 3 ""
- ;
- 3 {$ NS 32 32 $} ""
- 1 ""
- ;
- END
- \end{verbatim}
- This format contains several complications
- \begin{itemize}
- \item There are several ways to represent lists, e.g \verb|"[ Š]"|,
- \verb|"{$ Š $}"| and \texttt{GRAPH}\ldots\texttt{END}.
- \item Adjacency lists start with a node and end with \texttt{";"},
- whereas the list of nodes in a graph is terminated by \texttt{"END"}.
- \item Some syntax elements are superficial, like the \texttt{"="}
- after the keyword \texttt{GRAPH}.
- \end{itemize}
- On the other hand, the format supported generic attributes (inside
- \verb|"{$ Š $}"|) which were very similar to the one we propose. The main
- difference was that GraphEd's attributes had a key and a list of
- values, where we will only have one value per key. Of course
- GraphEd's approach made it easier to write files, but the data
- structures behind that were more complex, since two list structures
- where needed. GraphEd needed a list of attributes and a list of
- values, whereas we need only a list of key-value pairs.
- Another difference is that GraphEd put the graph structure and labels
- into a syntax outside the attributes, where we will combine them,
- once again to have only one data structure.
- \section{TEI (SGML) Format}
- The format used in TEI [reference] is actually a SGML DTD. It has the
- advantage that SGML provides a powerful standardized framework for it.
- However, the SGML syntax is more complex than ours, so parsing is
- more difficult. Our goal is a format that can easily be converted
- into any other format, so a easy parser is essential. Also, we do not
- need a format with the power of SGML here.
- \section{VRML Format}
- VRML is a format who's syntax is quite similar to the one defined
- here. Basically, it uses a key-value structure. The main difference
- is that a key can have a list values.
- The format is extensible.
- \section{dot Format}
- \TBD{}
- \section{Tom Sawyer Software Format}
- Tom Sawyer Software, Berkeley makes commercial graph layout and graph
- editor toolkits. Their file format uses keys in a line started by
- \texttt{//} , followed by a list of values, each on its line:
- \begin{verbatim}
- // Graph Layout Toolkit
- Hierarchical Layout
- // minimumSlopePercent
- 20
- // Nodes
- 2
- 42
- \end{verbatim}
- There is no hierarchical structure, although they can be modelled
- with dummy begin/end keys. The format is extensible; new elements can
- be added through a C or C++ interface.
- %%% Local Variables:
- %%% mode: latex
- %%% TeX-master: "GML"
- %%% End:
|