title: Hitchhiker's guide to data formats
date: 2015-10-21 09:45
author: Christine Lemmer-Webber
tags: data, serialization, foss
slug: hitchhikers-guide-to-data-formats
Just thinking out loud this morning on what data formats there are and
how they work with the world:
- XML: 2000's hippest technology.
Combines a clear, parsable tree based syntax with extension
mechanisms and a schema system. Still moderately popular, though not
as it once was. Tons of tooling. Many seem to think the tooling
makes it overly complex, and JSON has taken over much of its place.
Has the advantage of unambiguity over vanilla JSON, if you know how
to use it right, but more effort to work with.
- SGML:
XML's soupier grandmother. Influential.
- HTML: Kind of like
SGML and XML but for some specific data. Too bad XHTML never
fulfilled its dream. Without XHTML, it's even soupier than SGML, but
there's enough tooling for soup-processing that most developers
don't worry about it.
- JSON: Also tree-based, but
keeps things minimal, just your basic types. Loved by web
developers everywhere. Also ambiguous since on its own, it's
schema-free... this may lead to conflicts between applications. But
if you know the source and the destination perfectly it's fine. Has
the advantage of transforming into basic types in pretty much every
language and widespread tooling. (Don't be evil about being evil,
though? #vaguejokes) If you want to send JSON between a lot of
locations and want to be unambiguous in your meaning, or if you want
more than just the basic types provided, you're going to need
something more... we'll come to that in a bit.
- S-expressions: the
language of lisp, and lispers claim you can represent anything as
s-expressions, which is true, but also that's kind of ambiguous on
its own. Capable also of representing code just as well, which is
why lispers claim benefits of symmetry and "code that can write
code". However, serializing "pure data" is also perfectly possible
with s-expressions. So many variations between languages though...
it's more of a "generalized family" or even better, a pattern, of
data (and code) formats. Some damn
cool
representations of some of these other formats via sexps. Some
people get scared away by all the parens, though, which is too bad,
because (though this strays into code + data, not just data)
homoiconicity can't
be beat. (Maybe
Wisp can help
there?)
- Canonical
s-expressions:
S-expressions, with a canonical representation... cool! Most
developers don't know about it, but was designed for public key
cryptography usage, and still actively used there (libgcrypt uses
canonical s-expressions under the hood, for instance). No schema
system, and actually pretty much just lists and binary strings, but
the binary strings can be marked with "display hints" so systems can
know how to unpack the data into appropriate types.
- RDF
and friends: The "unicode" of graph-oriented data. Not a
serialization itself, but a specification on the conceptual modeling
of data, and you'll hear "linked data" people talking about it a
lot. A graph of "subject, predicate, object" triples. Pretty cool
once you learn what it is, though the introductory material is
really overwhelming. (Also, good luck representing ordered
lists).
However, there is no one serialization of RDF, which leads to much
confusion among many developers (including myself, while being
explained to the contrary, for a long time). For example,
rdf/xml looks like XML,
but woe be upon ye who uses XML tooling upon it. So, deserialzie to
RDF, then deal with RDF in RDF land, then serialize again... that's
the way to go with RDF. Has more sane formats than just rdf/xml, for
example Turtle
is easy to read. RDF community seems to get mad when you want to
interpret data as anything other than RDF, which can be very
off-putting, though the goal of a "platonic form" of data is highly
admirable. That said, graph based tooling is definitely harder for
most developers to work with than tree-based tooling, but hopefully
"the jQuery of RDF" library will become available some day, and
things will be easier. Interesting stuff to learn, anyway!
- json-ld: A "linked data format", technically
can transform itself into RDF, but unlike other forms of RDF syntax,
can often be parsed just on its own as simple JSON. So, say you want
to have JSON and keep things easy for most of your users who just
use their favorite interpreted language to extract key value pairs
from your API. Okay, no problem for them! But suddenly you're also
consuming JSON from multiple origins, and one of them uses "run" to
say "run a mile" whereas your system uses "run" to mean "run a
program". How do you tell these apart? With json-ld you can "expand"
a JSON representation with supplied context to an unambiguous form,
and you can "compact" it down again to the terms you know and
understand in your system, leaving out those you don't. No more
executing a program for a mile!
- Microformats and
RDFa: Two communities which are notoriously and
exasperatingly at odds with each other for over a decade, so why do
I link them together? Well, both of these take the same approach of
embedding data in HTML. Great when you have HTML for your data to go
with, though not all data needs an HTML wrapper. But it's good to be
able to extract it! RDFa simply extracts to RDF, which we've
discussed plenty; Microformats extracts to its own thing. Frequent
form of contention between these groups is about vocabulary, and how
to represent vocabulary. RDFa people like their vocabulary to have
canonical URIs for each term (well, that's an RDF thing, so not
surprising), Microformats people like to document everything in a
wiki. Arguments about extensibility is a frequent topic... if you
want to get into that, see Amy Guy's summary of
things.
Of course, there's more data formats than that. Heck, even on top of
these data formats there's a lot more out there (these days I spend a
lot of time working on ActivityStreams
2.0 related tooling, which
is just JSON with a specific structure, until you want to get fancier,
add extensions, or jump into linked data land, in which case you can
process it as json-ld). And maybe you'd also find stuff like Cap'n
Proto or Protocol
Buffers to be
interesting. But the above are the formats that, today, I think are
generally most interesting or impactful upon my day to day work. I hope
this guide was interesting to you!