Content Addressable Archival System

Timothy Rice aeace7f8da Add License 2 years ago
LICENSE aeace7f8da Add License 2 years ago
README.md f05db2b1c7 Give a little usage 2 years ago
carcass f05db2b1c7 Give a little usage 2 years ago
carcass-absorb cf51dc2bdb Ignore the new md5file when collecting md5sums 2 years ago
carcass-ls a8a8b94bf4 Initial commit 2 years ago

README.md

Carcass: Content Addressable Archival System

It is widely known that while Git can store binary data, it's not optimal. Perhaps less widely known, but more important if you are dealing with a lot of binary files, is that in fact Git begins to display significant performance issues if you try to add large archives of binary data. The more binary data you add, the more noticeable the issues will become, until the repository becomes almost impossible to work with.

Yet, Git does follow certain principles which could be useful for binary archives, provided you begin with a different set of optimization priorities.

Carcass borrows certain principles from Git, such as offline-first storage of objects named after their own hash. However, it is focused on the following types of data:

  • Relatively immutable files: Once you have a media file, you usually don't edit it except maybe to update tags. Even if you transcode the file to another format, the typical workflow gives you a new file rather than an edit of the old file.
  • Pre-compressed files: Whereas Git is optimized for plain text and thus benefits from compressing everything with zlib, media is usually already stored in some compressed format such as flac, jpeg or h264.
  • Entire collection metadata: Most media is usually organized into directory-based collections with some metadata in the collection root.
  • Disjoint collections: Whereas Git has the notion of a working directory containing checked out files, Carcass is focused on disjoint collections. Since collections are disjoint, they don't clobber each other, so you shouldn't need to worry about what is currently extracted from the archive. Therefore, there is no notion of "checking out" a branch or version. Just directly untar the collection you need when you need it.
  • Large files: On the other hand, because the files are large, you probably don't want everything extracted at once: instead, you'll untar what you need when you need it (possibly into a tmpfs) and delete it when finished. Since Carcass doesn't track what has been checked out, it also doesn't care if you delete what got checked out.

These considerations suggest a system focused on uncompressed tar archives of one collection per directory. Due to the flat structure, lack of extra compression, and no expectation of diffs, it is possible to build a system that is not only better than Git for binary files, but is also much simpler to build.

Note that Carcass is currently a prototype for building a proof-of-concept. Don't expect it to be bug-free or feature-rich.

Low priorities:

  • Distribution: just use rsync.
  • History: when you added an archive to Carcass is probably less interesting to you than the history of the files inside the archive. Carcass takes no responsibility for the latter, and is relatively uninterested in the former.

Usage

Like Git, the Carcass command itself is essentially a wrapper to any number of subcommands, which in turn are discovered from your PATH in the form of carcass-<subcmd>. At the moment, there are two subcommands:

  • carcass ls shows all the current objects under .carcass along with the root directory inside those archives.
  • carcass absorb <directory> basically tars up the target directory, stores it as an object under .carcass, and deletes the original directory.