C and Python library for handling RDF data. ALPHA

Stefano Cossu 996902aad4 Fix memory leak in LSUP_graph_store when output is NULL. 3 months ago
cpython b7c3aad35e Update Python modules; improve build script; fix bugs. 4 months ago
docs dbdfb5807e Fix profiling. 4 months ago
ext ad70c8c905 Re-added log and tpl subprojects. 3 months ago
include d8e76b672a Get store features; other inconsequential additions. 4 months ago
src 996902aad4 Fix memory leak in LSUP_graph_store when output is NULL. 3 months ago
test 9567f7a272 Fix term_copy, environment setup, and profiling. 4 months ago
.gitignore 78d782e27c Absolute & relative IRIs. 4 months ago
.gitmodules 52655f6e34 Package for distribution. 5 months ago
CODE_OF_CONDUCT 5e1c8e5fa6 Fix Makefile; add docs. 1 year ago
LICENSE 5e1c8e5fa6 Fix Makefile; add docs. 1 year ago
Makefile 4f52b03cee Documentation update. 3 months ago
README.md 4f52b03cee Documentation update. 3 months ago
TODO.md fd1746dd1c Add LSUP_DEF_NSM macro. 4 months ago
profile.c 9567f7a272 Fix term_copy, environment setup, and profiling. 4 months ago
setup.py b7c3aad35e Update Python modules; improve build script; fix bugs. 4 months ago
test.c 96b6307727 Make env_init() and env_done() idempotent; add test wallclock. 4 months ago
valgrind-python.supp e92f9e7363 Fix triple generator. 7 months ago

README.md

lsup_rdf

This project is work in progress.

Embedded RDF (and maybe later, generic graph) store and manipulation library.

Purpose

The goal of this library is to provide efficient and compact handling of RDF data. At least a complete C API and Python bindings are planned.

This library can be thought of as SQLite or BerkeleyDB for graphs. It can be embedded directly in a program and store persistent data without the need of running a server. In addition, lsup_rdf can perform in-memory graph operations such as validation, de/serialization, boolean operations, lookup, etc.

Two graph back ends are available: a memory one based on hash maps and a disk-based one based on LMDB, an extremely fast and compact embedded key-store value. Graphs can be created independently with either back end within the same program. Triples in the persistent back end are fully indexed and optimized for a balance of lookup speed, data compactness, and write performance (in order of importance).

This library was initially meant to replace RDFLib dependency and Cython code in Lakesuperior in an effort to reduce code clutter and speed up RDF handling; it is now a project for an independent RDF library, but unless the contributor base expands, it will remain focused on serving Lakesuperior.

Development Status

Alpha. The API structure is not yet stable and may change radically. The code may not compile, or throw a fit when run. Testing is minimal. At the moment this project is only intended for curious developers and researchers.

This is also my first stab at writing a C library (coming from Python) and an unpaid fun project, so don't be surprised if you find some gross stuff.

Road Map

In Scope – Short Term

The short-term goal is to support usage in Lakesuperior and a workable set of features as a standalone library:

  • Handling of graphs, triples, terms
  • Memory- and disk-backed (persistent) graph storage
  • Contexts (disk-backed only)
  • Handling of blank nodes
  • Namespace prefixes
  • Validation of literal and URI terms
  • Validation of RDF triples
  • Fast graph lookup using matching patterns
  • Graph boolean operations
  • Serialization and de-serialization to/from N-Triples and N-Quads
  • Serialization and de-serialization to/from Turtle and TriG
  • Compile-time configuration of max graph size (efficiency vs. capacity)
  • Python bindings
  • Basic command line utilities

Possibly In scope – Long Term

  • Binary serialization and hashing of graphs
  • Binary protocol for synchronizing remote replicas
  • Backend for massive distributed storage (possibly Ceph)
  • Lua bindings

Likely Out of Scope

(Unless provided and maintained by external contributors)

  • C++ bindings
  • JSON-LD de/serialization
  • SPARQL queries (We'll see... Will definitely need help)

Building

Requirements

  • It is recommended to build and run LSUP_RDF on a Linux system. No other OS has been tested so far.
  • A C compiler. This has been only tested with gcc so far.
  • re2c and Lemon to build the RDF language parsers.
  • cinclude2dot and Graphviz for generating dependency graph (optional).

make commands

The default make command compiles the library. Enter make help to get an overview of the other available commands.

make install installs libraries and headers in the directories set by the environment variable $PREFIX. If this is unset, the default /usr/local prefix is used.

Options to compile with debug symbols are available.

Compile-Time Constants

DEBUG: Set debug mode: memory map is at reduced size, logging is forced to TRACE level, etc.

LSUP_RDF_STREAM_CHUNK_SIZE: Size of RDF decoding buffer, i.e., maximum size of a chunk of RDF data fed to the parser when decoding a RDF file into a graph. This should be larger than the maximum expected size of a single term in your RDF source. The default value is 8192, which is mildly conservative. If you experience parsing errors on decoding, and they happen to be on a term such a very long string literal, try recompiling the library with a larger value.

Embedding

The generated liblsuprdf.so and liblsuprdf.a libraries can be linked dynamically or statically to your code. Only the lsup_rdf.h header, which recursively includes other headers in the include directory, needs to be #included in the embedding code.

Environment variables and/or compiler options might have to be set in order to find the dynamic libraries and headers in their install locations.

For compilation and linking examples, refer to test, memtest, perftest and other actions in the current Makefile.

Environment Variables

LSUP_MDB_STORE_PATH: The file path for the persistent store back end. For production use it is strongly recommended to set this to a permanent location on the fastest storage volume available. If unset, the current directory will be used. The directory must exist.

LSUP_LOGLEVEL: A number between 0 and 5, corresponding to:

  • 0: TRACE
  • 1: DEBUG
  • 2: INFO
  • 3: WARN
  • 4: ERROR
  • 5: FATAL

If unspecified, it is set to 3.

LSUP_MDB_MAPSIZE Virtual memory map size. It is recommended to leave this alone. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit systems. The map size by itself does not use up any extra resources.

C API Documentation

TODO Almost all header files are documented. Need a doc generator.

Python API Documentation

TODO