# `lsup_rdf` **This project is work in progress.** Embedded RDF (and maybe later, generic graph) store and manipulation library. ## Purpose The goal of this library is to provide efficient and compact handling of RDF data. At least a complete C API and Python bindings are planned. This library can be thought of as SQLite or BerkeleyDB for graphs. It can be embedded directly in a program and store persistent data without the need of running a server. In addition, `lsup_rdf` can perform in-memory graph operations such as validation, de/serialization, boolean operations, lookup, etc. Two graph back ends are available: a memory one based on hash maps and a disk-based one based on [LMDB](https://symas.com/lmdb/), an extremely fast and compact embedded key-store value. Graphs can be created independently with either back end within the same program. Triples in the persistent back end are fully indexed and optimized for a balance of lookup speed, data compactness, and write performance (in order of importance). The code offers an interface to write a custom back end implementation with minimal changes to the core. More documentation on the topic will follow. This library was initially meant to replace RDFLib dependency and Cython code in [Lakesuperior](https://notabug.org/scossu/lakesuperior) in an effort to reduce code clutter and speed up RDF handling; it is now a project for an independent RDF library, but unless the contributor base expands, it will remain focused on serving Lakesuperior. ## Development Status **Alpha.** Considered feature-complete from an MVP standpoint. The API may still change significantly. The code may not compile, or throw a fit when run. The Python API may be behind the C API and not work. Test coverage is not sufficient. Documentation is fairly extensive but needs reformatting. This code is being integrated in the higher-level `lsup_repo` project and is being improved as issues arise. The status will move to beta as soon as `lsup_repo` covers a significant range of `lsup_rdf` features. This is also my first stab at writing a C library (coming from Python) and an unpaid fun project, so don't be surprised if you find some gross stuff. ## Road Map ### In Scope – Short Term The short-term goal is to support usage in Lakesuperior and a workable set of features as a standalone library: - ✓ Handling of graphs, triples, terms - ✓ Memory- and disk-backed (persistent) graph storage - ✓ Contexts (disk-backed only) - ⚒ Handling of blank nodes - ✓ Namespace prefixes - ✓ Validation of literal and URI terms - ✓ Validation of RDF triples - ✓ Fast graph lookup using matching patterns - ✓ Graph boolean operations - ✓ Serialization and de-serialization to/from N-Triples and N-Quads - ✓ Serialization and de-serialization to/from Turtle and TriG - ✓ Compile-time configuration of max graph size (efficiency vs. capacity) - ⚒ Python bindings - ⚒ Basic command line utilities - ⚒ Store interface for custom back end implementations ### Possibly In scope – Long Term - Binary serialization and hashing of graphs - Binary protocol for synchronizing remote replicas - True plug-in architecture for 3rd-party store implementations ### Likely Out of Scope (Unless provided and maintained by external contributors) - JSON-LD de/serialization - SPARQL support ## Building ### Requirements - It is recommended to build and run `lsup_rdf` on a Linux system. No other OS has been tested so far. - A C compiler. This has been only tested with `gcc` so far. - [LMDB](https://symas.com/lmdb/) libraries and headers. - [XXHash](https://github.com/Cyan4973/xxHash) >=0.8 libraries and headers. - [re2c](https://re2c.org/) to build the RDF language lexers. - [cinclude2dot](https://www.flourish.org/cinclude2dot) and [Graphviz](https://graphviz.org/) for generating dependency graph (optional). ### `make` commands The default `make` command compiles the library. Enter `make help` to get an overview of the other available commands. `make install` installs libraries and headers in the directories set by the environment variable `$PREFIX`. If this is unset, the default `/usr/local` prefix is used. Options to compile with debug symbols are available. ### Compile-Time defines (`-D[...]`) `DEBUG`: Set debug mode: memory map is at reduced size, logging is forced to TRACE level, etc. `LSUP_RDF_STREAM_CHUNK_SIZE`: Size of RDF decoding buffer, i.e., maximum size of a chunk of RDF data fed to the parser when decoding a RDF file into a graph. This should be larger than the maximum expected size of a single term in your RDF source. The default value is 8192, which is mildly conservative. If you experience parsing errors on decoding, and they happen to be on a term such a very long string literal, try recompiling the library with a larger value. ## Embedding The generated `liblsuprdf.so` and `liblsuprdf.a` libraries can be linked dynamically or statically to your code. Only the `lsup_rdf.h` header, which recursively includes other headers in the `include` directory, needs to be `#include`d in the embedding code. Environment variables and/or compiler options might have to be set in order to find the dynamic libraries and headers in their install locations. For compilation and linking examples, refer to `test`, `memtest`, `perftest` and other actions in the current Makefile. ### Environment Variables `LSUP_MDB_STORE_PATH`: The file path for the persistent store back end. For production use it is strongly recommended to set this to a permanent location on the fastest storage volume available. If unset, the current directory will be used. The directory must exist. `LSUP_LOGLEVEL`: A number between 0 and 5, corresponding to: - 0: `TRACE` - 1: `DEBUG` - 2: `INFO` - 3: `WARN` - 4: `ERROR` - 5: `FATAL` If unspecified, it is set to 3. `LSUP_MDB_MAPSIZE` Virtual memory map size. It is recommended to leave this alone, unless you are running Valgrind or other tools that limit memory usage. The map size by itself does not preallocate any resources and is safe to increase beyond the physical capacity of the host system. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit systems. ### C API Documentation Run `doxygen` (see [Doxygen](https://www.doxygen.nl/index.html)) to generate HTML documentation in `docs/html`. ### Python API Documentation *TODO*