# `lsup_rdf` **This project is work in progress.** Embedded RDF (and maybe later, generic graph) store and manipulation library. ## Purpose The goal of this library is to provide efficient and compact handling of RDF data. At least a complete C API and Python bindings are planned. This library can be thought of as SQLite or BerkeleyDB for graphs. It can be embedded directly in a program and store persistent data without the need of running a server. In addition, `lsup_rdf` can perform in-memory graph operations such as validation, de/serialization, boolean operations, lookup, etc. Two graph back ends are available: a memory one based on hash maps and a disk-based one based on [LMDB](https://symas.com/lmdb/), an extremely fast and compact embedded key-store value. Graphs can be created independently with either back end within the same program. Triples in the persistent back end are fully indexed and optimized for a balance of lookup speed, data compactness, and write performance (in order of importance). This library was initially meant to replace RDFLib dependency and Cython code in [Lakesuperior](https://notabug.org/scossu/lakesuperior) in an effort to reduce code clutter and speed up RDF handling; it is now a project for an independent RDF library, but unless the contributor base expands, it will remain focused on serving Lakesuperior. ## Development Status **Alpha.** The API structure is not yet stable and may change radically. The code may not compile, or throw a fit when run. Testing is minimal. At the moment this project is only intended for curious developers and researchers. This is also my first stab at writing a C library (coming from Python) and an unpaid fun project, so don't be surprised if you find some gross stuff. ## Road Map ### In Scope – Short Term The short-term goal is to support usage in Lakesuperior and a workable set of features as a standalone library: - Handling of graphs, triples, terms - Memory- and disk-backed (persistent) graph storage - Contexts (disk-backed only) - Handling of blank nodes - Namespace prefixes - Validation of literal and URI terms - Validation of RDF triples - Fast graph lookup using matching patterns - Graph boolean operations - Serialization and de-serialization to/from N-Triples and N-Quads - Serialization and de-serialization to/from Turtle and TriG - Compile-time configuration of max graph size (efficiency vs. capacity) - Python bindings - Basic command line utilities ### Possibly In scope – Long Term - Binary serialization and hashing of graphs - Binary protocol for synchronizing remote replicas - Backend for massive distributed storage (possibly Ceph) - Lua bindings ### Likely Out of Scope (Unless provided and maintained by external contributors) - C++ bindings - JSON-LD de/serialization - SPARQL queries (We'll see... Will definitely need help) ## Usage ### Compile-Time Constants `DEBUG`: Set debug mode: memory map is at reduced size, logging is forced to TRACE level, etc. ### Environment Variables `LSUP_MDB_STORE_PATH`: The file path for the persistent store back end. For production use it is strongly recommended to set this to a permanent location on the fastest storage volume available. If unset, the current directory will be used. The directory must exist. `LSUP_LOGLEVEL`: A number between 0 and 5, corresponding to: - 0: `TRACE` - 1: `DEBUG` - 2: `INFO` - 3: `WARN` - 4: `ERROR` - 5: `FATAL` If unspecified, it is set to 3. `LSUP_MDB_MAPSIZE` Virtual memory map size. It is recommended to leave this unset and let the software adjust it to the hardware architecture. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit systems. The map size by itself does not use up any extra resources. `LSUP_RDF_STREAM_CHUNK_SIZE`: Size of RDF decoding buffer, i.e., maximum size of a chunk of RDF data fed to the parser when decoding a RDF file into a graph. This should be larger than the maximum expected size of a single term in your RDF source. The default value is 8192, which is mildly conservative. If you experience parsing errors on decoding, and they happen to be on a term such a very long string literal, try recompiling the library with a larger value.