# `lsup_rdf`

**This project is work in progress.**

Embedded RDF (and maybe later, generic graph) store and manipulation library.

## Purpose

The goal of this library is to provide efficient and compact handling of RDF
data. At least a complete C API and Python bindings are planned.

This library can be thought of as SQLite or BerkeleyDB for graphs. It can be
embedded directly in a program and store persistent data without the need of
running a server. In addition, `lsup_rdf` can perform in-memory graph
operations such as validation, de/serialization, boolean operations, lookup,
etc.

Two graph back ends are available: a memory one based on hash maps and a
disk-based one based on [LMDB](https://symas.com/lmdb/), an extremely fast and
compact embedded key-store value. Graphs can be created independently with
either back end within the same program. Triples in the persistent back end are
fully indexed and optimized for a balance of lookup speed, data compactness,
and write performance (in order of importance).

This library was initially meant to replace RDFLib dependency and Cython code
in [Lakesuperior](https://notabug.org/scossu/lakesuperior) in an effort to
reduce code clutter and speed up RDF handling; it is now a project for an
independent RDF library, but unless the contributor base expands, it will
remain focused on serving Lakesuperior.


## Development Status

**Alpha.** The API structure is not yet stable and may change radically. The
code may not compile, or throw a fit when run. Testing is minimal. At the
moment this project is only intended for curious developers and researchers.

This is also my first stab at writing a C library (coming from Python) and an
unpaid fun project, so don't be surprised if you find some gross stuff.


## Road Map

### In Scope – Short Term

The short-term goal is to support usage in Lakesuperior and a workable set
of features as a standalone library:

- Handling of graphs, triples, terms
- Memory- and disk-backed (persistent) graph storage
- Contexts (disk-backed only)
- Handling of blank nodes
- Namespace prefixes
- Validation of literal and URI terms
- Validation of RDF triples
- Fast graph lookup using matching patterns
- Graph boolean operations
- Serialization and de-serialization to/from N-Triples and N-Quads
- Serialization and de-serialization to/from Turtle and TriG
- Compile-time configuration of max graph size (efficiency vs. capacity)
- Python bindings
- Basic command line utilities

### Possibly In scope – Long Term

- Binary serialization and hashing of graphs
- Binary protocol for synchronizing remote replicas
- Backend for massive distributed storage (possibly Ceph)
- Lua bindings

### Likely Out of Scope

(Unless provided and maintained by external contributors)

- C++ bindings
- JSON-LD de/serialization
- SPARQL queries (We'll see... Will definitely need help)

## Usage

### Compile-Time Constants

`DEBUG`: Set debug mode: memory map is at reduced size, logging is forced to
TRACE level, etc.

### Environment Variables

`LSUP_MDB_STORE_PATH`: The file path for the persistent store back end. For
production use it is strongly recommended to set this to a permanent location
on the fastest storage volume available. If unset, the current directory will
be used. The directory must exist.

`LSUP_LOGLEVEL`: A number between 0 and 5, corresponding to:

- 0: `TRACE`
- 1: `DEBUG`
- 2: `INFO`
- 3: `WARN`
- 4: `ERROR`
- 5: `FATAL`

If unspecified, it is set to 3.

`LSUP_MDB_MAPSIZE` Virtual memory map size. It is recommended to leave this
unset and let the software adjust it to the hardware architecture. By default,
it is set to 1Tb for 64-bit systems and 4Gb for 32-bit systems. The map size by
itself does not use up any extra resources.

`LSUP_RDF_STREAM_CHUNK_SIZE`: Size of RDF decoding buffer, i.e., maximum size
of a chunk of RDF data fed to the parser when decoding a RDF file into a graph.
This should be larger than the maximum expected size of a single term in your
RDF source. The default value is 8192, which is mildly conservative. If you
experience parsing errors on decoding, and they happen to be on a term such a
very long string literal, try recompiling the library with a larger value.