C and Lua RDF library. ALPHA

2 Atzari

scossu 22506386e7 Merge branch 'no_store_nsm' of scossu/lsup_rdf into master		5 dienas atpakaļ
bin	3e1baa112c Keep bin dir.	3 gadi atpakaļ
docs	b3bed9c9a2 Fix makefiles using debug flags; adjust documentation.	6 dienas atpakaļ
ext	2f21e191e9 Remove unused ext folders.	3 gadi atpakaļ
include	c5fff8f966 Remove all namespace handling in store interfaces.	5 dienas atpakaļ
src	c5fff8f966 Remove all namespace handling in store interfaces.	5 dienas atpakaļ
test	c5fff8f966 Remove all namespace handling in store interfaces.	5 dienas atpakaļ
.gitignore	2f61d787fa Remove bash-isms from Makefile.	3 gadi atpakaļ
.gitmodules	20f1e71094 Remove re2c, lmdb, xxhash source code dependencies.	3 gadi atpakaļ
CODE_OF_CONDUCT	5e1c8e5fa6 Fix Makefile; add docs.	4 gadi atpakaļ
Doxyfile	8d1515c2d8 Change underlying context when changing graph URI.	2 mēneši atpakaļ
LICENSE	6934635956 Update README and license.	3 gadi atpakaļ
Makefile	c5fff8f966 Remove all namespace handling in store interfaces.	5 dienas atpakaļ
README.md	b3bed9c9a2 Fix makefiles using debug flags; adjust documentation.	6 dienas atpakaļ
TODO.md	e30c194960 Fix Python builder; bump version; wrap up release.	3 gadi atpakaļ
profile.c	dbb218e03e Adjust profile.c.	2 gadi atpakaļ
test.c	8d1515c2d8 Change underlying context when changing graph URI.	2 mēneši atpakaļ

`lsup_rdf`

This project is work in progress.

Embedded RDF (and maybe later, generic graph) store and manipulation library.

Purpose

The goal of this library is to provide efficient and compact handling of RDF data. At least a complete C API and Lua bindings (in a separate project) are bering built.

This library can be embedded directly in a program and store persistent data without the need of running a server. In addition, lsup_rdf can perform in-memory graph operations such as validation, de/serialization, boolean operations, lookup, etc.

Two graph back ends are available: a memory one based on hash maps and a disk-based one based on LMDB, an extremely fast and compact embedded key-store value. Graphs can be created independently with either back end within the same program, and mixed in certain operations such as booleans. Triples in the persistent back end are fully indexed and optimized for a balance of lookup speed, data compactness, and write performance (in order of importance).

The code offers an interface to write custom back end implementations with minimal changes to the core. More documentation on the topic will follow.

Development Status

Alpha. Considered feature-complete from an MVP standpoint. The API may still change significantly. The code may not compile, or throw a fit when run; however, most commits (not prefixed with "WIP") should be pushed after tests pass with a clean make memcheck output.

Test coverage is not sufficient. Documentation is fairly extensive but needs reformatting. This code is being integrated in higher-level projects and is being improved as issues arise. The status will move to beta as soon as at least one dependent project covers a significant range of lsup_rdf features.

Portability is still under assessment. The goal is to make this code POSIX compatible. ANSI C compatibility is out of reach because of the LMDB dependency.

This is my first stab at writing a C library (coming from Python) and an unpaid fun project, so don't be surprised if you find some gross stuff.

Road Map

In Scope – Short Term

The short-term goal is to support usage in Lakesuperior and a workable set of features as a standalone library:

✓ Handling of graphs, triples, terms
✓ Memory- and disk-backed (persistent) graph storage
✓ Contexts (disk-backed only)
⚒ Handling of blank nodes
✓ Namespace prefixes
✓ Validation of literal and URI terms
✓ Validation of RDF triples
✓ Fast graph lookup using matching patterns
✓ Graph boolean operations
✓ Serialization and de-serialization to/from N-Triples and N-Quads
✓ Serialization and de-serialization to/from Turtle and TriG
✓ Compile-time configuration of max graph size (efficiency vs. capacity)
⚒ Lua bindings (on a separate project)
⚒ Basic command line utilities
✓ Store interface for custom back end implementations
⚒ POSIX conformance

Possibly In scope – Long Term

Binary serialization and hashing of graphs
Binary protocol for synchronizing remote replicas
True plug-in architecture for 3rd-party store implementations

Out of Scope

JSON-LD: overwhelmingly bloated and complicated to understand and implement.
SPARQL, SPARUL, GraphQL, etc: they might make sense as separate plug-ins, but this project has a minimalist query approach based on combinations of set operations supported by a scripting language.
Python bindings: an initial implementation has been built and then shelved in favor of Lua, due to the complexity and slowness of the Python C API.

Building

Requirements

It is recommended to build and run lsup_rdf on a Linux system. No other OS has been tested so far.
A C compiler. This has been only tested with gcc so far.
LMDB libraries and headers.
XXHash >=0.8 libraries and headers.
re2c to build the RDF language lexers.
cinclude2dot and Graphviz for generating dependency graph (optional).

`make` commands

The default make command compiles the library. Enter make help to get an overview of the other available commands.

make install and make debug_install install libraries and headers in the directories set by the environment variable $PREFIX. If this is unset, the default /usr/local prefix is used.

Options to compile with debug symbols are available.

Compile-Time defines (`-D[...]`)

DEBUG: Set debug mode: memory map is at reduced size, logging is forced to TRACE level, etc.

LSUP_RDF_STREAM_CHUNK_SIZE: Size of RDF decoding buffer, i.e., maximum size of a chunk of RDF data fed to the parser when decoding a RDF file into a graph. This should be larger than the maximum expected size of a single term in your RDF source. The default value is 8192, which is mildly conservative. If you experience parsing errors on decoding, and they happen to be on a term such a very long string literal, try recompiling the library with a larger value.

Embedding & linking

The generated liblsuprdf.so and liblsuprdf.a libraries can be linked dynamically or statically to your code. Only the lsup_rdf.h header, which recursively includes other headers in the include directory, needs to be #included in the embedding code.

Environment variables and/or compiler options might have to be set in order to find the dynamic libraries and headers in their install locations.

For compilation and linking examples, refer to test, memtest, perftest and other targets in the current Makefile.

Environment Variables

LSUP_MDB_STORE_PATH: The file path for the persistent store back end. For production use it is strongly recommended to set this to a permanent location on the fastest storage volume available. If unset, the current directory will be used. The directory must exist.

LSUP_LOGLEVEL: A number between 0 and 5, corresponding to:

0: TRACE
1: DEBUG
2: INFO
3: WARN
4: ERROR
5: FATAL

If unspecified, it is set to 3.

LSUP_MDB_MAPSIZE Virtual memory map size. It is recommended to leave this alone, unless you are running Valgrind or other tools that limit memory usage. The map size by itself does not preallocate any resources and is safe to increase beyond the physical capacity of the host system. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit systems.

C API Documentation

Run doxygen (see Doxygen) to generate HTML documentation in docs/html.

README.md

lsup_rdf