Linked Data for the People.

scossu 6f32ab1151 Add note on codecs in README.		1 місяць тому
bin	3e1baa112c Keep bin dir.	3 роки тому
docs	b3bed9c9a2 Fix makefiles using debug flags; adjust documentation.	3 місяців тому
ext	ac265f6db6 Update hashmap submodule.	1 місяць тому
include	40a8636ba6 Doc enhancements.	2 місяців тому
src	1b8f50ff78 Track codec artifacts.	1 місяць тому
test	140f1a0ccb Term reform:	2 місяців тому
.gitignore	1b8f50ff78 Track codec artifacts.	1 місяць тому
.gitmodules	20f1e71094 Remove re2c, lmdb, xxhash source code dependencies.	3 роки тому
CODE_OF_CONDUCT	5e1c8e5fa6 Fix Makefile; add docs.	5 роки тому
Doxyfile	7b33caeca4 Bump version.	2 місяців тому
LICENSE	6934635956 Update README and license.	3 роки тому
Makefile	3c8a126041 Fix Makefile mess-up. It was Vim's fault.	2 місяців тому
README.md	6f32ab1151 Add note on codecs in README.	1 місяць тому
TODO.md	e30c194960 Fix Python builder; bump version; wrap up release.	3 роки тому
profile.c	3a3070214d Rebrand to Volksdata.	2 місяців тому
test.c	3a3070214d Rebrand to Volksdata.	2 місяців тому

Volksdata

Linked Data for the People.

Purpose

Volksdata (full name: Volksdata Linkswagen) is an efficient and compact library for handling, storing, and querying RDF, i.e. Linked Data.

This library can be embedded directly in a program and store persistent data without the need of running a server. In addition, Volksdata can perform in-memory graph operations such as validation, de/serialization, boolean operations, lookup, etc.

Two graph back ends are available: a memory one based on hash maps and a disk-based one based on LMDB, an extremely fast and compact embedded key-store value. Graphs can be created independently with either back end within the same program, and mixed in certain operations such as booleans. Triples in the persistent back end are fully indexed and optimized for a balance of lookup speed, data compactness, and write performance (in order of importance).

The code offers an interface to write custom back end implementations with minimal changes to the core. More documentation on the topic will follow.

Development Status

Beta. Considered feature-complete from an MVP standpoint. The API may still change significantly. Any API changes are marked by a new v1.0b<n> tag. Most commits (not prefixed with "WIP") and merges to the master branch are pushed after tests pass with a clean make memcheck output (and on a good day, with a clean doxygen).

Test coverage is not sufficient. Documentation is fairly extensive but needs reformatting. This code is being integrated in higher-level projects and is being improved as issues arise.

Portability is still under assessment. The goal is to make this code POSIX compatible. ANSI C compatibility is out of reach because of the LMDB dependency.

This is my first stab at writing a C library (coming from Python) and an unpaid fun project, so don't be surprised if you find some gross stuff.

Road Map

In Scope – Short Term

The short-term goal is to support usage in some personal projects and a workable set of features as a standalone library:

✓ Handling of graphs, triples, terms
✓ Memory- and disk-backed (persistent) graph storage
✓ Contexts (disk-backed only)
✓ Handling of blank nodes
✓ Namespace prefixes
✓ Validation of literal and URI terms
✓ Validation of RDF triples
✓ Fast graph lookup using matching patterns
✓ Graph boolean operations
✓ Serialization and de-serialization to/from N-Triples and N-Quads
✓ Serialization and de-serialization to/from Turtle and TriG
✓ Compile-time configuration of max graph size (efficiency vs. capacity)
✓ Lua bindings (on a separate project)
⚒ Basic command line utilities
✓ Store interface for custom back end implementations

Possibly In scope – Long Term

POSIX conformance
Binary serialization and hashing of graphs
Binary protocol for synchronizing remote replicas
True plug-in architecture for 3rd-party store implementations

Out of Scope

JSON-LD: overwhelmingly bloated and complicated to understand and implement.
SPARQL, SPARUL, GraphQL, etc: they might make sense as separate plug-ins, but this project has a minimalist query approach based on combinations of set operations supported by an external scripting language.
Python bindings: an initial implementation has been built and then shelved in favor of Lua, due to the complexity and slowness of the Python C API, as well as my recent personal gripes with Python.

Building

Requirements

It is recommended to build and run Volksdata on a Linux system. No other OS has been tested so far.
A C compiler. This has been only tested with gcc so far.
LMDB libraries and headers.
XXHash >=0.8 libraries and headers.
re2c to build the RDF language lexers. Only required if the codecs are changed. Otherwise, compiled lexers are included in this git repo.
cinclude2dot and Graphviz for generating dependency graph (optional).

`make` commands

The default make command (make lib) compiles the library. Enter make help to get an overview of the other available commands.

make install and installs libraries and headers in the directories set by the environment variable $PREFIX. If this is unset, the default /usr/local prefix is used.

If LOCAL is set to anything else than 0, the library will be installed in $LOCAL_PREFIX instead (by default,$HOME/.local.

If DEBUG is set to anything else than 0, the library will be compiled with debug symbols.

E.g.

make install DEBUG=1 LOCAL=1

Installs the library with debug symbols in ~/.local.

FIXME: At the moment, make test needs `DEBUG=1 to run.

Compile-Time defines (`-D[...]`)

DEBUG: Set debug mode: memory map is at reduced size, logging is forced to TRACE level, etc.

VOLK_RDF_STREAM_CHUNK_SIZE: Size of RDF decoding buffer, i.e., maximum size of a chunk of RDF data fed to the parser when decoding a RDF file into a graph. This should be larger than the maximum expected size of a single term in your RDF source. The default value is 8192, which is mildly conservative. If you experience parsing errors on decoding, and they happen to be on a term such a very long string literal, try recompiling the library with a larger value.

Embedding & linking

The generated libvolksdata.so and libvolksdata.a libraries can be linked dynamically or statically to your code. Only the volksdata.h header, which recursively includes other headers in the include/volksdata directory, needs to be #included in the embedding code.

Environment variables and/or compiler options might have to be set in order to find the dynamic libraries and headers in their install locations.

For compilation and linking examples, refer to test, memtest, perftest and other targets in the current Makefile.

Environment Variables

VOLK_DEFAULT_CTX: string to derive the default context URI for context-aware stores when the environment is initialized. This must be a fully qualified URI. If unspecified, the value of VOLK_DEFAULT_CTX_LABEL is used.

VOLK_MDB_STORE_PATH: The file path for the persistent store back end. For production use it is strongly recommended to set this to a permanent location on the fastest storage volume available. If unset, the current directory will be used. The directory must exist.

VOLK_LOGLEVEL: A number between 0 and 5, corresponding to:

0: TRACE
1: DEBUG
2: INFO
3: WARN
4: ERROR
5: FATAL

If unspecified, it is set to 3.

VOLK_MDB_MAPSIZE: Virtual memory map size. It is recommended to leave this alone, except when running Valgrind or other tools that limit memory usage. The map size by itself does not preallocate any resources and is safe to increase beyond the physical capacity of the host system. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit systems.

C API Documentation

Run doxygen (see Doxygen) to generate HTML documentation in docs/html.

README.md