|
@@ -0,0 +1,160 @@
|
|
|
|
+# `lsup_rdf`
|
|
|
|
+
|
|
|
|
+**This project is work in progress.**
|
|
|
|
+
|
|
|
|
+Embedded RDF (and maybe later, generic graph) store and manipulation library.
|
|
|
|
+
|
|
|
|
+## Purpose
|
|
|
|
+
|
|
|
|
+The goal of this library is to provide efficient and compact handling of RDF
|
|
|
|
+data. At least a complete C API and Python bindings are planned.
|
|
|
|
+
|
|
|
|
+This library can be thought of as SQLite or BerkeleyDB for graphs. It can be
|
|
|
|
+embedded directly in a program and store persistent data without the need of
|
|
|
|
+running a server. In addition, `lsup_rdf` can perform in-memory graph
|
|
|
|
+operations such as validation, de/serialization, boolean operations, lookup,
|
|
|
|
+etc.
|
|
|
|
+
|
|
|
|
+Two graph back ends are available: a memory one based on hash maps and a
|
|
|
|
+disk-based one based on [LMDB](https://symas.com/lmdb/), an extremely fast and
|
|
|
|
+compact embedded key-store value. Graphs can be created independently with
|
|
|
|
+either back end within the same program. Triples in the persistent back end are
|
|
|
|
+fully indexed and optimized for a balance of lookup speed, data compactness,
|
|
|
|
+and write performance (in order of importance).
|
|
|
|
+
|
|
|
|
+This library was initially meant to replace RDFLib dependency and Cython code
|
|
|
|
+in [Lakesuperior](https://notabug.org/scossu/lakesuperior) in an effort to
|
|
|
|
+reduce code clutter and speed up RDF handling; it is now a project for an
|
|
|
|
+independent RDF library, but unless the contributor base expands, it will
|
|
|
|
+remain focused on serving Lakesuperior.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+## Development Status
|
|
|
|
+
|
|
|
|
+**Alpha.** The API structure is not yet stable and may change radically. The
|
|
|
|
+code may not compile, or throw a fit when run. Testing is minimal. At the
|
|
|
|
+moment this project is only intended for curious developers and researchers.
|
|
|
|
+
|
|
|
|
+This is also my first stab at writing a C library (coming from Python) and an
|
|
|
|
+unpaid fun project, so don't be surprised if you find some gross stuff.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+## Road Map
|
|
|
|
+
|
|
|
|
+### In Scope – Short Term
|
|
|
|
+
|
|
|
|
+The short-term goal is to support usage in Lakesuperior and a workable set
|
|
|
|
+of features as a standalone library:
|
|
|
|
+
|
|
|
|
+- Handling of graphs, triples, terms
|
|
|
|
+- Memory- and disk-backed (persistent) graph storage
|
|
|
|
+- Contexts (disk-backed only)
|
|
|
|
+- Handling of blank nodes
|
|
|
|
+- Namespace prefixes
|
|
|
|
+- Validation of literal and URI terms
|
|
|
|
+- Validation of RDF triples
|
|
|
|
+- Fast graph lookup using matching patterns
|
|
|
|
+- Graph boolean operations
|
|
|
|
+- Serialization and de-serialization to/from N-Triples and N-Quads
|
|
|
|
+- Serialization and de-serialization to/from Turtle and TriG
|
|
|
|
+- Compile-time configuration of max graph size (efficiency vs. capacity)
|
|
|
|
+- Python bindings
|
|
|
|
+- Basic command line utilities
|
|
|
|
+
|
|
|
|
+### Possibly In scope – Long Term
|
|
|
|
+
|
|
|
|
+- Binary serialization and hashing of graphs
|
|
|
|
+- Binary protocol for synchronizing remote replicas
|
|
|
|
+- Backend for massive distributed storage (possibly Ceph)
|
|
|
|
+- Lua bindings
|
|
|
|
+
|
|
|
|
+### Likely Out of Scope
|
|
|
|
+
|
|
|
|
+(Unless provided and maintained by external contributors)
|
|
|
|
+
|
|
|
|
+- C++ bindings
|
|
|
|
+- JSON-LD de/serialization
|
|
|
|
+- SPARQL queries (We'll see... Will definitely need help)
|
|
|
|
+
|
|
|
|
+## Building
|
|
|
|
+
|
|
|
|
+### Requirements
|
|
|
|
+
|
|
|
|
+- It is recommended to build and run LSUP_RDF on a Linux system. No other
|
|
|
|
+ OS has been tested so far.
|
|
|
|
+- A C compiler. This has been only tested with `gcc` so far.
|
|
|
|
+- [re2c](https://re2c.org/) and [Lemon](https://www.sqlite.org/lemon.html) to
|
|
|
|
+ build the RDF language parsers.
|
|
|
|
+- [cinclude2dot](https://www.flourish.org/cinclude2dot) and
|
|
|
|
+ [Graphviz](https://graphviz.org/) for generating dependency graph (optional).
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### `make` commands
|
|
|
|
+
|
|
|
|
+The default `make` command compiles the library. Enter `make help` to get an
|
|
|
|
+overview of the other available commands.
|
|
|
|
+
|
|
|
|
+`make install` installs libraries and headers in the directories set by the
|
|
|
|
+environment variable `$PREFIX`. If this is unset, the default `/usr/local`
|
|
|
|
+prefix is used.
|
|
|
|
+
|
|
|
|
+Options to compile with debug symbols are available.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### Compile-Time Constants
|
|
|
|
+
|
|
|
|
+`DEBUG`: Set debug mode: memory map is at reduced size, logging is forced to
|
|
|
|
+TRACE level, etc.
|
|
|
|
+
|
|
|
|
+`LSUP_RDF_STREAM_CHUNK_SIZE`: Size of RDF decoding buffer, i.e., maximum size
|
|
|
|
+of a chunk of RDF data fed to the parser when decoding a RDF file into a graph.
|
|
|
|
+This should be larger than the maximum expected size of a single term in your
|
|
|
|
+RDF source. The default value is 8192, which is mildly conservative. If you
|
|
|
|
+experience parsing errors on decoding, and they happen to be on a term such a
|
|
|
|
+very long string literal, try recompiling the library with a larger value.
|
|
|
|
+
|
|
|
|
+## Embedding
|
|
|
|
+
|
|
|
|
+The generated `liblsuprdf.so` and `liblsuprdf.a` libraries can be linked
|
|
|
|
+dynamically or statically to your code. Only the `lsup_rdf.h` header, which
|
|
|
|
+recursively includes other headers in the `include` directory, needs to be
|
|
|
|
+`#include`d in the embedding code.
|
|
|
|
+
|
|
|
|
+Environment variables and/or compiler options might have to be set in order to
|
|
|
|
+find the dynamic libraries and headers in their install locations.
|
|
|
|
+
|
|
|
|
+For compilation and linking examples, refer to `test`, `memtest`, `perftest`
|
|
|
|
+and other actions in the current Makefile.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### Environment Variables
|
|
|
|
+
|
|
|
|
+`LSUP_MDB_STORE_PATH`: The file path for the persistent store back end. For
|
|
|
|
+production use it is strongly recommended to set this to a permanent location
|
|
|
|
+on the fastest storage volume available. If unset, the current directory will
|
|
|
|
+be used. The directory must exist.
|
|
|
|
+
|
|
|
|
+`LSUP_LOGLEVEL`: A number between 0 and 5, corresponding to:
|
|
|
|
+
|
|
|
|
+- 0: `TRACE`
|
|
|
|
+- 1: `DEBUG`
|
|
|
|
+- 2: `INFO`
|
|
|
|
+- 3: `WARN`
|
|
|
|
+- 4: `ERROR`
|
|
|
|
+- 5: `FATAL`
|
|
|
|
+
|
|
|
|
+If unspecified, it is set to 3.
|
|
|
|
+
|
|
|
|
+`LSUP_MDB_MAPSIZE` Virtual memory map size. It is recommended to leave this
|
|
|
|
+alone. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit
|
|
|
|
+systems. The map size by itself does not use up any extra resources.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### C API Documentation
|
|
|
|
+
|
|
|
|
+*TODO* Almost all header files are documented. Need a doc generator.
|
|
|
|
+
|
|
|
|
+
|
|
|
|
+### Python API Documentation
|
|
|
|
+
|
|
|
|
+*TODO*
|