Browse Source

Re-add TODO and README.

Stefano Cossu 2 years ago
parent
commit
45e395a356
2 changed files with 207 additions and 0 deletions
  1. 160 0
      README.md
  2. 47 0
      TODO.md

+ 160 - 0
README.md

@@ -0,0 +1,160 @@
+# `lsup_rdf`
+
+**This project is work in progress.**
+
+Embedded RDF (and maybe later, generic graph) store and manipulation library.
+
+## Purpose
+
+The goal of this library is to provide efficient and compact handling of RDF
+data. At least a complete C API and Python bindings are planned.
+
+This library can be thought of as SQLite or BerkeleyDB for graphs. It can be
+embedded directly in a program and store persistent data without the need of
+running a server. In addition, `lsup_rdf` can perform in-memory graph
+operations such as validation, de/serialization, boolean operations, lookup,
+etc.
+
+Two graph back ends are available: a memory one based on hash maps and a
+disk-based one based on [LMDB](https://symas.com/lmdb/), an extremely fast and
+compact embedded key-store value. Graphs can be created independently with
+either back end within the same program. Triples in the persistent back end are
+fully indexed and optimized for a balance of lookup speed, data compactness,
+and write performance (in order of importance).
+
+This library was initially meant to replace RDFLib dependency and Cython code
+in [Lakesuperior](https://notabug.org/scossu/lakesuperior) in an effort to
+reduce code clutter and speed up RDF handling; it is now a project for an
+independent RDF library, but unless the contributor base expands, it will
+remain focused on serving Lakesuperior.
+
+
+## Development Status
+
+**Alpha.** The API structure is not yet stable and may change radically. The
+code may not compile, or throw a fit when run. Testing is minimal. At the
+moment this project is only intended for curious developers and researchers.
+
+This is also my first stab at writing a C library (coming from Python) and an
+unpaid fun project, so don't be surprised if you find some gross stuff.
+
+
+## Road Map
+
+### In Scope – Short Term
+
+The short-term goal is to support usage in Lakesuperior and a workable set
+of features as a standalone library:
+
+- Handling of graphs, triples, terms
+- Memory- and disk-backed (persistent) graph storage
+- Contexts (disk-backed only)
+- Handling of blank nodes
+- Namespace prefixes
+- Validation of literal and URI terms
+- Validation of RDF triples
+- Fast graph lookup using matching patterns
+- Graph boolean operations
+- Serialization and de-serialization to/from N-Triples and N-Quads
+- Serialization and de-serialization to/from Turtle and TriG
+- Compile-time configuration of max graph size (efficiency vs. capacity)
+- Python bindings
+- Basic command line utilities
+
+### Possibly In scope – Long Term
+
+- Binary serialization and hashing of graphs
+- Binary protocol for synchronizing remote replicas
+- Backend for massive distributed storage (possibly Ceph)
+- Lua bindings
+
+### Likely Out of Scope
+
+(Unless provided and maintained by external contributors)
+
+- C++ bindings
+- JSON-LD de/serialization
+- SPARQL queries (We'll see... Will definitely need help)
+
+## Building
+
+### Requirements
+
+- It is recommended to build and run LSUP_RDF on a Linux system. No other
+  OS has been tested so far.
+- A C compiler. This has been only tested with `gcc` so far.
+- [re2c](https://re2c.org/) and [Lemon](https://www.sqlite.org/lemon.html) to
+  build the RDF language parsers.
+- [cinclude2dot](https://www.flourish.org/cinclude2dot) and
+  [Graphviz](https://graphviz.org/) for generating dependency graph (optional).
+
+
+### `make` commands
+
+The default `make` command compiles the library. Enter `make help` to get an
+overview of the other available commands.
+
+`make install` installs libraries and headers in the directories set by the
+environment variable `$PREFIX`. If this is unset, the default `/usr/local`
+prefix is used.
+
+Options to compile with debug symbols are available.
+
+
+### Compile-Time Constants
+
+`DEBUG`: Set debug mode: memory map is at reduced size, logging is forced to
+TRACE level, etc.
+
+`LSUP_RDF_STREAM_CHUNK_SIZE`: Size of RDF decoding buffer, i.e., maximum size
+of a chunk of RDF data fed to the parser when decoding a RDF file into a graph.
+This should be larger than the maximum expected size of a single term in your
+RDF source. The default value is 8192, which is mildly conservative. If you
+experience parsing errors on decoding, and they happen to be on a term such a
+very long string literal, try recompiling the library with a larger value.
+
+## Embedding
+
+The generated `liblsuprdf.so` and `liblsuprdf.a` libraries can be linked
+dynamically or statically to your code. Only the `lsup_rdf.h` header, which
+recursively includes other headers in the `include` directory, needs to be
+`#include`d in the embedding code.
+
+Environment variables and/or compiler options might have to be set in order to
+find the dynamic libraries and headers in their install locations.
+
+For compilation and linking examples, refer to `test`, `memtest`, `perftest`
+and other actions in the current Makefile.
+
+
+### Environment Variables
+
+`LSUP_MDB_STORE_PATH`: The file path for the persistent store back end. For
+production use it is strongly recommended to set this to a permanent location
+on the fastest storage volume available. If unset, the current directory will
+be used. The directory must exist.
+
+`LSUP_LOGLEVEL`: A number between 0 and 5, corresponding to:
+
+- 0: `TRACE`
+- 1: `DEBUG`
+- 2: `INFO`
+- 3: `WARN`
+- 4: `ERROR`
+- 5: `FATAL`
+
+If unspecified, it is set to 3.
+
+`LSUP_MDB_MAPSIZE` Virtual memory map size. It is recommended to leave this
+alone. By default, it is set to 1Tb for 64-bit systems and 4Gb for 32-bit
+systems. The map size by itself does not use up any extra resources.
+
+
+### C API Documentation
+
+*TODO*  Almost all header files are documented. Need a doc generator.
+
+
+### Python API Documentation
+
+*TODO*

+ 47 - 0
TODO.md

@@ -0,0 +1,47 @@
+# Quick TODO list
+
+*P* = pending; *W* = working on it; *D* = done.
+
+## Critical for MVP
+
+- *D* LMDB back end
+- *D* Hash table back end
+- *D* Namespace manager
+- *D* N3 serialization / deserialization
+- *D* Environment
+- *D* Better error handling
+- *D* Logging
+- *D* Store graph
+- *D* Python bindings
+    - *D* Basic module framework
+    - *D* term, triple, graph modules
+    - *D* Codec integration
+    - *D* Graph remove and lookup ops
+    - *D* Namespace module
+    - *D* Tests (basic)
+    - *D* Subclass term types
+- *D* Namespaced IRIs
+- *D* Relative IRIs
+- *D* Flexible store interface
+- *D* Transaction control
+- *W* Turtle serialization / deserialization
+- *P* Full UTF-8 support
+- *P* Extended tests
+    - *P* C API
+    - *P* Python API
+
+
+## Non-critical for MVP
+
+- Term and triple validation
+- Enhanced graph operations
+    - Extract unique terms and 2-term tuples
+- NQ codec
+- TriG codec
+- UTF-16 support
+
+
+## Long-term
+
+- Generic graph (non RDF constrained)
+- Lua bindings