C and Python data repository library. ALPHA

scossu 1396bcd1e6 WIP skip system-managed types only on update.		3 months ago
bin	aa22a928fd Update to latest LSUP_rdf version.	2 years ago
doc	1396bcd1e6 WIP skip system-managed types only on update.	3 months ago
ext	a2944ab760 Delete cached hashmap.c.	3 years ago
include	cd80d0844e More renaming.	4 months ago
src	1396bcd1e6 WIP skip system-managed types only on update.	3 months ago
test	f19ed466aa Remane D-RES and O-RES.	4 months ago
.gitignore	3ba240d78c A few changes:	4 months ago
.gitmodules	bd276e9446 Manually remove hashmap module.	3 years ago
CODE_OF_CONDUCT	9efbbe9a7a Add lsup_rdf submodule and copy boilerplate files.	4 years ago
Doxyfile	ae0865c419 WIP	3 years ago
LICENSE	9efbbe9a7a Add lsup_rdf submodule and copy boilerplate files.	4 years ago
Makefile	d85df1b67b Avoid clash with lsup_rdf core.h header.	1 year ago
NOTES.md	f79f31c89e Initial commit.	4 years ago
README.md	6074ab9d8f Update README.	3 months ago
TODO.md	d85df1b67b Avoid clash with lsup_rdf core.h header.	1 year ago
test.c	f19ed466aa Remane D-RES and O-RES.	4 months ago
valgrind-python.supp	9efbbe9a7a Add lsup_rdf submodule and copy boilerplate files.	4 years ago

`lsup_repo`

This software is work in progress.

lsup_repo is a C library providing embedded (server-less) data repository services. It builds upon a foundational library, lsup_rdf which handles RDF and graph data. Python binding are also planned, in a separate project.

Background & scope

Lakesuperior was initially built upon the Fedora repository software. This new version is a complete reingeineering and repurposing of the previous software, seeking to provide similar repository services with a simplified set of concepts and constrains, focused on usability and simplicity of design. Unlike Fedora, it does not aim to adhere to, alter, or set any API standard. Long-term sustainability of the handled data is provided by transparent exports into commonly parsable data formats (RDF) and tools to rebuild a repository from data files.

That said, nothing would prevent someone to add support for LDP, OCFL, Memento, etc. and even create a complete Fedora implementation by using this library as a foundation the basic repository functionality.

Use

lsup_repo can be included in a C program to manage the life cycle of RDF and non-RDF data. It allows to store and manage documents of any format and size, and catalog them via RDF metadata.

lsup_repo does not need to run a server for its core functionality. The interaction with the library is done via a C API or other bindings created for the purpose. That said, a REST API or any other type of server can be built with relative ease on top of this library. A separate project, based on this library, may in the future provide a REST API for generic resource management, likely based on some existing standard.

The current goal of this development is to build a minimum-viable product (MVP) to replace the essential functionality of a previous project, Lakesuperior.

Status

Alpha. Currently in an early implementation phase. The structure of the code may change radically. Basic functionality is still being built. See the road map for details.

Features

Short Term (MVP)

Handle the life cycle of arbitrary data documents (local disk storage)
Handle the life cycle of RDF resources describing such documents
Basic organizational structures: sets, lists, proxies (see "Concepts" below)
Versioning: create a version, revert to a version, "soft" deletion and reinstatement of a deleted resource
Checksum: cryptographic checksum of data, fixity check of resources in transit and at rest on demand
Basic management utilities: integrity check, statistics
Serialization and de-serialization to/from Turtle, TriG, N-Triples, N-quads
Python bindings

Mid Term

Dump, restore and migration utilities
Notification stream
Multiple back end options (local, S3, other network protocols)
Authentication and access policies

Long Term

Checksum of RDF resources (depends on lsup_rdf development)
Other features as they become necessary

Conceptual framework

Storage areas

While the underlying LSUP_RDF allows for working with multiple, user-defined storage back ends, lsup_repo has two designated stores: an in-memory staging area, where resources are created and manipulated, and a persistent storage backed by LMDB, where resources are pushed after they are modified. Functions such as LSR_dres_new, LSR_dres_update, and LSR_dres_triples operate on the staging area. Functions such as LSR_dres_store, LRS_dres_get, and LSR_dres_delete move resources between the staging area and the persistent store, or delete them from the persistent store.

Usual actions in a resource lifecycle flow in lsup_repo may be, for example:

Creating a resource and adding some data graphs
Storing the staged resource
Retrieving a resource from storage into memory
Modifying the staged resource via add and remove triple sets
Storing the modified resource
Deleting a stored resource

Versioning and soft & hard deletion

lsup_repo is planned to support full resource-level versioning, along the following lines:

A version for a resource may be created on demand at any point in time.
Each stored resource version is accessible via its own URI.
A resource can be restored to a previous version.
Versions of a resource may be collapsed into one (this is an irrevocable action).
A resource can be soft-deleted, i.e. its current state is stored as a version, and then replaced with a "tombstone" marker.
Any stored version of a soft-deleted resource can be accessed via its tombstone.
A soft-deleted resource can be reinstated from any of its stored versions.
A soft-deleted resource or an active resource can be hard-deleted (i.e. entirely and irrevocably removed from storage).

Opaque resources

Lakesuperior stores and organizes arbitrary files that can be found in a hard drive, remote server, etc. These files are called Opaque Resources (O-RES). Their contents are entirely opaque to Lakesuperior, therefore any type of document can be handled.

In further development phases lsup_repo may perform some essential preservation tasks, such as checksum and basic characterization.

Descriptive Resources

Each opaque resource is accompanied by a Descriptive Resource (D-RES). In the first iteration of Lakesuperior this is a RDF named graph which at a minimum contains a pointer to the data location and basic technical metadata. the URI of the named graph is globally unique. This resource can be added any user-defined metadata.

Descriptive resources may also exist independently of opaque resources for cataloging and organizational purposes. They have a few characteristics in common:

They are made up of one or more named graphs stored in the Lakesuperior back end. The URIs of each named graph follow a specific naming pattern as well as having explicit links between them, so they can be retrieved as one unit.
They are normal RDF resources, therefore their content is parsable by Lakesuperior and may be queried.
They may consist of several named graphs, each with a purpose defined by the software: library-managed data, user-provided data, versioning data, etc.

An O-RES cannot be looked up directly and has no metadata directly attached to it; instead, its related D-RES is looked up, which contains the file's metadata and a link to the file.

Partitioning a D-RES into multiple graphs allows individual data sets to be annotated, e.g. to establish provenance or versioning information about the asserted facts. Future developments of lsup_repo or software built upon it may take advantage of this structure.

Triples in a D-RES can have any subject; however it is recommended to maintain some consistency about which subjects are treated in each resource. Specifically, the use of a resource as an aggregation or container of triples about multiple independent subjects is discouraged in favor of the use of dedicated data structures, as described below.

Data Structures

Descriptive resources can be organized in varios aggregation forms. The aggregating resources are normal descriptive resources, with specific predicates pointing to other descriptive resources.

The life cycle of resource aggregations is entirely independent of the aggregated resources. An aggregation only "contains" pointers to other entirely independent resources, which can be pointed to by an arbitrary number of other aggregations, which can be removed at any time without changing the state of the aggregated resources.

On the other hand, deleting a resource that is part of some structure causes a scan of all inbound links (see "Referential Integrity" below) and the removal of all links to it present in other structures; therefore, the deletion of an aggregated resource changes the state of its aggregations.

The types of structures foreseen for the first implementation of lsup_repo are:

Set

A set is simply a descriptive resource containing an unordered number of unique links to other descriptive resources. Any descriptive resource, including other structures, can be used. Shorthand functions for counting and iterating over Set members, as well as performing boolean operations on them, shall be made available. As it is a descriptive resource, a Set may have descriptive metadata added to it, such as taxonomy, descriptions, labels, etc.

Deleting a resource contained in a set also deletes the membership relationship in the set.

List

A Lakesuperior List is a plain implementation of a linked list data structure. The list resource contains a link to a single descriptive resource. This resource, called a List Item, represents the first item in the list. Each list item, except for the last one, contains a single link to the next list item.

In addition, every list item has either:

A link to the resource it stands for: the List Item is a proxy for an existing resource, which makes it possible to make the same resource part of multiple list; or
A link to another list, which results in a nested list.

Shorthand functions to perform common list operations shall be made available. As with other descriptive resources, Lists and List Items can have any type of user-defined metadata and relationships added.

Deleting a list removes all list items under it, but leaves the resources pointed to by the list items intact.

Deleting a list item deletes all child items and shift the following item, if existent, to the position the deleted item occupied.

Deleting an object that a list item stands for, causes the list item to be deleted as well.

Proxy

A List Item is a special case of a Proxy, which is a descriptive resource standing for another descriptive resource. This indirection is useful for adding a specific context to a resource, e.g. additional information on a document in the context of a curated collection that is only valid or relevant to that collection. Proxies can be aggregated in sets or other structures as well, as one sees fit.

Proxy definitions follow the OAI ORE ontology.

Deleting a resource that a proxy stands for, also deletes the proxy.

Referential Integrity

The concept of Linked data, which lsup_repo is partly built upon, does not mandate the guarantee that a link pointing to a resource resolves to an actual resource, since it is often impossible to determine which system is responsible for managing that resource, let alone having any agency upon it. Therefore, "broken links" are not excluded.

lsup_repo, however, relies on the assumption that a specific set of resources (all under the urn:lsres: namespace) is under its full control, and therefore guarantees that all references to internally managed resources are maintained at all times. This means that when a resource is deleted, all links pointing to it are identified and removed. This is called referential integrity.

Tools shall be developed to perform referential integrity checks and to notify of dangling links and/or repair them.

Managed resources

Some Lakesuperior resources (DSC-R) and RDF terms may be managed by the repository and are handled in a special way under most circumstances.

Examples of such resources can be:

Support structures [TODO specify]
RDF predicates
RDF types

Some managed resources may be only handled by the user in different way depending on the state of a resource. For example, a RDF type of lsup:List can be specified by the user on creation, but after that it may not be modified manually.

TODO A more detailed list of these managed resources and their behaviour will be included in an expanded version of this documentation.

README.md

lsup_repo