C and Python data repository library. ALPHA

scossu fa85944542 Correct LD_LIBRARY_PATH. 1 year ago
bin aa22a928fd Update to latest LSUP_rdf version. 1 year ago
doc ae0865c419 WIP 2 years ago
ext a2944ab760 Delete cached hashmap.c. 1 year ago
include c6c1387ebb WIP Separate update function. 1 year ago
src c6c1387ebb WIP Separate update function. 1 year ago
test c6c1387ebb WIP Separate update function. 1 year ago
.gitignore aa22a928fd Update to latest LSUP_rdf version. 1 year ago
.gitmodules bd276e9446 Manually remove hashmap module. 1 year ago
CODE_OF_CONDUCT 9efbbe9a7a Add lsup_rdf submodule and copy boilerplate files. 3 years ago
Doxyfile ae0865c419 WIP 2 years ago
LICENSE 9efbbe9a7a Add lsup_rdf submodule and copy boilerplate files. 3 years ago
Makefile fa85944542 Correct LD_LIBRARY_PATH. 1 year ago
NOTES.md f79f31c89e Initial commit. 3 years ago
README.md 6b0f267af6 Update references to ext repos. 1 year ago
TODO.md da10036253 Merge stashed changes. 2 years ago
test.c 9785f83782 Pass basic test. 2 years ago
valgrind-python.supp 9efbbe9a7a Add lsup_rdf submodule and copy boilerplate files. 3 years ago

README.md

lsup_repo

This software is work in progress.

lsup_repo is a C and Python library providing embedded (server-less) data repository services. It builds upon a foundational library, lsup_rdf which handles RDF and graph data.

Background & scope

Lakesuperior was initially built upon the Fedora repository software. This new version is a complete reingeineering and repurposing of the previous software, seeking to provide similar repository services with a simplified set of concepts and constrains, focused on usability and simplicity of design. Unlike Fedora, it does not aim to adhere to, alter, or set any API standard. Long-term sustainability of the handled data is provided by transparent exports into commonly parsable data formats (RDF) and tools to rebuild a repository from data files.

That said, nothing would prevent someone to add support for LDP, OCFL, Memento, etc. and even create a complete Fedora implementation by using this library as a foundation the basic repository functionality.

Use

lsup_repo can be included in a C or Python program to manage the life cycle of RDF and non-RDF data. It allows to store and manage documents of any format and size, and catalog them via RDF metadata.

lsup_repo does not need to run a server for its core functionality. The interaction with the library is done via a C or Python API. That said, a REST API or any other type of server can be built with relative ease on top of this library. A separate project, based on this library, may in the future provide a REST API for generic resource management, likely based on some existing standard.

The current goal of this development is to build a minimum-viable product (MVP) to replace the essential functionality of a previous project, Lakesuperior.

Status

Pre-alpha. Currently at the beginning of the implementation phase. The structure of the code may change radically. Features mentioned below are to be intended as goals.

Features

Short Term (MVP)

  • Handle the life cycle of arbitrary data documents (local disk storage)
  • Handle the life cycle of RDF resources describing such documents
  • Basic organizational structures: sets, lists, proxies (see "Concepts" below)
  • Versioning: create a version, revert to a version, "soft" deletion and reinstatement of a deleted resource
  • Checksum: cryptographic checksum of data, fixity check of resources in transit and at rest on demand
  • Basic management utilities: integrity check, statistics
  • Serialization and de-serialization to/from Turtle, TriG, N-Triples, N-quads
  • Python bindings

Mid Term

  • Dump, restore and migration utilities
  • Notification stream
  • Multiple back end options (local, S3, other network protocols)
  • Authentication and access policies

Long Term

  • Checksum of RDF resources (depends on lsup_rdf development)
  • Other features as they become necessary

Concepts

Data resources

At the center of Lakesuperior is the goal of storing and organizing arbitrary files that can be found in a hard drive, remote server, etc. These files are called Data Resources (DATA-R). Their contents are entirely opaque to Lakesuperior, therefore any type of document can be handled.

Descriptive Resources

Each data resource is accompanied by a Descriptive Resource (DESC-R). In the first iteration of Lakesuperior this is a RDF named graph which at a minimum contains a pointer to the data location and basic technical metadata. the URI of the named graph is globally unique. Such resource stands for the non-RDF resource in a Linked Data context. It can also be added user-defined metadata.

Descriptive resources may also exist independently of data resources for cataloging and organizational purposes. They have a few characteristics in common:

  • They are made up of one or more named graphs stored in the Lakesuperior back end. The URIs of each named graph follow a specific naming pattern as well as having explicit links between them, so they can be retrieved as one unit.
  • They are normal RDF resources, therefore their content is parsable by Lakesuperior and may be queried.
  • They may consist of several named graphs, each with a purpose defined by the software: library-managed data, user-provided data, versioning data, etc.

Partitioning a DESC-R into multiple graphs allows individual data sets to be annotated, e.g. to establish provenance or versioning information about the asserted facts. Future developments of lsup_repo or software built upon it may take advantage of this structure.

Triples in a DESC-R can have any subject; however it is recommended to maintain some consistency about which subjects are treated in each resource. Specifically, the use of a resource as an aggregation or container of triples about multiple independent entities is discouraged in favor of the use of dedicated data structures, as described below.

Resource Structures

Descriptive resources can be organized in varios aggregation forms. The aggregating resources are normal descriptive resources, with specific predicates pointing to other descriptive resources.

It is important to notice that, unlike in Fedora or other LDP implementations, the life cycle of resource aggregations is entirely independent of the aggregated resources. In LDP, deleting a container would remove its contained resources. Also, in LDP a resource can be only contained by a single container (except in the case of indirect containers, to some extent). In Lakesuperior an aggregation only "contains" pointers to other entirely independent resources, which can be pointed to by an arbitrary number of other aggregations, which can be removed at any time without changing the state of the aggregated resources.

On the other hand, deleting a resource that is part of some structure causes a scan of all inbound links (see "Referential Integrity" below) and the removal of all links to it present in other structures; therefore, the deletion of an aggregated resource changes the state of its aggregations.

The types of structures foreseen for the first implementation of lsup_repo are:

Set

A set is simply a descriptive resource containing an unordered number of unique links to other descriptive resources. Any descriptive resource, including other structures, can be used. Shorthand functions for counting and iterating over Set members, as well as performing boolean operations on them, shall be made available. As it is a descriptive resource, a Set may have descriptive metadata added to it, such as taxonomy, descriptions, labels, etc.

List

A Lakesuperior List is the implementation of a "Linked List" data structure. It contains a link to a single descriptive resource. This resource, called a List Item, represents the first item in the list. Each list item, except for the last one, contains a single link to the next list item.

In addition, every list item has either:

  • A link to the resource it stands for: the List Item is a proxy for an existing resource, which makes it possible to make the same resource part of multiple list; or

  • A link to another list, which results in a nested list.

Shorthand functions to perform common list operations shall be made available. As with other descriptive resources, Lists and List Items can have any type of user-defined metadata and relationships added.

Proxy

A List Item is a special case of a Proxy, which is a descriptive resource standing for another descriptive resource. This indirection is useful for adding a specific context to a resource, e.g. additional information on a document in the context of a curated collection that is only valid or relevant to that collection. Proxies can be aggregated in sets or other structures as well, as one sees fit.

Proxy definitions follow the OAI ORE ontology.

Referential Integrity

The concept of Linked data, which lsup_repo is partly built upon, does not mandate the guarantee that a link pointing to a resource resolves to an actual resource, since it is often impossible to determine which system is responsible for managing that resource, let alone having any agency upon it. Therefore, "broken links" are not excluded.

lsup_repo, however, relies on the assumption that a specific set of resources is under its full control, and therefore guarantees that all references to internally managed resources are maintained at all times. This means that when a resource is deleted, all links pointing to it are identified and removed. This is called referential integrity.

Tools shall be developed to perform periodical referential integrity checks and to notify of dangling links and/or repair them.

Managed resources

Some Lakesuperior resources (DSC-R) and RDF terms may be managed by the repository and are handled in a special way under most circumstances.

Examples of such resources can be:

  • Support structures [TODO specify]
  • RDF predicates
  • RDF types

Some managed resources may be only handled by the user in different way depending on the state of a resource. For example, a RDF type of lsup:List can be specified by the user on creation, but after that it may not be modified manually.

TODO A more detailed list of these managed resources and their behaviour will be included in an expanded version of this documentation.