C and Python data repository library. ALPHA
|
8 hours ago | |
---|---|---|
bin | 2 years ago | |
doc | 1 day ago | |
ext | 2 years ago | |
include | 8 hours ago | |
src | 1 day ago | |
test | 1 day ago | |
.gitignore | 2 days ago | |
.gitmodules | 2 years ago | |
CODE_OF_CONDUCT | 4 years ago | |
Doxyfile | 3 years ago | |
LICENSE | 4 years ago | |
Makefile | 11 months ago | |
NOTES.md | 4 years ago | |
README.md | 1 day ago | |
TODO.md | 11 months ago | |
test.c | 1 day ago | |
valgrind-python.supp | 4 years ago |
lsup_repo
This software is work in progress.
lsup_repo
is a C and Python library providing embedded (server-less)
data repository services. It builds upon a foundational library, lsup_rdf
which handles RDF and graph data.
Lakesuperior was initially built upon the Fedora repository software. This new version is a complete reingeineering and repurposing of the previous software, seeking to provide similar repository services with a simplified set of concepts and constrains, focused on usability and simplicity of design. Unlike Fedora, it does not aim to adhere to, alter, or set any API standard. Long-term sustainability of the handled data is provided by transparent exports into commonly parsable data formats (RDF) and tools to rebuild a repository from data files.
That said, nothing would prevent someone to add support for LDP, OCFL, Memento, etc. and even create a complete Fedora implementation by using this library as a foundation the basic repository functionality.
lsup_repo
can be included in a C or Python program to manage the life cycle
of RDF and non-RDF data. It allows to store and manage documents of any format
and size, and catalog them via RDF metadata.
lsup_repo
does not need to run a server for its core functionality. The
interaction with the library is done via a C or Python API. That said, a REST
API or any other type of server can be built with relative ease on top of this
library. A separate project, based on this library, may in the future provide a
REST API for generic resource management, likely based on some existing
standard.
The current goal of this development is to build a minimum-viable product (MVP) to replace the essential functionality of a previous project, Lakesuperior.
Alpha. Currently at the beginning of the implementation phase. The structure of the code may change radically. Basic functionality is still being built. See road map for details.
lsup_rdf
development)While the underlying LSUP_RDF
allows for working with multiple, user-defined
storage back ends, LSUP_REPO
has two designated stores: an in-memory staging
area, where resources are created and manipulated, and a persistent storage
backed by LMDB, where resources are pushed after they are modified. Functions
such as LSR_dres_new
, LSR_dres_update
, and LSR_dres_triples
operate on
the staging area. Functions such as LSR_dres_store
, LRS_dres_get
, and
LSR_dres_delete
move resources between the staging area and the persistent
store, or delete them from the persistent store.
Usual actions in a resource lifecycle flow in LSUP_REPO
may be, for example:
LSUP_REPO
is planned to support full resource-level versioning, along the
following lines:
Lakesuperior stores and organizes arbitrary files that can be found in a hard drive, remote server, etc. These files are called Opaque Resources (O-RES). Their contents are entirely opaque to Lakesuperior, therefore any type of document can be handled.
In further development phases LSUP_REPO
may perform some essential
preservation tasks, such as checksum and basic characterization.
Each opaque resource is accompanied by a Descriptive Resource (D-RES). In the first iteration of Lakesuperior this is a RDF named graph which at a minimum contains a pointer to the data location and basic technical metadata. the URI of the named graph is globally unique. This resource can be added any user-defined metadata.
Descriptive resources may also exist independently of opaque resources for cataloging and organizational purposes. They have a few characteristics in common:
An O-RES cannot be looked up directly and has no metadata directly attached to it; instead, its related D-RES is looked up, which contains the file's metadata and a link to the file.
Partitioning a D-RES into multiple graphs allows individual data sets to be
annotated, e.g. to establish provenance or versioning information about the
asserted facts. Future developments of lsup_repo
or software built upon it
may take advantage of this structure.
Triples in a D-RES can have any subject; however it is recommended to maintain some consistency about which subjects are treated in each resource. Specifically, the use of a resource as an aggregation or container of triples about multiple independent subjects is discouraged in favor of the use of dedicated data structures, as described below.
Descriptive resources can be organized in varios aggregation forms. The aggregating resources are normal descriptive resources, with specific predicates pointing to other descriptive resources.
The life cycle of resource aggregations is entirely independent of the aggregated resources. An aggregation only "contains" pointers to other entirely independent resources, which can be pointed to by an arbitrary number of other aggregations, which can be removed at any time without changing the state of the aggregated resources.
On the other hand, deleting a resource that is part of some structure causes a scan of all inbound links (see "Referential Integrity" below) and the removal of all links to it present in other structures; therefore, the deletion of an aggregated resource changes the state of its aggregations.
The types of structures foreseen for the first implementation of lsup_repo
are:
A set is simply a descriptive resource containing an unordered number of unique links to other descriptive resources. Any descriptive resource, including other structures, can be used. Shorthand functions for counting and iterating over Set members, as well as performing boolean operations on them, shall be made available. As it is a descriptive resource, a Set may have descriptive metadata added to it, such as taxonomy, descriptions, labels, etc.
Deleting a resource contained in a set also deletes the membership relationship in the set.
A Lakesuperior List is a plain implementation of a linked list data structure. The list resource contains a link to a single descriptive resource. This resource, called a List Item, represents the first item in the list. Each list item, except for the last one, contains a single link to the next list item.
In addition, every list item has either:
A link to the resource it stands for: the List Item is a proxy for an existing resource, which makes it possible to make the same resource part of multiple list; or
A link to another list, which results in a nested list.
Shorthand functions to perform common list operations shall be made available. As with other descriptive resources, Lists and List Items can have any type of user-defined metadata and relationships added.
Deleting a list removes all list items under it, but leaves the resources pointed to by the list items intact.
Deleting a list item deletes all child items and shift the following item, if existent, to the position the deleted item occupied.
Deleting an object that a list item stands for, causes the list item to be deleted as well.
A List Item is a special case of a Proxy, which is a descriptive resource standing for another descriptive resource. This indirection is useful for adding a specific context to a resource, e.g. additional information on a document in the context of a curated collection that is only valid or relevant to that collection. Proxies can be aggregated in sets or other structures as well, as one sees fit.
Proxy definitions follow the OAI ORE ontology.
Deleting a resource that a proxy stands for, also deletes the proxy.
The concept of Linked data, which lsup_repo
is partly built upon, does not
mandate the guarantee that a link pointing to a resource resolves to an actual
resource, since it is often impossible to determine which system is responsible
for managing that resource, let alone having any agency upon it. Therefore,
"broken links" are not excluded.
lsup_repo
, however, relies on the assumption that a specific set of resources
(all under the urn:lsres:
namespace) is under its full control, and therefore
guarantees that all references to internally managed resources are maintained
at all times. This means that when a resource is deleted, all links pointing to
it are identified and removed. This is called referential integrity.
Tools shall be developed to perform referential integrity checks and to notify of dangling links and/or repair them.
Some Lakesuperior resources (DSC-R) and RDF terms may be managed by the repository and are handled in a special way under most circumstances.
Examples of such resources can be:
Some managed resources may be only handled by the user in different way
depending on the state of a resource. For example, a RDF type of lsup:List
can be specified by the user on creation, but after that it may not be modified
manually.
TODO A more detailed list of these managed resources and their behaviour will be included in an expanded version of this documentation.