|
@@ -0,0 +1,189 @@
|
|
|
+# `lsup_repo`
|
|
|
+
|
|
|
+**This software is work in progress.**
|
|
|
+
|
|
|
+`lsup_repo` is a C and Python library providing embedded (server-less)
|
|
|
+data repository services. It builds upon a foundational library, [`lsup_rdf`](
|
|
|
+https://notabug.org/scossu/lsup_rdf) which handles RDF and graph data.
|
|
|
+
|
|
|
+## Use
|
|
|
+
|
|
|
+`lsup_repo` can be included in a C or Python program to manage the life cycle
|
|
|
+of RDF and non-RDF data. It allows to store and manage documents of any format
|
|
|
+and size, and catalog them via RDF metadata.
|
|
|
+
|
|
|
+`lsup_repo` does not need to run a server for its core functionality. The
|
|
|
+interaction with the library is done via a C or Python API. That said, a REST
|
|
|
+API or any other type of server can be built with relative ease on top of this
|
|
|
+library. A separate project, based on this library, may in the future provide a
|
|
|
+REST API for generic resource management, likely based on some existing
|
|
|
+standard.
|
|
|
+
|
|
|
+The current goal of this development is to build a minimum-viable product (MVP)
|
|
|
+to replace the essential functionality of a previous project,
|
|
|
+[Lakesuperior](https://notabug.org/scossu/lakesuperior).
|
|
|
+
|
|
|
+## Status
|
|
|
+
|
|
|
+Pre-alpha. Currently at the beginning of the implementation phase. The
|
|
|
+structure of the code may change radically. Features mentioned below are to be
|
|
|
+intended as goals.
|
|
|
+
|
|
|
+## Features
|
|
|
+
|
|
|
+### Short Term (MVP)
|
|
|
+
|
|
|
+- Handle the life cycle of arbitrary data documents (local disk storage)
|
|
|
+- Handle the life cycle of RDF resources describing such documents
|
|
|
+- Basic organizational structures: sets, lists, proxies (see "Concepts" below)
|
|
|
+- Versioning: create a version, revert to a version, "soft" deletion and
|
|
|
+ reinstatement of a deleted resource
|
|
|
+- Checksum: cryptographic checksum of data, fixity check of resources in
|
|
|
+ transit and at rest on demand
|
|
|
+- Basic management utilities: integrity check, statistics
|
|
|
+- Serialization and de-serialization to/from Turtle, TriG, N-Triples, N-quads
|
|
|
+- Python bindings
|
|
|
+
|
|
|
+### Mid Term
|
|
|
+
|
|
|
+- Dump, restore and migration utilities
|
|
|
+- Notification stream
|
|
|
+- Multiple back end options (local, S3, other network protocols)
|
|
|
+- Authentication and access policies
|
|
|
+
|
|
|
+### Long Term
|
|
|
+
|
|
|
+- Checksum of RDF resources (depends on `lsup_rdf` development)
|
|
|
+- Other features as they become necesasry
|
|
|
+
|
|
|
+## Concepts
|
|
|
+
|
|
|
+### Scope
|
|
|
+
|
|
|
+Lakesuperior was initially built upon the Fedora repository software. This new
|
|
|
+version is a complete reingeineering and repurposing of the previous software,
|
|
|
+seeking to provide similar repository services with a simplified set of
|
|
|
+concepts and constrains, focused on usability and simplicity of design. Unlike
|
|
|
+Fedora, it does not aim to adhere to, alter, or set any API standard.
|
|
|
+Long-term sustainability of the handled data is provided by transparent exports
|
|
|
+into commonly parsable data formats (RDF) and tools to rebulid a repository
|
|
|
+from data files.
|
|
|
+
|
|
|
+### Data resources
|
|
|
+
|
|
|
+At the center of Lakesuperior is the goal of storing and organizing arbitrary
|
|
|
+files that can be found in a hard drive, remote server, etc. These files are
|
|
|
+called *Data Resources* (DATA-R). Their contents are entirely opaque to
|
|
|
+Lakesuperior, therefore any type of document can be handled.
|
|
|
+
|
|
|
+### Descriptive Resources
|
|
|
+
|
|
|
+Each data resource is accompanied by a *Descriptive Resource* (DESC-R). In the
|
|
|
+first iteration of Lakesuperior this is a RDF named graph which at a minimum
|
|
|
+contains a pointer to the data location and basic technical metadata. the URI
|
|
|
+of the named graph is globally unique. Such resource stands for the non-RDF
|
|
|
+resource in a Linked Data context. It can also be added user-defined metadata.
|
|
|
+
|
|
|
+Descriptive resources may also exist independently of data resources for
|
|
|
+cataloging and organizational purposes. They have a few characteristics in
|
|
|
+common:
|
|
|
+
|
|
|
+- They are made up of one or more named graphs stored in the Lakesuperior
|
|
|
+ back end. The URIs of each named graph follow a specific naming pattern as
|
|
|
+ well as having explicit links between them, so they can be retrieved as one
|
|
|
+ unit.
|
|
|
+- They are normal RDF resources, therefore their content is parsable by
|
|
|
+ Lakesuperior and may be queried.
|
|
|
+- They may consist of several named graphs, each with a purpose defined by the
|
|
|
+ software: library-managed data, user-provided data, versioning data, etc.
|
|
|
+
|
|
|
+Partitioning a DESC-R into multiple graphs allows individual data sets to be
|
|
|
+annotated, e.g. to establish provenance or versioning information about the
|
|
|
+asserted facts. Future developments of `lsup_repo` or software built upon it
|
|
|
+may take advantage of this structure.
|
|
|
+
|
|
|
+Triples in a DESC-R can have any subject; however it is recommended to maintain
|
|
|
+some consistency about which subjects are treated in each resource.
|
|
|
+Specifically, the use of a resource as an aggregation or container of triples
|
|
|
+about multiple independent entities is discouraged in favor of the use of
|
|
|
+dedicated data structures, as described below.
|
|
|
+
|
|
|
+### Resource Structures
|
|
|
+
|
|
|
+Descriptive resources can be organized in varios aggregation forms. The
|
|
|
+aggregating resources are normal descriptive resources, with specific
|
|
|
+predicates pointing to other descriptive resources.
|
|
|
+
|
|
|
+It is important to notice that, unlike in Fedora or other LDP implementations,
|
|
|
+the life cycle of resource aggregations is entirely independent of the
|
|
|
+aggregated resources. In LDP, deleting a container would remove its contained
|
|
|
+resources. Also, in LDP a resource can be only contained by a single container
|
|
|
+(except in the case of indirect containers, to some extent). In Lakesuperior an
|
|
|
+aggregation only "contains" pointers to other entirely independent resources,
|
|
|
+which can be pointed to by an arbitrary number of other aggregations, which can
|
|
|
+be removed at any time without changing the state of the aggregated resources.
|
|
|
+
|
|
|
+On the other hand, deleting a resource that is part of some structure causes a
|
|
|
+scan of all inbound links (see "Referential Integrity" below) and the removal
|
|
|
+of all links to it present in other structures; therefore, the deletion of an
|
|
|
+aggregated resource changes the state of its aggregations.
|
|
|
+
|
|
|
+The types of structures foreseen for the first implementation of `lsup_repo`
|
|
|
+are:
|
|
|
+
|
|
|
+#### Set
|
|
|
+
|
|
|
+A set is simply a descriptive resource containing an unordered number of unique
|
|
|
+links to other descriptive resources. Any descriptive resource, including other
|
|
|
+structures, can be used. Shorthand functions for counting and iterating over
|
|
|
+Set members, as well as performing boolean operations on them, shall be made
|
|
|
+available. As it is a descriptive resource, a Set can have descritive metadata
|
|
|
+added to it, such as taxonomy, descriptions, labels, etc.
|
|
|
+
|
|
|
+#### List
|
|
|
+
|
|
|
+A Lakesuperior *List* is the implementation of a "Linked List" data structure.
|
|
|
+It contains a link to a single descriptive resource. This resource, called a
|
|
|
+*List Item*, represents the first item in the list. Each list item, except for
|
|
|
+the last one, contains a single link to the next list item.
|
|
|
+
|
|
|
+In addition, every list item has either:
|
|
|
+
|
|
|
+- A link to the resource it stands for: the List Item is a proxy for an
|
|
|
+ existing resource, which makes it possible to make the same resource part of
|
|
|
+ multiple list; or
|
|
|
+
|
|
|
+- A link to another list, which results in a nested list.
|
|
|
+
|
|
|
+Shorthand functions to perform common list operations shall be made available.
|
|
|
+As with other descriptive resources, Lists and List Items can have any type of
|
|
|
+user-defined metadata and relationships added.
|
|
|
+
|
|
|
+#### Proxy
|
|
|
+
|
|
|
+A List Item is a special case of a *Proxy*, which is a descriptive resource
|
|
|
+standing for another descriptive resource. This indirection is useful for
|
|
|
+adding a specific context to a resource, e.g. additional information on a
|
|
|
+document in the context of a curated collection that is only valid or relevant
|
|
|
+to that collection. Proxies can be aggregated in sets or other structures as
|
|
|
+well, as one sees fit.
|
|
|
+
|
|
|
+Proxy definitions follow the [OAI ORE](http://www.openarchives.org/ore/)
|
|
|
+ontology.
|
|
|
+
|
|
|
+### Referential Integrity
|
|
|
+
|
|
|
+The concept of Linked data, which `lsup_repo` is partly built upon, does not
|
|
|
+mandate the guarantee that a link pointing to a resource resolves to an actual
|
|
|
+resource, since it is often impossible to determine which system is responsible
|
|
|
+for managing that resource, let alone having any agency upon it. Therefore,
|
|
|
+"broken links" are not excluded.
|
|
|
+
|
|
|
+`lsup_repo`, however, relies on the assumption that a specific set of resources
|
|
|
+is under its full control, and therefore guarantees that all references to
|
|
|
+internally managed resources are maintained at all times. This means that when
|
|
|
+a resource is deleted, all links pointing to it are identified and removed.
|
|
|
+This is called *referential integrity*.
|
|
|
+
|
|
|
+Tools shall be developed to perform periodical referential integrity checks and
|
|
|
+to notify of dangling links and/or repair them.
|