Browse Source

Initial README draft and base modules.

Stefano Cossu 3 years ago
parent
commit
3c30aa4197
13 changed files with 490 additions and 10 deletions
  1. 0 6
      .gitignore
  2. 3 0
      .gitmodules
  3. 1 3
      Makefile
  4. 189 0
      README.md
  5. 28 0
      TODO.md
  6. 1 0
      ext/libcsv
  7. 1 1
      ext/lsup_rdf
  8. 32 0
      include/core.h
  9. 136 0
      include/desc.h
  10. 0 0
      include/resource.h
  11. 38 0
      src/core.c
  12. 61 0
      src/desc.c
  13. 0 0
      src/resource.c

+ 0 - 6
.gitignore

@@ -104,12 +104,6 @@ venv.bak/
 
 sandbox.c
 
-# Lexer and parser artifacts.
-
-src/codec/*_grammar.c
-src/codec/*_grammar.h
-src/codec/*_parser.c
-
 # IDE
 .syntastic*
 .vimrc

+ 3 - 0
.gitmodules

@@ -1,3 +1,6 @@
 [submodule "ext/lsup_rdf"]
 	path = ext/lsup_rdf
 	url = https://notabug.org/scossu/lsup_rdf.git
+[submodule "ext/libcsv"]
+	path = ext/libcsv
+	url = https://github.com/rgamble/libcsv.git

+ 1 - 3
Makefile

@@ -5,9 +5,7 @@ INCLUDE_BASE = . -Iinclude -Iext/xxHash -Iext/openldap/libraries/liblmdb \
 	-Iext/tpl/src -Iext/uthash/src -Iext/log/src
 INCLUDE = -I$(INCLUDE_BASE)
 LIB = -luuid -lpthread
-SRC = ext/xxHash/xxhash.c ext/tpl/src/tpl.c \
-	ext/openldap/libraries/liblmdb/mdb.c \
-	ext/openldap/libraries/liblmdb/midl.c ext/log/src/log.c \
+SRC = ext/lsup_rdf/src/*.c \
 	src/*.c src/codec/*_grammar.c src/codec/*_parser.c
 
 DEPS := $(shell echo "${INCLUDE_BASE}" | sed -e 's/ -I/,/g')

+ 189 - 0
README.md

@@ -0,0 +1,189 @@
+# `lsup_repo`
+
+**This software is work in progress.**
+
+`lsup_repo` is a C and Python library providing embedded (server-less)
+data repository services. It builds upon a foundational library, [`lsup_rdf`](
+https://notabug.org/scossu/lsup_rdf) which handles RDF and graph data.
+
+## Use
+
+`lsup_repo` can be included in a C or Python program to manage the life cycle
+of RDF and non-RDF data. It allows to store and manage documents of any format
+and size, and catalog them via RDF metadata. 
+
+`lsup_repo` does not need to run a server for its core functionality. The
+interaction with the library is done via a C or Python API. That said, a REST
+API or any other type of server can be built with relative ease on top of this
+library. A separate project, based on this library, may in the future provide a
+REST API for generic resource management, likely based on some existing
+standard.
+
+The current goal of this development is to build a minimum-viable product (MVP)
+to replace the essential functionality of a previous project,
+[Lakesuperior](https://notabug.org/scossu/lakesuperior).
+
+## Status
+
+Pre-alpha. Currently at the beginning of the implementation phase. The
+structure of the code may change radically. Features mentioned below are to be
+intended as goals.
+
+## Features
+
+### Short Term (MVP)
+
+- Handle the life cycle of arbitrary data documents (local disk storage)
+- Handle the life cycle of RDF resources describing such documents
+- Basic organizational structures: sets, lists, proxies (see "Concepts" below)
+- Versioning: create a version, revert to a version, "soft" deletion and
+  reinstatement of a deleted resource
+- Checksum: cryptographic checksum of data, fixity check of resources in
+  transit and at rest on demand
+- Basic management utilities: integrity check, statistics
+- Serialization and de-serialization to/from Turtle, TriG, N-Triples, N-quads
+- Python bindings
+
+### Mid Term
+
+- Dump, restore and migration utilities
+- Notification stream
+- Multiple back end options (local, S3, other network protocols)
+- Authentication and access policies
+
+### Long Term
+
+- Checksum of RDF resources (depends on `lsup_rdf` development)
+- Other features as they become necesasry
+
+## Concepts
+
+### Scope
+
+Lakesuperior was initially built upon the Fedora repository software. This new
+version is a complete reingeineering and repurposing of the previous software,
+seeking to provide similar repository services with a simplified set of
+concepts and constrains, focused on usability and simplicity of design. Unlike
+Fedora, it does not aim to adhere to, alter, or set any API standard.
+Long-term sustainability of the handled data is provided by transparent exports
+into commonly parsable data formats (RDF) and tools to rebulid a repository
+from data files.
+
+### Data resources
+
+At the center of Lakesuperior is the goal of storing and organizing arbitrary
+files that can be found in a hard drive, remote server, etc. These files are
+called *Data Resources* (DATA-R). Their contents are entirely opaque to
+Lakesuperior, therefore any type of document can be handled.
+
+### Descriptive Resources
+
+Each data resource is accompanied by a *Descriptive Resource* (DESC-R). In the
+first iteration of Lakesuperior this is a RDF named graph which at a minimum
+contains a pointer to the data location and basic technical metadata. the URI
+of the named graph is globally unique. Such resource stands for the non-RDF
+resource in a Linked Data context. It can also be added user-defined metadata.
+
+Descriptive resources may also exist independently of data resources for
+cataloging and organizational purposes. They have a few characteristics in
+common:
+
+- They are made up of one or more named graphs stored in the Lakesuperior
+  back end. The URIs of each named graph follow a specific naming pattern as
+  well as having explicit links between them, so they can be retrieved as one
+  unit.
+- They are normal RDF resources, therefore their content is parsable by
+  Lakesuperior and may be queried.
+- They may consist of several named graphs, each with a purpose defined by the
+  software: library-managed data, user-provided data, versioning data, etc.
+
+Partitioning a DESC-R into multiple graphs allows individual data sets to be
+annotated, e.g. to establish provenance or versioning information about the
+asserted facts. Future developments of `lsup_repo` or software built upon it
+may take advantage of this structure.
+
+Triples in a DESC-R can have any subject; however it is recommended to maintain
+some consistency about which subjects are treated in each resource.
+Specifically, the use of a resource as an aggregation or container of triples
+about multiple independent entities is discouraged in favor of the use of
+dedicated data structures, as described below.
+
+### Resource Structures
+
+Descriptive resources can be organized in varios aggregation forms. The
+aggregating resources are normal descriptive resources, with specific
+predicates pointing to other descriptive resources.
+
+It is important to notice that, unlike in Fedora or other LDP implementations,
+the life cycle of resource aggregations is entirely independent of the
+aggregated resources. In LDP, deleting a container would remove its contained
+resources. Also, in LDP a resource can be only contained by a single container
+(except in the case of indirect containers, to some extent). In Lakesuperior an
+aggregation only "contains" pointers to other entirely independent resources,
+which can be pointed to by an arbitrary number of other aggregations, which can
+be removed at any time without changing the state of the aggregated resources.
+
+On the other hand, deleting a resource that is part of some structure causes a
+scan of all inbound links (see "Referential Integrity" below) and the removal
+of all links to it present in other structures; therefore, the deletion of an
+aggregated resource changes the state of its aggregations.
+
+The types of structures foreseen for the first implementation of `lsup_repo`
+are:
+
+#### Set
+
+A set is simply a descriptive resource containing an unordered number of unique
+links to other descriptive resources. Any descriptive resource, including other
+structures, can be used. Shorthand functions for counting and iterating over
+Set members, as well as performing boolean operations on them, shall be made
+available. As it is a descriptive resource, a Set can have descritive metadata
+added to it, such as taxonomy, descriptions, labels, etc.
+
+#### List
+
+A Lakesuperior *List* is the implementation of a "Linked List" data structure.
+It contains a link to a single descriptive resource. This resource, called a
+*List Item*, represents the first item in the list. Each list item, except for
+the last one, contains a single link to the next list item.
+
+In addition, every list item has either:
+
+- A link to the resource it stands for: the List Item is a proxy for an
+  existing resource, which makes it possible to make the same resource part of
+  multiple list; or
+
+- A link to another list, which results in a nested list.
+
+Shorthand functions to perform common list operations shall be made available.
+As with other descriptive resources, Lists and List Items can have any type of
+user-defined metadata and relationships added.
+
+#### Proxy
+
+A List Item is a special case of a *Proxy*, which is a descriptive resource
+standing for another descriptive resource. This indirection is useful for
+adding a specific context to a resource, e.g. additional information on a
+document in the context of a curated collection that is only valid or relevant
+to that collection. Proxies can be aggregated in sets or other structures as
+well, as one sees fit.
+
+Proxy definitions follow the [OAI ORE](http://www.openarchives.org/ore/)
+ontology.
+
+### Referential Integrity
+
+The concept of Linked data, which `lsup_repo` is partly built upon, does not
+mandate the guarantee that a link pointing to a resource resolves to an actual
+resource, since it is often impossible to determine which system is responsible
+for managing that resource, let alone having any agency upon it. Therefore,
+"broken links" are not excluded. 
+
+`lsup_repo`, however, relies on the assumption that a specific set of resources
+is under its full control, and therefore guarantees that all references to
+internally managed resources are maintained at all times. This means that when
+a resource is deleted, all links pointing to it are identified and removed.
+This is called *referential integrity*.
+
+Tools shall be developed to perform periodical referential integrity checks and
+to notify of dangling links and/or repair them.

+ 28 - 0
TODO.md

@@ -0,0 +1,28 @@
+# Quick TODO list
+
+*P* = pending; *W* = working on it; *D* = done.
+
+## Critical for MVP
+
+- *P* Descriptive resource CRUD
+- *P* Data resource CRUD
+- *P* Modular back end layout
+- *P* Data resource back end (disk)
+- *P* Checksum
+  - *P* SHA256
+  - *P* BLAKE2
+- Structures
+  - Sets
+  - Proxies
+  - Linked Lists
+- Versioning
+- 
+
+## Non-critical for MVP
+
+- 
+
+
+## Long-term
+
+- Workflow management

+ 1 - 0
ext/libcsv

@@ -0,0 +1 @@
+Subproject commit 6e750805f54f81a26470486d6ed79efaf7d13805

+ 1 - 1
ext/lsup_rdf

@@ -1 +1 @@
-Subproject commit 979cb4f408dcdf687e1e628e443236c2c9c85dd7
+Subproject commit 1a4343c0712f697d8de9b5dc163cf2a38f7e668a

+ 32 - 0
include/core.h

@@ -0,0 +1,32 @@
+/* @file core.h
+ *
+ * @brief LSUP_REPO core module.
+ */
+
+#ifndef _LSR_CORE_H
+#define _LSR_CORE_H
+
+#include "lsup_rdf.h"
+
+
+/** @brief Whether the environment is initialized.
+ */
+extern bool LSR_is_init;
+
+
+/** @brief Initialize LSUP and LSR environments.
+ *
+ * This function is idempotent.
+ */
+LSUP_rc LSR_init (void);
+
+
+/** @brief Tear down LSUP and LSR environments.
+ *
+ * This function is idempotent. It is also called automatically at the end of
+ * a program (atexit()).
+ */
+void LSR_done (void);
+
+
+#endif /* _LSR_CODE_H */

+ 136 - 0
include/desc.h

@@ -0,0 +1,136 @@
+/** @file desc.h
+ *
+ * @brief Description (RDF) resource.
+ *
+ * The Description Resource (DESC-R), together with the Data Resource (DATA-R),
+ * is the building block of lsup_repo information. Its contents are fully
+ * understood by the library.
+ */
+
+#ifndef _LSR_DESC_H
+#define _LSR_DESC_H
+
+#include "core.h"
+
+
+/*
+ * Typedefs.
+ */
+
+typedef char LSR_id[UUIDSTR_SIZE];
+
+typedef enum res_flags_t {
+    LSR_RS_MANAGED          = 1 << 0,       // Managed by the repo.
+} LSR_ResFlags;
+
+/** @brief DESC-R structure.
+ *
+ * A Descriptive Resource (DESC-R) is made up of several RDF named graphs.
+ * Each graph has a function defined by the framework and can be managed by
+ * the framework or by the user.
+ *
+ * the URI of each graph is derived from the ID of the resource. This is just
+ * for readability. Actually, when stored, the resource generates triples that
+ * explicitly link these graphs together. These URIs have an URN prefix so that
+ * they are portable and, in a Linked Data server, can be easily replaced with
+ * absolute URLs based on the server root.
+ *
+ * The `user_attr` and `admin_attr` graphs contain triples whose objects are
+ * not managed resources—i.e. URIs on the WWW, blank nodes or literal terms.
+ * These triples are generally referred to as "attributes" in this
+ * documentation.
+ *
+ * The `user_rel` and `admin_rel` graphs contain exclusively triples whose
+ * objects are URIs managed by the current repository. They may have special
+ * functionality attached and may be created as a direct consequence of a user
+ * action (e.g. adding a member to a Set or List). These triples are
+ * generally referred to as "relationships" in this documentation.
+ *
+ * Relationships between these graphs are expressed by triples stored in the
+ * default graph.
+ */
+typedef struct desc_t {
+    LSUP_Graph *            user_attr;      // User-defined attributes.
+    LSUP_Graph *            admin_attr;     // Managed attributes.
+    LSUP_Graph *            user_rel;       // User-defined relationships.
+    LSUP_Graph *            admin_rel;      // Managed relationships.
+    uuid_t                  id;             // Resource identifier (UUID4).
+    LSR_ResFlags            flags;          // Flags.
+} LSR_Desc;
+
+
+/*
+ * API functions.
+ */
+
+/** @brief Create an in-memory DESC-R from graph data.
+ *
+ * The resource is volatile until it is stored in a persistent back end. It
+ * must be stored in a context-capable back end (e.g. `LSUP_STORE_MDB`).
+ *
+ * The resource is assigned a UUID4. The resource URI, used in
+ * relationships, is the ID prefixed with the `LSR_NS_DESC` namespace.
+ *
+ * @param[in] data Graph with triples to populate the new resource. The graph
+ *  URI is not used. All URIs in the graph may be relative to the resource.
+ *  Hence, to reference the resource itself, an IRIRef with an empty string as
+ *  data is used.
+ *
+ * @param[out] rsrc Resource handle. It must be freed with #LSR_desc_free().
+ */
+LSUP_rc
+LSR_desc_new (LSUP_Graph *data, LSR_Desc **rsrc);
+
+
+/** @brief Create an in-memory DESC-R from a stored resource.
+ *
+ * Once created, the resource may be modified independently from its stored
+ * counterpart. In order to make changes permanent, it must be stored again
+ * using #LSR_desc_store().
+ *
+ * @param[in] id ID of the resource to be retrieved, without the namespace.
+ *
+ * @param[out] rsrc Resource handle. It must be freed with #LSR_desc_free().
+ */
+LSUP_rc
+LSR_desc_get (LSR_id id, LSR_Desc **rsrc);
+
+
+/** @brief Free a DESC-R.
+ */
+void
+LSR_desc_free (LSR_Desc *rsrc);
+
+
+/** @brief Store a DESC-R, overwriting any data if it already exists.
+ *
+ * This is a "create or overwrite" function that deletes any existing resource
+ * under the given ID before storing the content of the in-memory resource
+ * at hand.
+ *
+ * All data are copied and the resource may be freed after this operation.
+ *
+ * @param[in] rsrc Resource to be created or overwritten.
+ */
+LSUP_rc
+LSR_desc_store (LSR_Desc *rsrc);
+
+
+/** Perform a delta update on a stored resource.
+ *
+ * This function operates directly on a stored resource without the need to
+ * provide an in-memory DESC-R. It first deletes triples by given patterns,
+ * then adds triples. Both steps are optional.
+ *
+ * @param[in] id ID of the resource to be modified, without the namespace.
+ *
+ * @param[in] remove Array of 3-member array of terms. Each set of
+ *  terms represents the s, p, o bound terms. Any and all can be NULL, as in
+ *  #LSUP_graph_remove(). The array is terminated by a `{NULL}` array.
+ *
+ * @param[in] add Array of triples to be added, terminated by a NULL.
+ */
+LSUP_rc
+LSUP_desc_update (LSR_id id, LSUP_Term **remove, LSUP_Triple *add);
+
+#endif /* _LSR_DESC_H */

+ 0 - 0
include/resource.h


+ 38 - 0
src/core.c

@@ -0,0 +1,38 @@
+#include "core.h"
+
+
+bool LSR_is_init = false;
+
+/*
+ * Repo-managed predicates.
+ */
+static char *mgd_pred_str [] = {
+    "ebucore:hasMimeType",
+    "lsup:created",
+    "lsup:createdBy",
+    "lsup:lastModified",
+    "lsup:lastModifiedBy",
+    "premis:hasSize",
+    "premis:hasMessageDigest",
+
+    NULL
+};
+
+
+LSUP_rc LSR_init (void)
+{
+    LSUP_rc rc = LSUP_init();
+    if (rc < 0) return rc;
+
+    // TODO other initialization.
+
+    return LSUP_OK;
+}
+
+
+void LSR_done (void)
+{
+    // TODO other teardown.
+
+    LSUP_done();
+}

+ 61 - 0
src/desc.c

@@ -0,0 +1,61 @@
+#include "desc.h"
+
+LSUP_rc
+LSR_desc_new (LSUP_Graph *data, LSR_Desc **rsrc_p)
+{
+    LSUP_rc rc = LSUP_NOACTION;
+
+    LSR_Desc *rsrc;
+    MALLOC_GUARD (rsrc, LSUP_MEM_ERR);
+
+    LSUP_GraphIterator
+        *lu_it      = LSUP_graph_lookup (data, NULL, NULL, NULL, NULL),
+        *uattr_it   = LSUP_graph_add_init (rsrc->user_attr),
+        *aattr_it   = LSUP_graph_add_init (rsrc->admin_attr),
+        *urel_it    = LSUP_graph_add_init (rsrc->user_rel),
+        *arel_it    = LSUP_graph_add_init (rsrc->admin_rel);
+
+    LSUP_Triple src_spo;
+
+    while (LSUP_graph_iter_next (lu_it, &src_spo) == LSUP_OK) {
+        LSUP_Triple dest_spo = {
+            LSUP_term_copy (src_spo.s),
+            LSUP_term_copy (src_spo.p),
+            LSUP_term_copy (src_spo.o),
+        };
+
+        // TODO if the pred is managed, ignore the triple and send a warning.
+
+        // TODO Else if the resource is a structure type, handle special workflow.
+
+        // TODO Else if the subject or object is a resource, check if it exists; if
+        // it does, add triple to urel; if not, return an error.
+
+        // TODO Else add triple to user_attr.
+    }
+
+finally:
+    LSUP_graph_iter_free (lu_it);
+    LSUP_graph_add_done (uattr_it);
+    LSUP_graph_add_done (aattr_it);
+    LSUP_graph_add_done (urel_it);
+    LSUP_graph_add_done (arel_it);
+
+    if (rc == LSUP_OK) *rsrc_p = rsrc;
+    else {
+        LSR_desc_free (rsrc);
+        *rsrc_p = NULL;
+    }
+
+    return rc;
+}
+
+
+void LSR_desc_free (LSR_Desc *rsrc)
+{
+    LSUP_graph_free (rsrc->user_attr);
+    LSUP_graph_free (rsrc->admin_attr);
+    LSUP_graph_free (rsrc->user_rel);
+    LSUP_graph_free (rsrc->admin_rel);
+    free (rsrc);
+}

+ 0 - 0
src/resource.c