Browse Source

Remove SPARQL connector; update documentation.

Stefano Cossu 6 years ago
parent
commit
00e4c33cc4

+ 34 - 20
README.md

@@ -3,24 +3,39 @@
 LAKEsuperior is an experimental [Fedora Repository](http://fedorarepository.org)
 implementation.
 
-## Basic concepts
+## Guiding Principles
 
 LAKEsuperior aims at being an uncomplicated, efficient Fedora 4 implementation.
 
-Key features:
+Its main goals are:
+
+- *Simplicity of design:* LAKEsuperior relies on [LMDB](https://symas.com/lmdb/),
+an embedded, high-performance key-value store, for storing metadata and on
+the filesystem to store binaries.
+- *Efficiency:* while raw speed is important, LAKEsuperior also aims at being
+conservative with resources. Its memory and CPU footprint are small. Python C
+extensions are used where possible to improve performance.
+- *Reliability:* fully ACID-compliant writes guarantee consistency of data.
+- *Ease of management:* Contents can be queried directly via term search or
+SPARQL without the aid of external indices. Scripts and interfaces for
+repository administration and monitoring are shipped with the standard release.
+- *Portability:* aims at maintaining a minimal set of dependencies.
+
+## Key features
 
 - Drop-in replacement for Fedora4 (with some caveats: see
   [Delta document](doc/notes/fcrepo4_deltas.md))—currently being tested with
   Hyrax 2
-- Stores metadata in a graph store, binaries in filesystem
-- Simple search and SPARQL Query API via back-end triplestore (planned)
-- No performance issues storing many resources under the same container; no
+- Term-based search (*planned*) and SPARQL Query API + UI
+- No performance penalty for storing many resources under the same container; no
   [kudzu](https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml)
   pairtree segmentation <sup id="a1">[1](#f1)</sup>
-- Mitigates "many member" issue: constant performance writing to a resource with
+- Constant performance writing to a resource with
   many children or members; option to omit children in retrieval
-- Flexible back-end layouts: options to organize information in back end
-- Migration tool (planned)
+- Migration tools (*planned*)
+- Python API (*planned*): Authors of Python clients can use LAKEsuperior as an
+  embedded repository with no HTTP traffic or interim RDF serialization &
+  de-serialization involved.
 
 Implementation of the official [Fedora API specs](https://fedora.info/spec/)
 (Fedora 5.x and beyond) is not
@@ -30,20 +45,16 @@ project if it gains support.
 Please make sure you read the [Delta document](doc/notes/fcrepo4_deltas.md) for
 divergences with the official Fedora4 implementation.
 
-The application code strives to maintain a linear, intuitive code structure to
-foster collaboration. *TODO link to tech overview and approach*
-
 ## Installation
 
 ### Dependencies
 
-1. A triplestore.
-   [Fuseki](https://jena.apache.org/documentation/fuseki2/#download-fuseki)
-   is the benchmark used so far in development. Other implementations are
-   possible as long as they support RDF 1.1 and SPARQL over HTTP
-1. A message broker supporting the STOMP protocol. If you have a separate
-   instance of official Fedora listening to port 61613, that will do the job
-1. Python 3.5 or greater
+1. Python 3.5 or greater.
+1. The [LMDB](https://symas.com/lmdb/) database library. It should be included
+in most Linux distributions' standard package repositories.
+1. A message broker supporting the STOMP protocol. For testing and evaluation
+purposes, Coilmq is included in the dependencies and should be automatically
+installed.
 
 ### Installation steps
 
@@ -59,7 +70,7 @@ foster collaboration. *TODO link to tech overview and approach*
    `export FCREPO_CONFIG_DIR=<your config dir location>` (alternatively you can
    add this line to your virtualenv `activate` script)
 1. Configure the application
-1. Start your triplestore and STOMP broker
+1. Start your STOMP broker
 1. Run `util/bootstrap.py` to initialize the binary and graph stores
 1. Run `./fcrepo` for a multi-threaded server or `flask run` for a
    single-threaded development server
@@ -75,7 +86,10 @@ for a rudimentary road map and status.
 
 ## Further documentation
 
-The design documents are in the [doc/pdf](doc/pdf) folder. *@TODO needs update*
+Miscellaneous documents are in the [doc](doc) folder. They will be organized
+and linked better some day.
+
+---
 
 <b id="f1">1</b> However if your client splits pairtrees upstream, such as
 Hyrax does, that obviously needs to change to get rid of the path

+ 13 - 16
lakesuperior/store_layouts/ldp_rs/lmdb_store.py

@@ -170,29 +170,26 @@ class LmdbStore(Store):
     in the same environment due to complications in handling transaction
     contexts.
 
-    There are 3 main data sets (preservation worthy data):
+    There are 4 main data sets (preservation worthy data):
 
-    - tk:t (triple key: pickled triple; unique keys)
-    - tk:c (Triple key: pickled context; multi-valued keys)
+    - t:st (term key: serialized term; 1:1)
+    - spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
+    - c: (context keys only, values are the empty bytestring; 1:1)
     - pfx:ns (prefix: pickled namespace; unique)
 
-    And 8 indices to optimize lookup for all possible bound/unbound term
+    And 6 indices to optimize lookup for all possible bound/unbound term
     combination in a triple:
 
-    - c:tk (pickled context URI: triple key)
-    - sk:tk (subject key: triple key)
-    - pk:tk (pred key: triple key)
-    - ok:tk (object key: triple key)
-    - spk:tk (subject + predicate key: triple key)
-    - sok:tk (subject + object key: triple key)
-    - pok:tk (predicate + object key: triple key)
+    - th:t (term hash: term key; 1:1)
+    - s:po (S key: joined P, O keys; dupsort, dupfixed)
+    - p:so (P key: joined S, O keys; dupsort, dupfixed)
+    - o:sp (O key: joined S, P keys; dupsort, dupfixed)
+    - c:spo (context → triple association; dupsort, dupfixed)
     - ns:pfx (pickled namespace: prefix; unique)
 
-    The above indices (except for ns:pfx) are all multi-valued and store
-    fixed-length hash values referring to triples for economy's sake.
-
-    The search keys for terms are hashed on lookup. @TODO If this is too slow,
-    we may want to index term hashes.
+    These two data sets are stored in separate environments, i.e. separate
+    files in the filesystem. The index could be recreated from the main data
+    set in case of a disaster.
     '''
     context_aware = True
     # This is a hassle to maintain for no apparent gain. If some use is devised

+ 0 - 58
lakesuperior/store_layouts/ldp_rs/sparql_connector.py

@@ -1,58 +0,0 @@
-import logging
-
-from abc import ABCMeta
-from pprint import pformat
-
-from rdflib import Dataset
-from rdflib.term import URIRef
-from rdflib.plugins.stores.sparqlstore import SPARQLStore, SPARQLUpdateStore
-from SPARQLWrapper.Wrapper import POST
-
-from lakesuperior.dictionaries.namespaces import ns_collection as nsc
-from lakesuperior.store_layouts.ldp_rs.base_connector import BaseConnector
-
-
-class SparqlConnector(BaseConnector):
-    '''
-    Handles the connection and dataset information.
-
-    This is indpendent from the application context (production/test) and can
-    be passed any configuration options.
-    '''
-
-    # N.B. This is Fuseki-specific.
-    UNION_GRAPH_URI = URIRef('urn:x-arq:UnionGraph')
-
-    _logger = logging.getLogger(__name__)
-
-    def _init_connection(self, location, query_ep, update_ep=None,
-            autocommit=False):
-        '''
-        Initialize the connection to the SPARQL endpoint.
-
-        If `update_ep` is not specified, the store is initialized as read-only.
-        '''
-        if update_ep:
-            self.store = SPARQLUpdateStore(
-                    queryEndpoint=location + query_ep,
-                    update_endpoint=location + update_ep,
-                    autocommit=autocommit,
-                    dirty_reads=not autocommit)
-
-            self.readonly = False
-        else:
-            self.store = SPARQLStore(
-                    location + query_ep, default_query_method=POST)
-            self.readonly = True
-
-        self.ds = Dataset(self.store, default_union=True)
-
-
-    def optimize_edits(self):
-        opt_edits = [
-                l for l in self.store._edits
-                if not l.strip().startswith('PREFIX')]
-        #opt_edits = list(ns_pfx_sparql.values()) + opt_edits
-        self.store._edits = opt_edits
-        self._logger.debug('Changes to be committed: {}'.format(
-            pformat(self.store._edits)))