7 năm trước cách đây · af458f1604
--- a/README.md
+++ b/README.md
@@ -1,175 +0,0 @@
 
															-# LAKEsuperior
														
 
															-
														
 
															-[![build status](
														
 
															-  http://img.shields.io/travis/scossu/lakesuperior/master.svg?style=flat)](
														
 
															- https://travis-ci.org/username/repo)
														
 
															-
														
 
															-LAKEsuperior is an alternative [Fedora Repository](http://fedorarepository.org)
														
 
															-implementation.
														
 
															-
														
 
															-## Guiding Principles
														
 
															-
														
 
															-LAKEsuperior aims at being an uncomplicated, efficient Fedora 4 implementation.
														
 
															-
														
 
															-Its main goals are:
														
 
															-
														
 
															-- **Reliability:** Based on solid technologies with stability in mind.
														
 
															-- **Efficiency:** Small memory and CPU footprint, high scalability.
														
 
															-- **Ease of management:** Tools to perform monitoring and maintenance included.
														
 
															-- **Simplicity of design:** Straight-forward architecture, robustness over
														
 
															-  features.
														
 
															-
														
 
															-## Key features
														
 
															-
														
 
															-- Drop-in replacement for Fedora4 (with some
														
 
															-  [caveats](docs/notes/fcrepo4_deltas.md)); currently being tested with Hyrax 2
														
 
															-- Very stable persistence layer based on [LMDB](https://symas.com/lmdb/) and
														
 
															-  filesystem. Fully ACID-compliant writes guarantee consistency of data.
														
 
															-- Term-based search (*planned*) and SPARQL Query API + UI
														
 
															-- No performance penalty for storing many resources under the same container; no
														
 
															-  [kudzu](https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml)
														
 
															-  pairtree segmentation <sup id="a1">[1](#f1)</sup>
														
 
															-- Extensible [provenance metadata](docs/notes/model.md) tracking
														
 
															-- [Multi-modal access](docs/notes/architecture.md#multi-modal-access): HTTP
														
 
															-  (REST), command line interface and native Python API.
														
 
															-- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
														
 
															-
														
 
															-Implementation of the official [Fedora API specs](https://fedora.info/spec/)
														
 
															-(Fedora 5.x and beyond) is not
														
 
															-foreseen in the short term, however it would be a natural evolution of this
														
 
															-project if it gains support.
														
 
															-
														
 
															-Please make sure you read the [Delta document](docs/notes/fcrepo4_deltas.md) for
														
 
															-divergences with the official Fedora4 implementation.
														
 
															-
														
 
															-## Target Audience
														
 
															-
														
 
															-LAKEsuperior is for anybody who cares about preserving data in the long term.
														
 
															-
														
 
															-Less vaguely, LAKEsuperior is targeted at who needs to store large quantities
														
 
															-of highly linked metadata and documents.
														
 
															-
														
 
															-Its Python/C environment and API make it particularly well suited for academic
														
 
															-and scientific environments who would be able to embed it in a Python
														
 
															-application as a library or extend it via plug-ins.
														
 
															-
														
 
															-LAKEsuperior is able to be exposed to the Web as a
														
 
															-[Linked Data Platform](https://www.w3.org/TR/ldp-primer/) server. It also acts
														
 
															-as a SPARQL query (read-only) endpoint, however it is not meant to be used as
														
 
															-a full-fledged triplestore at the moment.
														
 
															-
														
 
															-In its current status, LAKEsuperior is aimed at developers and
														
 
															-hands-on managers who are interested in evaluating this project.
														
 
															-
														
 
															-## Quick Install: Running in Docker
														
 
															-
														
 
															-You can run LAKEsuperior in Docker for a hands-off quickstart.
														
 
															-
														
 
															-[Docker](http://docker.com/) is a containerization platform that allows you to run
														
 
															-services in lightweight virtual machine environments without having to worry about
														
 
															-installing all of the prerequisites on your host machine.
														
 
															-
														
 
															-1. Install the correct [Docker Community Edition](https://www.docker.com/community-edition)
														
 
															-   for your operating system.
														
 
															-1. Clone this repo: `git clone https://github.com/scossu/lakesuperior.git`
														
 
															-1. `cd` into repo folder
														
 
															-1. Run `docker-compose up`
														
 
															-
														
 
															-LAKEsuperior should now be available at `http://localhost:8000/`.
														
 
															-
														
 
															-The provided Docker configuration includes persistent storage as a self-container Docker
														
 
															-volume, meaning your data will persist between runs. If you want to clear the decks,
														
 
															-simply run `docker-compose down -v`.
														
 
															-
														
 
															-## Manual Install (a bit less quick, a bit more power)
														
 
															-
														
 
															-**Note:** These instructions have been tested on Linux. They may work on Darwin
														
 
															-with little modification, and possibly on Windows with some
														
 
															-modifications. Feedback is welcome.
														
 
															-
														
 
															-### Dependencies
														
 
															-
														
 
															-1. Python 3.5 or greater.
														
 
															-1. A message broker supporting the STOMP protocol. For testing and evaluation
														
 
															-purposes, [CoilMQ](https://github.com/hozn/coilmq) is included with the
														
 
															-dependencies and should be automatically installed.
														
 
															-
														
 
															-### Installation steps
														
 
															-
														
 
															-1. Create a virtualenv in a project folder:
														
 
															-   `virtualenv -p <python 3.5+ exec path> <virtualenv folder>`
														
 
															-1. Activate the virtualenv: `source <path_to_virtualenv>/bin/activate`
														
 
															-1. Clone this repo: `git clone https://github.com/scossu/lakesuperior.git`
														
 
															-1. `cd` into repo folder
														
 
															-1. Install dependencies: `pip install -r requirements.txt`
														
 
															-1. Start your STOMP broker, e.g.: `coilmq &`. If you have another queue manager
														
 
															-   listening to port 61613 you can either configure a different port on the
														
 
															-   application configuration, or use the existing message queue.
														
 
															-1. Run `./lsup-admin bootstrap` to initialize the binary and graph stores
														
 
															-1. Run `./fcrepo`.
														
 
															-
														
 
															-### Configuration
														
 
															-
														
 
															-The app should run for testing and evaluation purposes without any further
														
 
															-configuration. All the application data are stored by default in the `data`
														
 
															-directory.
														
 
															-
														
 
															-To change the default configuration you should:
														
 
															-
														
 
															-1. Copy the `etc.skeleton` folder to a separate location
														
 
															-1. Set the configuration folder location in the environment:
														
 
															-   `export FCREPO_CONFIG_DIR=<your config dir location>` (you can
														
 
															-   add this line at the end of your virtualenv `activate` script)
														
 
															-1. Configure the application
														
 
															-1. Bootstrap the app or copy the original data folders to the new location if
														
 
															-   any loction options changed
														
 
															-1. (Re)start the server: `./fcrepo`
														
 
															-
														
 
															-The configuration options are documented in the files.
														
 
															-
														
 
															-**Note:** `test.yml` must specify a different location for the graph and for
														
 
															-the binary stores than the default one, otherwise running a test suite will
														
 
															-destroy your main data store. The application will issue an error message and
														
 
															-refuse to start if these locations overlap.
														
 
															-
														
 
															-### Production deployment
														
 
															-
														
 
															-If you like fried repositories for lunch, deploy before 11AM.
														
 
															-
														
 
															-## Status and development
														
 
															-
														
 
															-LAKEsuperior is in **alpha** status. Please see the
														
 
															-[project issues](https://github.com/scossu/lakesuperior/issues) list for a
														
 
															-rudimentary road map.
														
 
															-
														
 
															-## Contributing
														
 
															-
														
 
															-This has been so far a single person's off-hours project (with much input from
														
 
															-several sides). In order to turn into anything close to a Beta release and
														
 
															-eventually to a production-ready implementation, it needs some community love.
														
 
															-
														
 
															-Contributions are welcome in all forms, including ideas, issue reports, or
														
 
															-even just spinning up the software and providing some feedback. LAKEsuperior is
														
 
															-meant to live as a community project.
														
 
															-
														
 
															-## Technical documentation
														
 
															-
														
 
															-[Architecture Overview](docs/notes/architecture.md)
														
 
															-
														
 
															-[Content Model](docs/notes/model.md)
														
 
															-
														
 
															-[Messaging](docs/notes/messaging.md)
														
 
															-
														
 
															-[Migration, Backup & Restore](docs/notes/migration.md)
														
 
															-
														
 
															-[Command-Line Reference](docs/notes/cli.md)
														
 
															-
														
 
															-[Storage Implementation](docs/notes/storage.md)
														
 
															-
														
 
															-[Performance Benchmarks](docs/notes/performance.md)
														
 
															-
														
 
															----
														
 
															-
														
 
															-<b id="f1">1</b> However if your client splits pairtrees upstream, such as
														
 
															-Hyrax does, that obviously needs to change to get rid of the path
														
 
															-segments. [↩](#a1)
														
--- a/README.rst
+++ b/README.rst
@@ -0,0 +1,194 @@
 
															+LAKEsuperior
														
 
															+============
														
 
															+
														
 
															+|build status|
														
 
															+
														
 
															+LAKEsuperior is an alternative `Fedora
														
 
															+Repository <http://fedorarepository.org>`__ implementation.
														
 
															+
														
 
															+Guiding Principles
														
 
															+------------------
														
 
															+
														
 
															+LAKEsuperior aims at being an uncomplicated, efficient Fedora 4
														
 
															+implementation.
														
 
															+
														
 
															+Its main goals are:
														
 
															+
														
 
															+-  **Reliability:** Based on solid technologies with stability in mind.
														
 
															+-  **Efficiency:** Small memory and CPU footprint, high scalability.
														
 
															+-  **Ease of management:** Tools to perform monitoring and maintenance
														
 
															+   included.
														
 
															+-  **Simplicity of design:** Straight-forward architecture, robustness
														
 
															+   over features.
														
 
															+
														
 
															+Key features
														
 
															+------------
														
 
															+
														
 
															+-  Drop-in replacement for Fedora4 (with some
														
 
															+   `caveats <docs/fcrepo4_deltas.md>`__); currently being tested
														
 
															+   with Hyrax 2
														
 
															+-  Very stable persistence layer based on
														
 
															+   `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
														
 
															+   ACID-compliant writes guarantee consistency of data.
														
 
															+-  Term-based search (*planned*) and SPARQL Query API + UI
														
 
															+-  No performance penalty for storing many resources under the same
														
 
															+   container; no
														
 
															+   `kudzu <https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml>`__
														
 
															+   pairtree segmentation \ `1 <#f1>`__\ 
														
 
															+-  Extensible `provenance metadata <docs/model.md>`__ tracking
														
 
															+-  `Multi-modal
														
 
															+   access <docs/architecture.md#multi-modal-access>`__: HTTP
														
 
															+   (REST), command line interface and native Python API.
														
 
															+-  Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
														
 
															+
														
 
															+Implementation of the official `Fedora API
														
 
															+specs <https://fedora.info/spec/>`__ (Fedora 5.x and beyond) is not
														
 
															+foreseen in the short term, however it would be a natural evolution of
														
 
															+this project if it gains support.
														
 
															+
														
 
															+Please make sure you read the `Delta
														
 
															+document <docs/fcrepo4_deltas.md>`__ for divergences with the
														
 
															+official Fedora4 implementation.
														
 
															+
														
 
															+Target Audience
														
 
															+---------------
														
 
															+
														
 
															+LAKEsuperior is for anybody who cares about preserving data in the long
														
 
															+term.
														
 
															+
														
 
															+Less vaguely, LAKEsuperior is targeted at who needs to store large
														
 
															+quantities of highly linked metadata and documents.
														
 
															+
														
 
															+Its Python/C environment and API make it particularly well suited for
														
 
															+academic and scientific environments who would be able to embed it in a
														
 
															+Python application as a library or extend it via plug-ins.
														
 
															+
														
 
															+LAKEsuperior is able to be exposed to the Web as a `Linked Data
														
 
															+Platform <https://www.w3.org/TR/ldp-primer/>`__ server. It also acts as
														
 
															+a SPARQL query (read-only) endpoint, however it is not meant to be used
														
 
															+as a full-fledged triplestore at the moment.
														
 
															+
														
 
															+In its current status, LAKEsuperior is aimed at developers and hands-on
														
 
															+managers who are interested in evaluating this project.
														
 
															+
														
 
															+Quick Install: Running in Docker
														
 
															+--------------------------------
														
 
															+
														
 
															+You can run LAKEsuperior in Docker for a hands-off quickstart.
														
 
															+
														
 
															+`Docker <http://docker.com/>`__ is a containerization platform that
														
 
															+allows you to run services in lightweight virtual machine environments
														
 
															+without having to worry about installing all of the prerequisites on
														
 
															+your host machine.
														
 
															+
														
 
															+1. Install the correct `Docker Community
														
 
															+   Edition <https://www.docker.com/community-edition>`__ for your
														
 
															+   operating system.
														
 
															+2. Clone this repo:
														
 
															+   ``git clone https://github.com/scossu/lakesuperior.git``
														
 
															+3. ``cd`` into repo folder
														
 
															+4. Run ``docker-compose up``
														
 
															+
														
 
															+LAKEsuperior should now be available at ``http://localhost:8000/``.
														
 
															+
														
 
															+The provided Docker configuration includes persistent storage as a
														
 
															+self-container Docker volume, meaning your data will persist between
														
 
															+runs. If you want to clear the decks, simply run
														
 
															+``docker-compose down -v``.
														
 
															+
														
 
															+Manual Install (a bit less quick, a bit more power)
														
 
															+---------------------------------------------------
														
 
															+
														
 
															+**Note:** These instructions have been tested on Linux. They may work on
														
 
															+Darwin with little modification, and possibly on Windows with some
														
 
															+modifications. Feedback is welcome.
														
 
															+
														
 
															+Dependencies
														
 
															+~~~~~~~~~~~~
														
 
															+
														
 
															+1. Python 3.5 or greater.
														
 
															+2. A message broker supporting the STOMP protocol. For testing and
														
 
															+   evaluation purposes, `CoilMQ <https://github.com/hozn/coilmq>`__ is
														
 
															+   included with the dependencies and should be automatically installed.
														
 
															+
														
 
															+Installation steps
														
 
															+~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+1. Create a virtualenv in a project folder:
														
 
															+   ``virtualenv -p <python 3.5+ exec path> <virtualenv folder>``
														
 
															+2. Activate the virtualenv: ``source <path_to_virtualenv>/bin/activate``
														
 
															+3. Clone this repo:
														
 
															+   ``git clone https://github.com/scossu/lakesuperior.git``
														
 
															+4. ``cd`` into repo folder
														
 
															+5. Install dependencies: ``pip install -r requirements.txt``
														
 
															+6. Start your STOMP broker, e.g.: ``coilmq &``. If you have another
														
 
															+   queue manager listening to port 61613 you can either configure a
														
 
															+   different port on the application configuration, or use the existing
														
 
															+   message queue.
														
 
															+7. Run ``./lsup-admin bootstrap`` to initialize the binary and graph
														
 
															+   stores
														
 
															+8. Run ``./fcrepo``.
														
 
															+
														
 
															+Configuration
														
 
															+~~~~~~~~~~~~~
														
 
															+
														
 
															+The app should run for testing and evaluation purposes without any
														
 
															+further configuration. All the application data are stored by default in
														
 
															+the ``data`` directory.
														
 
															+
														
 
															+To change the default configuration you should:
														
 
															+
														
 
															+1. Copy the ``etc.skeleton`` folder to a separate location
														
 
															+2. Set the configuration folder location in the environment:
														
 
															+   ``export FCREPO_CONFIG_DIR=<your config dir location>`` (you can add
														
 
															+   this line at the end of your virtualenv ``activate`` script)
														
 
															+3. Configure the application
														
 
															+4. Bootstrap the app or copy the original data folders to the new
														
 
															+   location if any loction options changed
														
 
															+5. (Re)start the server: ``./fcrepo``
														
 
															+
														
 
															+The configuration options are documented in the files.
														
 
															+
														
 
															+**Note:** ``test.yml`` must specify a different location for the graph
														
 
															+and for the binary stores than the default one, otherwise running a test
														
 
															+suite will destroy your main data store. The application will issue an
														
 
															+error message and refuse to start if these locations overlap.
														
 
															+
														
 
															+Production deployment
														
 
															+~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+If you like fried repositories for lunch, deploy before 11AM.
														
 
															+
														
 
															+Status and development
														
 
															+----------------------
														
 
															+
														
 
															+LAKEsuperior is in **alpha** status. Please see the `project
														
 
															+issues <https://github.com/scossu/lakesuperior/issues>`__ list for a
														
 
															+rudimentary road map.
														
 
															+
														
 
															+Contributing
														
 
															+------------
														
 
															+
														
 
															+This has been so far a single person’s off-hours project (with much
														
 
															+input from several sides). In order to turn into anything close to a
														
 
															+Beta release and eventually to a production-ready implementation, it
														
 
															+needs some community love.
														
 
															+
														
 
															+Contributions are welcome in all forms, including ideas, issue reports,
														
 
															+or even just spinning up the software and providing some feedback.
														
 
															+LAKEsuperior is meant to live as a community project.
														
 
															+
														
 
															+Documentation
														
 
															+-----------------------
														
 
															+
														
 
															+The documenation is maintained in `Read The Docs
														
 
															+<http://lakesuperior.readthedocs.io/en/latest/>`__.
														
 
															+
														
 
															+--------------
														
 
															+
														
 
															+1 However if your client splits pairtrees upstream, such as Hyrax does,
														
 
															+that obviously needs to change to get rid of the path segments.
														
 
															+`↩ <#a1>`__
														
 
															+
														
 
															+.. |build status| image:: http://img.shields.io/travis/scossu/lakesuperior/master.svg?style=flat
														
 
															+   :target: https://travis-ci.org/username/repo
														
--- a/docs/api.rst
+++ b/docs/api.rst
@@ -0,0 +1,48 @@
 
															+API Documentation
														
 
															+==================
														
 
															+
														
 
															+.. module:: lakesuperior
														
 
															+
														
 
															+.. automodule:: lakesuperior.api.resource
														
 
															+   :members:
														
 
															+
														
 
															+.. automodule:: lakesuperior.api.query
														
 
															+   :members:
														
 
															+
														
 
															+.. automodule:: lakesuperior.api.admin
														
 
															+   :members:
														
 
															+
														
 
															+
														
 
															+Resource API
														
 
															+~~~~~~~~~~~~
														
 
															+
														
 
															+.. autoclass:: lakesuperior.model.ldpr.Ldpr
														
 
															+   :members:
														
 
															+   :inherited-members:
														
 
															+
														
 
															+.. autoclass:: lakesuperior.model.ldp_factory.LdpFactory
														
 
															+   :members:
														
 
															+   :inherited-members:
														
 
															+
														
 
															+Low-Level API
														
 
															+~~~~~~~~~~~~~
														
 
															+
														
 
															+Store Layout
														
 
															+^^^^^^^^^^^^
														
 
															+
														
 
															+.. autoclass:: lakesuperior.store.ldp_rs.rsrc_centric_layout.RsrcCentricLayout
														
 
															+   :members:
														
 
															+   :inherited-members:
														
 
															+
														
 
															+.. autoclass:: lakesuperior.store.ldp_nr.base_non_rdf_layout.BaseNonRdfLayout
														
 
															+   :members:
														
 
															+   :inherited-members:
														
 
															+
														
 
															+RDF Store Abstraction
														
 
															+^^^^^^^^^^^^^^^^^^^^^
														
 
															+
														
 
															+.. automodule:: lakesuperior.store.ldp_rs.lmdb_store
														
 
															+   :members:
														
 
															+   :inherited-members:
														
 
															+
														
 
															+
														
--- a/docs/architecture.rst
+++ b/docs/architecture.rst
@@ -0,0 +1,48 @@
 
															+LAKEsuperior Architecture
														
 
															+=========================
														
 
															+
														
 
															+LAKEsuperior is written in Python. It is not excluded that parts of the
														
 
															+code may be rewritten in `Cython <http://cython.readthedocs.io/>`__ for
														
 
															+performance.
														
 
															+
														
 
															+Multi-Modal Access
														
 
															+------------------
														
 
															+
														
 
															+LAKEsuperior services and data are accessible in multiple ways:
														
 
															+
														
 
															+-  Via HTTP. This is the canonical way to interact with LDP resources
														
 
															+   and conforms quite closely to the Fedora specs (currently v4).
														
 
															+-  Via command line. This method includes long-running admin tasks which
														
 
															+   are not available via HTTP.
														
 
															+-  Via a Python API. This method allows to use Python scripts to access
														
 
															+   the same methods available to the two methods above in a programmatic
														
 
															+   way. It is possible to write Python plugins or even to embed
														
 
															+   LAKEsuperior in a Python application, even without running a web
														
 
															+   server.
														
 
															+
														
 
															+Architecture Overview
														
 
															+---------------------
														
 
															+
														
 
															+.. figure:: assets/lakesuperior_arch.png
														
 
															+   :alt: LAKEsuperior Architecture
														
 
															+
														
 
															+   LAKEsuperior Architecture
														
 
															+
														
 
															+The LAKEsuperior REST API provides access to the underlying Python API.
														
 
															+All REST and CLI operations can be replicated by a Python program
														
 
															+accessing this API.
														
 
															+
														
 
															+The main advantage of the Python API is that it makes it very easy to
														
 
															+maipulate graph and binary data without the need to serialize or
														
 
															+deserialize native data structures. This matters when handling large ETL
														
 
															+jobs for example.
														
 
															+
														
 
															+The Python API is divided in three main areas:
														
 
															+
														
 
															+-  `Resource API <../../lakesuperior/api/resource.py>`__. This API is in
														
 
															+   charge of all the resource CRUD operations and implements the
														
 
															+   majority of the Fedora specs.
														
 
															+-  `Admin API <../../lakesuperior/api/admin.py>`__. This exposes utility
														
 
															+   methods, mostly long-running maintenance jobs.
														
 
															+-  `Query API <../../lakesuperior/api/query.py>`__. This provides
														
 
															+   several facilities for querying repository data.
														
--- a/docs/cli.rst
+++ b/docs/cli.rst
@@ -0,0 +1,33 @@
 
															+LAKEsuperior Command Line Reference
														
 
															+===================================
														
 
															+
														
 
															+The LAKEsuperior command line tool is used for maintenance and
														
 
															+administration purposes.
														
 
															+
														
 
															+The script is invoked from the main install directory. The tool is
														
 
															+self-documented, so this is just a redundant overview:
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    $ ./lsup_admin
														
 
															+    Usage: lsup-admin [OPTIONS] COMMAND [ARGS]...
														
 
															+
														
 
															+    Options:
														
 
															+      --help  Show this message and exit.
														
 
															+
														
 
															+      bootstrap     Bootstrap binary and graph stores.
														
 
															+      check_fixity  [STUB] Check fixity of a resource.
														
 
															+      check_refint  [STUB] Check referential integrity.
														
 
															+      cleanup       [STUB] Clean up orphan database items.
														
 
															+      copy          [STUB] Copy (backup) repository data.
														
 
															+      dump          [STUB] Dump repository to disk.
														
 
															+      load          [STUB] Load serialized repository data.
														
 
															+      stats         Print repository statistics.
														
 
															+
														
 
															+All entries marked ``[STUB]`` are not yet implemented, however the
														
 
															+``lsup_admin <command> --help`` command will issue a description of what
														
 
															+the command is meant to do. Please see the `TODO <TODO>`__ document for
														
 
															+a rough road map.
														
 
															+
														
 
															+All of the above commands are also available via, and based upon, the
														
 
															+native Python API.
														
--- a/docs/fcrepo4_deltas.rst
+++ b/docs/fcrepo4_deltas.rst
@@ -0,0 +1,242 @@
 
															+Divergencies between lakesuperior and FCREPO4
														
 
															+=============================================
														
 
															+
														
 
															+This is a (vastly incomplete) list of discrepancies between the current
														
 
															+FCREPO4 implementation and LAKEsuperior. More will be added as more
														
 
															+clients will use it.
														
 
															+
														
 
															+Not yet implemented (but in the plans)
														
 
															+--------------------------------------
														
 
															+
														
 
															+-  Various headers handling
														
 
															+-  Versioning (incomplete)
														
 
															+-  AuthN/Z
														
 
															+-  Fixity check
														
 
															+-  Blank nodes
														
 
															+
														
 
															+Potentially breaking changes
														
 
															+----------------------------
														
 
															+
														
 
															+The following divergences may lead into incompatibilities with some
														
 
															+clients.
														
 
															+
														
 
															+Atomicity
														
 
															+~~~~~~~~~
														
 
															+
														
 
															+FCREPO4 supports batch atomic operations whereas a transaction can be
														
 
															+opened and a number of operations (i.e. multiple R/W requests to the
														
 
															+repository) can be performed. The operations are persisted in the
														
 
															+repository only if and when the transaction is committed.
														
 
															+
														
 
															+LAKesuperior only supports atomicity for a single HTTP request. I.e. a
														
 
															+single HTTTP request that should result in multiple write operations to
														
 
															+the storage layer is only persisted if no exception is thrown.
														
 
															+Otherwise, the operation is rolled back in order to prevent resources to
														
 
															+be left in an inconsistent state.
														
 
															+
														
 
															+Tombstone methods
														
 
															+~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+If a client requests a tombstone resource in FCREPO4 with a method other
														
 
															+than DELETE, the server will return ``405 Method Not Allowed``
														
 
															+regardless of whether the tombstone exists or not.
														
 
															+
														
 
															+LAKEsuperior will return ``405`` only if the tombstone actually exists,
														
 
															+``404`` otherwise.
														
 
															+
														
 
															+Web UI
														
 
															+~~~~~~
														
 
															+
														
 
															+FCREPO4 includes a web UI for simple CRUD operations.
														
 
															+
														
 
															+Such a UI is not in the immediate LAKEsuperior development plans.
														
 
															+However, a basic UI is available for read-only interaction: LDP resource
														
 
															+browsing, SPARQL query and other search facilities, and administrative
														
 
															+tools. Some of the latter *may* involve write operations, such as
														
 
															+clean-up tasks.
														
 
															+
														
 
															+Automatic path segment generation
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+A ``POST`` request without a slug in FCREPO4 results in a pairtree
														
 
															+consisting of several intermediate nodes leading to the automatically
														
 
															+minted identifier. E.g.
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    POST /rest
														
 
															+
														
 
															+results in ``/rest/8c/9a/07/4e/8c9a074e-dda3-5256-ea30-eec2dd4fcf61``
														
 
															+being created.
														
 
															+
														
 
															+The same request in LAKEsuperior would create
														
 
															+``/rest/8c9a074e-dda3-5256-ea30-eec2dd4fcf61`` (obviously the
														
 
															+identifiers will be different).
														
 
															+
														
 
															+This seems to break Hyrax at some point, but might have been fixed. This
														
 
															+needs to be verified further.
														
 
															+
														
 
															+Non-standard client breaking changes
														
 
															+------------------------------------
														
 
															+
														
 
															+The following changes may be incompatible with clients relying on some
														
 
															+FCREPO4 behavior not endorsed by LDP or other specifications.
														
 
															+
														
 
															+Pairtrees
														
 
															+~~~~~~~~~
														
 
															+
														
 
															+FCREPO4 generates “pairtree” resources if a resource is created in a
														
 
															+path whose segments are missing. E.g. when creating ``/a/b/c/d``, if
														
 
															+``/a/b`` and ``/a/b/c`` do not exist, FCREPO4 will create two Pairtree
														
 
															+resources. POSTing and PUTting into Pairtrees is not allowed. Also, a
														
 
															+containment triple is established between the closest LDPC and the
														
 
															+created resource, e.g. if ``a`` exists, a
														
 
															+``</a> ldp:contains </a/b/c/d>`` triple is created.
														
 
															+
														
 
															+LAKEsuperior does not employ Pairtrees. In the example above
														
 
															+LAKEsuperior would create a fully qualified LDPC for each missing
														
 
															+segment, which can be POSTed and PUT to. Containment triples are created
														
 
															+between each link in the path, i.e. ``</a> ldp:contains </a/b>``,
														
 
															+``</a/b> ldp:contains </a/b/c>`` etc. This may potentially break clients
														
 
															+relying on the direct containment model.
														
 
															+
														
 
															+The rationale behind this change is that Pairtrees are the byproduct of
														
 
															+a limitation imposed by Modeshape and introduce complexity in the
														
 
															+software stack and confusion for the client. LAKEsuperior aligns with
														
 
															+the more intuitive UNIX filesystem model, where each segment of a path
														
 
															+is a “folder” or container (except for the leaf nodes that can be either
														
 
															+folders or files). In any case, clients are discouraged from generating
														
 
															+deep paths in LAKEsuperior without a specific purpose because these
														
 
															+resources create unnecessary data.
														
 
															+
														
 
															+Non-mandatory, non-authoritative slug in version POST
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+FCREPO4 requires a ``Slug`` header to POST to ``fcr:versions`` to create
														
 
															+a new version.
														
 
															+
														
 
															+LAKEsuperior adheres to the more general FCREPO POST rule and if no slug
														
 
															+is provided, an automatic ID is generated instead. The ID is a UUID4.
														
 
															+
														
 
															+Note that internally this ID is not called “label” but “uid” since it is
														
 
															+treated as a fully qualified identifier. The ``fcrepo:hasVersionLabel``
														
 
															+predicate, however ambiguous in this context, will be kept until the
														
 
															+adoption of Memento, which will change the retrieval mechanisms.
														
 
															+
														
 
															+Also, if a POST is issued on the same resource ``fcr:versions`` location
														
 
															+using a version ID that already exists, LAKEsuperior will just mint a
														
 
															+random identifier rather than returning an error.
														
 
															+
														
 
															+Deprecation track
														
 
															+-----------------
														
 
															+
														
 
															+LAKEsuperior offers some “legacy” options to replicate the FCREPO4
														
 
															+behavior, however encourages new development to use a different approach
														
 
															+for some types of interaction.
														
 
															+
														
 
															+Endpoints
														
 
															+~~~~~~~~~
														
 
															+
														
 
															+The FCREPO root endpoint is ``/rest``. The LAKEsuperior root endpoint is
														
 
															+``/ldp``.
														
 
															+
														
 
															+This should not pose a problem if a client does not have ``rest``
														
 
															+hard-coded in its code, but in any event, the ``/rest`` endpoint is
														
 
															+provided for backwards compatibility.
														
 
															+
														
 
															+LAKEsuperior adds the (currently stub) ``query`` endpoint. Other
														
 
															+endpoints for non-LDP services may be opened in the future.
														
 
															+
														
 
															+Automatic LDP class assignment
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+Since LAKEsuperior rejects client-provided server-managed triples, and
														
 
															+since the LDP types are among them, the LDP container type is inferred
														
 
															+from the provided properties: if the ``ldp:hasMemberRelation`` and
														
 
															+``ldp:membershipResource`` properties are provided, the resource is a
														
 
															+Direct Container. If in addition to these the
														
 
															+``ldp:insertedContentRelation`` property is present, the resource is an
														
 
															+Indirect Container. If any of the first two are missing, the resource is
														
 
															+a Container (@TODO discuss: shall it be a Basic Container?)
														
 
															+
														
 
															+Clients are encouraged to omit LDP types in PUT, POST and PATCH
														
 
															+requests.
														
 
															+
														
 
															+Lenient handling
														
 
															+~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+FCREPO4 requires server-managed triples to be expressly indicated in a
														
 
															+PUT request, unless the ``Prefer`` header is set to
														
 
															+``handling=lenient; received="minimal"``, in which case the RDF payload
														
 
															+must not have any server-managed triples.
														
 
															+
														
 
															+LAKEsuperior works under the assumption that client should never provide
														
 
															+server-managed triples. It automatically handles PUT requests sent to
														
 
															+existing resources by returning a 412 if any server managed triples are
														
 
															+included in the payload. This is the same as setting ``Prefer`` to
														
 
															+``handling=strict``, which is the default.
														
 
															+
														
 
															+If ``Prefer`` is set to ``handling=lenient``, all server-managed triples
														
 
															+sent with the payload are ignored.
														
 
															+
														
 
															+Clients using the ``Prefer`` header to control PUT behavior as
														
 
															+advertised by the specs should not notice any difference.
														
 
															+
														
 
															+Optional improvements
														
 
															+---------------------
														
 
															+
														
 
															+The following are improvements in performance or usability that can only
														
 
															+be taken advantage of if client code is adjusted.
														
 
															+
														
 
															+LDP-NR content and metadata
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+FCREPO4 relies on the ``/fcr:metadata`` identifier to retrieve RDF
														
 
															+metadata about an LDP-NR. LAKEsuperior supports this as a legacy option,
														
 
															+but encourages the use of content negotiation to do the same while
														
 
															+offering explicit endpoints for RDF and non-RDF content retrieval.
														
 
															+
														
 
															+Any request to an LDP-NR with an ``Accept`` header set to one of the
														
 
															+supported RDF serialization formats will yield the RDF metadata of the
														
 
															+resource instead of the binary contents.
														
 
															+
														
 
															+The ``fcr:metadata`` URI returns the RDF metadata of a LDP-NR.
														
 
															+
														
 
															+The ``fcr:content`` URI returns the non-RDF content.
														
 
															+
														
 
															+The two optionsabove return an HTTP error if requested for a LDP-RS.
														
 
															+
														
 
															+“Include” and “Omit” options for children
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+LAKEsuperior offers an additional ``Prefer`` header option to exclude
														
 
															+all references to child resources (i.e. by removing all the
														
 
															+``ldp:contains`` triples) while leaving the other server-managed triples
														
 
															+when retrieving a resource:
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    Prefer: return=representation; [include | omit]="http://fedora.info/definitions/v4/repository#Children"
														
 
															+
														
 
															+The default behavior is to include all children URIs.
														
 
															+
														
 
															+Soft-delete and purge
														
 
															+~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+**NOTE**: The implementation of this section is incomplete and debated.
														
 
															+
														
 
															+In FCREPO4 a deleted resource leaves a tombstone deleting all traces of
														
 
															+the previous resource.
														
 
															+
														
 
															+In LAKEsuperior, a normal DELETE creates a new version snapshot of the
														
 
															+resource and puts a tombstone in its place. The resource versions are
														
 
															+still available in the ``fcr:versions`` location. The resource can be
														
 
															+“resurrected” by issuing a POST to its tombstone. This will result in a
														
 
															+``201``.
														
 
															+
														
 
															+If a tombstone is deleted, the resource and its versions are completely
														
 
															+deleted (purged).
														
 
															+
														
 
															+Moreover, setting the ``Prefer:no-tombstone`` header option on DELETE
														
 
															+allows to delete a resource and its versions directly without leaving a
														
 
															+tombstone.
														
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -3,13 +3,28 @@
 
															    You can adapt this file completely to your liking, but it should at least
														
 
															    contain the root `toctree` directive.
														
 
															-Welcome to lakesuperior's documentation!
														
 
															+LAKEsuperior
														
 
															 ========================================
														
 
															 .. toctree::
														
 
															    :maxdepth: 2
														
 
															    :caption: Contents:
														
 
															+    Architecture Overview <architecture>
														
 
															+    Divergences from Fedora 4 <fcrepo4_deltas>
														
 
															+    Content Model <model>
														
 
															+    Messaging SPI <messaging>
														
 
															+    Migration Guide <migration>
														
 
															+    Command Line Reference <cli>
														
 
															+    Storage Implementation <storage>
														
 
															+    Performance Benchmarks <performance>
														
 
															+    API documentation <api>
														
 
															+
														
 
															+.. toctree::
														
 
															+   :maxdepth: 3
														
 
															+   :caption: Technical notes:
														
 
															+
														
 
															+    notes/indexing_strategy
														
 
															 Indices and tables
														
--- a/docs/messaging.rst
+++ b/docs/messaging.rst
@@ -0,0 +1,30 @@
 
															+LAKEsuperior Messaging
														
 
															+======================
														
 
															+
														
 
															+LAKEsuperior implements a messaging system based on ActivityStreams, as
														
 
															+indicated by the `Feodra API
														
 
															+specs <https://fedora.info/2017/06/30/spec/#notifications>`__. The
														
 
															+metadata set provided is currently quite minimal but can be easily
														
 
															+enriched by extending the `default formatter
														
 
															+class <https://github.com/scossu/lakesuperior/blob/master/lakesuperior/messaging/messenger.py>`__.
														
 
															+
														
 
															+STOMP is the only supported protocol at the moment. More protocols may
														
 
															+be made available at a later time.
														
 
															+
														
 
															+LAKEsuperior can send messages to any number of destinations: see
														
 
															+`configuration <https://github.com/scossu/lakesuperior/blob/master/etc.defaults/application.yml#L79>`__.
														
 
															+By default, CoilMQ is provided for testing purposes and listens to
														
 
															+``localhost:61613``. The default route sends messages to
														
 
															+``/topic/fcrepo``.
														
 
															+
														
 
															+A small command-line utility, also provided with the Python
														
 
															+dependencies, allows to watch incoming messages. To monitor messages,
														
 
															+enter the following *after activating your virtualenv*:
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    stomp -H localhost -P 61613 -L /topic/fcrepo
														
 
															+
														
 
															+See the `stomp.py library reference
														
 
															+page <https://github.com/jasonrbriggs/stomp.py/wiki/Command-Line-Access>`__
														
 
															+for details.
														
--- a/docs/migration.rst
+++ b/docs/migration.rst
@@ -0,0 +1,65 @@
 
															+Migration, Backup & Restore
														
 
															+===========================
														
 
															+
														
 
															+All LAKEsuperior data is by default fully contained in a folder. This
														
 
															+means that only the data, configurations and code folders are needed for
														
 
															+it to run. No Postgres, Redis, or such. Data and configuration folders
														
 
															+can be moved around as needed.
														
 
															+
														
 
															+Migration Tool
														
 
															+--------------
														
 
															+
														
 
															+Migration is the process of importing and converting data from a
														
 
															+different Fedora or LDP implementation into a new LAKEsuperior instance.
														
 
															+This process uses the HTTP/LDP API of the original repository. A
														
 
															+command-line utility is available as part of the ``lsup-admin`` suite to
														
 
															+assist in such operation.
														
 
															+
														
 
															+A repository can be migrated with a one-line command such as:
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    ./lsup-admin migrate http://source-repo.edu/rest /local/dest/folder
														
 
															+
														
 
															+For more options, enter
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    ./lsup-admin migrate --help
														
 
															+
														
 
															+The script will crawl through the resources and crawl through outbound
														
 
															+links within them. In order to do this, resources are added as raw
														
 
															+triples ( i.e. no consistency checks are made).
														
 
															+
														
 
															+**Note:** the consistency check tool has not yet been implemented at the
														
 
															+moment but its release should follow shortly. This will ensure that all
														
 
															+the links between resources are consistent in regard to referential
														
 
															+integrity.
														
 
															+
														
 
															+This script will create a full dataset in the specified destination
														
 
															+folder, complete with a default configuration that allows to start the
														
 
															+LAKEsuperior server immediately after the migration is complete.
														
 
															+
														
 
															+Two approaches to migration are possible:
														
 
															+
														
 
															+1. By providing a starting point on the source repository. E.g. if the
														
 
															+   repository you want to migrate is at ``http://repo.edu/rest/prod``
														
 
															+   you can add the ``-s /prod`` option to the script to avoid migrating
														
 
															+   irrelevant branches. Note that the script will still reach outside of
														
 
															+   the starting point if resources are referencing other resources
														
 
															+   outside of it.
														
 
															+2. By providing a file containing a list of resources to migrate. This
														
 
															+   is useful if a source repository cannot produce a full list (e.g. the
														
 
															+   root node has more children than the server can handle) but a list of
														
 
															+   individual resources is available via an external index (Solr,
														
 
															+   triplestore, etc.). The resources can be indicated by their fully
														
 
															+   qualified URIs or paths relative to the repository root. (*TODO
														
 
															+   latter option needs testing*)
														
 
															+
														
 
															+Backup And Restore
														
 
															+------------------
														
 
															+
														
 
															+A back up of a LAKEshore repository consists in copying the RDF and
														
 
															+non-RDF data folders. These folders are indicated in the application
														
 
															+configuration. The default commands provided by your OS (``cp``,
														
 
															+``rsync``, ``tar`` etc. for Unix) are all is needed.
														
--- a/docs/model.rst
+++ b/docs/model.rst
@@ -0,0 +1,66 @@
 
															+LAKEsuperior Content Model Rationale
														
 
															+====================================
														
 
															+
														
 
															+Internal and Public URIs; Identifiers
														
 
															+-------------------------------------
														
 
															+
														
 
															+Resource URIs are stored internally in LAKEsuperior as domain-agnostic
														
 
															+URIs with the scheme ``info:fcres<resource UID>``. This allows resources
														
 
															+to be portable across systems. E.g. a resource with an internal URI of
														
 
															+``info:fcres/a/b/c``, when accessed via the
														
 
															+``http://localhost:8000/ldp`` endpoint, will be found at
														
 
															+``http://localhost:8000/ldp/a/b/c``.
														
 
															+
														
 
															+The resource UID making up the looks like a UNIX filesystem path,
														
 
															+i.e. it always starts with a forward slash and can be made up of
														
 
															+multiple segments separated by slashes. E.g. ``/`` is the root node UID,
														
 
															+``/a`` is a resource UID just below root. their internal URIs are
														
 
															+``info:fcres/`` and ``info:fcres/a`` respectively.
														
 
															+
														
 
															+In the Python API, the UID and internal URI of an LDP resource can be
														
 
															+accessed via the ``uid`` and ``uri`` properties respectively:
														
 
															+
														
 
															+::
														
 
															+
														
 
															+    >>> import lakesuperior.env_setup
														
 
															+    >>> from lakesuperior.api import resource
														
 
															+    >>> rsrc = resource.get('/a/b/c')
														
 
															+    >>> rsrc.uid
														
 
															+    /a/b/c
														
 
															+    >>> rsrc.uri
														
 
															+    rdflib.terms.URIRef('info:fcres/a/b/c')
														
 
															+
														
 
															+Store Layout
														
 
															+------------
														
 
															+
														
 
															+One of the key concepts in LAKEsuperior is the store layout. This is a
														
 
															+module built with a specific purpose in mind, i.e. allowing fine-grained
														
 
															+recording of provenance metadata while providing reasonable performance.
														
 
															+
														
 
															+Store layout modules could be replaceable (work needs to be done to
														
 
															+develop an interface to allow that). The default (and only at the
														
 
															+moment) layout shipped with LAKEsuperior is the `resource-centric
														
 
															+layout <../../lakesuperior/store/ldp_rs/rsrc_centric_layout>`__. This
														
 
															+layout implements a so-called `graph-per-aspect
														
 
															+pattern <http://patterns.dataincubator.org/book/graph-per-aspect.html>`__
														
 
															+which stores different sets of statements about a resource in separate
														
 
															+named graphs.
														
 
															+
														
 
															+The named graphs used for each resource are:
														
 
															+
														
 
															+-  An admin graph (``info:fcsystem/graph/admin<resource UID>``) which
														
 
															+   stores administrative metadata, mostly server-managed triples such as
														
 
															+   LDP types, system create/update timestamps and agents, etc.
														
 
															+-  A structure graph (``info:fcsystem/graph/structure<resource UID>``)
														
 
															+   reserved for containment triples. The reason for this separation is
														
 
															+   purely convenience, since it makes it easy to retrieve all the
														
 
															+   properties of a large container without its child references.
														
 
															+-  One (and, possibly, in the future, more user-defined) named graph for
														
 
															+   user-provided data
														
 
															+   (``info:fcsystem/graph/userdata/_main<resource UID>``).
														
 
															+
														
 
															+Each of these graphs can be annotated with provenance metadata. The
														
 
															+layout decides which triples go in which graph based on the predicate or
														
 
															+RDF type contained in the triple. Adding logic to support arbitrary
														
 
															+named graphs based e.g. on user agent, or to add more provenance
														
 
															+information, should be relatively simple.
														
--- a/docs/notes/architecture.md
+++ b/docs/notes/architecture.md
@@ -1,43 +0,0 @@
 
															-# LAKEsuperior Architecture
														
 
															-
														
 
															-LAKEsuperior is written in Python. It is not excluded that parts of the code
														
 
															-may be rewritten in [Cython](http://cython.readthedocs.io/) for performance.
														
 
															-
														
 
															-
														
 
															-## Multi-Modal Access
														
 
															-
														
 
															-LAKEsuperior services and data are accessible in multiple ways:
														
 
															-
														
 
															-- Via HTTP. This is the canonical way to interact with LDP resources and
														
 
															-  conforms quite closely to the Fedora specs (currently v4).
														
 
															-- Via command line. This method includes long-running admin tasks which are not
														
 
															-  available via HTTP.
														
 
															-- Via a Python API. This method allows to use Python scripts to access the same
														
 
															-  methods available to the two methods above in a programmatic way. It is
														
 
															-  possible to write Python plugins or even to embed LAKEsuperior in a
														
 
															-  Python application, even without running a web server.
														
 
															-
														
 
															-
														
 
															-## Architecture Overview
														
 
															-
														
 
															-![LAKEsuperior Architecture](../assets/lakesuperior_arch.png)
														
 
															-
														
 
															-The LAKEsuperior REST API provides access to the underlying Python API. All
														
 
															-REST and CLI operations can be replicated by a Python program accessing this
														
 
															-API.
														
 
															-
														
 
															-The main advantage of the Python API is that it makes it very easy to maipulate
														
 
															-graph and binary data without the need to serialize or deserialize native data
														
 
															-structures. This matters when handling large ETL jobs for example.
														
 
															-
														
 
															-The Python API is divided in three main areas:
														
 
															-
														
 
															-- [Resource API](../../lakesuperior/api/resource.py). This API is in charge of
														
 
															-  all the resource CRUD operations and implements the majority of the Fedora
														
 
															-  specs.
														
 
															-- [Admin API](../../lakesuperior/api/admin.py). This exposes utility methods,
														
 
															-  mostly long-running maintenance jobs.
														
 
															-- [Query API](../../lakesuperior/api/query.py). This provides several
														
 
															-  facilities for querying repository data.
														
 
															-
														
 
															-
														
--- a/docs/notes/cli.md
+++ b/docs/notes/cli.md
@@ -1,33 +0,0 @@
 
															-# LAKEsuperior Command Line Reference
														
 
															-
														
 
															-The LAKEsuperior command line tool is used for maintenance and administration
														
 
															-purposes.
														
 
															-
														
 
															-The script is invoked from the main install directory. The tool is
														
 
															-self-documented, so this is just a redundant overview:
														
 
															-
														
 
															-```
														
 
															-$ ./lsup_admin
														
 
															-Usage: lsup-admin [OPTIONS] COMMAND [ARGS]...
														
 
															-
														
 
															-Options:
														
 
															-  --help  Show this message and exit.
														
 
															-
														
 
															-  bootstrap     Bootstrap binary and graph stores.
														
 
															-  check_fixity  [STUB] Check fixity of a resource.
														
 
															-  check_refint  [STUB] Check referential integrity.
														
 
															-  cleanup       [STUB] Clean up orphan database items.
														
 
															-  copy          [STUB] Copy (backup) repository data.
														
 
															-  dump          [STUB] Dump repository to disk.
														
 
															-  load          [STUB] Load serialized repository data.
														
 
															-  stats         Print repository statistics.
														
 
															-
														
 
															-```
														
 
															-
														
 
															-All entries marked `[STUB]` are not yet implemented, however the
														
 
															-`lsup_admin <command> --help` command will issue a description of what the
														
 
															-command is meant to do. Please see the [TODO](TODO) document for a rough road
														
 
															-map.
														
 
															-
														
 
															-All of the above commands are also available via, and based upon, the native
														
 
															-Python API.
														
--- a/docs/notes/fcrepo4_deltas.md
+++ b/docs/notes/fcrepo4_deltas.md
@@ -1,213 +0,0 @@
 
															-# Divergencies between lakesuperior and FCREPO4
														
 
															-
														
 
															-This is a (vastly incomplete) list of discrepancies between the current FCREPO4
														
 
															-implementation and LAKEsuperior. More will be added as more clients will use
														
 
															-it.
														
 
															-
														
 
															-
														
 
															-## Not yet implemented (but in the plans)
														
 
															-
														
 
															-- Various headers handling
														
 
															-- Versioning (incomplete)
														
 
															-- AuthN/Z
														
 
															-- Fixity check
														
 
															-- Blank nodes
														
 
															-
														
 
															-
														
 
															-## Potentially breaking changes
														
 
															-
														
 
															-The following  divergences may lead into incompatibilities with some clients.
														
 
															-
														
 
															-### Atomicity
														
 
															-
														
 
															-FCREPO4 supports batch atomic operations whereas a transaction can be opened
														
 
															-and a number of operations (i.e. multiple R/W requests to the repository) can
														
 
															-be performed. The operations are persisted in the repository only if and when
														
 
															-the transaction is committed.
														
 
															-
														
 
															-LAKesuperior only supports atomicity for a single HTTP request. I.e. a single
														
 
															-HTTTP request that should result in multiple write operations to the storage
														
 
															-layer is only persisted if no exception is thrown. Otherwise, the operation is
														
 
															-rolled back in order to prevent resources to be left in an inconsistent state.
														
 
															-
														
 
															-### Tombstone methods
														
 
															-
														
 
															-If a client requests a tombstone resource in
														
 
															-FCREPO4 with a method other than DELETE, the server will return `405 Method Not
														
 
															-Allowed` regardless of whether the tombstone exists or not.
														
 
															-
														
 
															-LAKEsuperior will return `405` only if the tombstone actually exists, `404`
														
 
															-otherwise.
														
 
															-
														
 
															-### Web UI
														
 
															-
														
 
															-FCREPO4 includes a web UI for simple CRUD operations.
														
 
															-
														
 
															-Such a UI is not in the immediate LAKEsuperior development plans. However, a
														
 
															-basic UI is available for read-only interaction: LDP resource browsing, SPARQL
														
 
															-query and other search facilities, and administrative tools. Some of the latter
														
 
															-*may* involve write operations, such as clean-up tasks.
														
 
															-
														
 
															-### Automatic path segment generation
														
 
															-
														
 
															-A `POST` request without a slug in FCREPO4 results in a pairtree consisting of
														
 
															-several intermediate nodes leading to the automatically minted identifier. E.g.
														
 
															-
														
 
															-    POST /rest
														
 
															-
														
 
															-results in `/rest/8c/9a/07/4e/8c9a074e-dda3-5256-ea30-eec2dd4fcf61` being
														
 
															-created.
														
 
															-
														
 
															-The same request in LAKEsuperior would create
														
 
															-`/rest/8c9a074e-dda3-5256-ea30-eec2dd4fcf61` (obviously the identifiers will be
														
 
															-different).
														
 
															-
														
 
															-This seems to break Hyrax at some point, but might have been fixed. This needs
														
 
															-to be verified further.
														
 
															-
														
 
															-
														
 
															-## Non-standard client breaking changes
														
 
															-
														
 
															-The following changes may be incompatible with clients relying on some FCREPO4
														
 
															-behavior not endorsed by LDP or other specifications.
														
 
															-
														
 
															-### Pairtrees
														
 
															-
														
 
															-FCREPO4 generates "pairtree" resources if a resource is created in a path whose
														
 
															-segments are missing. E.g. when creating `/a/b/c/d`, if `/a/b` and `/a/b/c` do
														
 
															-not exist, FCREPO4 will create two Pairtree resources. POSTing and PUTting into
														
 
															-Pairtrees is not allowed. Also, a containment triple is established between the
														
 
															-closest LDPC and the created resource, e.g. if `a` exists, a `</a> ldp:contains
														
 
															-</a/b/c/d>` triple is created.
														
 
															-
														
 
															-LAKEsuperior does not employ Pairtrees. In the example above LAKEsuperior would
														
 
															-create a fully qualified LDPC for each missing segment, which can be POSTed and
														
 
															-PUT to. Containment triples are created between each link in the path, i.e.
														
 
															-`</a> ldp:contains </a/b>`, `</a/b> ldp:contains </a/b/c>` etc. This may
														
 
															-potentially break clients relying on the direct containment model.
														
 
															-
														
 
															-The rationale behind this change is that Pairtrees are the byproduct of a
														
 
															-limitation imposed by Modeshape and introduce complexity in the software stack
														
 
															-and confusion for the client. LAKEsuperior aligns with the more intuitive UNIX
														
 
															-filesystem model, where each segment of a path is a "folder" or container
														
 
															-(except for the leaf nodes that can be either folders or files). In any
														
 
															-case, clients are discouraged from generating deep paths in LAKEsuperior
														
 
															-without a specific purpose because these resources create unnecessary data.
														
 
															-
														
 
															-### Non-mandatory, non-authoritative slug in version POST
														
 
															-
														
 
															-FCREPO4 requires a `Slug` header to POST to `fcr:versions` to create a new
														
 
															-version.
														
 
															-
														
 
															-LAKEsuperior adheres to the more general FCREPO POST rule and if no slug is
														
 
															-provided, an automatic ID is generated instead. The ID is a UUID4.
														
 
															-
														
 
															-Note that internally this ID is not called "label" but "uid" since it
														
 
															-is treated as a fully qualified identifier. The `fcrepo:hasVersionLabel`
														
 
															-predicate, however ambiguous in this context, will be kept until the adoption
														
 
															-of Memento, which will change the retrieval mechanisms.
														
 
															-
														
 
															-Also, if a POST is issued on the same resource `fcr:versions` location using
														
 
															-a version ID that already exists, LAKEsuperior will just mint a random
														
 
															-identifier rather than returning an error.
														
 
															-
														
 
															-
														
 
															-## Deprecation track
														
 
															-
														
 
															-LAKEsuperior offers some "legacy" options to replicate the FCREPO4 behavior,
														
 
															-however encourages new development to use a different approach for some types
														
 
															-of interaction.
														
 
															-
														
 
															-### Endpoints
														
 
															-
														
 
															-The FCREPO root endpoint is `/rest`. The LAKEsuperior root endpoint is `/ldp`.
														
 
															-
														
 
															-This should not pose a problem if a client does not have `rest` hard-coded in
														
 
															-its code, but in any event, the `/rest` endpoint is provided for backwards
														
 
															-compatibility.
														
 
															-
														
 
															-LAKEsuperior adds the (currently stub) `query` endpoint. Other endpoints for
														
 
															-non-LDP services may be opened in the future.
														
 
															-
														
 
															-### Automatic LDP class assignment
														
 
															-
														
 
															-Since LAKEsuperior rejects client-provided server-managed triples, and since
														
 
															-the LDP types are among them, the LDP container type is inferred from the
														
 
															-provided properties: if the `ldp:hasMemberRelation` and
														
 
															-`ldp:membershipResource` properties are provided, the resource is a Direct
														
 
															-Container. If in addition to these the `ldp:insertedContentRelation` property
														
 
															-is present, the resource is an Indirect Container. If any of the first two are
														
 
															-missing, the resource is a Container (@TODO discuss: shall it be a Basic
														
 
															-Container?)
														
 
															-
														
 
															-Clients are encouraged to omit LDP types in PUT, POST and PATCH requests.
														
 
															-
														
 
															-### Lenient handling
														
 
															-
														
 
															-FCREPO4 requires server-managed triples to be expressly indicated in a PUT
														
 
															-request, unless the `Prefer` header is set to
														
 
															-`handling=lenient; received="minimal"`, in which case the RDF payload must not
														
 
															-have any server-managed triples.
														
 
															-
														
 
															-LAKEsuperior works under the assumption that client should never provide
														
 
															-server-managed triples. It automatically handles PUT requests sent to existing
														
 
															-resources by returning a 412 if any server managed triples are included in the
														
 
															-payload. This is the same as setting `Prefer` to `handling=strict`, which is
														
 
															-the default.
														
 
															-
														
 
															-If `Prefer` is set to `handling=lenient`, all server-managed triples sent with
														
 
															-the payload are ignored.
														
 
															-
														
 
															-Clients using the `Prefer` header to control PUT behavior as advertised by the
														
 
															-specs should not notice any difference.
														
 
															-
														
 
															-
														
 
															-## Optional improvements
														
 
															-
														
 
															-The following are improvements in performance or usability that can only be taken
														
 
															-advantage of if client code is adjusted.
														
 
															-
														
 
															-### LDP-NR content and metadata
														
 
															-
														
 
															-FCREPO4 relies on the `/fcr:metadata` identifier to retrieve RDF metadata about
														
 
															-an LDP-NR. LAKEsuperior supports this as a legacy option, but encourages the
														
 
															-use of content negotiation to do the same while offering explicit endpoints
														
 
															-for RDF and non-RDF content retrieval.
														
 
															-
														
 
															-Any request to an LDP-NR with an `Accept` header set to one of the supported
														
 
															-RDF serialization formats will yield the RDF metadata of the resource instead
														
 
															-of the binary contents.
														
 
															-
														
 
															-The `fcr:metadata` URI returns the RDF metadata of a LDP-NR.
														
 
															-
														
 
															-The `fcr:content` URI returns the non-RDF content.
														
 
															-
														
 
															-The two optionsabove return an HTTP error if requested for a LDP-RS.
														
 
															-
														
 
															-### "Include" and "Omit" options for children
														
 
															-
														
 
															-LAKEsuperior offers an additional `Prefer` header option to exclude all
														
 
															-references to child resources (i.e. by removing all the `ldp:contains` triples)
														
 
															-while leaving the other server-managed triples when retrieving a resource:
														
 
															-
														
 
															-    Prefer: return=representation; [include | omit]="http://fedora.info/definitions/v4/repository#Children"
														
 
															-
														
 
															-The default behavior is to include all children URIs.
														
 
															-
														
 
															-### Soft-delete and purge
														
 
															-
														
 
															-**NOTE**: The implementation of this section is incomplete and debated.
														
 
															-
														
 
															-In FCREPO4 a deleted resource leaves a tombstone deleting all traces of the
														
 
															-previous resource.
														
 
															-
														
 
															-In LAKEsuperior, a normal DELETE creates a new version snapshot of the resource
														
 
															-and puts a tombstone in its place. The resource versions are still available
														
 
															-in the `fcr:versions` location. The resource can be "resurrected" by
														
 
															-issuing a POST to its tombstone. This will result in a `201`.
														
 
															-
														
 
															-If a tombstone is deleted, the resource and its versions are completely deleted
														
 
															-(purged).
														
 
															-
														
 
															-Moreover, setting the `Prefer:no-tombstone` header option on DELETE allows to
														
 
															-delete a resource and its versions directly without leaving a tombstone.
														
--- a/docs/notes/indexing_strategy.md
+++ b/docs/notes/indexing_strategy.md
@@ -1,260 +0,0 @@
 
															-# LMDB Store design for RDFLib
														
 
															-
														
 
															-This is a log of subsequent strategies employed to store triples in LMDB.
														
 
															-
														
 
															-Strategy #5a is the one currently used. The rest is kept for historic reasons
														
 
															-and academic curiosity (and also because it was too much work to just wipe out
														
 
															-of memory).
														
 
															-
														
 
															-## Storage approach
														
 
															-
														
 
															-- Pickle quad and create MD5 or SHA1 hash.
														
 
															-- Store triples in one database paired with key; store indices separately.
														
 
															-
														
 
															-Different strategies involve layout and number of databases.
														
 
															-
														
 
															-## Strategy #1
														
 
															-
														
 
															-- kq: key: serialized triple (1:1)
														
 
															-- sk: Serialized subject: key (1:m)
														
 
															-- pk: Serialized predicate: key (1:m)
														
 
															-- ok: Serialized object: key (1:m)
														
 
															-- (optional) lok: Serialized literal object: key (1:m)
														
 
															-- (optional) tok: Serialized RDF type: key (1:m)
														
 
															-- ck: Serialized context: key (1:m)
														
 
															-
														
 
															-### Retrieval approach
														
 
															-
														
 
															-To find all matches for a quad:
														
 
															-
														
 
															-- If all terms in the quad are bound, generate the key from the pickled
														
 
															-quad and look up the triple in `kt`
														
 
															-- If all terms are unbound, return an iterator of all values in `kt`.
														
 
															-- If some values are bound and some unbound (most common query):
														
 
															-  - Get a base list of keys associated wirh the first bound term
														
 
															-  - For each subsequent bound term, check if each key associated with the term
														
 
															-  matches a key in the base list
														
 
															-  - Continue through all the bound terms. If a match is not found at any point,
														
 
															-  continue to the next term
														
 
															-  - If a match is found in all the bound term databases, look up the pickled quad
														
 
															-  matching the key in `kq` and yield it
														
 
															-
														
 
															-More optimization can be introduced later, e.g. separating literal and RDF
														
 
															-type objects in separate databases. Literals can have very long values and a
														
 
															-database with a longer key setting may be useful. RDF terms can be indexed
														
 
															-separately because they are the most common bound term.
														
 
															-
														
 
															-### Example lookup
														
 
															-
														
 
															-Keys and Triples (should actually be quads but this is a simplified version):
														
 
															-
														
 
															-A: s1 p1 o1
														
 
															-B: s1 p2 o2
														
 
															-C: s2 p3 o1
														
 
															-D: s2 p3 o3
														
 
															-
														
 
															-Indices:
														
 
															-
														
 
															-- SK:
														
 
															-  - s1: A, B
														
 
															-  - s2: C, D
														
 
															-- PK:
														
 
															-  - p1: A
														
 
															-  - p2: B
														
 
															-  - p3: C, D
														
 
															- - OK:
														
 
															-  - o1: A, C
														
 
															-  - o2: B
														
 
															-  - o3: D
														
 
															-
														
 
															-Queries:
														
 
															-
														
 
															-- s1 ?p ?o → {A, B}
														
 
															-- s1 p2 ?o → {A, B} & {B} = {B}
														
 
															-- ?s ?p o3 → {D}
														
 
															-- s1 p2 o5 → {} (Exit at OK: no term matches 'o5')
														
 
															-- s2 p3 o2 → {C, D} & {C, D} & {B} = {}
														
 
															-
														
 
															-
														
 
															-## Strategy #2
														
 
															-
														
 
															-Separate data and indices in two environments.
														
 
															-
														
 
															-### Main data store
														
 
															-
														
 
															-Key to quad; main keyspace; all unique.
														
 
															-
														
 
															-### Indices
														
 
															-
														
 
															-None of these databases is of critical preservation concern. They can be
														
 
															-rebuilt from the main data store.
														
 
															-
														
 
															-All dupsort and dupfixed.
														
 
															-
														
 
															-@TODO The first three may not be needed if computing term hash is fast enough.
														
 
															-
														
 
															-- t2k (term to term key)
														
 
															-- lt2k (literal to term key: longer keys)
														
 
															-- k2t (term key to term)
														
 
															-
														
 
															-- s2k (subject key to quad key)
														
 
															-- p2k (pred key to quad key)
														
 
															-- o2k (object key to quad key)
														
 
															-- c2k (context key to quad key)
														
 
															-
														
 
															-- sc2qk (subject + context keys to quad key)
														
 
															-- po2qk (predicate + object keys to quad key)
														
 
															-
														
 
															-- sp2qk (subject + predicate keys to quad key)
														
 
															-- oc2qk (object + context keys to quad key)
														
 
															-
														
 
															-- so2qk (subject + object keys to quad key)
														
 
															-- pc2qk (predicate + context keys to quad key)
														
 
															-
														
 
															-
														
 
															-## Strategy #3
														
 
															-
														
 
															-Contexts are much fewer (even in graph per aspect, 5-10 triples per graph)
														
 
															-
														
 
															-### Main data store
														
 
															-
														
 
															-Preservation-worthy data
														
 
															-
														
 
															-- tk:t (triple key: triple; dupsort, dupfixed)
														
 
															-- tk:c (context key: triple; unique)
														
 
															-
														
 
															-### Indices
														
 
															-
														
 
															-Rebuildable from main data store
														
 
															-
														
 
															-- s2k (subject key: triple key)
														
 
															-- p2k (pred key: triple key)
														
 
															-- o2k (object key: triple key)
														
 
															-- sp2k
														
 
															-- so2k
														
 
															-- po2k
														
 
															-- spo2k
														
 
															-
														
 
															-### Lookup
														
 
															-
														
 
															-1. Look up triples by s, p, o, sp, so, po and get keys
														
 
															-2. If a context is specified, for each key try to seek to (context, key) in ct
														
 
															-   to verify it exists
														
 
															-3. Intersect sets
														
 
															-4. Match triple keys with data using kt
														
 
															-
														
 
															-#### Shortcuts
														
 
															-
														
 
															-- Get all contexts: return list of keys from ct
														
 
															-- Get all triples for a context: get all values for a contex from ct and match
														
 
															-  triple data with kt
														
 
															-- Get one triple match for all contexts: look up in triple indices and match
														
 
															-  triple data with kt
														
 
															-
														
 
															-
														
 
															-## Strategy #4
														
 
															-
														
 
															-Terms are entered individually in main data store. Also, shorter keys are
														
 
															-used rather than hashes. These two aspects save a great deal of space and I/O,
														
 
															-but require an additional index to put the terms together in a triple.
														
 
															-
														
 
															-### Main Data Store
														
 
															-
														
 
															-- t:st (term key: serialized term; 1:1)
														
 
															-- spo:c (joined S, P, O keys: context key; 1:m)
														
 
															-- c: (context keys only, values are the empty bytestring)
														
 
															-
														
 
															-Storage total: variable
														
 
															-
														
 
															-### Indices
														
 
															-
														
 
															-- th:t (term hash: term key; 1:1)
														
 
															-- c:spo (context key: joined triple keys; 1:m)
														
 
															-- s:po (S key: P + O key; 1:m)
														
 
															-- p:so (P key: S + O keys; 1:m)
														
 
															-- o:sp (object key: triple key; 1:m)
														
 
															-- sp:o (S + P keys: O key; 1:m)
														
 
															-- so:p (S + O keys: P key; 1:m)
														
 
															-- po:s (P + O keys: S key; 1:m)
														
 
															-
														
 
															-Storage total: 143 bytes per triple
														
 
															-
														
 
															-### Disadvantages
														
 
															-
														
 
															-- Lots of indices
														
 
															-- Terms can get orphaned:
														
 
															-  - No easy way to know if a term is used anywhere in a quad
														
 
															-  - Needs some routine cleanup
														
 
															-  - On the other hand, terms are relatively light-weight and can be reused
														
 
															-  - Almost surely not reusable are UUIDs, message digests, timestamps etc.
														
 
															-
														
 
															-
														
 
															-## Strategy #5
														
 
															-
														
 
															-Reduce number of indices and rely on parsing and splitting keys to find triples
														
 
															-with two bound parameters.
														
 
															-
														
 
															-This is especially important for keeping indexing synchronous to achieve fully
														
 
															-ACID writes.
														
 
															-
														
 
															-### Main data store
														
 
															-
														
 
															-Same as Strategy #4:
														
 
															-
														
 
															-- t:st (term key: serialized term; 1:1)
														
 
															-- spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
														
 
															-- c: (context keys only, values are the empty bytestring; 1:1)
														
 
															-
														
 
															-Storage total: variable (same as #4)
														
 
															-
														
 
															-### Indices
														
 
															-
														
 
															-- th:t (term hash: term key; 1:1)
														
 
															-- s:po (S key: joined P, O keys; dupsort, dupfixed)
														
 
															-- p:so (P key: joined S, O keys; dupsort, dupfixed)
														
 
															-- o:sp (O key: joined S, P keys; dupsort, dupfixed)
														
 
															-- c:spo (context → triple association; dupsort, dupfixed)
														
 
															-
														
 
															-Storage total: 95 bytes per triple
														
 
															-
														
 
															-### Lookup strategy
														
 
															-
														
 
															-- ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term key
														
 
															-- s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from term key
														
 
															-- s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
														
 
															-- s p ?: [s:po] All PO for S → filter result by P in split key
														
 
															-    → [t:st] term from term key
														
 
															-
														
 
															-### Advantages
														
 
															-
														
 
															-- Less indices: smaller index size and less I/O
														
 
															-
														
 
															-### Disadvantages
														
 
															-
														
 
															-- Possibly slower retrieval for queries with 2 bound terms (run metrics)
														
 
															-
														
 
															-### Further optimization
														
 
															-
														
 
															-In order to minimize traversing and splittig results, the first retrieval
														
 
															-should be made on the term with less average keys. Search order can be balanced
														
 
															-by establishing a lookup order for indices.
														
 
															-
														
 
															-This can be achieved by calling stats on the index databases and looking up the
														
 
															-database with *most* keys. Since there is an equal number of entries in each of
														
 
															-the (s:po, p:so, o:sp) indices, the one with most keys will have the least
														
 
															-average number of values per key. If that lookup is done first, the initial
														
 
															-data set to traverse and filter will be smaller.
														
 
															-
														
 
															-
														
 
															-## Strategy #5a
														
 
															-
														
 
															-This is a slightly different implementation of #5 that somewhat simplifies and
														
 
															-perhaps speeds up things a bit. It is the currently employed solution.
														
 
															-
														
 
															-The indexing and lookup strtegy is the same; but instead of using a separator
														
 
															-byte for splitting compound keys, the logic relies on the fact that keys have
														
 
															-a fixed length and are sliced instead. This *should* result in faster key
														
 
															-manipulation, also because in most cases `memoryview` buffers can be used
														
 
															-directly instead of being copied from memory.
														
 
															-
														
 
															-Index storage is 90 bytes per triple.
														
--- a/docs/notes/indexing_strategy.rst
+++ b/docs/notes/indexing_strategy.rst
@@ -0,0 +1,311 @@
 
															+LMDB Store design for RDFLib
														
 
															+============================
														
 
															+
														
 
															+This is a log of subsequent strategies employed to store triples in
														
 
															+LMDB.
														
 
															+
														
 
															+Strategy #5a is the one currently used. The rest is kept for historic
														
 
															+reasons and academic curiosity (and also because it was too much work to
														
 
															+just wipe out of memory).
														
 
															+
														
 
															+Storage approach
														
 
															+----------------
														
 
															+
														
 
															+-  Pickle quad and create MD5 or SHA1 hash.
														
 
															+-  Store triples in one database paired with key; store indices
														
 
															+   separately.
														
 
															+
														
 
															+Different strategies involve layout and number of databases.
														
 
															+
														
 
															+Strategy #1
														
 
															+-----------
														
 
															+
														
 
															+-  kq: key: serialized triple (1:1)
														
 
															+-  sk: Serialized subject: key (1:m)
														
 
															+-  pk: Serialized predicate: key (1:m)
														
 
															+-  ok: Serialized object: key (1:m)
														
 
															+-  (optional) lok: Serialized literal object: key (1:m)
														
 
															+-  (optional) tok: Serialized RDF type: key (1:m)
														
 
															+-  ck: Serialized context: key (1:m)
														
 
															+
														
 
															+Retrieval approach
														
 
															+~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+To find all matches for a quad:
														
 
															+
														
 
															+-  If all terms in the quad are bound, generate the key from the pickled
														
 
															+   quad and look up the triple in ``kt``
														
 
															+-  If all terms are unbound, return an iterator of all values in ``kt``.
														
 
															+-  If some values are bound and some unbound (most common query):
														
 
															+
														
 
															+   -  Get a base list of keys associated wirh the first bound term
														
 
															+   -  For each subsequent bound term, check if each key associated with
														
 
															+      the term matches a key in the base list
														
 
															+   -  Continue through all the bound terms. If a match is not found at
														
 
															+      any point, continue to the next term
														
 
															+   -  If a match is found in all the bound term databases, look up the
														
 
															+      pickled quad matching the key in ``kq`` and yield it
														
 
															+
														
 
															+More optimization can be introduced later, e.g. separating literal and
														
 
															+RDF type objects in separate databases. Literals can have very long
														
 
															+values and a database with a longer key setting may be useful. RDF terms
														
 
															+can be indexed separately because they are the most common bound term.
														
 
															+
														
 
															+Example lookup
														
 
															+~~~~~~~~~~~~~~
														
 
															+
														
 
															+Keys and Triples (should actually be quads but this is a simplified
														
 
															+version):
														
 
															+
														
 
															+A: s1 p1 o1 B: s1 p2 o2 C: s2 p3 o1 D: s2 p3 o3
														
 
															+
														
 
															+Indices:
														
 
															+
														
 
															+-  SK:
														
 
															+
														
 
															+   -  s1: A, B
														
 
															+   -  s2: C, D
														
 
															+
														
 
															+-  PK:
														
 
															+
														
 
															+   -  p1: A
														
 
															+   -  p2: B
														
 
															+   -  p3: C, D
														
 
															+
														
 
															+-  OK:
														
 
															+-  o1: A, C
														
 
															+-  o2: B
														
 
															+-  o3: D
														
 
															+
														
 
															+Queries:
														
 
															+
														
 
															+-  s1 ?p ?o → {A, B}
														
 
															+-  s1 p2 ?o → {A, B} & {B} = {B}
														
 
															+-  ?s ?p o3 → {D}
														
 
															+-  s1 p2 o5 → {} (Exit at OK: no term matches ‘o5’)
														
 
															+-  s2 p3 o2 → {C, D} & {C, D} & {B} = {}
														
 
															+
														
 
															+Strategy #2
														
 
															+-----------
														
 
															+
														
 
															+Separate data and indices in two environments.
														
 
															+
														
 
															+Main data store
														
 
															+~~~~~~~~~~~~~~~
														
 
															+
														
 
															+Key to quad; main keyspace; all unique.
														
 
															+
														
 
															+Indices
														
 
															+~~~~~~~
														
 
															+
														
 
															+None of these databases is of critical preservation concern. They can be
														
 
															+rebuilt from the main data store.
														
 
															+
														
 
															+All dupsort and dupfixed.
														
 
															+
														
 
															+@TODO The first three may not be needed if computing term hash is fast
														
 
															+enough.
														
 
															+
														
 
															+-  t2k (term to term key)
														
 
															+-  lt2k (literal to term key: longer keys)
														
 
															+-  k2t (term key to term)
														
 
															+
														
 
															+-  s2k (subject key to quad key)
														
 
															+-  p2k (pred key to quad key)
														
 
															+-  o2k (object key to quad key)
														
 
															+-  c2k (context key to quad key)
														
 
															+
														
 
															+-  sc2qk (subject + context keys to quad key)
														
 
															+-  po2qk (predicate + object keys to quad key)
														
 
															+
														
 
															+-  sp2qk (subject + predicate keys to quad key)
														
 
															+-  oc2qk (object + context keys to quad key)
														
 
															+
														
 
															+-  so2qk (subject + object keys to quad key)
														
 
															+-  pc2qk (predicate + context keys to quad key)
														
 
															+
														
 
															+Strategy #3
														
 
															+-----------
														
 
															+
														
 
															+Contexts are much fewer (even in graph per aspect, 5-10 triples per
														
 
															+graph)
														
 
															+
														
 
															+.. _main-data-store-1:
														
 
															+
														
 
															+Main data store
														
 
															+~~~~~~~~~~~~~~~
														
 
															+
														
 
															+Preservation-worthy data
														
 
															+
														
 
															+-  tk:t (triple key: triple; dupsort, dupfixed)
														
 
															+-  tk:c (context key: triple; unique)
														
 
															+
														
 
															+.. _indices-1:
														
 
															+
														
 
															+Indices
														
 
															+~~~~~~~
														
 
															+
														
 
															+Rebuildable from main data store
														
 
															+
														
 
															+-  s2k (subject key: triple key)
														
 
															+-  p2k (pred key: triple key)
														
 
															+-  o2k (object key: triple key)
														
 
															+-  sp2k
														
 
															+-  so2k
														
 
															+-  po2k
														
 
															+-  spo2k
														
 
															+
														
 
															+Lookup
														
 
															+~~~~~~
														
 
															+
														
 
															+1. Look up triples by s, p, o, sp, so, po and get keys
														
 
															+2. If a context is specified, for each key try to seek to (context, key)
														
 
															+   in ct to verify it exists
														
 
															+3. Intersect sets
														
 
															+4. Match triple keys with data using kt
														
 
															+
														
 
															+Shortcuts
														
 
															+^^^^^^^^^
														
 
															+
														
 
															+-  Get all contexts: return list of keys from ct
														
 
															+-  Get all triples for a context: get all values for a contex from ct
														
 
															+   and match triple data with kt
														
 
															+-  Get one triple match for all contexts: look up in triple indices and
														
 
															+   match triple data with kt
														
 
															+
														
 
															+Strategy #4
														
 
															+-----------
														
 
															+
														
 
															+Terms are entered individually in main data store. Also, shorter keys
														
 
															+are used rather than hashes. These two aspects save a great deal of
														
 
															+space and I/O, but require an additional index to put the terms together
														
 
															+in a triple.
														
 
															+
														
 
															+.. _main-data-store-2:
														
 
															+
														
 
															+Main Data Store
														
 
															+~~~~~~~~~~~~~~~
														
 
															+
														
 
															+-  t:st (term key: serialized term; 1:1)
														
 
															+-  spo:c (joined S, P, O keys: context key; 1:m)
														
 
															+-  c: (context keys only, values are the empty bytestring)
														
 
															+
														
 
															+Storage total: variable
														
 
															+
														
 
															+.. _indices-2:
														
 
															+
														
 
															+Indices
														
 
															+~~~~~~~
														
 
															+
														
 
															+-  th:t (term hash: term key; 1:1)
														
 
															+-  c:spo (context key: joined triple keys; 1:m)
														
 
															+-  s:po (S key: P + O key; 1:m)
														
 
															+-  p:so (P key: S + O keys; 1:m)
														
 
															+-  o:sp (object key: triple key; 1:m)
														
 
															+-  sp:o (S + P keys: O key; 1:m)
														
 
															+-  so:p (S + O keys: P key; 1:m)
														
 
															+-  po:s (P + O keys: S key; 1:m)
														
 
															+
														
 
															+Storage total: 143 bytes per triple
														
 
															+
														
 
															+Disadvantages
														
 
															+~~~~~~~~~~~~~
														
 
															+
														
 
															+-  Lots of indices
														
 
															+-  Terms can get orphaned:
														
 
															+
														
 
															+   -  No easy way to know if a term is used anywhere in a quad
														
 
															+   -  Needs some routine cleanup
														
 
															+   -  On the other hand, terms are relatively light-weight and can be
														
 
															+      reused
														
 
															+   -  Almost surely not reusable are UUIDs, message digests, timestamps
														
 
															+      etc.
														
 
															+
														
 
															+Strategy #5
														
 
															+-----------
														
 
															+
														
 
															+Reduce number of indices and rely on parsing and splitting keys to find
														
 
															+triples with two bound parameters.
														
 
															+
														
 
															+This is especially important for keeping indexing synchronous to achieve
														
 
															+fully ACID writes.
														
 
															+
														
 
															+.. _main-data-store-3:
														
 
															+
														
 
															+Main data store
														
 
															+~~~~~~~~~~~~~~~
														
 
															+
														
 
															+Same as Strategy #4:
														
 
															+
														
 
															+-  t:st (term key: serialized term; 1:1)
														
 
															+-  spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
														
 
															+-  c: (context keys only, values are the empty bytestring; 1:1)
														
 
															+
														
 
															+Storage total: variable (same as #4)
														
 
															+
														
 
															+.. _indices-3:
														
 
															+
														
 
															+Indices
														
 
															+~~~~~~~
														
 
															+
														
 
															+-  th:t (term hash: term key; 1:1)
														
 
															+-  s:po (S key: joined P, O keys; dupsort, dupfixed)
														
 
															+-  p:so (P key: joined S, O keys; dupsort, dupfixed)
														
 
															+-  o:sp (O key: joined S, P keys; dupsort, dupfixed)
														
 
															+-  c:spo (context → triple association; dupsort, dupfixed)
														
 
															+
														
 
															+Storage total: 95 bytes per triple
														
 
															+
														
 
															+Lookup strategy
														
 
															+~~~~~~~~~~~~~~~
														
 
															+
														
 
															+-  ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term
														
 
															+   key
														
 
															+-  s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from
														
 
															+   term key
														
 
															+-  s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
														
 
															+-  s p ?: [s:po] All PO for S → filter result by P in split key → [t:st]
														
 
															+   term from term key
														
 
															+
														
 
															+Advantages
														
 
															+~~~~~~~~~~
														
 
															+
														
 
															+-  Less indices: smaller index size and less I/O
														
 
															+
														
 
															+.. _disadvantages-1:
														
 
															+
														
 
															+Disadvantages
														
 
															+~~~~~~~~~~~~~
														
 
															+
														
 
															+-  Possibly slower retrieval for queries with 2 bound terms (run
														
 
															+   metrics)
														
 
															+
														
 
															+Further optimization
														
 
															+~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+In order to minimize traversing and splittig results, the first
														
 
															+retrieval should be made on the term with less average keys. Search
														
 
															+order can be balanced by establishing a lookup order for indices.
														
 
															+
														
 
															+This can be achieved by calling stats on the index databases and looking
														
 
															+up the database with *most* keys. Since there is an equal number of
														
 
															+entries in each of the (s:po, p:so, o:sp) indices, the one with most
														
 
															+keys will have the least average number of values per key. If that
														
 
															+lookup is done first, the initial data set to traverse and filter will
														
 
															+be smaller.
														
 
															+
														
 
															+Strategy #5a
														
 
															+------------
														
 
															+
														
 
															+This is a slightly different implementation of #5 that somewhat
														
 
															+simplifies and perhaps speeds up things a bit. It is the currently
														
 
															+employed solution.
														
 
															+
														
 
															+The indexing and lookup strtegy is the same; but instead of using a
														
 
															+separator byte for splitting compound keys, the logic relies on the fact
														
 
															+that keys have a fixed length and are sliced instead. This *should*
														
 
															+result in faster key manipulation, also because in most cases
														
 
															+``memoryview`` buffers can be used directly instead of being copied from
														
 
															+memory.
														
 
															+
														
 
															+Index storage is 90 bytes per triple.
														
--- a/docs/notes/messaging.md
+++ b/docs/notes/messaging.md
@@ -1,27 +0,0 @@
 
															-# LAKEsuperior Messaging
														
 
															-
														
 
															-LAKEsuperior implements a messaging system based on ActivityStreams, as
														
 
															-indicated by the
														
 
															-[Feodra API specs](https://fedora.info/2017/06/30/spec/#notifications).
														
 
															-The metadata set provided is currently quite minimal but can be easily
														
 
															-enriched by extending the
														
 
															-[default formatter class](https://github.com/scossu/lakesuperior/blob/master/lakesuperior/messaging/messenger.py).
														
 
															-
														
 
															-STOMP is the only supported protocol at the moment. More protocols may be made
														
 
															-available at a later time.
														
 
															-
														
 
															-LAKEsuperior can send messages to any number of destinations: see
														
 
															-[configuration](https://github.com/scossu/lakesuperior/blob/master/etc.defaults/application.yml#L79).
														
 
															-By default, CoilMQ is provided for testing purposes and listens to
														
 
															-`localhost:61613`. The default route sends messages to `/topic/fcrepo`.
														
 
															-
														
 
															-A small command-line utility, also provided with the Python dependencies,
														
 
															-allows to watch incoming messages. To monitor messages, enter the following
														
 
															-*after activating your virtualenv*:
														
 
															-
														
 
															-```
														
 
															-stomp -H localhost -P 61613 -L /topic/fcrepo
														
 
															-```
														
 
															-
														
 
															-See the [stomp.py library reference page](https://github.com/jasonrbriggs/stomp.py/wiki/Command-Line-Access)
														
 
															-for details.
														
--- a/docs/notes/migration.md
+++ b/docs/notes/migration.md
@@ -1,58 +0,0 @@
 
															-# Migration, Backup & Restore
														
 
															-
														
 
															-All LAKEsuperior data is by default fully contained in a folder. This means
														
 
															-that only the data, configurations and code folders are needed for it to run.
														
 
															-No Postgres, Redis, or such. Data and configuration folders can be moved around
														
 
															-as needed.
														
 
															-
														
 
															-## Migration Tool
														
 
															-
														
 
															-Migration is the process of importing and converting data from a different
														
 
															-Fedora or LDP implementation into a new LAKEsuperior instance. This process
														
 
															-uses the HTTP/LDP API of the original repository. A command-line utility is
														
 
															-available as part of the `lsup-admin` suite to assist in such operation.
														
 
															-
														
 
															-A repository can be migrated with a one-line command such as:
														
 
															-
														
 
															-```
														
 
															-./lsup-admin migrate http://source-repo.edu/rest /local/dest/folder
														
 
															-```
														
 
															-
														
 
															-For more options, enter
														
 
															-
														
 
															-```
														
 
															-./lsup-admin migrate --help
														
 
															-```
														
 
															-
														
 
															-The script will crawl through the resources and crawl through outbound links
														
 
															-within them. In order to do this, resources are added as raw triples (i.e.
														
 
															-no consistency checks are made).
														
 
															-
														
 
															-**Note:** the consistency check tool has not yet been implemented at the moment
														
 
															-but its release should follow shortly. This will ensure that all the links
														
 
															-between resources are consistent in regard to referential integrity.
														
 
															-
														
 
															-This script will create a full dataset in the specified destination folder,
														
 
															-complete with a default configuration that allows to start the LAKEsuperior
														
 
															-server immediately after the migration is complete.
														
 
															-
														
 
															-Two approaches to migration are possible:
														
 
															-
														
 
															-1. By providing a starting point on the source repository. E.g. if the
														
 
															-   repository you want to migrate is at `http://repo.edu/rest/prod` you can add
														
 
															-   the `-s /prod` option to the script to avoid migrating irrelevant branches.
														
 
															-   Note that the script will still reach outside of the starting point if
														
 
															-   resources are referencing other resources outside of it.
														
 
															-2. By providing a file containing a list of resources to migrate. This is
														
 
															-   useful if a source repository cannot produce a full list (e.g. the root node
														
 
															-   has more children than the server can handle) but a list of individual
														
 
															-   resources is available via an external index (Solr, triplestore, etc.).
														
 
															-   The resources can be indicated by their fully qualified URIs or paths
														
 
															-   relative to the repository root. (*TODO latter option needs testing*)
														
 
															-
														
 
															-## Backup And Restore
														
 
															-
														
 
															-A back up of a LAKEshore repository consists in copying the RDF and non-RDF
														
 
															-data folders. These folders are indicated in the application configuration. The
														
 
															-default commands provided by your OS (`cp`, `rsync`, `tar` etc. for Unix) are
														
 
															-all is needed.
														
--- a/docs/notes/model.md
+++ b/docs/notes/model.md
@@ -1,63 +0,0 @@
 
															-# LAKEsuperior Content Model Rationale
														
 
															-
														
 
															-## Internal and Public URIs; Identifiers
														
 
															-
														
 
															-Resource URIs are stored internally in LAKEsuperior as domain-agnostic URIs
														
 
															-with the scheme `info:fcres<resource UID>`. This allows resources to be
														
 
															-portable across systems. E.g. a resource with an internal URI of
														
 
															-`info:fcres/a/b/c`, when accessed via the `http://localhost:8000/ldp`
														
 
															-endpoint, will be found at `http://localhost:8000/ldp/a/b/c`.
														
 
															-
														
 
															-The resource UID making up the looks like a UNIX
														
 
															-filesystem path, i.e. it always starts with a forward slash and can be made up
														
 
															-of multiple segments separated by slashes. E.g. `/` is the root node UID,
														
 
															-`/a` is a resource UID just below root. their internal URIs are `info:fcres/`
														
 
															-and `info:fcres/a` respectively.
														
 
															-
														
 
															-In the Python API, the UID and internal URI of an LDP resource can be accessed
														
 
															-via the `uid` and `uri` properties respectively:
														
 
															-
														
 
															-```
														
 
															->>> import lakesuperior.env_setup
														
 
															->>> from lakesuperior.api import resource
														
 
															->>> rsrc = resource.get('/a/b/c')
														
 
															->>> rsrc.uid
														
 
															-/a/b/c
														
 
															->>> rsrc.uri
														
 
															-rdflib.terms.URIRef('info:fcres/a/b/c')
														
 
															-```
														
 
															-
														
 
															-## Store Layout
														
 
															-
														
 
															-One of the key concepts in LAKEsuperior is the store layout. This is a
														
 
															-module built with a
														
 
															-specific purpose in mind, i.e. allowing fine-grained recording of provenance
														
 
															-metadata while providing reasonable performance.
														
 
															-
														
 
															-Store layout modules could be replaceable (work needs to
														
 
															-be done to develop an interface to allow that). The default (and only at the
														
 
															-moment) layout shipped with LAKEsuperior is the
														
 
															-[resource-centric layout](../../lakesuperior/store/ldp_rs/rsrc_centric_layout).
														
 
															-This layout implements a so-called
														
 
															-[graph-per-aspect pattern](http://patterns.dataincubator.org/book/graph-per-aspect.html)
														
 
															-which stores different sets of statements about a resource in separate named
														
 
															-graphs.
														
 
															-
														
 
															-The named graphs used for each resource are:
														
 
															-
														
 
															-- An admin graph (`info:fcsystem/graph/admin<resource UID>`) which stores
														
 
															-  administrative metadata, mostly server-managed triples such as LDP types,
														
 
															-  system create/update timestamps and agents, etc.
														
 
															-- A structure graph (`info:fcsystem/graph/structure<resource UID>`) reserved for
														
 
															-  containment triples. The reason
														
 
															-  for this separation is purely convenience, since it makes it easy to retrieve
														
 
															-  all the properties of a large container without its child references.
														
 
															-- One (and, possibly, in the future, more user-defined) named graph for
														
 
															-  user-provided data (`info:fcsystem/graph/userdata/_main<resource UID>`).
														
 
															-
														
 
															-Each of these graphs can be annotated with provenance metadata. The layout
														
 
															-decides which triples go in which graph based on the predicate or RDF type
														
 
															-contained in the triple. Adding logic to support arbitrary named graphs based
														
 
															-e.g. on user agent, or to add more provenance information, should be relatively
														
 
															-simple.
														
 
															-
														
--- a/docs/notes/performance.md
+++ b/docs/notes/performance.md
@@ -1,112 +0,0 @@
 
															-# Performance Benchmark Report
														
 
															-
														
 
															-## Environment
														
 
															-
														
 
															-### Hardware
														
 
															-
														
 
															-#### ‘Rather Snappy’ Laptop
														
 
															-
														
 
															-- Dell Precison M3800 Laptop
														
 
															-- 4x Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
														
 
															-- 12Gb RAM
														
 
															-- SSD
														
 
															-
														
 
															-#### ‘Ole Workhorse’ server
														
 
															-
														
 
															-8x Intel(R) Xeon(R) CPU X5550  @ 2.67GHz
														
 
															-16Gb RAM
														
 
															-Magnetic drive, XXX RPM
														
 
															-
														
 
															-### Software
														
 
															-
														
 
															-- Arch Linux OS
														
 
															-- glibc 2.26-11
														
 
															-- python 3.5.4
														
 
															-- lmdb 0.9.21-1
														
 
															-
														
 
															-### Benchmark script
														
 
															-
														
 
															-[Generator script](../../util/benchmark.py)
														
 
															-
														
 
															-The script was run with default values: 10,000 children under the same parent,
														
 
															-PUT requests.
														
 
															-
														
 
															-### Data Set
														
 
															-
														
 
															-Synthetic graph created by the benchmark script. The graph is unique for each
														
 
															-request and consists of 200 triples which are partly random data, with a
														
 
															-consistent size and variation:
														
 
															-
														
 
															-- 50 triples have an object that is a URI of an external resource (50 unique
														
 
															-  predicates; 5 unique objects).
														
 
															-- 50 triples have an object that is a URI of a repository-managed resource
														
 
															-  (50 unique predicates; 5 unique objects).
														
 
															-- 100 triples have an object that is a 64-character random Unicode string
														
 
															-  (50 unique predicates; 100 unique objects).
														
 
															-
														
 
															-## Results
														
 
															-
														
 
															-### ‘Rather Snappy’ Laptop
														
 
															-
														
 
															-#### FCREPO/Modeshape 4.7.5
														
 
															-
														
 
															-15'45" running time
														
 
															-
														
 
															-0.094" per resource (100%—reference point)
														
 
															-
														
 
															-3.4M triples total in repo at the end of the process
														
 
															-
														
 
															-Retrieval of parent resource (~10000 triples), pipe to /dev/null: 3.64" (100%)
														
 
															-
														
 
															-Peak memory usage: 2.47Gb
														
 
															-
														
 
															-Database size: 3.3 Gb
														
 
															-
														
 
															-
														
 
															-#### LAKEsuperior Alpha 6, LMDB Back End
														
 
															-
														
 
															-25' running time
														
 
															-
														
 
															-0.152" per resource (161%)
														
 
															-
														
 
															-*Some gaps every ~40-50 requests, probably disk flush*
														
 
															-
														
 
															-Retrieval of parent resource (10K triples), pipe to /dev/null: 2.13" (58%)
														
 
															-
														
 
															-Peak memory usage: ~650 Mb (3 idle workers, 1 active)
														
 
															-
														
 
															-Database size: 523 Mb (16%)
														
 
															-
														
 
															-### ‘Ole Workhorse’ server
														
 
															-
														
 
															-#### FCREPO
														
 
															-
														
 
															-0:47:38 running time
														
 
															-
														
 
															-0.285" per resource (100%)
														
 
															-
														
 
															-Retrieval of parent resource: 9.6" (100%)
														
 
															-
														
 
															-#### LAKEsuperior
														
 
															-
														
 
															-1:14:19 running time
														
 
															-
														
 
															-0.446" per resource (156%)
														
 
															-
														
 
															-Retrieval of parent resource: 5.58" (58%)
														
 
															-
														
 
															-## Conclusions
														
 
															-
														
 
															-LAKEsuperior appears to be markedly slower on writes and markedly faster on
														
 
															-reads. Both these factors are very likely related to the underlying LMDB store
														
 
															-which is optimized for read performance.
														
 
															-
														
 
															-Comparison of results between the laptop and the server demonstrates that both
														
 
															-read and write performance gaps
														
 
															-are identical in the two environments. Disk speed severely affects the numbers.
														
 
															-
														
 
															-**Note:** As you can guess, these are only very partial and specific results. They
														
 
															-should not be taken as a thorough performance assessment. Such an assessment
														
 
															-may be impossible and pointless to make given the very different nature of
														
 
															-the storage models, which may behave radically differently depending on many
														
 
															-variables.
														
--- a/docs/notes/storage.md
+++ b/docs/notes/storage.md
@@ -1,82 +0,0 @@
 
															-# Storage Implementation
														
 
															-
														
 
															-LAKEsuperior stores non-RDF ("binary") data in the filesystem and RDF data in
														
 
															-an embedded key-value store, [LMDB](https://symas.com/lmdb/).
														
 
															-
														
 
															-## RDF Storage design
														
 
															-
														
 
															-LMDB is a very fast, very lightweight C library. It is inspired by BerkeleyDB
														
 
															-but introduces significant improvements in terms of efficiency and stability.
														
 
															-
														
 
															-The LAKEsuperior RDF store consists of two files: the main data store and the
														
 
															-indices (plus two lock files that are generated at runtime). A good amount of
														
 
															-effort has been put to develop an indexing strategy that is balanced between
														
 
															-write performance, read performance, and data size, with no compromise made on
														
 
															-consistency.
														
 
															-
														
 
															-The main data
														
 
															-store is the one containing the preservation-worthy data. While the indices are
														
 
															-necessary for LAKEsuperior to function, they can be entirely rebuilt from the
														
 
															-main data store in case of file corruption (recovery tools are on the TODO
														
 
															-list).
														
 
															-
														
 
															-Detailed notes about the various strategies researched can be found
														
 
															-[here](indexing_strategy.md).
														
 
															-
														
 
															-## Scalability
														
 
															-
														
 
															-Since LAKEsuperior is focused on design simplicity, efficiency and reliability,
														
 
															-its RDF store is embedded and not horizontally scalable. However, LAKEsuperior
														
 
															-is quite frugal with disk space. About 55 million triples can be
														
 
															-stored in 8Gb of space (mileage can vary depending on how heterogeneous the
														
 
															-triples are). This makes it easier to use expensive SSD drives for
														
 
															-the RDF store, in order to improve performance. A single LMDB environment can
														
 
															-reportedly scale up to 128 terabytes.
														
 
															-
														
 
															-## Maintenance
														
 
															-
														
 
															-LMDB has a very simple configuration, and all options are hardcoded
														
 
															-in LAKESuperior in order to exploit its features. A database automatically
														
 
															-recovers from a crash.
														
 
															-
														
 
															-The LAKEsuperior RDF store abstraction maintains a registry of unique terms.
														
 
															-These terms are not deleted if a triple is deleted, even if no triple is using
														
 
															-them, because it would be too expesive to look up for orphaned terms during a
														
 
															-delete request. While these terms are relatively lightweight, it would be good
														
 
															-to run a periodical clean-up job. Tools will be developed in the near future to
														
 
															-facilitate this maintenance task.
														
 
															-
														
 
															-## Consistency
														
 
															-
														
 
															-LAKEsuperior wraps each LDP operation in a transaction. The indices are updated
														
 
															-synchronously within the same transaction in order to guarantee
														
 
															-consistency. If a system loses power or crashes, only the last transaction is
														
 
															-lost, and the last successful write will include primary and index data.
														
 
															-
														
 
															-## Concurrency
														
 
															-
														
 
															-LMDB employs
														
 
															-[MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
														
 
															-to achieve fully ACID transactions. This implies that during
														
 
															-a write, the whole database is locked. Multiple writes can be initiated
														
 
															-concurrently, but the performance gain of doing so may be little because
														
 
															-only one write operation can be performed at a time. Reasonable efforts have
														
 
															-been put to make write transactions as short as possible (and more can be
														
 
															-done). Also, this excludes a priori the option to implement long-running atomic
														
 
															-operations, unless one is willing to block writes on the application for an
														
 
															-indefinite length of time. On the other hand, write operations never block and
														
 
															-are never blocked, so an application with a high read-to-write ratio may still
														
 
															-benefit from multi-threaded requests.
														
 
															-
														
 
															-## Performance
														
 
															-
														
 
															-The [Performance Benchmark Report](performance.md) contains benchmark results.
														
 
															-
														
 
															-Write performance is lower than Modeshape/Fedora4; this may be mostly due to
														
 
															-the fact that indices are written synchronously in a blocking transaction;
														
 
															-also, the LMDB B+Tree structure is optimized for read performance rather than
														
 
															-write performance. Some optimizations on the application layer could be made.
														
 
															-
														
 
															-Reads are faster than Modeshape/Fedora.
														
 
															-
														
 
															-All tests so far have been performed in a single thread.
														
--- a/docs/performance.rst
+++ b/docs/performance.rst
@@ -0,0 +1,131 @@
 
															+Performance Benchmark Report
														
 
															+============================
														
 
															+
														
 
															+Environment
														
 
															+-----------
														
 
															+
														
 
															+Hardware
														
 
															+~~~~~~~~
														
 
															+
														
 
															+‘Rather Snappy’ Laptop
														
 
															+^^^^^^^^^^^^^^^^^^^^^^
														
 
															+
														
 
															+-  Dell Precison M3800 Laptop
														
 
															+-  4x Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
														
 
															+-  12Gb RAM
														
 
															+-  SSD
														
 
															+
														
 
															+‘Ole Workhorse’ server
														
 
															+^^^^^^^^^^^^^^^^^^^^^^
														
 
															+
														
 
															+8x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz 16Gb RAM Magnetic drive, XXX RPM
														
 
															+
														
 
															+Software
														
 
															+~~~~~~~~
														
 
															+
														
 
															+-  Arch Linux OS
														
 
															+-  glibc 2.26-11
														
 
															+-  python 3.5.4
														
 
															+-  lmdb 0.9.21-1
														
 
															+
														
 
															+Benchmark script
														
 
															+~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+`Generator script <../../util/benchmark.py>`__
														
 
															+
														
 
															+The script was run with default values: 10,000 children under the same
														
 
															+parent, PUT requests.
														
 
															+
														
 
															+Data Set
														
 
															+~~~~~~~~
														
 
															+
														
 
															+Synthetic graph created by the benchmark script. The graph is unique for
														
 
															+each request and consists of 200 triples which are partly random data,
														
 
															+with a consistent size and variation:
														
 
															+
														
 
															+-  50 triples have an object that is a URI of an external resource (50
														
 
															+   unique predicates; 5 unique objects).
														
 
															+-  50 triples have an object that is a URI of a repository-managed
														
 
															+   resource (50 unique predicates; 5 unique objects).
														
 
															+-  100 triples have an object that is a 64-character random Unicode
														
 
															+   string (50 unique predicates; 100 unique objects).
														
 
															+
														
 
															+Results
														
 
															+-------
														
 
															+
														
 
															+.. _rather-snappy-laptop-1:
														
 
															+
														
 
															+‘Rather Snappy’ Laptop
														
 
															+~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+FCREPO/Modeshape 4.7.5
														
 
															+^^^^^^^^^^^^^^^^^^^^^^
														
 
															+
														
 
															+15’45" running time
														
 
															+
														
 
															+0.094" per resource (100%—reference point)
														
 
															+
														
 
															+3.4M triples total in repo at the end of the process
														
 
															+
														
 
															+Retrieval of parent resource (~10000 triples), pipe to /dev/null: 3.64"
														
 
															+(100%)
														
 
															+
														
 
															+Peak memory usage: 2.47Gb
														
 
															+
														
 
															+Database size: 3.3 Gb
														
 
															+
														
 
															+LAKEsuperior Alpha 6, LMDB Back End
														
 
															+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
														
 
															+
														
 
															+25’ running time
														
 
															+
														
 
															+0.152" per resource (161%)
														
 
															+
														
 
															+*Some gaps every ~40-50 requests, probably disk flush*
														
 
															+
														
 
															+Retrieval of parent resource (10K triples), pipe to /dev/null: 2.13"
														
 
															+(58%)
														
 
															+
														
 
															+Peak memory usage: ~650 Mb (3 idle workers, 1 active)
														
 
															+
														
 
															+Database size: 523 Mb (16%)
														
 
															+
														
 
															+.. _ole-workhorse-server-1:
														
 
															+
														
 
															+‘Ole Workhorse’ server
														
 
															+~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+FCREPO
														
 
															+^^^^^^
														
 
															+
														
 
															+0:47:38 running time
														
 
															+
														
 
															+0.285" per resource (100%)
														
 
															+
														
 
															+Retrieval of parent resource: 9.6" (100%)
														
 
															+
														
 
															+LAKEsuperior
														
 
															+^^^^^^^^^^^^
														
 
															+
														
 
															+1:14:19 running time
														
 
															+
														
 
															+0.446" per resource (156%)
														
 
															+
														
 
															+Retrieval of parent resource: 5.58" (58%)
														
 
															+
														
 
															+Conclusions
														
 
															+-----------
														
 
															+
														
 
															+LAKEsuperior appears to be markedly slower on writes and markedly faster
														
 
															+on reads. Both these factors are very likely related to the underlying
														
 
															+LMDB store which is optimized for read performance.
														
 
															+
														
 
															+Comparison of results between the laptop and the server demonstrates
														
 
															+that both read and write performance gaps are identical in the two
														
 
															+environments. Disk speed severely affects the numbers.
														
 
															+
														
 
															+**Note:** As you can guess, these are only very partial and specific
														
 
															+results. They should not be taken as a thorough performance assessment.
														
 
															+Such an assessment may be impossible and pointless to make given the
														
 
															+very different nature of the storage models, which may behave radically
														
 
															+differently depending on many variables.
														
--- a/docs/storage.rst
+++ b/docs/storage.rst
@@ -0,0 +1,94 @@
 
															+Storage Implementation
														
 
															+======================
														
 
															+
														
 
															+LAKEsuperior stores non-RDF (“binary”) data in the filesystem and RDF
														
 
															+data in an embedded key-value store, `LMDB <https://symas.com/lmdb/>`__.
														
 
															+
														
 
															+RDF Storage design
														
 
															+------------------
														
 
															+
														
 
															+LMDB is a very fast, very lightweight C library. It is inspired by
														
 
															+BerkeleyDB but introduces significant improvements in terms of
														
 
															+efficiency and stability.
														
 
															+
														
 
															+The LAKEsuperior RDF store consists of two files: the main data store
														
 
															+and the indices (plus two lock files that are generated at runtime). A
														
 
															+good amount of effort has been put to develop an indexing strategy that
														
 
															+is balanced between write performance, read performance, and data size,
														
 
															+with no compromise made on consistency.
														
 
															+
														
 
															+The main data store is the one containing the preservation-worthy data.
														
 
															+While the indices are necessary for LAKEsuperior to function, they can
														
 
															+be entirely rebuilt from the main data store in case of file corruption
														
 
															+(recovery tools are on the TODO list).
														
 
															+
														
 
															+Detailed notes about the various strategies researched can be found
														
 
															+`here <indexing_strategy.md>`__.
														
 
															+
														
 
															+Scalability
														
 
															+-----------
														
 
															+
														
 
															+Since LAKEsuperior is focused on design simplicity, efficiency and
														
 
															+reliability, its RDF store is embedded and not horizontally scalable.
														
 
															+However, LAKEsuperior is quite frugal with disk space. About 55 million
														
 
															+triples can be stored in 8Gb of space (mileage can vary depending on how
														
 
															+heterogeneous the triples are). This makes it easier to use expensive
														
 
															+SSD drives for the RDF store, in order to improve performance. A single
														
 
															+LMDB environment can reportedly scale up to 128 terabytes.
														
 
															+
														
 
															+Maintenance
														
 
															+-----------
														
 
															+
														
 
															+LMDB has a very simple configuration, and all options are hardcoded in
														
 
															+LAKESuperior in order to exploit its features. A database automatically
														
 
															+recovers from a crash.
														
 
															+
														
 
															+The LAKEsuperior RDF store abstraction maintains a registry of unique
														
 
															+terms. These terms are not deleted if a triple is deleted, even if no
														
 
															+triple is using them, because it would be too expesive to look up for
														
 
															+orphaned terms during a delete request. While these terms are relatively
														
 
															+lightweight, it would be good to run a periodical clean-up job. Tools
														
 
															+will be developed in the near future to facilitate this maintenance
														
 
															+task.
														
 
															+
														
 
															+Consistency
														
 
															+-----------
														
 
															+
														
 
															+LAKEsuperior wraps each LDP operation in a transaction. The indices are
														
 
															+updated synchronously within the same transaction in order to guarantee
														
 
															+consistency. If a system loses power or crashes, only the last
														
 
															+transaction is lost, and the last successful write will include primary
														
 
															+and index data.
														
 
															+
														
 
															+Concurrency
														
 
															+-----------
														
 
															+
														
 
															+LMDB employs
														
 
															+`MVCC <https://en.wikipedia.org/wiki/Multiversion_concurrency_control>`__
														
 
															+to achieve fully ACID transactions. This implies that during a write,
														
 
															+the whole database is locked. Multiple writes can be initiated
														
 
															+concurrently, but the performance gain of doing so may be little because
														
 
															+only one write operation can be performed at a time. Reasonable efforts
														
 
															+have been put to make write transactions as short as possible (and more
														
 
															+can be done). Also, this excludes a priori the option to implement
														
 
															+long-running atomic operations, unless one is willing to block writes on
														
 
															+the application for an indefinite length of time. On the other hand,
														
 
															+write operations never block and are never blocked, so an application
														
 
															+with a high read-to-write ratio may still benefit from multi-threaded
														
 
															+requests.
														
 
															+
														
 
															+Performance
														
 
															+-----------
														
 
															+
														
 
															+The `Performance Benchmark Report <performance.md>`__ contains benchmark
														
 
															+results.
														
 
															+
														
 
															+Write performance is lower than Modeshape/Fedora4; this may be mostly
														
 
															+due to the fact that indices are written synchronously in a blocking
														
 
															+transaction; also, the LMDB B+Tree structure is optimized for read
														
 
															+performance rather than write performance. Some optimizations on the
														
 
															+application layer could be made.
														
 
															+
														
 
															+Reads are faster than Modeshape/Fedora.
														
 
															+
														
 
															+All tests so far have been performed in a single thread.