浏览代码

Convert all .md docs to .rst.

Stefano Cossu 7 年之前
父节点
当前提交
af458f1604

+ 0 - 175
README.md

@@ -1,175 +0,0 @@
-# LAKEsuperior
-
-[![build status](
-  http://img.shields.io/travis/scossu/lakesuperior/master.svg?style=flat)](
- https://travis-ci.org/username/repo)
-
-LAKEsuperior is an alternative [Fedora Repository](http://fedorarepository.org)
-implementation.
-
-## Guiding Principles
-
-LAKEsuperior aims at being an uncomplicated, efficient Fedora 4 implementation.
-
-Its main goals are:
-
-- **Reliability:** Based on solid technologies with stability in mind.
-- **Efficiency:** Small memory and CPU footprint, high scalability.
-- **Ease of management:** Tools to perform monitoring and maintenance included.
-- **Simplicity of design:** Straight-forward architecture, robustness over
-  features.
-
-## Key features
-
-- Drop-in replacement for Fedora4 (with some
-  [caveats](docs/notes/fcrepo4_deltas.md)); currently being tested with Hyrax 2
-- Very stable persistence layer based on [LMDB](https://symas.com/lmdb/) and
-  filesystem. Fully ACID-compliant writes guarantee consistency of data.
-- Term-based search (*planned*) and SPARQL Query API + UI
-- No performance penalty for storing many resources under the same container; no
-  [kudzu](https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml)
-  pairtree segmentation <sup id="a1">[1](#f1)</sup>
-- Extensible [provenance metadata](docs/notes/model.md) tracking
-- [Multi-modal access](docs/notes/architecture.md#multi-modal-access): HTTP
-  (REST), command line interface and native Python API.
-- Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
-
-Implementation of the official [Fedora API specs](https://fedora.info/spec/)
-(Fedora 5.x and beyond) is not
-foreseen in the short term, however it would be a natural evolution of this
-project if it gains support.
-
-Please make sure you read the [Delta document](docs/notes/fcrepo4_deltas.md) for
-divergences with the official Fedora4 implementation.
-
-## Target Audience
-
-LAKEsuperior is for anybody who cares about preserving data in the long term.
-
-Less vaguely, LAKEsuperior is targeted at who needs to store large quantities
-of highly linked metadata and documents.
-
-Its Python/C environment and API make it particularly well suited for academic
-and scientific environments who would be able to embed it in a Python
-application as a library or extend it via plug-ins.
-
-LAKEsuperior is able to be exposed to the Web as a
-[Linked Data Platform](https://www.w3.org/TR/ldp-primer/) server. It also acts
-as a SPARQL query (read-only) endpoint, however it is not meant to be used as
-a full-fledged triplestore at the moment.
-
-In its current status, LAKEsuperior is aimed at developers and
-hands-on managers who are interested in evaluating this project.
-
-## Quick Install: Running in Docker
-
-You can run LAKEsuperior in Docker for a hands-off quickstart.
-
-[Docker](http://docker.com/) is a containerization platform that allows you to run
-services in lightweight virtual machine environments without having to worry about
-installing all of the prerequisites on your host machine.
-
-1. Install the correct [Docker Community Edition](https://www.docker.com/community-edition)
-   for your operating system.
-1. Clone this repo: `git clone https://github.com/scossu/lakesuperior.git`
-1. `cd` into repo folder
-1. Run `docker-compose up`
-
-LAKEsuperior should now be available at `http://localhost:8000/`.
-
-The provided Docker configuration includes persistent storage as a self-container Docker
-volume, meaning your data will persist between runs. If you want to clear the decks,
-simply run `docker-compose down -v`.
-
-## Manual Install (a bit less quick, a bit more power)
-
-**Note:** These instructions have been tested on Linux. They may work on Darwin
-with little modification, and possibly on Windows with some
-modifications. Feedback is welcome.
-
-### Dependencies
-
-1. Python 3.5 or greater.
-1. A message broker supporting the STOMP protocol. For testing and evaluation
-purposes, [CoilMQ](https://github.com/hozn/coilmq) is included with the
-dependencies and should be automatically installed.
-
-### Installation steps
-
-1. Create a virtualenv in a project folder:
-   `virtualenv -p <python 3.5+ exec path> <virtualenv folder>`
-1. Activate the virtualenv: `source <path_to_virtualenv>/bin/activate`
-1. Clone this repo: `git clone https://github.com/scossu/lakesuperior.git`
-1. `cd` into repo folder
-1. Install dependencies: `pip install -r requirements.txt`
-1. Start your STOMP broker, e.g.: `coilmq &`. If you have another queue manager
-   listening to port 61613 you can either configure a different port on the
-   application configuration, or use the existing message queue.
-1. Run `./lsup-admin bootstrap` to initialize the binary and graph stores
-1. Run `./fcrepo`.
-
-### Configuration
-
-The app should run for testing and evaluation purposes without any further
-configuration. All the application data are stored by default in the `data`
-directory.
-
-To change the default configuration you should:
-
-1. Copy the `etc.skeleton` folder to a separate location
-1. Set the configuration folder location in the environment:
-   `export FCREPO_CONFIG_DIR=<your config dir location>` (you can
-   add this line at the end of your virtualenv `activate` script)
-1. Configure the application
-1. Bootstrap the app or copy the original data folders to the new location if
-   any loction options changed
-1. (Re)start the server: `./fcrepo`
-
-The configuration options are documented in the files.
-
-**Note:** `test.yml` must specify a different location for the graph and for
-the binary stores than the default one, otherwise running a test suite will
-destroy your main data store. The application will issue an error message and
-refuse to start if these locations overlap.
-
-### Production deployment
-
-If you like fried repositories for lunch, deploy before 11AM.
-
-## Status and development
-
-LAKEsuperior is in **alpha** status. Please see the
-[project issues](https://github.com/scossu/lakesuperior/issues) list for a
-rudimentary road map.
-
-## Contributing
-
-This has been so far a single person's off-hours project (with much input from
-several sides). In order to turn into anything close to a Beta release and
-eventually to a production-ready implementation, it needs some community love.
-
-Contributions are welcome in all forms, including ideas, issue reports, or
-even just spinning up the software and providing some feedback. LAKEsuperior is
-meant to live as a community project.
-
-## Technical documentation
-
-[Architecture Overview](docs/notes/architecture.md)
-
-[Content Model](docs/notes/model.md)
-
-[Messaging](docs/notes/messaging.md)
-
-[Migration, Backup & Restore](docs/notes/migration.md)
-
-[Command-Line Reference](docs/notes/cli.md)
-
-[Storage Implementation](docs/notes/storage.md)
-
-[Performance Benchmarks](docs/notes/performance.md)
-
----
-
-<b id="f1">1</b> However if your client splits pairtrees upstream, such as
-Hyrax does, that obviously needs to change to get rid of the path
-segments. [↩](#a1)

+ 194 - 0
README.rst

@@ -0,0 +1,194 @@
+LAKEsuperior
+============
+
+|build status|
+
+LAKEsuperior is an alternative `Fedora
+Repository <http://fedorarepository.org>`__ implementation.
+
+Guiding Principles
+------------------
+
+LAKEsuperior aims at being an uncomplicated, efficient Fedora 4
+implementation.
+
+Its main goals are:
+
+-  **Reliability:** Based on solid technologies with stability in mind.
+-  **Efficiency:** Small memory and CPU footprint, high scalability.
+-  **Ease of management:** Tools to perform monitoring and maintenance
+   included.
+-  **Simplicity of design:** Straight-forward architecture, robustness
+   over features.
+
+Key features
+------------
+
+-  Drop-in replacement for Fedora4 (with some
+   `caveats <docs/fcrepo4_deltas.md>`__); currently being tested
+   with Hyrax 2
+-  Very stable persistence layer based on
+   `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
+   ACID-compliant writes guarantee consistency of data.
+-  Term-based search (*planned*) and SPARQL Query API + UI
+-  No performance penalty for storing many resources under the same
+   container; no
+   `kudzu <https://www.nature.org/ourinitiatives/urgentissues/land-conservation/forests/kudzu.xml>`__
+   pairtree segmentation \ `1 <#f1>`__\ 
+-  Extensible `provenance metadata <docs/model.md>`__ tracking
+-  `Multi-modal
+   access <docs/architecture.md#multi-modal-access>`__: HTTP
+   (REST), command line interface and native Python API.
+-  Fits in a pocket: you can carry 50M triples in an 8Gb memory stick.
+
+Implementation of the official `Fedora API
+specs <https://fedora.info/spec/>`__ (Fedora 5.x and beyond) is not
+foreseen in the short term, however it would be a natural evolution of
+this project if it gains support.
+
+Please make sure you read the `Delta
+document <docs/fcrepo4_deltas.md>`__ for divergences with the
+official Fedora4 implementation.
+
+Target Audience
+---------------
+
+LAKEsuperior is for anybody who cares about preserving data in the long
+term.
+
+Less vaguely, LAKEsuperior is targeted at who needs to store large
+quantities of highly linked metadata and documents.
+
+Its Python/C environment and API make it particularly well suited for
+academic and scientific environments who would be able to embed it in a
+Python application as a library or extend it via plug-ins.
+
+LAKEsuperior is able to be exposed to the Web as a `Linked Data
+Platform <https://www.w3.org/TR/ldp-primer/>`__ server. It also acts as
+a SPARQL query (read-only) endpoint, however it is not meant to be used
+as a full-fledged triplestore at the moment.
+
+In its current status, LAKEsuperior is aimed at developers and hands-on
+managers who are interested in evaluating this project.
+
+Quick Install: Running in Docker
+--------------------------------
+
+You can run LAKEsuperior in Docker for a hands-off quickstart.
+
+`Docker <http://docker.com/>`__ is a containerization platform that
+allows you to run services in lightweight virtual machine environments
+without having to worry about installing all of the prerequisites on
+your host machine.
+
+1. Install the correct `Docker Community
+   Edition <https://www.docker.com/community-edition>`__ for your
+   operating system.
+2. Clone this repo:
+   ``git clone https://github.com/scossu/lakesuperior.git``
+3. ``cd`` into repo folder
+4. Run ``docker-compose up``
+
+LAKEsuperior should now be available at ``http://localhost:8000/``.
+
+The provided Docker configuration includes persistent storage as a
+self-container Docker volume, meaning your data will persist between
+runs. If you want to clear the decks, simply run
+``docker-compose down -v``.
+
+Manual Install (a bit less quick, a bit more power)
+---------------------------------------------------
+
+**Note:** These instructions have been tested on Linux. They may work on
+Darwin with little modification, and possibly on Windows with some
+modifications. Feedback is welcome.
+
+Dependencies
+~~~~~~~~~~~~
+
+1. Python 3.5 or greater.
+2. A message broker supporting the STOMP protocol. For testing and
+   evaluation purposes, `CoilMQ <https://github.com/hozn/coilmq>`__ is
+   included with the dependencies and should be automatically installed.
+
+Installation steps
+~~~~~~~~~~~~~~~~~~
+
+1. Create a virtualenv in a project folder:
+   ``virtualenv -p <python 3.5+ exec path> <virtualenv folder>``
+2. Activate the virtualenv: ``source <path_to_virtualenv>/bin/activate``
+3. Clone this repo:
+   ``git clone https://github.com/scossu/lakesuperior.git``
+4. ``cd`` into repo folder
+5. Install dependencies: ``pip install -r requirements.txt``
+6. Start your STOMP broker, e.g.: ``coilmq &``. If you have another
+   queue manager listening to port 61613 you can either configure a
+   different port on the application configuration, or use the existing
+   message queue.
+7. Run ``./lsup-admin bootstrap`` to initialize the binary and graph
+   stores
+8. Run ``./fcrepo``.
+
+Configuration
+~~~~~~~~~~~~~
+
+The app should run for testing and evaluation purposes without any
+further configuration. All the application data are stored by default in
+the ``data`` directory.
+
+To change the default configuration you should:
+
+1. Copy the ``etc.skeleton`` folder to a separate location
+2. Set the configuration folder location in the environment:
+   ``export FCREPO_CONFIG_DIR=<your config dir location>`` (you can add
+   this line at the end of your virtualenv ``activate`` script)
+3. Configure the application
+4. Bootstrap the app or copy the original data folders to the new
+   location if any loction options changed
+5. (Re)start the server: ``./fcrepo``
+
+The configuration options are documented in the files.
+
+**Note:** ``test.yml`` must specify a different location for the graph
+and for the binary stores than the default one, otherwise running a test
+suite will destroy your main data store. The application will issue an
+error message and refuse to start if these locations overlap.
+
+Production deployment
+~~~~~~~~~~~~~~~~~~~~~
+
+If you like fried repositories for lunch, deploy before 11AM.
+
+Status and development
+----------------------
+
+LAKEsuperior is in **alpha** status. Please see the `project
+issues <https://github.com/scossu/lakesuperior/issues>`__ list for a
+rudimentary road map.
+
+Contributing
+------------
+
+This has been so far a single person’s off-hours project (with much
+input from several sides). In order to turn into anything close to a
+Beta release and eventually to a production-ready implementation, it
+needs some community love.
+
+Contributions are welcome in all forms, including ideas, issue reports,
+or even just spinning up the software and providing some feedback.
+LAKEsuperior is meant to live as a community project.
+
+Documentation
+-----------------------
+
+The documenation is maintained in `Read The Docs
+<http://lakesuperior.readthedocs.io/en/latest/>`__.
+
+--------------
+
+1 However if your client splits pairtrees upstream, such as Hyrax does,
+that obviously needs to change to get rid of the path segments.
+`↩ <#a1>`__
+
+.. |build status| image:: http://img.shields.io/travis/scossu/lakesuperior/master.svg?style=flat
+   :target: https://travis-ci.org/username/repo

+ 48 - 0
docs/api.rst

@@ -0,0 +1,48 @@
+API Documentation
+==================
+
+.. module:: lakesuperior
+
+.. automodule:: lakesuperior.api.resource
+   :members:
+
+.. automodule:: lakesuperior.api.query
+   :members:
+
+.. automodule:: lakesuperior.api.admin
+   :members:
+
+
+Resource API
+~~~~~~~~~~~~
+
+.. autoclass:: lakesuperior.model.ldpr.Ldpr
+   :members:
+   :inherited-members:
+
+.. autoclass:: lakesuperior.model.ldp_factory.LdpFactory
+   :members:
+   :inherited-members:
+
+Low-Level API
+~~~~~~~~~~~~~
+
+Store Layout
+^^^^^^^^^^^^
+
+.. autoclass:: lakesuperior.store.ldp_rs.rsrc_centric_layout.RsrcCentricLayout
+   :members:
+   :inherited-members:
+
+.. autoclass:: lakesuperior.store.ldp_nr.base_non_rdf_layout.BaseNonRdfLayout
+   :members:
+   :inherited-members:
+
+RDF Store Abstraction
+^^^^^^^^^^^^^^^^^^^^^
+
+.. automodule:: lakesuperior.store.ldp_rs.lmdb_store
+   :members:
+   :inherited-members:
+
+

+ 48 - 0
docs/architecture.rst

@@ -0,0 +1,48 @@
+LAKEsuperior Architecture
+=========================
+
+LAKEsuperior is written in Python. It is not excluded that parts of the
+code may be rewritten in `Cython <http://cython.readthedocs.io/>`__ for
+performance.
+
+Multi-Modal Access
+------------------
+
+LAKEsuperior services and data are accessible in multiple ways:
+
+-  Via HTTP. This is the canonical way to interact with LDP resources
+   and conforms quite closely to the Fedora specs (currently v4).
+-  Via command line. This method includes long-running admin tasks which
+   are not available via HTTP.
+-  Via a Python API. This method allows to use Python scripts to access
+   the same methods available to the two methods above in a programmatic
+   way. It is possible to write Python plugins or even to embed
+   LAKEsuperior in a Python application, even without running a web
+   server.
+
+Architecture Overview
+---------------------
+
+.. figure:: assets/lakesuperior_arch.png
+   :alt: LAKEsuperior Architecture
+
+   LAKEsuperior Architecture
+
+The LAKEsuperior REST API provides access to the underlying Python API.
+All REST and CLI operations can be replicated by a Python program
+accessing this API.
+
+The main advantage of the Python API is that it makes it very easy to
+maipulate graph and binary data without the need to serialize or
+deserialize native data structures. This matters when handling large ETL
+jobs for example.
+
+The Python API is divided in three main areas:
+
+-  `Resource API <../../lakesuperior/api/resource.py>`__. This API is in
+   charge of all the resource CRUD operations and implements the
+   majority of the Fedora specs.
+-  `Admin API <../../lakesuperior/api/admin.py>`__. This exposes utility
+   methods, mostly long-running maintenance jobs.
+-  `Query API <../../lakesuperior/api/query.py>`__. This provides
+   several facilities for querying repository data.

+ 33 - 0
docs/cli.rst

@@ -0,0 +1,33 @@
+LAKEsuperior Command Line Reference
+===================================
+
+The LAKEsuperior command line tool is used for maintenance and
+administration purposes.
+
+The script is invoked from the main install directory. The tool is
+self-documented, so this is just a redundant overview:
+
+::
+
+    $ ./lsup_admin
+    Usage: lsup-admin [OPTIONS] COMMAND [ARGS]...
+
+    Options:
+      --help  Show this message and exit.
+
+      bootstrap     Bootstrap binary and graph stores.
+      check_fixity  [STUB] Check fixity of a resource.
+      check_refint  [STUB] Check referential integrity.
+      cleanup       [STUB] Clean up orphan database items.
+      copy          [STUB] Copy (backup) repository data.
+      dump          [STUB] Dump repository to disk.
+      load          [STUB] Load serialized repository data.
+      stats         Print repository statistics.
+
+All entries marked ``[STUB]`` are not yet implemented, however the
+``lsup_admin <command> --help`` command will issue a description of what
+the command is meant to do. Please see the `TODO <TODO>`__ document for
+a rough road map.
+
+All of the above commands are also available via, and based upon, the
+native Python API.

+ 242 - 0
docs/fcrepo4_deltas.rst

@@ -0,0 +1,242 @@
+Divergencies between lakesuperior and FCREPO4
+=============================================
+
+This is a (vastly incomplete) list of discrepancies between the current
+FCREPO4 implementation and LAKEsuperior. More will be added as more
+clients will use it.
+
+Not yet implemented (but in the plans)
+--------------------------------------
+
+-  Various headers handling
+-  Versioning (incomplete)
+-  AuthN/Z
+-  Fixity check
+-  Blank nodes
+
+Potentially breaking changes
+----------------------------
+
+The following divergences may lead into incompatibilities with some
+clients.
+
+Atomicity
+~~~~~~~~~
+
+FCREPO4 supports batch atomic operations whereas a transaction can be
+opened and a number of operations (i.e. multiple R/W requests to the
+repository) can be performed. The operations are persisted in the
+repository only if and when the transaction is committed.
+
+LAKesuperior only supports atomicity for a single HTTP request. I.e. a
+single HTTTP request that should result in multiple write operations to
+the storage layer is only persisted if no exception is thrown.
+Otherwise, the operation is rolled back in order to prevent resources to
+be left in an inconsistent state.
+
+Tombstone methods
+~~~~~~~~~~~~~~~~~
+
+If a client requests a tombstone resource in FCREPO4 with a method other
+than DELETE, the server will return ``405 Method Not Allowed``
+regardless of whether the tombstone exists or not.
+
+LAKEsuperior will return ``405`` only if the tombstone actually exists,
+``404`` otherwise.
+
+Web UI
+~~~~~~
+
+FCREPO4 includes a web UI for simple CRUD operations.
+
+Such a UI is not in the immediate LAKEsuperior development plans.
+However, a basic UI is available for read-only interaction: LDP resource
+browsing, SPARQL query and other search facilities, and administrative
+tools. Some of the latter *may* involve write operations, such as
+clean-up tasks.
+
+Automatic path segment generation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A ``POST`` request without a slug in FCREPO4 results in a pairtree
+consisting of several intermediate nodes leading to the automatically
+minted identifier. E.g.
+
+::
+
+    POST /rest
+
+results in ``/rest/8c/9a/07/4e/8c9a074e-dda3-5256-ea30-eec2dd4fcf61``
+being created.
+
+The same request in LAKEsuperior would create
+``/rest/8c9a074e-dda3-5256-ea30-eec2dd4fcf61`` (obviously the
+identifiers will be different).
+
+This seems to break Hyrax at some point, but might have been fixed. This
+needs to be verified further.
+
+Non-standard client breaking changes
+------------------------------------
+
+The following changes may be incompatible with clients relying on some
+FCREPO4 behavior not endorsed by LDP or other specifications.
+
+Pairtrees
+~~~~~~~~~
+
+FCREPO4 generates “pairtree” resources if a resource is created in a
+path whose segments are missing. E.g. when creating ``/a/b/c/d``, if
+``/a/b`` and ``/a/b/c`` do not exist, FCREPO4 will create two Pairtree
+resources. POSTing and PUTting into Pairtrees is not allowed. Also, a
+containment triple is established between the closest LDPC and the
+created resource, e.g. if ``a`` exists, a
+``</a> ldp:contains </a/b/c/d>`` triple is created.
+
+LAKEsuperior does not employ Pairtrees. In the example above
+LAKEsuperior would create a fully qualified LDPC for each missing
+segment, which can be POSTed and PUT to. Containment triples are created
+between each link in the path, i.e. ``</a> ldp:contains </a/b>``,
+``</a/b> ldp:contains </a/b/c>`` etc. This may potentially break clients
+relying on the direct containment model.
+
+The rationale behind this change is that Pairtrees are the byproduct of
+a limitation imposed by Modeshape and introduce complexity in the
+software stack and confusion for the client. LAKEsuperior aligns with
+the more intuitive UNIX filesystem model, where each segment of a path
+is a “folder” or container (except for the leaf nodes that can be either
+folders or files). In any case, clients are discouraged from generating
+deep paths in LAKEsuperior without a specific purpose because these
+resources create unnecessary data.
+
+Non-mandatory, non-authoritative slug in version POST
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+FCREPO4 requires a ``Slug`` header to POST to ``fcr:versions`` to create
+a new version.
+
+LAKEsuperior adheres to the more general FCREPO POST rule and if no slug
+is provided, an automatic ID is generated instead. The ID is a UUID4.
+
+Note that internally this ID is not called “label” but “uid” since it is
+treated as a fully qualified identifier. The ``fcrepo:hasVersionLabel``
+predicate, however ambiguous in this context, will be kept until the
+adoption of Memento, which will change the retrieval mechanisms.
+
+Also, if a POST is issued on the same resource ``fcr:versions`` location
+using a version ID that already exists, LAKEsuperior will just mint a
+random identifier rather than returning an error.
+
+Deprecation track
+-----------------
+
+LAKEsuperior offers some “legacy” options to replicate the FCREPO4
+behavior, however encourages new development to use a different approach
+for some types of interaction.
+
+Endpoints
+~~~~~~~~~
+
+The FCREPO root endpoint is ``/rest``. The LAKEsuperior root endpoint is
+``/ldp``.
+
+This should not pose a problem if a client does not have ``rest``
+hard-coded in its code, but in any event, the ``/rest`` endpoint is
+provided for backwards compatibility.
+
+LAKEsuperior adds the (currently stub) ``query`` endpoint. Other
+endpoints for non-LDP services may be opened in the future.
+
+Automatic LDP class assignment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since LAKEsuperior rejects client-provided server-managed triples, and
+since the LDP types are among them, the LDP container type is inferred
+from the provided properties: if the ``ldp:hasMemberRelation`` and
+``ldp:membershipResource`` properties are provided, the resource is a
+Direct Container. If in addition to these the
+``ldp:insertedContentRelation`` property is present, the resource is an
+Indirect Container. If any of the first two are missing, the resource is
+a Container (@TODO discuss: shall it be a Basic Container?)
+
+Clients are encouraged to omit LDP types in PUT, POST and PATCH
+requests.
+
+Lenient handling
+~~~~~~~~~~~~~~~~
+
+FCREPO4 requires server-managed triples to be expressly indicated in a
+PUT request, unless the ``Prefer`` header is set to
+``handling=lenient; received="minimal"``, in which case the RDF payload
+must not have any server-managed triples.
+
+LAKEsuperior works under the assumption that client should never provide
+server-managed triples. It automatically handles PUT requests sent to
+existing resources by returning a 412 if any server managed triples are
+included in the payload. This is the same as setting ``Prefer`` to
+``handling=strict``, which is the default.
+
+If ``Prefer`` is set to ``handling=lenient``, all server-managed triples
+sent with the payload are ignored.
+
+Clients using the ``Prefer`` header to control PUT behavior as
+advertised by the specs should not notice any difference.
+
+Optional improvements
+---------------------
+
+The following are improvements in performance or usability that can only
+be taken advantage of if client code is adjusted.
+
+LDP-NR content and metadata
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+FCREPO4 relies on the ``/fcr:metadata`` identifier to retrieve RDF
+metadata about an LDP-NR. LAKEsuperior supports this as a legacy option,
+but encourages the use of content negotiation to do the same while
+offering explicit endpoints for RDF and non-RDF content retrieval.
+
+Any request to an LDP-NR with an ``Accept`` header set to one of the
+supported RDF serialization formats will yield the RDF metadata of the
+resource instead of the binary contents.
+
+The ``fcr:metadata`` URI returns the RDF metadata of a LDP-NR.
+
+The ``fcr:content`` URI returns the non-RDF content.
+
+The two optionsabove return an HTTP error if requested for a LDP-RS.
+
+“Include” and “Omit” options for children
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+LAKEsuperior offers an additional ``Prefer`` header option to exclude
+all references to child resources (i.e. by removing all the
+``ldp:contains`` triples) while leaving the other server-managed triples
+when retrieving a resource:
+
+::
+
+    Prefer: return=representation; [include | omit]="http://fedora.info/definitions/v4/repository#Children"
+
+The default behavior is to include all children URIs.
+
+Soft-delete and purge
+~~~~~~~~~~~~~~~~~~~~~
+
+**NOTE**: The implementation of this section is incomplete and debated.
+
+In FCREPO4 a deleted resource leaves a tombstone deleting all traces of
+the previous resource.
+
+In LAKEsuperior, a normal DELETE creates a new version snapshot of the
+resource and puts a tombstone in its place. The resource versions are
+still available in the ``fcr:versions`` location. The resource can be
+“resurrected” by issuing a POST to its tombstone. This will result in a
+``201``.
+
+If a tombstone is deleted, the resource and its versions are completely
+deleted (purged).
+
+Moreover, setting the ``Prefer:no-tombstone`` header option on DELETE
+allows to delete a resource and its versions directly without leaving a
+tombstone.

+ 16 - 1
docs/index.rst

@@ -3,13 +3,28 @@
    You can adapt this file completely to your liking, but it should at least
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
    contain the root `toctree` directive.
 
 
-Welcome to lakesuperior's documentation!
+LAKEsuperior
 ========================================
 ========================================
 
 
 .. toctree::
 .. toctree::
    :maxdepth: 2
    :maxdepth: 2
    :caption: Contents:
    :caption: Contents:
 
 
+    Architecture Overview <architecture>
+    Divergences from Fedora 4 <fcrepo4_deltas>
+    Content Model <model>
+    Messaging SPI <messaging>
+    Migration Guide <migration>
+    Command Line Reference <cli>
+    Storage Implementation <storage>
+    Performance Benchmarks <performance>
+    API documentation <api>
+
+.. toctree::
+   :maxdepth: 3
+   :caption: Technical notes:
+
+    notes/indexing_strategy
 
 
 
 
 Indices and tables
 Indices and tables

+ 30 - 0
docs/messaging.rst

@@ -0,0 +1,30 @@
+LAKEsuperior Messaging
+======================
+
+LAKEsuperior implements a messaging system based on ActivityStreams, as
+indicated by the `Feodra API
+specs <https://fedora.info/2017/06/30/spec/#notifications>`__. The
+metadata set provided is currently quite minimal but can be easily
+enriched by extending the `default formatter
+class <https://github.com/scossu/lakesuperior/blob/master/lakesuperior/messaging/messenger.py>`__.
+
+STOMP is the only supported protocol at the moment. More protocols may
+be made available at a later time.
+
+LAKEsuperior can send messages to any number of destinations: see
+`configuration <https://github.com/scossu/lakesuperior/blob/master/etc.defaults/application.yml#L79>`__.
+By default, CoilMQ is provided for testing purposes and listens to
+``localhost:61613``. The default route sends messages to
+``/topic/fcrepo``.
+
+A small command-line utility, also provided with the Python
+dependencies, allows to watch incoming messages. To monitor messages,
+enter the following *after activating your virtualenv*:
+
+::
+
+    stomp -H localhost -P 61613 -L /topic/fcrepo
+
+See the `stomp.py library reference
+page <https://github.com/jasonrbriggs/stomp.py/wiki/Command-Line-Access>`__
+for details.

+ 65 - 0
docs/migration.rst

@@ -0,0 +1,65 @@
+Migration, Backup & Restore
+===========================
+
+All LAKEsuperior data is by default fully contained in a folder. This
+means that only the data, configurations and code folders are needed for
+it to run. No Postgres, Redis, or such. Data and configuration folders
+can be moved around as needed.
+
+Migration Tool
+--------------
+
+Migration is the process of importing and converting data from a
+different Fedora or LDP implementation into a new LAKEsuperior instance.
+This process uses the HTTP/LDP API of the original repository. A
+command-line utility is available as part of the ``lsup-admin`` suite to
+assist in such operation.
+
+A repository can be migrated with a one-line command such as:
+
+::
+
+    ./lsup-admin migrate http://source-repo.edu/rest /local/dest/folder
+
+For more options, enter
+
+::
+
+    ./lsup-admin migrate --help
+
+The script will crawl through the resources and crawl through outbound
+links within them. In order to do this, resources are added as raw
+triples ( i.e. no consistency checks are made).
+
+**Note:** the consistency check tool has not yet been implemented at the
+moment but its release should follow shortly. This will ensure that all
+the links between resources are consistent in regard to referential
+integrity.
+
+This script will create a full dataset in the specified destination
+folder, complete with a default configuration that allows to start the
+LAKEsuperior server immediately after the migration is complete.
+
+Two approaches to migration are possible:
+
+1. By providing a starting point on the source repository. E.g. if the
+   repository you want to migrate is at ``http://repo.edu/rest/prod``
+   you can add the ``-s /prod`` option to the script to avoid migrating
+   irrelevant branches. Note that the script will still reach outside of
+   the starting point if resources are referencing other resources
+   outside of it.
+2. By providing a file containing a list of resources to migrate. This
+   is useful if a source repository cannot produce a full list (e.g. the
+   root node has more children than the server can handle) but a list of
+   individual resources is available via an external index (Solr,
+   triplestore, etc.). The resources can be indicated by their fully
+   qualified URIs or paths relative to the repository root. (*TODO
+   latter option needs testing*)
+
+Backup And Restore
+------------------
+
+A back up of a LAKEshore repository consists in copying the RDF and
+non-RDF data folders. These folders are indicated in the application
+configuration. The default commands provided by your OS (``cp``,
+``rsync``, ``tar`` etc. for Unix) are all is needed.

+ 66 - 0
docs/model.rst

@@ -0,0 +1,66 @@
+LAKEsuperior Content Model Rationale
+====================================
+
+Internal and Public URIs; Identifiers
+-------------------------------------
+
+Resource URIs are stored internally in LAKEsuperior as domain-agnostic
+URIs with the scheme ``info:fcres<resource UID>``. This allows resources
+to be portable across systems. E.g. a resource with an internal URI of
+``info:fcres/a/b/c``, when accessed via the
+``http://localhost:8000/ldp`` endpoint, will be found at
+``http://localhost:8000/ldp/a/b/c``.
+
+The resource UID making up the looks like a UNIX filesystem path,
+i.e. it always starts with a forward slash and can be made up of
+multiple segments separated by slashes. E.g. ``/`` is the root node UID,
+``/a`` is a resource UID just below root. their internal URIs are
+``info:fcres/`` and ``info:fcres/a`` respectively.
+
+In the Python API, the UID and internal URI of an LDP resource can be
+accessed via the ``uid`` and ``uri`` properties respectively:
+
+::
+
+    >>> import lakesuperior.env_setup
+    >>> from lakesuperior.api import resource
+    >>> rsrc = resource.get('/a/b/c')
+    >>> rsrc.uid
+    /a/b/c
+    >>> rsrc.uri
+    rdflib.terms.URIRef('info:fcres/a/b/c')
+
+Store Layout
+------------
+
+One of the key concepts in LAKEsuperior is the store layout. This is a
+module built with a specific purpose in mind, i.e. allowing fine-grained
+recording of provenance metadata while providing reasonable performance.
+
+Store layout modules could be replaceable (work needs to be done to
+develop an interface to allow that). The default (and only at the
+moment) layout shipped with LAKEsuperior is the `resource-centric
+layout <../../lakesuperior/store/ldp_rs/rsrc_centric_layout>`__. This
+layout implements a so-called `graph-per-aspect
+pattern <http://patterns.dataincubator.org/book/graph-per-aspect.html>`__
+which stores different sets of statements about a resource in separate
+named graphs.
+
+The named graphs used for each resource are:
+
+-  An admin graph (``info:fcsystem/graph/admin<resource UID>``) which
+   stores administrative metadata, mostly server-managed triples such as
+   LDP types, system create/update timestamps and agents, etc.
+-  A structure graph (``info:fcsystem/graph/structure<resource UID>``)
+   reserved for containment triples. The reason for this separation is
+   purely convenience, since it makes it easy to retrieve all the
+   properties of a large container without its child references.
+-  One (and, possibly, in the future, more user-defined) named graph for
+   user-provided data
+   (``info:fcsystem/graph/userdata/_main<resource UID>``).
+
+Each of these graphs can be annotated with provenance metadata. The
+layout decides which triples go in which graph based on the predicate or
+RDF type contained in the triple. Adding logic to support arbitrary
+named graphs based e.g. on user agent, or to add more provenance
+information, should be relatively simple.

+ 0 - 43
docs/notes/architecture.md

@@ -1,43 +0,0 @@
-# LAKEsuperior Architecture
-
-LAKEsuperior is written in Python. It is not excluded that parts of the code
-may be rewritten in [Cython](http://cython.readthedocs.io/) for performance.
-
-
-## Multi-Modal Access
-
-LAKEsuperior services and data are accessible in multiple ways:
-
-- Via HTTP. This is the canonical way to interact with LDP resources and
-  conforms quite closely to the Fedora specs (currently v4).
-- Via command line. This method includes long-running admin tasks which are not
-  available via HTTP.
-- Via a Python API. This method allows to use Python scripts to access the same
-  methods available to the two methods above in a programmatic way. It is
-  possible to write Python plugins or even to embed LAKEsuperior in a
-  Python application, even without running a web server.
-
-
-## Architecture Overview
-
-![LAKEsuperior Architecture](../assets/lakesuperior_arch.png)
-
-The LAKEsuperior REST API provides access to the underlying Python API. All
-REST and CLI operations can be replicated by a Python program accessing this
-API.
-
-The main advantage of the Python API is that it makes it very easy to maipulate
-graph and binary data without the need to serialize or deserialize native data
-structures. This matters when handling large ETL jobs for example.
-
-The Python API is divided in three main areas:
-
-- [Resource API](../../lakesuperior/api/resource.py). This API is in charge of
-  all the resource CRUD operations and implements the majority of the Fedora
-  specs.
-- [Admin API](../../lakesuperior/api/admin.py). This exposes utility methods,
-  mostly long-running maintenance jobs.
-- [Query API](../../lakesuperior/api/query.py). This provides several
-  facilities for querying repository data.
-
-

+ 0 - 33
docs/notes/cli.md

@@ -1,33 +0,0 @@
-# LAKEsuperior Command Line Reference
-
-The LAKEsuperior command line tool is used for maintenance and administration
-purposes.
-
-The script is invoked from the main install directory. The tool is
-self-documented, so this is just a redundant overview:
-
-```
-$ ./lsup_admin
-Usage: lsup-admin [OPTIONS] COMMAND [ARGS]...
-
-Options:
-  --help  Show this message and exit.
-
-  bootstrap     Bootstrap binary and graph stores.
-  check_fixity  [STUB] Check fixity of a resource.
-  check_refint  [STUB] Check referential integrity.
-  cleanup       [STUB] Clean up orphan database items.
-  copy          [STUB] Copy (backup) repository data.
-  dump          [STUB] Dump repository to disk.
-  load          [STUB] Load serialized repository data.
-  stats         Print repository statistics.
-
-```
-
-All entries marked `[STUB]` are not yet implemented, however the
-`lsup_admin <command> --help` command will issue a description of what the
-command is meant to do. Please see the [TODO](TODO) document for a rough road
-map.
-
-All of the above commands are also available via, and based upon, the native
-Python API.

+ 0 - 213
docs/notes/fcrepo4_deltas.md

@@ -1,213 +0,0 @@
-# Divergencies between lakesuperior and FCREPO4
-
-This is a (vastly incomplete) list of discrepancies between the current FCREPO4
-implementation and LAKEsuperior. More will be added as more clients will use
-it.
-
-
-## Not yet implemented (but in the plans)
-
-- Various headers handling
-- Versioning (incomplete)
-- AuthN/Z
-- Fixity check
-- Blank nodes
-
-
-## Potentially breaking changes
-
-The following  divergences may lead into incompatibilities with some clients.
-
-### Atomicity
-
-FCREPO4 supports batch atomic operations whereas a transaction can be opened
-and a number of operations (i.e. multiple R/W requests to the repository) can
-be performed. The operations are persisted in the repository only if and when
-the transaction is committed.
-
-LAKesuperior only supports atomicity for a single HTTP request. I.e. a single
-HTTTP request that should result in multiple write operations to the storage
-layer is only persisted if no exception is thrown. Otherwise, the operation is
-rolled back in order to prevent resources to be left in an inconsistent state.
-
-### Tombstone methods
-
-If a client requests a tombstone resource in
-FCREPO4 with a method other than DELETE, the server will return `405 Method Not
-Allowed` regardless of whether the tombstone exists or not.
-
-LAKEsuperior will return `405` only if the tombstone actually exists, `404`
-otherwise.
-
-### Web UI
-
-FCREPO4 includes a web UI for simple CRUD operations.
-
-Such a UI is not in the immediate LAKEsuperior development plans. However, a
-basic UI is available for read-only interaction: LDP resource browsing, SPARQL
-query and other search facilities, and administrative tools. Some of the latter
-*may* involve write operations, such as clean-up tasks.
-
-### Automatic path segment generation
-
-A `POST` request without a slug in FCREPO4 results in a pairtree consisting of
-several intermediate nodes leading to the automatically minted identifier. E.g.
-
-    POST /rest
-
-results in `/rest/8c/9a/07/4e/8c9a074e-dda3-5256-ea30-eec2dd4fcf61` being
-created.
-
-The same request in LAKEsuperior would create
-`/rest/8c9a074e-dda3-5256-ea30-eec2dd4fcf61` (obviously the identifiers will be
-different).
-
-This seems to break Hyrax at some point, but might have been fixed. This needs
-to be verified further.
-
-
-## Non-standard client breaking changes
-
-The following changes may be incompatible with clients relying on some FCREPO4
-behavior not endorsed by LDP or other specifications.
-
-### Pairtrees
-
-FCREPO4 generates "pairtree" resources if a resource is created in a path whose
-segments are missing. E.g. when creating `/a/b/c/d`, if `/a/b` and `/a/b/c` do
-not exist, FCREPO4 will create two Pairtree resources. POSTing and PUTting into
-Pairtrees is not allowed. Also, a containment triple is established between the
-closest LDPC and the created resource, e.g. if `a` exists, a `</a> ldp:contains
-</a/b/c/d>` triple is created.
-
-LAKEsuperior does not employ Pairtrees. In the example above LAKEsuperior would
-create a fully qualified LDPC for each missing segment, which can be POSTed and
-PUT to. Containment triples are created between each link in the path, i.e.
-`</a> ldp:contains </a/b>`, `</a/b> ldp:contains </a/b/c>` etc. This may
-potentially break clients relying on the direct containment model.
-
-The rationale behind this change is that Pairtrees are the byproduct of a
-limitation imposed by Modeshape and introduce complexity in the software stack
-and confusion for the client. LAKEsuperior aligns with the more intuitive UNIX
-filesystem model, where each segment of a path is a "folder" or container
-(except for the leaf nodes that can be either folders or files). In any
-case, clients are discouraged from generating deep paths in LAKEsuperior
-without a specific purpose because these resources create unnecessary data.
-
-### Non-mandatory, non-authoritative slug in version POST
-
-FCREPO4 requires a `Slug` header to POST to `fcr:versions` to create a new
-version.
-
-LAKEsuperior adheres to the more general FCREPO POST rule and if no slug is
-provided, an automatic ID is generated instead. The ID is a UUID4.
-
-Note that internally this ID is not called "label" but "uid" since it
-is treated as a fully qualified identifier. The `fcrepo:hasVersionLabel`
-predicate, however ambiguous in this context, will be kept until the adoption
-of Memento, which will change the retrieval mechanisms.
-
-Also, if a POST is issued on the same resource `fcr:versions` location using
-a version ID that already exists, LAKEsuperior will just mint a random
-identifier rather than returning an error.
-
-
-## Deprecation track
-
-LAKEsuperior offers some "legacy" options to replicate the FCREPO4 behavior,
-however encourages new development to use a different approach for some types
-of interaction.
-
-### Endpoints
-
-The FCREPO root endpoint is `/rest`. The LAKEsuperior root endpoint is `/ldp`.
-
-This should not pose a problem if a client does not have `rest` hard-coded in
-its code, but in any event, the `/rest` endpoint is provided for backwards
-compatibility.
-
-LAKEsuperior adds the (currently stub) `query` endpoint. Other endpoints for
-non-LDP services may be opened in the future.
-
-### Automatic LDP class assignment
-
-Since LAKEsuperior rejects client-provided server-managed triples, and since
-the LDP types are among them, the LDP container type is inferred from the
-provided properties: if the `ldp:hasMemberRelation` and
-`ldp:membershipResource` properties are provided, the resource is a Direct
-Container. If in addition to these the `ldp:insertedContentRelation` property
-is present, the resource is an Indirect Container. If any of the first two are
-missing, the resource is a Container (@TODO discuss: shall it be a Basic
-Container?)
-
-Clients are encouraged to omit LDP types in PUT, POST and PATCH requests.
-
-### Lenient handling
-
-FCREPO4 requires server-managed triples to be expressly indicated in a PUT
-request, unless the `Prefer` header is set to
-`handling=lenient; received="minimal"`, in which case the RDF payload must not
-have any server-managed triples.
-
-LAKEsuperior works under the assumption that client should never provide
-server-managed triples. It automatically handles PUT requests sent to existing
-resources by returning a 412 if any server managed triples are included in the
-payload. This is the same as setting `Prefer` to `handling=strict`, which is
-the default.
-
-If `Prefer` is set to `handling=lenient`, all server-managed triples sent with
-the payload are ignored.
-
-Clients using the `Prefer` header to control PUT behavior as advertised by the
-specs should not notice any difference.
-
-
-## Optional improvements
-
-The following are improvements in performance or usability that can only be taken
-advantage of if client code is adjusted.
-
-### LDP-NR content and metadata
-
-FCREPO4 relies on the `/fcr:metadata` identifier to retrieve RDF metadata about
-an LDP-NR. LAKEsuperior supports this as a legacy option, but encourages the
-use of content negotiation to do the same while offering explicit endpoints
-for RDF and non-RDF content retrieval.
-
-Any request to an LDP-NR with an `Accept` header set to one of the supported
-RDF serialization formats will yield the RDF metadata of the resource instead
-of the binary contents.
-
-The `fcr:metadata` URI returns the RDF metadata of a LDP-NR.
-
-The `fcr:content` URI returns the non-RDF content.
-
-The two optionsabove return an HTTP error if requested for a LDP-RS.
-
-### "Include" and "Omit" options for children
-
-LAKEsuperior offers an additional `Prefer` header option to exclude all
-references to child resources (i.e. by removing all the `ldp:contains` triples)
-while leaving the other server-managed triples when retrieving a resource:
-
-    Prefer: return=representation; [include | omit]="http://fedora.info/definitions/v4/repository#Children"
-
-The default behavior is to include all children URIs.
-
-### Soft-delete and purge
-
-**NOTE**: The implementation of this section is incomplete and debated.
-
-In FCREPO4 a deleted resource leaves a tombstone deleting all traces of the
-previous resource.
-
-In LAKEsuperior, a normal DELETE creates a new version snapshot of the resource
-and puts a tombstone in its place. The resource versions are still available
-in the `fcr:versions` location. The resource can be "resurrected" by
-issuing a POST to its tombstone. This will result in a `201`.
-
-If a tombstone is deleted, the resource and its versions are completely deleted
-(purged).
-
-Moreover, setting the `Prefer:no-tombstone` header option on DELETE allows to
-delete a resource and its versions directly without leaving a tombstone.

+ 0 - 260
docs/notes/indexing_strategy.md

@@ -1,260 +0,0 @@
-# LMDB Store design for RDFLib
-
-This is a log of subsequent strategies employed to store triples in LMDB.
-
-Strategy #5a is the one currently used. The rest is kept for historic reasons
-and academic curiosity (and also because it was too much work to just wipe out
-of memory).
-
-## Storage approach
-
-- Pickle quad and create MD5 or SHA1 hash.
-- Store triples in one database paired with key; store indices separately.
-
-Different strategies involve layout and number of databases.
-
-## Strategy #1
-
-- kq: key: serialized triple (1:1)
-- sk: Serialized subject: key (1:m)
-- pk: Serialized predicate: key (1:m)
-- ok: Serialized object: key (1:m)
-- (optional) lok: Serialized literal object: key (1:m)
-- (optional) tok: Serialized RDF type: key (1:m)
-- ck: Serialized context: key (1:m)
-
-### Retrieval approach
-
-To find all matches for a quad:
-
-- If all terms in the quad are bound, generate the key from the pickled
-quad and look up the triple in `kt`
-- If all terms are unbound, return an iterator of all values in `kt`.
-- If some values are bound and some unbound (most common query):
-  - Get a base list of keys associated wirh the first bound term
-  - For each subsequent bound term, check if each key associated with the term
-  matches a key in the base list
-  - Continue through all the bound terms. If a match is not found at any point,
-  continue to the next term
-  - If a match is found in all the bound term databases, look up the pickled quad
-  matching the key in `kq` and yield it
-
-More optimization can be introduced later, e.g. separating literal and RDF
-type objects in separate databases. Literals can have very long values and a
-database with a longer key setting may be useful. RDF terms can be indexed
-separately because they are the most common bound term.
-
-### Example lookup
-
-Keys and Triples (should actually be quads but this is a simplified version):
-
-A: s1 p1 o1
-B: s1 p2 o2
-C: s2 p3 o1
-D: s2 p3 o3
-
-Indices:
-
-- SK:
-  - s1: A, B
-  - s2: C, D
-- PK:
-  - p1: A
-  - p2: B
-  - p3: C, D
- - OK:
-  - o1: A, C
-  - o2: B
-  - o3: D
-
-Queries:
-
-- s1 ?p ?o → {A, B}
-- s1 p2 ?o → {A, B} & {B} = {B}
-- ?s ?p o3 → {D}
-- s1 p2 o5 → {} (Exit at OK: no term matches 'o5')
-- s2 p3 o2 → {C, D} & {C, D} & {B} = {}
-
-
-## Strategy #2
-
-Separate data and indices in two environments.
-
-### Main data store
-
-Key to quad; main keyspace; all unique.
-
-### Indices
-
-None of these databases is of critical preservation concern. They can be
-rebuilt from the main data store.
-
-All dupsort and dupfixed.
-
-@TODO The first three may not be needed if computing term hash is fast enough.
-
-- t2k (term to term key)
-- lt2k (literal to term key: longer keys)
-- k2t (term key to term)
-
-- s2k (subject key to quad key)
-- p2k (pred key to quad key)
-- o2k (object key to quad key)
-- c2k (context key to quad key)
-
-- sc2qk (subject + context keys to quad key)
-- po2qk (predicate + object keys to quad key)
-
-- sp2qk (subject + predicate keys to quad key)
-- oc2qk (object + context keys to quad key)
-
-- so2qk (subject + object keys to quad key)
-- pc2qk (predicate + context keys to quad key)
-
-
-## Strategy #3
-
-Contexts are much fewer (even in graph per aspect, 5-10 triples per graph)
-
-### Main data store
-
-Preservation-worthy data
-
-- tk:t (triple key: triple; dupsort, dupfixed)
-- tk:c (context key: triple; unique)
-
-### Indices
-
-Rebuildable from main data store
-
-- s2k (subject key: triple key)
-- p2k (pred key: triple key)
-- o2k (object key: triple key)
-- sp2k
-- so2k
-- po2k
-- spo2k
-
-### Lookup
-
-1. Look up triples by s, p, o, sp, so, po and get keys
-2. If a context is specified, for each key try to seek to (context, key) in ct
-   to verify it exists
-3. Intersect sets
-4. Match triple keys with data using kt
-
-#### Shortcuts
-
-- Get all contexts: return list of keys from ct
-- Get all triples for a context: get all values for a contex from ct and match
-  triple data with kt
-- Get one triple match for all contexts: look up in triple indices and match
-  triple data with kt
-
-
-## Strategy #4
-
-Terms are entered individually in main data store. Also, shorter keys are
-used rather than hashes. These two aspects save a great deal of space and I/O,
-but require an additional index to put the terms together in a triple.
-
-### Main Data Store
-
-- t:st (term key: serialized term; 1:1)
-- spo:c (joined S, P, O keys: context key; 1:m)
-- c: (context keys only, values are the empty bytestring)
-
-Storage total: variable
-
-### Indices
-
-- th:t (term hash: term key; 1:1)
-- c:spo (context key: joined triple keys; 1:m)
-- s:po (S key: P + O key; 1:m)
-- p:so (P key: S + O keys; 1:m)
-- o:sp (object key: triple key; 1:m)
-- sp:o (S + P keys: O key; 1:m)
-- so:p (S + O keys: P key; 1:m)
-- po:s (P + O keys: S key; 1:m)
-
-Storage total: 143 bytes per triple
-
-### Disadvantages
-
-- Lots of indices
-- Terms can get orphaned:
-  - No easy way to know if a term is used anywhere in a quad
-  - Needs some routine cleanup
-  - On the other hand, terms are relatively light-weight and can be reused
-  - Almost surely not reusable are UUIDs, message digests, timestamps etc.
-
-
-## Strategy #5
-
-Reduce number of indices and rely on parsing and splitting keys to find triples
-with two bound parameters.
-
-This is especially important for keeping indexing synchronous to achieve fully
-ACID writes.
-
-### Main data store
-
-Same as Strategy #4:
-
-- t:st (term key: serialized term; 1:1)
-- spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
-- c: (context keys only, values are the empty bytestring; 1:1)
-
-Storage total: variable (same as #4)
-
-### Indices
-
-- th:t (term hash: term key; 1:1)
-- s:po (S key: joined P, O keys; dupsort, dupfixed)
-- p:so (P key: joined S, O keys; dupsort, dupfixed)
-- o:sp (O key: joined S, P keys; dupsort, dupfixed)
-- c:spo (context → triple association; dupsort, dupfixed)
-
-Storage total: 95 bytes per triple
-
-### Lookup strategy
-
-- ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term key
-- s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from term key
-- s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
-- s p ?: [s:po] All PO for S → filter result by P in split key
-    → [t:st] term from term key
-
-### Advantages
-
-- Less indices: smaller index size and less I/O
-
-### Disadvantages
-
-- Possibly slower retrieval for queries with 2 bound terms (run metrics)
-
-### Further optimization
-
-In order to minimize traversing and splittig results, the first retrieval
-should be made on the term with less average keys. Search order can be balanced
-by establishing a lookup order for indices.
-
-This can be achieved by calling stats on the index databases and looking up the
-database with *most* keys. Since there is an equal number of entries in each of
-the (s:po, p:so, o:sp) indices, the one with most keys will have the least
-average number of values per key. If that lookup is done first, the initial
-data set to traverse and filter will be smaller.
-
-
-## Strategy #5a
-
-This is a slightly different implementation of #5 that somewhat simplifies and
-perhaps speeds up things a bit. It is the currently employed solution.
-
-The indexing and lookup strtegy is the same; but instead of using a separator
-byte for splitting compound keys, the logic relies on the fact that keys have
-a fixed length and are sliced instead. This *should* result in faster key
-manipulation, also because in most cases `memoryview` buffers can be used
-directly instead of being copied from memory.
-
-Index storage is 90 bytes per triple.

+ 311 - 0
docs/notes/indexing_strategy.rst

@@ -0,0 +1,311 @@
+LMDB Store design for RDFLib
+============================
+
+This is a log of subsequent strategies employed to store triples in
+LMDB.
+
+Strategy #5a is the one currently used. The rest is kept for historic
+reasons and academic curiosity (and also because it was too much work to
+just wipe out of memory).
+
+Storage approach
+----------------
+
+-  Pickle quad and create MD5 or SHA1 hash.
+-  Store triples in one database paired with key; store indices
+   separately.
+
+Different strategies involve layout and number of databases.
+
+Strategy #1
+-----------
+
+-  kq: key: serialized triple (1:1)
+-  sk: Serialized subject: key (1:m)
+-  pk: Serialized predicate: key (1:m)
+-  ok: Serialized object: key (1:m)
+-  (optional) lok: Serialized literal object: key (1:m)
+-  (optional) tok: Serialized RDF type: key (1:m)
+-  ck: Serialized context: key (1:m)
+
+Retrieval approach
+~~~~~~~~~~~~~~~~~~
+
+To find all matches for a quad:
+
+-  If all terms in the quad are bound, generate the key from the pickled
+   quad and look up the triple in ``kt``
+-  If all terms are unbound, return an iterator of all values in ``kt``.
+-  If some values are bound and some unbound (most common query):
+
+   -  Get a base list of keys associated wirh the first bound term
+   -  For each subsequent bound term, check if each key associated with
+      the term matches a key in the base list
+   -  Continue through all the bound terms. If a match is not found at
+      any point, continue to the next term
+   -  If a match is found in all the bound term databases, look up the
+      pickled quad matching the key in ``kq`` and yield it
+
+More optimization can be introduced later, e.g. separating literal and
+RDF type objects in separate databases. Literals can have very long
+values and a database with a longer key setting may be useful. RDF terms
+can be indexed separately because they are the most common bound term.
+
+Example lookup
+~~~~~~~~~~~~~~
+
+Keys and Triples (should actually be quads but this is a simplified
+version):
+
+A: s1 p1 o1 B: s1 p2 o2 C: s2 p3 o1 D: s2 p3 o3
+
+Indices:
+
+-  SK:
+
+   -  s1: A, B
+   -  s2: C, D
+
+-  PK:
+
+   -  p1: A
+   -  p2: B
+   -  p3: C, D
+
+-  OK:
+-  o1: A, C
+-  o2: B
+-  o3: D
+
+Queries:
+
+-  s1 ?p ?o → {A, B}
+-  s1 p2 ?o → {A, B} & {B} = {B}
+-  ?s ?p o3 → {D}
+-  s1 p2 o5 → {} (Exit at OK: no term matches ‘o5’)
+-  s2 p3 o2 → {C, D} & {C, D} & {B} = {}
+
+Strategy #2
+-----------
+
+Separate data and indices in two environments.
+
+Main data store
+~~~~~~~~~~~~~~~
+
+Key to quad; main keyspace; all unique.
+
+Indices
+~~~~~~~
+
+None of these databases is of critical preservation concern. They can be
+rebuilt from the main data store.
+
+All dupsort and dupfixed.
+
+@TODO The first three may not be needed if computing term hash is fast
+enough.
+
+-  t2k (term to term key)
+-  lt2k (literal to term key: longer keys)
+-  k2t (term key to term)
+
+-  s2k (subject key to quad key)
+-  p2k (pred key to quad key)
+-  o2k (object key to quad key)
+-  c2k (context key to quad key)
+
+-  sc2qk (subject + context keys to quad key)
+-  po2qk (predicate + object keys to quad key)
+
+-  sp2qk (subject + predicate keys to quad key)
+-  oc2qk (object + context keys to quad key)
+
+-  so2qk (subject + object keys to quad key)
+-  pc2qk (predicate + context keys to quad key)
+
+Strategy #3
+-----------
+
+Contexts are much fewer (even in graph per aspect, 5-10 triples per
+graph)
+
+.. _main-data-store-1:
+
+Main data store
+~~~~~~~~~~~~~~~
+
+Preservation-worthy data
+
+-  tk:t (triple key: triple; dupsort, dupfixed)
+-  tk:c (context key: triple; unique)
+
+.. _indices-1:
+
+Indices
+~~~~~~~
+
+Rebuildable from main data store
+
+-  s2k (subject key: triple key)
+-  p2k (pred key: triple key)
+-  o2k (object key: triple key)
+-  sp2k
+-  so2k
+-  po2k
+-  spo2k
+
+Lookup
+~~~~~~
+
+1. Look up triples by s, p, o, sp, so, po and get keys
+2. If a context is specified, for each key try to seek to (context, key)
+   in ct to verify it exists
+3. Intersect sets
+4. Match triple keys with data using kt
+
+Shortcuts
+^^^^^^^^^
+
+-  Get all contexts: return list of keys from ct
+-  Get all triples for a context: get all values for a contex from ct
+   and match triple data with kt
+-  Get one triple match for all contexts: look up in triple indices and
+   match triple data with kt
+
+Strategy #4
+-----------
+
+Terms are entered individually in main data store. Also, shorter keys
+are used rather than hashes. These two aspects save a great deal of
+space and I/O, but require an additional index to put the terms together
+in a triple.
+
+.. _main-data-store-2:
+
+Main Data Store
+~~~~~~~~~~~~~~~
+
+-  t:st (term key: serialized term; 1:1)
+-  spo:c (joined S, P, O keys: context key; 1:m)
+-  c: (context keys only, values are the empty bytestring)
+
+Storage total: variable
+
+.. _indices-2:
+
+Indices
+~~~~~~~
+
+-  th:t (term hash: term key; 1:1)
+-  c:spo (context key: joined triple keys; 1:m)
+-  s:po (S key: P + O key; 1:m)
+-  p:so (P key: S + O keys; 1:m)
+-  o:sp (object key: triple key; 1:m)
+-  sp:o (S + P keys: O key; 1:m)
+-  so:p (S + O keys: P key; 1:m)
+-  po:s (P + O keys: S key; 1:m)
+
+Storage total: 143 bytes per triple
+
+Disadvantages
+~~~~~~~~~~~~~
+
+-  Lots of indices
+-  Terms can get orphaned:
+
+   -  No easy way to know if a term is used anywhere in a quad
+   -  Needs some routine cleanup
+   -  On the other hand, terms are relatively light-weight and can be
+      reused
+   -  Almost surely not reusable are UUIDs, message digests, timestamps
+      etc.
+
+Strategy #5
+-----------
+
+Reduce number of indices and rely on parsing and splitting keys to find
+triples with two bound parameters.
+
+This is especially important for keeping indexing synchronous to achieve
+fully ACID writes.
+
+.. _main-data-store-3:
+
+Main data store
+~~~~~~~~~~~~~~~
+
+Same as Strategy #4:
+
+-  t:st (term key: serialized term; 1:1)
+-  spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
+-  c: (context keys only, values are the empty bytestring; 1:1)
+
+Storage total: variable (same as #4)
+
+.. _indices-3:
+
+Indices
+~~~~~~~
+
+-  th:t (term hash: term key; 1:1)
+-  s:po (S key: joined P, O keys; dupsort, dupfixed)
+-  p:so (P key: joined S, O keys; dupsort, dupfixed)
+-  o:sp (O key: joined S, P keys; dupsort, dupfixed)
+-  c:spo (context → triple association; dupsort, dupfixed)
+
+Storage total: 95 bytes per triple
+
+Lookup strategy
+~~~~~~~~~~~~~~~
+
+-  ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term
+   key
+-  s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from
+   term key
+-  s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
+-  s p ?: [s:po] All PO for S → filter result by P in split key → [t:st]
+   term from term key
+
+Advantages
+~~~~~~~~~~
+
+-  Less indices: smaller index size and less I/O
+
+.. _disadvantages-1:
+
+Disadvantages
+~~~~~~~~~~~~~
+
+-  Possibly slower retrieval for queries with 2 bound terms (run
+   metrics)
+
+Further optimization
+~~~~~~~~~~~~~~~~~~~~
+
+In order to minimize traversing and splittig results, the first
+retrieval should be made on the term with less average keys. Search
+order can be balanced by establishing a lookup order for indices.
+
+This can be achieved by calling stats on the index databases and looking
+up the database with *most* keys. Since there is an equal number of
+entries in each of the (s:po, p:so, o:sp) indices, the one with most
+keys will have the least average number of values per key. If that
+lookup is done first, the initial data set to traverse and filter will
+be smaller.
+
+Strategy #5a
+------------
+
+This is a slightly different implementation of #5 that somewhat
+simplifies and perhaps speeds up things a bit. It is the currently
+employed solution.
+
+The indexing and lookup strtegy is the same; but instead of using a
+separator byte for splitting compound keys, the logic relies on the fact
+that keys have a fixed length and are sliced instead. This *should*
+result in faster key manipulation, also because in most cases
+``memoryview`` buffers can be used directly instead of being copied from
+memory.
+
+Index storage is 90 bytes per triple.

+ 0 - 27
docs/notes/messaging.md

@@ -1,27 +0,0 @@
-# LAKEsuperior Messaging
-
-LAKEsuperior implements a messaging system based on ActivityStreams, as
-indicated by the
-[Feodra API specs](https://fedora.info/2017/06/30/spec/#notifications).
-The metadata set provided is currently quite minimal but can be easily
-enriched by extending the
-[default formatter class](https://github.com/scossu/lakesuperior/blob/master/lakesuperior/messaging/messenger.py).
-
-STOMP is the only supported protocol at the moment. More protocols may be made
-available at a later time.
-
-LAKEsuperior can send messages to any number of destinations: see
-[configuration](https://github.com/scossu/lakesuperior/blob/master/etc.defaults/application.yml#L79).
-By default, CoilMQ is provided for testing purposes and listens to
-`localhost:61613`. The default route sends messages to `/topic/fcrepo`.
-
-A small command-line utility, also provided with the Python dependencies,
-allows to watch incoming messages. To monitor messages, enter the following
-*after activating your virtualenv*:
-
-```
-stomp -H localhost -P 61613 -L /topic/fcrepo
-```
-
-See the [stomp.py library reference page](https://github.com/jasonrbriggs/stomp.py/wiki/Command-Line-Access)
-for details.

+ 0 - 58
docs/notes/migration.md

@@ -1,58 +0,0 @@
-# Migration, Backup & Restore
-
-All LAKEsuperior data is by default fully contained in a folder. This means
-that only the data, configurations and code folders are needed for it to run.
-No Postgres, Redis, or such. Data and configuration folders can be moved around
-as needed.
-
-## Migration Tool
-
-Migration is the process of importing and converting data from a different
-Fedora or LDP implementation into a new LAKEsuperior instance. This process
-uses the HTTP/LDP API of the original repository. A command-line utility is
-available as part of the `lsup-admin` suite to assist in such operation.
-
-A repository can be migrated with a one-line command such as:
-
-```
-./lsup-admin migrate http://source-repo.edu/rest /local/dest/folder
-```
-
-For more options, enter
-
-```
-./lsup-admin migrate --help
-```
-
-The script will crawl through the resources and crawl through outbound links
-within them. In order to do this, resources are added as raw triples (i.e.
-no consistency checks are made).
-
-**Note:** the consistency check tool has not yet been implemented at the moment
-but its release should follow shortly. This will ensure that all the links
-between resources are consistent in regard to referential integrity.
-
-This script will create a full dataset in the specified destination folder,
-complete with a default configuration that allows to start the LAKEsuperior
-server immediately after the migration is complete.
-
-Two approaches to migration are possible:
-
-1. By providing a starting point on the source repository. E.g. if the
-   repository you want to migrate is at `http://repo.edu/rest/prod` you can add
-   the `-s /prod` option to the script to avoid migrating irrelevant branches.
-   Note that the script will still reach outside of the starting point if
-   resources are referencing other resources outside of it.
-2. By providing a file containing a list of resources to migrate. This is
-   useful if a source repository cannot produce a full list (e.g. the root node
-   has more children than the server can handle) but a list of individual
-   resources is available via an external index (Solr, triplestore, etc.).
-   The resources can be indicated by their fully qualified URIs or paths
-   relative to the repository root. (*TODO latter option needs testing*)
-
-## Backup And Restore
-
-A back up of a LAKEshore repository consists in copying the RDF and non-RDF
-data folders. These folders are indicated in the application configuration. The
-default commands provided by your OS (`cp`, `rsync`, `tar` etc. for Unix) are
-all is needed.

+ 0 - 63
docs/notes/model.md

@@ -1,63 +0,0 @@
-# LAKEsuperior Content Model Rationale
-
-## Internal and Public URIs; Identifiers
-
-Resource URIs are stored internally in LAKEsuperior as domain-agnostic URIs
-with the scheme `info:fcres<resource UID>`. This allows resources to be
-portable across systems. E.g. a resource with an internal URI of
-`info:fcres/a/b/c`, when accessed via the `http://localhost:8000/ldp`
-endpoint, will be found at `http://localhost:8000/ldp/a/b/c`.
-
-The resource UID making up the looks like a UNIX
-filesystem path, i.e. it always starts with a forward slash and can be made up
-of multiple segments separated by slashes. E.g. `/` is the root node UID,
-`/a` is a resource UID just below root. their internal URIs are `info:fcres/`
-and `info:fcres/a` respectively.
-
-In the Python API, the UID and internal URI of an LDP resource can be accessed
-via the `uid` and `uri` properties respectively:
-
-```
->>> import lakesuperior.env_setup
->>> from lakesuperior.api import resource
->>> rsrc = resource.get('/a/b/c')
->>> rsrc.uid
-/a/b/c
->>> rsrc.uri
-rdflib.terms.URIRef('info:fcres/a/b/c')
-```
-
-## Store Layout
-
-One of the key concepts in LAKEsuperior is the store layout. This is a
-module built with a
-specific purpose in mind, i.e. allowing fine-grained recording of provenance
-metadata while providing reasonable performance.
-
-Store layout modules could be replaceable (work needs to
-be done to develop an interface to allow that). The default (and only at the
-moment) layout shipped with LAKEsuperior is the
-[resource-centric layout](../../lakesuperior/store/ldp_rs/rsrc_centric_layout).
-This layout implements a so-called
-[graph-per-aspect pattern](http://patterns.dataincubator.org/book/graph-per-aspect.html)
-which stores different sets of statements about a resource in separate named
-graphs.
-
-The named graphs used for each resource are:
-
-- An admin graph (`info:fcsystem/graph/admin<resource UID>`) which stores
-  administrative metadata, mostly server-managed triples such as LDP types,
-  system create/update timestamps and agents, etc.
-- A structure graph (`info:fcsystem/graph/structure<resource UID>`) reserved for
-  containment triples. The reason
-  for this separation is purely convenience, since it makes it easy to retrieve
-  all the properties of a large container without its child references.
-- One (and, possibly, in the future, more user-defined) named graph for
-  user-provided data (`info:fcsystem/graph/userdata/_main<resource UID>`).
-
-Each of these graphs can be annotated with provenance metadata. The layout
-decides which triples go in which graph based on the predicate or RDF type
-contained in the triple. Adding logic to support arbitrary named graphs based
-e.g. on user agent, or to add more provenance information, should be relatively
-simple.
-

+ 0 - 112
docs/notes/performance.md

@@ -1,112 +0,0 @@
-# Performance Benchmark Report
-
-## Environment
-
-### Hardware
-
-#### ‘Rather Snappy’ Laptop
-
-- Dell Precison M3800 Laptop
-- 4x Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
-- 12Gb RAM
-- SSD
-
-#### ‘Ole Workhorse’ server
-
-8x Intel(R) Xeon(R) CPU X5550  @ 2.67GHz
-16Gb RAM
-Magnetic drive, XXX RPM
-
-### Software
-
-- Arch Linux OS
-- glibc 2.26-11
-- python 3.5.4
-- lmdb 0.9.21-1
-
-### Benchmark script
-
-[Generator script](../../util/benchmark.py)
-
-The script was run with default values: 10,000 children under the same parent,
-PUT requests.
-
-### Data Set
-
-Synthetic graph created by the benchmark script. The graph is unique for each
-request and consists of 200 triples which are partly random data, with a
-consistent size and variation:
-
-- 50 triples have an object that is a URI of an external resource (50 unique
-  predicates; 5 unique objects).
-- 50 triples have an object that is a URI of a repository-managed resource
-  (50 unique predicates; 5 unique objects).
-- 100 triples have an object that is a 64-character random Unicode string
-  (50 unique predicates; 100 unique objects).
-
-## Results
-
-### ‘Rather Snappy’ Laptop
-
-#### FCREPO/Modeshape 4.7.5
-
-15'45" running time
-
-0.094" per resource (100%—reference point)
-
-3.4M triples total in repo at the end of the process
-
-Retrieval of parent resource (~10000 triples), pipe to /dev/null: 3.64" (100%)
-
-Peak memory usage: 2.47Gb
-
-Database size: 3.3 Gb
-
-
-#### LAKEsuperior Alpha 6, LMDB Back End
-
-25' running time
-
-0.152" per resource (161%)
-
-*Some gaps every ~40-50 requests, probably disk flush*
-
-Retrieval of parent resource (10K triples), pipe to /dev/null: 2.13" (58%)
-
-Peak memory usage: ~650 Mb (3 idle workers, 1 active)
-
-Database size: 523 Mb (16%)
-
-### ‘Ole Workhorse’ server
-
-#### FCREPO
-
-0:47:38 running time
-
-0.285" per resource (100%)
-
-Retrieval of parent resource: 9.6" (100%)
-
-#### LAKEsuperior
-
-1:14:19 running time
-
-0.446" per resource (156%)
-
-Retrieval of parent resource: 5.58" (58%)
-
-## Conclusions
-
-LAKEsuperior appears to be markedly slower on writes and markedly faster on
-reads. Both these factors are very likely related to the underlying LMDB store
-which is optimized for read performance.
-
-Comparison of results between the laptop and the server demonstrates that both
-read and write performance gaps
-are identical in the two environments. Disk speed severely affects the numbers.
-
-**Note:** As you can guess, these are only very partial and specific results. They
-should not be taken as a thorough performance assessment. Such an assessment
-may be impossible and pointless to make given the very different nature of
-the storage models, which may behave radically differently depending on many
-variables.

+ 0 - 82
docs/notes/storage.md

@@ -1,82 +0,0 @@
-# Storage Implementation
-
-LAKEsuperior stores non-RDF ("binary") data in the filesystem and RDF data in
-an embedded key-value store, [LMDB](https://symas.com/lmdb/).
-
-## RDF Storage design
-
-LMDB is a very fast, very lightweight C library. It is inspired by BerkeleyDB
-but introduces significant improvements in terms of efficiency and stability.
-
-The LAKEsuperior RDF store consists of two files: the main data store and the
-indices (plus two lock files that are generated at runtime). A good amount of
-effort has been put to develop an indexing strategy that is balanced between
-write performance, read performance, and data size, with no compromise made on
-consistency.
-
-The main data
-store is the one containing the preservation-worthy data. While the indices are
-necessary for LAKEsuperior to function, they can be entirely rebuilt from the
-main data store in case of file corruption (recovery tools are on the TODO
-list).
-
-Detailed notes about the various strategies researched can be found
-[here](indexing_strategy.md).
-
-## Scalability
-
-Since LAKEsuperior is focused on design simplicity, efficiency and reliability,
-its RDF store is embedded and not horizontally scalable. However, LAKEsuperior
-is quite frugal with disk space. About 55 million triples can be
-stored in 8Gb of space (mileage can vary depending on how heterogeneous the
-triples are). This makes it easier to use expensive SSD drives for
-the RDF store, in order to improve performance. A single LMDB environment can
-reportedly scale up to 128 terabytes.
-
-## Maintenance
-
-LMDB has a very simple configuration, and all options are hardcoded
-in LAKESuperior in order to exploit its features. A database automatically
-recovers from a crash.
-
-The LAKEsuperior RDF store abstraction maintains a registry of unique terms.
-These terms are not deleted if a triple is deleted, even if no triple is using
-them, because it would be too expesive to look up for orphaned terms during a
-delete request. While these terms are relatively lightweight, it would be good
-to run a periodical clean-up job. Tools will be developed in the near future to
-facilitate this maintenance task.
-
-## Consistency
-
-LAKEsuperior wraps each LDP operation in a transaction. The indices are updated
-synchronously within the same transaction in order to guarantee
-consistency. If a system loses power or crashes, only the last transaction is
-lost, and the last successful write will include primary and index data.
-
-## Concurrency
-
-LMDB employs
-[MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
-to achieve fully ACID transactions. This implies that during
-a write, the whole database is locked. Multiple writes can be initiated
-concurrently, but the performance gain of doing so may be little because
-only one write operation can be performed at a time. Reasonable efforts have
-been put to make write transactions as short as possible (and more can be
-done). Also, this excludes a priori the option to implement long-running atomic
-operations, unless one is willing to block writes on the application for an
-indefinite length of time. On the other hand, write operations never block and
-are never blocked, so an application with a high read-to-write ratio may still
-benefit from multi-threaded requests.
-
-## Performance
-
-The [Performance Benchmark Report](performance.md) contains benchmark results.
-
-Write performance is lower than Modeshape/Fedora4; this may be mostly due to
-the fact that indices are written synchronously in a blocking transaction;
-also, the LMDB B+Tree structure is optimized for read performance rather than
-write performance. Some optimizations on the application layer could be made.
-
-Reads are faster than Modeshape/Fedora.
-
-All tests so far have been performed in a single thread.

+ 131 - 0
docs/performance.rst

@@ -0,0 +1,131 @@
+Performance Benchmark Report
+============================
+
+Environment
+-----------
+
+Hardware
+~~~~~~~~
+
+‘Rather Snappy’ Laptop
+^^^^^^^^^^^^^^^^^^^^^^
+
+-  Dell Precison M3800 Laptop
+-  4x Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
+-  12Gb RAM
+-  SSD
+
+‘Ole Workhorse’ server
+^^^^^^^^^^^^^^^^^^^^^^
+
+8x Intel(R) Xeon(R) CPU X5550 @ 2.67GHz 16Gb RAM Magnetic drive, XXX RPM
+
+Software
+~~~~~~~~
+
+-  Arch Linux OS
+-  glibc 2.26-11
+-  python 3.5.4
+-  lmdb 0.9.21-1
+
+Benchmark script
+~~~~~~~~~~~~~~~~
+
+`Generator script <../../util/benchmark.py>`__
+
+The script was run with default values: 10,000 children under the same
+parent, PUT requests.
+
+Data Set
+~~~~~~~~
+
+Synthetic graph created by the benchmark script. The graph is unique for
+each request and consists of 200 triples which are partly random data,
+with a consistent size and variation:
+
+-  50 triples have an object that is a URI of an external resource (50
+   unique predicates; 5 unique objects).
+-  50 triples have an object that is a URI of a repository-managed
+   resource (50 unique predicates; 5 unique objects).
+-  100 triples have an object that is a 64-character random Unicode
+   string (50 unique predicates; 100 unique objects).
+
+Results
+-------
+
+.. _rather-snappy-laptop-1:
+
+‘Rather Snappy’ Laptop
+~~~~~~~~~~~~~~~~~~~~~~
+
+FCREPO/Modeshape 4.7.5
+^^^^^^^^^^^^^^^^^^^^^^
+
+15’45" running time
+
+0.094" per resource (100%—reference point)
+
+3.4M triples total in repo at the end of the process
+
+Retrieval of parent resource (~10000 triples), pipe to /dev/null: 3.64"
+(100%)
+
+Peak memory usage: 2.47Gb
+
+Database size: 3.3 Gb
+
+LAKEsuperior Alpha 6, LMDB Back End
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+25’ running time
+
+0.152" per resource (161%)
+
+*Some gaps every ~40-50 requests, probably disk flush*
+
+Retrieval of parent resource (10K triples), pipe to /dev/null: 2.13"
+(58%)
+
+Peak memory usage: ~650 Mb (3 idle workers, 1 active)
+
+Database size: 523 Mb (16%)
+
+.. _ole-workhorse-server-1:
+
+‘Ole Workhorse’ server
+~~~~~~~~~~~~~~~~~~~~~~
+
+FCREPO
+^^^^^^
+
+0:47:38 running time
+
+0.285" per resource (100%)
+
+Retrieval of parent resource: 9.6" (100%)
+
+LAKEsuperior
+^^^^^^^^^^^^
+
+1:14:19 running time
+
+0.446" per resource (156%)
+
+Retrieval of parent resource: 5.58" (58%)
+
+Conclusions
+-----------
+
+LAKEsuperior appears to be markedly slower on writes and markedly faster
+on reads. Both these factors are very likely related to the underlying
+LMDB store which is optimized for read performance.
+
+Comparison of results between the laptop and the server demonstrates
+that both read and write performance gaps are identical in the two
+environments. Disk speed severely affects the numbers.
+
+**Note:** As you can guess, these are only very partial and specific
+results. They should not be taken as a thorough performance assessment.
+Such an assessment may be impossible and pointless to make given the
+very different nature of the storage models, which may behave radically
+differently depending on many variables.

+ 94 - 0
docs/storage.rst

@@ -0,0 +1,94 @@
+Storage Implementation
+======================
+
+LAKEsuperior stores non-RDF (“binary”) data in the filesystem and RDF
+data in an embedded key-value store, `LMDB <https://symas.com/lmdb/>`__.
+
+RDF Storage design
+------------------
+
+LMDB is a very fast, very lightweight C library. It is inspired by
+BerkeleyDB but introduces significant improvements in terms of
+efficiency and stability.
+
+The LAKEsuperior RDF store consists of two files: the main data store
+and the indices (plus two lock files that are generated at runtime). A
+good amount of effort has been put to develop an indexing strategy that
+is balanced between write performance, read performance, and data size,
+with no compromise made on consistency.
+
+The main data store is the one containing the preservation-worthy data.
+While the indices are necessary for LAKEsuperior to function, they can
+be entirely rebuilt from the main data store in case of file corruption
+(recovery tools are on the TODO list).
+
+Detailed notes about the various strategies researched can be found
+`here <indexing_strategy.md>`__.
+
+Scalability
+-----------
+
+Since LAKEsuperior is focused on design simplicity, efficiency and
+reliability, its RDF store is embedded and not horizontally scalable.
+However, LAKEsuperior is quite frugal with disk space. About 55 million
+triples can be stored in 8Gb of space (mileage can vary depending on how
+heterogeneous the triples are). This makes it easier to use expensive
+SSD drives for the RDF store, in order to improve performance. A single
+LMDB environment can reportedly scale up to 128 terabytes.
+
+Maintenance
+-----------
+
+LMDB has a very simple configuration, and all options are hardcoded in
+LAKESuperior in order to exploit its features. A database automatically
+recovers from a crash.
+
+The LAKEsuperior RDF store abstraction maintains a registry of unique
+terms. These terms are not deleted if a triple is deleted, even if no
+triple is using them, because it would be too expesive to look up for
+orphaned terms during a delete request. While these terms are relatively
+lightweight, it would be good to run a periodical clean-up job. Tools
+will be developed in the near future to facilitate this maintenance
+task.
+
+Consistency
+-----------
+
+LAKEsuperior wraps each LDP operation in a transaction. The indices are
+updated synchronously within the same transaction in order to guarantee
+consistency. If a system loses power or crashes, only the last
+transaction is lost, and the last successful write will include primary
+and index data.
+
+Concurrency
+-----------
+
+LMDB employs
+`MVCC <https://en.wikipedia.org/wiki/Multiversion_concurrency_control>`__
+to achieve fully ACID transactions. This implies that during a write,
+the whole database is locked. Multiple writes can be initiated
+concurrently, but the performance gain of doing so may be little because
+only one write operation can be performed at a time. Reasonable efforts
+have been put to make write transactions as short as possible (and more
+can be done). Also, this excludes a priori the option to implement
+long-running atomic operations, unless one is willing to block writes on
+the application for an indefinite length of time. On the other hand,
+write operations never block and are never blocked, so an application
+with a high read-to-write ratio may still benefit from multi-threaded
+requests.
+
+Performance
+-----------
+
+The `Performance Benchmark Report <performance.md>`__ contains benchmark
+results.
+
+Write performance is lower than Modeshape/Fedora4; this may be mostly
+due to the fact that indices are written synchronously in a blocking
+transaction; also, the LMDB B+Tree structure is optimized for read
+performance rather than write performance. Some optimizations on the
+application layer could be made.
+
+Reads are faster than Modeshape/Fedora.
+
+All tests so far have been performed in a single thread.