Browse Source

Update remaining doc pages; add RPI page; add benchmark plot images.

Stefano Cossu 5 years ago
parent
commit
58da9647cb

+ 7 - 11
docs/about.rst

@@ -1,8 +1,8 @@
 About Lakesuperior
 ==================
 
-Lakesuperior is a repository system to store binary files and their metadata as
-Linked Data. It is a `Fedora Repository <http://fedorarepository.org>`__
+Lakesuperior is a repository system that stores binary files and their metadata
+as Linked Data. It is a `Fedora Repository <http://fedorarepository.org>`__
 implementation focused on efficiency, stability and integration with Python.
 
 Fedora is a mature repository software system historically adopted by
@@ -13,8 +13,7 @@ any type of binary files and their metadata in Linked Data format.
 Guiding Principles
 ------------------
 
-Lakesuperior aims at being an uncomplicated, efficient Fedora 4
-implementation.
+Lakesuperior aims at being an efficient and flexible Fedora 4 implementation.
 
 Its main goals are:
 
@@ -28,9 +27,7 @@ Its main goals are:
 Key features
 ------------
 
--  Drop-in replacement for Fedora4 (with some
-   :doc:`caveats <fcrepo4_deltas>`); currently being tested
-   with Hyrax 2
+-  Drop-in replacement for Fedora4 (with some :doc:`caveats <fcrepo4_deltas>`)
 -  Very stable persistence layer based on
    `LMDB <https://symas.com/lmdb/>`__ and filesystem. Fully
    ACID-compliant writes guarantee consistency of data.
@@ -42,12 +39,11 @@ Key features
 -  Extensible :doc:`provenance metadata <model>` tracking
 -  :doc:`Multi-modal access <architecture>`: HTTP
    (REST), command line interface and native Python API.
--  Fits in a pocket: you can carry 50M triples in an 8Gb memory stick [#]_.
+-  Fits in a pocket: you can carry 64M triples in a 32Gb memory stick [#]_.
 
 Implementation of the official `Fedora API
-specs <https://fedora.info/spec/>`__ (Fedora 5.x and beyond) is not
-foreseen in the short term, however it would be a natural evolution of
-this project if it gains support.
+specs <https://fedora.info/spec/>`__ and OCFL are currently being
+considered as the next major development steps.
 
 Please make sure you read the :doc:`Delta document <fcrepo4_deltas>` for
 divergences with the official Fedora4 implementation.

+ 24 - 2
docs/api.rst

@@ -10,9 +10,31 @@ The Lakesuperior API modules of most interest for a client are:
 - :mod:`lakesupeiror.api.query`
 - :mod:`lakesuperior.api.admin`
 
-:mod:`lakesuperior.model.ldp.ldpr` is used to manipulate resources.
+Lower-Level Interfaces
+----------------------
+
+:mod:`lakesuperior.model.ldp` handles the concepts of LDP resources,
+containers, binaries, etc.
+
+:mod:`lakesuperior.store.ldp_rs.rsrc_centric_layout` handles the "layout" of
+LDP resources as named graphs in a triplestore. It is possible (currently not
+without changes to the core libraries) to devise a different layout for e.g. a
+more sparse, or richer, data model.
+
+Similarly, :mod:`lakesuperior.store.ldp_nr.base_non_rdf_layout` offers an
+interface to handle the layout of LDPR resources. Currently only one
+implementation is available but it is also possible to create a new module to
+e.g. handle files in an S3 bucket, a Cassandra database, or create Bagit or
+OCFL file structures, and configure Lakesuperior to use one, or more, of those
+persistence methods.
+
+Deep Tissue
+-----------
 
-The full API docs are listed below.
+Some of the Cython libraries in :mod:`lakesuperior.model.structures`,
+:mod:`lakesuperior.model.rdf`, and :mod:`lakesuperior.store` have
+Python-accessible methods for high-performance manipulation. The
+:py:class:`lakesuperior.model.rdf.graph.Graph` class is an example of that.
 
 Full API Documentation
 ----------------------

+ 14 - 4
docs/architecture.rst

@@ -1,9 +1,18 @@
 Lakesuperior Architecture
 =========================
 
-Lakesuperior is written in Python. It is not excluded that parts of the
-code may be rewritten in `Cython <http://cython.readthedocs.io/>`__ for
-performance.
+Lakesuperior is written in Python and `Cython
+<http://cython.readthedocs.io/>`__; the latter for lower-level components that
+interface with C basic data structures for maximum efficiency.
+
+Aside from an optional dependency on a message queue server, Lakesuperior aims
+at being self-contained. All persistence is done on an embedded database. This
+allows a minimum memory and CPU footprint, and a high degree of scalability,
+from :doc:`single-board computers <rpi>` to multi-core, high-load servers.
+
+Inefficient applications "get the job done" by burning through CPU cycles,
+memory, storage and electricity, and spew out great amounts of carbon and
+digits on cloud provider bills. Lakesuperior strives to be mindful of that.
 
 Multi-Modal Access
 ------------------
@@ -18,7 +27,8 @@ Lakesuperior services and data are accessible in multiple ways:
    the same methods available to the two methods above in a programmatic
    way. It is possible to write Python plugins or even to embed
    Lakesuperior in a Python application, even without running a web
-   server.
+   server. Also, only this way it is possible to access some of the lower-level
+   application layers that allow to skirt much heavy-handed data processing.
 
 Architecture Overview
 ---------------------

BIN
docs/assets/fedora_rpi.jpg


BIN
docs/assets/plot_fcrepo_post_100K.png


BIN
docs/assets/plot_fcrepo_put_30K.png


BIN
docs/assets/plot_lsup_post_100K.png


BIN
docs/assets/plot_lsup_pyapi_post_100K.png


+ 45 - 23
docs/cli.rst

@@ -75,27 +75,39 @@ This command is used to run performance tests in a predictable way.
 
 The command line options can be queried with the ``--help`` option::
 
-   Usage: lsup-benchmark [OPTIONS]
-
-   Options:
-   -e, --endpoint TEXT       LDP endpoint. Default: http://localhost:8000/ldp
-   -c, --count INTEGER       Number of resources to ingest. Default: {def_ct}
-   -p, --parent TEXT         Path to the container resource under which the new
-                             resources will be created. It must begin with a
-                             slash (`/`) character. Default: /pomegranate
-   -d, --delete-container    Delete container resource and its children if
-                             already existing. By default, the container is not
-                             deleted and new resources are added to it.
-   -m, --method TEXT         HTTP method to use. Case insensitive. Either PUT
-                             or POST. Default: PUT
-   -s, --graph-size INTEGER  Number of triples in each graph. Default: 200
-   -t, --resource-type TEXT  Type of resources to ingest. One of `r` (only LDP-
-                             RS, i.e. RDF), `n` (only  LDP-NR, i.e. binaries),
-                             or `b` (50/50% of both). Default: r
-   -p, --plot                Plot a graph of ingest timings. The graph figure
-                             is displayed on screen with basic manipulation and
-                             save options.
-   --help                    Show this message and exit.
+    Usage: lsup-benchmark [OPTIONS]
+
+      Run the benchmark.
+
+    Options:
+      -m, --mode TEXT           Mode of ingestion. One of `ldp`, `python`. With
+                                the former, the HTTP/LDP web server is used. With
+                                the latter, the Python API is used, in which case
+                                the server need not be running. Default:
+                                http://localhost:8000/ldp
+      -e, --endpoint TEXT       LDP endpoint. Only meaningful with `ldp` mode.
+                                Default: http://localhost:8000/ldp
+      -c, --count INTEGER       Number of resources to ingest. Default: {def_ct}
+      -p, --parent TEXT         Path to the container resource under which the new
+                                resources will be created. It must begin with a
+                                slash (`/`) character. Default: /pomegranate
+      -d, --delete-container    Delete container resource and its children if
+                                already existing. By default, the container is not
+                                deleted and new resources are added to it.
+      -X, --method TEXT         HTTP method to use. Case insensitive. Either PUT
+                                or POST. Default: PUT
+      -s, --graph-size INTEGER  Number of triples in each random graph, rounded
+                                down to a multiple of 8. Default: 200
+      -S, --image-size INTEGER  Size of random square image, in pixels for each
+                                dimension, rounded down to a multiple of 8.
+                                Default: 1024
+      -t, --resource-type TEXT  Type of resources to ingest. One of `r` (only LDP-
+                                RS, i.e. RDF), `n` (only  LDP-NR, i.e. binaries),
+                                or `b` (50/50% of both). Default: r
+      -P, --plot                Plot a graph of ingest timings. The graph figure
+                                is displayed on screen with basic manipulation and
+                                save options.
+      --help                    Show this message and exit.
 
 The benchmark tool is able to create RDF sources, or non-RDF, or an equal mix
 of them, via POST or PUT, in a given lDP endpoint. It runs single threaded.
@@ -109,8 +121,8 @@ The non-RDF sources are randomly generated 1024x1024 pixel PNG images.
 You are warmly encouraged to run the script and share the performance results (
 *TODO add template for posting results*).
 
-``profiler``
-------------
+``lsup-profiler``
+-----------------
 
 This command launches a single-threaded HTTP server (Flask) on port 5000 that
 logs profiling information. This is useful for analyzing application
@@ -121,3 +133,13 @@ For more information, consult the `Python profilers guide
 
 Do not launch this while a WSGI server (``fcrepo``) is already running, because
 that also launches a Flask server on port 5000.
+
+Locust (experimental)
+---------------------
+
+`Locust <http://locust.io>`__ is an HTTP load tester. It can launch many
+requests on an HTTP endpoint. A rudimentary Locust file is currently available.
+
+To run Locust against Lakesuperior or FCREPO, run in the project root::
+
+    locust -f lakesuperior/util/locustfile.py http://localhost:8000/

+ 25 - 5
docs/contributing.rst

@@ -2,7 +2,7 @@ Contributing to Lakesuperior
 ============================
 
 Lakesuperior has been so far a single person’s off-hours project (with much
-very valuable input from several sides). In order to turn into anything close
+valuable input from several sides). In order to turn into anything close
 to a Beta release and eventually to a production-ready implementation, it
 needs some community love.
 
@@ -19,13 +19,30 @@ To set up the software for developing code, documentation, or tests::
 
     mkdir lsup # or whatever you may want to call it
     cd lsup
-    python3 -m venv .
+    python3 -m venv venv
     source bin/activate
-    git clone https://github.com/scossu/lakesuperior.git app
-    cd app
+    git clone --recurse-submodules https://github.com/scossu/lakesuperior.git src
+    cd src
     pip install -e .
 
-This will allow to alter the code without having to recompile it after changes.
+This will allow to alter the code without having to re-run ``pip install``
+after changes (unless one is changing the Cython modules; see below).
+
+Modifying Cython Modules
+------------------------
+
+Cython files must be recompiled into C files and then into binary files every
+time they are changed. To recompile Lakesuperior modules, run::
+
+    python setup.py --build_ext --inplace
+
+For a faster compilation while testing, the environment variable ``CFLAGS``
+can set to ``-O0`` to turn off compiler optimization. The runtime code may run
+slower so this is not recommended for performance benchmarking.
+
+Refer to the `Cython documentation
+<http://docs.cython.org/en/latest/src/userguide/source_files_and_compilation.html>`__
+for a detailed description of the Cython compilation process.
 
 Contribution Guidelines
 -----------------------
@@ -54,3 +71,6 @@ addition to the code, please fork the repo, create a new branch for your topic
 and open a pull request against development. In case you find a critical bug,
 a hotfix can be proposed against master if agreed in the related issue
 discussion.
+
+Last but not least, read carefully the `Code of
+Conduct <../../code_of_conduct.md>`__.

+ 2 - 0
docs/indexing_strategy.rst

@@ -335,3 +335,5 @@ second ("filter") term. This approach, instead, only looks up the relevant
 keys and composes the results. It is slower on writes and nearly doubles the
 size of the indices, but it makes reads faster and more memory-efficient.
 
+Alpha20 uses the same strategy but keys are treated as ``size_t`` integers
+rather than ``char*`` strings, thus making the code much cleaner.

+ 74 - 69
docs/performance.rst

@@ -15,16 +15,18 @@ Hardware
 -  16Gb RAM
 -  SSD
 -  OS X 10.13
--  python 3.7.2
+-  Python 3.7.2
 -  lmdb 0.9.22
 
 Benchmark script
 ~~~~~~~~~~~~~~~~
 
-`Generator script <../../util/benchmark.py>`__
+`Source code <../../util/benchmark.py>`__
 
-The script was run with default values: resprectively 10,000 and 100,000
-children under the same parent. PUT and POST requests were tested separately.
+The script was run by generating 100,000 children under the same parent. PUT
+and POST requests were tested separately.  The POST method produced pairtrees
+in Fedora to counter its known issue with many resources as direct children of
+a container.
 
 The script calculates only the timings used for the PUT or POST requests, not
 counting the time used to generate the random data.
@@ -43,6 +45,10 @@ with a consistent size and variation:
 -  100 triples have an object that is a 64-character random Unicode
    string (50 unique predicates; 100 unique objects).
 
+The benchmark script is also capable of generating random binaries and a mix of
+binary and RDF resources; a large-scale benchmark, however, was impractical at
+the moment due to storage constraints.
+
 LDP Data Retrieval
 ~~~~~~~~~~~~~~~~~~
 
@@ -53,12 +59,8 @@ REST API request::
 SPARQL Query
 ~~~~~~~~~~~~
 
-*Note:* The query may take a long time and therefore is made on the
-single-threaded server (``lsup-server``) that does not impose a timeout (of
-course, gunicorn could also be used by changing the configuration to allow a
-long timeout).
-
-Sample query::
+The following query was used against the repository after the 100K resource
+ingest::
 
    PREFIX ldp: <http://www.w3.org/ns/ldp#>
    SELECT (COUNT(?s) AS ?c) WHERE {
@@ -83,76 +85,79 @@ IPython console::
 
    In [1]: from lakesuperior import env_setup
    In [2]: from lakesuperior.api import resource as rsrc_api
-   In [3]: %timeit x = rsrc_api.get('/pomegranate').imr.as_rdflib
+   In [3]: %timeit x = rsrc_api.get('/pomegranate').imr.as_rdflib()
 
 Results
 -------
 
-10K Resources
-^^^^^^^^^^^^^
-
-===============================  =============  =============  ============  ============  ============
-System                           PUT            POST           Store         GET           SPARQL Query
-===============================  =============  =============  ============  ============  ============
-FCREPO / Modeshape 4.7.5         68ms (100%)    XXms (100%)    3.9Gb (100%)  6.2s (100%)   N/A         
-Lakesuperior 1.0a20 REST API     105ms (159%)   XXXms (XXX%)   298Mb (8%)    2.1s          XXXXXXXs    
-Lakesuperior 1.0a20 Python API   53ms (126%)    XXms (XXX%)    789Mb (21%)   381ms         N/A         
-===============================  =============  =============  ============  ============  ============
-
-**Notes:**
-
-- The Python API time for the GET request in alpha18 is 8.5% of the request.
-  This means that over 91% of the time is spent serializing the results.
-  This time could be dramatically reduced by using faster serialization
-  libraries, or can be outright zeroed out by an application that uses the
-  Python API directly and manipulates the native RDFLib objects (of course, if
-  a serialized output is eventually needed, that cost is unavoidable).
-- Similarly, the ``triples`` retrieval method of the SPARQL query only takes
-  13.6% of the request time. The rest is spent evaluating SPARQL and results.
-  An application can use ``triples`` directly for relatively simple lookups
-  without that overhead.
-
-100K Resources
-^^^^^^^^^^^^^^
-
-===============================  ===============  ===============  =============  ===============  ==============
-System                           PUT              POST             Store          GET              SPARQL Query  
-===============================  ===============  ===============  =============  ===============  ==============
-FCREPO / Modeshape 4.7.5         500+ms*          65ms (100%)\*\*  12Gb (100%)    3m41s (100%)     N/A           
-Lakesuperior 1.0a20 REST API     104ms (100%)     123ms (189%)     8.7Gb (72%)    30s (14%)        19.3s (100%)  
-Lakesuperior 1.0a20 Python API   69ms (60%)       58ms (89%)       8.7Gb (72%)    6s (2.7%)        9.17s (47%)
-===============================  ===============  ===============  =============  ===============  ==============
-
-\* POST was stopped at 30K resources after the ingest time reached >1s per
-resource. This is the manifestation of the "many members" issue which is
-visible in the graph below. The "Store" value is for the PUT operation which
-ran regularly with 100K resources.
-
-\*\* the POST test with 100K resources was conducted with fedora 4.7.5 because
-5.0 would not automatically create a pairtree, thereby resulting in the same
-performance as the PUT method.
-
-\*\*\* Timing based on a warm cache. The first query timed at 0m22.2s.
+===================  ===============  ================  =============  ====================  ==============
+Software             PUT              POST              Store Size     GET                   SPARQL Query  
+===================  ===============  ================  =============  ====================  ==============
+FCREPO 5.0.2         >500ms [#]_      65ms (100%) [#]_  12Gb (100%)    3m41s (100%)          N/A           
+Lakesuperior REST    104ms (100%)     123ms (189%)      8.7Gb (72%)    30s (14%)             19.3s (100%)  
+Lakesuperior Python  69ms (60%)       58ms (89%)        8.7Gb (72%)    6.7s (3%) [#]_ [#]_   9.17s (47%)
+===================  ===============  ================  =============  ====================  ==============
+
+.. [#] POST was stopped at 30K resources after the ingest time reached >1s per
+    resource. This is the manifestation of the "many members" issue which is
+    visible in the graph below. The "Store" value is for the PUT operation
+    which ran regularly with 100K resources.
+
+.. [#] the POST test with 100K resources was conducted with fedora 4.7.5
+    because 5.0 would not automatically create a pairtree, thereby resulting
+    in the same performance as the PUT method.
+
+.. [#] Timing based on a warm cache. The first query timed at 22.2s.
+
+.. [#] The Python API time for the "GET request" (retrieval) without the
+    conversion to Python in alpha20 is 3.2 seconds, versus the 6.7s that
+    includes conversion to Python/RDFlib objects. This can be improved by
+    using more efficient libraries that allow serialization and
+    deserialization of RDF.
+
+Charts
+------
+
+.. figure:: assets/plot_fcrepo_put_100K.png
+   :alt: Fedora with PUT, 100K request time chart
+
+   Fedora/Modeshape using PUT requests under the same parent. The "many
+   members" issue is clearly visible after a threshold is reached.
+
+.. figure:: assets/plot_fcrepo_post_100K.png
+   :alt: Fedora with POST, 100K request time chart
+
+   Fedora/Modeshape using POST requests generating pairtrees. The performance
+   is greatly improved, however the ingest time increases linearly with the
+   repository size (O(n) time complexity)
+
+.. figure:: assets/plot_lsup_post_100K.png
+   :alt: Lakesuperior with POST, 100K request time chart
+
+   Lakesuperior using POST requests, NOT generating pairtrees (equivalent to
+   a PUT request). The timing increase is closer to a O(log n) pattern.
+
+.. figure:: assets/plot_lsup_pyapi_post_100K.png
+   :alt: Lakesuperior Python API, 100K request time chart
+
+   Lakesuperior using Python API. The pattern is much smoother, with less
+   frequent and less pronounced spikes. The O(log n) performance is more
+   clearly visile here: time increases quickly at the beginning, then slows
+   down as the repository size increases.
 
 Conclusions
 -----------
 
-Lakesuperior appears to be markedly slower on writes and markedly faster
-on reads. Both these factors are very likely related to the underlying
-LMDB store which is optimized for read performance.
+Lakesuperior appears to be slower on writes and much faster on reads than
+Fedora 4-5. Both these factors are very likely related to the underlying LMDB
+store which is optimized for read performance. The write performance gap is
+more than filled when ingesting via the Python API.
 
 In a real-world application scenario, in which a client may perform multiple
 reads before and after storing resources, the write performance gap may
 decrease. A Python application using the Python API for querying and writing
-would experience a dramatic improvement in reading timings, and somewhat in
-write timings.
-
-Comparison of results between the laptop and the server demonstrates
-that both read and write performance ratios between repository systems are
-identical in the two environments.
+would experience a dramatic improvement in read as well as write timings.
 
 As it may be obvious, these are only very partial and specific
-results. They should not be taken as a thorough performance assessment.
-Such an assessment may be impossible and pointless to make given the
-very different nature of the storage models, which may behave radically
-differently depending on many variables.
+results. They should not be taken as a thorough performance assessment, but
+rather as a starting point to which specific use-case variables may be added.

+ 34 - 0
docs/rpi.rst

@@ -0,0 +1,34 @@
+Lakesuperior on a Raspberry Pi
+==============================
+
+.. figure:: assets/fedora_rpi.jpg
+   :alt: Raspberry Pi on pinboard. Fedora inside.
+
+   Look, a Fedora!
+
+**Experiment in Progress**
+
+Lakesuperior has been successfully installed and ran on a Raspberry Pi 3 board.
+The software was compiled on Alpine Linux using `musl <http://musl.libc.org>`__
+C libraries. (it also run fine with musl on more conventional hardware, but
+performance benchmarks vis-a-vis libc have not been performed yet.)
+
+Performance is obviously much lower than even a consumer-grade laptop, however
+the low cost of single-board computers may open up Lakesuperior to new
+applications that may require to connect many small LDP micro-repositories.
+
+If this experiment proves worthwhile, a disk image contianing the full system
+can be made available. The image would be flashed to a microSD card and
+inserted into a Raspberry Pi for a ready-to-use system.
+
+Some tweaks to the software could be made to better adapt it to small
+repositories. For example, adding a cpmpile-time option to force the use of
+fixed 32-bit keys on an ARM64 processor rather than the current 64-bit keys (a
+32-bit system would use 32-bit keys natively), it would be possible for
+Lakesuperior to handle half-sized indices and still being capable of holding,
+in theory, millions of triples.
+
+Cell phones next?
+
+More to come on the topic.
+