Browse Source

Update documentation: README, benchmarks and storage.

Stefano Cossu 7 năm trước cách đây
mục cha
commit
610c21b057
5 tập tin đã thay đổi với 191 bổ sung124 xóa
  1. 46 31
      doc/notes/fcrepo4_deltas.md
  2. 65 0
      doc/notes/performance.md
  3. 0 92
      doc/notes/performance.txt
  4. 79 0
      doc/notes/storage.md
  5. 1 1
      tests/10K_children.py

+ 46 - 31
doc/notes/fcrepo4_deltas.md

@@ -10,11 +10,10 @@ it.
 See [TODO](TODO)
 
 - Various header handling
-- Versioning
+- Versioning (incomplete)
 - AuthN/Z
 - Fixity check
 - Blank nodes
-- Hash URIs
 
 
 ## Potentially breaking changes
@@ -28,7 +27,7 @@ and a number of operations (i.e. multiple R/W requests to the repository) can
 be performed. The operations are persisted in the repository only if and when
 the transaction is committed.
 
-LAKesuperior only supports atomicity for a single LDP request. I.e. a single
+LAKesuperior only supports atomicity for a single HTTP request. I.e. a single
 HTTTP request that should reult in multiple write operations to the storage
 layer is only persisted if no exception is thrown. Otherwise, the operation is
 rolled back in order to prevent resources to be left in an inconsistent state.
@@ -46,18 +45,12 @@ otherwise.
 
 FCREPO4 includes a web UI for simple CRUD operations.
 
-Such a UI is not in the immediate LAKEsuperior development plans. However, a very
-basic UI may at some point be built for read-only interaction (i.e.
-requesting resources and clicking through relationship links). A more complete
-UI should be built for simple and/or SPARQL queries.
+Such a UI is not in the immediate LAKEsuperior development plans. However, a
+basic UI is available for read-only interaction: LDP resource browsing, SPARQL
+query and other search facilities, and administrative tools. Some of the latter
+*may* involve write operations, such as clean-up tasks.
 
-
-## Non-standard client breaking changes
-
-The following changes may be incompatible with clients relying on some FCREPO4
-behavior not endorsed by LDP or other specifications.
-
-### Automatic pairtree generation
+### Automatic path segment generation
 
 A `POST` request without a slug in FCREPO4 results in a pairtree consisting of
 several intermediate nodes leading to the automatically minted identifier. E.g.
@@ -71,21 +64,41 @@ The same request in LAKEsuperior would create
 `/rest/8c9a074e-dda3-5256-ea30-eec2dd4fcf61` (obviously the identifiers will be
 different).
 
-### Explicit intermediate paths
+This seems to brak Hyrax at some point, but might have been fixed. This needs
+to be verified further.
 
-In FCREPO4, a PUT request to `/rest/a/b/c`, given `/rest/a` and `rest/a/b` not
-previously existing, results in the creation of Pairtree resources that are
-retrievable. In LAKEsuperior the same operation results only in the creation of
-containment triple in the graph store, which are not exposed in the LDP API.
-Therefore, a GET to `rest/a` in FCREPO4 will result in a 200, a GET to `rest/a`
-in LAKEsuperior in a 404.
 
-In both above cases, PUTting into `rest/a` yields a 409, POSTing to it results
-in a 201.
+## Non-standard client breaking changes
+
+The following changes may be incompatible with clients relying on some FCREPO4
+behavior not endorsed by LDP or other specifications.
+
+### Pairtrees
+
+FCREPO4 generates "pairtree" resources if a resource is created in a path whose
+segments are missing. E.g. when crating `/a/b/c/d`, if `/a/b` and `/a/b/c` do
+not exist, FCREPO4 will create two Pairtree resources. POSTing and PUTting into
+Pairtrees is not allowed. Also, a containment triple is established between the
+closest LDPC and the created resource, e.g. if `a` exists, a `</a> ldp:contains
+</a/b/c/d>` triple is created.
+
+LAKEsuperior does not employ Pairtrees. In the example above LAKEsuperior would
+create a fully qualified LDPC for each missing segment, which can be POSTed and
+PUT to. Containment triples are created between each link in the path, i.e.
+`</a> ldp:contains </a/b>`, `</a/b> ldp:contains </a/b/c>` etc. This may
+potentially break clients relying on the direct containment model.
+
+The rationale behind this change is that Pairtrees are the byproduct of a
+limitation imposed by Modeshape and introduce complexity in the software stack
+and confusion for the client. LAKEsuperior aligns with the more intuitive UNIX
+filesystem model, where each segment of a path is a "folder" or container
+(except for the leaf nodes that can be eiher folders or files). In any
+case, clients are discouraged from generating deep paths in LAKEsuperior
+without a specific purpose because these resources create unnecessary data.
 
 ### Non-mandatory, non-authoritative slug in version POST
 
-FCREPO requires a `Slug` header to POST to `fcr:versions` to create a new
+FCREPO4 requires a `Slug` header to POST to `fcr:versions` to create a new
 version.
 
 LAKEsuperior adheres to the more general FCREPO POST rule and if no slug is
@@ -150,6 +163,12 @@ the payload are ignored.
 Clients using the `Prefer` header to control PUT behavior as advertised by the
 specs should not notice any difference.
 
+
+## Optional improvements
+
+The following are improvements in performance or usability that can only taken
+advantage of if client code is adjusted.
+
 ### LDP-NR metadata by content negotiation
 
 FCREPO4 relies on the `/fcr:metadata` identifier to retrieve RDF metadata about
@@ -158,12 +177,6 @@ use of content negotiation to do the same. Any request to an LDP-NR with an
 `Accept` header set to one of the supported RDF serialization formats will
 yield the RDF metadata of the resource instead of the binary contents.
 
-
-## Optional improvements
-
-The following are improvements in performance or usability that can only taken
-advantage of if client code is adjusted.
-
 ### "Include" and "Omit" options for children
 
 LAKEsuperior offers an additional `Prefer` header option to exclude all
@@ -172,10 +185,12 @@ while leaving the other server-managed triples when retrieving a resource:
 
     Prefer: return=representation; [include | omit]="http://fedora.info/definitions/v4/repository#Children"
 
-The default behavior is including all children URIs.
+The default behavior is to include all children URIs.
 
 ### Soft-delete and purge
 
+**NOTE**: The implementation of this section is incomplete and debated.
+
 In FCREPO4 a deleted resource leaves a tombstone deleting all traces of the
 previous resource.
 

+ 65 - 0
doc/notes/performance.md

@@ -0,0 +1,65 @@
+# Performance Benchmark Report
+
+## Environment
+
+### Hardware
+
+- Dell Precison M3800 Laptop
+- 8x Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
+- 12Gb RAM
+- SSD
+
+### Software
+
+- Arch Linux OS
+- glibc 2.26-11
+- python 3.5.4
+- lmdb 0.9.21-1
+- db (BerkeleyDB) 5.3.28-3
+
+### Benchmark script
+
+[Generator script](../../tests/10K_children.py)
+
+The script was run with default values: 10,000 children under the same parent,
+PUT requests.
+
+### Data Set
+
+Synthetic graph created by the benchmark script. The graph is unique for each
+request and consists of 200 triples which are partly random data, with a
+consistent size and variation:
+
+- 50 triples have an object that is a URI of an external resource (50 unique
+  predicates; 5 unique objects).
+- 50 triples have an object that is a URI of a repository-managed resource
+  (50 unique predicates; 5 unique objects).
+- 100 triples have an object that is a 64-character random Unicode string
+  (50 unique predicates; 100 unique objects).
+
+## Results
+
+### FCREPO/Modeshape 4.7.5
+
+15'45" running time
+0.094" per resource
+3.4M triples total in repo at the end of the process
+
+Retrieval of parent resource (~10000 triples), pipe to /dev/null: 3.64"
+
+Peak memory usage: 2.47Gb
+Database size: 3.3 Gb
+
+
+### LAKEsuperior Alpha 6, LMDB Back End
+
+25' running time
+0.152" per resource
+
+Some gaps every ~40-50 requests, probably disk flush
+
+Retrieval of parent resource (10K triples), pipe to /dev/null: 2.13"
+
+Peak memory usage: ~650 Mb (3 idle workers, 1 active)
+Database size: 523 Mb
+

+ 0 - 92
doc/notes/performance.txt

@@ -1,92 +0,0 @@
-# Performance Benchmark Notes
-
-## Environment
-
-### Hardware
-
-- Dell Precison M3800 Laptop
-- 8x Intel(R) Core(TM) i7-4712HQ CPU @ 2.30GHz
-- 12Gb RAM
-- SSD
-
-### Software
-
-- Arch Linux OS
-- glibc 2.26-11
-- python 3.5.4
-- lmdb 0.9.21-1
-- db (BerkeleyDB) 5.3.28-3
-
-### Sample Data Set
-
-Modified Duchamp VIAF dataset (343 triples; changed all subjects to `<>`)
-
-## Vanilla FCREPO 4.7
-
-10K PUTs to new resources under the same container:
-
-9'27" running time
-0.057" per resource
-3.4M triples total in repo at the end of the process
-
-Retrieval of parent resource (~10000 triples), pipe to /dev/null: 3.64"
-
-Database size: 3.3 Gb
-
-
-## Sleepycat Back End Test
-
-10K PUTs to new resources under the same container:
-
-~18' running time
-0.108" per resource
-3.4M triples total in repo at the end of the process
-
-Retrieval of parent resource, pipe to /dev/null: 3.6"
-
-Database size: 1.2 Gb
-
-
-## LMDB Back End Test
-
-### Strategy #4
-
-10K PUTs to new resources under the same container:
-
-~29' running time
-0.178" per resource
-3.4M triples total in repo at the end of the process
-
-Some gaps every ~40-50 requests, probably blocking transactions or disk
-flush
-
-Database size: 633 Mb
-
-Retrieval of parent resource, pipe to /dev/null: 3.48"
-
-
-### Strategy #5
-
-10K PUTs to new resources under the same container:
-
-29' running time
-0.176" per resource
-3.4M triples total in repo at the end of the process
-
-Less gaps than strategy #4, however overall timing is almost identical. The
-blocker seems to be somewhere else.
-
-Database size: 422 Mb
-
-Retrieval of parent resource, pipe to /dev/null: 7.5"
-
-
-### After using triple methods rather than SPARQL for extract_imr
-
-25' running time
-0.155" per resource
-
-Database size: 523 Mb
-
-Retrieval of parent resource, pipe to /dev/null: 1.9"
-

+ 79 - 0
doc/notes/storage.md

@@ -0,0 +1,79 @@
+# Storage implementation
+
+LAKEsuperior stores non-RDF ("binary") data in the filesystem and RDF data in
+an embedded key-value store, [LMDB](https://symas.com/lmdb/).
+
+## RDF Storage design
+
+LMDB is a very fast, very lightweight C library. It is inspired by BerkeleyDB
+but introduces significant improvements in terms of efficiency and stability.
+
+The LAKEsuperior RDF store consists of two files: the main data store and the
+indices (plus two lock files that are generated at runtime). A good amount of
+effort has been put to develop an indexing strategy that is balanced between
+write performance, read performance, and data size, with no compromise made on
+consistency.
+
+The main data
+store is the one containing the preservation-worthy data. While the indices are
+necessary for LAKEsuperior to function, they can be entirely rebuilt from the
+main data store in case of file corruption (recovery tools are on the TODO
+list).
+
+## Scalability
+
+Since LAKEsuperior is focused on design simplicity, efficiency and reliability,
+its RDF store is embedded and not horizontally scalable. However, LAKEsuperior
+is quite frugal with disk space. About 55 million triples can be
+stored in 8Gb of space (mileage can vary depending on how heterogeneous the
+triples are). This makes it easier to use expensive SSD drives for
+the RDF store, in order to improve performance. A single LMDB environment can
+reportedly scale up to 128 terabytes.
+
+## Maintenance
+
+LMDB has a very simple configuration, and all options are hardcoded
+in LAKESuperior in order to exploit its features. A database automatically
+recovers from a crash.
+
+The LAKEsuperior RDF store abstraction maintains a registry of unique terms.
+These terms are not deleted if a triple is deleted, even if no triple is using
+them, because it would be too expesive to look up for orphaned terms during a
+delete request. While these terms are relatively lightweight, it would be good
+to run a periodical clean-up job. Tools will be developed in the near future to
+facilitate this maintenance task.
+
+## Consistency
+
+LAKEsuperior wraps each LDP operation in a transaction. The indices are updated
+synchronously within the same transaction in order to guarantee
+consistency. If a system loses power or crashes, only the last transaction is
+lost, and the last successful write will include primary and index data.
+
+## Concurrency
+
+LMDB employs
+[MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
+to achieve fully ACID transactions. This implies that during
+a write, the whole database is locked. Multiple writes can be initiated
+concurrently, but the performance gain of doing so may be little because
+only one write operation can be performed at a time. Reasonable efforts have
+been put to make write transactions as short as possible (and more can be
+done). Also, this excludes a priori the option to implement long-running atomic
+operations, unless one is willing to block writes on the application for an
+indefinite length of time. On the other hand, write operations never block and
+are never blocked, so an application with a high read-to-write ratio may still
+benefit from multi-threaded requests.
+
+## Performance
+
+The [Performance Benchmark Report](performance.txt) contains benchmark results.
+
+Write performance is lower than Modeshape/Fedora4; this may be mostly due to
+the fact that indices are written synchronously in a blocking transaction;
+also, the LMDB B+Tree structure is optimized for read performance rather than
+write performance. Some optimizations on the application layer could be made.
+
+Reads are faster than Modeshape/Fedora.
+
+All tests so far have been performed in a single thread.

+ 1 - 1
tests/10K_children.py

@@ -14,7 +14,7 @@ from util.generators import random_utf8_string
 
 default_n = 10000
 webroot = 'http://localhost:8000/ldp'
-#webroot = 'http://localhost:8080/fcrepo/rest'
+#webroot = 'http://localhost:8080/rest'
 container_uri = webroot + '/pomegranate'
 
 sys.stdout.write('How many children? [{}] >'.format(default_n))