7 years ago · ecca1777ea
--- a/README.md
+++ b/README.md
@@ -85,10 +85,13 @@ If you like fried repositories for lunch, deploy before 11AM.
 
				 LAKEsuperior is in **alpha** status. Please see the [TODO](doc/notes/TODO) list
			
 
				 for a rudimentary road map and status.
			
 
				 
			
 
				-## Further documentation
			
 
				+## Technical documentation
			
 
				 
			
 
				-Miscellaneous documents are in the [doc](doc) folder. They will be organized
			
 
				-and linked better some day.
			
 
				+[Storage Implementation](doc/notes/torage.md)
			
 
				+
			
 
				+[Performance benchmarks](doc/notes/performance.md)
			
 
				+
			
 
				+[TODO list](doc/notes/TODO)
			
 
				 
			
 
				 ---
			
 
				 
			
--- a/doc/notes/indexing_strategy.md
+++ b/doc/notes/indexing_strategy.md
@@ -0,0 +1,249 @@
 
				+# LMDB Store design for RDFLib
			
 
				+
			
 
				+Spoiler: Strategy #5 is the one currently used.
			
 
				+
			
 
				+## Storage approach
			
 
				+
			
 
				+- Pickle quad and create MD5 or SHA1 hash.
			
 
				+- Store triples in one database paired with key; store indices separately.
			
 
				+
			
 
				+Different strategies involve layout and number of databases.
			
 
				+
			
 
				+## Strategy #1
			
 
				+
			
 
				+- kq: key: serialized triple (1:1)
			
 
				+- sk: Serialized subject: key (1:m)
			
 
				+- pk: Serialized predicate: key (1:m)
			
 
				+- ok: Serialized object: key (1:m)
			
 
				+- (optional) lok: Serialized literal object: key (1:m)
			
 
				+- (optional) tok: Serialized RDF type: key (1:m)
			
 
				+- ck: Serialized context: key (1:m)
			
 
				+
			
 
				+### Retrieval approach
			
 
				+
			
 
				+To find all matches for a quad:
			
 
				+
			
 
				+- If all terms in the quad are bound, generate the key from the pickled
			
 
				+quad and look up the triple in `kt`
			
 
				+- If all terms are unbound, return an iterator of all values in `kt`.
			
 
				+- If some values are bound and some unbound (most common query):
			
 
				+  - Get a base list of keys associated wirh the first bound term
			
 
				+  - For each subsequent bound term, check if each key associated with the term
			
 
				+  matches a key in the base list
			
 
				+  - Continue through all the bound terms. If a match is not found at any point,
			
 
				+  continue to the next term
			
 
				+  - If a match is found in all the bound term databases, look up the pickled quad
			
 
				+  matching the key in `kq` and yield it
			
 
				+
			
 
				+More optimization can be introduced later, e.g. separating literal and RDF
			
 
				+type objects in separate databases. Literals can have very long values and a
			
 
				+database with a longer key setting may be useful. RDF terms can be indexed
			
 
				+separately because they are the most common bound term.
			
 
				+
			
 
				+### Example lookup
			
 
				+
			
 
				+Keys and Triples (should actually be quads but this is a simplified version):
			
 
				+
			
 
				+A: s1 p1 o1
			
 
				+B: s1 p2 o2
			
 
				+C: s2 p3 o1
			
 
				+D: s2 p3 o3
			
 
				+
			
 
				+Indices:
			
 
				+
			
 
				+- SK:
			
 
				+  - s1: A, B
			
 
				+  - s2: C, D
			
 
				+- PK:
			
 
				+  - p1: A
			
 
				+  - p2: B
			
 
				+  - p3: C, D
			
 
				+ - OK:
			
 
				+  - o1: A, C
			
 
				+  - o2: B
			
 
				+  - o3: D
			
 
				+
			
 
				+Queries:
			
 
				+
			
 
				+- s1 ?p ?o → {A, B}
			
 
				+- s1 p2 ?o → {A, B} & {B} = {B}
			
 
				+- ?s ?p o3 → {D}
			
 
				+- s1 p2 o5 → {} (Exit at OK: no term matches 'o5')
			
 
				+- s2 p3 o2 → {C, D} & {C, D} & {B} = {}
			
 
				+
			
 
				+
			
 
				+## Strategy #2
			
 
				+
			
 
				+Separate data and indices in two environments.
			
 
				+
			
 
				+### Main data store
			
 
				+
			
 
				+Key to quad; main keyspace; all unique.
			
 
				+
			
 
				+### Indices
			
 
				+
			
 
				+None of these databases is of critical preservation concern. They can be
			
 
				+rebuilt from the main data store.
			
 
				+
			
 
				+All dupsort and dupfixed.
			
 
				+
			
 
				+@TODO The first three may not be needed if computing term hash is fast enough.
			
 
				+
			
 
				+- t2k (term to term key)
			
 
				+- lt2k (literal to term key: longer keys)
			
 
				+- k2t (term key to term)
			
 
				+
			
 
				+- s2k (subject key to quad key)
			
 
				+- p2k (pred key to quad key)
			
 
				+- o2k (object key to quad key)
			
 
				+- c2k (context key to quad key)
			
 
				+
			
 
				+- sc2qk (subject + context keys to quad key)
			
 
				+- po2qk (predicate + object keys to quad key)
			
 
				+
			
 
				+- sp2qk (subject + predicate keys to quad key)
			
 
				+- oc2qk (object + context keys to quad key)
			
 
				+
			
 
				+- so2qk (subject + object keys to quad key)
			
 
				+- pc2qk (predicate + context keys to quad key)
			
 
				+
			
 
				+
			
 
				+## Strategy #3
			
 
				+
			
 
				+Contexts are much fewer (even in graph per aspect, 5-10 triples per graph)
			
 
				+
			
 
				+### Main data store
			
 
				+
			
 
				+Preservation-worthy data
			
 
				+
			
 
				+- tk:t (triple key: triple; dupsort, dupfixed)
			
 
				+- tk:c (context key: triple; unique)
			
 
				+
			
 
				+### Indices
			
 
				+
			
 
				+Rebuildable from main data store
			
 
				+
			
 
				+- s2k (subject key: triple key)
			
 
				+- p2k (pred key: triple key)
			
 
				+- o2k (object key: triple key)
			
 
				+- sp2k
			
 
				+- so2k
			
 
				+- po2k
			
 
				+- spo2k
			
 
				+
			
 
				+### Lookup
			
 
				+
			
 
				+1. Look up triples by s, p, o, sp, so, po and get keys
			
 
				+2. If a context is specified, for each key try to seek to (context, key) in ct
			
 
				+   to verify it exists
			
 
				+3. Intersect sets
			
 
				+4. Match triple keys with data using kt
			
 
				+
			
 
				+#### Shortcuts
			
 
				+
			
 
				+- Get all contexts: return list of keys from ct
			
 
				+- Get all triples for a context: get all values for a contex from ct and match
			
 
				+  triple data with kt
			
 
				+- Get one triple match for all contexts: look up in triple indices and match
			
 
				+  triple data with kt
			
 
				+
			
 
				+
			
 
				+## Strategy #4
			
 
				+
			
 
				+Terms are entered individually in main data store. Also, shorter keys are
			
 
				+used rather than hashes. These two aspects save a great deal of space and I/O,
			
 
				+but require an additional index to put the terms together in a triple.
			
 
				+
			
 
				+### Main Data Store
			
 
				+
			
 
				+- t:st (term key: serialized term; 1:1)
			
 
				+- spo:c (joined S, P, O keys: context key; 1:m)
			
 
				+- c: (context keys only, values are the empty bytestring)
			
 
				+
			
 
				+Storage total: variable
			
 
				+
			
 
				+### Indices
			
 
				+
			
 
				+- th:t (term hash: term key; 1:1)
			
 
				+- c:spo (context key: joined triple keys; 1:m)
			
 
				+- s:po (S key: P + O key; 1:m)
			
 
				+- p:so (P key: S + O keys; 1:m)
			
 
				+- o:sp (object key: triple key; 1:m)
			
 
				+- sp:o (S + P keys: O key; 1:m)
			
 
				+- so:p (S + O keys: P key; 1:m)
			
 
				+- po:s (P + O keys: S key; 1:m)
			
 
				+
			
 
				+Storage total: 143 bytes per triple
			
 
				+
			
 
				+### Disadvantages
			
 
				+
			
 
				+- Lots of indices
			
 
				+- Terms can get orphaned:
			
 
				+  - No easy way to know if a term is used anywhere in a quad
			
 
				+  - Needs some routine cleanup
			
 
				+  - On the other hand, terms are relatively light-weight and can be reused
			
 
				+  - Almost surely reusable are UUIDs, message digests, timestamps etc.
			
 
				+
			
 
				+
			
 
				+## Strategy #5
			
 
				+
			
 
				+Reduce number of indices and rely on parsing and splitting keys to find triples
			
 
				+with two bound parameters.
			
 
				+
			
 
				+This is especially important for keeping indexing synchronous to achieve fully
			
 
				+ACID writes.
			
 
				+
			
 
				+### Main data store
			
 
				+
			
 
				+Same as Strategy #4:
			
 
				+
			
 
				+- t:st (term key: serialized term; 1:1)
			
 
				+- spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
			
 
				+- c: (context keys only, values are the empty bytestring; 1:1)
			
 
				+
			
 
				+Storage total: variable (same as #4)
			
 
				+
			
 
				+### Indices
			
 
				+
			
 
				+- th:t (term hash: term key; 1:1)
			
 
				+- s:po (S key: joined P, O keys; dupsort, dupfixed)
			
 
				+- p:so (P key: joined S, O keys; dupsort, dupfixed)
			
 
				+- o:sp (O key: joined S, P keys; dupsort, dupfixed)
			
 
				+- c:spo (context → triple association; dupsort, dupfixed)
			
 
				+
			
 
				+Storage total: 95 bytes per triple
			
 
				+
			
 
				+### Lookup strategy
			
 
				+
			
 
				+- ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term key
			
 
				+- s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from term key
			
 
				+- s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
			
 
				+- s p ?: [s:po] All PO for S → filter result by P in split key
			
 
				+    → [t:st] term from term key
			
 
				+
			
 
				+### Advantages
			
 
				+
			
 
				+- Less indices: smaller index size and less I/O
			
 
				+
			
 
				+### Disadvantages
			
 
				+
			
 
				+- Possibly slower retrieval for queries with 2 bound terms (run metrics)
			
 
				+
			
 
				+### Further optimization
			
 
				+
			
 
				+In order to minimiza traversing and splittig results, the first retrieval
			
 
				+should be made on the term with less average keys. Search order can be balanced
			
 
				+by establishing a lookup order for indices.
			
 
				+
			
 
				+This can be achieved by calling stats on the index databases and looking up the
			
 
				+database with *most* keys. Since there is an equal number of entries in each of
			
 
				+the (s:spo, p:spo, o:spo) indices, the one with most keys will have the least
			
 
				+average number of values per key. If that lookup is done first, the initial
			
 
				+data set to traverse and filter will be smaller.
			
 
				+
			
 
				+Also, keys can be split into equally size chunks without using a
			
 
				+separator. This relies on the fixed length of the keys. It also allows to use
			
 
				+memory views that can be sliced without being copied. The performance gain
			
 
				+should be estimated, since this changes quite a bit of code in the module.
			
 
				+
			
 
				+
			
--- a/doc/notes/storage.md
+++ b/doc/notes/storage.md
@@ -1,4 +1,4 @@
 
				-# Storage implementation
			
 
				+# Storage Implementation
			
 
				 
			
 
				 LAKEsuperior stores non-RDF ("binary") data in the filesystem and RDF data in
			
 
				 an embedded key-value store, [LMDB](https://symas.com/lmdb/).
			
@@ -20,6 +20,9 @@ necessary for LAKEsuperior to function, they can be entirely rebuilt from the
 
				 main data store in case of file corruption (recovery tools are on the TODO
			
 
				 list).
			
 
				 
			
 
				+Detailed notes about the various strategies researched can be found
			
 
				+[here](indexing_strategy.md).
			
 
				+
			
 
				 ## Scalability
			
 
				 
			
 
				 Since LAKEsuperior is focused on design simplicity, efficiency and reliability,
			
@@ -67,7 +70,7 @@ benefit from multi-threaded requests.
 
				 
			
 
				 ## Performance
			
 
				 
			
 
				-The [Performance Benchmark Report](performance.txt) contains benchmark results.
			
 
				+The [Performance Benchmark Report](performance.md) contains benchmark results.
			
 
				 
			
 
				 Write performance is lower than Modeshape/Fedora4; this may be mostly due to
			
 
				 the fact that indices are written synchronously in a blocking transaction;