123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311 |
- LMDB Store design for RDFLib
- ============================
- This is a log of subsequent strategies employed to store triples in
- LMDB.
- Strategy #5a is the one currently used. The rest is kept for historic
- reasons and academic curiosity (and also because it was too much work to
- just wipe out of memory).
- Storage approach
- ----------------
- - Pickle quad and create MD5 or SHA1 hash.
- - Store triples in one database paired with key; store indices
- separately.
- Different strategies involve layout and number of databases.
- Strategy #1
- -----------
- - kq: key: serialized triple (1:1)
- - sk: Serialized subject: key (1:m)
- - pk: Serialized predicate: key (1:m)
- - ok: Serialized object: key (1:m)
- - (optional) lok: Serialized literal object: key (1:m)
- - (optional) tok: Serialized RDF type: key (1:m)
- - ck: Serialized context: key (1:m)
- Retrieval approach
- ~~~~~~~~~~~~~~~~~~
- To find all matches for a quad:
- - If all terms in the quad are bound, generate the key from the pickled
- quad and look up the triple in ``kt``
- - If all terms are unbound, return an iterator of all values in ``kt``.
- - If some values are bound and some unbound (most common query):
- - Get a base list of keys associated wirh the first bound term
- - For each subsequent bound term, check if each key associated with
- the term matches a key in the base list
- - Continue through all the bound terms. If a match is not found at
- any point, continue to the next term
- - If a match is found in all the bound term databases, look up the
- pickled quad matching the key in ``kq`` and yield it
- More optimization can be introduced later, e.g. separating literal and
- RDF type objects in separate databases. Literals can have very long
- values and a database with a longer key setting may be useful. RDF terms
- can be indexed separately because they are the most common bound term.
- Example lookup
- ~~~~~~~~~~~~~~
- Keys and Triples (should actually be quads but this is a simplified
- version):
- A: s1 p1 o1 B: s1 p2 o2 C: s2 p3 o1 D: s2 p3 o3
- Indices:
- - SK:
- - s1: A, B
- - s2: C, D
- - PK:
- - p1: A
- - p2: B
- - p3: C, D
- - OK:
- - o1: A, C
- - o2: B
- - o3: D
- Queries:
- - s1 ?p ?o → {A, B}
- - s1 p2 ?o → {A, B} & {B} = {B}
- - ?s ?p o3 → {D}
- - s1 p2 o5 → {} (Exit at OK: no term matches ‘o5’)
- - s2 p3 o2 → {C, D} & {C, D} & {B} = {}
- Strategy #2
- -----------
- Separate data and indices in two environments.
- Main data store
- ~~~~~~~~~~~~~~~
- Key to quad; main keyspace; all unique.
- Indices
- ~~~~~~~
- None of these databases is of critical preservation concern. They can be
- rebuilt from the main data store.
- All dupsort and dupfixed.
- @TODO The first three may not be needed if computing term hash is fast
- enough.
- - t2k (term to term key)
- - lt2k (literal to term key: longer keys)
- - k2t (term key to term)
- - s2k (subject key to quad key)
- - p2k (pred key to quad key)
- - o2k (object key to quad key)
- - c2k (context key to quad key)
- - sc2qk (subject + context keys to quad key)
- - po2qk (predicate + object keys to quad key)
- - sp2qk (subject + predicate keys to quad key)
- - oc2qk (object + context keys to quad key)
- - so2qk (subject + object keys to quad key)
- - pc2qk (predicate + context keys to quad key)
- Strategy #3
- -----------
- Contexts are much fewer (even in graph per aspect, 5-10 triples per
- graph)
- .. _main-data-store-1:
- Main data store
- ~~~~~~~~~~~~~~~
- Preservation-worthy data
- - tk:t (triple key: triple; dupsort, dupfixed)
- - tk:c (context key: triple; unique)
- .. _indices-1:
- Indices
- ~~~~~~~
- Rebuildable from main data store
- - s2k (subject key: triple key)
- - p2k (pred key: triple key)
- - o2k (object key: triple key)
- - sp2k
- - so2k
- - po2k
- - spo2k
- Lookup
- ~~~~~~
- 1. Look up triples by s, p, o, sp, so, po and get keys
- 2. If a context is specified, for each key try to seek to (context, key)
- in ct to verify it exists
- 3. Intersect sets
- 4. Match triple keys with data using kt
- Shortcuts
- ^^^^^^^^^
- - Get all contexts: return list of keys from ct
- - Get all triples for a context: get all values for a contex from ct
- and match triple data with kt
- - Get one triple match for all contexts: look up in triple indices and
- match triple data with kt
- Strategy #4
- -----------
- Terms are entered individually in main data store. Also, shorter keys
- are used rather than hashes. These two aspects save a great deal of
- space and I/O, but require an additional index to put the terms together
- in a triple.
- .. _main-data-store-2:
- Main Data Store
- ~~~~~~~~~~~~~~~
- - t:st (term key: serialized term; 1:1)
- - spo:c (joined S, P, O keys: context key; 1:m)
- - c: (context keys only, values are the empty bytestring)
- Storage total: variable
- .. _indices-2:
- Indices
- ~~~~~~~
- - th:t (term hash: term key; 1:1)
- - c:spo (context key: joined triple keys; 1:m)
- - s:po (S key: P + O key; 1:m)
- - p:so (P key: S + O keys; 1:m)
- - o:sp (object key: triple key; 1:m)
- - sp:o (S + P keys: O key; 1:m)
- - so:p (S + O keys: P key; 1:m)
- - po:s (P + O keys: S key; 1:m)
- Storage total: 143 bytes per triple
- Disadvantages
- ~~~~~~~~~~~~~
- - Lots of indices
- - Terms can get orphaned:
- - No easy way to know if a term is used anywhere in a quad
- - Needs some routine cleanup
- - On the other hand, terms are relatively light-weight and can be
- reused
- - Almost surely not reusable are UUIDs, message digests, timestamps
- etc.
- Strategy #5
- -----------
- Reduce number of indices and rely on parsing and splitting keys to find
- triples with two bound parameters.
- This is especially important for keeping indexing synchronous to achieve
- fully ACID writes.
- .. _main-data-store-3:
- Main data store
- ~~~~~~~~~~~~~~~
- Same as Strategy #4:
- - t:st (term key: serialized term; 1:1)
- - spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
- - c: (context keys only, values are the empty bytestring; 1:1)
- Storage total: variable (same as #4)
- .. _indices-3:
- Indices
- ~~~~~~~
- - th:t (term hash: term key; 1:1)
- - s:po (S key: joined P, O keys; dupsort, dupfixed)
- - p:so (P key: joined S, O keys; dupsort, dupfixed)
- - o:sp (O key: joined S, P keys; dupsort, dupfixed)
- - c:spo (context → triple association; dupsort, dupfixed)
- Storage total: 95 bytes per triple
- Lookup strategy
- ~~~~~~~~~~~~~~~
- - ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term
- key
- - s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from
- term key
- - s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
- - s p ?: [s:po] All PO for S → filter result by P in split key → [t:st]
- term from term key
- Advantages
- ~~~~~~~~~~
- - Less indices: smaller index size and less I/O
- .. _disadvantages-1:
- Disadvantages
- ~~~~~~~~~~~~~
- - Possibly slower retrieval for queries with 2 bound terms (run
- metrics)
- Further optimization
- ~~~~~~~~~~~~~~~~~~~~
- In order to minimize traversing and splittig results, the first
- retrieval should be made on the term with less average keys. Search
- order can be balanced by establishing a lookup order for indices.
- This can be achieved by calling stats on the index databases and looking
- up the database with *most* keys. Since there is an equal number of
- entries in each of the (s:po, p:so, o:sp) indices, the one with most
- keys will have the least average number of values per key. If that
- lookup is done first, the initial data set to traverse and filter will
- be smaller.
- Strategy #5a
- ------------
- This is a slightly different implementation of #5 that somewhat
- simplifies and perhaps speeds up things a bit. It is the currently
- employed solution.
- The indexing and lookup strtegy is the same; but instead of using a
- separator byte for splitting compound keys, the logic relies on the fact
- that keys have a fixed length and are sliced instead. This *should*
- result in faster key manipulation, also because in most cases
- ``memoryview`` buffers can be used directly instead of being copied from
- memory.
- Index storage is 90 bytes per triple.
|