indexing_strategy.rst 7.8 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311
  1. LMDB Store design for RDFLib
  2. ============================
  3. This is a log of subsequent strategies employed to store triples in
  4. LMDB.
  5. Strategy #5a is the one currently used. The rest is kept for historic
  6. reasons and academic curiosity (and also because it was too much work to
  7. just wipe out of memory).
  8. Storage approach
  9. ----------------
  10. - Pickle quad and create MD5 or SHA1 hash.
  11. - Store triples in one database paired with key; store indices
  12. separately.
  13. Different strategies involve layout and number of databases.
  14. Strategy #1
  15. -----------
  16. - kq: key: serialized triple (1:1)
  17. - sk: Serialized subject: key (1:m)
  18. - pk: Serialized predicate: key (1:m)
  19. - ok: Serialized object: key (1:m)
  20. - (optional) lok: Serialized literal object: key (1:m)
  21. - (optional) tok: Serialized RDF type: key (1:m)
  22. - ck: Serialized context: key (1:m)
  23. Retrieval approach
  24. ~~~~~~~~~~~~~~~~~~
  25. To find all matches for a quad:
  26. - If all terms in the quad are bound, generate the key from the pickled
  27. quad and look up the triple in ``kt``
  28. - If all terms are unbound, return an iterator of all values in ``kt``.
  29. - If some values are bound and some unbound (most common query):
  30. - Get a base list of keys associated wirh the first bound term
  31. - For each subsequent bound term, check if each key associated with
  32. the term matches a key in the base list
  33. - Continue through all the bound terms. If a match is not found at
  34. any point, continue to the next term
  35. - If a match is found in all the bound term databases, look up the
  36. pickled quad matching the key in ``kq`` and yield it
  37. More optimization can be introduced later, e.g. separating literal and
  38. RDF type objects in separate databases. Literals can have very long
  39. values and a database with a longer key setting may be useful. RDF terms
  40. can be indexed separately because they are the most common bound term.
  41. Example lookup
  42. ~~~~~~~~~~~~~~
  43. Keys and Triples (should actually be quads but this is a simplified
  44. version):
  45. A: s1 p1 o1 B: s1 p2 o2 C: s2 p3 o1 D: s2 p3 o3
  46. Indices:
  47. - SK:
  48. - s1: A, B
  49. - s2: C, D
  50. - PK:
  51. - p1: A
  52. - p2: B
  53. - p3: C, D
  54. - OK:
  55. - o1: A, C
  56. - o2: B
  57. - o3: D
  58. Queries:
  59. - s1 ?p ?o → {A, B}
  60. - s1 p2 ?o → {A, B} & {B} = {B}
  61. - ?s ?p o3 → {D}
  62. - s1 p2 o5 → {} (Exit at OK: no term matches ‘o5’)
  63. - s2 p3 o2 → {C, D} & {C, D} & {B} = {}
  64. Strategy #2
  65. -----------
  66. Separate data and indices in two environments.
  67. Main data store
  68. ~~~~~~~~~~~~~~~
  69. Key to quad; main keyspace; all unique.
  70. Indices
  71. ~~~~~~~
  72. None of these databases is of critical preservation concern. They can be
  73. rebuilt from the main data store.
  74. All dupsort and dupfixed.
  75. @TODO The first three may not be needed if computing term hash is fast
  76. enough.
  77. - t2k (term to term key)
  78. - lt2k (literal to term key: longer keys)
  79. - k2t (term key to term)
  80. - s2k (subject key to quad key)
  81. - p2k (pred key to quad key)
  82. - o2k (object key to quad key)
  83. - c2k (context key to quad key)
  84. - sc2qk (subject + context keys to quad key)
  85. - po2qk (predicate + object keys to quad key)
  86. - sp2qk (subject + predicate keys to quad key)
  87. - oc2qk (object + context keys to quad key)
  88. - so2qk (subject + object keys to quad key)
  89. - pc2qk (predicate + context keys to quad key)
  90. Strategy #3
  91. -----------
  92. Contexts are much fewer (even in graph per aspect, 5-10 triples per
  93. graph)
  94. .. _main-data-store-1:
  95. Main data store
  96. ~~~~~~~~~~~~~~~
  97. Preservation-worthy data
  98. - tk:t (triple key: triple; dupsort, dupfixed)
  99. - tk:c (context key: triple; unique)
  100. .. _indices-1:
  101. Indices
  102. ~~~~~~~
  103. Rebuildable from main data store
  104. - s2k (subject key: triple key)
  105. - p2k (pred key: triple key)
  106. - o2k (object key: triple key)
  107. - sp2k
  108. - so2k
  109. - po2k
  110. - spo2k
  111. Lookup
  112. ~~~~~~
  113. 1. Look up triples by s, p, o, sp, so, po and get keys
  114. 2. If a context is specified, for each key try to seek to (context, key)
  115. in ct to verify it exists
  116. 3. Intersect sets
  117. 4. Match triple keys with data using kt
  118. Shortcuts
  119. ^^^^^^^^^
  120. - Get all contexts: return list of keys from ct
  121. - Get all triples for a context: get all values for a contex from ct
  122. and match triple data with kt
  123. - Get one triple match for all contexts: look up in triple indices and
  124. match triple data with kt
  125. Strategy #4
  126. -----------
  127. Terms are entered individually in main data store. Also, shorter keys
  128. are used rather than hashes. These two aspects save a great deal of
  129. space and I/O, but require an additional index to put the terms together
  130. in a triple.
  131. .. _main-data-store-2:
  132. Main Data Store
  133. ~~~~~~~~~~~~~~~
  134. - t:st (term key: serialized term; 1:1)
  135. - spo:c (joined S, P, O keys: context key; 1:m)
  136. - c: (context keys only, values are the empty bytestring)
  137. Storage total: variable
  138. .. _indices-2:
  139. Indices
  140. ~~~~~~~
  141. - th:t (term hash: term key; 1:1)
  142. - c:spo (context key: joined triple keys; 1:m)
  143. - s:po (S key: P + O key; 1:m)
  144. - p:so (P key: S + O keys; 1:m)
  145. - o:sp (object key: triple key; 1:m)
  146. - sp:o (S + P keys: O key; 1:m)
  147. - so:p (S + O keys: P key; 1:m)
  148. - po:s (P + O keys: S key; 1:m)
  149. Storage total: 143 bytes per triple
  150. Disadvantages
  151. ~~~~~~~~~~~~~
  152. - Lots of indices
  153. - Terms can get orphaned:
  154. - No easy way to know if a term is used anywhere in a quad
  155. - Needs some routine cleanup
  156. - On the other hand, terms are relatively light-weight and can be
  157. reused
  158. - Almost surely not reusable are UUIDs, message digests, timestamps
  159. etc.
  160. Strategy #5
  161. -----------
  162. Reduce number of indices and rely on parsing and splitting keys to find
  163. triples with two bound parameters.
  164. This is especially important for keeping indexing synchronous to achieve
  165. fully ACID writes.
  166. .. _main-data-store-3:
  167. Main data store
  168. ~~~~~~~~~~~~~~~
  169. Same as Strategy #4:
  170. - t:st (term key: serialized term; 1:1)
  171. - spo:c (joined S, P, O keys: context key; dupsort, dupfixed)
  172. - c: (context keys only, values are the empty bytestring; 1:1)
  173. Storage total: variable (same as #4)
  174. .. _indices-3:
  175. Indices
  176. ~~~~~~~
  177. - th:t (term hash: term key; 1:1)
  178. - s:po (S key: joined P, O keys; dupsort, dupfixed)
  179. - p:so (P key: joined S, O keys; dupsort, dupfixed)
  180. - o:sp (O key: joined S, P keys; dupsort, dupfixed)
  181. - c:spo (context → triple association; dupsort, dupfixed)
  182. Storage total: 95 bytes per triple
  183. Lookup strategy
  184. ~~~~~~~~~~~~~~~
  185. - ? ? ? c: [c:spo] all SPO for C → split key → [t:st] term from term
  186. key
  187. - s p o c: [c:spo] exact SPO & C match → split key → [t:st] term from
  188. term key
  189. - s ? ?: [s:po] All PO for S → split key → [t:st] term from term key
  190. - s p ?: [s:po] All PO for S → filter result by P in split key → [t:st]
  191. term from term key
  192. Advantages
  193. ~~~~~~~~~~
  194. - Less indices: smaller index size and less I/O
  195. .. _disadvantages-1:
  196. Disadvantages
  197. ~~~~~~~~~~~~~
  198. - Possibly slower retrieval for queries with 2 bound terms (run
  199. metrics)
  200. Further optimization
  201. ~~~~~~~~~~~~~~~~~~~~
  202. In order to minimize traversing and splittig results, the first
  203. retrieval should be made on the term with less average keys. Search
  204. order can be balanced by establishing a lookup order for indices.
  205. This can be achieved by calling stats on the index databases and looking
  206. up the database with *most* keys. Since there is an equal number of
  207. entries in each of the (s:po, p:so, o:sp) indices, the one with most
  208. keys will have the least average number of values per key. If that
  209. lookup is done first, the initial data set to traverse and filter will
  210. be smaller.
  211. Strategy #5a
  212. ------------
  213. This is a slightly different implementation of #5 that somewhat
  214. simplifies and perhaps speeds up things a bit. It is the currently
  215. employed solution.
  216. The indexing and lookup strtegy is the same; but instead of using a
  217. separator byte for splitting compound keys, the logic relies on the fact
  218. that keys have a fixed length and are sliced instead. This *should*
  219. result in faster key manipulation, also because in most cases
  220. ``memoryview`` buffers can be used directly instead of being copied from
  221. memory.
  222. Index storage is 90 bytes per triple.