structures.rst 3.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
  1. Data Structure Internals
  2. ========================
  3. **(Draft)**
  4. Lakesuperior has its own methods for handling in-memory graphs. These methods
  5. rely on C data structures and are therefore much faster than Python/RDFLib
  6. objects.
  7. The graph data model modules are in :py:module:`lakesuperior.model.graph`.
  8. The Graph Data Model
  9. --------------------
  10. Triples are stored in a C hash set. Each triple is represented by a pointer to
  11. a ``BufferTriple`` structure stored in a temporary memory pool. This pool is
  12. tied to the life cycle of the ``SimpleGraph`` object it belongs to.
  13. A triple structure contains three pointers to ``Buffer`` structures, which
  14. contain a serialized version of a RDF term. These structures are stored in the
  15. ``SimpleGraph`` memory pool as well.
  16. Each ``SimpleGraph`` object has a ``_terms`` property and a ``_triples``
  17. property. These are C hash sets holding addresses of unique terms and
  18. triples inserted in the graph. If the same term is entered more than once,
  19. in any position in any triple, the first one entered is used and is pointed to
  20. by the triple. This makes the graph data structure very compact.
  21. In summary, the pointers can be represented this way::
  22. <serialized term data in mem pool (x3)>
  23. ^ ^ ^
  24. | | |
  25. <Term structures in mem pool (x3)>
  26. ^ ^ ^
  27. | | |
  28. <Term struct addresses in _terms set (x3)>
  29. ^ ^ ^
  30. | | |
  31. <Triple structure in mem pool>
  32. ^
  33. |
  34. <address of triple in _triples set>
  35. Let's say we insert the following triples in a ``SimpleGraph``::
  36. <urn:s:0> <urn:p:0> <urn:o:0>
  37. <urn:s:0> <urn:p:1> <urn:o:1>
  38. <urn:s:0> <urn:p:1> <urn:o:2>
  39. <urn:s:0> <urn:p:0> <urn:o:0>
  40. The memory pool contains the following byte arrays of raw data, displayed in
  41. the following list with their relative addresses (simplified to 8-bit
  42. addresses and fixed-length byte strings for readability)::
  43. 0x00 <urn:s:0>
  44. 0x09 <urn:p:0>
  45. 0x12 <urn:o:0>
  46. 0x1b <urn:s:0>
  47. 0x24 <urn:p:1>
  48. 0x2d <urn:o:1>
  49. 0x36 <urn:s:0>
  50. 0x3f <urn:p:1>
  51. 0x48 <urn:o:2>
  52. 0x51 <urn:s:0>
  53. 0x5a <urn:p:0>
  54. 0x63 <urn:o:0>
  55. However, the ``_terms`` set contains only ``Buffer`` structures pointing to
  56. unique addresses::
  57. 0x00
  58. 0x09
  59. 0x12
  60. 0x24
  61. 0x2d
  62. 0x48
  63. The other terms are just unutilized. They will be deallocated en masse when
  64. the ``SimpleGraph`` object is garbage collected.
  65. The ``_triples`` set would then contain 3 unique entries pointing to the unique
  66. term addresses::
  67. 0x00 0x09 0x12
  68. 0x00 0x24 0x2d
  69. 0x00 0x24 0x48
  70. (the actual addresses would actually belong to the structures pointing to the
  71. raw data, but this is just an illustrative example).
  72. The advantage of this approach is that the memory pool is contiguous and
  73. append-only (until it gets purged), so it's cheap to just add to it, while the
  74. sets that must maintain uniqueness and are the ones that most operations
  75. (lookup, adding, removing, slicing, copying, etc.) are done on, contain much
  76. less data and are therefore faster.