123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899 |
- Data Structure Internals
- ========================
- **(Draft)**
- Lakesuperior has its own methods for handling in-memory graphs. These methods
- rely on C data structures and are therefore much faster than Python/RDFLib
- objects.
- The graph data model modules are in :py:module:`lakesuperior.model.graph`.
- The Graph Data Model
- Triples are stored in a C hash set. Each triple is represented by a pointer to
- a ``BufferTriple`` structure stored in a temporary memory pool. This pool is
- tied to the life cycle of the ``SimpleGraph`` object it belongs to.
- A triple structure contains three pointers to ``Buffer`` structures, which
- contain a serialized version of a RDF term. These structures are stored in the
- ``SimpleGraph`` memory pool as well.
- Each ``SimpleGraph`` object has a ``_terms`` property and a ``_triples``
- property. These are C hash sets holding addresses of unique terms and
- triples inserted in the graph. If the same term is entered more than once,
- in any position in any triple, the first one entered is used and is pointed to
- by the triple. This makes the graph data structure very compact.
- In summary, the pointers can be represented this way::
- <serialized term data in mem pool (x3)>
- ^ ^ ^
- | | |
- <Term structures in mem pool (x3)>
- ^ ^ ^
- | | |
- <Term struct addresses in _terms set (x3)>
- ^ ^ ^
- | | |
- <Triple structure in mem pool>
- ^
- |
- <address of triple in _triples set>
- Let's say we insert the following triples in a ``SimpleGraph``::
- <urn:s:0> <urn:p:0> <urn:o:0>
- <urn:s:0> <urn:p:1> <urn:o:1>
- <urn:s:0> <urn:p:1> <urn:o:2>
- <urn:s:0> <urn:p:0> <urn:o:0>
- The memory pool contains the following byte arrays of raw data, displayed in
- the following list with their relative addresses (simplified to 8-bit
- addresses and fixed-length byte strings for readability)::
- 0x00 <urn:s:0>
- 0x09 <urn:p:0>
- 0x12 <urn:o:0>
- 0x1b <urn:s:0>
- 0x24 <urn:p:1>
- 0x2d <urn:o:1>
- 0x36 <urn:s:0>
- 0x3f <urn:p:1>
- 0x48 <urn:o:2>
- 0x51 <urn:s:0>
- 0x5a <urn:p:0>
- 0x63 <urn:o:0>
- However, the ``_terms`` set contains only ``Buffer`` structures pointing to
- unique addresses::
- 0x00
- 0x09
- 0x12
- 0x24
- 0x2d
- 0x48
- The other terms are just unutilized. They will be deallocated en masse when
- the ``SimpleGraph`` object is garbage collected.
- The ``_triples`` set would then contain 3 unique entries pointing to the unique
- term addresses::
- 0x00 0x09 0x12
- 0x00 0x24 0x2d
- 0x00 0x24 0x48
- (the actual addresses would actually belong to the structures pointing to the
- raw data, but this is just an illustrative example).
- The advantage of this approach is that the memory pool is contiguous and
- append-only (until it gets purged), so it's cheap to just add to it, while the
- sets that must maintain uniqueness and are the ones that most operations
- (lookup, adding, removing, slicing, copying, etc.) are done on, contain much
- less data and are therefore faster.
|