Data Structure Internals ======================== **(Draft)** Lakesuperior has its own methods for handling in-memory graphs. These methods rely on C data structures and are therefore much faster than Python/RDFLib objects. The graph data model modules are in :py:module:`lakesuperior.model.graph`. The Graph Data Model -------------------- Triples are stored in a C hash set. Each triple is represented by a pointer to a ``BufferTriple`` structure stored in a temporary memory pool. This pool is tied to the life cycle of the ``SimpleGraph`` object it belongs to. A triple structure contains three pointers to ``Buffer`` structures, which contain a serialized version of a RDF term. These structures are stored in the ``SimpleGraph`` memory pool as well. Each ``SimpleGraph`` object has a ``_terms`` property and a ``_triples`` property. These are C hash sets holding addresses of unique terms and triples inserted in the graph. If the same term is entered more than once, in any position in any triple, the first one entered is used and is pointed to by the triple. This makes the graph data structure very compact. In summary, the pointers can be represented this way:: ^ ^ ^ | | | ^ ^ ^ | | | ^ ^ ^ | | | ^ |
Let's say we insert the following triples in a ``SimpleGraph``:: The memory pool contains the following byte arrays of raw data, displayed in the following list with their relative addresses (simplified to 8-bit addresses and fixed-length byte strings for readability):: 0x00 0x09 0x12 0x1b 0x24 0x2d 0x36 0x3f 0x48 0x51 0x5a 0x63 However, the ``_terms`` set contains only ``Buffer`` structures pointing to unique addresses:: 0x00 0x09 0x12 0x24 0x2d 0x48 The other terms are just unutilized. They will be deallocated en masse when the ``SimpleGraph`` object is garbage collected. The ``_triples`` set would then contain 3 unique entries pointing to the unique term addresses:: 0x00 0x09 0x12 0x00 0x24 0x2d 0x00 0x24 0x48 (the actual addresses would actually belong to the structures pointing to the raw data, but this is just an illustrative example). The advantage of this approach is that the memory pool is contiguous and append-only (until it gets purged), so it's cheap to just add to it, while the sets that must maintain uniqueness and are the ones that most operations (lookup, adding, removing, slicing, copying, etc.) are done on, contain much less data and are therefore faster.