title: LAKEsuperior Recommendations author: "Stefano Cossu, the Art Institute of Chicago scossu@artic.edu
Kevin Ford, the Art Institute of Chicago, <kford1@rtic.edu>"
papersize: letter geometry: "margin=1in" output: pdf_document fontfamily: ebgaramond fontfamilyoptions: osf fontsize: 11pt
lakesuperior
is the code name of the current
internal Fedora repository that is part of LAKE, the institutional
repository and DAMS for the Art Institute of Chicago Collections.
In order to overcome the current technical limitations of the current Modeshape-backed Fedora 4 implementation, we want to propose an alternative implementation of Fedora.
End goal of this project is a repository system primarily, but not exclusively, supported and used by AIC, fully compliant to the official Fedora API specifications at [http://fedora.info].
The full-scale project is divided into three levels, corresponding to three major milestones:
Our Fedora implementation should fulfill the following high-level requirements:
This is in no way an exhaustive path to development and less than ever a spec sheet. It is mostly a collection of recommended patterns to resolve specific issues for building an alternative implementation of Fedora. Some of these recommendations may differ from the final implementation.
Level 1 implementation is aimed at providing a proof-of-concept (PoC) repository system with a minimal set of features. The resulting system should be easily deployable and offer most commonly used features and features that present particular challenges in the current Fedora implementation.
The Level 1 development cycle should delivery a system that can be tested by a variety of stakeholders with the purpose of gathering feedback to drive further development and possible collaboration with other Fedora 4 alternative implementation efforts.
The Level 1 LAKEsuperior implementation supports a minimum set of features meeting the following requirements:
Blazegraph is the most convenient choice for a PoC.
Pros:
Cons:
Alternatives can be considered beyond Level 1 if significant hurdles or limitations are encountered during this development phase. Therefore, the code should use standard triplestore features and be explicitly document the implementation-specific features necessary for LAKEsuperior to function.
Binaries are stored in the filesystem. See below for implementation details.
The application is written in Python 3 using the Python standard library and mature, well-respected 3rd party libraries: Flask, Requests, RDFLib, etc.
Application configuration is stored in separate files in an appropriate structured format (e.g. YAML).
Partitioning within a quad-store can be done via named graphs. Each “resource” as intended in LDP is contained in a separate named graph. This allows to clearly identify such LDP “resources”.
There are multiple approaches to this implementation, mostly differing in the handling of the graph name:
[prefix definitions]
res:16fb6c41-b862-4adc-8656-5f9c356b56bb {
res:16fb6c41-b862-4adc-8656-5f9c356b56bb a ldp:NonRdfSource ;
ebucore:height "1024" ;
ebucore:width "1920" ;
ebucore:filename "xyz.png" ;
premis:hasMessageDigest <urn:sha1:eaa3379d8415071369f3c8b3699fa91fcfc6888c> .
}
This is the most straightforward implementation and does not impose any additional construct on a straight SPARQL query. On the other hand, the repetition of the graph name in the subject is redundant and limits further strategic use of named graphs.
[LEVEL 2?]
[prefix definitions]
main {
res:a a ltype:Resource ;
fcrepo:hasVersion res:b , res:c ;
res:b a ltype:Snapshot .
res:c a ltype:Snapshot .
}
res:a {
lake:16fb6c41-b862-4adc-8656-5f9c356b56bb
a ldp:Container , aictype:Work ;
skos:prefLabel "Composition in Red" ;
}
res:b {
lake:16fb6c41-b862-4adc-8656-5f9c356b56bb
a ldp:Container , aictype:Work ;
skos:prefLabel "Composition in Reed" ;
}
res:c {
lake:16fb6c41-b862-4adc-8656-5f9c356b56bb
a ldp:Container , aictype:Work ;
skos:prefLabel "Composition in Read" ;
}
This other approach has a
“main” graph holding metadata about the first-class resources: their
provenance, relationships with other named graphs, etc. Multiple named
graphs can be different representations of the same resource. This makes
it easier to build provenance data, such as version snapshots. On the
other hand it is a more indirect approach, in that LAKEsuperior has to
find one or more named graphs that represent the resource that is being
requested. In the example above, the repository application needs to get
triples from the graph that has the ltype:Resource
and skip the ones
from the ltype:Snapshot
graphs.
This is done by the application layer. This implementation supports all necessary building blocks for interacting with ActiveFedora: LDP-NR, LDPC, LDP-DC, LDP-IC, etc.
Resources are stored internally with a prefixed namespace that is replaced by the domain-specific URI prefix. No host-specific information is stored in the triplestore in order to guarantee portability of the data set.
LAKEsuperior supports all the LDP container types and their related behavior, especially with regard to direct and indirect containers.
For individually deleted triples (i.e. property values) versioning can be engaged.
Allowing the deletion of a resource altogether poses several challenges (see also “Referential integrity” below). Non-destructive deletion and optional admin-restricted purging is the recommended approach.
Two “soft-delete” approaches are possible:
Deleted resources (with either method) should leave a tombstone. If the tombstone is deleted, the resource may be deleted permanently (barring referential integrity issues).
A tombstone, while available, may offer method to “resurrect” a deleted resource.
Currently several clients rely on the mechanism by which, given <main:a>
and <main:a/b>
, if
<main:a>
is deleted, <main:a/b>
is also deleted. A tombstone
should be left on <main:a>
and surface if either <main:a>
or
<main:a/b>
is requested, as per current fcrepo4 behavior
[VERIFY].
[DISCUSS utility, extent and pitfalls] Since any triples can be inserted in a
triplestore, it may be useful to explicitly identify a LAKEsuperior
“resource” by adding e.g. a
<http://definitions.artic.edu/lake/type#Resource>
RDF type to all the
resources directly managed and exposed by LAKEsuperior. This RDF type is
server managed and not exposed in the LDP API. The presence of this type
for a resource may trigger further validation of server-managed
properties and other constraints.
The designated LAKE resource class should be included in each named graph for each LAKEsuperior resource appearing as a subject.
While it is not ideal to enforce low-level structural rules in a repository, experience has proven that when systems become out of sync or external clients behave in an unexpected way, a repository without any structural integrity rules becomes more easily corrupted and structural issues become harder to identify and fix. Therefore it may be valuable to place some basic restrictions for resources in the repository so that errors surface early and are more likely to be caught and repaired.
Files are stored in the filesystem. The filesystem path for each LDP-NR is obtained by a root prefix defined in the application configuration and a balanced pairtree created from the file SHA1 checksum of the file content. This means that identical binaries can be represented by multiple LDP resources but are stored under the same file behind the scenes.
A server-managed triple contains the path to a file for each version.
Sample data set:
[prefix definitions]
area:main {
res:16fb6c41-b862-4adc-8656-5f9c356b56bb
a ldp:NonRdfSource , ltype:Resource ;
ebucore:height "1024" ;
ebucore:width "1920" ;
ebucore:filename "xyz.png" ;
premis:hasMessageDigest <urn:sha1:eaa3379d8415071369f3c8b3699fa91fcfc6888c> ;
fcr:content "/eaa3/379d8/4150/eaa3379d8415071369f3c8b3699fa91fcfc6888c" ;
aic:status res:58f00eca-c398-02f0-f9bb-6b2b6105c0ef .
}
Note that fcr:content
is redundant since it can be inferred by the
premis:hasMessageDigest
URI.
The binary file content is stored in <bitstream folder path>/eaa3/379d8/4150/eaa3379d8415071369f3c8b3699fa91fcfc6888c
.
Sample LDP requests and responses:
GET http://lakesuperior.artic.edu/rest/main:16fb6c41-b862-4adc-8656-5f9c356b56bb
[binary data]
GET http://lakesuperior.artic.edu/rest/main:16fb6c41-b862-4adc-8656-5f9c356b56bb/fcr:metadata
[prefix definitions]
<http://lakesuperior.artic.edu/rest/main:16fb6c41-b862-4adc-8656-5f9c356b56bb>
a ldp:NonRdfSource ;
ebucore:height "1024" ;
ebucore:width "1920" ;
ebucore:filename "xyz.png" ;
premis:hasMessageDigest <urn:sha1:eaa3379d8415071369f3c8b3699fa91fcfc6888c> ;
iana:describedBy <http://lakesuperior.artic.edu/rest/main:16fb6c41-b862-4adc-8656-5f9c356b56bb/fcr:metadata>
aic:status res:58f00eca-c398-02f0-f9bb-6b2b6105c0ef .
}
Server-managed properties should be supported the same way the current Fedora 4.x implementation does, to the extent that allows a drop-in replacement as a Samvera client. This includes "magic" LDP predicates for direct and indirect containers.
Some server-managed properties are not exposed in the
client-facing API. These properties have predicates within a dedicated
namespace. If a client tries to insert a triple including an internal
use only predicate, the application should return a 409 Conflict
response.
Some internal use only predicates are:
The Level 2 implementation builds upon the Level 1 proof of concept and, ideally, possible feedback from a variety of testers.
The goal for Level 2 is a feature-complete, beta-quality product compatible with Hyrax and Islandora (pending the presence of Islandora stakeholders)
Structural restrictions can include:
Enforcing these three types of restrictions may satisfy a very broad number of use cases.
Restrictions should be completely optional for implementers.
The subsystem responsible for enforcing restrictions should be close enough to the core repository to ensure that all interaction with the persistence layer passes through it; and isolated enough that it can always be configured and enabled or disabled separately.
Referential integrity should be ideally enforced within resources which share the same domain, i.e. are managed by the same repository.
Referential integrity is hard to maintain if the repository allows
deletion of versioned resources. Consider two resources, <a>
and <b>
:
<a> ns:rel <b>
<a>
,
but it is still present in historical versions.<b>
is deleted. The referential integrity is broken in previous versions of <a>
.<a>
is restored to its previous version, this breakage surfaces in the main
resource.There are three possible ways to address this:
Actual referential integrity is a better candidate for Level 2 but some early decisions should be made with this setup in mind.
Previous versions of a resource are stored as sets of triples within a separate named graph:
PREFIX area: <http://definitions.artic.edu/lake/area#>
PREFIX res: <http://definitions.artic.edu/lake/resource#>
PREFIX ltype: <http://definitions.artic.edu/lake/type#>
PREFIX snap: <http://definitions.artic.edu/lake/snapshot#>
PREFIX aic: <http://definitions.artic.edu/ontology/1.0/>
area:main {
res:16fb6c41-b862-4adc-8656-5f9c356b56bb
a ldp:NonRdfSource , ltype:Resource ;
ebucore:height "1024" ;
ebucore:width "1920" ;
ebucore:filename "xyz.png" ;
premis:hasMessageDigest <urn:sha1:eaa3379d8415071369f3c8b3699fa91fcfc6888c> ;
aic:status res:58f00eca-c398-02f0-f9bb-6b2b6105c0ef ;
}
area:historic {
res:16fb6c41-b862-4adc-8656-5f9c356b56bb
fedora:hasVersion
snap:fa297320-4ae1-46c9-8d2b-9356e713489f ,
snap:7c96c502-b101-410d-8119-3b690a38a46a .
snap:fa297320-4ae1-46c9-8d2b-9356e713489f
a ldp:NonRdfSource , ltype:Resource ;
fedora:hasVersionLabel "Version 2" ;
ebucore:height "1018" ;
ebucore:width "1318" ;
ebucore:filename "xyz_older.png" ;
premis:hasMessageDigest <urn:sha1:dae1230ad61117e33f4e338a1648983e1af84377> ;
fedora:created "2017-08-22T14:21:14.941Z"^^xsd:\#dateTime ;
aic:status res:58f00eca-c398-02f0-f9bb-6b2b6105c0ef .
snap:7c96c502-b101-410d-8119-3b690a38a46a
a ldp:NonRdfSource , ltype:Resource ;
fedora:hasVersionLabel "Version 1" ;
ebucore:height "768" ;
ebucore:width "1024" ;
ebucore:filename "xyz_oldest.jpg" ;
premis:hasMessageDigest <urn:sha1:dae9d4b279aa27cde354e5a0a8e7da07c7560fbc> ;
fedora:created "2017-08-22T14:51:46.036Z"^^xsd:\#dateTime ;
aic:status res:58f00eca-c398-02f0-f9bb-6b2b6105c0ef .
}
Versions are retrieved by a mechanism similar to fcrepo4, with some differences:
GET http://lakesuperior.artic.edu/rest/historic:16fb6c41-b862-4adc-8656-5f9c356b56bb
[prefix definitions]
<http://lakesuperior.artic.edu/rest/historic:16fb6c41-b862-4adc-8656-5f9c356b56bb>
fedora:hasVersion
<http://lakesuperior.artic.edu/rest/fa297320-4ae1-46c9-8d2b-9356e713489f> ,
<http://lakesuperior.artic.edu/rest/7c96c502-b101-410d-8119-3b690a38a46a> .
<http://lakesuperior.artic.edu/rest/historic:fa297320-4ae1-46c9-8d2b-9356e713489f>
fedora:created "2017-08-22T14:21:14.941Z"^^xsd:\#dateTime ;
fedora:hasVersionLabel "Version 2" .
<http://lakesuperior.artic.edu/rest/historic:7c96c502-b101-410d-8119-3b690a38a46a>
fedora:created "2017-08-22T14:51:46.036Z"^^xsd:\#dateTime ;
fedora:hasVersionLabel "Version 1" .
Most importantly note the lack of
fedora:hasVersions
property that points to the ./fcr:versions
resource.
This is replaced by a resource with the same URI as the main resource,
within the historic
namespace. If a client requires the
fedora:hasVersions
property for whatever reason, it may be reasonable to
add it and have it point to the resource in the historic
area.
Transactions are not supported by LAKEsuperior at the LDP API level, at least in Levels 1 and 2, and maybe not in early stable releases either. The pitfalls of implementing transactions are too many to tackle in early releases.
However, Level 1 should guarantee that a single client request is always atomic even if this entails multiple interactions with the underlying store. For this reason, any candidate backing triplestore must support transactions.
[TODO] Fixity support should be implemented as per current fcrepo4 specs.
TBD