Compact, fast, low-maintenance digital repository.

scossu f5b8a53cb4 Parole parole parole.		1 hafta önce
config	f4e60f450a Bunch of fixes:	1 hafta önce
data	98e61f840e Move type definition configs; various improvements.	1 ay önce
ext	f4e60f450a Bunch of fixes:	1 hafta önce
out	e35524ebde HTML index generator; simple UID generator; RDF fixes	3 hafta önce
src	c353c7c505 Update to Volksdata.	1 hafta önce
templates	f4e60f450a Bunch of fixes:	1 hafta önce
test	f4e60f450a Bunch of fixes:	1 hafta önce
.gitignore	b4dd5084af Functional, bare-bones prototype:	3 hafta önce
.gitmodules	f4e60f450a Bunch of fixes:	1 hafta önce
CODE_OF_CONDUCT	f5b8a53cb4 Parole parole parole.	1 hafta önce
LICENSE	f5b8a53cb4 Parole parole parole.	1 hafta önce
README.md	f5b8a53cb4 Parole parole parole.	1 hafta önce
pocket_archive-scm-1.rockspec	c353c7c505 Update to Volksdata.	1 hafta önce
scratch.lua	c353c7c505 Update to Volksdata.	1 hafta önce

Pocket Archive

The idea

Stick it in your pocket and carry it around. Install it on a cloud server. Install it on a Raspberry Pi. Browse it offline. Browse it online. Duplicate it, share it, harvest it and aggregate it. Feed it non-GMO spreadsheets regularly and it will thrive.

A more sensical description

Pocket Archive is a digital archival system and static site generator for small- to medium-(?) sized archives. It is designed to function in environments with unreliable connectivity and requires very low technical and human resources to set up, run, and use.

Pocket Archive fulfills the following functions:

Storage and management of files and metadata-only resources
Management of descriptive, administrative, and technical metadata
Dynamic relationships between resources
Static site generation (discovery interface)

In spite of its design simplicity, Pocket Archive strives to be highly flexible. It is based on Volksdata , a very compact Linked Data store written in C. There is no restriction to the types and schema of metadata allowed, or the file types supported. A file-based configuration allows to set up content types and validation rules, or to have (almost) no rules at all.

Why

Several years ago, the author of this project believed that he should work in larger and larger institutions, with larger and larger data sets. One day, he came across a project that changed his perspective.

"From a standpoint of preserving human cultural heritage at large, does it make more sense to design very large repositories for very rich institutions, with a lot of layers of safety but also a lot of bureaucracy and redundancy, or contribute to many decentralized projects that are highly efficient, small, representing periferal cultures, and most importantly, that are at much higher risk of loss than the large institutions"?

Both: this software has been conceived with the experience of large-scale repositories as the background to decide what works and what doesn't, what is necessary and what is superfluous, and what catalogers and archivists need to do their job.

Basic concepts

Until some proper reference is written, this should serve as a high-level documentation to help evaluate the functionality and for the author to stay on track. Some of these ideas have been ripped right off my day job, so there is a good chance they work.

General philosophy

The functional goals of Pocket Archive are simplicity and flexibility, from both a user's and a maintainer's perspectives. These two properties are usually seen as conflicting, but within reason, they can coexist.

Pocket Archive is built upon a minimalistic framework: C and Lua, with very few dependencies. As with these foundational elements, it strives to offer few tools that can be combined in a multitude of ways to achieve many goals, rather than many tools each doing a specific thing.

Resource

The Linked Data adage goes, "everything is a Resource". Without confusing users too much by taking the concept to the Linked Data extremes, the term resource is used in this project to describe individual, self-contained units of information such as:

Files;
Intellectual or physical artifacts (artworks, documents, books, etc.);
Structural elements inside or around an entity, such as the order of pages in a book, the two sides of a postcard, a collection of artifacts, etc.

Files are called opaque resources. They are viewed by Pocket Archive as "opaque" in that the system doesn't care about their contents. It only ensures that files are stored as they were submitted, and keeps checksums to guard against data corruption.

All other entities are called descriptive resources. These are effectively Linked Data, which can be queried and searched for. Each file also has its own descriptive resource, so that it can be classified, discovered, and described.

Submission

A Pocket Archive repository is populated via submissions. A submission is performed by telling the archive to pick up some files from a folder it can access, push them into storage, add metadata to them, and index them so that they can be found later.

A submission is directed by a laundry list, which is a spreadsheet listing all the resources (both opaque and descriptive) to be created, and the metadata assigned to them. The laundry list, formatted as a CSV (comma-separated value) file, can be edited by several free and open source applications, such as LibreOffice. For repetitive, high- volume submissions, templates can be set to facilitate filling in metadata fields. An example submission , which includes a laundry list, is available.

Detailed instructions on how to write a laundry list shall be added later. For now, the following are the basic guidelines to build a submission package:

Resources are arranged in files and folders on a local filesystem that Pocket Archive can access.
File and folder arrangement is important. A folder represents a descriptive resource, and can have metadata attached to it. A file of folder under a parent folder is automatically added as a child of the parent resource. This relationship is intended to present the parent as a container of other sub-resources (descriptive and/or opaque). With this method, hierarchies of any complexity can be built.
File and folder order in the submission folder is not important. No need to rename files and folders to force a specific ordering. This is specified via laundry list instead. See below.
The laundry list file is placed under the submission package folder and must be named pkar_submission.csv.

A laundry list is thus formatted:

The first row is reserved for the headers, which indicate the field names.
Each subsequent row represents a resource (except in a multi-value case, described below). The pas:sourcePath and pas:contentType fields are mandatory for each resource. All other fields are optional for the submission, however, some type definitions may have constraints in this regard.
All field names, except for id, have a namespace prefix among the ones defined in the configuration. See dedicated section for details about namespaces.
Fields with a special meaning:
- id: optional and single-valued. If provided, it becomes the primary identifier for the resource, which is used anywhere information about the resource is retrieved. The depositor is responmsible for ensuring that the provided ID is unique across the system. If left blank, the system generates an identifier that is guaranteed to be unique.
- pas:sourcePath: mandatory and single-valued. It refers to the file or folder path relative to the package.
- pas:contentType: mandatory and single-valued. It defines the content type assigned to the resource. For files, it should be pas:File or a sub-type thereof. For folders it must not be a pas:File or sub-type.
To provide multiple values for one or more fields, additional values are added to rows below the previous. For these additional rows, the sourcePath field must not be filled, and additional values for single-valued fields are ignored.
The ordering of the rows determines the ordering of the resources in their container. The system automatically assigns an order to the resources, using their source path and their position in the laundry list. Resources at the top are not assigned an order, as they are considered self-standing. If an order is needed for those, the pas:next field can be set to the desired resource (see point below about relationships), or they can be put in an enclosing folder that acts as a collection.
Relationships can be established between resources. These are stored as persistent links and appear as hyperlinks in the discovery interface. A relationship can only be set for a field that is configured as "resource" type. To set a relationship with a resource in the same laundry list that doesn't have an explicit ID set, insert the source path of the resource. For a resource that has already an ID, either by being assigned one manually or by being already deposited, insert the full ID including the par: namespace (e.g. for ID 12345, insert par:12345).

Update

A submission is also used to update existing resources. Each resource update is a full replacement of all the resource's metadata, so a submission must include a full representation of each of the resources updated.

To facilitate this task while avoiding the need to hold on to all of the archive's laundry lists, Pocket Archive can generate a laundry list for one or more selected resources. This list, which represents the current state of the resources requested, can be edited and submitted for an update. This method is much faster and intuitive than clicking around an alien user interface filled with icons and terms that one has never seen before.

Metadata & content model

Note: The scope of this functional area is currently under review. Things may change.

Metadata are (yes, it's a plural noun) controlled by a content model, which in this project is intended as the entirety of definitions of content types recognized by the system, and how they relate to one another. Each type definition is encoded in a configuration file defining a single content category type. This configuration is specific to each individual Pocket Archive installation, which can use the baseline one provided by default, or extend it via additional configurations. Please look at the default model configuration files that come with Pocket Archive.

One doesn't have to define all possible types in detail. Pocket Archive provides some basic types, e.g.: Anything (the super-class of them all), Artifact, File, Part, which can be used in a very basic installation and should not be radically altered, because some basic functionality of the system relies on them. To add more specific definitions, subtypes can be defined. A subtype inherits all the property definitions of its broader model, and adds more specific behavior. An example classification could be: Anything -> File -> Image File -> Scientific Image. Each of the sub-types would only define the special properties of that definition, which add to, or replace, the properties of its broader definitions.

All resources in Pocket Archive must be assigned a content type. If someone has to deal with a resource that doesn't fit in any of the predefined content models, they can asign it the most specific type that they can. At worst, they can put it under Anything. Of course, if one starts dealing with many unclassifiable resources that look similar, it's probably best to define a model for them; but that is not mandatory.

Each metadata field can be specified by constraints. These constraints can be on:

Type: the data type for the field, e.g. string, number, resource (relationship), etc.
Cardinality: how many values can be set for a field, for each resource. These values can be adjusted to set mandatory fields, single-valued fields, etc.
Range: the range of values allowed. How this is interpreted depends on the data type: for a number can be a min/max range, for a string a regular expression pattern, for a resource the type(s) of the resources pointed to, etc.

All of these constraints are optionals. Fields that are not defined may accept any number of values, and are optional. So it's up to the repository manager to decide how specific or how free-form their archive should be.

Note that fields that are not defined at least by a label, may be hard to understand by users browsing the discovery interface.

Site generation

Pocket Archive can generate HTML pages and all the related assets to run a complete static website. The advantages of a static website over a dynamic one are that it's much simpler and economical to set up and run, and it's impervious to malicious attacks.

The entire site must be generated every time resources are created or updated. This is usually very fast, but on large archives it can take a while. This is the downside of static website: they are static.

Status

ALPHA. Pocket Archive is a very recent project, in fast development. Its foundational library, Volksdata, has been developed as a spare-time project for 6 years and it just entered in beta status.

Road map

Simple road map for a rough prototype:

⚒ Configuration + config parser
- ✓ Application
- ⚒ Content model
- ⎊ Validation rules
- ⎊ Relationship inference rules
- Local overrides
⚒ Submission module
- ✓ SIP building
- ✓ Metadata from LL
- ✓ Structure inference
- Relatioships inference
⚒ HTML generator
- ✓ Index
- ✓ Resource
- ✓ Static assets
- ⚒ Transformers
⎊ Non-HTML generators
- LL
- RDF (turtle)
⎊ Front end
- JS search engine

README.md