|
@@ -0,0 +1,269 @@
|
|
|
|
+# Pocket Archive
|
|
|
|
+
|
|
|
|
+## The idea
|
|
|
|
+
|
|
|
|
+Stick it in your pocket and carry it around. Install it on a cloud server.
|
|
|
|
+Install it on a Raspberry Pi. Browse it offline. Browse it online. Duplicate
|
|
|
|
+it, share it, harvest it and aggregate it. Feed it non-GMO spreadsheets
|
|
|
|
+regularly and it will thrive.
|
|
|
|
+
|
|
|
|
+## A more sensical description
|
|
|
|
+
|
|
|
|
+Pocket Archive is a digital archival system and static site generator for
|
|
|
|
+small- to medium-(?) sized archives. It is designed to function in environments
|
|
|
|
+with unreliable connectivity and requires very low technical and human
|
|
|
|
+resources to set up, run, and use.
|
|
|
|
+
|
|
|
|
+Pocket Archive fulfills the following functions:
|
|
|
|
+
|
|
|
|
+- Storage and management of files and metadata-only resources
|
|
|
|
+- Management of descriptive, administrative, and technical metadata
|
|
|
|
+- Dynamic relationships between resources
|
|
|
|
+- Static site generation (discovery interface)
|
|
|
|
+
|
|
|
|
+In spite of its design simplicity, Pocket Archive strives to be highly
|
|
|
|
+flexible. It is based on [Volksdata
|
|
|
|
+](https://git.knowledgetx.com/scossu/volksdata), a very compact Linked Data
|
|
|
|
+store written in C. There is no restriction to the types and schema of metadata
|
|
|
|
+allowed, or the file types supported. A file-based configuration allows to set
|
|
|
|
+up content types and validation rules, or to have (almost) no rules at all.
|
|
|
|
+
|
|
|
|
+## Why
|
|
|
|
+
|
|
|
|
+Several years ago, the author of this project believed that he should work in
|
|
|
|
+larger and larger institutions, with larger and larger data sets. One day, he
|
|
|
|
+came across a [project](https://zenodo.org/records/8111569) that changed his
|
|
|
|
+perspective.
|
|
|
|
+
|
|
|
|
+"From a standpoint of preserving human cultural heritage at large, does it make
|
|
|
|
+more sense to design very large repositories for very rich institutions, with a
|
|
|
|
+lot of layers of safety but also a lot of bureaucracy and redundancy, or
|
|
|
|
+contribute to many decentralized projects that are highly efficient, small,
|
|
|
|
+representing periferal cultures, and most importantly, that are at much higher
|
|
|
|
+risk of loss than the large institutions"?
|
|
|
|
+
|
|
|
|
+Both: this software has been conceived with the experience of large-scale
|
|
|
|
+repositories as the background to decide what works and what doesn't, what is
|
|
|
|
+necessary and what is superfluous, and what catalogers and archivists need to
|
|
|
|
+do their job.
|
|
|
|
+
|
|
|
|
+## Basic concepts
|
|
|
|
+
|
|
|
|
+Until some proper reference is written, this should serve as a high-level
|
|
|
|
+documentation to help evaluate the functionality and for the author to stay on
|
|
|
|
+track. Some of these ideas have been ripped right off my day job, so there is
|
|
|
|
+a good chance they work.
|
|
|
|
+
|
|
|
|
+### General philosophy
|
|
|
|
+
|
|
|
|
+The functional goals of Pocket Archive are simplicity and flexibility, from
|
|
|
|
+both a user's and a maintainer's perspectives. These two properties are usually
|
|
|
|
+seen as conflicting, but within reason, they can coexist.
|
|
|
|
+
|
|
|
|
+Pocket Archive is built upon a minimalistic framework: C and Lua, with very few
|
|
|
|
+dependencies. As with these foundational elements, it strives to offer few
|
|
|
|
+tools that can be combined in a multitude of ways to achieve many goals, rather
|
|
|
|
+than many tools each doing a specific thing.
|
|
|
|
+
|
|
|
|
+### Resource
|
|
|
|
+
|
|
|
|
+The Linked Data adage goes, "everything is a Resource". Without confusing users
|
|
|
|
+too much by taking the concept to the Linked Data extremes, the term *resource*
|
|
|
|
+is used in this project to describe individual, self-contained units of
|
|
|
|
+information such as:
|
|
|
|
+
|
|
|
|
+- Files;
|
|
|
|
+- Intellectual or physical artifacts (artworks, documents, books, etc.);
|
|
|
|
+- Structural elements inside or around an entity, such as the order of pages in
|
|
|
|
+ a book, the two sides of a postcard, a collection of artifacts, etc.
|
|
|
|
+
|
|
|
|
+Files are called *opaque resources*. They are viewed by Pocket Archive as
|
|
|
|
+"opaque" in that the system doesn't care about their contents. It only ensures
|
|
|
|
+that files are stored as they were submitted, and keeps checksums to guard
|
|
|
|
+against data corruption.
|
|
|
|
+
|
|
|
|
+All other entities are called *descriptive resources*. These are effectively
|
|
|
|
+Linked Data, which can be queried and searched for. Each file also has its own
|
|
|
|
+descriptive resource, so that it can be classified, discovered, and described.
|
|
|
|
+
|
|
|
|
+### Submission
|
|
|
|
+
|
|
|
|
+A Pocket Archive repository is populated via *submissions*. A submission is
|
|
|
|
+performed by telling the archive to pick up some files from a folder it can
|
|
|
|
+access, push them into storage, add metadata to them, and index them so that
|
|
|
|
+they can be found later.
|
|
|
|
+
|
|
|
|
+A submission is directed by a *laundry list*, which is a spreadsheet listing
|
|
|
|
+all the resources (both opaque and descriptive) to be created, and the metadata
|
|
|
|
+assigned to them. The laundry list, formatted as a CSV (comma-separated value)
|
|
|
|
+file, can be edited by several free and open source applications, such as
|
|
|
|
+LibreOffice. For repetitive, high- volume submissions, templates can be set to
|
|
|
|
+facilitate filling in metadata fields. An [example submission
|
|
|
|
+](test/sample_submission/postcard-bag/data/), which includes a laundry list, is
|
|
|
|
+available.
|
|
|
|
+
|
|
|
|
+Detailed instructions on how to write a laundry list shall be added later. For
|
|
|
|
+now, the following are the basic guidelines to build a submission package:
|
|
|
|
+
|
|
|
|
+- Resources are arranged in files and folders on a local filesystem that Pocket
|
|
|
|
+ Archive can access.
|
|
|
|
+- File and folder arrangement is important. A folder represents a descriptive
|
|
|
|
+ resource, and can have metadata attached to it. A file of folder under a
|
|
|
|
+ parent folder is automatically added as a *child* of the parent resource.
|
|
|
|
+ This relationship is intended to present the parent as a container of other
|
|
|
|
+ sub-resources (descriptive and/or opaque). With this method, hierarchies of
|
|
|
|
+ any complexity can be built.
|
|
|
|
+- File and folder order in the submission folder is *not* important. No need to
|
|
|
|
+ rename files and folders to force a specific ordering. This is specified via
|
|
|
|
+ laundry list instead. See below.
|
|
|
|
+- The laundry list file is placed under the submission package folder and must
|
|
|
|
+ be named `pkar_submission.csv`.
|
|
|
|
+
|
|
|
|
+A laundry list is thus formatted:
|
|
|
|
+
|
|
|
|
+- The first row is reserved for the headers, which indicate the field names.
|
|
|
|
+- Each subsequent row represents a resource (except in a multi-value case,
|
|
|
|
+ described below). The `pas:sourcePath` and `pas:contentType` fields are
|
|
|
|
+ mandatory for each resource. All other fields are optional for the
|
|
|
|
+ submission, however, some type definitions may have constraints in this
|
|
|
|
+ regard.
|
|
|
|
+- All field names, except for `id`, have a namespace prefix among the ones
|
|
|
|
+ defined in the configuration. See dedicated section for details about
|
|
|
|
+ namespaces.
|
|
|
|
+- Fields with a special meaning:
|
|
|
|
+ - `id`: optional and single-valued. If provided, it becomes the primary
|
|
|
|
+ identifier for the resource, which is used anywhere information about the
|
|
|
|
+ resource is retrieved. The depositor is responmsible for ensuring that
|
|
|
|
+ the provided ID is unique across the system. If left blank, the system
|
|
|
|
+ generates an identifier that is guaranteed to be unique.
|
|
|
|
+ - `pas:sourcePath`: mandatory and single-valued. It refers to the file or
|
|
|
|
+ folder path relative to the package.
|
|
|
|
+ - `pas:contentType`: mandatory and single-valued. It defines the content
|
|
|
|
+ type assigned to the resource. For files, it should be `pas:File` or a
|
|
|
|
+ sub-type thereof. For folders it must not be a `pas:File` or sub-type.
|
|
|
|
+- To provide multiple values for one or more fields, additional values are
|
|
|
|
+ added to rows below the previous. For these additional rows, the `sourcePath`
|
|
|
|
+ field **must not** be filled, and additional values for single-valued fields
|
|
|
|
+ are ignored.
|
|
|
|
+- The ordering of the rows determines the ordering of the resources in their
|
|
|
|
+ container. The system automatically assigns an order to the resources, using
|
|
|
|
+ their source path and their position in the laundry list. Resources at the
|
|
|
|
+ top are not assigned an order, as they are considered self-standing. If an
|
|
|
|
+ order is needed for those, the `pas:next` field can be set to the desired
|
|
|
|
+ resource (see point below about relationships), or they can be put in an
|
|
|
|
+ enclosing folder that acts as a collection.
|
|
|
|
+- Relationships can be established between resources. These are stored as
|
|
|
|
+ persistent links and appear as hyperlinks in the discovery interface. A
|
|
|
|
+ relationship can only be set for a field that is configured as "resource"
|
|
|
|
+ type. To set a relationship with a resource in the
|
|
|
|
+ same laundry list that doesn't have an explicit ID set, insert the source
|
|
|
|
+ path of the resource. For a resource that has already an ID, either by being
|
|
|
|
+ assigned one manually or by being already deposited, insert the full ID
|
|
|
|
+ including the `par:` namespace (e.g. for ID `12345`, insert `par:12345`).
|
|
|
|
+
|
|
|
|
+### Update
|
|
|
|
+
|
|
|
|
+A submission is also used to update existing resources. Each resource update is
|
|
|
|
+a full replacement of all the resource's metadata, so a submission must include
|
|
|
|
+a full representation of each of the resources updated.
|
|
|
|
+
|
|
|
|
+To facilitate this task while avoiding the need to hold on to all of the
|
|
|
|
+archive's laundry lists, Pocket Archive can generate a laundry list for one or
|
|
|
|
+more selected resources. This list, which represents the current state of the
|
|
|
|
+resources requested, can be edited and submitted for an update. This method is
|
|
|
|
+much faster and intuitive than clicking around an alien user interface filled
|
|
|
|
+with icons and terms that one has never seen before.
|
|
|
|
+
|
|
|
|
+### Metadata & content model
|
|
|
|
+
|
|
|
|
+**Note:** The scope of this functional area is currently under review. Things
|
|
|
|
+may change.
|
|
|
|
+
|
|
|
|
+Metadata are (yes, it's a *plural* noun) controlled by a *content model*, which
|
|
|
|
+in this project is intended as the entirety of definitions of content types
|
|
|
|
+recognized by the system, and how they relate to one another. Each *type
|
|
|
|
+definition* is encoded in a configuration file defining a single content
|
|
|
|
+category type. This configuration is specific to each individual Pocket Archive
|
|
|
|
+installation, which can use the baseline one provided by default, or extend it
|
|
|
|
+via additional configurations. Please look at the [default model
|
|
|
|
+configuration](config/model/typedef) files that come with Pocket Archive.
|
|
|
|
+
|
|
|
|
+One doesn't have to define all possible types in detail. Pocket Archive
|
|
|
|
+provides some basic types, e.g.: `Anything` (the super-class of them all),
|
|
|
|
+`Artifact`, `File`, `Part`, which can be used in a very basic installation and
|
|
|
|
+should not be radically altered, because some basic functionality of the system
|
|
|
|
+relies on them. To add more specific definitions, *subtypes* can be defined. A
|
|
|
|
+subtype inherits all the property definitions of its broader model, and adds
|
|
|
|
+more specific behavior. An example classification could be: Anything -> File ->
|
|
|
|
+Image File -> Scientific Image. Each of the sub-types would only define the
|
|
|
|
+special properties of that definition, which add to, or replace, the properties
|
|
|
|
+of its broader definitions.
|
|
|
|
+
|
|
|
|
+All resources in Pocket Archive must be assigned a content type. If someone has
|
|
|
|
+to deal with a resource that doesn't fit in any of the predefined content
|
|
|
|
+models, they can asign it the most specific type that they can. At worst, they
|
|
|
|
+can put it under Anything. Of course, if one starts dealing with many
|
|
|
|
+unclassifiable resources that look similar, it's probably best to define a
|
|
|
|
+model for them; but that is not mandatory.
|
|
|
|
+
|
|
|
|
+Each metadata field can be specified by constraints. These constraints can be
|
|
|
|
+on:
|
|
|
|
+
|
|
|
|
+- Type: the data type for the field, e.g. string, number, resource
|
|
|
|
+ (relationship), etc.
|
|
|
|
+- Cardinality: how many values can be set for a field, for each resource. These
|
|
|
|
+ values can be adjusted to set mandatory fields, single-valued fields, etc.
|
|
|
|
+- Range: the range of values allowed. How this is interpreted depends on the
|
|
|
|
+ data type: for a number can be a min/max range, for a string a regular
|
|
|
|
+ expression pattern, for a resource the type(s) of the resources pointed to,
|
|
|
|
+ etc.
|
|
|
|
+
|
|
|
|
+All of these constraints are optionals. Fields that are not defined may accept
|
|
|
|
+any number of values, and are optional. So it's up to the repository manager
|
|
|
|
+to decide how specific or how free-form their archive should be.
|
|
|
|
+
|
|
|
|
+Note that fields that are not defined at least by a label, may be hard to
|
|
|
|
+understand by users browsing the discovery interface.
|
|
|
|
+
|
|
|
|
+### Site generation
|
|
|
|
+
|
|
|
|
+Pocket Archive can generate HTML pages and all the related assets to
|
|
|
|
+run a complete static website. The advantages of a static website over a
|
|
|
|
+dynamic one are that it's much simpler and economical to set up and run, and
|
|
|
|
+it's impervious to malicious attacks.
|
|
|
|
+
|
|
|
|
+The entire site must be generated every time resources are created or updated.
|
|
|
|
+This is usually very fast, but on large archives it can take a while. This is
|
|
|
|
+the downside of static website: they are static.
|
|
|
|
+
|
|
|
|
+## Status
|
|
|
|
+
|
|
|
|
+**ALPHA**. Pocket Archive is a very recent project, in fast development. Its
|
|
|
|
+foundational library, Volksdata, has been developed as a spare-time project for
|
|
|
|
+6 years and it just entered in beta status.
|
|
|
|
+
|
|
|
|
+### Road map
|
|
|
|
+
|
|
|
|
+Simple road map for a rough prototype:
|
|
|
|
+
|
|
|
|
+- ⚒ Configuration + config parser
|
|
|
|
+ - ✓ Application
|
|
|
|
+ - ⚒ Content model
|
|
|
|
+ - ⎊ Validation rules
|
|
|
|
+ - ⎊ Relationship inference rules
|
|
|
|
+ - Local overrides
|
|
|
|
+- ⚒ Submission module
|
|
|
|
+ - ✓ SIP building
|
|
|
|
+ - ✓ Metadata from LL
|
|
|
|
+ - ✓ Structure inference
|
|
|
|
+ - Relatioships inference
|
|
|
|
+- ⚒ HTML generator
|
|
|
|
+ - ✓ Index
|
|
|
|
+ - ✓ Resource
|
|
|
|
+ - ✓ Static assets
|
|
|
|
+ - ⚒ Transformers
|
|
|
|
+- ⎊ Non-HTML generators
|
|
|
|
+ - LL
|
|
|
|
+ - RDF (turtle)
|
|
|
|
+- ⎊ Front end
|
|
|
|
+ - JS search engine
|