# Pocket Archive submission guide

*Audience: archivists, system administrators, developers*

**WORK IN PROGRESS**

Terms appearing in **bold** are referenced in the [glossary](./glossary.md).

## Archival process overview

Pocket Archive receives new contents, and updates to existing contents, via
**submissions**. A submission is an individual contribution to the archive that
can add, update, or delete **resources** (a combination of any of these). A
submission may include multiple resources, which can be related but do not
necessarily have to.

![The full cycle of operations for a given resource in Pocket
Archive.](./pkar_res_lifecycle.png)

1. Archivist selects and lays out resources to be archived in his or her own
   workstation.
2. Archivist creates a **laundry list** that includes an inventory of the
   resources and their metadata. This, together with the files and folders
   previously prepares, constitutes the **SIP**.
3. Archivist transfers the SIP to the **Drop box**: first the files and
   folders, then the laundry list.
4. Upon receipt of the laundry list, Pocket Archive processes the incoming
   materials and archives them.
5. Pocket Archive generates a report after the process is complete (regardless
   of whether it was successful or failed).
6. Depending on setup, Pocket Archive may delete the SIP from the Drop box if
   the submission succeeded.
7. Depending on setup, Pocket Archive may (re-)generate the static site.
8. If the archivist wants to update the archived resources, they can either
   request a full copy of the SIP, or to only update metadata, only the laundry
   list, and re-submit it.

Processing of the SIP (point 4 above) either succeeds or fails as a unit. This
means that a submission will never perform only a part of the task that it is
meant to do. This is called an **atomic** operation and it is designed to
ensure consistency of the data.

Individual steps are described in detail in the following chapters.

## Submission Information Package structure

A submission is performed by preparing a Submission Information Package, or
**SIP**, which consists of data, i.e. files optionally arranged in a
curator-defined folder hierarchy, and metadata, the latter gathered in a single
file called a laundry list; and sending them both to Pocket Archive for
processing.

A [working SIP example](../test/sample_submission) including files and a
laundry list, used for testing, is available as a quick reference. Other
examples are illustrated further down in this document.

As the above life cycle chart shows, the SIP is a disposable artifact. Once it
is successfully archived, it can be deleted. The full SIP can be regenerated by
the archive and retrieved at a later time.

The original files in the archivist's workstation can be optionally kept and/or
copied to local storage. This is stongly recommended, at least until Pocket
Archive reaches a stable status and can be exclusively relied on for long-term
preservation. More copies means more chances to recover data from corruption or
loss, but it also means higher storage costs.

### Source file & folder layout

Preparation of the SIP begins with selecting the materials to submit.
Generally, it is good practice to select a group of **artifacts** more or less
related to one another, e.g. a small coherent **collection**, or a day's work
within a large collection that may take long to complete. It is not critical
to get this part perfectly right, as more can be added to the archive at a
later time. It is more important to keep submissions not too large, as a single
malformed element can cause the whole submission to fail, and not too small, to
avoid too many iterations that can become confusing. Submissions of tens to
hundreds of files are in a quite safe range.

The arrangement of files and folders is important, the ordering of elements
in a folder is less so. A file or sub-folder inside a parent folder creates a
membership relationship between the two, so that, e.g. one can create the
following structure:

```
my_collection
  |
  `- artifact1
  |   |
  |   `- file01.tiff
  |   |
  |   `- file02.pdf
  |
  `- artifact2
      |
      `- file3.mpg
```

This creates a collection, `my_collection`, with two members, `artifact1` and
`artifact2`, the former containing `file1.tiff` and `file2.pdf`, and the latter
containing `file3.mpg`.

Ordering of the files or folder in a SIP is defined in the laundry list, as we
will see further down, so using file namings to force a certain order is not
necessary (however it can provide a good starting point for large lists of
files or folders under a parent).

Some file and folder structure will be also used in future versions of Pocket
Archive to create more metadata, but at the moment this is not implemented.

Empty folders can be created and submitted: they can be used as placeholders
for resources that have no files directly related. But the same effect can be
obtained by other means with the laundry list.

### Laundry list

Once the files to be included in the SIP is completed, a laundry list is
compiled. This is basically, as the name suggests, an inventory of all the
resources that go into the submission; but it provides much more information
than that, by defining metadata and relationships between resources.

The laundry list is a **CSV** file.

Laundry lists may be edited in any application that supports CSV reading and
writing. Care must be taken to export the file to CSV. In LibreOffice, for
example, "Save" writes the file as `.odt` format, which is not usable as a
laundry list. The spreadsheet must be instead exported as a `.csv` format.

#### Multi-sheet documents

Many spreadsheet applications allow grouping multiple tables or sheets in one
file. CSV supports only one table per file. While some may find it convenient
to keep multiple laundry lists in one spreadsheet file, one must take care of
exporting each sheet individually as a CSV.

#### Laundry list format

The first row of a laundry list is reserved for the header, which indicates the
field names.  These can be in any order, but following a specific order is
recommended. The order used in this document and in all laundry lists
automatically generated by Pocket Archive is: `content_type`, `id`,
`source_path`, and then all ordinary fields in alphabetical order.

Each subsequent row represents a resource (except in a multi-value case,
described below). The `content_type` field is mandatory for each resource.

The `source_path` field is only mandatory for files. All other fields are
optional for the submission, however, some type definitions may have
constraints in this regard and may be at least strongly recommended. This
depends on the content model used.

#### Fields with a special meaning

- `content_type`: mandatory, single-valued. It defines the content type
  assigned to the resource. For files, it must be `file` or a sub-type thereof,
  except for inferred resources (see below). For folders it must not be a
  `file` or sub-type. Consult the content model of your archive for a list of
  defined type names.
- `id`: optional, single-valued. If provided, it becomes the primary
  identifier for the resource, which is used anywhere information about the
  resource is retrieved. The IDs generated by default by Pocket Archive are
  16-character random strings containing only uppercase and lowercase letters
  and digits. The depositor is responsible for ensuring that the provided ID
  is unique across the system. If left blank, the system generates an
  identifier that is guaranteed to be unique. However, re-submitting the
  laundry list a second time with the same blank field will create a duplicate
  resource.
- `source_path`: mandatory for files, single-valued. It refers to the
  file or folder path relative to the package, using forward slash `/`
  characters to separate folders and subfolders or files.

Example of a table representing an artifact with two files:

<table>
  <thead>
    <tr>
      <th>content\_type</th>
      <th>id</th>
      <th>source\_path</th>
      <th>creation\_date</th>
      <th>label</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>still\_image</td>
      <td>Sg9hYIISjRjlkP62</td>
      <td>my\_collection/artifact1</td>
      <td>My first deposited artifact</td>
      <td>2002</td>
    </tr>
    <tr>
      <td>still\_image\_file</td>
      <td>7hic19YTXA8Fudxo</td>
      <td>my\_collection/artifact1/file1.tiff</td>
      <td>2025</td>
      <td></td>
    </tr>
    <tr>
      <td>still\_image\_file</td>
      <td>Z509TdNhpTjPYDS4</td>
      <td>my\_collection/artifact1/file2.pdf</td>
      <td>2025</td>
      <td></td>
    </tr>
  </tbody>
</table>

Note the difference between the `still_imge` and the `still_image_file`
resources. We will get back to it further down.

#### Multi-valued fields

Some fields may allow multiple values. To provide multiple values for one or
more fields, additional values are added to rows below the previous. For these
additional rows, the special fields `content_type`, `id`, and `source_path`
**must not** be filled.

Example of a table with a single resource with multi-valued fields:

<table>
  <thead>
    <tr>
      <th>content\_type</th>
      <th>id</th>
      <th>source\_path</th>
      <th>alt\_label</th>
      <th>description</th>
      <th>label</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>still\_image</td>
      <td>Sg9hYIISjRjlkP62</td>
      <td>my\_collection/artifact1</td>
      <td>An alternative label</td>
      <td>A description of the artifact goes here.</td>
      <td>This is the title and must have only one value.</td>
    </tr>
    <tr>
      <td></td>
      <td></td>
      <td></td>
      <td>You can have as many as you like of these</td>
      <td>Another description goes here.</td>
      <td></td>
    </tr>
    <tr>
      <td></td>
      <td></td>
      <td></td>
      <td>FREE alt labels! (as long as supplies last)</td>
      <td></td>
      <td></td>
    </tr>
  </tbody>
</table>

The submission process checks if the `content_type` field is filled in a cell
to determine whether a row in the table is a continuation from the previous
one, adding multiple values. Having a row without `content` type and with `id`
and/or `source_path` is considered an error.

#### Ordering (sorting)

The ordering of rows in a laundry list determines the ordering of the resources
in their container. The system automatically assigns an order to the resources,
using their source path and their position in the laundry list.  Resources at
the top are not assigned an order, as they are considered self-standing. If an
order is needed for those, the `pas:next` **property** can be set to the
desired resource (see point below about relationships), or they can be put in
an enclosing folder that acts as a collection.

**Relationships** can be established between resources. These are stored as
persistent links and appear as hyperlinks in the discovery interface. A
relationship can only be set for a field that is configured as "resource" type.
Consult your **content model** to find which properties are relationships.

To set a relationship with a resource in the same laundry list that doesn't
have an explicit ID set, insert the source path of the resource. For a resource
that has already an ID, either by being assigned one manually or by being
already deposited, insert the ID string.

Example table with implicit and explicit relationships, some path-based and
some ID-based:

<table>
  <thead>
    <tr>
      <th>content\_type</th>
      <th>id</th>
      <th>source\_path</th>
      <th>has\_member</th>
      <th>label</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>collection</td>
      <td>p9tXQGBb9iC7xEqm</td>
      <td>my\_collection-1</td>
      <td></td>
      <td>This collection has implicit members from the folder hierarchy.</td>
    </tr>
    <tr>
      <td>still\_image</td>
      <td>KHwYidw4R7xUAEMN</td>
      <td>my\_collection-2/image001</td>
      <td></td>
      <td>Resource with an explicit ID. The ID can be used in a reference.</td>
    </tr>
    <tr>
      <td>text</td>
      <td></td>
      <td>my\_collection-2/text0001</td>
      <td></td>
      <td>Resource without explicit ID. It can be referenced by source\_path.</td>
    </tr>
    <tr>
      <td>collection</td>
      <td>EUXRg9igmU9ouzVH</td>
      <td>my\_collection-2</td>
      <td>p9tXQGBb9iC7xEqm</td>
      <td>This collection has explicit member relationships.</td>
    </tr>
    <tr>
      <td></td>
      <td></td>
      <td></td>
      <td>my\_collection-2/text0001</td>
      <td></td>
    </tr>
  </tbody>
</table>

When the laundry list is processed for submission, the path-based references
are replaced with IDs, which are automatically generated where not provided.
Therefore, a laundry list generated from archived resources may look different
from the original one. The generated laundry list should be used for
re-submission.

#### Resource types and sub-types

This chapter is a very concise introduction to content modeling in Pocket
Archive, which is treated in detail in the [Content modeling guide
(WIP)](./content_model.md). It is strongly recommended to read that guide
before archiving resources in earnest.

The three main resource types found in a submission are:

- **Artifact**: the digital surrogate of a physical and/or intellectual object,
  such as a photograph, a video, a letter, a painting, etc. This resource
  contains data related to the subject, content, author, taxonomy, etc. of the
  intellectual work.
- File: a digital capture or document related to an artifact. The file is
  accompanied by a **metadata** resource, which is automatically generated fom
  the metadata that the archivist enters in the laundry list. These metadata
  should be exclusively about the file itself, e.g. time of creation, file
  size, file type, etc, as well as how the file relates to the artifact (e.g.
  detail shot, documentation, transcript, 3/4 view, etc) or other files. Some
  of these metadata are generated automatically by analyzing the file during
  the submission process. Information about the artifact itself go exclusively
  on the artifact resource.
- **Brick**: a structural element used to build logical, hierarchical
  structures with multiple resources. Bricks can represent many things: the
  ordering of chapters and pages in a book, front and back sides of a postcard
  or a vinyl record, the ordering of artifacts and collections in a collection,
  etc. They have no intellectual content themselves, but they can reference an
  artifact or file. They are mostly automatically generated by the submission
  process, and are mostly hidden in the **presentation**, but they can be
  explicitly created in a laundry list to create specific structures.

These three key content types are seldom used as-is. They usually have
sub-types, which are defined in the content model. See the content modeling
guide for more information about sub-types.

Also see the [sample laundry
list](../test/sample_submission/pkar_submisson-demo.csv) for examples of
artifacts, files, and bricks making up a two-sided postcard.

### Submission ID and submission name

Each submission gets a randomly generated ID when it starts. This ID is
attached to all the resources in the submission. This makes it easier to It
also makes it possible to generate a laundry list that contains exactly the
same resources that were originally submitted (possibly with added
auto-generated implicit resources).

The ID is automatically generated and system-controlled. It cannot be changed.
A submission can also have a name, which is optional and user-defined. The
submission name is determined by the file name used for the laundry list. E.g.
`pkar_submission-my_new_collection.csv` will use `my_new_collection`, i.e. the
text between `pkar_submission-` and `.csv`, as the submission name. Submisson
names are not required to be unique, but they can contain for example a date to
facilitate their search.

### Update

A submission is also used to update existing resources. Each resource update is
a full replacement of all the resource's metadata, so a submission must include
a full representation of each of the resources updated.

To facilitate this task while avoiding the need to hold on to all of the
archive's laundry lists, Pocket Archive can generate a laundry list for one or
more selected resources. This list, which represents the current state of the
resources requested, can be edited and submitted for an update. 

### Shortcuts

Some shortcuts are available to facilitate the management of larger
submissions.

#### Implicit resources

Some implicit resources are created 

#### ID generation

As mentioned before, explicitly adding IDs in a laundry list simplifies later
editing and management. However, this is one of the most tedious parts of a
laundry list creation.

Fortuntately, such repetitive and error-prone tasks can be easily automated
with tools provided by most spreadsheet applications. A macro (a mini-program
that runs in an application) for LibreOffice Calc is [provided
here](../src/util/libreoffice_idgen.bas) to automatically generate 16-character
IDs for all the cells selected in a table.