# Pocket Archive submission guide *Audience: archivists, system administrators, developers* **WORK IN PROGRESS** Terms appearing in **bold** are referenced in the [glossary](./glossary.md). ## Archival process overview Pocket Archive receives new contents, and updates to existing contents, via **submissions**. A submission is an individual contribution to the archive that can add, update, or delete **resources** (a combination of any of these). A submission may include multiple resources, which can be related but do not necessarily have to. ![The full cycle of operations for a given resource in Pocket Archive.](./pkar_res_lifecycle.png) 1. Archivist selects and lays out resources to be archived in his or her own workstation. 2. Archivist creates a **laundry list** that includes an inventory of the resources and their metadata. This, together with the files and folders previously prepares, constitutes the **SIP**. 3. Archivist transfers the SIP to the **Drop box**: first the files and folders, then the laundry list. 4. Upon receipt of the laundry list, Pocket Archive processes the incoming materials and archives them. 5. Pocket Archive generates a report after the process is complete (regardless of whether it was successful or failed). 6. Depending on setup, Pocket Archive may delete the SIP from the Drop box if the submission succeeded. 7. Depending on setup, Pocket Archive may (re-)generate the static site. 8. If the archivist wants to update the archived resources, they can either request a full copy of the SIP, or to only update metadata, only the laundry list, and re-submit it. Processing of the SIP (point 4 above) either succeeds or fails as a unit. This means that a submission will never perform only a part of the task that it is meant to do. This is called an **atomic** operation and it is designed to ensure consistency of the data. Individual steps are described in detail in the following chapters. ## Submission Information Package structure A submission is performed by preparing a Submission Information Package, or **SIP**, which consists of data, i.e. files optionally arranged in a curator-defined folder hierarchy, and metadata, the latter gathered in a single file called a laundry list; and sending them both to Pocket Archive for processing. A [working SIP example](../test/sample_submission) including files and a laundry list, used for testing, is available as a quick reference. Other examples are illustrated further down in this document. As the above life cycle chart shows, the SIP is a disposable artifact. Once it is successfully archived, it can be deleted. The full SIP can be regenerated by the archive and retrieved at a later time. The original files in the archivist's workstation can be optionally kept and/or copied to local storage. This is stongly recommended, at least until Pocket Archive reaches a stable status and can be exclusively relied on for long-term preservation. More copies means more chances to recover data from corruption or loss, but it also means higher storage costs. ### Source file & folder layout Preparation of the SIP begins with selecting the materials to submit. Generally, it is good practice to select a group of **artifacts** more or less related to one another, e.g. a small coherent **collection**, or a day's work within a large collection that may take long to complete. It is not critical to get this part perfectly right, as more can be added to the archive at a later time. It is more important to keep submissions not too large, as a single malformed element can cause the whole submission to fail, and not too small, to avoid too many iterations that can become confusing. Submissions of tens to hundreds of files are in a quite safe range. The arrangement of files and folders is important, the ordering of elements in a folder is less so. A file or sub-folder inside a parent folder creates a membership relationship between the two, so that, e.g. one can create the following structure: ``` my_collection | `- artifact1 | | | `- file01.tiff | | | `- file02.pdf | `- artifact2 | `- file3.mpg ``` This creates a collection, `my_collection`, with two members, `artifact1` and `artifact2`, the former containing `file1.tiff` and `file2.pdf`, and the latter containing `file3.mpg`. Ordering of the files or folder in a SIP is defined in the laundry list, as we will see further down, so using file namings to force a certain order is not necessary (however it can provide a good starting point for large lists of files or folders under a parent). Some file and folder structure will be also used in future versions of Pocket Archive to create more metadata, but at the moment this is not implemented. Empty folders can be created and submitted: they can be used as placeholders for resources that have no files directly related. But the same effect can be obtained by other means with the laundry list. ### Laundry list Once the files to be included in the SIP is completed, a laundry list is compiled. This is basically, as the name suggests, an inventory of all the resources that go into the submission; but it provides much more information than that, by defining metadata and relationships between resources. The laundry list is a **CSV** file. Laundry lists may be edited in any application that supports CSV reading and writing. Care must be taken to export the file to CSV. In LibreOffice, for example, "Save" writes the file as `.odt` format, which is not usable as a laundry list. The spreadsheet must be instead exported as a `.csv` format. #### Multi-sheet documents Many spreadsheet applications allow grouping multiple tables or sheets in one file. CSV supports only one table per file. While some may find it convenient to keep multiple laundry lists in one spreadsheet file, one must take care of exporting each sheet individually as a CSV. #### Laundry list format The first row of a laundry list is reserved for the header, which indicates the field names. These can be in any order, but following a specific order is recommended. The order used in this document and in all laundry lists automatically generated by Pocket Archive is: `content_type`, `id`, `source_path`, and then all ordinary fields in alphabetical order. Each subsequent row represents a resource (except in a multi-value case, described below). The `content_type` field is mandatory for each resource. The `source_path` field is only mandatory for files. All other fields are optional for the submission, however, some type definitions may have constraints in this regard and may be at least strongly recommended. This depends on the content model used. #### Fields with a special meaning - `content_type`: mandatory, single-valued. It defines the content type assigned to the resource. For files, it must be `file` or a sub-type thereof, except for inferred resources (see below). For folders it must not be a `file` or sub-type. Consult the content model of your archive for a list of defined type names. - `id`: optional, single-valued. If provided, it becomes the primary identifier for the resource, which is used anywhere information about the resource is retrieved. The IDs generated by default by Pocket Archive are 16-character random strings containing only uppercase and lowercase letters and digits. The depositor is responsible for ensuring that the provided ID is unique across the system. If left blank, the system generates an identifier that is guaranteed to be unique. However, re-submitting the laundry list a second time with the same blank field will create a duplicate resource. - `source_path`: mandatory for files, single-valued. It refers to the file or folder path relative to the package, using forward slash `/` characters to separate folders and subfolders or files. Example of a table representing an artifact with two files:
content\_type id source\_path creation\_date label
still\_image Sg9hYIISjRjlkP62 my\_collection/artifact1 My first deposited artifact 2002
still\_image\_file 7hic19YTXA8Fudxo my\_collection/artifact1/file1.tiff 2025
still\_image\_file Z509TdNhpTjPYDS4 my\_collection/artifact1/file2.pdf 2025
Note the difference between the `still_imge` and the `still_image_file` resources. We will get back to it further down. #### Multi-valued fields Some fields may allow multiple values. To provide multiple values for one or more fields, additional values are added to rows below the previous. For these additional rows, the special fields `content_type`, `id`, and `source_path` **must not** be filled. Example of a table with a single resource with multi-valued fields:
content\_type id source\_path alt\_label description label
still\_image Sg9hYIISjRjlkP62 my\_collection/artifact1 An alternative label A description of the artifact goes here. This is the title and must have only one value.
You can have as many as you like of these Another description goes here.
FREE alt labels! (as long as supplies last)
The submission process checks if the `content_type` field is filled in a cell to determine whether a row in the table is a continuation from the previous one, adding multiple values. Having a row without `content` type and with `id` and/or `source_path` is considered an error. #### Ordering (sorting) The ordering of rows in a laundry list determines the ordering of the resources in their container. The system automatically assigns an order to the resources, using their source path and their position in the laundry list. Resources at the top are not assigned an order, as they are considered self-standing. If an order is needed for those, the `pas:next` **property** can be set to the desired resource (see point below about relationships), or they can be put in an enclosing folder that acts as a collection. **Relationships** can be established between resources. These are stored as persistent links and appear as hyperlinks in the discovery interface. A relationship can only be set for a field that is configured as "resource" type. Consult your **content model** to find which properties are relationships. To set a relationship with a resource in the same laundry list that doesn't have an explicit ID set, insert the source path of the resource. For a resource that has already an ID, either by being assigned one manually or by being already deposited, insert the ID string. Example table with implicit and explicit relationships, some path-based and some ID-based:
content\_type id source\_path has\_member label
collection p9tXQGBb9iC7xEqm my\_collection-1 This collection has implicit members from the folder hierarchy.
still\_image KHwYidw4R7xUAEMN my\_collection-2/image001 Resource with an explicit ID. The ID can be used in a reference.
text my\_collection-2/text0001 Resource without explicit ID. It can be referenced by source\_path.
collection EUXRg9igmU9ouzVH my\_collection-2 p9tXQGBb9iC7xEqm This collection has explicit member relationships.
my\_collection-2/text0001
When the laundry list is processed for submission, the path-based references are replaced with IDs, which are automatically generated where not provided. Therefore, a laundry list generated from archived resources may look different from the original one. The generated laundry list should be used for re-submission. #### Resource types and sub-types This chapter is a very concise introduction to content modeling in Pocket Archive, which is treated in detail in the [Content modeling guide (WIP)](./content_model.md). It is strongly recommended to read that guide before archiving resources in earnest. The three main resource types found in a submission are: - **Artifact**: the digital surrogate of a physical and/or intellectual object, such as a photograph, a video, a letter, a painting, etc. This resource contains data related to the subject, content, author, taxonomy, etc. of the intellectual work. - File: a digital capture or document related to an artifact. The file is accompanied by a **metadata** resource, which is automatically generated fom the metadata that the archivist enters in the laundry list. These metadata should be exclusively about the file itself, e.g. time of creation, file size, file type, etc, as well as how the file relates to the artifact (e.g. detail shot, documentation, transcript, 3/4 view, etc) or other files. Some of these metadata are generated automatically by analyzing the file during the submission process. Information about the artifact itself go exclusively on the artifact resource. - **Brick**: a structural element used to build logical, hierarchical structures with multiple resources. Bricks can represent many things: the ordering of chapters and pages in a book, front and back sides of a postcard or a vinyl record, the ordering of artifacts and collections in a collection, etc. They have no intellectual content themselves, but they can reference an artifact or file. They are mostly automatically generated by the submission process, and are mostly hidden in the **presentation**, but they can be explicitly created in a laundry list to create specific structures. These three key content types are seldom used as-is. They usually have sub-types, which are defined in the content model. See the content modeling guide for more information about sub-types. Also see the [sample laundry list](../test/sample_submission/pkar_submisson-demo.csv) for examples of artifacts, files, and bricks making up a two-sided postcard. ### Submission ID and submission name Each submission gets a randomly generated ID when it starts. This ID is attached to all the resources in the submission. This makes it easier to It also makes it possible to generate a laundry list that contains exactly the same resources that were originally submitted (possibly with added auto-generated implicit resources). The ID is automatically generated and system-controlled. It cannot be changed. A submission can also have a name, which is optional and user-defined. The submission name is determined by the file name used for the laundry list. E.g. `pkar_submission-my_new_collection.csv` will use `my_new_collection`, i.e. the text between `pkar_submission-` and `.csv`, as the submission name. Submisson names are not required to be unique, but they can contain for example a date to facilitate their search. ### Update A submission is also used to update existing resources. Each resource update is a full replacement of all the resource's metadata, so a submission must include a full representation of each of the resources updated. To facilitate this task while avoiding the need to hold on to all of the archive's laundry lists, Pocket Archive can generate a laundry list for one or more selected resources. This list, which represents the current state of the resources requested, can be edited and submitted for an update. ### Shortcuts Some shortcuts are available to facilitate the management of larger submissions. #### Implicit resources Some implicit resources are created #### ID generation As mentioned before, explicitly adding IDs in a laundry list simplifies later editing and management. However, this is one of the most tedious parts of a laundry list creation. Fortuntately, such repetitive and error-prone tasks can be easily automated with tools provided by most spreadsheet applications. A macro (a mini-program that runs in an application) for LibreOffice Calc is [provided here](../src/util/libreoffice_idgen.bas) to automatically generate 16-character IDs for all the cells selected in a table.