Explorar o código

Begin writing up notes.

Stefano Cossu hai 1 ano
pai
achega
9dfaa75d75
Modificáronse 1 ficheiros con 206 adicións e 0 borrados
  1. 206 0
      NOTES.md

+ 206 - 0
NOTES.md

@@ -0,0 +1,206 @@
+# Romanization service implementation notes
+
+## `.cfg` files
+
+The `.cfg` format seems to follow a INI-like syntax that is ad-hoc-parsed by
+the Transliterator.
+
+So far only top-level section names have been encountered.
+
+Key-value pairs may express either a transliteration operation, e.g.
+
+```
+U+182CU+1820=qU+0307a
+```
+
+Or a passthrough (verbatim copy) operation, such as
+
+```
+At head of title=At head of title
+```
+
+Or a configuration directive, e.g.
+
+```
+SubfieldsAlwaysExcluded=uvxy0123456789
+```
+
+This last option may appear in any section, the first two only in the
+transliteration sections.
+
+It is unclear how configuration directives can be distinguished from
+transliteration rules, except by naming all the possible verbatim copy options.
+A more readable and efficent format would have discrete subsections for
+configuration and transliteration; if possible, vebatim copy should be
+implicit, which would make maintenance easier.
+
+It is unclear at the moment if spaces around the `=` sign are ignored.
+
+
+## `ReRomanizeRecord.bas`
+
+Much of the code deals with MARC records. No need to concern about that since
+the new Transliterator is meant to convert text strings to text strings.
+
+### Functions
+
+#### ReRomanizeText
+
+- Determine direction (input param): R2S or S2R
+- Determine personal name handling
+- Whether to uppercase first word
+
+
+#### LoadOneRomanizationTable
+
+Load cfg file (line by line, we can do the whole thing) and parse table
+metadata.
+
+Skip lines starting with `#` (comments).
+
+Lines starting with `[` are section defiinitions. The supported sections are
+`General`, `ScriptToRoman`, and `RomanToScript`.
+
+##### `General` Section
+
+Potentially relevant variables:
+
+- Name
+- NoRomanization
+- AllowCaseVariation
+- ApostropheCharacters
+- AllowCaseVariation
+- ApostropheCharacters
+- BySyllables
+- Truncation
+
+Likely irrelevant:
+
+- FontName
+- DoNotUse880Field
+- AllowDefineButton (what is this? Just UI related?)
+
+##### `RomanToScript` section
+
+If there is no `=` sign, it is assumed to be a multi-line directive, and the
+next line should be loaded and merged with the previous content.
+
+Otherwise, a keyword indicating a configuration directive is looked up.
+
+Currently supported, and potentially relevant, keywords are:
+
+- FieldsIncluded
+- IncludeFormattingCharactersLcPattern
+- OtherSubfieldsExcludedByTag
+- VowelMarker
+
+Likely irrelevant keywords: 
+
+- CreateEmpty880s
+- Subfield6Code
+- SubfieldsAlwaysExcluded
+
+If no keyword is detected, proceed to transliteration. [TODO transliteration
+logic details still to be looked at]
+
+
+##### `ScriptToRoman` section
+
+The logic is the same as the `RomanToscript` section, but the configuration
+keyword are different.
+
+Currently supported, potentially relevant:
+
+ - UppercaseFirstCharacterInSubfield [TODO verify]
+ - PersonalNameHandling
+
+ Likely irrelevant:
+
+ - CreateEmptyFields
+ - FieldsIncluded
+ - SubfieldsAlwaysExcluded
+ - OtherSubfieldsExcludedByTag
+
+[TODO Complete other function analysis]
+
+## General strategy for rewrite
+
+### API endpoints
+
+#### List available transliterations
+
+##### Invocation
+
+```
+GET /tables
+```
+
+Returns all available transliterations, and which directions are available for
+each.
+
+##### Response
+
+`200 OK`; body: K/V pairs of: script name, list of `r2s` (Roman to script),
+`s2r` (Script to Roman), or both.
+
+#### Transliterate a string
+
+##### Invocation
+
+```
+POST /trans/<script>/<direction>
+```
+
+##### Query parameters
+- `<script>`: script name as obtained by the `/tables` endpoint.
+- `<direction>`: transliteration direction as  obtained by the `/tables`
+endpoint.
+
+##### POST body
+
+- `data`: Input text (UTF-8) to transliterate.
+
+##### Response
+
+- `200 OK` if transliteration was successful; response body: transliterated
+  string (UTF-8)
+- `400 Bad Request` if a script name is not available in the requested
+  direction; response body: details of failure.
+- `500 Server Error` if an internal error occurred; response body: generic
+  error message (no details about the error)
+
+#### Reload the translation tables
+
+Reload the tables if they have been modified. This is done internally at
+server start. This should be auth-protected.
+
+##### Invocation
+
+`POST /reload_config`
+
+##### Authantication
+
+API token (probably just a hard-coded value in a .env file should suffice)
+
+##### Response
+
+- `204 No content` if the tables were reloaded successfully; no response body.
+- `500 Server Error` on internal error.
+
+
+### Functional approach
+
+1. Load all translation table metadata on server startup. This is equivalent
+   to invoking `reload_config` via REST API (see above) and is done by
+   scanning a designated directory containing only the translation table,
+   finding the metadata in the `General` section, disccovering the
+   `ScriptToRoman` and `RomanToScript` sections, and storing these metadata in
+   a variable available to all requests.
+
+2. Upon invocation of the `trans` method: load the relevant configuration file
+   (this operation will be cached in order to save expensive parsing) and apply
+   the relevant `ScriptToRoman` or `RomanToScript` transliteration to the
+   provided string.
+
+3. Upon invocation of the `/reload_config` method: reload the table metadata
+   as on startup; invalidate the cache for all the configurations.