Browse Source

Partial hooks guide; move all docs under `doc`.

Stefano Cossu 1 year ago
parent
commit
b5bef43025
5 changed files with 284 additions and 81 deletions
  1. 5 79
      README.md
  2. 6 2
      TODO.md
  3. 3 0
      doc/config.md
  4. 195 0
      doc/hooks.md
  5. 75 0
      doc/rest_api.md

+ 5 - 79
README.md

@@ -16,88 +16,14 @@ Start container:
 docker run -e TXL_FLASK_SECRET=changeme -p 8000:8000 transliterator:latest
 ```
 
-## Web UI
-
-`/` renders a simple HTML form to test the transliteration service.
-
-
-## REST API
-
-### `GET /health`
-
-Useful endpoint for health checks.
-
-#### Response code
-
-`200 OK` if the service is running.
-
-### `GET /languages`
-
-List all the languages supported.
-
-#### Response code
-
-`200 OK`
-
-#### Response body
-
-MIME type: `application/json`
-
-Content: a JSON object of the supported language tables. Keys are the keywords
-used throughout the API, e.g. for `/transliterate`. Each key is paired with an
-object that contains some basic metadata about the language features. At the
-moment, only the human-readable name is available.
-
-### `GET /table/<lang>`
-
-Dump a language table.
+For running in development mode, add `-e FLASK_ENV=development` to the options.
 
-#### URI parameters
 
-- `<lang>`: Language code as given by the `/languages` endpoint. 
-
-#### Response code
-
-`200 OK`
-
-#### Response body
-
-MIME type: `application/json`
-
-Content: language configuration as a JSON object with all the transliteration
-rules as they are read by the application. If the table inherits from a parent,
-the computed values from the merged tables are shown.
-
-### `POST /transliterate/<lang>[/r2s]`
-
-Transliterate an input string in a given language.
-
-#### URI parameters
-
-- `<lang>`: Language code as given by the `/languages` endpoint. 
-- `r2s`: if appended to the URI, the transliteration is intended to be
-  Roman-to-script, and the input string should be Latin text. If not, the
-  default behavior is followed, which is interpreting the input as a script
-  in the given language, and returning the Romanized text.
-
-#### POST body
-
-- `text`: Input text to be transliterated.
-
-#### Response code
-
-- `200 OK` on successful operation.
-- `400 Bad Request` for an invalid request. The reason for the failure is
-  normally printed in the response body.
-
-#### Response body
-
-MIME Type: `text/plain`
+## Web UI
 
-Content: transliterated string. Characters not found in the mapping are copied
-verbatim (see "Configuration files" section for more information).
+`/` renders a simple HTML form to test the transliteration service.
 
 
-## Configuration files
+## Further documentation
 
-TODO
+See the [`doc`](./doc) folder for additional documentation.

+ 6 - 2
TODO.md

@@ -10,8 +10,12 @@ discussion, etc.); *X* = not implementing.
 - *D* Basic transliteration in both directions
 - *D* Basic REST API
 - *D* Basic UI
-- *P* Life cycle hooks for plugins
-- *P* API documentation
+- *W* Life cycle hooks for plugins
+- *P* Regular expressions in ignore lists
+- *P* Word boundaries (design)
+- *D* API documentation
+- *P* Config file documentation
+- *W* Hooks documentation
 - *W* Complete conversion of existing tables to YAML
   - *P* Arabic
   - *P* Armenian

+ 3 - 0
doc/config.md

@@ -0,0 +1,3 @@
+# Transliterator configuration file format
+
+TODO

+ 195 - 0
doc/hooks.md

@@ -0,0 +1,195 @@
+# Transliterator life cycle hooks
+
+This is a guide for language specialists with some Python development skills,
+or who can partner with Python developers to create new, complex
+transliteration features.
+
+This software strives to become a universal transliteration tool; however, the
+challenge of such goal is that some scripts have very complex, specific
+rules that, if implemented in a generic tool, would quickly make its code
+unmanageable.
+
+The solution we propose in this tool is to keep its core functionality simple,
+yet versatile enough to work with the vast majority of scripts; and to enable
+script- or language-specific extensions that can be managd by subject matter
+experts so that it's easy to isolate those specific features.
+
+This is implemented life cycle hooks, which are "ports" where additional logic
+can be executed to change the outcome of a transliteration at certain stages
+of the process.
+
+
+## Overview of the transliteration process
+
+In order to understand how hooks work, it is necessary to understand the logic
+underpinning the transliteration process.
+
+When a transliteration request is sent to the application, the following
+happens:
+
+1. A configuration is read and parsed for the script/language specified. This
+   is actually only done once, the first time that that language is used since
+   the application was (re)started. If a configuration is changed, the
+   application must be restarted in order for it to parse the updated rules.
+   See [`config.md`](./config.md) for a complete guide on the configuration
+   structure.
+   a. If the table is designated as inheriting another table (e.g. `russian`
+      inheriting the `_cyrillic_base` table), the parent's tokens in the `map`
+      sections under `roman_to_script` and/or `script_to_roman` are first read
+      and then te present table's tokens are merged onto them. If a key is
+      present in both tables, the "child" table token overrides the parent. The
+      same thing happens with the "ignore" list in the `roman_to_script`
+      section.
+   b. Each of the tokens are rearranged so that longer tokens are
+      sorted before shorter ones that are completely contained in the beginning
+      part of the former. E.g. a list of tokens such as `['A', 'B', 'AB',
+      'BCD', 'ABCD', 'BCDE', 'BEFGH']` (showing only keys here) becomes
+      `['ABCD', 'AB', 'A', 'BCDE', 'BCD', 'BEFGH', 'B']`. This is critical to
+      ensure that specific word definitions are parsed in their entirety before
+      they are transliterated. Hence, the Hanzi sequence "北京" is correctly
+      interpreted as "Beijing" instead of "bei jing".
+2. Once the transliteration rules are loaded, the application proceeds to
+   scanning the input text. The application initializes an empty list, which
+   shall contain the parsed tokens, and starts a loop that advances a cursor,
+   which represents the reading position starting at 0 (the beginning of the
+   text).
+3. For Roman-to-script transliteration, tokens in the `ignore` list are first
+   compared against the text at the cursor position. The amount of
+   characters compared is equal to the length of each token in the ignore list.
+   a. If there is a match, the matching token is added to the output list and
+      the cursor advanced by the number of characters in the token.
+   b. If all ignore tokens are scanned and there is no match, the application
+      proceeds with the next step at the same cursor position.
+4. Tokens in the relevant `map` of the transliteration table are compared, one
+   by one in the order established in 1.b, with the string at the cursor
+   position. The amount of characters compared is equal to the length of the
+   token.
+   a. If there is a match, the transliteration indicated in the token is added
+      to the output list, and the cursor advanced by the number of characters
+      in the token.
+   b. If there is no match, the next token is tried. If all the tokens have
+      been tried and still no match results, the single character at the
+      current position is added verbatim to the output list, and the cursor
+      advances by one position.
+5. When the end of the input text is reached, if the configuration indicates
+   that capitalization is required (this is true by default), te first element
+   of the output list is capitalized.
+6. The output list is joined into one string.
+7. The string is compacted by removing excessive whitespace: Two or more
+   contiguous spaces are collapsed into one, and whitespace is stripped off
+   both ends of the text.
+8. The output string is returned.
+
+
+## Hooks
+
+Hooks are entry points for arbitrary code that may be added to change the
+transliteration behavior at specific point of the process described above.
+Each of the hooks may indicate the name of a function and optional, additional
+paramters.
+
+Hook functions may be defined for each language/script in the corresponding
+configuration file. See [`config.md`](./config.md) for details.
+
+The function name takes the form of `<module name>/<function name>` and must
+correspond to an existing module and function under the `transliterator.hooks`
+package.
+
+Each hook requires input parameters that are specific to its context, and are
+passed to the corresponding hook function(s) by the internal process. They must
+be defined in each function associated with the hook. Hooks may also accept
+optional keyword-only arguments, as described below, whose values can be
+defined in the configuration.
+
+Each function must also return an output that the process is able to handle as
+expected. These are also defined below.
+
+**[TODO]** These hooks are being implemented in a vacuum, without much of a
+real-world use case. Modifications to these capabilities may change as actual
+challenges arise.
+
+### `post_config`
+
+This hook is run after the whole configuration is parsed and possibly merged
+with a parent configuration.
+
+This hook is run once with the configuration parsing.
+
+#### Input parameters
+
+- `config` (dict): The parsed configuration data structure.
+- `**kwargs`: Additional arguments that may be defined in the configuration.
+
+#### Output
+
+(dict) Configuration data structure.
+
+### `begin_input_token`
+
+This hook is run at the beginning of each iteration of the input parsing loop.
+
+#### Input parameters
+
+- `input` (str): the whole input text.
+- `cursor` (int): cursor position.
+- `ouptut` (list): Output list in its current state.
+- `**kwargs`: Additional arguments that may be defined in the configuration.
+
+#### Output
+
+(int) Cursor position.
+
+### `pre_ignore_token`
+
+Run before each ignore token is compared with the input.
+
+#### Input parameters
+
+- `input` (str): the whole input text.
+- `cursor` (int): cursor position.
+- `token` (str): Current ignore token.
+- `**kwargs`: Additional arguments that may be defined in the configuration.
+
+#### Output
+
+(int) Cursor position.
+
+### `on_ignore_match`
+
+Run when an ignore token matches.
+
+TODO
+
+### `pre_tx_token`
+
+Run before comparing each transliteration token with the current text.
+
+TODO
+
+### `on_tx_token_match`
+
+Run when a transliteration token matches the input.
+
+TODO
+
+### `on_no_tx_token_match`
+
+Run after all tokens for the current position have been tried and no match has
+been found. If defined, this **replaces** the default behavior of copying the
+character to the output.
+
+TODO
+
+### `pre_assembly`
+
+Run after the whole text has been scanned, before the output list is
+capitalized and assembled into a string.
+
+TODO
+
+### `post_assembly`
+
+Run after the output has been assembled into a string, before whitespace is
+stripped off.
+
+TODO

+ 75 - 0
doc/rest_api.md

@@ -0,0 +1,75 @@
+# Transliterator REST API
+
+## `GET /health`
+
+Useful endpoint for health checks.
+
+### Response code
+
+`200 OK` if the service is running.
+
+## `GET /languages`
+
+List all the languages supported.
+
+### Response code
+
+`200 OK`
+
+### Response body
+
+MIME type: `application/json`
+
+Content: a JSON object of the supported language tables. Keys are the keywords
+used throughout the API, e.g. for `/transliterate`. Each key is paired with an
+object that contains some basic metadata about the language features. At the
+moment, only the human-readable name is available.
+
+## `GET /table/<lang>`
+
+Dump a language table.
+
+### URI parameters
+
+- `<lang>`: Language code as given by the `/languages` endpoint. 
+
+### Response code
+
+`200 OK`
+
+### Response body
+
+MIME type: `application/json`
+
+Content: language configuration as a JSON object with all the transliteration
+rules as they are read by the application. If the table inherits from a parent,
+the computed values from the merged tables are shown.
+
+## `POST /transliterate/<lang>[/r2s]`
+
+Transliterate an input string in a given language.
+
+### URI parameters
+
+- `<lang>`: Language code as given by the `/languages` endpoint. 
+- `r2s`: if appended to the URI, the transliteration is intended to be
+  Roman-to-script, and the input string should be Latin text. If not, the
+  default behavior is followed, which is interpreting the input as a script
+  in the given language, and returning the Romanized text.
+
+### POST body
+
+- `text`: Input text to be transliterated.
+
+### Response code
+
+- `200 OK` on successful operation.
+- `400 Bad Request` for an invalid request. The reason for the failure is
+  normally printed in the response body.
+
+### Response body
+
+MIME Type: `text/plain`
+
+Content: transliterated string. Characters not found in the mapping are copied
+verbatim (see "Configuration files" section for more information).