3 years ago · b5bef43025
--- a/README.md
+++ b/README.md
@@ -16,88 +16,14 @@ Start container:
 
				 docker run -e TXL_FLASK_SECRET=changeme -p 8000:8000 transliterator:latest
			
 
				 ```
			
 
				 
			
 
				-## Web UI
			
 
				-
			
 
				-`/` renders a simple HTML form to test the transliteration service.
			
 
				-
			
 
				-
			
 
				-## REST API
			
 
				-
			
 
				-### `GET /health`
			
 
				-
			
 
				-Useful endpoint for health checks.
			
 
				-
			
 
				-#### Response code
			
 
				-
			
 
				-`200 OK` if the service is running.
			
 
				-
			
 
				-### `GET /languages`
			
 
				-
			
 
				-List all the languages supported.
			
 
				-
			
 
				-#### Response code
			
 
				-
			
 
				-`200 OK`
			
 
				-
			
 
				-#### Response body
			
 
				-
			
 
				-MIME type: `application/json`
			
 
				-
			
 
				-Content: a JSON object of the supported language tables. Keys are the keywords
			
 
				-used throughout the API, e.g. for `/transliterate`. Each key is paired with an
			
 
				-object that contains some basic metadata about the language features. At the
			
 
				-moment, only the human-readable name is available.
			
 
				-
			
 
				-### `GET /table/<lang>`
			
 
				-
			
 
				-Dump a language table.
			
 
				+For running in development mode, add `-e FLASK_ENV=development` to the options.
			
 
				 
			
 
				-#### URI parameters
			
 
				 
			
 
				-- `<lang>`: Language code as given by the `/languages` endpoint. 
			
 
				-
			
 
				-#### Response code
			
 
				-
			
 
				-`200 OK`
			
 
				-
			
 
				-#### Response body
			
 
				-
			
 
				-MIME type: `application/json`
			
 
				-
			
 
				-Content: language configuration as a JSON object with all the transliteration
			
 
				-rules as they are read by the application. If the table inherits from a parent,
			
 
				-the computed values from the merged tables are shown.
			
 
				-
			
 
				-### `POST /transliterate/<lang>[/r2s]`
			
 
				-
			
 
				-Transliterate an input string in a given language.
			
 
				-
			
 
				-#### URI parameters
			
 
				-
			
 
				-- `<lang>`: Language code as given by the `/languages` endpoint. 
			
 
				-- `r2s`: if appended to the URI, the transliteration is intended to be
			
 
				-  Roman-to-script, and the input string should be Latin text. If not, the
			
 
				-  default behavior is followed, which is interpreting the input as a script
			
 
				-  in the given language, and returning the Romanized text.
			
 
				-
			
 
				-#### POST body
			
 
				-
			
 
				-- `text`: Input text to be transliterated.
			
 
				-
			
 
				-#### Response code
			
 
				-
			
 
				-- `200 OK` on successful operation.
			
 
				-- `400 Bad Request` for an invalid request. The reason for the failure is
			
 
				-  normally printed in the response body.
			
 
				-
			
 
				-#### Response body
			
 
				-
			
 
				-MIME Type: `text/plain`
			
 
				+## Web UI
			
 
				 
			
 
				-Content: transliterated string. Characters not found in the mapping are copied
			
 
				-verbatim (see "Configuration files" section for more information).
			
 
				+`/` renders a simple HTML form to test the transliteration service.
			
 
				 
			
 
				 
			
 
				-## Configuration files
			
 
				+## Further documentation
			
 
				 
			
 
				-TODO
			
 
				+See the [`doc`](./doc) folder for additional documentation.
			
--- a/TODO.md
+++ b/TODO.md
@@ -10,8 +10,12 @@ discussion, etc.); *X* = not implementing.
 
				 - *D* Basic transliteration in both directions
			
 
				 - *D* Basic REST API
			
 
				 - *D* Basic UI
			
 
				-- *P* Life cycle hooks for plugins
			
 
				-- *P* API documentation
			
 
				+- *W* Life cycle hooks for plugins
			
 
				+- *P* Regular expressions in ignore lists
			
 
				+- *P* Word boundaries (design)
			
 
				+- *D* API documentation
			
 
				+- *P* Config file documentation
			
 
				+- *W* Hooks documentation
			
 
				 - *W* Complete conversion of existing tables to YAML
			
 
				   - *P* Arabic
			
 
				   - *P* Armenian
			
--- a/doc/config.md
+++ b/doc/config.md
@@ -0,0 +1,3 @@
 
				+# Transliterator configuration file format
			
 
				+
			
 
				+TODO
			
--- a/doc/hooks.md
+++ b/doc/hooks.md
@@ -0,0 +1,195 @@
 
				+# Transliterator life cycle hooks
			
 
				+
			
 
				+This is a guide for language specialists with some Python development skills,
			
 
				+or who can partner with Python developers to create new, complex
			
 
				+transliteration features.
			
 
				+
			
 
				+This software strives to become a universal transliteration tool; however, the
			
 
				+challenge of such goal is that some scripts have very complex, specific
			
 
				+rules that, if implemented in a generic tool, would quickly make its code
			
 
				+unmanageable.
			
 
				+
			
 
				+The solution we propose in this tool is to keep its core functionality simple,
			
 
				+yet versatile enough to work with the vast majority of scripts; and to enable
			
 
				+script- or language-specific extensions that can be managd by subject matter
			
 
				+experts so that it's easy to isolate those specific features.
			
 
				+
			
 
				+This is implemented life cycle hooks, which are "ports" where additional logic
			
 
				+can be executed to change the outcome of a transliteration at certain stages
			
 
				+of the process.
			
 
				+
			
 
				+
			
 
				+## Overview of the transliteration process
			
 
				+
			
 
				+In order to understand how hooks work, it is necessary to understand the logic
			
 
				+underpinning the transliteration process.
			
 
				+
			
 
				+When a transliteration request is sent to the application, the following
			
 
				+happens:
			
 
				+
			
 
				+1. A configuration is read and parsed for the script/language specified. This
			
 
				+   is actually only done once, the first time that that language is used since
			
 
				+   the application was (re)started. If a configuration is changed, the
			
 
				+   application must be restarted in order for it to parse the updated rules.
			
 
				+   See [`config.md`](./config.md) for a complete guide on the configuration
			
 
				+   structure.
			
 
				+   a. If the table is designated as inheriting another table (e.g. `russian`
			
 
				+      inheriting the `_cyrillic_base` table), the parent's tokens in the `map`
			
 
				+      sections under `roman_to_script` and/or `script_to_roman` are first read
			
 
				+      and then te present table's tokens are merged onto them. If a key is
			
 
				+      present in both tables, the "child" table token overrides the parent. The
			
 
				+      same thing happens with the "ignore" list in the `roman_to_script`
			
 
				+      section.
			
 
				+   b. Each of the tokens are rearranged so that longer tokens are
			
 
				+      sorted before shorter ones that are completely contained in the beginning
			
 
				+      part of the former. E.g. a list of tokens such as `['A', 'B', 'AB',
			
 
				+      'BCD', 'ABCD', 'BCDE', 'BEFGH']` (showing only keys here) becomes
			
 
				+      `['ABCD', 'AB', 'A', 'BCDE', 'BCD', 'BEFGH', 'B']`. This is critical to
			
 
				+      ensure that specific word definitions are parsed in their entirety before
			
 
				+      they are transliterated. Hence, the Hanzi sequence "北京" is correctly
			
 
				+      interpreted as "Beijing" instead of "bei jing".
			
 
				+2. Once the transliteration rules are loaded, the application proceeds to
			
 
				+   scanning the input text. The application initializes an empty list, which
			
 
				+   shall contain the parsed tokens, and starts a loop that advances a cursor,
			
 
				+   which represents the reading position starting at 0 (the beginning of the
			
 
				+   text).
			
 
				+3. For Roman-to-script transliteration, tokens in the `ignore` list are first
			
 
				+   compared against the text at the cursor position. The amount of
			
 
				+   characters compared is equal to the length of each token in the ignore list.
			
 
				+   a. If there is a match, the matching token is added to the output list and
			
 
				+      the cursor advanced by the number of characters in the token.
			
 
				+   b. If all ignore tokens are scanned and there is no match, the application
			
 
				+      proceeds with the next step at the same cursor position.
			
 
				+4. Tokens in the relevant `map` of the transliteration table are compared, one
			
 
				+   by one in the order established in 1.b, with the string at the cursor
			
 
				+   position. The amount of characters compared is equal to the length of the
			
 
				+   token.
			
 
				+   a. If there is a match, the transliteration indicated in the token is added
			
 
				+      to the output list, and the cursor advanced by the number of characters
			
 
				+      in the token.
			
 
				+   b. If there is no match, the next token is tried. If all the tokens have
			
 
				+      been tried and still no match results, the single character at the
			
 
				+      current position is added verbatim to the output list, and the cursor
			
 
				+      advances by one position.
			
 
				+5. When the end of the input text is reached, if the configuration indicates
			
 
				+   that capitalization is required (this is true by default), te first element
			
 
				+   of the output list is capitalized.
			
 
				+6. The output list is joined into one string.
			
 
				+7. The string is compacted by removing excessive whitespace: Two or more
			
 
				+   contiguous spaces are collapsed into one, and whitespace is stripped off
			
 
				+   both ends of the text.
			
 
				+8. The output string is returned.
			
 
				+
			
 
				+
			
 
				+## Hooks
			
 
				+
			
 
				+Hooks are entry points for arbitrary code that may be added to change the
			
 
				+transliteration behavior at specific point of the process described above.
			
 
				+Each of the hooks may indicate the name of a function and optional, additional
			
 
				+paramters.
			
 
				+
			
 
				+Hook functions may be defined for each language/script in the corresponding
			
 
				+configuration file. See [`config.md`](./config.md) for details.
			
 
				+
			
 
				+The function name takes the form of `<module name>/<function name>` and must
			
 
				+correspond to an existing module and function under the `transliterator.hooks`
			
 
				+package.
			
 
				+
			
 
				+Each hook requires input parameters that are specific to its context, and are
			
 
				+passed to the corresponding hook function(s) by the internal process. They must
			
 
				+be defined in each function associated with the hook. Hooks may also accept
			
 
				+optional keyword-only arguments, as described below, whose values can be
			
 
				+defined in the configuration.
			
 
				+
			
 
				+Each function must also return an output that the process is able to handle as
			
 
				+expected. These are also defined below.
			
 
				+
			
 
				+**[TODO]** These hooks are being implemented in a vacuum, without much of a
			
 
				+real-world use case. Modifications to these capabilities may change as actual
			
 
				+challenges arise.
			
 
				+
			
 
				+### `post_config`
			
 
				+
			
 
				+This hook is run after the whole configuration is parsed and possibly merged
			
 
				+with a parent configuration.
			
 
				+
			
 
				+This hook is run once with the configuration parsing.
			
 
				+
			
 
				+#### Input parameters
			
 
				+
			
 
				+- `config` (dict): The parsed configuration data structure.
			
 
				+- `**kwargs`: Additional arguments that may be defined in the configuration.
			
 
				+
			
 
				+#### Output
			
 
				+
			
 
				+(dict) Configuration data structure.
			
 
				+
			
 
				+### `begin_input_token`
			
 
				+
			
 
				+This hook is run at the beginning of each iteration of the input parsing loop.
			
 
				+
			
 
				+#### Input parameters
			
 
				+
			
 
				+- `input` (str): the whole input text.
			
 
				+- `cursor` (int): cursor position.
			
 
				+- `ouptut` (list): Output list in its current state.
			
 
				+- `**kwargs`: Additional arguments that may be defined in the configuration.
			
 
				+
			
 
				+#### Output
			
 
				+
			
 
				+(int) Cursor position.
			
 
				+
			
 
				+### `pre_ignore_token`
			
 
				+
			
 
				+Run before each ignore token is compared with the input.
			
 
				+
			
 
				+#### Input parameters
			
 
				+
			
 
				+- `input` (str): the whole input text.
			
 
				+- `cursor` (int): cursor position.
			
 
				+- `token` (str): Current ignore token.
			
 
				+- `**kwargs`: Additional arguments that may be defined in the configuration.
			
 
				+
			
 
				+#### Output
			
 
				+
			
 
				+(int) Cursor position.
			
 
				+
			
 
				+### `on_ignore_match`
			
 
				+
			
 
				+Run when an ignore token matches.
			
 
				+
			
 
				+TODO
			
 
				+
			
 
				+### `pre_tx_token`
			
 
				+
			
 
				+Run before comparing each transliteration token with the current text.
			
 
				+
			
 
				+TODO
			
 
				+
			
 
				+### `on_tx_token_match`
			
 
				+
			
 
				+Run when a transliteration token matches the input.
			
 
				+
			
 
				+TODO
			
 
				+
			
 
				+### `on_no_tx_token_match`
			
 
				+
			
 
				+Run after all tokens for the current position have been tried and no match has
			
 
				+been found. If defined, this **replaces** the default behavior of copying the
			
 
				+character to the output.
			
 
				+
			
 
				+TODO
			
 
				+
			
 
				+### `pre_assembly`
			
 
				+
			
 
				+Run after the whole text has been scanned, before the output list is
			
 
				+capitalized and assembled into a string.
			
 
				+
			
 
				+TODO
			
 
				+
			
 
				+### `post_assembly`
			
 
				+
			
 
				+Run after the output has been assembled into a string, before whitespace is
			
 
				+stripped off.
			
 
				+
			
 
				+TODO
			
--- a/doc/rest_api.md
+++ b/doc/rest_api.md
@@ -0,0 +1,75 @@
 
				+# Transliterator REST API
			
 
				+
			
 
				+## `GET /health`
			
 
				+
			
 
				+Useful endpoint for health checks.
			
 
				+
			
 
				+### Response code
			
 
				+
			
 
				+`200 OK` if the service is running.
			
 
				+
			
 
				+## `GET /languages`
			
 
				+
			
 
				+List all the languages supported.
			
 
				+
			
 
				+### Response code
			
 
				+
			
 
				+`200 OK`
			
 
				+
			
 
				+### Response body
			
 
				+
			
 
				+MIME type: `application/json`
			
 
				+
			
 
				+Content: a JSON object of the supported language tables. Keys are the keywords
			
 
				+used throughout the API, e.g. for `/transliterate`. Each key is paired with an
			
 
				+object that contains some basic metadata about the language features. At the
			
 
				+moment, only the human-readable name is available.
			
 
				+
			
 
				+## `GET /table/<lang>`
			
 
				+
			
 
				+Dump a language table.
			
 
				+
			
 
				+### URI parameters
			
 
				+
			
 
				+- `<lang>`: Language code as given by the `/languages` endpoint. 
			
 
				+
			
 
				+### Response code
			
 
				+
			
 
				+`200 OK`
			
 
				+
			
 
				+### Response body
			
 
				+
			
 
				+MIME type: `application/json`
			
 
				+
			
 
				+Content: language configuration as a JSON object with all the transliteration
			
 
				+rules as they are read by the application. If the table inherits from a parent,
			
 
				+the computed values from the merged tables are shown.
			
 
				+
			
 
				+## `POST /transliterate/<lang>[/r2s]`
			
 
				+
			
 
				+Transliterate an input string in a given language.
			
 
				+
			
 
				+### URI parameters
			
 
				+
			
 
				+- `<lang>`: Language code as given by the `/languages` endpoint. 
			
 
				+- `r2s`: if appended to the URI, the transliteration is intended to be
			
 
				+  Roman-to-script, and the input string should be Latin text. If not, the
			
 
				+  default behavior is followed, which is interpreting the input as a script
			
 
				+  in the given language, and returning the Romanized text.
			
 
				+
			
 
				+### POST body
			
 
				+
			
 
				+- `text`: Input text to be transliterated.
			
 
				+
			
 
				+### Response code
			
 
				+
			
 
				+- `200 OK` on successful operation.
			
 
				+- `400 Bad Request` for an invalid request. The reason for the failure is
			
 
				+  normally printed in the response body.
			
 
				+
			
 
				+### Response body
			
 
				+
			
 
				+MIME Type: `text/plain`
			
 
				+
			
 
				+Content: transliterated string. Characters not found in the mapping are copied
			
 
				+verbatim (see "Configuration files" section for more information).