|
@@ -35,21 +35,21 @@ happens:
|
|
|
structure.
|
|
|
|
|
|
a. If the table is designated as inheriting another table (e.g. `russian`
|
|
|
- inheriting the `_cyrillic_base` table), the parent's tokens in the `map`
|
|
|
- sections under `roman_to_script` and/or `script_to_roman` are first read
|
|
|
- and then te present table's tokens are merged onto them. If a key is
|
|
|
- present in both tables, the "child" table token overrides the parent. The
|
|
|
- same thing happens with the "ignore" list in the `roman_to_script`
|
|
|
- section.
|
|
|
+ inheriting the `_cyrillic_base` table), the parent's tokens in the `map`
|
|
|
+ sections under `roman_to_script` and/or `script_to_roman` are first read
|
|
|
+ and then te present table's tokens are merged onto them. If a key is
|
|
|
+ present in both tables, the "child" table token overrides the parent. The
|
|
|
+ same thing happens with the "ignore" list in the `roman_to_script`
|
|
|
+ section.
|
|
|
|
|
|
b. Each of the tokens are rearranged so that longer tokens are
|
|
|
- sorted before shorter ones that are completely contained in the beginning
|
|
|
- part of the former. E.g. a list of tokens such as `['A', 'B', 'AB',
|
|
|
- 'BCD', 'ABCD', 'BCDE', 'BEFGH']` (showing only keys here) becomes
|
|
|
- `['ABCD', 'AB', 'A', 'BCDE', 'BCD', 'BEFGH', 'B']`. This is critical to
|
|
|
- ensure that specific word definitions are parsed in their entirety before
|
|
|
- they are transliterated. Hence, the Hanzi sequence "北京" is correctly
|
|
|
- interpreted as "Beijing" instead of "bei jing".
|
|
|
+ sorted before shorter ones that are completely contained in the beginning
|
|
|
+ part of the former. E.g. a list of tokens such as `['A', 'B', 'AB',
|
|
|
+ 'BCD', 'ABCD', 'BCDE', 'BEFGH']` (showing only keys here) becomes
|
|
|
+ `['ABCD', 'AB', 'A', 'BCDE', 'BCD', 'BEFGH', 'B']`. This is critical to
|
|
|
+ ensure that specific word definitions are parsed in their entirety before
|
|
|
+ they are transliterated. Hence, the Hanzi sequence "北京" is correctly
|
|
|
+ interpreted as "Beijing" instead of "bei jing".
|
|
|
2. Once the transliteration rules are loaded, the application proceeds to
|
|
|
scanning the input text. The application initializes an empty list, which
|
|
|
shall contain the parsed tokens, and starts a loop that advances a cursor,
|
|
@@ -60,10 +60,10 @@ happens:
|
|
|
characters compared is equal to the length of each token in the ignore list.
|
|
|
|
|
|
a. If there is a match, the matching token is added to the output list and
|
|
|
- the cursor advanced by the number of characters in the token.
|
|
|
+ the cursor advanced by the number of characters in the token.
|
|
|
|
|
|
b. If all ignore tokens are scanned and there is no match, the application
|
|
|
- proceeds with the next step at the same cursor position.
|
|
|
+ proceeds with the next step at the same cursor position.
|
|
|
4. Tokens in the relevant `map` of the transliteration table are compared, one
|
|
|
by one in the order established in 1.b, with the string at the cursor
|
|
|
position. The amount of characters compared is equal to the length of the
|
|
@@ -99,14 +99,16 @@ The function name takes the form of `<module name>/<function name>` and must
|
|
|
correspond to an existing module and function under the `transliterator.hooks`
|
|
|
package.
|
|
|
|
|
|
-Each hook requires input parameters that are specific to its context, and are
|
|
|
-passed to the corresponding hook function(s) by the internal process. They must
|
|
|
-be defined in each function associated with the hook. Hooks may also accept
|
|
|
-optional keyword-only arguments, as described below, whose values can be
|
|
|
+Each hook requires some arguments to be defined in each function associated
|
|
|
+with it: `ctx`, an instance of `transliterator.trans.Context` which carries
|
|
|
+information about the current scanner status and can be manipulated by the hook
|
|
|
+function; and `**kw`, optional keyword-only arguments, whose values can be
|
|
|
defined in the configuration.
|
|
|
|
|
|
Each function must also return an output that the process is able to handle as
|
|
|
-expected. These are also defined below.
|
|
|
+expected. the output may instruct the application to make a specific decision
|
|
|
+after the hook function is executed. Possible return values are defined below
|
|
|
+for each hook.
|
|
|
|
|
|
**[TODO]** These hooks are being implemented in a vacuum, without much of a
|
|
|
real-world use case. Modifications to these capabilities may change as actual
|
|
@@ -117,83 +119,97 @@ challenges arise.
|
|
|
This hook is run after the whole configuration is parsed and possibly merged
|
|
|
with a parent configuration.
|
|
|
|
|
|
-This hook is run once with the configuration parsing.
|
|
|
+#### Return
|
|
|
|
|
|
-#### Input parameters
|
|
|
-
|
|
|
-- `config` (dict): The parsed configuration data structure.
|
|
|
-- `**kwargs`: Additional arguments that may be defined in the configuration.
|
|
|
-
|
|
|
-#### Output
|
|
|
-
|
|
|
-(dict) Configuration data structure.
|
|
|
+`None`.
|
|
|
|
|
|
### `begin_input_token`
|
|
|
|
|
|
This hook is run at the beginning of each iteration of the input parsing loop.
|
|
|
|
|
|
-#### Input parameters
|
|
|
-
|
|
|
-- `input` (str): the whole input text.
|
|
|
-- `cursor` (int): cursor position.
|
|
|
-- `ouptut` (list): Output list in its current state.
|
|
|
-- `**kwargs`: Additional arguments that may be defined in the configuration.
|
|
|
+#### Return
|
|
|
|
|
|
-#### Output
|
|
|
-
|
|
|
-(int) Cursor position.
|
|
|
+(str | None) Possible values are `"cont"`, `"break"`, or `None`. If `None` is
|
|
|
+returned, the parsing proceeds as normal. `"cont"` causes the application to
|
|
|
+skip the parsing of the current token. `"break"` interrupts the text scanning
|
|
|
+and proceeds directly to handling the result list for output. **CAUTION**: when
|
|
|
+returning "cont", it is the responsibility of the function to advance `ctx.cur`
|
|
|
+so that the loop doesn't become an infinite one.
|
|
|
|
|
|
### `pre_ignore_token`
|
|
|
|
|
|
Run before each ignore token is compared with the input.
|
|
|
|
|
|
-#### Input parameters
|
|
|
-
|
|
|
-- `input` (str): the whole input text.
|
|
|
-- `cursor` (int): cursor position.
|
|
|
-- `token` (str): Current ignore token.
|
|
|
-- `**kwargs`: Additional arguments that may be defined in the configuration.
|
|
|
-
|
|
|
#### Output
|
|
|
|
|
|
-(int) Cursor position.
|
|
|
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
|
|
|
+current ignore token. `"break"` stops looking up ignore tokens for the current
|
|
|
+position. This function can return `"cont"` without advancing the cursor and
|
|
|
+without causing an infinite loop.
|
|
|
|
|
|
### `on_ignore_match`
|
|
|
|
|
|
Run when an ignore token matches.
|
|
|
|
|
|
-TODO
|
|
|
+#### Output
|
|
|
+
|
|
|
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
|
|
|
+on looking up the ignore list. `"break"` stops looking up ignore tokens for the
|
|
|
+current position. See cautionary note on `begin_input_token`.
|
|
|
|
|
|
### `pre_tx_token`
|
|
|
|
|
|
Run before comparing each transliteration token with the current text.
|
|
|
|
|
|
-TODO
|
|
|
+#### Output
|
|
|
+
|
|
|
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
|
|
|
+current token. `"break"` stops looking up all tokens for the current
|
|
|
+position. See cautionary note on `begin_input_token`.
|
|
|
|
|
|
### `on_tx_token_match`
|
|
|
|
|
|
Run when a transliteration token matches the input.
|
|
|
|
|
|
-TODO
|
|
|
+#### Output
|
|
|
+
|
|
|
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
|
|
|
+on looking up the token list. `"break"` stops looking up tokens for the
|
|
|
+current position and effectively reports a non-match.
|
|
|
|
|
|
### `on_no_tx_token_match`
|
|
|
|
|
|
Run after all tokens for the current position have been tried and no match has
|
|
|
-been found. If defined, this **replaces** the default behavior of copying the
|
|
|
-character to the output.
|
|
|
+been found.
|
|
|
+
|
|
|
+#### Output
|
|
|
|
|
|
-TODO
|
|
|
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips to the next
|
|
|
+position in the input text. Int his case, the function **must** advance the
|
|
|
+cursor. `"break"` stops all text parsing and proceeds to the assembly of the
|
|
|
+output.
|
|
|
|
|
|
### `pre_assembly`
|
|
|
|
|
|
Run after the whole text has been scanned, before the output list is
|
|
|
-capitalized and assembled into a string.
|
|
|
+capitalized and assembled into a string. This function may manipulate the token
|
|
|
+list and/or handle the assembly itself, in which case it can return the
|
|
|
+assembled string and bypass any further output handling.
|
|
|
+
|
|
|
+#### Output
|
|
|
|
|
|
-TODO
|
|
|
+(str | None) If the output is a string, the transliteration function returns
|
|
|
+this string immediately; otherwise it proceeds with standard adjustments and
|
|
|
+assembly of the output list.
|
|
|
|
|
|
### `post_assembly`
|
|
|
|
|
|
Run after the output has been assembled into a string, before whitespace is
|
|
|
-stripped off.
|
|
|
+stripped off. This function can access and manipulate `ctx.dest` which is
|
|
|
+the output string.
|
|
|
+
|
|
|
+#### Output
|
|
|
|
|
|
-TODO
|
|
|
+(str | None) If the output is a string, the transliteration function returns
|
|
|
+this string immediately; otherwise it proceeds with standard adjustments of the
|
|
|
+output string.
|