瀏覽代碼

WIP Presentable hooks doc draft; initial hook implementation.

Stefano Cossu 2 年之前
父節點
當前提交
abd82fa208
共有 4 個文件被更改,包括 253 次插入80 次删除
  1. 72 56
      doc/hooks.md
  2. 5 0
      transliterator/hooks/__init__.py
  3. 63 0
      transliterator/tables/__init__.py
  4. 113 24
      transliterator/trans.py

+ 72 - 56
doc/hooks.md

@@ -35,21 +35,21 @@ happens:
    structure.
 
    a. If the table is designated as inheriting another table (e.g. `russian`
-      inheriting the `_cyrillic_base` table), the parent's tokens in the `map`
-      sections under `roman_to_script` and/or `script_to_roman` are first read
-      and then te present table's tokens are merged onto them. If a key is
-      present in both tables, the "child" table token overrides the parent. The
-      same thing happens with the "ignore" list in the `roman_to_script`
-      section.
+   inheriting the `_cyrillic_base` table), the parent's tokens in the `map`
+   sections under `roman_to_script` and/or `script_to_roman` are first read
+   and then te present table's tokens are merged onto them. If a key is
+   present in both tables, the "child" table token overrides the parent. The
+   same thing happens with the "ignore" list in the `roman_to_script`
+   section.
 
    b. Each of the tokens are rearranged so that longer tokens are
-      sorted before shorter ones that are completely contained in the beginning
-      part of the former. E.g. a list of tokens such as `['A', 'B', 'AB',
-      'BCD', 'ABCD', 'BCDE', 'BEFGH']` (showing only keys here) becomes
-      `['ABCD', 'AB', 'A', 'BCDE', 'BCD', 'BEFGH', 'B']`. This is critical to
-      ensure that specific word definitions are parsed in their entirety before
-      they are transliterated. Hence, the Hanzi sequence "北京" is correctly
-      interpreted as "Beijing" instead of "bei jing".
+   sorted before shorter ones that are completely contained in the beginning
+   part of the former. E.g. a list of tokens such as `['A', 'B', 'AB',
+   'BCD', 'ABCD', 'BCDE', 'BEFGH']` (showing only keys here) becomes
+   `['ABCD', 'AB', 'A', 'BCDE', 'BCD', 'BEFGH', 'B']`. This is critical to
+   ensure that specific word definitions are parsed in their entirety before
+   they are transliterated. Hence, the Hanzi sequence "北京" is correctly
+   interpreted as "Beijing" instead of "bei jing".
 2. Once the transliteration rules are loaded, the application proceeds to
    scanning the input text. The application initializes an empty list, which
    shall contain the parsed tokens, and starts a loop that advances a cursor,
@@ -60,10 +60,10 @@ happens:
    characters compared is equal to the length of each token in the ignore list.
 
    a. If there is a match, the matching token is added to the output list and
-      the cursor advanced by the number of characters in the token.
+   the cursor advanced by the number of characters in the token.
 
    b. If all ignore tokens are scanned and there is no match, the application
-      proceeds with the next step at the same cursor position.
+   proceeds with the next step at the same cursor position.
 4. Tokens in the relevant `map` of the transliteration table are compared, one
    by one in the order established in 1.b, with the string at the cursor
    position. The amount of characters compared is equal to the length of the
@@ -99,14 +99,16 @@ The function name takes the form of `<module name>/<function name>` and must
 correspond to an existing module and function under the `transliterator.hooks`
 package.
 
-Each hook requires input parameters that are specific to its context, and are
-passed to the corresponding hook function(s) by the internal process. They must
-be defined in each function associated with the hook. Hooks may also accept
-optional keyword-only arguments, as described below, whose values can be
+Each hook requires some arguments to be defined in each function associated
+with it: `ctx`, an instance of `transliterator.trans.Context` which carries
+information about the current scanner status and can be manipulated by the hook
+function; and `**kw`, optional keyword-only arguments, whose values can be
 defined in the configuration.
 
 Each function must also return an output that the process is able to handle as
-expected. These are also defined below.
+expected. the output may instruct the application to make a specific decision
+after the hook function is executed. Possible return values are defined below
+for each hook.
 
 **[TODO]** These hooks are being implemented in a vacuum, without much of a
 real-world use case. Modifications to these capabilities may change as actual
@@ -117,83 +119,97 @@ challenges arise.
 This hook is run after the whole configuration is parsed and possibly merged
 with a parent configuration.
 
-This hook is run once with the configuration parsing.
+#### Return
 
-#### Input parameters
-
-- `config` (dict): The parsed configuration data structure.
-- `**kwargs`: Additional arguments that may be defined in the configuration.
-
-#### Output
-
-(dict) Configuration data structure.
+`None`.
 
 ### `begin_input_token`
 
 This hook is run at the beginning of each iteration of the input parsing loop.
 
-#### Input parameters
-
-- `input` (str): the whole input text.
-- `cursor` (int): cursor position.
-- `ouptut` (list): Output list in its current state.
-- `**kwargs`: Additional arguments that may be defined in the configuration.
+#### Return
 
-#### Output
-
-(int) Cursor position.
+(str | None) Possible values are `"cont"`, `"break"`, or `None`. If `None` is
+returned, the parsing proceeds as normal. `"cont"` causes the application to
+skip the parsing of the current token. `"break"` interrupts the text scanning
+and proceeds directly to handling the result list for output. **CAUTION**: when
+returning "cont", it is the responsibility of the function to advance `ctx.cur`
+so that the loop doesn't become an infinite one. 
 
 ### `pre_ignore_token`
 
 Run before each ignore token is compared with the input.
 
-#### Input parameters
-
-- `input` (str): the whole input text.
-- `cursor` (int): cursor position.
-- `token` (str): Current ignore token.
-- `**kwargs`: Additional arguments that may be defined in the configuration.
-
 #### Output
 
-(int) Cursor position.
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
+current ignore token. `"break"` stops looking up ignore tokens for the current
+position. This function can return `"cont"` without advancing the cursor and
+without causing an infinite loop.
 
 ### `on_ignore_match`
 
 Run when an ignore token matches.
 
-TODO
+#### Output
+
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
+on looking up the ignore list. `"break"` stops looking up ignore tokens for the
+current position. See cautionary note on `begin_input_token`.
 
 ### `pre_tx_token`
 
 Run before comparing each transliteration token with the current text.
 
-TODO
+#### Output
+
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
+current token. `"break"` stops looking up all tokens for the current
+position. See cautionary note on `begin_input_token`.
 
 ### `on_tx_token_match`
 
 Run when a transliteration token matches the input.
 
-TODO
+#### Output
+
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
+on looking up the token list. `"break"` stops looking up tokens for the
+current position and effectively reports a non-match.
 
 ### `on_no_tx_token_match`
 
 Run after all tokens for the current position have been tried and no match has
-been found. If defined, this **replaces** the default behavior of copying the
-character to the output.
+been found.
+
+#### Output
 
-TODO
+(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips to the next
+position in the input text. Int his case, the function **must** advance the
+cursor. `"break"` stops all text parsing and proceeds to the assembly of the
+output.
 
 ### `pre_assembly`
 
 Run after the whole text has been scanned, before the output list is
-capitalized and assembled into a string.
+capitalized and assembled into a string. This function may manipulate the token
+list and/or handle the assembly itself, in which case it can return the
+assembled string and bypass any further output handling.
+
+#### Output
 
-TODO
+(str | None) If the output is a string, the transliteration function returns
+this string immediately; otherwise it proceeds with standard adjustments and
+assembly of the output list.
 
 ### `post_assembly`
 
 Run after the output has been assembled into a string, before whitespace is
-stripped off.
+stripped off. This function can access and manipulate `ctx.dest` which is
+the output string.
+
+#### Output
 
-TODO
+(str | None) If the output is a string, the transliteration function returns
+this string immediately; otherwise it proceeds with standard adjustments of the
+output string.

+ 5 - 0
transliterator/hooks/__init__.py

@@ -0,0 +1,5 @@
+__doc__ = """ Hook functions.
+
+Put modules and functions that should be found by life cycle hooks in this
+directory.
+"""

+ 63 - 0
transliterator/tables/__init__.py

@@ -20,10 +20,28 @@ language (or language and script? TBD)
 
 TABLE_DIR = path.join(path.dirname(path.realpath(__file__)), "data")
 
+# Available hook names.
+HOOKS = (
+    "post_config",
+    "begin_input_token",
+    "pre_ignore_token",
+    "on_ignore_match",
+    "pre_tx_token",
+    "on_tx_token_match",
+    "on_no_tx_token_match",
+    "pre_assembly",
+    "post_assembly",
+)
+# Package path where hook functions are kept.
+HOOK_PKG_PATH = "transliterator.hooks"
 
 logger = logging.getLogger(__name__)
 
 
+class ConfigError(Exception):
+    """ Raised when a malformed configuration is detected. """
+
+
 class Token(str):
     """
     Token class: minimal unit of text parsing.
@@ -112,6 +130,10 @@ def load_table(tname):
         tdata["script_to_roman"]["map"] = tuple(
                 (k.content, tokens[k]) for k in sorted(tokens))
 
+        if "hooks" in tdata["script_to_roman"]:
+            tdata["script_to_roman"]["hooks"] = load_hook_fn(
+                    tdata["script_to_roman"]["hooks"])
+
     if "roman_to_script" in tdata:
         tokens = {
                 Token(k): v
@@ -140,4 +162,45 @@ def load_table(tname):
         tdata["roman_to_script"]["ignore"] = [
                 t.content for t in sorted(ignore)]
 
+        if "hooks" in tdata["roman_to_script"]:
+            tdata["roman_to_script"]["hooks"] = load_hook_fn(
+                    tdata["roman_to_script"]["hooks"])
+
     return tdata
+
+
+def load_hook_fn(cname, sec):
+    """
+    Load hook functions from configuration file.
+
+    Args:
+        lang (str): The language key for the configuration.
+
+        sec (dict): The `script_to_roman` or `roman_to_script` section
+        that may contain the `hooks` key to be parsed.
+
+    Return:
+        dict: Dictionary of hook name and list of hook functions pairs.
+    """
+    hook_fn = {}
+    for cfg_hook, cfg_hook_fns in sec.get("hooks", {}):
+        if cfg_hook not in HOOKS:
+            raise ConfigError(f"{cfg_hook} is not a valid hook name!")
+
+        hook_fn[cfg_hook] = []
+        # There may be more than one function in each hook. They are
+        # executed in the order they are found.
+        for cfg_hook_fn in cfg_hook_fns:
+            modname, fnname = path.splitext(cfg_hook_fn)
+            fnname = fnname.lstrip(".")
+            try:
+                fn = import_module(
+                        "." + modname, HOOK_PKG_PATH).getattr(fnname)
+            except NameError:
+                raise ConfigError(
+                    f"Hook function {fnname} defined in {cname} configuration "
+                    f"not found in module {HOOK_PKG_PATH}.{modname}!"
+                )
+            hook_fn[cfg_hook].append(fn)
+
+    return hook_fn

+ 113 - 24
transliterator/trans.py

@@ -11,6 +11,27 @@ MULTI_WS_RE = re.compile(r"\s{2,}")
 logger = logging.getLogger(__name__)
 
 
+class Context:
+    """
+    Context used within the transliteration and passed to hook functions.
+    """
+    cur = 0  # Input text cursor.
+    dest_ls = []  # Token list making up the output string.
+
+    def __init__(self, src, general, langsec):
+        """
+        Initialize a context.
+
+        Args:
+            src (str): The original text. This is meant to never change.
+            general (dict): general section of the current config.
+            langsec (dict): Language configuration section being used.
+        """
+        self.src = src
+        self.general = general
+        self.langsec = langsec
+
+
 def transliterate(src, lang, r2s=False):
     """
     Transliterate a single string.
@@ -37,7 +58,7 @@ def transliterate(src, lang, r2s=False):
     logger.info(f"Loaded table for {lang}.")
 
     # General directives.
-    # general_dir = cfg.get("directives", {})
+    general = cfg.get("general", {})
 
     if not r2s and "script_to_roman" not in cfg:
         raise NotImplementedError(
@@ -50,54 +71,122 @@ def transliterate(src, lang, r2s=False):
 
     langsec = cfg["script_to_roman"] if not r2s else cfg["roman_to_script"]
     langsec_dir = langsec.get("directives", {})
+    langsec_hooks = langsec.get("hooks", {})
+
+    ctx = Context(src, general, langsec)
+
+    _run_hook("post_config", ctx, langsec_hooks)
 
-    i = 0
-    dest_ls = []
     # Loop through source characters. The increment of each loop depends on
     # the length of the token that eventually matches.
     ignore_list = langsec.get("ignore", [])  # Only present in R2S
-    while i < len(src):
+    while ctx.cur < len(src):
+        # This hook may skip the parsing of the current
+        # token or exit the scanning loop altogether.
+        hret = _run_hook("begin_input_token", ctx, langsec_hooks)
+        if hret == "break":
+            break
+        if hret == "continue":
+            continue
         # Check ignore list first. Find as many subsequent ignore tokens
         # as possible before moving on to looking for match tokens.
         while True:
-            ignoring = False
+            ctx.ignoring = False
             for tk in ignore_list:
+                hret = _run_hook("pre_ignore_token", ctx, langsec_hooks)
+                if hret == "break":
+                    break
+                if hret == "continue":
+                    continue
+
                 step = len(tk)
-                if tk == src[i:i + step]:
+                if tk == src[ctx.i:ctx.i + step]:
+                    hret = _run_hook("on_ignore_match", ctx, langsec_hooks)
+                    if hret == "break":
+                        break
+                    if hret == "continue":
+                        continue
+
                     logger.info(f"Ignored token: {tk}")
-                    dest_ls.append(tk)
-                    i += step
-                    ignoring = True
+                    ctx.dest_ls.append(tk)
+                    ctx.i += step
+                    ctx.ignoring = True
                     break
             # We looked through all ignore tokens, not found any. Move on.
-            if not ignoring:
+            if not ctx.ignoring:
                 break
 
-        match = False
+        ctx.match = False
         for src_tk, dest_tk in langsec["map"]:
+            hret = _run_hook("pre_tx_token", ctx, langsec_hooks)
+            if hret == "break":
+                break
+            if hret == "continue":
+                continue
+
             # Longer tokens should be guaranteed to be scanned before their
             # substrings at this point.
             step = len(src_tk)
-            if src_tk == src[i:i + step]:
+            if src_tk == src[ctx.i:ctx.i + step]:
+                # This hook may skip this token or break out of the token
+                # lookup for the current position.
+                hret = _run_hook("on_tx_token_match", ctx, langsec_hooks)
+                if hret == "break":
+                    break
+                if hret == "continue":
+                    continue
+
                 # A match is found. Stop scanning tokens, append result, and
                 # proceed scanning the source.
-                dest_ls.append(dest_tk)
-                match = True
-                i += step
+                ctx.dest_ls.append(dest_tk)
+                ctx.match = True
+                ctx.i += step
+                break
+
+        if not ctx.match:
+            hret = _run_hook("on_no_tx_token_match", ctx, langsec_hooks)
+            if hret == "break":
                 break
+            if hret == "continue":
+                continue
 
-        if not match:
             # No match found. Copy non-mapped character (one at a time).
-            logger.info(f"Token {src[i]} at position {i} is not mapped.")
-            dest_ls.append(src[i])
-            i += 1
+            logger.info(
+                    f"Token {src[ctx.i]} at position {ctx.i} is not mapped.")
+            ctx.dest_ls.append(src[ctx.i])
+            ctx.i += 1
 
     if langsec_dir.get("capitalize", False):
-        dest_ls[0] = dest_ls[0].capitalize()
+        ctx.dest_ls[0] = ctx.dest_ls[0].capitalize()
+
+    # This hook may take care of the assembly and cause the function to return
+    # its own return value.
+    hret = _run_hook("pre_assembly", ctx, langsec_hooks)
+    if hret is not None:
+        return hret
+
+    logger.debug(f"Output list: {ctx.dest_ls}")
+    ctx.dest = "".join(ctx.dest_ls)
+
+    # This hook may manipulate the output string and cause the function to
+    # return that.
+    hret = _run_hook("post_assembly", ctx, langsec_hooks)
+    if hret is not None:
+        return hret
+
+    # Strip multiple spaces and leading/trailing whitespace.
+    ctx.dest = re.sub(MULTI_WS_RE, ' ', ctx.dest.strip())
+
+    return ctx.dest
 
-    logger.debug(f"Output list: {dest_ls}")
-    dest = "".join(dest_ls)
 
-    dest = re.sub(MULTI_WS_RE, ' ', dest.strip())
+def _run_hook(hname, ctx, hooks):
+    for hook_def in hooks.get(hname, []):
+        kwargs = hook_def[1] if len(hook_def > 1) else {}
+        ret = hook_def[0](ctx.src, ctx.cur, ctx.dest_ls, **kwargs)
+        if ret in ("break", "cont"):
+            # This will stop parsing hooks functions and tell the caller to
+            # break out of the outer loop or skip iteration.
+            return ret
 
-    return dest
+    return ret