Browse Source

Adjust hooks logic and docs.

Stefano Cossu 2 năm trước cách đây
mục cha
commit
5a1ccfa2b5
2 tập tin đã thay đổi với 149 bổ sung47 xóa
  1. 112 24
      doc/hooks.md
  2. 37 23
      transliterator/trans.py

+ 112 - 24
doc/hooks.md

@@ -14,9 +14,9 @@ yet versatile enough to work with the vast majority of scripts; and to enable
 script- or language-specific extensions that can be managd by subject matter
 experts so that it's easy to isolate those specific features.
 
-This is implemented life cycle hooks, which are "ports" where additional logic
-can be executed to change the outcome of a transliteration at certain stages
-of the process.
+This is implemented by using life cycle hooks, which are "ports" into the
+transliteration workflow where additional logic can be executed to change the
+outcome of a transliteration at certain stages of the process.
 
 
 ## Overview of the transliteration process
@@ -64,13 +64,15 @@ happens:
 
    b. If all ignore tokens are scanned and there is no match, the application
    proceeds with the next step at the same cursor position.
-4. Tokens in the relevant `map` of the transliteration table are compared, one
-   by one in the order established in 1.b, with the string at the cursor
-   position. The amount of characters compared is equal to the length of the
-   token.
+4. Tokens in the relevant `map` section of the transliteration table are
+   compared, one by one in the order established in 1.b, with the string at the
+   cursor position. The amount of characters compared is equal to the length of
+   the token.
+
    a. If there is a match, the transliteration indicated in the token is added
       to the output list, and the cursor advanced by the number of characters
       in the token.
+
    b. If there is no match, the next token is tried. If all the tokens have
       been tried and still no match results, the single character at the
       current position is added verbatim to the output list, and the cursor
@@ -95,9 +97,10 @@ paramters.
 Hook functions may be defined for each language/script in the corresponding
 configuration file. See [`config.md`](./config.md) for details.
 
-The function name takes the form of `<module name>/<function name>` and must
+The function name takes the form of `<module name>.<function name>` and must
 correspond to an existing module and function under the `transliterator.hooks`
-package.
+package. Check the [`rot3.yml`](../transliterator/tables/data/rot3.yml) test
+configuration and the referred functions for a working example.
 
 Each hook requires some arguments to be defined in each function associated
 with it: `ctx`, an instance of `transliterator.trans.Context` which carries
@@ -119,17 +122,33 @@ challenges arise.
 This hook is run after the whole configuration is parsed and possibly merged
 with a parent configuration.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position. It should be 0 at this point.
+- `ctx.dest_ls`: destination token list. It should be empty at this point.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+
 #### Return
 
-`None`.
+`None`
 
 ### `begin_input_token`
 
 This hook is run at the beginning of each iteration of the input parsing loop.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+
 #### Return
 
-(str | None) Possible values are `"cont"`, `"break"`, or `None`. If `None` is
+Possible values are `"cont"`, `"break"`, or `None`. If `None` is
 returned, the parsing proceeds as normal. `"cont"` causes the application to
 skip the parsing of the current token. `"break"` interrupts the text scanning
 and proceeds directly to handling the result list for output. **CAUTION**: when
@@ -140,9 +159,17 @@ so that the loop doesn't become an infinite one.
 
 Run before each ignore token is compared with the input.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+
 #### Output
 
-(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
+`"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
 current ignore token. `"break"` stops looking up ignore tokens for the current
 position. This function can return `"cont"` without advancing the cursor and
 without causing an infinite loop.
@@ -151,9 +178,20 @@ without causing an infinite loop.
 
 Run when an ignore token matches.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+- `ctx.tk`: matching ignore token.
+- `ctx.ignoring`: whether an ignore token matched. If set to `False`, the rest
+  of the workflow will assume a non-match.
+
 #### Output
 
-(str | None) `"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
+`"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
 on looking up the ignore list. `"break"` stops looking up ignore tokens for the
 current position. See cautionary note on `begin_input_token`.
 
@@ -161,9 +199,19 @@ current position. See cautionary note on `begin_input_token`.
 
 Run before comparing each transliteration token with the current text.
 
+#### Available context member
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+- `ctx.src_tk`: the input token being looked up.
+- `ctx.dest_tk`: the transliterated string associated with the current token.
+
 #### Output
 
-(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
+`"cont"`, `"break"`, or `None`. `"cont"` skips the checks on the
 current token. `"break"` stops looking up all tokens for the current
 position. See cautionary note on `begin_input_token`.
 
@@ -171,9 +219,22 @@ position. See cautionary note on `begin_input_token`.
 
 Run when a transliteration token matches the input.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list. The matching token will be added to it
+  after this hook is run.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+- `ctx.src_tk`: the matching input token.
+- `ctx.dest_tk`: the transliterated sting to be added to the output.
+- `ctx.match`: whether there was a match. If set to `False`, the rest of the
+  workflow will assume a non-match.
+
 #### Output
 
-(str | None) `"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
+`"cont"`, `"break"`, or `None`. `"cont"` voids the match and keeps
 on looking up the token list. `"break"` stops looking up tokens for the
 current position and effectively reports a non-match.
 
@@ -182,9 +243,21 @@ current position and effectively reports a non-match.
 Run after all tokens for the current position have been tried and no match has
 been found.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+- `ctx.src_tk`: the matching input token.
+- `ctx.dest_tk`: the transliterated sting to be added to the output.
+- `ctx.match`: whether there was a match. This is always `False` here and will
+  have no consequence if changed.
+
 #### Output
 
-(str | None) `"cont"`, `"break"`, or `None`. `"cont"` skips to the next
+`"cont"`, `"break"`, or `None`. `"cont"` skips to the next
 position in the input text. Int his case, the function **must** advance the
 cursor. `"break"` stops all text parsing and proceeds to the assembly of the
 output.
@@ -196,20 +269,35 @@ capitalized and assembled into a string. This function may manipulate the token
 list and/or handle the assembly itself, in which case it can return the
 assembled string and bypass any further output handling.
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+
 #### Output
 
-(str | None) If the output is a string, the transliteration function returns
-this string immediately; otherwise it proceeds with standard adjustments and
-assembly of the output list.
+A string or `None`. If the output is a string, the transliteration function
+returns this string immediately; otherwise it proceeds with standard
+adjustments and assembly of the output list.
 
 ### `post_assembly`
 
+#### Available context members
+
+- `ctx.src`: Source text. It should not be reassigned.
+- `ctx.cur`: cursor position.
+- `ctx.dest_ls`: destination token list.
+- `ctx.general`: Configuration general options.
+- `ctx.langsec`: language section (S2R or R2S) of configuration.
+- `ctx.dest`: output string.
+
 Run after the output has been assembled into a string, before whitespace is
-stripped off. This function can access and manipulate `ctx.dest` which is
-the output string.
+stripped off.
 
 #### Output
 
-(str | None) If the output is a string, the transliteration function returns
-this string immediately; otherwise it proceeds with standard adjustments of the
-output string.
+`"ret"` or `None`. If `"ret"`, the transliteration function returns `ctx.dest`
+immediately; otherwise it proceeds with standard adjustments of the output
+string before returning.

+ 37 - 23
transliterator/trans.py

@@ -90,34 +90,42 @@ def transliterate(src, lang, r2s=False):
             continue
         # Check ignore list first. Find as many subsequent ignore tokens
         # as possible before moving on to looking for match tokens.
+        ctx.tk = None
         while True:
             ctx.ignoring = False
-            for tk in ignore_list:
+            for ctx.tk in ignore_list:
                 hret = _run_hook("pre_ignore_token", ctx, langsec_hooks)
                 if hret == "break":
                     break
                 if hret == "continue":
                     continue
 
-                step = len(tk)
-                if tk == src[ctx.i:ctx.i + step]:
+                step = len(ctx.tk)
+                if ctx.tk == src[ctx.cur:ctx.cur + step]:
+                    # The position matches an ignore token.
                     hret = _run_hook("on_ignore_match", ctx, langsec_hooks)
                     if hret == "break":
                         break
                     if hret == "continue":
                         continue
 
-                    logger.info(f"Ignored token: {tk}")
-                    ctx.dest_ls.append(tk)
-                    ctx.i += step
+                    logger.info(f"Ignored token: {ctx.tk}")
+                    ctx.dest_ls.append(ctx.tk)
+                    ctx.cur += step
                     ctx.ignoring = True
                     break
             # We looked through all ignore tokens, not found any. Move on.
             if not ctx.ignoring:
                 break
+            # Otherwise, if we found a match, check if the next position may be
+            # ignored as well.
 
+        delattr(ctx, "tk")
+        delattr(ctx, "ignoring")
+
+        # Begin transliteration token lookup.
         ctx.match = False
-        for src_tk, dest_tk in langsec["map"]:
+        for ctx.src_tk, ctx.dest_tk in langsec["map"]:
             hret = _run_hook("pre_tx_token", ctx, langsec_hooks)
             if hret == "break":
                 break
@@ -126,8 +134,9 @@ def transliterate(src, lang, r2s=False):
 
             # Longer tokens should be guaranteed to be scanned before their
             # substrings at this point.
-            step = len(src_tk)
-            if src_tk == src[ctx.i:ctx.i + step]:
+            step = len(ctx.src_tk)
+            if ctx.src_tk == src[ctx.cur:ctx.cur + step]:
+                ctx.match = True
                 # This hook may skip this token or break out of the token
                 # lookup for the current position.
                 hret = _run_hook("on_tx_token_match", ctx, langsec_hooks)
@@ -138,12 +147,11 @@ def transliterate(src, lang, r2s=False):
 
                 # A match is found. Stop scanning tokens, append result, and
                 # proceed scanning the source.
-                ctx.dest_ls.append(dest_tk)
-                ctx.match = True
-                ctx.i += step
+                ctx.dest_ls.append(ctx.dest_tk)
+                ctx.cur += step
                 break
 
-        if not ctx.match:
+        if ctx.match is False:
             hret = _run_hook("on_no_tx_token_match", ctx, langsec_hooks)
             if hret == "break":
                 break
@@ -152,12 +160,15 @@ def transliterate(src, lang, r2s=False):
 
             # No match found. Copy non-mapped character (one at a time).
             logger.info(
-                    f"Token {src[ctx.i]} at position {ctx.i} is not mapped.")
-            ctx.dest_ls.append(src[ctx.i])
-            ctx.i += 1
+                f"Token {src[ctx.cur]} at position {ctx.cur} is not mapped."
+            )
+            ctx.dest_ls.append(src[ctx.cur])
+            ctx.cur += 1
 
-    if langsec_dir.get("capitalize", False):
-        ctx.dest_ls[0] = ctx.dest_ls[0].capitalize()
+    delattr(ctx, "src_tk")
+    delattr(ctx, "dest_tk")
+    delattr(ctx, "match")
+    delattr(ctx, "cur")
 
     # This hook may take care of the assembly and cause the function to return
     # its own return value.
@@ -165,14 +176,17 @@ def transliterate(src, lang, r2s=False):
     if hret is not None:
         return hret
 
+    if langsec_dir.get("capitalize", False):
+        ctx.dest_ls[0] = ctx.dest_ls[0].capitalize()
+
     logger.debug(f"Output list: {ctx.dest_ls}")
     ctx.dest = "".join(ctx.dest_ls)
 
-    # This hook may manipulate the output string and cause the function to
-    # return that.
+    # This hook may reassign the output string and/or cause the function to
+    # return it immediately.
     hret = _run_hook("post_assembly", ctx, langsec_hooks)
-    if hret is not None:
-        return hret
+    if hret == "ret":
+        return ctx.dest
 
     # Strip multiple spaces and leading/trailing whitespace.
     ctx.dest = re.sub(MULTI_WS_RE, ' ', ctx.dest.strip())
@@ -183,7 +197,7 @@ def transliterate(src, lang, r2s=False):
 def _run_hook(hname, ctx, hooks):
     for hook_def in hooks.get(hname, []):
         kwargs = hook_def[1] if len(hook_def > 1) else {}
-        ret = hook_def[0](ctx.src, ctx.cur, ctx.dest_ls, **kwargs)
+        ret = hook_def[0](ctx, **kwargs)
         if ret in ("break", "cont"):
             # This will stop parsing hooks functions and tell the caller to
             # break out of the outer loop or skip iteration.