Browse Source

Add documentation for WB token and normalization.

scossu 2 months ago
parent
commit
52a0652a4a
1 changed files with 38 additions and 2 deletions
  1. 38 2
      doc/config.md

+ 38 - 2
doc/config.md

@@ -194,6 +194,28 @@ that:
   before "BA" but after "AD");
 - strings beginning with different characters are sorted alphabetically.
 
+
+#### `roman_to_script.normalize`
+
+Type: key-value pairs; value is a list of strings.
+
+Normalization rules. Each key is paired with one or more tokens that are
+replaced with the key before the transliteration. E.g. if we have (from the
+classical Greek config):
+
+```
+normalize:
+"\u03B1":  # α 	Greek Small Letter Alpha
+  - "\u1F00"  # ἀ 	Greek Small Letter Alpha With Psili
+  - "\u1F80"  # ᾀ 	Greek Small Letter Alpha With Psili And Ypogegrammeni
+  - "\u1FB0"  # ᾰ 	Greek Small Letter Alpha With Vrachy
+  - "\u1FB1"  # ᾱ 	Greek Small Letter Alpha With Macron
+  - "\u1FB3"  # ᾳ 	Greek Small Letter Alpha With Ypogegrammeni
+```
+
+then all ἀ, ᾀ, ᾰ, etc. are normalized to α; only this last token needs to be
+mapped for transliteration.
+
 #### `roman_to_script.map`
 
 Type: key-value pairs
@@ -204,14 +226,28 @@ Transliteration rules. Each rule takes the following form:
   "<source>": "<destination>"
 ```
 
-Unicode code points on either side are written using the YAML notation:
-`\u????`
+Unicode code points on either side can be written using the YAML notation
+(`\u????`) or the literal Unicode characters.
 
 These rules can be written in any order, however writing longer
 strings such as full names before individual phonemes and characters makes the
 file more readable. The strings are sorted by the application using the same
 rules dscribed above for the ignore list.
 
+To create specific rules for initial, final, and standalone tokens, the `%`
+character is used to mark the word boundary. Hence:
+
+```
+  "token%": "transliteration of token at beginning of word"
+  "%token": "transliteration of token at end of word"
+  "%token%": "transliteration of standalone token"
+  "token": "transliteration of token anywhere else not specified"
+```
+
+The order of priority is: standalone, initial, final, and medial. As for the
+other tokens, these can be in any order in the configuration and will be
+reordered when the table is loaded.
+
 
 #### `roman_to_script.hooks`