Browse Source

Aksharamukha update (#103)

* Yiddish transliteration via submodules.

* Update checkout workflow.

* Change refs for Yiddish submodules.

* Fix WORKDIR in Dockerfile

* Do not remove yiddish module.

* Manually add yiddish submodules.

* Use git clone instead of submodule.

* Move ext checkout to github actions.

* Chinese numerals (#97)

* WIP Parse Chinese numerals.

* WIP complete number parsing.

* Complete Chinese numerals:

* Use standard table override instead of pre-config hooks.
* Add few test strings.

* Complete numerals:

* Transliterate all numeric examples correctly
* Modify hook return logic for consistency
* WIP partial spacing fix.

* Some cleanup; upgrade docker OS.

* Add dependency for uwsgi.

* Squashed commit of the following: (#98)

commit 30859a52b9cc325c323b414133856d0af3ffc2a6
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 22:17:36 2024 -0500

    Move ext checkout to github actions.

commit 6d8da6df68ac764f90deb15861089095043fd4ba
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 21:45:01 2024 -0500

    Use git clone instead of submodule.

commit ade9da589179870d331b703ff526d7fff33e88bb
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 21:42:45 2024 -0500

    Manually add yiddish submodules.

commit 77cb9ef2959f611d0220cc405e0b584ece71147c
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 21:23:37 2024 -0500

    Do not remove yiddish module.

commit e405b3605dd2629ed5557ccc5fdd5fe8812799ed
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 09:11:41 2024 -0500

    Fix WORKDIR in Dockerfile

commit 95445ba642163e28b94df6736ad6946ad7dc76c0
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 09:07:50 2024 -0500

    Change refs for Yiddish submodules.

commit 208ea095e792195981f644497ccd5fcd55e15c1b
Author: scossu <stefano@cossu.cc>
Date:   Wed Feb 28 08:45:58 2024 -0500

    Update checkout workflow.

* Add debug output to /trans response.

* Split docker files and requirements.

* Add bad request debug handler.

* Add bad request debug handler.

* Adjust CI workflows.

* Fix image name typo.

* Refine triggers.

* Fix typo on test workflow trigger.

* Use JSON in POST body.

* Also use JSON in feedback request; update docs.

* Return json data in 400 debug.

* Update Aksharamukha.

* Add new set of languages; separate pre and post options in Aksharamukha.
Stefano Cossu 1 year ago
parent
commit
99dcaac38c

+ 1 - 1
deps.txt

@@ -1,5 +1,5 @@
 # External dependencies.
-aksharamukha>=2.1,<3
+aksharamukha>=2.2,<3
 camel-tools>=1.5
 funcy>=1.15,<2
 pymarc>=4.0,<5

+ 12 - 6
scriptshifter/hooks/aksharamukha/romanizer.py

@@ -15,16 +15,22 @@ from scriptshifter.exceptions import BREAK
 logger = getLogger(__name__)
 
 
-def s2r_post_config(ctx, src_script):
+def s2r_post_config(ctx, src_script, pre=[], post=[]):
     # options = detect_preoptions(ctx.src, src_script)
-    options = [n for n, v in ctx.options.items() if v and n != "capitalize"]
-    ctx.dest = process(src_script, "IAST", ctx.src, pre_options=options)
+    pre_options = pre + [
+            n for n, v in ctx.options.items() if v and n != "capitalize"]
+    ctx.dest = process(
+            src_script, "RomanLoC", ctx.src,
+            pre_options=pre_options, post_options=post)
 
     return BREAK
 
 
-def r2s_post_config(ctx, dest_script):
-    options = [n for n, v in ctx.options.items() if v and n != "capitalize"]
-    ctx.dest = process("IAST", dest_script, ctx.src, post_options=options)
+def r2s_post_config(ctx, dest_script, pre=[], post=[]):
+    post_options = post + [
+            n for n, v in ctx.options.items() if v and n != "capitalize"]
+    ctx.dest = process(
+            "RomanLoC", dest_script, ctx.src,
+            pre_options=pre, post_options=post_options)
 
     return BREAK

+ 16 - 0
scriptshifter/tables/data/gujarati.yml

@@ -0,0 +1,16 @@
+general:
+  name: Gujarati
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Gujarati"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Gujarati"

+ 24 - 0
scriptshifter/tables/data/index.yml

@@ -58,6 +58,8 @@ greek_classical:
   name: Greek (classical)
 greek_modern:
   name: Greek (modern)
+gujarati:
+  name: Gujarati
 hebrew:
   name: Hebrew
 hindi:
@@ -68,6 +70,8 @@ katakana:
   name: Japanese (Katakana)
 kalmyk_cyrillic:
   name: Kalmyk (Cyrillic)
+kannada:
+  name: Kannada
 kara-kalpak_cyrillic:
   name: Kara-Kalpak (Cyrillic)
 karachai-balkar_cyrillic:
@@ -80,6 +84,8 @@ khakass_cyrillic:
   name: Khakass (Cyrillic)
 khanty_cyrillic:
   name: Khanty (Cyrillic)
+khmer:
+  name: Khmer
 komi_cyrillic:
   name: Komi (Cyrillic)
 korean_nonames:
@@ -96,8 +102,12 @@ lithuanian_cyrillic:
   name: Lithuanian (Cyrillic)
 macedonian:
   name: Macedonian
+marathi:
+  name: Marathi (Devanagari)
 mansi_cyrillic:
   name: Mansi (Cyrillic)
+malayalam:
+  name: Malayalam
 moldovan_cyrillic:
   name: Moldovan (Cyrillic)
 mongolian_cyrillic:
@@ -108,8 +118,16 @@ mordvin_cyrillic:
   name: Mordvin (Cyrillic)
 nenets_cyrillic:
   name: Nenets (Cyrillic)
+oriya:
+  name: Oriya
 ossetic_cyrillic:
   name: Ossetic (Cyrillic)
+pali:
+  name: Pali
+panjabi:
+  name: Panjabi
+prakrit:
+  name: Prakrit (Devanagari)
 pulaar:
   name: Pulaar (Adlam)
 gurmukhi:
@@ -118,10 +136,14 @@ romani_cyrillic:
   name: Romani (Cyrillic)
 russian:
   name: Russian
+sanskrit:
+  name: Sanskrit (Devanagari)
 serbian:
   name: Serbian
 shor_cyrillic:
   name: Shor (Cyrillic)
+sinhalese:
+  name: Sinhalese
 syriac_cyrillic:
   name: Syriac (Cyrillic)
 tajik_cyrillic:
@@ -132,6 +154,8 @@ tamil_brahmi:
   name: Tamil Brahmi
 tamil_extended:
   name: Tamil (extended)
+telugu:
+  name: Telugu
 thai:
   name: Thai
 tatar-kryashen_cyrillic:

+ 16 - 0
scriptshifter/tables/data/kannada.yml

@@ -0,0 +1,16 @@
+general:
+  name: Kannada
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Kannada"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Kannada"

+ 16 - 0
scriptshifter/tables/data/khmer.yml

@@ -0,0 +1,16 @@
+general:
+  name: Khmer
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Khmer"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Khmer"

+ 16 - 0
scriptshifter/tables/data/malayalam.yml

@@ -0,0 +1,16 @@
+general:
+  name: Khmer
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Khmer"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Khmer"

+ 18 - 0
scriptshifter/tables/data/marathi.yml

@@ -0,0 +1,18 @@
+general:
+  name: Marathi (Devanagari)
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Devanagari"
+        - post: ["HindiMarathiRomanLoCFix"]
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Devanagari"
+        - pre: ["HindiMarathiRomanLoCFix"]

+ 16 - 0
scriptshifter/tables/data/oriya.yml

@@ -0,0 +1,16 @@
+general:
+  name: Oriya
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Oriya"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Oriya"

+ 16 - 0
scriptshifter/tables/data/pali.yml

@@ -0,0 +1,16 @@
+general:
+  name: Pali
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Pali"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Pali"

+ 16 - 0
scriptshifter/tables/data/panjabi.yml

@@ -0,0 +1,16 @@
+general:
+  name: Panjabi
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Punjabi"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Punjabi"

+ 16 - 0
scriptshifter/tables/data/prakrit.yml

@@ -0,0 +1,16 @@
+general:
+  name: Prakrit (Devanagari)
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Devanagari"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Devanagari"

+ 16 - 0
scriptshifter/tables/data/sanskrit.yml

@@ -0,0 +1,16 @@
+general:
+  name: Sanskrit (Devanagari)
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Devanagari"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Devanagari"

+ 16 - 0
scriptshifter/tables/data/sinhalese.yml

@@ -0,0 +1,16 @@
+general:
+  name: Sinhalese
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Sinhala"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Sinhala"

+ 16 - 0
scriptshifter/tables/data/telugu.yml

@@ -0,0 +1,16 @@
+general:
+  name: Telugu
+
+script_to_roman:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.s2r_post_config
+        - src_script: "Telugu"
+
+roman_to_script:
+  hooks:
+    post_config:
+      -
+        - aksharamukha.romanizer.r2s_post_config
+        - dest_script: "Telugu"