فهرست منبع

Greek (#94)

* WIP firrst pass at ancient and modern Greek.

* WIP - add breathing and punctuation, and tests.

* Update Transliterator config files for Greek (from Randy).

* WIP Greek.

* Fix 4-digit numerals.

* Fix 1÷3 digit numerals.

* Implement normalize logic and block for classical Greek.

* Fix EOF numerals parsing; enable modern Greek tests.

* WIP use combined Unicode symbols.

* Use combined diacritics where available.

* Greek: normalize Vrachy, Macron, Dialytika, Prosgegrammeni, Ypogegrammeni.

* Greek: more mapping fixes.

* Merged changes.

* Fix merge issues.

* Inherit normalization rules.

* WIP BOW & EOW handling.

* Regroup vowel assimilation to distinguish tonal vowels.

* Alt implementation of initial and final tokens.

* Add documentation for WB token and normalization.

* Correct section of normalize rules in doc.

* Remove extraneous lines from Dockerfile.
Stefano Cossu 1 سال پیش
والد
کامیت
aab0ea04bd

+ 0 - 4
Dockerfile

@@ -4,10 +4,6 @@ RUN apt update
 RUN apt install -y build-essential tzdata gfortran libopenblas-dev libboost-all-dev
 
 ENV TZ=America/New_York
-
-# Copy and compile Kakadu codec.
-WORKDIR ${_workroot}
-
 ENV _workroot "/usr/local/scriptshifter/src"
 
 WORKDIR ${_workroot}

+ 41 - 2
doc/config.md

@@ -194,6 +194,7 @@ that:
   before "BA" but after "AD");
 - strings beginning with different characters are sorted alphabetically.
 
+
 #### `roman_to_script.map`
 
 Type: key-value pairs
@@ -204,14 +205,28 @@ Transliteration rules. Each rule takes the following form:
   "<source>": "<destination>"
 ```
 
-Unicode code points on either side are written using the YAML notation:
-`\u????`
+Unicode code points on either side can be written using the YAML notation
+(`\u????`) or the literal Unicode characters.
 
 These rules can be written in any order, however writing longer
 strings such as full names before individual phonemes and characters makes the
 file more readable. The strings are sorted by the application using the same
 rules dscribed above for the ignore list.
 
+To create specific rules for initial, final, and standalone tokens, the `%`
+character is used to mark the word boundary. Hence:
+
+```
+  "token%": "transliteration of token at beginning of word"
+  "%token": "transliteration of token at end of word"
+  "%token%": "transliteration of standalone token"
+  "token": "transliteration of token anywhere else not specified"
+```
+
+The order of priority is: standalone, initial, final, and medial. As for the
+other tokens, these can be in any order in the configuration and will be
+reordered when the table is loaded.
+
 
 #### `roman_to_script.hooks`
 
@@ -274,3 +289,27 @@ Type: list
 
 This is only a valid subsection of S2R. It removes double capitalization rules
 from the inherited list.
+
+#### `script_to_roman.normalize`
+
+Type: key-value pairs; value is a list of strings.
+
+Normalization rules. Each key is paired with one or more tokens that are
+replaced with the key before the transliteration. E.g. if we have (from the
+classical Greek config):
+
+```
+normalize:
+"\u03B1":  # α 	Greek Small Letter Alpha
+  - "\u1F00"  # ἀ 	Greek Small Letter Alpha With Psili
+  - "\u1F80"  # ᾀ 	Greek Small Letter Alpha With Psili And Ypogegrammeni
+  - "\u1FB0"  # ᾰ 	Greek Small Letter Alpha With Vrachy
+  - "\u1FB1"  # ᾱ 	Greek Small Letter Alpha With Macron
+  - "\u1FB3"  # ᾳ 	Greek Small Letter Alpha With Ypogegrammeni
+```
+
+then all ἀ, ᾀ, ᾰ, etc. are normalized to α; only this last token needs to be
+mapped for transliteration.
+
+This section is only available for S2R at the moment. It is inherited by
+child configurations.

+ 418 - 108
legacy/data/GreekClassicalRomanization.cfg

@@ -3,153 +3,357 @@
 # Added fields to FieldsIncluded 19 July 2007
 # Added curly single and double quotes 06 April 2010
 # Last updated 30 May 2013
+# Updated by Randall K. Barry, 11 May 2023 to add script code
 
 [General]
-Name=Greek classical
+Name=Greek (Classical)
+ScriptCode=(S
+
+# UNICODE GREEK CHARACTER RANGE: U+0370-U+03FF
+# ISO 15924 4-alpha script code: Grek
+
 Truncation=%
 
 [RomanToScript]
-FieldsIncluded=100 110 111 130 240 245 246 250 260 264 440 490 600 610 611 630 651 700 710 711 730 740 800 830
+FieldsIncluded=240 245 246 247 250 260 264 440 490 740
 SubfieldsAlwaysExcluded=uvxy0123456789
-OtherSubfieldsExcludedByTag=100/e 110/e 111/j 246/i 260/c 264/c 650/a 700/e 700/i 710/e 710/i 711/i 711/j 730/i
-Subfield6Code=(S
-# in this stanza no attempt has been made to provide rough breathing in the script
-#  but others may well wish to provide it - a policy issue as much as anything
+OtherSubfieldsExcludedByTag=246/i 260/c 264/c
+
+# In this conversion, no attempt has been made to provide rough breathing in the script
+# but others may well wish to provide it. This is a policy issue.
 
 # RDA boilerplate phrases not transliterated:
 Place of publication not identified=Place of publication not identified
 publisher not identified=publisher not identified
+date of publication not identified=date of publication not identified
+At head of title=At head of title
+Colophon=Colophon
+On cover=On cover
+Cover title.=Cover title.
+and others=and others
+and one other=and one other
+and two others=and two others
+and three others=and three others
+and four others=and four others
+and five others=and five others
+and six others=and six others
+and seven others=and seven others
+and eight others=and eight others
+and nine others=and nine others
+and ten others=and ten others
+and eleven others=and eleven others
+and twelve others=and twelve others
+and thirteen others=and thirteen others
+and fourteen others=and fourteen others
+and fifteen others=and fifteen others
+and sixteen others=and sixteen others
+and seventeen others=and seventeen others
+and eighteen others=and eighteen others
+and nineteen others=and nineteen others
+et al.=et al.
+i.e.=i.e.
+S.l.=S.l.
+s.n.=s.n.
+XLIX=XLIX
+XLIV=XLIV
+XLIII=XLIII
+XLII=XLII
+XLI=XLI
+XLVIII-XLVIII
+XLVII=XLVII
+XLVI=XLVI
+XLV=XLV
+XL=XL
+LXXXIX=LXXXIX
+LXXXIV=LXXXIV
+LXXXIII=LXXXIII
+LXXXII=LXXXII
+LXXXI=LXXXI
+LXXXVIII=LXXXVIII
+LXXXVII=LXXXVII
+LXXXVI=LXXXVI
+LXXXV=LXXXV
+LXXX=LXXX
+LXXIX=LXXIX
+LXXIV=LXXIV
+LXXIII=LXXIII
+LXXII=LXXII
+LXXI=LXXI
+LXXVIII=LXXVIII
+LXXVII=LXXVII
+LXXVI=LXXVI
+LXXV=LXXV
+LXX=LXX
+LXIX=XIX
+LXIV=LXIV
+LXIII=LXIII
+LXII=LXII
+LXI=LXI
+LXVIII=LXVIII
+LXVII=LXVII
+LXVI=LXVI
+LXV=LXV
+LX=LX
+LIX=LIX
+LIV=LIV
+LIII=LIII
+LII=LII
+LI=LI
+LVIII=LVIII
+LVII=LVII
+LVI=LVI
+LV=LV
+XXXIX=XXXIX
+XXXIV=XXXIV
+XXXIII=XXXIII
+XXXII=XXXII
+XXXI=XXXI
+XXXVIII=XXXVIII
+XXXVII=XXXVII
+XXXVI=XXXVI
+XXXV=XXXV
+XXX=XXX
+XXIX=XXIX
+XXIV=XXIV
+XXIII=XXIII
+XXII=XXII
+XXI=XXI
+XXVIII=XXVIII
+XXVII=XXVII
+XXVI=XXVI
+XXV=XXV
+XX=XX
+XIX=XIX
+XIV=XIV
+XIII=XIII
+XII=XII
+XI=XI
+XVIII=XVIII
+XVII=XVII
+XVI=XVI
+XV=XV
+IV=IV
+VIII=VIII
+VII=VII
+VI=VI
+IX=IX
+III=III
+II=II
+
+# combining grave U+0300
+# combining acute U+0301
+# combining circumflex U+0302
+# combining tilde U+0303
+# combining macron U+0304
+# combining breve U+0306
+# combining dot above U+0307
+# combining diaeresis U+0308
+# combining ring above U+030A
+# combining double acute U+030B
+# combining caron (hachek) U+030C
+# combining candrabindu U+0310
+# combining comma above U+0313
+# combining rough breathing mark U+0314
+# combining dot below U+0323
+# combining dieresis below U+0324
+# combining comma below U+0326
+# combining cedilla U+0327
+# combining ogonek (hook) U+0328
+# combining low line U+0332
+# combining double low line U+0333
+# combining left ligature U+FE20
+# combining right ligature U+FE21
+# soft sign/prime (spacing) U+02B9
+# hard sign/double prime (spacing) U+02BA
+# ayn(spacing) U+02BB
+# alif (spacing) U+02BC
+# middle dot (spacing) U+00B7
 
-# this conversion shouldn't be needed, but does no harm
-AU=U+0391U+03C5
+.)U+0333=U+03FF
+.)=U+037D
+?U+0333=U+037E
+"U+0332=U+201C
+"U+0333=U+201D
+'U+0332=U+2018
+'U+0333=U+2019
+(.U+0333=U+03FE
+(.=U+037C
+|)U+0333=U+03FD
+|)=U+037B
+# Left pointing double angle quotation mark
+U+003CU+003C=U+00AB
+# Right pointing double angle quotation mark
+U+003EU+003E=U+00BB
+6U+0332=U+03DB
+6U+0333=U+03DA
 Au=U+0391U+03C5
-A=U+0391
 au=U+03B1U+03C5
-a=U+03B1
+aU+0301=U+03AC
 B=U+0392
 b=U+03B2
-# this conversion shouldn't be needed, but does no harm
-CH=U+03A7
+bU+0333=U+03D0
 Ch=U+03A7
 ch=U+03C7
-DU+0332%=U+039DU+03C4
+cU+030C=U+03EB
+DU+0332=U+039DU+03C4
+dU+0332=U+03BDU+03C4
 D=U+0394
-dU+0332%=U+03BDU+03C4
 d=U+03B4
-# this conversion shouldn't be needed, but does no harm
-EU+0304U=U+0397U+03C5
-EU+0304u=U+0397U+03C5
-EU+0304=U+0397
-# this conversion shouldn't be needed, but does no harm
-EU=U+0395U+03C5
 Eu=U+0395U+03C5
-E=U+0395
-eU+0304u=U+03B7U+03C5
-eU+0304=U+03B7
 eu=U+03B5U+03C5
+EU+0301=U+0388
+eU+0301=U+03AD
+EU+0304U+0301=U+0389U+0314
+eU+0304U+0301=U+03AE
+EU+0304U+0301=U+0389
+EU+0304=U+0397
+eU+0304=U+03B7
+eU+0332=U+03F5
+eU+0333=U+03F6
+F=U+03E4
+f=U+03E5
+GU+0332=U+03DC
+gU+0332=U+03DD
+HaU+0301=U+0386U+0314
+haU+301=U+03ACU+0314
+Ha=U+0391U+0314
+ha=U+03B1U+0314
+A=U+0391
+a=U+03B1
+heU+0304U+0301=U+03AEU+0314
+HeU+0301=U+0388U+0314
+heU+0301=U+03ADU+0314
+HeU+0304=U+0397U+0314
+heU+0304=U+03B7U+0314
+He=U+0395U+0314
+he=U+03B5U+0314
+E=U+0395
 e=U+03B5
-G=U+0393
-g=U+03B3
-# this conversion shouldn't be needed, but does no harm
-HA=U+0391
-Ha=U+0391
-# this conversion shouldn't be needed, but does no harm
-HEU+0304=U+0397
-HeU+0304=U+0397
-# this conversion shouldn't be needed, but does no harm
-HE=U+0395
-He=U+0395
-# this conversion shouldn't be needed, but does no harm
-HI=U+0399
-Hi=U+0399
-# this conversion shouldn't be needed, but does no harm
-HOU+0304=U+03A9
-HoU+0304=U+03A9
-# this conversion shouldn't be needed, but does no harm
-HO=U+039F
-Ho=U+039F
-# this conversion shouldn't be needed, but does no harm
-HY=U+03A5
-Hy=U+03A5
-H=U+0314
-ha=U+03B1
-heU+0304=U+03B7
-he=U+03B5
-hi=U+03B9
-hoU+0304=U+03C9
-ho=U+03BF
-hy=U+03C5
-h=U+0314
-# this conversion shouldn't be needed, but does no harm
-IU=U+0399U+03C5
+HiU+0301=U+038AU+0314
+hiU+0301=U+03AFU+0314
+Hi=U+0399U+0314
+hi=U+03B9U+0314
+HoU+0301=U+038FU+0314
+HoU+0301=U+038CU+0314
+hoU+0304=U+03C9U+0314
+HoU+0304=U+03A9U+0314
+Ho=U+039FU+0314
+ho=U+03BFU+0314
+HU+0307=U+03E8
+hU+0307=U+03E9
+HU+0323=U+0370
+hU+0323=U+0371
+HU+0332=U+03E6
+hU+0332=U+03E7
+HyU+0301=U+038EU+0314
+Hy=U+03A5U+0314
+hy=U+03C5U+0314
 Iu=U+0399U+03C5
-I=U+0399
 iu=U+03B9U+03C5
-i=U+03B9
-K=U+039A
-k=U+03BA
+IU+0301=U+038A
+iU+0301=U+03AF
+IU+0308=U+03AA
+iU+0308U+0301=U+0390
+iU+0308=U+03CA
+J=U+037F
+j=U+03F3
+KU+0323y=U+03EC
+kU+0323y=U+03ED
+KU+0332=U+03DE
+kU+0332=U+03DF
+KU+0326=U+03CF
+kU+0326=U+03D7
+kU+0332=U+03F0
 L=U+039B
 l=U+03BB
 M=U+039C
 m=U+03BC
-N=U+039D
-ng=U+03B3U+0393
-nk=U+03B3U+03BA
-nx=U+03B3U+03BE
 nch=U+03B3U+03C7
+ng=U+03B3U+03B3
+%nk%=U+03B3U+03BA
+nx=U+03B3U+03BE
+NoU+0332=U+2116
+N=U+039D
 n=U+03BD
-# this conversion shouldn't be needed, but does no harm
-OU+0304U=U+03C9U+03C5
-OU+0304u=U+03C9U+03C5
-# this conversion shouldn't be needed, but does no harm
-OU=U+039FU+03C5
+K=U+039A
+k=U+03BA
+G=U+0393
+g=U+03B3
 Ou=U+039FU+03C5
-OU+0304=U+03A9
-O=U+039F
 ou=U+03BFU+03C5
+OU+0301=U+038C
+oU+0301=U+03CC
+OU+0304U+0301=U+038F
+oU+0304U+0301=U+03CE
+OU+0304=U+03A9
 oU+0304=U+03C9
+O=U+039F
 o=U+03BF
-# this conversion shouldn't be needed, but does no harm
-PH=U+03A6
 Ph=U+03A6
-# this conversion shouldn't be needed, but does no harm
-PS=U+03A8
-Ps=U+03A8
-P=U+03A0
 ph=U+03C6
+Ps=U+03A8
 ps=U+03C8
+pU+0333hU+0333=U+03D5
+pU+0333=U+03D6
+P=U+03A0
 p=U+03C0
-# this conversion shouldn't be needed, but does no harm
-RH=U+03A1U
-Rh=U+03A1U
+Q=U+03D8
+q=U+03D9
+Rh=U+03A1U+0314
+rh=U+03C1U+0314
+rU+0332=U+03F1
+rU+0333=U+03FC
 R=U+03A1
-rh=U+03C1U
 r=U+03C1
+SU+FE20U+0332SU+FE21U+0332=U+0372
+sU+FE20U+0332sU+FE21U+0332=U+0373
+SU+FE20HU+FE21=U+03F7
+sU+FE20hU+FE21=U+03F8
+SU+FE20SU+FE21=U+03E1
+sU+FE20sU+FE21=U+03E0
+SU+030C=U+03E2
+sU+030C=U+03E3
+SU+0323=U+03FA
+sU+0323=U+03FB
+SU+0332=U+03F9
+sU+0332=U+03F2
 S=U+03A3
 %s=U+03C2
 s=U+03C3
-# this conversion shouldn't be needed, but does no harm
-TH=U+0398
+TU+0333HU+0333=U+03F4
+tU+0333hU+0333=U+03D1
 Th=U+0398
-T=U+03A4
 th=U+03B8
+TU+0323i=U+03EE
+tU+0323i=U+03EF
+T=U+03A4
 t=U+03C4
-# this conversion shouldn't be needed, but does no harm
-U=U+03A5
-# this conversion shouldn't be needed (always forms part of diphthong), but does no harm
-u=U+03C5
-# this conversion shouldn't be needed in classical Greek, but does no harm
+I=U+0399
+i=U+03B9
+U+0020U+0301=U+0384
+U+0020U+0308U+0301=U+0385
+;U+0333=U+0387
+U+02B9=U+0374
+U+0326=U+0375
+U+0328=U+037A
 V=U+0392
-# this conversion shouldn't be needed in classical Greek, but does no harm
 v=U+03B2
+WU+0323=U+0376
+wU+0323=U+0377
 X=U+039E
 x=U+03BE
+YU+0301U+0333=U+03D3
+YU+0301=U+038E
+yU+0301=U+03CD
+YU+0308U+0333=U+03D4
+yU+0308U+0301=U+03B0
+YU+0308=U+03AB
+yU+0308=U+03CB
+YU+0333=U+03D2
 Y=U+03A5
 y=U+03C5
 Z=U+0396
 z=U+03B6
-# rather than provide all long vowel versions, simply strip what's left
-U+0304=
 
 [ScriptToRoman]
 FieldsIncluded=100 110 111 130 240 245 246 250 260 264 440 490 600 610 611 630 651 700 710 711 730 740 800 830
@@ -160,18 +364,56 @@ OtherSubfieldsExcludedByTag=100/e 110/e 111/j 246/i 260/c 264/c 650/a 700/e 700/
 Place of publication not identified=Place of publication not identified
 publisher not identified=publisher not identified
 
-U+201C="
-U+201D="
-U+2018='
-U+2019='
-U+0300=
-U+0301=
-U+0302=
-U+0313=
-U+0314U+03A1=Rh
-U+0314U+03C1=rh
-# this is known to be ambiguous...
-U+0314=h
+
+U+201C="U+0332
+U+201D="U+0333
+U+2018='U+0332
+U+2019='U+0333
+U+2116=NoU+0332
+# U+0300=
+# U+0301=
+# U+0302=
+# U+0313=
+U+0370=HU+0323
+U+0371=hU+0323
+U+0372=SU+FE20U+0332SU+FE21U+0332
+U+0373=sU+FE20U+0332sU+FE21U+0332
+U+0374=U+02B9
+U+0375=U+0326
+U+0376=WU+0323
+U+0377=wU+0323
+# U+0378 reserved
+# U+0379 reserved
+U+037A=U+0328
+U+037B=|)
+U+037C=(.
+U+037D=.)
+U+037E=?U+0333
+U+037F=J
+# U+0380 reserved 
+# U+0381 reserved 
+# U+0382 reserved 
+# U+0383 reserved 
+U+0384=U+0020U+0301
+U+0385=U+0020U+0308U+0301
+U+0386U+0314=HaU+0301
+U+0386=AU+0301
+U+0387=;U+0333
+U+0388U+0314=HeU+0301
+U+0388=EU+0301
+U+0389U+0314=EU+0304U+0301
+U+0389=EU+0304U+0301
+U+038AU+0314=HiU+0301
+U+038A=IU+0301
+# U+038B reserved
+U+038CU+0314=HoU+0301
+U+038C=OU+0301
+# U+038D reserved
+U+038EU+0314=HyU+0301
+U+038E=YU+0301
+U+038FU+0314=HoU+0301
+U+038F=OU+0304U+0301
+U+0390=iU+0308U+0301
 U+0391U+0314=Ha
 U+0391U+03C5=Au
 U+0391=A
@@ -190,8 +432,9 @@ U+0399U+03C5=Iu
 U+0399=I
 U+039A=K
 U+039B=L
+U+039CU+03C0%=B
 U+039C=M
-U+039DU+03C4%=DU+0332
+# U+039DU+03C4%=DU+0332
 U+039D=N
 U+039E=X
 U+039FU+0314=Ho
@@ -200,6 +443,7 @@ U+039F=O
 U+03A0=P
 U+03A1U+0314=Rh
 U+03A1=R
+# U+03A2 reserved 
 U+03A3=S
 U+03A4=T
 U+03A5U+0314=Hy
@@ -209,12 +453,23 @@ U+03A7=Ch
 U+03A8=Ps
 U+03A9U+0314=HoU+0304
 U+03A9=OU+0304
+U+03AA=IU+0308
+U+03AB=YU+0308
+U+03ACU+0314=haU+301
+U+03AC=aU+0301
+U+03ADU=0314=heU+0301
+U+03AD=eU+0301
+U+03AEU+0314=heU+0304U+0301
+U+03AE=eU+0304U+0301
+U+03AFU+0314=hiU+0301
+U+03AF=iU+0301
+U+03B0=yU+0308U+0301
 U+03B1U+0314=ha
 U+03B1U+03C5=au
 U+03B1=a
 U+03B2=b
 U+03B3U+03B3=ng
-U+03B3U+03BA=nk
+%U+03B3U+03BA%=nk
 U+03B3U+03BE=nx
 U+03B3U+03C7=nch
 U+03B3=g
@@ -231,8 +486,9 @@ U+03B9U+03C5=iu
 U+03B9=i
 U+03BA=k
 U+03BB=l
+U+03BCU+03C0%=b
 U+03BC=m
-U+03BDU+03C4%=dU+0332
+# U+03BDU+03C4%=dU+0332
 U+03BD=n
 U+03BE=x
 U+03BFU+0314=ho
@@ -251,3 +507,57 @@ U+03C7=ch
 U+03C8=ps
 U+03C9U+0314=hoU+0304
 U+03C9=oU+0304
+U+03CA=iU+0308
+U+03CB=yU+0308
+U+03CC=oU+0301
+U+03CD=yU+0301
+U+03CE=oU+0304U+0301
+U+03CF=KU+0326
+U+03D0=bU+0333
+U+03D1=tU+0333hU+0333
+U+03D2=YU+0333
+U+03D3=YU+0301U+0333
+U+03D4=YU+0308U+0333
+U+03D5=pU+0333hU+0333
+U+03D6=pU+0333
+U+03D7=kU+0326
+U+03D8=Q
+U+03D9=q
+U+03DA=6U+0333
+U+03DB=6U+0332
+U+03DC=GU+0332
+U+03DD=gU+0332
+U+03DE=KU+0324
+U+03DF=kU+0324
+U+03E0=sU+FE20sU+FE21
+U+03E1=SU+FE20SU+FE21
+U+03E2=SU+030C
+U+03E3=sU+030C
+U+03E4=F
+U+03E5=f
+U+03E6=HU+0332
+U+03E7=hU+0332
+U+03E8=HU+0307
+U+03E9=hU+0307
+U+03EA=CU+030C
+U+03EB=cU+030C
+U+03EC=KU+0323y
+U+03ED=kU+0323y
+U+03EE=TU+0323i
+U+03EF=tU+0323i
+U+03F0=kU+0332
+U+03F1=rU+0332
+U+03F2=sU+0332
+U+03F3=j
+U+03F4=TU+0333HU+0333
+U+03F5=eU+0332
+U+03F6=eU+0333
+U+03F7=SU+FE20HU+FE21
+U+03F8=sU+FE20hU+FE21
+U+03F9=SU+0332
+U+03FA=SU+0323
+U+03FB=sU+0323
+U+03FC=rU+0333
+U+03FD=|)U+0333
+U+03FE=(.U+0333
+U+03FF=.)U+0333

+ 559 - 0
legacy/data/GreekModernRomanization.cfg

@@ -0,0 +1,559 @@
+# Original table by Randall K. Barry, 11 April 2023
+# Updated by Randall K. Barry, 11 May 2023 to add script code
+
+[General]
+Name=Greek (Modern)
+ScriptCode=(S
+
+# UNICODE GREEK CHARACTER RANGE: U+0370-U+03FF
+# ISO 15924 4-alpha script code: Grek
+
+Truncation=%
+
+[RomanToScript]
+FieldsIncluded=240 245 246 247 250 260 264 440 490 740
+SubfieldsAlwaysExcluded=uvxy0123456789
+OtherSubfieldsExcludedByTag=246/i 260/c 264/c
+
+# In this conversion, no attempt has been made to provide rough breathing in the script
+# but others may well wish to provide it. This is a policy issue.
+
+# RDA boilerplate phrases not transliterated:
+Place of publication not identified=Place of publication not identified
+publisher not identified=publisher not identified
+date of publication not identified=date of publication not identified
+At head of title=At head of title
+Colophon=Colophon
+On cover=On cover
+Cover title.=Cover title.
+and others=and others
+and one other=and one other
+and two others=and two others
+and three others=and three others
+and four others=and four others
+and five others=and five others
+and six others=and six others
+and seven others=and seven others
+and eight others=and eight others
+and nine others=and nine others
+and ten others=and ten others
+and eleven others=and eleven others
+and twelve others=and twelve others
+and thirteen others=and thirteen others
+and fourteen others=and fourteen others
+and fifteen others=and fifteen others
+and sixteen others=and sixteen others
+and seventeen others=and seventeen others
+and eighteen others=and eighteen others
+and nineteen others=and nineteen others
+et al.=et al.
+i.e.=i.e.
+S.l.=S.l.
+s.n.=s.n.
+XLIX=XLIX
+XLIV=XLIV
+XLIII=XLIII
+XLII=XLII
+XLI=XLI
+XLVIII-XLVIII
+XLVII=XLVII
+XLVI=XLVI
+XLV=XLV
+XL=XL
+LXXXIX=LXXXIX
+LXXXIV=LXXXIV
+LXXXIII=LXXXIII
+LXXXII=LXXXII
+LXXXI=LXXXI
+LXXXVIII=LXXXVIII
+LXXXVII=LXXXVII
+LXXXVI=LXXXVI
+LXXXV=LXXXV
+LXXX=LXXX
+LXXIX=LXXIX
+LXXIV=LXXIV
+LXXIII=LXXIII
+LXXII=LXXII
+LXXI=LXXI
+LXXVIII=LXXVIII
+LXXVII=LXXVII
+LXXVI=LXXVI
+LXXV=LXXV
+LXX=LXX
+LXIX=XIX
+LXIV=LXIV
+LXIII=LXIII
+LXII=LXII
+LXI=LXI
+LXVIII=LXVIII
+LXVII=LXVII
+LXVI=LXVI
+LXV=LXV
+LX=LX
+LIX=LIX
+LIV=LIV
+LIII=LIII
+LII=LII
+LI=LI
+LVIII=LVIII
+LVII=LVII
+LVI=LVI
+LV=LV
+XXXIX=XXXIX
+XXXIV=XXXIV
+XXXIII=XXXIII
+XXXII=XXXII
+XXXI=XXXI
+XXXVIII=XXXVIII
+XXXVII=XXXVII
+XXXVI=XXXVI
+XXXV=XXXV
+XXX=XXX
+XXIX=XXIX
+XXIV=XXIV
+XXIII=XXIII
+XXII=XXII
+XXI=XXI
+XXVIII=XXVIII
+XXVII=XXVII
+XXVI=XXVI
+XXV=XXV
+XX=XX
+XIX=XIX
+XIV=XIV
+XIII=XIII
+XII=XII
+XI=XI
+XVIII=XVIII
+XVII=XVII
+XVI=XVI
+XV=XV
+IV=IV
+VIII=VIII
+VII=VII
+VI=VI
+IX=IX
+III=III
+II=II
+
+# combining grave U+0300
+# combining acute U+0301
+# combining circumflex U+0302
+# combining tilde U+0303
+# combining macron U+0304
+# combining breve U+0306
+# combining dot above U+0307
+# combining diaeresis U+0308
+# combining ring above U+030A
+# combining double acute U+030B
+# combining caron (hachek) U+030C
+# combining candrabindu U+0310
+# combining comma above U+0313
+# combining rough breathing mark U+0314
+# combining dot below U+0323
+# combining dieresis below U+0324
+# combining comma below U+0326
+# combining cedilla U+0327
+# combining ogonek (hook) U+0328
+# combining low line U+0332
+# combining double low line U+0333
+# combining left ligature U+FE20
+# combining right ligature U+FE21
+# soft sign/prime (spacing) U+02B9
+# hard sign/double prime (spacing) U+02BA
+# ayn(spacing) U+02BB
+# alif (spacing) U+02BC
+# middle dot (spacing) U+00B7
+
+.)U+0333=U+03FF
+.)=U+037D
+?U+0333=U+037E
+"U+0332=U+201C
+"U+0333=U+201D
+'U+0332=U+2018
+'U+0333=U+2019
+(.U+0333=U+03FE
+(.=U+037C
+|)U+0333=U+03FD
+|)=U+037B
+# Left pointing double angle quotation mark
+U+003CU+003C=U+00AB
+# Right pointing double angle quotation mark
+U+003EU+003E=U+00BB
+6U+0332=U+03DB
+6U+0333=U+03DA
+Au=U+0391U+03C5
+au=U+03B1U+03C5
+aU+0301=U+03AC
+B%=U+039CU+03C0
+b%=U+03BCU+03C0
+bU+0333=U+03D0
+Ch=U+03A7
+ch=U+03C7
+cU+030C=U+03EB
+DU+0332=U+039DU+03C4
+dU+0332=U+03BDU+03C4
+D=U+0394
+d=U+03B4
+Eu=U+0395U+03C5
+eu=U+03B5U+03C5
+EU+0301=U+0388
+eU+0301=U+03AD
+EU+0304U+0301=U+0389U+0314
+eU+0304U+0301=U+03AE
+EU+0304U+0301=U+0389
+EU+0304=U+0397
+eU+0304=U+03B7
+eU+0332=U+03F5
+eU+0333=U+03F6
+F=U+03E4
+f=U+03E5
+GU+0332=U+03DC
+gU+0332=U+03DD
+HaU+0301=U+0386U+0314
+haU+301=U+03ACU+0314
+Ha=U+0391U+0314
+ha=U+03B1U+0314
+A=U+0391
+a=U+03B1
+heU+0304U+0301=U+03AEU+0314
+HeU+0301=U+0388U+0314
+heU+0301=U+03ADU+0314
+HeU+0304=U+0397U+0314
+heU+0304=U+03B7U+0314
+He=U+0395U+0314
+he=U+03B5U+0314
+E=U+0395
+e=U+03B5
+HiU+0301=U+038AU+0314
+hiU+0301=U+03AFU+0314
+Hi=U+0399U+0314
+hi=U+03B9U+0314
+HoU+0301=U+038FU+0314
+HoU+0301=U+038CU+0314
+hoU+0304=U+03C9U+0314
+HoU+0304=U+03A9U+0314
+Ho=U+039FU+0314
+ho=U+03BFU+0314
+HU+0307=U+03E8
+hU+0307=U+03E9
+HU+0323=U+0370
+hU+0323=U+0371
+HU+0332=U+03E6
+hU+0332=U+03E7
+HyU+0301=U+038EU+0314
+Hy=U+03A5U+0314
+hy=U+03C5U+0314
+Iu=U+0399U+03C5
+iu=U+03B9U+03C5
+IU+0301=U+038A
+iU+0301=U+03AF
+IU+0308=U+03AA
+iU+0308U+0301=U+0390
+iU+0308=U+03CA
+J=U+037F
+j=U+03F3
+KU+0323y=U+03EC
+kU+0323y=U+03ED
+KU+0332=U+03DE
+kU+0332=U+03DF
+KU+0326=U+03CF
+kU+0326=U+03D7
+kU+0332=U+03F0
+L=U+039B
+l=U+03BB
+M=U+039C
+m=U+03BC
+nch=U+03B3U+03C7
+ng=U+03B3U+03B3
+%nk%=U+03B3U+03BA
+nx=U+03B3U+03BE
+NoU+0332=U+2116
+N=U+039D
+n=U+03BD
+K=U+039A
+k=U+03BA
+G=U+0393
+g=U+03B3
+Ou=U+039FU+03C5
+ou=U+03BFU+03C5
+OU+0301=U+038C
+oU+0301=U+03CC
+OU+0304U+0301=U+038F
+oU+0304U+0301=U+03CE
+OU+0304=U+03A9
+oU+0304=U+03C9
+O=U+039F
+o=U+03BF
+Ph=U+03A6
+ph=U+03C6
+Ps=U+03A8
+ps=U+03C8
+pU+0333hU+0333=U+03D5
+pU+0333=U+03D6
+P=U+03A0
+p=U+03C0
+Q=U+03D8
+q=U+03D9
+Rh=U+03A1U+0314
+rh=U+03C1U+0314
+rU+0332=U+03F1
+rU+0333=U+03FC
+R=U+03A1
+r=U+03C1
+SU+FE20U+0332SU+FE21U+0332=U+0372
+sU+FE20U+0332sU+FE21U+0332=U+0373
+SU+FE20HU+FE21=U+03F7
+sU+FE20hU+FE21=U+03F8
+SU+FE20SU+FE21=U+03E1
+sU+FE20sU+FE21=U+03E0
+SU+030C=U+03E2
+sU+030C=U+03E3
+SU+0323=U+03FA
+sU+0323=U+03FB
+SU+0332=U+03F9
+sU+0332=U+03F2
+S=U+03A3
+%s=U+03C2
+s=U+03C3
+TU+0333HU+0333=U+03F4
+tU+0333hU+0333=U+03D1
+Th=U+0398
+th=U+03B8
+TU+0323i=U+03EE
+tU+0323i=U+03EF
+T=U+03A4
+t=U+03C4
+I=U+0399
+i=U+03B9
+U+0020U+0301=U+0384
+U+0020U+0308U+0301=U+0385
+;U+0333=U+0387
+U+02B9=U+0374
+U+0326=U+0375
+U+0328=U+037A
+V=U+0392
+v=U+03B2
+WU+0323=U+0376
+wU+0323=U+0377
+X=U+039E
+x=U+03BE
+YU+0301U+0333=U+03D3
+YU+0301=U+038E
+yU+0301=U+03CD
+YU+0308U+0333=U+03D4
+yU+0308U+0301=U+03B0
+YU+0308=U+03AB
+yU+0308=U+03CB
+YU+0333=U+03D2
+Y=U+03A5
+y=U+03C5
+Z=U+0396
+z=U+03B6
+
+[ScriptToRoman]
+FieldsIncluded=100 110 111 130 240 245 246 250 260 264 440 490 600 610 611 630 651 700 710 711 730 740 800 830
+SubfieldsAlwaysExcluded=uvxy0123456789
+OtherSubfieldsExcludedByTag=100/e 110/e 111/j 246/i 260/c 264/c 650/a 700/e 700/i 710/e 710/i 711/i 711/j 730/i
+
+# RDA boilerplate phrases not transliterated:
+Place of publication not identified=Place of publication not identified
+publisher not identified=publisher not identified
+
+
+U+201C="U+0332
+U+201D="U+0333
+U+2018='U+0332
+U+2019='U+0333
+U+2116=NoU+0332
+# U+0300=
+# U+0301=
+# U+0302=
+# U+0313=
+U+0370=HU+0323
+U+0371=hU+0323
+U+0372=SU+FE20U+0332SU+FE21U+0332
+U+0373=sU+FE20U+0332sU+FE21U+0332
+U+0374=U+02B9
+U+0375=U+0326
+U+0376=WU+0323
+U+0377=wU+0323
+# U+0378 reserved
+# U+0379 reserved
+U+037A=U+0328
+U+037B=|)
+U+037C=(.
+U+037D=.)
+U+037E=?U+0333
+U+037F=J
+# U+0380 reserved 
+# U+0381 reserved 
+# U+0382 reserved 
+# U+0383 reserved 
+U+0384=U+0020U+0301
+U+0385=U+0020U+0308U+0301
+U+0386U+0314=HaU+0301
+U+0386=AU+0301
+U+0387=;U+0333
+U+0388U+0314=HeU+0301
+U+0388=EU+0301
+U+0389U+0314=EU+0304U+0301
+U+0389=EU+0304U+0301
+U+038AU+0314=HiU+0301
+U+038A=IU+0301
+# U+038B reserved
+U+038CU+0314=HoU+0301
+U+038C=OU+0301
+# U+038D reserved
+U+038EU+0314=HyU+0301
+U+038E=YU+0301
+U+038FU+0314=HoU+0301
+U+038F=OU+0304U+0301
+U+0390=iU+0308U+0301
+U+0391U+0314=Ha
+U+0391U+03C5=Au
+U+0391=A
+U+0392=V
+U+0393=G
+U+0394=D
+U+0395U+0314=He
+U+0395U+03C5=Eu
+U+0395=E
+U+0396=Z
+U+0397U+0314=HeU+0304
+U+0397=EU+0304
+U+0398=Th
+U+0399U+0314=Hi
+U+0399U+03C5=Iu
+U+0399=I
+U+039A=K
+U+039B=L
+U+039CU+03C0%=B
+U+039C=M
+U+039DU+03C4%=DU+0332
+U+039D=N
+U+039E=X
+U+039FU+0314=Ho
+U+039FU+03C5=Ou
+U+039F=O
+U+03A0=P
+U+03A1U+0314=Rh
+U+03A1=R
+# U+03A2 reserved 
+U+03A3=S
+U+03A4=T
+U+03A5U+0314=Hy
+U+03A5=Y
+U+03A6=Ph
+U+03A7=Ch
+U+03A8=Ps
+U+03A9U+0314=HoU+0304
+U+03A9=OU+0304
+U+03AA=IU+0308
+U+03AB=YU+0308
+U+03ACU+0314=haU+301
+U+03AC=aU+0301
+U+03ADU=0314=heU+0301
+U+03AD=eU+0301
+U+03AEU+0314=heU+0304U+0301
+U+03AE=eU+0304U+0301
+U+03AFU+0314=hiU+0301
+U+03AF=iU+0301
+U+03B0=yU+0308U+0301
+U+03B1U+0314=ha
+U+03B1U+03C5=au
+U+03B1=a
+U+03B2=v
+U+03B3U+03B3=ng
+%U+03B3U+03BA%=nk
+U+03B3U+03BE=nx
+U+03B3U+03C7=nch
+U+03B3=g
+U+03B4=d
+U+03B5U+0314=he
+U+03B5U+03C5=eu
+U+03B5=e
+U+03B6=z
+U+03B7U+0314=heU+0304
+U+03B7=eU+0304
+U+03B8=th
+U+03B9U+0314=hi
+U+03B9U+03C5=iu
+U+03B9=i
+U+03BA=k
+U+03BB=l
+U+03BCU+03C0%=b
+U+03BC=m
+U+03BDU+03C4%=dU+0332
+U+03BD=n
+U+03BE=x
+U+03BFU+0314=ho
+U+03BFU+03C5=ou
+U+03BF=o
+U+03C0=p
+U+03C1U+0314=rh
+U+03C1=r
+U+03C2=s
+U+03C3=s
+U+03C4=t
+U+03C5U+0314=hy
+U+03C5=y
+U+03C6=ph
+U+03C7=ch
+U+03C8=ps
+U+03C9U+0314=hoU+0304
+U+03C9=oU+0304
+U+03CA=iU+0308
+U+03CB=yU+0308
+U+03CC=oU+0301
+U+03CD=yU+0301
+U+03CE=oU+0304U+0301
+U+03CF=KU+0326
+U+03D0=bU+0333
+U+03D1=tU+0333hU+0333
+U+03D2=YU+0333
+U+03D3=YU+0301U+0333
+U+03D4=YU+0308U+0333
+U+03D5=pU+0333hU+0333
+U+03D6=pU+0333
+U+03D7=kU+0326
+U+03D8=Q
+U+03D9=q
+U+03DA=6U+0333
+U+03DB=6U+0332
+U+03DC=GU+0332
+U+03DD=gU+0332
+U+03DE=KU+0324
+U+03DF=kU+0324
+U+03E0=sU+FE20sU+FE21
+U+03E1=SU+FE20SU+FE21
+U+03E2=SU+030C
+U+03E3=sU+030C
+U+03E4=F
+U+03E5=f
+U+03E6=HU+0332
+U+03E7=hU+0332
+U+03E8=HU+0307
+U+03E9=hU+0307
+U+03EA=CU+030C
+U+03EB=cU+030C
+U+03EC=KU+0323y
+U+03ED=kU+0323y
+U+03EE=TU+0323i
+U+03EF=tU+0323i
+U+03F0=kU+0332
+U+03F1=rU+0332
+U+03F2=sU+0332
+U+03F3=j
+U+03F4=TU+0333HU+0333
+U+03F5=eU+0332
+U+03F6=eU+0333
+U+03F7=SU+FE20HU+FE21
+U+03F8=sU+FE20hU+FE21
+U+03F9=SU+0332
+U+03FA=SU+0323
+U+03FB=sU+0323
+U+03FC=rU+0333
+U+03FD=|)U+0333
+U+03FE=(.U+0333
+U+03FF=.)U+0333

+ 141 - 0
scriptshifter/hooks/greek/__init__.py

@@ -0,0 +1,141 @@
+__doc__ = """ Greek hooks. """
+
+
+from logging import getLogger
+
+from scriptshifter.exceptions import CONT
+
+
+# Suffixed by ʹ
+# Indices are positions in the numeric string from the right
+DIGITS = {
+    1: {  # Units
+        "α": 1,
+        "β": 2,
+        "γ": 3,
+        "δ": 4,
+        "ε": 5,
+        "ϝ": 6,
+        "ϛ": 6,
+        "στ": 6,
+        "ζ": 7,
+        "η": 8,
+        "θ": 9,
+    },
+    2: {  # Tens
+        "ι": 1,
+        "κ": 2,
+        "λ": 3,
+        "μ": 4,
+        "ν": 5,
+        "ξ": 6,
+        "ο": 7,
+        "π": 8,
+        "ϙ": 9,
+        "ϟ": 9,
+    },
+    3: {  # Hundreds
+        "ρ": 1,
+        "σ": 2,
+        "τ": 3,
+        "υ": 4,
+        "φ": 5,
+        "χ": 6,
+        "ψ": 7,
+        "ω": 8,
+        "ϡ": 9,
+    },
+    # Prefixed by ͵
+    4: {
+        "α": 1,
+        "β": 2,
+        "γ": 3,
+        "δ": 4,
+        "ε": 5,
+        "ϝ": 6,
+        "ϛ": 6,
+        "στ": 6,
+        "ζ": 7,
+        "η": 8,
+        "θ": 9,
+    },
+}
+
+NUM_SUFFIX = "\u0374"  # ʹ
+THOUSANDS_PREFIX = "\u0375"  # ͵
+
+logger = getLogger(__name__)
+
+
+def parse_numeral(ctx):
+    """
+    Parse a numeric string.
+
+    Runs on begin_input_token hook.
+
+    Note that this logic does not raise a warning or error for numeral
+    characters mixed with letter characters without a space. Therefore,
+    "͵ακακαα" would transliterate "1021kaa", and "͵αακαα", "1001kaa".
+    """
+    # Parse thousands.
+    if ctx.src[ctx.cur] == THOUSANDS_PREFIX:
+        tk = ctx.src[ctx.cur + 1]
+
+        try:
+            ctx.dest_ls.append(str(DIGITS[4][tk]))
+            ctx.cur += 2
+
+        except KeyError:
+            ctx.warnings.append(
+                    f"Character `{tk}` at position {ctx.cur + 1} "
+                    "is not a valid thousands character.")
+            ctx.cur += 1
+
+            return CONT
+
+        ext = ["0", "0", "0"]
+        ext_cur = 0
+        for i in range(0, 3):
+            # Parse following characters until EOW or max 3.
+            if ctx.cur >= len(ctx.src) or ctx.src[ctx.cur] == " ":
+                break
+
+            try:
+                ext[ext_cur] = str(DIGITS[3 - i][ctx.src[ctx.cur]])
+                ctx.cur += 1
+            except KeyError:
+                # If the number char is not in the correct position, pad with 0
+                continue
+            finally:
+                ext_cur += 1
+        ctx.dest_ls.extend(ext)
+
+        logger.debug(f"Stopping numeral parsing at position #{ctx.cur}.")
+
+        return CONT
+
+    # Parse 1÷999.
+    # This requires a different approach, i.e. backtracking previously
+    # transliterated characters.
+    if ctx.src[ctx.cur] == NUM_SUFFIX:
+        # Move back up to 3 positions.
+        for i in range(1, 4):
+            cur = ctx.cur - i
+            if cur >= 0:
+                num_tk = ctx.src[cur]  # Number to be parsed
+                if ctx.src[cur] in DIGITS[i]:
+                    # Not yet reached word boundary.
+                    ctx.dest_ls[-i] = str(DIGITS[i][num_tk])
+                else:
+                    if ctx.src[cur] != " ":  # Word boundary.
+                        # Something's wrong.
+                        ctx.warnings.append(
+                                f"Character `{ctx.src[cur] }` at position "
+                                f"{cur} is not a valid digit character "
+                                f"at place #{4 - i} in a numeral.")
+
+                    ctx.cur += 1
+                    return CONT  # Continue normal parsing.
+
+        ctx.cur += 1
+        return CONT

+ 14 - 0
scriptshifter/hooks/greek/data.yml

@@ -0,0 +1,14 @@
+normalize:
+  psili:
+    -
+      - "ἀ"
+      - "ἂ"
+      - "ἄ"
+      - "ἆ"
+    - "α"
+  dasia:
+    -
+      - "ἃ"
+      - "ἅ"
+      - "ἇ"
+    - "ἁ"

+ 51 - 2
scriptshifter/tables/__init__.py

@@ -29,6 +29,7 @@ TABLE_DIR = environ.get("TXL_CONFIG_TABLE_DIR", DEFAULT_TABLE_DIR)
 # Available hook names.
 HOOKS = (
     "post_config",
+    "post_normalize",
     "begin_input_token",
     "pre_ignore_token",
     "on_ignore_match",
@@ -43,6 +44,15 @@ HOOK_PKG_PATH = "scriptshifter.hooks"
 # Default characters defining a word boundary. This is configurable per-table.
 WORD_BOUNDARY = " \n\t:;.,\"'-()[]{}"
 
+# Token word boundary marker. Used in maps to distinguish special
+# transliterations for initial, final, and standalone tokens.
+TOKEN_WB_MARKER = "%"
+
+# Word boundary bitwise flags.
+BOW = 1 << 1
+EOW = 1 << 0
+
+
 logger = logging.getLogger(__name__)
 
 
@@ -54,9 +64,22 @@ class Token(str):
     in a way that prioritizes a longer string over a shorter one with identical
     root.
     """
+    flags = 0
+
     def __init__(self, content):
         self.content = content
 
+        # Assign special precedence based on token position.
+        # Standalone has precedence, then initial, then final, then medial.
+        # This is somewhat arbitrary and may change if special cases arise.
+        # WB markers are moved to flags to allow default comparison.
+        if self.content.endswith(TOKEN_WB_MARKER):
+            self.flags |= BOW
+            self.content = self.content.rstrip(TOKEN_WB_MARKER)
+        if self.content.startswith(TOKEN_WB_MARKER):
+            self.flags |= EOW
+            self.content = self.content.lstrip(TOKEN_WB_MARKER)
+
     def __lt__(self, other):
         """
         Operator to sort tokens.
@@ -76,6 +99,17 @@ class Token(str):
         other_len = len(other.content)
         min_len = min(self_len, other_len)
 
+        # Check word boundary flags only if tokens are identical.
+        # Higher flag value has precedence.
+        if (
+                (self.flags > 0 or other.flags > 0)
+                and self.content == other.content):
+            logger.debug(f"{self.content} flags: {self.flags}")
+            logger.debug(f"{other.content} flags: {other.flags}")
+            logger.debug("Performing flags comparison.")
+
+            return self.flags > other.flags
+
         # If one of the strings is entirely contained in the other string...
         if self.content[:min_len] == other.content[:min_len]:
             # logger.debug("Roots match.")
@@ -147,8 +181,23 @@ def load_table(tname):
                 Token(k): v
                 for k, v in tdata["script_to_roman"].get("map", {}).items()}
         tdata["script_to_roman"]["map"] = tuple(
-                (k.content, tokens[k]) for k in sorted(tokens))
+                (k, tokens[k]) for k in sorted(tokens))
+
+        # Normalization.
+        normalize = {}
+
+        # Inherit normalization rules.
+        for parent in parents:
+            parent_langsec = load_table(parent)["script_to_roman"]
+            normalize |= parent_langsec.get("normalize", {})
+
+        for k, v in tdata["script_to_roman"].get("normalize", {}).items():
+            for vv in v:
+                normalize[Token(vv)] = k
+
+        tdata["script_to_roman"]["normalize"] = dict(sorted(normalize.items()))
 
+        # Hook function.
         if "hooks" in tdata["script_to_roman"]:
             tdata["script_to_roman"]["hooks"] = load_hook_fn(
                     tname, tdata["script_to_roman"])
@@ -168,7 +217,7 @@ def load_table(tname):
             for k, v in tdata["roman_to_script"].get("map", {}).items()
         }
         tdata["roman_to_script"]["map"] = tuple(
-                (k.content, tokens[k]) for k in sorted(tokens))
+                (k, tokens[k]) for k in sorted(tokens))
 
         # Ignore regular expression patterns.
         # Patterns are evaluated in the order they are listed in the config.

+ 791 - 0
scriptshifter/tables/data/greek_classical.yml

@@ -0,0 +1,791 @@
+general:
+  name: Classical Greek (ancient and medieval)
+  notes:
+    - Compiled based on https://www.loc.gov/catdir/cpso/romanization/greek.pdf
+  parents:
+    - _ignore_base
+
+script_to_roman:
+  hooks:
+    begin_input_token:
+      -
+        - greek.parse_numeral
+
+  normalize:
+    # Assimilate all vowels that can be in a diphthong with upsilon to a
+    # non-tonal and a tonal form, so that a hiatus can be established.
+    # The accent used for the assimilated form is Varia, which is used for the
+    # transliteration rules of hiatuses further down.
+
+    # Alpha
+    "\u03B1":  # α 	Greek Small Letter Alpha
+      - "\u1F00"  # ἀ 	Greek Small Letter Alpha With Psili
+      - "\u1F80"  # ᾀ 	Greek Small Letter Alpha With Psili And Ypogegrammeni
+      - "\u1FB0"  # ᾰ 	Greek Small Letter Alpha With Vrachy
+      - "\u1FB1"  # ᾱ 	Greek Small Letter Alpha With Macron
+      - "\u1FB3"  # ᾳ 	Greek Small Letter Alpha With Ypogegrammeni
+    "\u1F70":  # ὰ 	Greek Small Letter Alpha With Varia
+      - "\u03AC"  # ά 	Greek Small Letter Alpha With Tonos
+      - "\u1F02"  # ἂ 	Greek Small Letter Alpha With Psili And Varia
+      - "\u1F04"  # ἄ 	Greek Small Letter Alpha With Psili And Oxia
+      - "\u1F06"  # ἆ 	Greek Small Letter Alpha With Psili And Perispomeni
+      - "\u1F71"  # ά 	Greek Small Letter Alpha With Oxia
+      - "\u1F82"  # ᾂ 	Greek Small Letter Alpha With Psili And Varia And Ypogegrammeni
+      - "\u1F84"  # ᾄ 	Greek Small Letter Alpha With Psili And Oxia And Ypogegrammeni
+      - "\u1F86"  # ᾆ 	Greek Small Letter Alpha With Psili And Perispomeni And Ypogegrammeni
+      - "\u1FB2"  # ᾲ 	Greek Small Letter Alpha With Varia And Ypogegrammeni
+      - "\u1FB4"  # ᾴ 	Greek Small Letter Alpha With Oxia And Ypogegrammeni
+      - "\u1FB6"  # ᾶ 	Greek Small Letter Alpha With Perispomeni
+      - "\u1FB7"  # ᾷ 	Greek Small Letter Alpha With Perispomeni And Ypogegrammeni
+    "\u0391":  # Α 	Greek Capital Letter Alpha
+      - "\u1F08"  # Ἀ 	Greek Capital Letter Alpha With Psili
+      - "\u1F88"  # ᾈ 	Greek Capital Letter Alpha With Psili And Prosgegrammeni
+      - "\u1FB8"  # Ᾰ 	Greek Capital Letter Alpha With Vrachy
+      - "\u1FB9"  # Ᾱ 	Greek Capital Letter Alpha With Macron
+      - "\u1FBC"  # ᾼ 	Greek Capital Letter Alpha With Prosgegrammeni
+    "\u1FBA":  # Ὰ 	Greek Capital Letter Alpha With Varia
+      - "\u0386"  # Ά 	Greek Capital Letter Alpha With Tonos
+      - "\u1F0A"  # Ἂ 	Greek Capital Letter Alpha With Psili And Varia
+      - "\u1F0C"  # Ἄ 	Greek Capital Letter Alpha With Psili And Oxia
+      - "\u1F8A"  # ᾊ 	Greek Capital Letter Alpha With Psili And Varia And Prosgegrammeni
+      - "\u1F8C"  # ᾌ 	Greek Capital Letter Alpha With Psili And Oxia And Prosgegrammeni
+      - "\u1F8E"  # ᾎ 	Greek Capital Letter Alpha With Psili And Perispomeni And Prosgegrammeni
+    # Rough alpha
+    "\u1F01":  # ἁ 	Greek Small Letter Alpha With Dasia
+      - "\u03B1\u0314" # Small alpha + combining reversed comma above
+      - "\u1F81"  # ᾁ 	Greek Small Letter Alpha With Dasia And Ypogegrammeni
+    "\u1F03":  # ἃ 	Greek Small Letter Alpha With Dasia And Varia
+      - "\u1F05"  # ἅ 	Greek Small Letter Alpha With Dasia And Oxia
+      - "\u1F07"  # ἇ 	Greek Small Letter Alpha With Dasia And Perispomeni
+      - "\u1F83"  # ᾃ 	Greek Small Letter Alpha With Dasia And Varia And Ypogegrammeni
+      - "\u1F85"  # ᾅ 	Greek Small Letter Alpha With Dasia And Oxia And Ypogegrammeni
+      - "\u1F87"  # ᾇ 	Greek Small Letter Alpha With Dasia And Perispomeni And Ypogegrammeni
+    "\u1F09":  # Ἁ 	Greek Capital Letter Alpha With Dasia
+      - "\u0391\u0314" # Capital alpha + combining reversed comma above
+      - "\u1F89"  # ᾉ 	Greek Capital Letter Alpha With Dasia And Prosgegrammeni
+    "\u1F0B":  # Ἃ 	Greek Capital Letter Alpha With Dasia And Varia
+      - "\u1F8B"  # ᾋ 	Greek Capital Letter Alpha With Dasia And Varia And Prosgegrammeni
+      - "\u1F8D"  # ᾍ 	Greek Capital Letter Alpha With Dasia And Oxia And Prosgegrammeni
+      - "\u1F8F"  # ᾏ 	Greek Capital Letter Alpha With Dasia And Perispomeni And Prosgegrammeni
+      - "\u1F0D"  # Ἅ 	Greek Capital Letter Alpha With Dasia And Oxia
+      - "\u1F0F"  # Ἇ 	Greek Capital Letter Alpha With Dasia And Perispomeni
+    # Epsilon
+    "\u03B5":  # ε 	Greek Small Letter Epsilon
+      - "\u1F10"  # ἐ 	Greek Small Letter Epsilon With Psili
+    "\u1F72":  # ὲ 	Greek Small Letter Epsilon With Varia
+      - "\u03AD"  # έ 	Greek Small Letter Epsilon With Tonos
+      - "\u1F12"  # ἒ 	Greek Small Letter Epsilon With Psili And Varia
+      - "\u1F14"  # ἔ 	Greek Small Letter Epsilon With Psili And Oxia
+      - "\u1F73"  # έ 	Greek Small Letter Epsilon With Oxia
+    "\u0395":  # Ε 	Greek Capital Letter Epsilon
+      - "\u1F18"  # Ἐ 	Greek Capital Letter Epsilon With Psili
+    "\u1F1A":  # Ἒ 	Greek Capital Letter Epsilon With Psili And Varia
+      - "\u1F1C"  # Ἔ 	Greek Capital Letter Epsilon With Psili And Oxia
+      - "\u0388"
+    # Rough epsilon
+    "\u1F11":  #  ἑ 	Greek Small Letter Epsilon With Dasia
+      - "\u03B5\u0314"  # Small epsilon + combination Dasia
+    "\u1F13":  # ἓ 	Greek Small Letter Epsilon With Dasia And Varia
+      - "\u1F15"  # ἕ 	Greek Small Letter Epsilon With Dasia And Oxia
+    "\u1F19":  # Ἑ 	Greek Capital Letter Epsilon With Dasia
+      - "\u0395\u0314"  # Capital epsilon + combination Dasia
+    "\u1F1B":  # Ἓ 	Greek Capital Letter Epsilon With Dasia And Varia
+      - "\u1F1D"  # Ἕ 	Greek Capital Letter Epsilon With Dasia And Oxia
+    # Eta
+    "\u03B7":  # η 	Greek Small Letter Eta
+      - "\u1F20"  # ἠ 	Greek Small Letter Eta With Psili
+      - "\u1FC3"  # ῃ 	Greek Small Letter Eta With Ypogegrammeni
+      - "\u1F90"  # ᾐ 	Greek Small Letter Eta With Psili And Ypogegrammeni
+    "\u1F74":  # ὴ 	Greek Small Letter Eta With Varia
+      - "\u03AE"  # ή 	Greek Small Letter Eta With Tonos
+      - "\u1F22"  # ἢ 	Greek Small Letter Eta With Psili And Varia
+      - "\u1F24"  # ἤ 	Greek Small Letter Eta With Psili And Oxia
+      - "\u1F26"  # ἦ 	Greek Small Letter Eta With Psili And Perispomeni
+      - "\u1F75"  # ή 	Greek Small Letter Eta With Oxia
+      - "\u1F92"  # ᾒ 	Greek Small Letter Eta With Psili And Varia And Ypogegrammeni
+      - "\u1F94"  # ᾔ 	Greek Small Letter Eta With Psili And Oxia And Ypogegrammeni
+      - "\u1F96"  # ᾖ 	Greek Small Letter Eta With Psili And Perispomeni And Ypogegrammeni
+      - "\u1FC2"  # ῂ 	Greek Small Letter Eta With Varia And Ypogegrammeni
+      - "\u1FC4"  # ῄ 	Greek Small Letter Eta With Oxia And Ypogegrammeni
+      - "\u1FC6"  # ῆ 	Greek Small Letter Eta With Perispomeni
+      - "\u1FC7"  # ῇ 	Greek Small Letter Eta With Perispomeni And Ypogegrammen
+    "\u0397":  # Η 	Greek Capital Letter Eta
+      - "\u1F28"  # Ἠ 	Greek Capital Letter Eta With Psili
+      - "\u1F98"  # ᾘ 	Greek Capital Letter Eta With Psili And Prosgegrammeni
+      - "\u1FCC"  # ῌ 	Greek Capital Letter Eta With Prosgegrammeni
+    "\u1F2A":  # Ἢ 	Greek Capital Letter Eta With Psili And Varia
+      - "\u0389"
+      - "\u1F2C"  # Ἤ 	Greek Capital Letter Eta With Psili And Oxia
+      - "\u1F2E"  # Ἦ 	Greek Capital Letter Eta With Psili And Perispomeni
+      - "\u1F9A"  # ᾚ 	Greek Capital Letter Eta With Psili And Varia And Prosgegrammeni
+      - "\u1F9C"  # ᾜ 	Greek Capital Letter Eta With Psili And Oxia And Prosgegrammeni
+      - "\u1F9E"  # ᾞ 	Greek Capital Letter Eta With Psili And Perispomeni And Prosgegrammeni
+    # Rough eta
+    "\u1F21":  # ἡ 	Greek Small Letter Eta With Dasia
+      - "\u03B7\u0314"  # Small eta + cmbination dasia
+      - "\u1F91"  # ᾑ 	Greek Small Letter Eta With Dasia And Ypogegrammeni
+    "\u1F23":  # ἣ 	Greek Small Letter Eta With Dasia And Varia
+      - "\u1F25"  # ἥ 	Greek Small Letter Eta With Dasia And Oxia
+      - "\u1F27"  # ἧ 	Greek Small Letter Eta With Dasia And Perispomeni
+      - "\u1F93"  # ᾓ 	Greek Small Letter Eta With Dasia And Varia And Ypogegrammeni
+      - "\u1F95"  # ᾕ 	Greek Small Letter Eta With Dasia And Oxia And Ypogegrammeni
+      - "\u1F97"  # ᾗ 	Greek Small Letter Eta With Dasia And Perispomeni And Ypogegrammeni
+    "\u1F29":  # Ἡ 	Greek Capital Letter Eta With Dasia
+      - "\u0397\u0314"  # Capital Eta + combination dasia
+      - "\u1F99"  # ᾙ 	Greek Capital Letter Eta With Dasia And Prosgegrammeni
+    "\u1F2B":  # Ἣ 	Greek Capital Letter Eta With Dasia And Varia
+      - "\u1F2D"  # Ἥ 	Greek Capital Letter Eta With Dasia And Oxia
+      - "\u1F2F"  # Ἧ 	Greek Capital Letter Eta With Dasia And Perispomeni
+      - "\u1F9B"  # ᾛ 	Greek Capital Letter Eta With Dasia And Varia And Prosgegrammeni
+      - "\u1F9D"  # ᾝ 	Greek Capital Letter Eta With Dasia And Oxia And Prosgegrammeni
+      - "\u1F9F"  # ᾟ 	Greek Capital Letter Eta With Dasia And Perispomeni And Prosgegrammeni
+    # Iota
+    # TODO clarify about possible hiatus combinations with upsilon.
+    "\u03B9":
+      - "\u0390"
+      - "\u03AF"
+      - "\u03CA"
+      - "\u1F30"
+      - "\u1F32"
+      - "\u1F34"
+      - "\u1F36"
+      - "\u1F76"
+      - "\u1F77"
+      - "\u1FD0"
+      - "\u1FD1"
+      - "\u1FD2"
+      - "\u1FD3"
+      - "\u1FD6"
+      - "\u1FD7"
+    "\u0399":
+      - "\u038A"
+      - "\u03AA"
+      - "\u1F38"
+      - "\u1F3A"
+      - "\u1F3C"
+      - "\u1F3E"
+      - "\u1FD8"
+      - "\u1FD9"
+    # Rough iota
+    "\u1F31":
+      - "\u03B9\u0314"
+      - "\u1F33"
+      - "\u1F35"
+      - "\u1F37"
+    "\u1F39":
+      - "\u0399\u0314"
+      - "\u1F3B"
+      - "\u1F3D"
+      - "\u1F3F"
+    # Omicron
+    "\u03BF":  # ο 	Greek Small Letter Omicron
+      - "\u1F40"  # ὀ 	Greek Small Letter Omicron With Psili
+    "\u1F78":  # ὸ 	Greek Small Letter Omicron With Varia
+      - "\u03CC"  # ό 	Greek Small Letter Omicron With Tonos
+      - "\u1F42"  # ὂ 	Greek Small Letter Omicron With Psili And Varia
+      - "\u1F44"  # ὄ 	Greek Small Letter Omicron With Psili And Oxia
+      - "\u1F79"  # ό 	Greek Small Letter Omicron With Oxia
+    "\u039F":  # Ο 	Greek Capital Letter Omicron
+      - "\u1F48"  # Ὀ 	Greek Capital Letter Omicron With Psili
+    "\u1F4A":  # Ὂ 	Greek Capital Letter Omicron With Psili And Varia
+      - "\u038C"
+      - "\u1F4C"  # Ὄ 	Greek Capital Letter Omicron With Psili And Oxia
+    # Rough Omicron
+    "\u1F41":  # ὁ 	Greek Small Letter Omicron With Dasia
+      - "\u03BF\u0314"  # Small omicron + combination dasia
+    "\u1F43":  # ὃ 	Greek Small Letter Omicron With Dasia And Varia
+      - "\u1F45"  # ὅ 	Greek Small Letter Omicron With Dasia And Oxia
+    "\u1F49":  # Ὁ 	Greek Capital Letter Omicron With Dasia
+      - "\u039F\u0314"  # Capital omicron + combination dasia
+    "\u1F4B":  # Ὃ 	Greek Capital Letter Omicron With Dasia And Varia
+      - "\u1F4D"  # Ὅ 	Greek Capital Letter Omicron With Dasia And Oxia
+    # Rho
+    "\u03C1":
+      - "\u1FE4"
+    # Rough Rho
+    "\u1FE5":
+      - "\u03C1\u0314"
+    "\u1FEC":
+      - "\u03A1\u0314"
+
+    # Upsilon
+    "\u03C5":
+      - "\u03CD"  # ύ 	Greek Small Letter Upsilon With Tonos
+      - "\u1F50"  # ὐ 	Greek Small Letter Upsilon With Psili
+      - "\u1F52"  # ὒ 	Greek Small Letter Upsilon With Psili And Varia
+      - "\u1F54"  # ὔ 	Greek Small Letter Upsilon With Psili And Oxia
+      - "\u1F56"  # ὖ 	Greek Small Letter Upsilon With Psili And Perispomeni
+      - "\u1F7A"  # ὺ 	Greek Small Letter Upsilon With Varia
+      - "\u1F7B"  # ύ 	Greek Small Letter Upsilon With Oxia
+      - "\u1FE0"  # ῠ 	Greek Small Letter Upsilon With Vrachy
+      - "\u1FE1"  # ῡ 	Greek Small Letter Upsilon With Macron
+      - "\u1FE6"  # ῦ 	Greek Small Letter Upsilon With Perispomeni
+    "\u03CB":  # ϋ 	Greek Small Letter Upsilon With Dialytika
+      - "\u03B0"  # ΰ 	Greek Small Letter Upsilon With Dialytika And Tonos
+      - "\u1FE2"  # ῢ 	Greek Small Letter Upsilon With Dialytika And Varia
+      - "\u1FE3"  # ΰ 	Greek Small Letter Upsilon With Dialytika And Oxia
+      - "\u1FE7"  # ῧ 	Greek Small Letter Upsilon With Dialytika And Perispomeni
+    "\u03A5":
+        # NOTE: Capital upsilon + psili seems to be absent from Unicode table.
+      - "\u03AB"  # Ϋ 	Greek Capital Letter Upsilon With Dialytika
+      - "\u1F59"  # Ὑ 	Greek Capital Letter Upsilon With Dasia
+      - "\u1FE8"  # Ῠ 	Greek Capital Letter Upsilon With Vrachy
+      - "\u1FE9"  # Ῡ 	Greek Capital Letter Upsilon With Macron
+    "\u1FEA":  # Ὺ 	Greek Capital Letter Upsilon With Varia
+      - "\u1F5B"  # Ὓ 	Greek Capital Letter Upsilon With Dasia And Varia
+      - "\u1F5D"  # Ὕ 	Greek Capital Letter Upsilon With Dasia And Oxia
+      - "\u1F5F"  # Ὗ 	Greek Capital Letter Upsilon With Dasia And Perispomeni
+    # Rough Upsilon
+    "\u1F51":
+      - "\u03C5\u0314"
+      - "\u1F53"
+      - "\u1F55"
+      - "\u1F57"
+    "\u1F59":
+      - "\u03A5\u0314"
+      - "\u1F5B"
+      - "\u1F5D"
+      - "\u1F5F"
+
+    # Omega
+    "\u03C9":  # ω 	Greek Small Letter Omega
+      - "\u1F60"  # ὠ 	Greek Small Letter Omega With Psili
+      - "\u1FA0"  # ᾠ 	Greek Small Letter Omega With Psili And Ypogegrammeni
+      - "\u1FF3"  # ῳ 	Greek Small Letter Omega With Ypogegrammeni
+    "\u1F7C":  # ὼ 	Greek Small Letter Omega With Varia
+      - "\u03CE"  # ώ 	Greek Small Letter Omega With Tonos
+      - "\u1F62"  # ὢ 	Greek Small Letter Omega With Psili And Varia
+      - "\u1F64"  # ὤ 	Greek Small Letter Omega With Psili And Oxia
+      - "\u1F66"  # ὦ 	Greek Small Letter Omega With Psili And Perispomeni
+      - "\u1F7D"  # ώ 	Greek Small Letter Omega With Oxia
+      - "\u1FA2"  # ᾢ 	Greek Small Letter Omega With Psili And Varia And Ypogegrammeni
+      - "\u1FA4"  # ᾤ 	Greek Small Letter Omega With Psili And Oxia And Ypogegrammeni
+      - "\u1FA6"  # ᾦ 	Greek Small Letter Omega With Psili And Perispomeni And Ypogegrammeni
+      - "\u1FF2"  # ῲ 	Greek Small Letter Omega With Varia And Ypogegrammeni
+      - "\u1FF4"  # ῴ 	Greek Small Letter Omega With Oxia And Ypogegrammeni
+      - "\u1FF6"  # ῶ 	Greek Small Letter Omega With Perispomeni
+      - "\u1FF7"  # ῷ 	Greek Small Letter Omega With Perispomeni And Ypogegrammeni
+    "\u03A9":  # Ω 	Greek Capital Letter Omega
+      - "\u1F68"  # Ὠ 	Greek Capital Letter Omega With Psili
+      - "\u1FA8"  # ᾨ 	Greek Capital Letter Omega With Psili And Prosgegrammeni
+      - "\u1FFC"  # ῼ 	Greek Capital Letter Omega With Prosgegrammeni
+    "\u1FFA":  # Ὼ 	Greek Capital Letter Omega With Varia
+      - "\u038F"  # Ώ 	Greek Capital Letter Omega With Tonos
+      - "\u1F6A"  # Ὢ 	Greek Capital Letter Omega With Psili And Varia
+      - "\u1F6C"  # Ὤ 	Greek Capital Letter Omega With Psili And Oxia
+      - "\u1F6E"  # Ὦ 	Greek Capital Letter Omega With Psili And Perispomeni
+      - "\u1FAA"  # ᾪ 	Greek Capital Letter Omega With Psili And Varia And Prosgegrammeni
+      - "\u1FAC"  # ᾬ 	Greek Capital Letter Omega With Psili And Oxia And Prosgegrammeni
+      - "\u1FAE"  # ᾮ 	Greek Capital Letter Omega With Psili And Perispomeni And Prosgegrammeni
+    # Rough omega
+    "\u1F61":  # ὡ 	Greek Small Letter Omega With Dasia
+      - "\u03C9\u0314"  # Small omega + combination dasia
+      - "\u1FA1"  # ᾡ 	Greek Small Letter Omega With Dasia And Ypogegrammeni
+    "\u1F63":  # ὣ 	Greek Small Letter Omega With Dasia And Varia
+      - "\u1F65"  # ὥ 	Greek Small Letter Omega With Dasia And Oxia
+      - "\u1F67"  # ὧ 	Greek Small Letter Omega With Dasia And Perispomeni
+      - "\u1FA3"  # ᾣ 	Greek Small Letter Omega With Dasia And Varia And Ypogegrammeni
+      - "\u1FA5"  # ᾥ 	Greek Small Letter Omega With Dasia And Oxia And Ypogegrammeni
+      - "\u1FA7"  # ᾧ 	Greek Small Letter Omega With Dasia And Perispomeni And Ypogegrammeni
+    "\u1F69":  # Ὡ 	Greek Capital Letter Omega With Dasia
+      - "\u03A9\u0314"  # Capital omega + combination dasia
+      - "\u1FA9"  # ᾩ 	Greek Capital Letter Omega With Dasia And Prosgegrammeni
+    "\u1F6B":  # Ὣ 	Greek Capital Letter Omega With Dasia And Varia
+      - "\u1F6D"  # Ὥ 	Greek Capital Letter Omega With Dasia And Oxia
+      - "\u1F6F"  # Ὧ 	Greek Capital Letter Omega With Dasia And Perispomeni
+      - "\u1FAB"  # ᾫ 	Greek Capital Letter Omega With Dasia And Varia And Prosgegrammeni
+      - "\u1FAD"  # ᾭ 	Greek Capital Letter Omega With Dasia And Oxia And Prosgegrammeni
+      - "\u1FAF"  # ᾯ 	Greek Capital Letter Omega With Dasia And Perispomeni And Prosgegrammeni
+
+    # Remove all combining diacritics.
+    "":
+      - "\u0314"
+      - "\u0342"
+      - "\u0343"
+      - "\u0344"
+      - "\u0345"
+      - "\u037A"
+      - "\u0384"
+      - "\u0385"
+      - "\u1FBD"
+      - "\u1FBE"
+      - "\u1FBF"
+      - "\u1FC0"
+      - "\u1FC1"
+      - "\u1FCD"
+      - "\u1FCE"
+      - "\u1FCF"
+      - "\u1FED"
+      - "\u1FEE"
+      - "\u1FFD"
+
+  map:
+    "\u201C": "\"\u0332"
+    "\u201D": "\"\u0333"
+    "\u2018": "'\u0332"
+    #"\u2019": "'\u0333"
+    "\u2116": "No\u0332"
+    # "\u0300": ""
+    # "\u0301": ""
+    # "\u0302": ""
+    # "\u0313": ""
+    "\u0370": "H\u0323"
+    "\u0371": "h\u0323"
+    "\u0372": "S\uFE20\u0332S\uFE21\u0332"
+    "\u0373": "s\uFE20\u0332s\uFE21\u0332"
+    "\u0374": "\u02B9"
+    "\u0375": "\u0326"
+    "\u0376": "W\u0323"
+    "\u0377": "w\u0323"
+    # \u0378 reserved
+    # \u0379 reserved
+    "\u037A": "\u0328"
+    "\u037B": "|)"
+    "\u037C": "(."
+    "\u037D": ".)"
+    "\u037E": "?\u0333"
+    "\u037F": "J"
+    # \u0380 reserved
+    # \u0381 reserved
+    # \u0382 reserved
+    # \u0383 reserved
+    "\u0384": " \u0301"
+    "\u0385": " \u0308\u0301"
+    "\u0386": "A\u0301"
+    "\u0387": ";\u0333"
+    "\u0388\u0314": "He\u0301"
+    "\u0388": "E\u0301"
+    "\u0389\u0314": "\u0112\u0301"
+    "\u0389": "\u0112\u0301"
+    "\u038A\u0314": "Hi\u0301"
+    "\u038A": "I\u0301"
+    # \u038B reserved
+    "\u038C\u0314": "Ho\u0301"
+    "\u038C": "O\u0301"
+    # \u038D reserved
+    "\u038E\u0314": "Hy\u0301"
+    "\u038E": "Y\u0301"
+    "\u038F\u0314": "Ho\u0301"
+    "\u038F": "\u014C\u0301"
+    "\u0390": "i\u0308\u0301"
+    "\u1F09": "Ha"
+    "\u1F0B": "Ha"
+    "\u0391\u1F31": "Hai"
+    "\u0391\u1F51": "Hau"
+    "\u0391\u1F61": "Ha\u014D"
+    "\u0391\u03C5": "Au"
+    "\u1FBA\u03C5": "Ay"  # Tonos on preceding vowel
+    "\u0391": "A"
+    "\u1FBA": "A"
+    "\u0392": "B"
+    "\u0393": "G"
+    "\u0394": "D"
+    "\u0395\u03C5": "Eu"
+    "\u1F19": "He"
+    "\u1F1B": "He"
+    "\u1F19\u03C5": "Heu"
+    "\u1F1A\u03C5": "Ey"  # Tonos on preceding vowel
+    "\u1F1B\u03C5": "Hey"  # Tonos on preceding vowel
+    "\u0395": "E"
+    "\u1F1A": "E"
+    "\u0396": "Z"
+    "\u1F29": "H\u0113"
+    "\u1F2B": "H\u0113"
+    "\u0397": "\u0112"
+    "\u1F2A": "\u0112"
+    "\u0397\u03C5": "\u0112u"
+    "\u1F2A\u03C5": "\u0112y"  # Tonos on preceding vowel
+    "\u1F29\u1F51": "H\u0113u"
+    "\u1F2B\u1F51": "H\u0113y"  # Tonos on preceding vowel
+    "\u0398": "Th"
+    "\u1F39": "Hi"
+    "\u0399\u03C5": "Iu"
+    "\u0399": "I"
+    "\u039A": "K"
+    "\u039B": "L"
+    "\u039C\u03C0%": "B"
+    "\u039C": "M"
+    "\u039D\u03C4%": "\u1E0E"
+    "\u039D": "N"
+    "\u039E": "X"
+    "\u1F49": "Ho"
+    "\u1F4B": "Ho"
+    "\u039F\u03C5": "Ou"
+    "\u1F4A\u03C5": "Oy"  # Tonos on preceding vowel
+    "\u039F": "O"
+    "\u1F4A": "O"
+    "\u03A0": "P"
+    "\u1FEC": "Rh"
+    "\u03A1": "R"
+    # \u03A2 reserved
+    "\u03A3": "S"
+    "\u03A4": "T"
+    "\u1F59": "Hy"
+    "\u03A5": "Y"
+    "\u03A5\u03B9": "Ui"
+    "\u03A5\u1F31": "Hui"
+    "\u03A6": "Ph"
+    "\u03A7": "Ch"
+    "\u03A8": "Ps"
+    "\u1F69": "H\u014D"
+    "\u1F6B": "H\u014D"
+    "\u1F69\u03C5": "H\u014Du"
+    "\u1F6B\u03C5": "H\u014Dy"  # Tonos on preceding vowel
+    "\u03A9": "\u014C"
+    "\u1FFA": "\u014C"
+    "\u03A9\u03C5": "\u014Cu"
+    "\u1FFA\u03C5": "\u014Cy"  # Tonos on preceding vowel
+    "\u03AA": "I\u0308"
+    "\u03AB": "Y\u0308"
+    "\u03AC\u0314": "ha\u0301"
+    "\u03AC": "a\u0301"
+    "\u03ADU": "he\u0301"
+    "\u03AD": "e\u0301"
+    "\u03AE\u0314": "h\u0113\u0301"
+    "\u03AE": "\u0113\u0301"
+    "\u03AF\u0314": "hi\u0301"
+    "\u03AF": "i\u0301"
+    "\u03B0": "y\u0308\u0301"
+    "\u03B1": "a"
+    "\u1F70": "a"
+    "\u03B1\u03C5": "au"
+    "\u03B1\u1F31": "hai"
+    "\u03B1\u1F51": "hau"
+    "\u03B1\u1F61": "ha\u014D"
+    "\u1F01": "ha"
+    "\u1F03": "ha"
+    "\u1F01\u03C5": "hau"
+    "\u1F01\u1F31": "hai"
+    "\u1F03\u03C5": "hay"  # Tonos on preceding vowel
+    "\u1f70\u03C5": "ay"  # Tonos on preceding vowel
+    "\u03B2": "b"
+    "\u03B3\u03B3": "ng"
+    "\u03B3\u03BA": "nk"
+    "\u0393\u03BA%": "Gk"
+    "\u03B3\u03BA%": "gk"
+    "%\u0393\u03BA": "Gk"
+    "%\u03B3\u03BA": "gk"
+    "\u03B3\u03BE": "nx"
+    "\u03B3\u03C7": "nch"
+    "\u03B3": "g"
+    "\u03B4": "d"
+    "\u1F11": "he"
+    "\u1F13": "he"
+    "\u03B5\u03C5": "eu"
+    "\u1F72\u03C5": "ey"  # Tonos on preceding vowel
+    "\u03B5\u1F51": "heu"
+    "\u1F13\u1F51": "hey"  # Tonos on preceding vowel
+    "\u03B5": "e"
+    "\u1F72": "e"
+    "\u03B6": "z"
+    "\u03B7": "\u0113"
+    "\u1F74": "\u0113"
+    "\u03B7\u03C5": "\u0113u"
+    "\u1F74\u03C5": "\u0113y"  # Tonos on preceding vowel
+    "\u1F21": "h\u0113"
+    "\u1F23": "h\u0113"
+    "\u1F21\u03C5": "h\u0113u"
+    "\u03B7\u1F51": "h\u0113u"
+    "\u1F23\u03C5": "h\u0113y"  # Tonos on preceding vowel
+    "\u03B8": "th"
+    "\u1F31": "hi"
+    "\u1F31\u03C5": "hiu"
+    "\u03B9\u03C5": "iu"
+    "\u03B9": "i"
+    "\u03BA": "k"
+    "\u03BB": "l"
+    "\u03BC\u03C0%": "b"
+    "\u03BC": "m"
+    "\u03BD\u03C4%": "\u1E0F"
+    "\u03BD": "n"
+    "\u03BE": "x"
+    "\u1F41": "ho"
+    "\u1F43": "ho"
+    "\u1F41\u03C5": "hou"
+    "\u03BF\u1F51": "hou"
+    "\u1F43\u03C5": "hoy"  # Tonos on preceding vowel
+    "\u03BF": "o"
+    "\u1F78": "o"
+    "\u03BF\u03C5": "ou"
+    "\u1F78\u03C5": "oy"  # Tonos on preceding vowel
+    "\u03C0": "p"
+    "\u1FE5": "rh"
+    "\u03C1": "r"
+    "\u03C2": "s"
+    "\u03C3": "s"
+    "\u03C4": "t"
+    "\u1F51": "hy"
+    "\u1F59": "Hy"
+    "\u03C5": "y"
+    "\u03CB": "y"
+    "\u03C5\u03B9": "ui"
+    "\u03C5\u1F31": "hui"
+    "\u03C6": "ph"
+    "\u03C7": "ch"
+    "\u03C8": "ps"
+    "\u03C9": "\u014D"
+    "\u1F7C": "\u014D"
+    "\u03C9\u03C5": "\u014Du"
+    "\u1F7C\u03C5": "\u014Dy"  # Tonos on preceding vowel
+    "\u1F61": "h\u014D"
+    "\u1F63": "h\u014D"
+    "\u1F61\u03C5": "h\u014Du"
+    "\u03C9\u1F51": "h\u014Du"
+    "\u1F63\u03C5": "h\u014Dy"  # Tonos on preceding vowel
+    "\u03CA": "i\u0308"
+    "\u03CB": "y"
+    "\u03CC": "o\u0301"
+    "\u03CD": "y\u0301"
+    "\u03CE": "\u014D\u0301"
+    "\u03CF": "K\u0326"
+    "\u03D0": "b\u0333"
+    "\u03D1": "t\u0333h\u0333"
+    "\u03D2": "Y\u0333"
+    "\u03D3": "Y\u0301\u0333"
+    "\u03D4": "Y\u0308\u0333"
+    "\u03D5": "p\u0333h\u0333"
+    "\u03D6": "p\u0333"
+    "\u03D7": "k\u0326"
+    "\u03D8": "Ḳ"
+    "\u03D9": "ḳ"
+    "\u03DA": "6\u0333"
+    "\u03DB": "6\u0332"
+    #"\u03DC": "G\u0332"
+    "\u03DC": "W"
+    #"\u03DD": "g\u0332"
+    "\u03DD": "w"
+    "\u03DE": "K\u0324"
+    "\u03DF": "k\u0324"
+    "\u03E0": "s\uFE20s\uFE21"
+    "\u03E1": "S\uFE20S\uFE21"
+    "\u03E2": "S\u030C"
+    "\u03E3": "s\u030C"
+    "\u03E4": "F"
+    "\u03E5": "f"
+    "\u03E6": "H\u0332"
+    "\u03E7": "h\u0332"
+    "\u03E8": "H\u0307"
+    "\u03E9": "h\u0307"
+    "\u03EA": "C\u030C"
+    "\u03EB": "c\u030C"
+    "\u03EC": "K\u0323y"
+    "\u03ED": "k\u0323y"
+    "\u03EE": "T\u0323i"
+    "\u03EF": "t\u0323i"
+    "\u03F0": "k\u0332"
+    "\u03F1": "r\u0332"
+    "\u03F2": "s"
+    "\u03F3": "j"
+    "\u03F4": "T\u0333H\u0333"
+    "\u03F5": "e\u0332"
+    "\u03F6": "e\u0333"
+    "\u03F7": "S\uFE20H\uFE21"
+    "\u03F8": "s\uFE20h\uFE21"
+    "\u03F9": "S"
+    "\u03FA": "S\u0323"
+    "\u03FB": "s\u0323"
+    "\u03FC": "r\u0333"
+    "\u03FD": "|)\u0333"
+    "\u03FE": "(.\u0333"
+    "\u03FF": ".)\u0333"
+
+  roman_to_script:
+    map:
+      ".)\u0333": "\u03FF"
+      ".)": "\u037D"
+      "?\u0333": "\u037E"
+      "\"\u0332": "\u201C"
+      "\"\u0333": "\u201D"
+      "'\u0332": "\u2018"
+      "'\u0333": "\u2019"
+      "(.\u0333": "\u03FE"
+      "(.": "\u037C"
+      "|)\u0333": "\u03FD"
+      "|)": "\u037B"
+      # Left pointing double angle quotation mark
+      "\u003C\u003C": "\u00AB"
+      # Right pointing double angle quotation mark
+      "\u003E\u003E": "\u00BB"
+      "6\u0332": "\u03DB"
+      "6\u0333": "\u03DA"
+      "Au": "\u0391\u03C5"
+      "au": "\u03B1\u03C5"
+      "a\u0301": "\u03AC"
+      "B": "\u0392"
+      "b": "\u03B2"
+      "b\u0333": "\u03D0"
+      "Ch": "\u03A7"
+      "ch": "\u03C7"
+      "c\u030C": "\u03EB"
+      "\u1E0E": "\u039D\u03C4"
+      "\u1E0F": "\u03BD\u03C4"
+      "D": "\u0394"
+      "d": "\u03B4"
+      "Eu": "\u0395\u03C5"
+      "eu": "\u03B5\u03C5"
+      "E\u0301": "\u0388"
+      "e\u0301": "\u03AD"
+      "\u0112\u0301": "\u0389\u0314"
+      "\u0113\u0301": "\u03AE"
+      "\u0112\u0301": "\u0389"
+      "\u0112": "\u0397"
+      "\u0112u": "\u0397\u03C5"
+      "\u0113": "\u03B7"
+      "\u0113u": "\u03B7\u03C5"
+      "h\u0113u": "\u03B7\u1F51"
+      "e\u0332": "\u03F5"
+      "e\u0333": "\u03F6"
+      "F": "\u03E4"
+      "f": "\u03E5"
+      #"G\u0332": "\u03DC"
+      "W": "\u03DC"
+      #"g\u0332": "\u03DD"
+      "w": "\u03DD"
+      "Ha\u0301": "\u0386\u0314"
+      "ha\u0301": "\u03AC\u0314"
+      "Ha": "\u1F09"
+      "ha": "\u03B1\u0314"
+      "A": "\u0391"
+      "a": "\u03B1"
+      "h\u0113\u0301": "\u03AE\u0314"
+      "He\u0301": "\u0388\u0314"
+      "he\u0301": "\u03AD\u0314"
+      "H\u0113": "\u1F29"
+      "H\u0113u": "\u1F29\u03C5"
+      "h\u0113": "\u1F21"
+      "h\u0113u": "\u1F21\u13C5"
+      "He": "\u1F19"
+      "he": "\u1F11"
+      "E": "\u0395"
+      "e": "\u03B5"
+      "Hi\u0301": "\u038A\u0314"
+      "hi\u0301": "\u03AF\u0314"
+      "Hi": "\u1F39"
+      "hi": "\u1F31"
+      "Ho\u0301": "\u038F\u0314"
+      "Ho\u0301": "\u038C\u0314"
+      "h\u014D": "\u1F61"
+      "H\u014D": "\u1F69"
+      "Ho": "\u1F49"
+      "ho": "\u1F41"
+      "H\u0307": "\u03E8"
+      "h\u0307": "\u03E9"
+      "H\u0323": "\u0370"
+      "h\u0323": "\u0371"
+      "H\u0332": "\u03E6"
+      "h\u0332": "\u03E7"
+      "Hy\u0301": "\u038E\u0314"
+      "Hy": "\u1F59"
+      "hy": "\u1F51"
+      "Iu": "\u0399\u03C5"
+      "iu": "\u03B9\u03C5"
+      "I\u0301": "\u038A"
+      "i\u0301": "\u03AF"
+      "I\u0308": "\u03AA"
+      "i\u0308\u0301": "\u0390"
+      "i\u0308": "\u03CA"
+      "J": "\u037F"
+      "j": "\u03F3"
+      "K\u0323y": "\u03EC"
+      "k\u0323y": "\u03ED"
+      "K\u0332": "\u03DE"
+      "k\u0332": "\u03DF"
+      "K\u0326": "\u03CF"
+      "k\u0326": "\u03D7"
+      "k\u0332": "\u03F0"
+      "L": "\u039B"
+      "l": "\u03BB"
+      "M": "\u039C"
+      "m": "\u03BC"
+      "nch": "\u03B3\u03C7"
+      "ng": "\u03B3\u03B3"
+      "%nk%": "\u03B3\u03BA"
+      "nx": "\u03B3\u03BE"
+      "No\u0332": "\u2116"
+      "N": "\u039D"
+      "n": "\u03BD"
+      "K": "\u039A"
+      "k": "\u03BA"
+      "G": "\u0393"
+      "g": "\u03B3"
+      "Ou": "\u039F\u03C5"
+      "ou": "\u03BF\u03C5"
+      "O\u0301": "\u038C"
+      "o\u0301": "\u03CC"
+      "\u014C\u0301": "\u038F"
+      "\u014D\u0301": "\u03CE"
+      "\u014C": "\u03A9"
+      "\u014Cu": "\u03A9\u03C5"
+      "\u014D": "\u03C9"
+      "\u014Du": "\u03D9\u03C5"
+      "O": "\u039F"
+      "o": "\u03BF"
+      "Ph": "\u03A6"
+      "ph": "\u03C6"
+      "Ps": "\u03A8"
+      "ps": "\u03C8"
+      "p\u0333h\u0333": "\u03D5"
+      "p\u0333": "\u03D6"
+      "P": "\u03A0"
+      "p": "\u03C0"
+      "Ḳ": "\u03D8"
+      "ḳ": "\u03D9"
+      "Rh": "\u1FEC"
+      "rh": "\u1FE5"
+      "r\u0332": "\u03F1"
+      "r\u0333": "\u03FC"
+      "R": "\u03A1"
+      "r": "\u03C1"
+      "S\uFE20\u0332S\uFE21\u0332": "\u0372"
+      "s\uFE20\u0332s\uFE21\u0332": "\u0373"
+      "S\uFE20H\uFE21": "\u03F7"
+      "s\uFE20h\uFE21": "\u03F8"
+      "S\uFE20S\uFE21": "\u03E1"
+      "s\uFE20s\uFE21": "\u03E0"
+      "S\u030C": "\u03E2"
+      "s\u030C": "\u03E3"
+      "S\u0323": "\u03FA"
+      "s\u0323": "\u03FB"
+      "S": "\u03F9"
+      "s": "\u03F2"
+      "S": "\u03A3"
+      "%s": "\u03C2"
+      "s": "\u03C3"
+      "T\u0333H\u0333": "\u03F4"
+      "t\u0333h\u0333": "\u03D1"
+      "Th": "\u0398"
+      "th": "\u03B8"
+      "T\u0323i": "\u03EE"
+      "t\u0323i": "\u03EF"
+      "T": "\u03A4"
+      "t": "\u03C4"
+      "I": "\u0399"
+      "i": "\u03B9"
+      "\u0020\u0301": "\u0384"
+      "\u0020\u0308\u0301": "\u0385"
+      ";\u0333": "\u0387"
+      "\u02B9": "\u0374"
+      "\u0326": "\u0375"
+      "\u0328": "\u037A"
+      "V": "\u0392"
+      "v": "\u03B2"
+      "W\u0323": "\u0376"
+      "w\u0323": "\u0377"
+      "X": "\u039E"
+      "x": "\u03BE"
+      "Y\u0301\u0333": "\u03D3"
+      "Y\u0301": "\u038E"
+      "y\u0301": "\u03CD"
+      "Y\u0308\u0333": "\u03D4"
+      "y\u0308\u0301": "\u03B0"
+      "Y\u0308": "\u03AB"
+      "y\u0308": "\u03CB"
+      "Y\u0333": "\u03D2"
+      "Y": "\u03A5"
+      "Ui": "\u03A5\u03B9"
+      "Hui": "\u03A5\u1F31"
+      "y": "\u03C5"
+      "ui": "\u03C5\u03B9"
+      "hui": "\u03C5\u1F31"
+      "Z": "\u0396"
+      "z": "\u03B6"

+ 14 - 0
scriptshifter/tables/data/greek_modern.yml

@@ -0,0 +1,14 @@
+general:
+  name: Greek (modern)
+  parents:
+    - greek_classical
+
+script_to_roman:
+  map:
+    "\u0392": "V"
+    "\u03B2": "v"
+
+roman_to_script:
+  map:
+    "V": "\u0392"
+    "v": "\u03B2"

+ 6 - 2
scriptshifter/tables/data/index.yml

@@ -50,10 +50,14 @@ ethiopic:
   name: Ethiopic (Amharic)
 even-evenki_cyrillic:
   name: Even/Evenki (Cyrillic)
-georgian:
-  name: Georgian
 gagauz_cyrillic:
   name: Gagauz (Cyrillic)
+georgian:
+  name: Georgian
+greek_classical:
+  name: Greek (classical)
+greek_modern:
+  name: Greek (modern)
 hebrew:
   name: Hebrew
 hindi:

+ 60 - 26
scriptshifter/trans.py

@@ -2,16 +2,12 @@ import logging
 import re
 
 from scriptshifter.exceptions import BREAK, CONT
-from scriptshifter.tables import WORD_BOUNDARY, load_table
+from scriptshifter.tables import BOW, EOW, WORD_BOUNDARY, load_table
 
 
 # Match multiple spaces.
 MULTI_WS_RE = re.compile(r"\s{2,}")
 
-# Cursor bitwise flags.
-CUR_BOW = 1 << 0
-CUR_EOW = 1 << 1
-
 logger = logging.getLogger(__name__)
 
 
@@ -25,11 +21,11 @@ class Context:
 
     @src.setter
     def src(self):
-        raise NotImplementedError("Atribute is read-only.")
+        raise NotImplementedError("Attribute is read-only.")
 
     @src.deleter
     def src(self):
-        raise NotImplementedError("Atribute is read-only.")
+        raise NotImplementedError("Attribute is read-only.")
 
     def __init__(self, src, general, langsec, options={}):
         """
@@ -110,11 +106,18 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
     if _run_hook("post_config", ctx, langsec_hooks) == BREAK:
         return getattr(ctx, "dest", ""), ctx.warnings
 
+    if "normalize" in ctx.langsec:
+        _normalize_src(ctx)
+
+    if _run_hook("post_normalize", ctx, langsec_hooks) == BREAK:
+        return getattr(ctx, "dest", ""), ctx.warnings
+
     # Loop through source characters. The increment of each loop depends on
     # the length of the token that eventually matches.
     ignore_list = langsec.get("ignore", [])  # Only present in R2S
     ctx.cur = 0
     word_boundary = langsec.get("word_boundary", WORD_BOUNDARY)
+
     while ctx.cur < len(ctx.src):
         # Reset cursor position flags.
         # Carry over extended "beginning of word" flag.
@@ -122,19 +125,14 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
         cur_char = ctx.src[ctx.cur]
 
         # Look for a word boundary and flag word beginning/end it if found.
-        if (ctx.cur == 0 or ctx.src[ctx.cur - 1] in word_boundary) and (
-                cur_char not in word_boundary):
+        if _is_bow(ctx.cur, ctx, word_boundary):
             # Beginning of word.
             logger.debug(f"Beginning of word at position {ctx.cur}.")
-            ctx.cur_flags |= CUR_BOW
-        if (
-            ctx.cur == len(ctx.src) - 1
-            or ctx.src[ctx.cur + 1] in word_boundary
-        ) and (cur_char not in word_boundary):
-            # Beginning of word.
+            ctx.cur_flags |= BOW
+        if _is_eow(ctx.cur, ctx, word_boundary):
             # End of word.
             logger.debug(f"End of word at position {ctx.cur}.")
-            ctx.cur_flags |= CUR_EOW
+            ctx.cur_flags |= EOW
 
         # This hook may skip the parsing of the current
         # token or exit the scanning loop altogether.
@@ -183,28 +181,45 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
 
         # Begin transliteration token lookup.
         ctx.match = False
-        for ctx.src_tk, ctx.dest_tk in langsec["map"]:
+
+        for ctx.src_tk, ctx.dest_str in langsec["map"]:
             hret = _run_hook("pre_tx_token", ctx, langsec_hooks)
             if hret == BREAK:
                 break
             if hret == CONT:
                 continue
 
-            step = len(ctx.src_tk)
+            step = len(ctx.src_tk.content)
+            # If the token is longer than the remaining of the string,
+            # it surely won't match.
+            if ctx.cur + step > len(ctx.src):
+                continue
 
             # If the first character of the token is greater (= higher code
             # point value) than the current character, then break the loop
             # without a match, because we know there won't be any more match
             # due to the alphabetical ordering.
-            if ctx.src_tk[0] > cur_char:
+            if ctx.src_tk.content[0] > cur_char:
                 logger.debug(
-                        f"{ctx.src_tk} is after "
+                        f"{ctx.src_tk.content} is after "
                         f"{ctx.src[ctx.cur:ctx.cur + step]}. Breaking loop.")
                 break
 
+            # If src_tk has a WB flag but the token is not at WB, skip.
+            if (
+                (ctx.src_tk.flags & BOW and not ctx.cur_flags & BOW)
+                or
+                # Can't rely on EOW flag, we must check on the last character
+                # of the potential match.
+                (ctx.src_tk.flags & EOW and not _is_eow(
+                        ctx.cur + step - 1, ctx, word_boundary))
+            ):
+                continue
+
             # Longer tokens should be guaranteed to be scanned before their
             # substrings at this point.
-            if ctx.src_tk == ctx.src[ctx.cur:ctx.cur + step]:
+            # Similarly, flagged tokens are evaluated first.
+            if ctx.src_tk.content == ctx.src[ctx.cur:ctx.cur + step]:
                 ctx.match = True
                 # This hook may skip this token or break out of the token
                 # lookup for the current position.
@@ -223,20 +238,21 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
                     or
                     (
                         ctx.options["capitalize"] == "all"
-                        and ctx.cur_flags & CUR_BOW
+                        and ctx.cur_flags & BOW
                     )
                 ):
                     logger.info("Capitalizing token.")
                     double_cap = False
                     for dcap_rule in ctx.langsec.get("double_cap", []):
-                        if ctx.dest_tk == dcap_rule:
-                            ctx.dest_tk = ctx.dest_tk.upper()
+                        if ctx.dest_str == dcap_rule:
+                            ctx.dest_str = ctx.dest_str.upper()
                             double_cap = True
                             break
                     if not double_cap:
-                        ctx.dest_tk = ctx.dest_tk[0].upper() + ctx.dest_tk[1:]
+                        ctx.dest_str = (
+                                ctx.dest_str[0].upper() + ctx.dest_str[1:])
 
-                ctx.dest_ls.append(ctx.dest_tk)
+                ctx.dest_ls.append(ctx.dest_str)
                 ctx.cur += step
                 break
 
@@ -281,6 +297,24 @@ def transliterate(src, lang, t_dir="s2r", capitalize=False, options={}):
     return ctx.dest, ctx.warnings
 
 
+def _normalize_src(ctx):
+    for nk, nv in ctx.langsec.get("normalize", {}).items():
+        ctx._src = ctx.src.replace(nk, nv)
+    logger.debug(f"Normalized source: {ctx.src}")
+
+
+def _is_bow(cur, ctx, word_boundary):
+    return (cur == 0 or ctx.src[cur - 1] in word_boundary) and (
+            ctx.src[cur] not in word_boundary)
+
+
+def _is_eow(cur, ctx, word_boundary):
+    return (
+        cur == len(ctx.src) - 1
+        or ctx.src[cur + 1] in word_boundary
+    ) and (ctx.src[cur] not in word_boundary)
+
+
 def _run_hook(hname, ctx, hooks):
     ret = None
     for hook_def in hooks.get(hname, []):

+ 54 - 0
tests/data/script_samples/greek.csv

@@ -0,0 +1,54 @@
+greek_classical,Ἡσιόδου τοῦ Ἀσκραίου Ἔργα καὶ ἡμέραι,Hēsiodou tou Askraiou Erga kai hēmerai,,
+greek_classical,Ἡ τοῦ Ὁμήρου Ἰλιάς,Hē tou Homērou Ilias,,
+greek_classical,Φίληβος ἢ Περὶ ἡδονῆς,Philēbos ē Peri hēdonēs,,
+greek_classical,Ἀγνώστῳ θεῷ,Agnōstō theō,,
+greek_classical,κεῖται παρ’ Ἅιδῃ,keitai par’ Hadē,,
+greek_classical,Αἴτια Ῥωμαϊκά,Aitia Rhōmaika,,
+greek_classical,Ὅτι οὐδ’ ἡδέως ζῆν ἔστι κατ’ Ἐπίκουρον,Hoti oud’ hēdeōs zēn esti kat’ Epikouron,,
+greek_classical,Περὶ τοῦ μὴ ῥᾳδίως πιστεύειν διαβολῇ,Peri tou mē rhadiōs pisteuein diabolē,,
+greek_classical,ἀΰπνους νύκτας ἴαυον,aypnous nyktas iauon,,
+greek_classical,Λητοῦς καὶ Διὸς υἱός,Lētous kai Dios huios,,
+greek_classical,ὑϊκὸν πάσχειν,hyikon paschein,,
+greek_classical,εἶπε πρὸς τὸν ἄνδρα τὸν ἑωυτῆς,eipe pros ton andra ton heōutēs,,
+greek_classical,τί τοῦδ’ ἂν εὕρημ’ ηὗρον εὐτυχέστερον;,ti toud’ an heurēm’ hēuron eutychesteron,,
+greek_classical,Τοῦ Κατὰ πασῶν αἱρέσεων ἐλέγχου βιβλίον αʹ,Tou Kata pasōn haireseōn elenchou biblion 1,,
+greek_classical,καλὸν κἀγαθόν,kalon kagathon,,
+greek_classical,ᾤχοντο θοἰμάτιον λαβόντες μου,ōchonto thoimation labontes mou,,
+greek_classical,Περὶ ἰλίγγων,Peri ilingōn,,
+greek_classical,ὅτε τ’ ἴαχε σάλπιγξ,hote t’ iache salpinx,,
+greek_classical,Ἐγχειρίδιον ἁρμονικῆς,Encheiridion harmonikēs,,
+greek_classical,ἄλαϲτα δὲ ϝέργα πάθον κακὰ μηϲαμένοι,alasta de werga pathon kaka mēsamenoi,,
+greek_classical,Δαμαρέτα τ’ ἐρατά τε Ϝιανθεμίϲ,Damareta t’ erata te Wianthemis,,
+greek_classical,ξένϝος,xenwos,,
+greek_classical,Πάτροϙλος,Patroḳlos,,
+greek_modern,"Ἐτήσια ἔκθεσις / Κυπριακὴ Δημοκρατία, Ὑπουργεῖον Ἐργασίας καὶ Κοινωνικῶν Ἀσφαλίσεων","Etēsia ekthesis / Kypriakē Dēmokratia, Hypourgeion Ergasias kai Koinōnikōn Asphaliseōn",,
+greek_modern,"Ετήσια έκθεση / Κυπριακή Δημοκρατία, Υπουργείο Εργασίας και Κοινωνικών Ασφαλίσεων","Etēsia ekthesē / Kypriakē Dēmokratia, Hypourgeio Ergasias kai Koinōnikōn Asphaliseōn",,
+greek_modern,Ελληνικό Ίδρυμα Ευρωπαϊκής και Εξωτερικής Πολιτικής,Hellēniko Hidryma Eurōpaikēs kai Exōterikēs Politikēs,,
+greek_modern,Ελευθέριος Δ. Παυλίδης,Eleutherios D. Paulidēs,,
+greek_modern,Ορθόδοξος Αυτοκέφαλος Εκκλησία της Αλβανίας,Orthodoxos Autokephalos Ekklēsia tēs Alvanias,,
+greek_modern,Βίος και πολιτεία του Αλέξη Ζορμπά,Vios kai politeia tou Alexē Zormpa,,
+greek_modern,Βίος καὶ πολιτεία τοῦ Ἀλέξη Ζορμπᾶ,Vios kai politeia tou Alexē Zormpa,,
+greek_modern,Λασκαρίνα Μπουμπουλίνα,Laskarina Boumpoulina,,
+greek_modern,Νταίηβιντ Μίτσελ,Ḏaiēvint Mitsel,,
+greek_modern,Τζαίημς Τζόυς,Tzaiēms Tzoys,,
+greek_modern,Ἡ κοινωνιολογία τοῦ ρεμπέτικου,Hē koinōniologia tou rempetikou,,
+greek_modern,Βίλλυ Μπραντ,Villy Brant,,
+greek_modern,Μπραντ Πιτ,Brant Pit,,
+greek_modern,Γιάκομπ Φίλιπ Φαλμεράυερ,Giakomp Philip Phalmerayer,,
+greek_modern,Σαρλ Ογκουστίν ντε Κουλόμπ,Sarl Onkoustin de Koulomp,,
+greek_modern,Λαμπέρτο Ντίνι,Lamperto Ḏini,,
+greek_modern,Τζωρτζ Χέρμπερτ Ουώκερ Μπους,Tzōrtz Chermpert Ouōker Bous,,
+greek_modern,Ουίνστων Τσώρτσιλ,Ouinstōn Tsōrtsil,,
+greek_modern,Παγκόσμιο Κέντρο Εμπορίου,Pankosmio Kentro Emporiou,,
+greek_modern,Φαίδων Γκιζίκης,Phaidōn Gkizikēs,,
+greek_modern,Γκέτεμποργκ,Gketemporgk,,
+greek_modern,Ουάσιγκτον,Ouasinkton,,
+greek_modern,Ουάσινγκτον,Ouasinnkton,,
+greek_modern,Αεροδρόμιο Ρόναλντ Ρέιγκαν της Ουάσινγκτον,Aerodromio Ronalnt Reinkan tēs Ouasinnkton,,
+greek_modern,Ντμίτρι Ιβάνοβιτς Μεντελέγιεφ,Ḏmitri Ivanovits Mentelegieph,,
+greek_modern,Άγγελος Σταύρου Βλάχος,Angelos Staurou Vlachos,,
+greek_modern,ΟΔΗΓΟΣ ΜΑΡΚΕΤΙΝΓΚ ΕΛΛΑΔΟΣ / Ἑλληνικό Ἰνστιτοῦτο Μάρκετινγκ τῆς Ἑλληνικῆς Ἑταιρίας Διοικήσεως Ἐπιχειρήσεων,Hodēgos marketingk Hellados / Hellēniko Institouto Marketingk tēs Hellēnikēs Hetairias Dioikēseōs Epicheirēseōn,,
+greek_modern,Σάλπιγξ Ἑλληνική,Salpinx Hellēnikē,,
+greek_modern,Μπιντπάϋ,Bintpay,,
+greek_modern,Η υιοθεσία ενηλίκων,Hē huiothesia enēlikōn,,
+greek_modern,οι Άρπυιες,hoi Harpuies,,