En-Zh Dictionary discussion / feedback

With 2.2.0, the font added the ability to wrap an English word with braces {word} and have the word displayed in Traditional Chinese/jyutping. This is a thread for discussion / suggestions.


How it works

Some technical background is helpful here, for you to know what is possible / where we can intercept.

Within the font

Behind the scenes, the font simply does a replacement whenever a particular sequence of characters appear. For example, there is a rule for {monkey}, such that it would be replaced with 馬騮.

What this means is that the text is rigid: neither {Monkey} nor {monkeys} would work out of the box.

Note the strange text uF68FC: this is the target Chinese text. We cannot create more of these; fonts have a hard limit of 65,535 “shapes”, and Pokfield has used up its quota.

The database

These rules are prepared from a Google Sheet:

At the moment the same database produce the rules for En → Zh and also Zh → En. This is a temporary solution; Zh-En probably needs its own database.

The dataset is mish-mashed from:

  • all countries’ common name (~250)
  • world cities (~400)
  • Hong Kong MTR locations (~200)
  • high-frequency words categories
    • batch 1, about 1500, was selected from a Collins English-Spanish frequency dictionary I had on my shelf;
    • batch 2, the remaining 1500 high-frequency, came from smashing together various online lists and de-duping. These are Google Translated.

In general my manual translations, which covers the majority of the vocabulary, favors spoken Cantonese (e.g., 馬騮 instead of 猴子). The Google Translations, on the other hand, is all literal. We should correct these as they surface.

Updates / versioning

Updating these rules from a new Sheet is neither difficult nor time-consuming. These changes are considered minor, and refreshes the sub-sub-version number (i.e., 2.0.0 → 2.0.1).

This is a proposal for a new feature. @jkwchui

Plurals, capitalization, and conjugates

The problem

It is desirable that users can have English prose in a word processor, and simply tag words in the braces to have it translated. Currently this doesn’t really work that way, because the same semantic in English can be represented with different sequences. While we can tag every word in the (grammatically incorrect) “monkey eat banana”, the tagging feature fails at some parts for the following:

  • Monkey eats banana
  • Monkeys eat banana
  • monkey eat bananas

Users should not be expected to know how it works under the hood, and they would just consider this… buggy.

The solutions

I propose we

  1. add an additional derived entry for plurals and conjugates, and
  2. automatically generate capitalized versions for every input

This is plausible because it does not involve adding any new Chinese combinations (which would count towards the font 65k ceiling) but only adds new rules (where the ceiling is 2^32 bytes.)

Derived entries

  • 1.1 If the word is a noun (but not a proper noun), add the pluralized versions (e.g,. monkeys, technologies, fishes)
  • 1.2 If a noun exists as plural, add the singular version (e.g., datum, bacterium)
  • 1.3 If the word is a verb, add the conjugated versions and the infinitive (e.g., run → runs, to run)

(Others rules needed?)

Capitalized versions

2.1 When creating rules from the (derived) entries, programmatically generate also the capitalized version (if lower-case): Run, Runs, To run
2.2 When creating rules from the capitalized entries, programmatically generate also the lower-cased version if originally capitalized: America → america

Note that there are certain rules that would need to be exceptions here, e.g., China (country) / china (ceramic), Turkey (country) / turkey (animal).

Plurals, capitalization, and conjugates

Plurals, capitalization, and conjugates have been implemented in 2.2.2.

The method to accomplish this is briefly written up at Font-embedded “Translation” – jon.hk. The data-set will need lots of manual translations / tuning over the next months or years, but the foundation for the technology is laid.