ExCantonese roadmap & status

jkwchui · February 4, 2024, 5:02am

ExCantonese is an in-house Elixir library that seeks to be the do-everything solution for Cantonese works. ExCantonese and ExOpenType forms the driving force behind the font development. The libraries are proprietary and it is unlikely that either libraries will be open-sourced.[^1]

[^1] Not like there is anyone other than me writing Elixir and does Cantonese / font!

This first post is a living roadmap / status update, with progress reports (and Jon’s soliloquies™) in the subsequent posts.

Design / Specifications

Modules

ExCantonese MUST have structs + functions for working with existing data in-memory:

data around Unicode codepoints,
data around word/compounds,
data around phrases,
data around romanizations

These are grouped under the ExCantonese.Data.* namespace, with the data files located under /priv/data/. Data loading utilities are grouped under ExCantonese.Data.Utils, and the loaders are under ExCantonese.Data.*.Loader.

Existing data SHOULD reference a data file but only once, either at compile time or at runtime when the first query is made (via Erlang persistent terms). These SHOULD be compacted to < 20 Mb memory when deployed.[^2]

[^2] This rules out including Unicode.Unihanas a dependency, which takes around 200 Mb of memory. In some prior work we need to extract relevant information from Unihan and add them to our data source.

There MUST also be modules for specific instances (usually relating to user-input), at various levels of granularities:

jyutping
characters 字
words 詞
phrase
work

There MUST be modules for working with each input-output formats, grouped under ExCantonese.IO.*:

Cantonese markups
- dry
- hydrated
LaTeX
- ruby
- triple-ruby
Typst[^3]
HTML

There MAY be works around SVG here, or spun out into a separate repo.

[^3] this awaits stabilization of Typst furigana

Layering

ExCantonese follows a standard Core - Boundary layering.

Quality Control

Without possibility of code reviews, development MUST use:

credo for good practices
dialixir for static type analysis
doctor for doc/test coverage
mix format for every check-in. (Noting that I make special styling around usage of multiple -> )

We MAY treat warnings as errors, and we MAY enforce type specs on Core.

For parsing, we SHOULD favor explicit grammar, and its parsing using PEGasus, over regular expressions.

Roadmap / Status

As of 2024-02-01, ExCantonese is in a sorry state and needs a tear down and rebuild. Work on the library started before Unicode.Unihan, relied on Rime as a data source, and frankly, I didn’t understand what the problem needed to be solve is.

The goal is to complete a disciplined rebuild by Apr 2024, and use this as a base to build up web APIs, whose access will be the lagniappe for paying customers.

awong · March 12, 2024, 3:44pm

Question: what’s the tie-in between WebAPIs and rebuild aside from the library quality itself?

Also, if your goal is Web APIs, can I suggest starting with a rough list of the API endpoints you wish before the rebuild? Having built a few web APIs before…that can sometimes affect design.

Oh…and I spent some time last night learning Elixir. Nifty language… I’ve never seen a language actually use Grapheme’s as the fundamental iteration type for their string structure. Most languages were designed before that concept was even formalized…what a different world.

jkwchui · March 12, 2024, 4:27pm

Re-building: what happened last year was that I didn’t have a very clear idea of what the library should do, and let the data-related stuff creep into the core. For the most part this is just tearing them apart.

The slightly thorny part is handling the :persistent_terms. In Erlang/Elixir, there is a share-nothing / immutable to most things, but a few ways to share state. :persistent_term is an obscure Erlang feature, that is a fast-read VM-wide state (but extremely expensive to write to) that is exactly what we want for these dictionary-data.

The way this works in Unicode.Unihan and ExCantonese is that we do a parse of the CSV (slow), then (1) save to an Erlang Term File (.etf), and then (2) push this into :persistent_term. Then in future application start, if the .etf exists, then it just do the fast load.

What I haven’t quite wrapped my head around is the right way to do the bit that actually matters. I had been working with the belief that

ExCantonese.Data.Char (data for each character, such as how each character could be pronounced; the same file also drives the font through the default readings but not completely) and
ExCantonese.Data.Word (data from Words.hk about words; not completely what is used in the font)

are all I need. This is wrong.

This is wrong because I performed multiple intervention in how the character data is used (e.g., in general the highest priority Traditional character reading is used as the default, but this is not true for 叶 which I took jip6), and both added and removed data from the words bank (e.g., 十九 is found in Words.hk as sap1 gau1, but it is not in the font for obvious reasons).

To take the Dry to Hydrated markup, what I need is to replicate as close to what the font is doing. This actually means building out

ExCantonese.Data.Font.Char and
ExCantonese.Data.Font.Word

which is attained by parsing the collection of font rules.

None of this is hard at all, but this is a little lower in priority than the web-site stuff. I’m fighting fire for the studio business the last few days, and I just need a few days after having the website up.

Elixir / Phoenix / LiveView / Ash is quite a lot to learn. What is quite interesting (I did a bit of React/Vue before coming to Elixir/Phoenix) is how it’s strongly about building out the application; the web-facing stuff can generally fit around the application. (<- that’s my way of saying I haven’t thought much about the API!)

If I were to be pushed, I’d say what would be available would just be a single POST to /hydrate/ where the request body contains the Dry markup text, and the return is the Hydrated markup. (Programmers can then do what they want in their fav lang.)

(I also haven’t thought enough about the API because I really, really want to build something “the Ash way”, and I haven’t learnt enough.)