ExCantonese
is an in-house Elixir library that seeks to be the do-everything solution for Cantonese works. ExCantonese
and ExOpenType
forms the driving force behind the font development. The libraries are proprietary and it is unlikely that either libraries will be open-sourced.[^1]
[^1] Not like there is anyone other than me writing Elixir and does Cantonese / font!
This first post is a living roadmap / status update, with progress reports (and Jon’s soliloquies™) in the subsequent posts.
Design / Specifications
Modules
ExCantonese
MUST have structs + functions for working with existing data in-memory:
- data around Unicode codepoints,
- data around word/compounds,
- data around phrases,
- data around romanizations
These are grouped under the ExCantonese.Data.*
namespace, with the data files located under /priv/data/
. Data loading utilities are grouped under ExCantonese.Data.Utils
, and the loaders are under ExCantonese.Data.*.Loader
.
Existing data SHOULD reference a data file but only once, either at compile time or at runtime when the first query is made (via Erlang persistent terms). These SHOULD be compacted to < 20 Mb memory when deployed.[^2]
[^2] This rules out including Unicode.Unihan
as a dependency, which takes around 200 Mb of memory. In some prior work we need to extract relevant information from Unihan and add them to our data source.
There MUST also be modules for specific instances (usually relating to user-input), at various levels of granularities:
- jyutping
- characters 字
- words 詞
- phrase
- work
There MUST be modules for working with each input-output formats, grouped under ExCantonese.IO.*
:
- Cantonese markups
- dry
- hydrated
- LaTeX
- ruby
- triple-ruby
- Typst[^3]
- HTML
There MAY be works around SVG here, or spun out into a separate repo.
[^3] this awaits stabilization of Typst furigana
Layering
ExCantonese
follows a standard Core - Boundary layering.
Quality Control
Without possibility of code reviews, development MUST use:
- credo for good practices
- dialixir for static type analysis
- doctor for doc/test coverage
- mix format for every check-in. (Noting that I make special styling around usage of multiple
->
)
We MAY treat warnings as errors, and we MAY enforce type specs on Core.
For parsing, we SHOULD favor explicit grammar, and its parsing using PEGasus
, over regular expressions.
Roadmap / Status
As of 2024-02-01, ExCantonese
is in a sorry state and needs a tear down and rebuild. Work on the library started before Unicode.Unihan
, relied on Rime as a data source, and frankly, I didn’t understand what the problem needed to be solve is.
The goal is to complete a disciplined rebuild by Apr 2024, and use this as a base to build up web APIs, whose access will be the lagniappe for paying customers.