Cantonese-Jyutping Markup Formats (2024 Q1)

jkwchui · February 2, 2024, 4:49pm

[WIP documentation]
[Chinese version at …]

When you prepare jyutping-annotated material with v2 (Pokfield), you not only create a specific document, but have in fact annotated the piece of text with Jyutping for perpetuity. This reference article reveals some of the ecology, and explains why contributing to the repository asks for (ugly) plain-text as the mandatory document, but that everything else (including what is rendered by Canto Font) to be optional.

A little about how the font works

The Canto Font is made up of two part, the pictures and the rules:

Pictures: each unique reading of a character has two pictures drawn for it (color, and black-and-white for systems that don’t support color fonts)
Rules: describes what picture should be drawn by default, and when a different picture should be substituted in

Pokfield contains about 80,000 pictures and 100,000 rules. These smarts ultimately derives from an in-house Elixir (programming language) library called ExCantonese.

This library, being the parent, can “see” from plain-text as 區.au1議員係區議員, what you have prepared / seen in Pages/Word as

and understand that the second 區, in that context, is keoi1.

The Dry and Hydrated Formats

Dry format (what you prepare)

The dry format is what you would prepare and submit to the repo. Here are some examples:

普洱|加|檸檬|變|檸茶
區.au1議員|係|區議員
南京市|長江大橋
南京|市長|[江大橋 name] (aspirational)

Preparing the dry format is a two-step procedure:

In your choice of word-processing app, correct the pronunciation using the . dot notation
Copy the text into a text-editor (e.g., Visual Studio Code, TextEdit), and mark off word boundaries with a | pipe. (On US keyboards, this key is over the Enter/Return key.)

Word boundaries can be fuzzy in Chinese (長江大橋 or 長江|大橋), and often there isn’t a “right answer”. That’s OK.

When you copy your | pipe-fenced plain-text back into your editor, the pipes become invisible but create a subtle spacing,[^| is formally replaced with a modified zero-width non-joiner U+200C] which assists the human reader in understanding the word boundaries.

The advantage of the Dry format is that it requires very little effort to generate, yet implicitly contains the pronunciation for each character. The major downside, however, is that to make explicit this pronunciation requires access to ExCantonese, an Elixir programmer, and Elixir is decidedly a non-mainstream language .

Tag (Aspirational)

I need to see if it is indeed possible to “no show” according to a complex rule. If true, we would be have the Dry format with classes/tags, such as

[黃蓉 name]|係|[金庸 name]|武俠小說|[射雕英雄傳 book]|嘅|女主角。

A useful list of tags would be

word
name
book
emph
strike
sup
sub
every other alphanumeric as tag

Or if it really works,

class="xyz"

The downstream formats may then convert these markup to something appropriate for that renderer, e.g.,

<u>黃蓉</u>係<u>金庸</u>武俠小說<book>射雕英雄傳</book>嘅女主角。

Hydrated format

Jon periodically takes the community’s Dry formats in the repository and, by passing the content through ExCantonese, provide a variety of outputs from it. The first is the hydrated markup.

One of the problems of relying onExCantonese is that the library’s data may change over time. Say the default reading for 洱 is lei2 right now; but someone managed to persuasively argue for ji5 and we changed the default to that. Interpretation of the dry-formatted 普洱|加|檸檬|變|檸茶 will now be different from what you intended, and that’s… no good.

The hydrated markup is an archival format that explicitly states the jyutping for every character.:

普.pou2洱.lei2|加.gaa1|檸.ning4檬.mung1|變.bin3|檸.ning2茶.caa4

This is a machine-oriented format; hard to read for humans, easy to parse in code. [TODO-JC provide PEG grammar as spoiler]

By making the pronunciation self-contained, the hydrated form empowers programmers fluent in other languages to build upon your efforts.

Other Downstream Markup Formats

The hydrated markup flows downstream into other presentations. ExCantonese could use the information to prepare

LaTeX and Typst (for complex / multilingual parallel print / PDF layouts; graded editions),
HTML (for static webpages / ePub eBooks), and
SVG/PNG (standalone vector / raster images)

Future interesting projects may be to audio (jyutping-to-speech), or conversion into HK sign language (text+jyutping to animation).

As of 2024-Feb, there are big holes here and nothing is fully automated. When the business side is stabilized and the company has extra resources, I hope to build and deploy a web interface that lets you submit your Dry markup and get everything in a zip.