[WIP] [need translation]
This reference article describes at a high-level how the Cantonese Font works. The text is written with the early access testers in mind: reading this overview will give you better understanding of how to report your findings. Users who have curiosity or special needs will also find this interesting.
Terminologies
Fonts, for most people, are something that “just works”. When you are working with fonts, however, we need to be more precise about the words we use, so we are on the same page. Give it a skim, and come back to this as a reference when needed.
- font refers to a complete, standalone file that the user installs. These files end with a
.ttf
(TrueType) extension. For the Cantonese Font, each of these is a particular combination of jyutping size, font-face, and whether it is fixed width (monospaced) or fluid width (where every character occupies only as much space as the jyutping or character demands) - font collection, or collection, is a set of fonts which share the same functionalities but visually looks different. Three sans-serif with different jyutping sizes, all sharing the version number 2.1.12, would be considered a collection.
- character is a linguistic concept, referring to the idea of an ideograph. “蛋” is a character. Characters usually are assigned a unique hexadecimal (base-16) number by the Unicode Consortium; this is called the codepoint. The codepoint for 蛋, for example, is
86CB
, which can be looked up at various online sources (like Compart). - glyph is a specific drawn shape of a character. What is notable in the Cantonese Font is that the glyph contains shape information of both the character as well as the Jyutping. The following picture shows three characters, one of which is 蛋; it has five glyphs, since there are five “pictures”.
-
SVG stands for Scalable Vector Graphic, and is a image format that describes visual appearances as mathematical curves (instead of, say, a PNG/JPEG which describes appearances pixel-by-pixel). We prepare a glyph by first drawing it as an SVG.
-
(black-and-white / monochrome) fall-back. Not every font renderer supports OpenType-SVG (the standard we use). In particular, Microsoft champions its own color font format. To ensure some graceful degradation, each glyph is supplied with both a colored (SVG) image as well as a “colorless” format. We call this the fall-back (or path instruction).
-
word / compound is a linguistic concept, referring to combination of characters. “蛋糕” is a word. In Chinese it is not always clear what is a word: is “分工合作” one, two, three, or four words? Especially when it is ambiguous, I tend to favor compound over word.
-
(word) segmentation is the procedure where a sentence is split into words. For CJK (Chinese-Japanese-Korean) languages which does not use space to mark off word boundaries, this is quite important and affects the semantics (meaning) and pronunciation (sound). As an example, 學生會好慘 can be alternately segmented as 學生會|好慘 (student union suffers) or 學生|會|好慘 (students will suffer), and 會 would follow to carry a different sound of
wui2
orwui3
. -
features / rules are programmatic instructions that are embedded inside a font. They originated to resolve clashing Latin character combinations, such as
f
andl
, by specifying a rule to replace this combination offl
with a new combined (ligature; “stuck together”) glyph. I abuse these in the Cantonese Fonts to make it look “smart”.
What’s Inside a Font
A font comprises of glyphs and features.
Glyphs
Every glyph is of a particular single character and its jyutping. Continuing with 蛋 as our example, the character has three glyphs (蛋 can be pronounced in three ways in Cantonese), and each glyph has a color SVG and monochrome fall-back.
The Cantonese Font contains 29,146 CJK characters and 39,419 SVG + 39,149 fall-backs. It has an additional 1,000 standard Latin glyphs, and another 10,850 word/compound phantom glyphs. We’ll get to the phantom glyphs with the Features.
Features
Features can be thought of as programming instructions embedded into fonts. They look like
sub f l by fl;
which means “substitute consecutive f
and l
with a glyph named fl
”.
One of the rules associated with 蛋, for example, is
sub uni86CB period d a a n two by uni86CB.daan2;
which means when someone types 蛋
and then .daan2
, it should replace the whole seven characters with a glyph named uni86CB.daan2
.
A group of feature/rules is called a lookup, and lookups are applied to the input one-after-another.
The Cantonese Font use features in four ways:
1. Spelling (type for alternate pronunciation) features
The above shown
sub uni86CB period d a a n two by uni86CB.daan2;
enables selection of a particular pronunciation. 蛋
has three of these to let the user choose the pronunciation; note that one can coerce the default jyutping (daan6) by typing 蛋.daan6
; this is what enables the Hydrated Cantonese markup format to “fix” a pronunciation.
There are thus 39,419 spelling rules, one for each glyph, collected into 40 lookups of 1,000 rules each. Spelling rules are applied before any other rules.
2. Word / compound features
Word features are used to provide a correct reading for words that uses non-default sound for any character. For example, the default sound for 行 is hang4
; in the context of 銀行
, we want to keep 銀
with the default sound, but 行
ought to be replaced with 行.hong4
. We write a pair of rules to do this substitution.
One of the primary tuning we’ll do in the testing period is to identify more compounds that the normal user may use. For example, the default of hang4
for 行 means that 行緊, 行入, 行去 — none of which are what you find in dictionaries as a “word” — would all show (incorrect) hang4
.
The solution here is case-by-case; we can (try to) enumerate all the use cases and create word features for them, or in this case, the better solution is probably to change the default reading to haang4
. When I set out a testing period, the expectation is that we will get the default jyutpings fixed-in-stone by public release.
There are somewhere around 10,000 pairs of word features. This changes, and is expected to change, with each version.
3. Selection features
Some application can pop up a panel that shows all the alternate glyphs for a selected character. How does it know which glyphs are related? Well, that is specified with the selection rules. Like spelling features, there are 39,419 rules, one for each glyph.
4. Cultural features
The “English-to-Chinese” dictionary, maps, idioms etc are really just a bunch of word features. We split them out here because they functionally do something different.
About “Phantom Glyphs”
Fonts are very grumpy, and accepts only certain rules. In particular, they do not allow rules that substitute several characters with several characters (many-to-many); the following is forbidden:
sub c a t by d o g;
To get around this, I create an empty glyph, say, ghost
, and do two substitutions:
sub c a t by ghost;
sub ghost by d o g;
The first is a many-to-one, and the second is one-to-many, so this is allowed. ghost
here is only used as a transient placeholder, and I call them Phantom Glyphs. One of them is created for each word and each culture rule.
When User Types
Let’s say you’re using Word. You choose a Canto Font, and merrily types away (your input; let’s say it’s 我今日咪同哥哥去咗銀行). What happens then?
Word gets first shot at the input, and it decides where to chop the string into lines. Let’s say your font-size is large, and each line can contain only about 6 CJK characters; the sentence would then split over two lines:
我今日咪同哥
哥去咗銀行
Canto Font then gets dibs of each line, and applies the rules in order:
- spelling rules
- selection rules
- word rules
- culture rules
In this case, there is no .jyutping
annotations nor special glyph chosen, so (1) and (2) don’t apply. You may be surprised that 哥哥 is rendered as go1 go1
, when everywhere else it is rendered as go4 go1
; the reason is that Word decides on the line break, and with 哥哥 splitted up as 哥 (new line) 哥 , the word rule for 哥哥
is not applied.
銀行, on the other hand, falls in the same line, and the word rule for 銀行
is applied to give the correct reading.
Oh no! 咪 is displayed as mai1
when you wanted it to be mai6
. English speakers naively expects that the written script drives the meaning, but in Cantonese it is the combination of written script and the sound that drives the meaning. 咪 is one of the characters where we need to override the sound, and we change our input to 我今日咪.mai6同哥哥去咗銀行.
And now things get messy and unpredictable. There are two ways that the application can split our input (a Latin glyph is about half the width of a CJK glyph):
我今日咪.mai
6同哥哥去咗銀行
or
我今日咪.mai6同哥
哥去咗銀行
Which is it?
The answer is… we don’t know. Applications do not just break the lines; some of them talk to the font and allow certain substitutions to happen (e.g., replacing the long 咪.mai6 with a single glyph, thereby reducing the total length). Each application handles this differently and we do not have control here. Sorry.
(I have carefully crafted the worst possible case; in reality this happens rarely, but when they happen they are very confusing.)
What is saved
You pasted in your text, and corrected individual readings; it looks nice and pretty. You hit Save. What happens next?
Most applications do not store the font within the file you saved. They save the text and formatting. Font choice, to the application, is just another formatting attribute; next time when you open the file, Word sees that some text should be rendered with Cantonese Font v2, and it asks your operating system to provide it. This way it avoids duplicating Times New Roman once for each Word document you created.
However, if you do pass this file to other people, and they don’t have the Cantonese Font installed, then their Word will substitute in whatever generic CJK 新細明體 font it has. Until they install Canto Font too, they will not see the jyutping.
(There is a fun hack here for Windows users who really, really like the colors. You prepare your .doc
document, and send the file to a Mac/Word machine. They see colors and can “fix” it by printing to PDF for you.)
Fixing the output - PDF
PDF is a unique file format that does put the glyphs that are used into the file. (This is why you can read PDFs that contain fonts you don’t have on your machine.) If you want other people to see what you see, you should Print to PDF and send the PDF.
Really fixing the output - outlined PDF
If you greatly care about other people seeing exactly what you see, you can take one more step. In Adobe Acrobat (not Acrobat Reader; you need the expensive subscription Acrobat), the Print Production → Preflight → Outline Fonts function will convert your glyphs inside the PDF into vector images. The file-size will bloat but now you have absolute guarantee about the fidelity.
Printing caveats
Colors in SVG are specified as RGB colors (for light-emitting screens) and printing uses CMYK ink to subtract light. Expect the printed colors to be somewhat more dull than what you see from the screen.
If you are doing professional print production, our in-house processes give finer control over every aspect, and you can consider hiring our expertise.