Final request for feedback

Discussion:

(too old to reply)

David Newall

2022-02-20 03:11:37 UTC

Hi All,

I'm about to publish my UTF-8 code. Before I do I'm asking for feedback
and opinions for what should be the last time.

What's different about what I'm finally intending to publish:

1. I'm using a dictionary for the UNICODE encoding map, instead of
sparse array. This isn't because it's faster -- 3ns slower seems quite
acceptable -- and a dictionary is bigger -- over double the size for
GNU's UnifontMedium. I'm doing this because it's two less files to
publish -- I don't need to publish sparseget and I don't need to publish
an AWK script to convert Fontforge .g2n files into a sparse array.

2. I've replaced utf8show with utf8decode (which generates an array of
UNICODE values) and unicodeshow.

3. I'm not storing the map in the font, but passing it as a parameter to
unicodeshow because I think it's simpler. Storing it in the font means
defining a new font (definefont).

These are the alternative programs for printing a UTF-8 string.

This is what I think I'll publish:

%!PS
%%IncludeResource: procset unicodeshow
%%IncludeResource: procset utf8decode
/Helvetica 20 selectfont
100 300 moveto
(Welcome to \342\200\234UTF-8\342\200\235 \342\230\272) utf8decode
ReverseAdobeGlyphList exch unicodeshow
showpage

This is what I was previously intending, using a dictionary:

%!PS
%%IncludeResource: procset unicodefont
%%IncludeResource: procset unicodeshow
%%IncludeResource: procset utf8decode
/Helvetica findfont 20 scalefont ReverseAdobeGlyphList unicodefont
/MyFont exch definefont setfont
100 300 moveto
(Welcome to \342\200\234UTF-8\342\200\235 \342\230\272) utf8decode
unicodeshow
showpage

There's one extra line if using a sparse array instead of a dictionary:

%%IncludeResource: procset sparseget

I think the first is better but am open to opposing opinions.

Thanks,

David

luser droog

2022-02-22 15:36:22 UTC

Permalink

Post by David Newall
Hi All,
I'm about to publish my UTF-8 code. Before I do I'm asking for feedback
and opinions for what should be the last time.

That looks really good to me. I'm a little sad that definefont is out,
but it really doesn't appear to offer very much. It seems like PostScript
*almost* has the pieces available to put this together seamlessly.
But the conversion probably can't use a filtered file because of the need
to convert from a string to an array. And packing the glyph selection
into a composite font would be a ton of work if it's even possible.

Carlos

2022-02-26 00:44:36 UTC

Permalink

On Tue, 22 Feb 2022 07:36:22 -0800 (PST)

[...] And packing
the glyph selection into a composite font would be a ton of work if
it's even possible.

It is possible to create a tree of composite fonts, where each byte in
a UTF-8 sequence dispatches to the next font, and the last one picks
the glyph. The problems with this approach are 1. the complexity
creating and populating the font tree, and 2. the fact that
the base fonts at the leaves can only encode 64 glyphs each (since
that's how many values the last byte in a multibyte UTF-8 sequence can
hold), and not even at the beginning of the /Encoding array, which is a
waste.

A simpler approach is to reencode the UTF-8 string to a made-up UTF-24
encoding (3 bytes per codepoint), and then use a simple chain of 8x8
(FMapType 2) composite fonts. Here the first byte selects the Unicode
plane (sections of 65536 codepoints; only 4 or 5 are assigned), the
second byte the segment of 256 codepoints in that plane, and the third
one the glyph inside that segment.

While in theory this needs 1 comp. font to choose the plane + 256 comp.
fonts (1 for each plane) + 265x256 base fonts = 65793 fonts, the
majority of them are just the same empty font.

Below is an example of this approach. You get a unicode font by calling
"unicodize" on a font with CharStrings, and you reencode UTF-8 strings
with the "u" operator:

/Courier-Unicode /Courier findfont unicodize 12 scalefont setfont
(oh là là)u show

It uses the AdobeGlyphList for now -- maybe David will come up with
something better.

The code has probably some bugs. I only tested it with Emacs' "Hello"
demo:

%!PS

/f /Arial findfont def
/uf /UFont f unicodize def

uf 14 scalefont setfont
700
[
( Europe: ¡Hola!, Grüß Gott, Hyvää päivää, Tere õhtust, Bonġu)
( Cześć!, Dobrý den, Здравствуйте!, Γειά σας, გამარჯობა)
( Africa: ሠላም)
( Middle/Near East: שָׁלוֹם, السّلام عليكم)
( South Asia: નમસ્તે, नमस्ते, ನಮಸ್ಕಾರ, നമസ്കാരം, ଶୁଣିବେ,)
( ආයුබෝවන්, வணக்கம், నమస్కారం, བཀྲ་ཤིས་བདེ་ལེགས༎)
( South East Asia: ជំរាបសួរ, ສະບາຍດີ, မင်္ဂလာပါ, สวัสดีครับ,
Chào bạn) ( East Asia: 你好, 早晨, こんにちは, 안녕하세요)
( Misc: Eĥoŝanĝo ĉiuĵaŭde, ⠓⠑⠇⠇⠕, ∀ p ∈ world • hello p □)
( CJK variety: GB(元气,开发), BIG5(元氣,開發), JIS(元気,開発),
KSC(元氣,開發)) ( Unicode charset: Eĥoŝanĝo ĉiuĵaŭde, Γειά σας,
שלום, Здравствуйте!) ] {
1 index 20 exch moveto
u show
30 sub
} forall

pop
showpage

Here's the code. Our old friend the iterator makes an appearance :)

%!PS

%% create a composite font suitable for strings with UTF-24 encoding
%: key originalfont -- newfont
/unicodize {
40 dict begin
/ofont exch def
/key exch def
/fname key dup length string cvs def
/basefonts 10 dict def
/planefonts 10 dict def
%: string string -- name
/newname {
/s2 exch def /s1 exch def
/s s1 length s2 length add 1 add string def
s 0 s1 putinterval
s s1 length (-) putinterval
s s1 length 1 add s2 putinterval
s cvn
} def
%: int -- string
/tohex { 16 10 string cvrs } def
%: array element -- newarray
/append { /e exch def [ exch aload pop e ] } def
%: suffix -- font
/newbasefont {
/suffix exch def
/name fname suffix newname def
ofont dup length dict copy
dup /Encoding [ 256 { /.notdef } repeat ] put
dup /FontName name put
dup basefonts exch name exch put
} def
/emptybasefont (Base-E) newbasefont def
%: suffix -- font
/newplanefont {
/suffix exch def
/name fname suffix newname def
<< /FontType 0
/FontMatrix [ 1 0 0 1 0 0 ]
/FontName name
/FMapType 2
/Encoding [ 256 { 0 } repeat ]
/FDepVector [ emptybasefont ]
dup planefonts exch name exch put
} def
/emptyplanefont (Plane-E) newplanefont def
/mainfont << /FontType 0
/FontMatrix [ 1 0 0 1 0 0 ]
/FontName fname
/FMapType 2
/Encoding [ 256 { 0 } repeat ]
/FDepVector [ emptyplanefont ]

def

%: font subfont code --
/addsubfont {
/c exch def /sf exch def /f exch def
f /FDepVector 2 copy get sf append put
f /Encoding get c f /FDepVector get length 1 sub put
} def
%: glyphname code --
/putglyph {
dup /plane exch 65536 idiv def
dup /range exch 65536 mod 256 idiv def
/code exch 256 mod def
/glyph exch def
/idx mainfont /Encoding get plane get def
idx 0 eq {
plane tohex newplanefont
dup mainfont exch plane addsubfont
} {
mainfont /FDepVector get idx get
} ifelse
/planefont exch def
/idx planefont /Encoding get range get def
idx 0 eq {
plane 256 mul range add tohex newbasefont
dup planefont exch range addsubfont
} {
planefont /FDepVector get idx get
} ifelse
/basefont exch def
basefont /Encoding get code glyph put
} def
%: glyphname -- code true | false
/getcode {
/g exch def
AdobeGlyphList g known {
AdobeGlyphList g get true
} {
/s g g length string cvs def
s length 7 eq {
s 0 3 getinterval (uni) eq {
s 7 string copy dup 0 (16#) putinterval
{ cvi } stopped { pop false } { true } ifelse
} {
s 0 1 getinterval (u) eq {
9 string dup 3 s 1 6 getinterval putinterval
dup 0 (16#) putinterval
{ cvi } stopped { pop false } { true } ifelse
} { false } ifelse
} ifelse
} { false } ifelse
} ifelse
} def
% fill the fonts...
ofont /CharStrings get { pop dup getcode { putglyph } { pop } ifelse } forall
% register them...
basefonts { definefont pop } forall
planefonts { definefont pop } forall
% register & return main font
key mainfont definefont
end
} bind def

%: string|array -- iterator ( -- nextchar true | false )
/sequenceiterator {
2 dict begin
/s exch def
/counter [ 0 ] def
[ counter 0 /get cvx s length /lt cvx [
s counter 0 /get cvx /get cvx true
counter 0 2 /copy cvx /get cvx 1 /add cvx /put cvx
] cvx [
false
] cvx /ifelse cvx
] cvx
end
} bind def

%% reencode UTF-8 to UTF-24
%: string -- string
/u {
3 dict begin
/src exch def
/nextch src sequenceiterator def
% count UTF-8 sequence starts
0 src { dup 128 lt exch 2#11000000 and 2#11000000 eq or
{ 1 } { 0 } ifelse add } forall
3 mul string /dest exch def
0 {
% decode sequence
nextch not { exit } if
dup 128 lt {
0 % 0xxxxxxx - 0 following bytes
} {
dup dup 2#11000000 ge exch 2#11011111 le and {
2#00011111 and 1 % 110xxxxx - 1 following byte
} {
dup dup 2#11100000 ge exch 2#11101111 le and {
2#00001111 and 2 % 1110xxxx - 2 following bytes
} {
dup dup 2#11110000 ge exch 2#11110111 le and {
2#00000111 and 3 % 11110xxx - 3 following bytes
} {
pop 0 0 % invalid sequence
} ifelse
} ifelse
} ifelse
} ifelse
{ 6 bitshift nextch pop 2#00111111 and add } repeat
% stack: index-to-dest, codepoint
2 copy 65536 idiv dest 3 1 roll put
exch 1 add exch 2 copy 65536 mod 256 idiv dest 3 1 roll put
exch 1 add exch 2 copy 256 mod dest 3 1 roll put pop
1 add
} loop
pop
dest
end
} bind def
--

David Newall

2022-02-28 10:39:30 UTC

Permalink

Hi Carlos,

Post by Carlos
A simpler approach is to reencode the UTF-8 string

What an elegant decoder; and I like the iterator with its clever use of
an array.

Invalid sequences should produce U+FFFD. Add:

/unget {
load 0 get dup 0 get dup 0 gt
{ 1 sub 0 exch put } { pop pop } ifelse
} def

and then only two changes:

pop 16#FFFD 0 % invalid sequence

and

6 bitshift nextch not { pop 16#FFFD exit } if
dup 2#11000000 and 2#10000000 ne
{ /nextch unget pop 16#FFFD exit } if
2#00111111 and add

It still accepts overlong sequences but gives output consistent with the
input.

Regards,

David

David Newall

2022-02-28 11:26:53 UTC

Permalink

Hi Carlos,

Post by Carlos
It is possible to create a tree of composite fonts, where each byte in
a UTF-8 sequence dispatches to the next font, and the last one picks
the glyph.

Thank you for the clearest example of composite fonts that I've ever
seen. Unfortunately, they lose useful cshow (only the last byte of each
character is pushed on stack) and don't work at all with kshow.

It's an intriguing idea but I'm not sure where to go with it.

What I'm currently working on fails when exceeding 64K glyphs (Adobe
PostScript array and dictionary implementation limits) and a composite
font gets past that, but not when simply transforming a standard font
into a composite font (CharStrings limit.)

Regards,

David

Carlos

2022-02-26 00:56:15 UTC

Permalink

On Sun, 20 Feb 2022 14:11:37 +1100
David Newall <***@davidnewall.com> wrote:
[...]

Post by David Newall
3. I'm not storing the map in the font, but passing it as a parameter
to unicodeshow because I think it's simpler. Storing it in the font
means defining a new font (definefont).

I think the map problem --how to get a good map, since the AdobeGlyphMap
is insufficient-- is the key. The interface and/or implementation IMO
is not so important (I posted an alternative implementation in another
message--but it's still limited to the meager 4K+ glyphs in the Adobe
list plus whatever extra /uniXXXX the font has...).

C.
--