public inbox for pgsql-bugs@postgresql.org
help / color / mirror / Atom feedFrom: Robert Haas <robertmhaas@gmail.com>
To: Jeroen Vermeulen <jtvjtv@gmail.com>
Cc: VASUKI M <vasukianand0119@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: Tue, 16 Dec 2025 09:26:12 -0500
Message-ID: <CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com> (raw)
In-Reply-To: <CA+zULE47EXZOp7qKYODd+mjSgDiR-WX5ZNBkwdKnj-Zc0FT58w@mail.gmail.com>
References: <19354-eefe6d8b3e84f9f2@postgresql.org>
<CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
<CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com>
<2292889.1765846569@sss.pgh.pa.us>
<CAE2r8H5vaSyaC_t1FcpHBo-BB_=SrFj7GFnOC-SxC6WDf5c9VA@mail.gmail.com>
<CA+zULE47EXZOp7qKYODd+mjSgDiR-WX5ZNBkwdKnj-Zc0FT58w@mail.gmail.com>
On Tue, Dec 16, 2025 at 2:42 AM Jeroen Vermeulen <jtvjtv@gmail.com> wrote:
> My one worry is perhaps Johab is on the list because one important user needed it.
>
> But even then that requirement may have gone away?
Well, that was over 20 years ago. There's a very good chance that even
if somebody was using JOHAB back then, they're not still using it now.
What's mystifying to me is that, presumably, somebody had a reason at
the time for thinking that this was correct. I know that our quality
standards were a whole looser back then, but I still don't quite
understand why someone would have spent time and effort writing code
based on a purely fictitious encoding scheme. So I went looking for
where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
from a file JOHAB.TXT, of which the latest version seems to be found
here:
https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT
And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
regenerates the current mapping files. Playing with it a bit:
rhaas=# select convert_from(e'\\x8a5c'::bytea, 'johab');
ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
rhaas=# select convert_from(e'\\x8444'::bytea, 'johab');
ERROR: invalid byte sequence for encoding "JOHAB": 0x84 0x44
rhaas=# select convert_from(e'\\x89ef'::bytea, 'johab');
convert_from
--------------
괦
(1 row)
So, \x8a5c is the original example, which does appear in JOHAB.TXT,
and \x8444 is the first multi-byte character in that file, and both of
them fail. But 89ef, which also appears in that file, doesn't fail,
and from what I can tell the mapping is correct. So apparently we've
got the "right" mappings, but you can only actually the ones that
match the code's rules for something to be a valid multi-byte
character, which aren't actually in sync with the mapping table. I'm
left with the conclusions that (1) nobody ever actually tried using
this encoding for anything real until 3 days ago and (2) we don't have
any testing infrastructure that verifies that the characters in the
mapping tables are actually accepted by pg_verifymbstr(). I wonder how
many other encodings we have that don't actually work?
--
Robert Haas
EDB: http://www.enterprisedb.com
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: pgsql-bugs@postgresql.org
Cc: robertmhaas@gmail.com, jtvjtv@gmail.com, vasukianand0119@gmail.com, tgl@sss.pgh.pa.us, pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
In-Reply-To: <CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox