Re: BUG #19354: JOHAB rejects valid byte sequences

public inbox for pgsql-bugs@postgresql.org  
help / color / mirror / Atom feed

From: Jeroen Vermeulen <jtvjtv@gmail.com>
To: Robert Haas <robertmhaas@gmail.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: Tue, 16 Dec 2025 01:07:12 +0100
Message-ID: <CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com> (raw)
In-Reply-To: <CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
References: <19354-eefe6d8b3e84f9f2@postgresql.org>
	<CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>

Hi Robert.  Thanks for following up.

The original author of the support code in libpqxx also noted that there
was a discrepancy.  Python does accept these 2-byte sequences, and decodes
them to Hangul characters.

The way I read the Wikipedia section, Johab isn't like the EUC encodings in
that it adds characters that contain ASCII-like values in the second byte.
I guess that was needed to support Chinese characters in addition to
Hangul.  Unit-testing for the embedded-backslash hazard was what led me to
find the problem.

This bit worries me: "TlOther, vendor-defined, Johab variants also exist" —
such as an EBCDIC-based one and a stateful one!


Jeroen

On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote:

> On Sat, Dec 13, 2025 at 2:12 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > Calling libpq, connecting to a UTF8 database and successfully setting
> client
> > encoding to JOHAB, this statement:
> >
> >     PQexec(connection, "SELECT '\x8a\x5c'");
> >
> > Returned an empty result with this error message:
> >
> >     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
> >
> > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character
> "굎".
> > Easily verified in Python:
> >
> >     print(b'\x8a\x5c'.decode('johab'))
> >
> > It's the same story for some other valid sequences I tried, including
> this
> > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
>
> My reading of pg_johab_verifystr() is that it accepts any character
> without the high bit set as a single-byte character. Otherwise, it
> calls pg_joham_mblen() to determine the length of the character, and
> that in turn calls pg_euc_mblen(), which returns 3 if the first byte
> is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
> to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
> Your byte string doesn't match that rule, so it makes sense that it
> fails.
>
> What confuses me is that
> https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
> that the encoding is always a 2-byte encoding and that any 2-byte
> sequence with the high bit set on the first character is a valid
> character. So the rules we're implementing don't seem to match that at
> all. But unfortunately the intent behind the current code is not
> clear. It was introduced by Bruce in 2002 in commit
> a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
> there or elsewhere explaining what the thought was behind the way the
> code works, so I don't know if this is some weird variant of JOHAB
> that intentionally works differently or if this was just never
> correct.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-bugs@postgresql.org
  Cc: jtvjtv@gmail.com, robertmhaas@gmail.com, pgsql-bugs@lists.postgresql.org
  Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
  In-Reply-To: <CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox