MIME-Version: 1.0
References: <19354-eefe6d8b3e84f9f2@postgresql.org> <CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
In-Reply-To: <CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
From: Jeroen Vermeulen <jtvjtv@gmail.com>
Date: Tue, 16 Dec 2025 01:07:12 +0100
Message-ID: <CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com>
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
To: Robert Haas <robertmhaas@gmail.com>
Cc: pgsql-bugs@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000442ae50646068384"
Archived-At: <https://www.postgresql.org/message-id/CA%2BzULE4L4rA2DLAcfy%3DeQL7w_ZexV4P5zpQRbP%3D_qrhJBEOzjg%40mail.gmail.com>
Precedence: bulk

--000000000000442ae50646068384
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Robert.  Thanks for following up.

The original author of the support code in libpqxx also noted that there
was a discrepancy.  Python does accept these 2-byte sequences, and decodes
them to Hangul characters.

The way I read the Wikipedia section, Johab isn't like the EUC encodings in
that it adds characters that contain ASCII-like values in the second byte.
I guess that was needed to support Chinese characters in addition to
Hangul.  Unit-testing for the embedded-backslash hazard was what led me to
find the problem.

This bit worries me: "TlOther, vendor-defined, Johab variants also exist" =
=E2=80=94
such as an EBCDIC-based one and a stateful one!


Jeroen

On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote:

> On Sat, Dec 13, 2025 at 2:12=E2=80=AFPM PG Bug reporting form
> <noreply@postgresql.org> wrote:
> > Calling libpq, connecting to a UTF8 database and successfully setting
> client
> > encoding to JOHAB, this statement:
> >
> >     PQexec(connection, "SELECT '\x8a\x5c'");
> >
> > Returned an empty result with this error message:
> >
> >     ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x5c
> >
> > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character
> "=EA=B5=8E".
> > Easily verified in Python:
> >
> >     print(b'\x8a\x5c'.decode('johab'))
> >
> > It's the same story for some other valid sequences I tried, including
> this
> > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.
>
> My reading of pg_johab_verifystr() is that it accepts any character
> without the high bit set as a single-byte character. Otherwise, it
> calls pg_joham_mblen() to determine the length of the character, and
> that in turn calls pg_euc_mblen(), which returns 3 if the first byte
> is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
> to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
> Your byte string doesn't match that rule, so it makes sense that it
> fails.
>
> What confuses me is that
> https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say
> that the encoding is always a 2-byte encoding and that any 2-byte
> sequence with the high bit set on the first character is a valid
> character. So the rules we're implementing don't seem to match that at
> all. But unfortunately the intent behind the current code is not
> clear. It was introduced by Bruce in 2002 in commit
> a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
> there or elsewhere explaining what the thought was behind the way the
> code works, so I don't know if this is some weird variant of JOHAB
> that intentionally works differently or if this was just never
> correct.
>
> --
> Robert Haas
> EDB: http://www.enterprisedb.com
>

--000000000000442ae50646068384
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">Hi Robert.=C2=A0 Thanks for following up.<div dir=3D"auto=
"><br></div><div dir=3D"auto">The original author of the support code in li=
bpqxx also noted that there was a discrepancy.=C2=A0 Python does accept the=
se 2-byte sequences, and decodes them to Hangul characters.</div><div dir=
=3D"auto"><br></div><div dir=3D"auto">The way I read the Wikipedia section,=
 Johab isn&#39;t like the EUC encodings in that it adds characters that con=
tain ASCII-like values in the second byte.=C2=A0 I guess that was needed to=
 support Chinese characters in addition to Hangul.=C2=A0 Unit-testing for t=
he embedded-backslash hazard was what led me to find the problem.</div><div=
 dir=3D"auto"><br></div><div dir=3D"auto">This bit worries me: &quot;TlOthe=
r, vendor-defined, Johab variants also exist&quot; =E2=80=94 such as an EBC=
DIC-based one and a stateful one!</div><div dir=3D"auto"><br></div><div dir=
=3D"auto"><br></div><div dir=3D"auto">Jeroen</div><br><div class=3D"gmail_q=
uote gmail_quote_container" dir=3D"auto"><div dir=3D"ltr" class=3D"gmail_at=
tr">On Mon, Dec 15, 2025, 18:46 Robert Haas &lt;<a href=3D"mailto:robertmha=
as@gmail.com">robertmhaas@gmail.com</a>&gt; wrote:<br></div><blockquote cla=
ss=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;pa=
dding-left:1ex">On Sat, Dec 13, 2025 at 2:12=E2=80=AFPM PG Bug reporting fo=
rm<br>
&lt;<a href=3D"mailto:noreply@postgresql.org" target=3D"_blank" rel=3D"nore=
ferrer">noreply@postgresql.org</a>&gt; wrote:<br>
&gt; Calling libpq, connecting to a UTF8 database and successfully setting =
client<br>
&gt; encoding to JOHAB, this statement:<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0PQexec(connection, &quot;SELECT &#39;\x8a\x5c&#39;&=
quot;);<br>
&gt;<br>
&gt; Returned an empty result with this error message:<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0ERROR:=C2=A0 invalid byte sequence for encoding &qu=
ot;JOHAB&quot;: 0x8a 0x5c<br>
&gt;<br>
&gt; AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character=
 &quot;=EA=B5=8E&quot;.<br>
&gt; Easily verified in Python:<br>
&gt;<br>
&gt;=C2=A0 =C2=A0 =C2=A0print(b&#39;\x8a\x5c&#39;.decode(&#39;johab&#39;))<=
br>
&gt;<br>
&gt; It&#39;s the same story for some other valid sequences I tried, includ=
ing this<br>
&gt; character&#39;s &quot;neighbours&quot; 0x8a 0x5b and 0x8a 0x5d.<br>
<br>
My reading of pg_johab_verifystr() is that it accepts any character<br>
without the high bit set as a single-byte character. Otherwise, it<br>
calls pg_joham_mblen() to determine the length of the character, and<br>
that in turn calls pg_euc_mblen(), which returns 3 if the first byte<br>
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte<br>
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.<br>
Your byte string doesn&#39;t match that rule, so it makes sense that it<br>
fails.<br>
<br>
What confuses me is that<br>
<a href=3D"https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding" rel=3D"n=
oreferrer noreferrer" target=3D"_blank">https://en.wikipedia.org/wiki/KS_X_=
1001#Johab_encoding</a> seems to say<br>
that the encoding is always a 2-byte encoding and that any 2-byte<br>
sequence with the high bit set on the first character is a valid<br>
character. So the rules we&#39;re implementing don&#39;t seem to match that=
 at<br>
all. But unfortunately the intent behind the current code is not<br>
clear. It was introduced by Bruce in 2002 in commit<br>
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don&#39;t see comments<br>
there or elsewhere explaining what the thought was behind the way the<br>
code works, so I don&#39;t know if this is some weird variant of JOHAB<br>
that intentionally works differently or if this was just never<br>
correct.<br>
<br>
-- <br>
Robert Haas<br>
EDB: <a href=3D"http://www.enterprisedb.com" rel=3D"noreferrer noreferrer" =
target=3D"_blank">http://www.enterprisedb.com</a><br>
</blockquote></div></div>

--000000000000442ae50646068384--