Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVIcD-0008zl-1H for pgsql-bugs@arkaria.postgresql.org; Tue, 16 Dec 2025 00:08:34 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vVIbC-003Luo-1r for pgsql-bugs@arkaria.postgresql.org; Tue, 16 Dec 2025 00:07:31 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVIbC-003Lug-0f for pgsql-bugs@lists.postgresql.org; Tue, 16 Dec 2025 00:07:31 +0000 Received: from mail-vk1-xa29.google.com ([2607:f8b0:4864:20::a29]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vVIbA-000x84-0p for pgsql-bugs@lists.postgresql.org; Tue, 16 Dec 2025 00:07:30 +0000 Received: by mail-vk1-xa29.google.com with SMTP id 71dfb90a1353d-559934e34bcso1042244e0c.1 for ; Mon, 15 Dec 2025 16:07:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1765843646; x=1766448446; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=rVkU4iJ6Su4IIe+2soA0Q8qSnR+EDEKblWzteUaB75I=; b=iRbC9Fv6lTRkjm6vP2567SPlxu+74SCic7c7XdkhIwoad4Og7fmqrbMruRXJvS6LCY Ys8YYknV2eJb2uY6LCbUoSb8cp43f0S0f2SjMACrOueqPNRKTsPi5xGz3MoFitKuWq4Z tm0/93dLWwXEbgzL8Rho97L0/4y/BDbEG2AsLACiO/BVxgOpWTfVBCEoU05ghqxWwB1D ku8WOz7xVo0DYkwERXmMcqPZVe00Qs9h2ZC7vx/rNPIUjrjmmwGwEcLga7IcBGUWOG2C DIyd2dnCKfWgwFn9NIEUe2ZmrR2UDJqCs4uJDxUr2WqDMHbJASeVH15yzGAz4P4pwXHV g3sQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765843646; x=1766448446; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=rVkU4iJ6Su4IIe+2soA0Q8qSnR+EDEKblWzteUaB75I=; b=bk/0foDF6YanU7yJh5x0wbzS4J7cPGMPpDgVezaJLCRkbOJ3OJ2MvpyqFJkeKaJfX0 Ce6rLtS560lWSyGwFok6VD31INnmnizx0HPHOi/uYxvrgow/9d4cn2UgqbA6OQSjW4oH MTLkRO/Ku9yyPwJat2mc2Q5jD4Ks8V+qcI9Hq4yhXnFYMLzjPwoXL3jv0sDyBCKI8QcB LLN2hWKV1DfnFGlD8fgecDqetZp3QWltj15W9wQHoEMz369O4buvW63mKqc/xsYbKPC/ Ru0hM1fK2mT95MuM+jZgp7OIOfmroGcALDU+EYihAt1Bd1maMUlCHila1+7MEBtGZ3h6 OV2w== X-Gm-Message-State: AOJu0Yw8cmZE8c5nbnP7dN/Hux9+hHuZCYted/tAB5WjA2JLxWZ9PCDS auq5AjnmgWBjbkqvPwGjxNSyONWSChQeb68ASvMAaOqmWzYKyZXonGcrJjo27VlwnVBtk6kNZMW wQ0KzmBHIpLGJ/bJAcEK7CLMKDVK1imE= X-Gm-Gg: AY/fxX6RloPtvwglWn55b1wA9ceSxXCnn6C0xPU2xf/h84Cz8Cz5L+OV/6z4YEB8Nng aRen9bYrYa6RCrSOYHg687nL/qJ2Dy6OYqb+mj+CAX69Gm/ziFaGJoZFL5UayNwHw28WVMUFXYh B2mSoDH7dmw8vqCTLrTE/7wDysAXRyKfgvKOPOrH4Bfkp+gcm71qBahZ2IAQfTjtuInxiN9YZDG Z/RShu4c55KA/osUlfWG1krJbVVlL175CpClZqiwvufEtu0joosBF76GGmdeTJbHCrRFblkdedn Qfp3JnXpkvVCJPQQIpsAjEHtxxkW0w== X-Google-Smtp-Source: AGHT+IEWhNuhFfxTOj6XeEwcGWcGU6bUKBEaKGxu7M45CVjWG/5FWyON7sUKc9mu+UFCvQlfCOSV8F+isG1Mov/FuOQ= X-Received: by 2002:a05:6122:6610:b0:557:d6d4:2f51 with SMTP id 71dfb90a1353d-55fed5893d5mr3470168e0c.8.1765843646293; Mon, 15 Dec 2025 16:07:26 -0800 (PST) MIME-Version: 1.0 References: <19354-eefe6d8b3e84f9f2@postgresql.org> In-Reply-To: From: Jeroen Vermeulen Date: Tue, 16 Dec 2025 01:07:12 +0100 X-Gm-Features: AQt7F2onpkAK-yKrWFLykRUFZ_xAQ5-Hv7f5W4EBHg8fvojaqfgOQ1NpXubfbBc Message-ID: Subject: Re: BUG #19354: JOHAB rejects valid byte sequences To: Robert Haas Cc: pgsql-bugs@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000442ae50646068384" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000442ae50646068384 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Robert. Thanks for following up. The original author of the support code in libpqxx also noted that there was a discrepancy. Python does accept these 2-byte sequences, and decodes them to Hangul characters. The way I read the Wikipedia section, Johab isn't like the EUC encodings in that it adds characters that contain ASCII-like values in the second byte. I guess that was needed to support Chinese characters in addition to Hangul. Unit-testing for the embedded-backslash hazard was what led me to find the problem. This bit worries me: "TlOther, vendor-defined, Johab variants also exist" = =E2=80=94 such as an EBCDIC-based one and a stateful one! Jeroen On Mon, Dec 15, 2025, 18:46 Robert Haas wrote: > On Sat, Dec 13, 2025 at 2:12=E2=80=AFPM PG Bug reporting form > wrote: > > Calling libpq, connecting to a UTF8 database and successfully setting > client > > encoding to JOHAB, this statement: > > > > PQexec(connection, "SELECT '\x8a\x5c'"); > > > > Returned an empty result with this error message: > > > > ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c > > > > AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character > "=EA=B5=8E". > > Easily verified in Python: > > > > print(b'\x8a\x5c'.decode('johab')) > > > > It's the same story for some other valid sequences I tried, including > this > > character's "neighbours" 0x8a 0x5b and 0x8a 0x5d. > > My reading of pg_johab_verifystr() is that it accepts any character > without the high bit set as a single-byte character. Otherwise, it > calls pg_joham_mblen() to determine the length of the character, and > that in turn calls pg_euc_mblen(), which returns 3 if the first byte > is 0x8f and otherwise 2. Whatever the answer, it then wants each byte > to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe. > Your byte string doesn't match that rule, so it makes sense that it > fails. > > What confuses me is that > https://en.wikipedia.org/wiki/KS_X_1001#Johab_encoding seems to say > that the encoding is always a 2-byte encoding and that any 2-byte > sequence with the high bit set on the first character is a valid > character. So the rules we're implementing don't seem to match that at > all. But unfortunately the intent behind the current code is not > clear. It was introduced by Bruce in 2002 in commit > a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments > there or elsewhere explaining what the thought was behind the way the > code works, so I don't know if this is some weird variant of JOHAB > that intentionally works differently or if this was just never > correct. > > -- > Robert Haas > EDB: http://www.enterprisedb.com > --000000000000442ae50646068384 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Robert.=C2=A0 Thanks for following up.

The original author of the support code in li= bpqxx also noted that there was a discrepancy.=C2=A0 Python does accept the= se 2-byte sequences, and decodes them to Hangul characters.

The way I read the Wikipedia section,= Johab isn't like the EUC encodings in that it adds characters that con= tain ASCII-like values in the second byte.=C2=A0 I guess that was needed to= support Chinese characters in addition to Hangul.=C2=A0 Unit-testing for t= he embedded-backslash hazard was what led me to find the problem.

This bit worries me: "TlOthe= r, vendor-defined, Johab variants also exist" =E2=80=94 such as an EBC= DIC-based one and a stateful one!


Jeroen

On Mon, Dec 15, 2025, 18:46 Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Dec 13, 2025 at 2:12=E2=80=AFPM PG Bug reporting fo= rm
<noreply@postgresql.org> wrote:
> Calling libpq, connecting to a UTF8 database and successfully setting = client
> encoding to JOHAB, this statement:
>
>=C2=A0 =C2=A0 =C2=A0PQexec(connection, "SELECT '\x8a\x5c'&= quot;);
>
> Returned an empty result with this error message:
>
>=C2=A0 =C2=A0 =C2=A0ERROR:=C2=A0 invalid byte sequence for encoding &qu= ot;JOHAB": 0x8a 0x5c
>
> AFAICT, 0x8a 0x5c is a valid JOHAB sequence making up Hangul character= "=EA=B5=8E".
> Easily verified in Python:
>
>=C2=A0 =C2=A0 =C2=A0print(b'\x8a\x5c'.decode('johab'))<= br> >
> It's the same story for some other valid sequences I tried, includ= ing this
> character's "neighbours" 0x8a 0x5b and 0x8a 0x5d.

My reading of pg_johab_verifystr() is that it accepts any character
without the high bit set as a single-byte character. Otherwise, it
calls pg_joham_mblen() to determine the length of the character, and
that in turn calls pg_euc_mblen(), which returns 3 if the first byte
is 0x8f and otherwise 2. Whatever the answer, it then wants each byte
to pass IS_EUC_RANGE_VALID() which allows for bytes from 0xa1 to 0xfe.
Your byte string doesn't match that rule, so it makes sense that it
fails.

What confuses me is that
https://en.wikipedia.org/wiki/KS_X_= 1001#Johab_encoding seems to say
that the encoding is always a 2-byte encoding and that any 2-byte
sequence with the high bit set on the first character is a valid
character. So the rules we're implementing don't seem to match that= at
all. But unfortunately the intent behind the current code is not
clear. It was introduced by Bruce in 2002 in commit
a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, but I don't see comments
there or elsewhere explaining what the thought was behind the way the
code works, so I don't know if this is some weird variant of JOHAB
that intentionally works differently or if this was just never
correct.

--
Robert Haas
EDB: http://www.enterprisedb.com
--000000000000442ae50646068384--