public inbox for pgsql-bugs@postgresql.org  
help / color / mirror / Atom feed
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Robert Haas <robertmhaas@gmail.com>
Cc: Jeroen Vermeulen <jtvjtv@gmail.com>
Cc: VASUKI M <vasukianand0119@gmail.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: Tue, 16 Dec 2025 10:41:46 -0500
Message-ID: <2393116.1765899706@sss.pgh.pa.us> (raw)
In-Reply-To: <CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com>
References: <19354-eefe6d8b3e84f9f2@postgresql.org>
	<CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
	<CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com>
	<2292889.1765846569@sss.pgh.pa.us>
	<CAE2r8H5vaSyaC_t1FcpHBo-BB_=SrFj7GFnOC-SxC6WDf5c9VA@mail.gmail.com>
	<CA+zULE47EXZOp7qKYODd+mjSgDiR-WX5ZNBkwdKnj-Zc0FT58w@mail.gmail.com>
	<CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com>

Robert Haas <robertmhaas@gmail.com> writes:
> ... So I went looking for
> where we got the mapping tables from. UCS_to_JOHAB.pl expects to read
> from a file JOHAB.TXT, of which the latest version seems to be found
> here:
> https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT
> And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it
> regenerates the current mapping files.

Thanks for doing that research!

> So apparently we've
> got the "right" mappings, but you can only actually the ones that
> match the code's rules for something to be a valid multi-byte
> character, which aren't actually in sync with the mapping table.

Yeah.  Looking at the code in wchar.c, it's clear that it thinks
that JOHAB has the same character-length rules as EUC_KR, which is
something that one might guess based on available documentation that
says it's related to that encoding.  So I can see how we got here.

However, that doesn't mean we can fix pg_johab_mblen() and we're done.
I'm still quite afraid that we'd be introducing security-grade
inconsistencies of interpretation between different PG versions.

> I'm
> left with the conclusions that (1) nobody ever actually tried using
> this encoding for anything real until 3 days ago and (2) we don't have
> any testing infrastructure that verifies that the characters in the
> mapping tables are actually accepted by pg_verifymbstr(). I wonder how
> many other encodings we have that don't actually work?

Indeed.  Anyone want to do some testing?

			regards, tom lane






reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-bugs@postgresql.org
  Cc: tgl@sss.pgh.pa.us, robertmhaas@gmail.com, jtvjtv@gmail.com, vasukianand0119@gmail.com, pgsql-bugs@lists.postgresql.org
  Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
  In-Reply-To: <2393116.1765899706@sss.pgh.pa.us>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox