Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVXBT-004owa-0h for pgsql-bugs@arkaria.postgresql.org; Tue, 16 Dec 2025 15:41:56 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vVXBS-007SHW-0U for pgsql-bugs@arkaria.postgresql.org; Tue, 16 Dec 2025 15:41:54 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVXBR-007SHO-2t for pgsql-bugs@lists.postgresql.org; Tue, 16 Dec 2025 15:41:54 +0000 Received: from sss.pgh.pa.us ([68.162.161.243]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVXBR-000zJO-0V for pgsql-bugs@lists.postgresql.org; Tue, 16 Dec 2025 15:41:53 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 5BGFfkQG2393117; Tue, 16 Dec 2025 10:41:46 -0500 From: Tom Lane To: Robert Haas cc: Jeroen Vermeulen , VASUKI M , pgsql-bugs@lists.postgresql.org Subject: Re: BUG #19354: JOHAB rejects valid byte sequences In-reply-to: References: <19354-eefe6d8b3e84f9f2@postgresql.org> <2292889.1765846569@sss.pgh.pa.us> Comments: In-reply-to Robert Haas message dated "Tue, 16 Dec 2025 09:26:12 -0500" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <2393115.1765899706.1@sss.pgh.pa.us> Date: Tue, 16 Dec 2025 10:41:46 -0500 Message-ID: <2393116.1765899706@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Robert Haas writes: > ... So I went looking for > where we got the mapping tables from. UCS_to_JOHAB.pl expects to read > from a file JOHAB.TXT, of which the latest version seems to be found > here: > https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT > And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it > regenerates the current mapping files. Thanks for doing that research! > So apparently we've > got the "right" mappings, but you can only actually the ones that > match the code's rules for something to be a valid multi-byte > character, which aren't actually in sync with the mapping table. Yeah. Looking at the code in wchar.c, it's clear that it thinks that JOHAB has the same character-length rules as EUC_KR, which is something that one might guess based on available documentation that says it's related to that encoding. So I can see how we got here. However, that doesn't mean we can fix pg_johab_mblen() and we're done. I'm still quite afraid that we'd be introducing security-grade inconsistencies of interpretation between different PG versions. > I'm > left with the conclusions that (1) nobody ever actually tried using > this encoding for anything real until 3 days ago and (2) we don't have > any testing infrastructure that verifies that the characters in the > mapping tables are actually accepted by pg_verifymbstr(). I wonder how > many other encodings we have that don't actually work? Indeed. Anyone want to do some testing? regards, tom lane