Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVW1A-004d2k-0l for pgsql-bugs@arkaria.postgresql.org; Tue, 16 Dec 2025 14:27:13 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vVW19-0076xm-0e for pgsql-bugs@arkaria.postgresql.org; Tue, 16 Dec 2025 14:27:11 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVW0R-007596-06 for pgsql-bugs@lists.postgresql.org; Tue, 16 Dec 2025 14:26:27 +0000 Received: from mail-ed1-x533.google.com ([2a00:1450:4864:20::533]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vVW0Q-000ylR-1F for pgsql-bugs@lists.postgresql.org; Tue, 16 Dec 2025 14:26:26 +0000 Received: by mail-ed1-x533.google.com with SMTP id 4fb4d7f45d1cf-640c6577120so6555539a12.1 for ; Tue, 16 Dec 2025 06:26:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1765895185; x=1766499985; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tcDcSFW167Bu5IRdfcMfkfbDdnLZJoNdegIfNh/+uDY=; b=kkQTcaWZFYZDFIpzSUMFkCXQTFfXvgwqkShQgBWiQ89JZdE4uZHRKpxueDLXMWPPX0 eifJeh/HZIR5yvDhAY7MlFlUt1MWW3DhCyueCbnOB+Ixxdf6ochwIVuop+Yunt8vwj8E c6nH7DYYUeNcB1i+n7vUVtJjK9zoywy0ARiYFEhEAcAlLxkSXs1+Qa2o3h316ct1aJmh B9bfwBjvpXpSexTjQS80ypznSfHVYBLs1xVu70Bf6SSFaCxvDX1vaFNrRrjTUz6Gfxba 2vAerkZX/K9tyT1ZrsCl9ep8BuB71vS0uV9rV4I2a/036LtKecwhR4uaR52NKLRvgVsw qChw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765895185; x=1766499985; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=tcDcSFW167Bu5IRdfcMfkfbDdnLZJoNdegIfNh/+uDY=; b=PoUhDjViXkG8rQRLC/eaw/HJKlZ36xkO0IlOxtgj5UC/+iPNzEEh4oVB10qcugoE7W e7f8xHxJqxpTCZ72fshV8x+/lIWoYGhjUPJAfvtpu7JCVDaAf1dGs+jnyLqTm6PbSGOA q4450xHSwjQ/iN8giKW6AucpBXlWhkypxdNjQttnkjEO4c5mTKyxKYMW2YeSQGWh/XzI o/+H9skl/05veMWsu2dkjye/nyzvwnxrI2OutbT74TgcCS0NIQyvJXoG7jmf3OxT+1Jm Vi6XB9QjKzDAJuBwjVAWw5CY2uV8Avzo1VgWaUqQf1AvNheWTwAGU64dIEHrWwvH11Sh dJZg== X-Forwarded-Encrypted: i=1; AJvYcCWgFq6D2445MVTzf8IUeMxrclFWrFcHFqNPgK64AzjnYf+M90oxgczv9Pa1wZDdjOpWLjT769f6M24r@lists.postgresql.org X-Gm-Message-State: AOJu0YzVBhMyljG8Ikb7qo9TmhpKG3SI/E1EAzRakw3zloXP+EjCAV0u 8GRbu8+WTco0OsoQGVq34/yBhc5VkVv2F9srGHIro0iD5iLtGweXWLuruikQdRuhPddwPaY37Jk Zcjg8VPGsinBB7sdZXCypNRkCf+EzmS0= X-Gm-Gg: AY/fxX4/yMcgTfJBcPyaq28fIHxR7gfC/cgkI4hSOnBKYBHHyH4Ya7wltDpdC4saBZo 1A7EXn9fUYGgH8UZ+LxbNkTJemJGBtwkwJ09R3/FUC6/6F0W2I+JZssyefm3c4rEQ2HHh9n1iYy aP8QRae4znxyJdu7VA3wOlpMeohLuxQv5CFgf4U/Z96Z9LRn5UVvOrQRaQO0p064SUZAqzN113h 7/DijayEXgJoKUeURDw2dVnVIEb0DP3AhhIxKRJkZyWMdGFz0PYHhCzlIdWRATiU9z31HI9GNvK WC4zXlpux0VlRcQlDs8FUtoSgro= X-Google-Smtp-Source: AGHT+IEt3B+jA4ZJ9VSNG61vEiXxGtIbS/JHOVplB6WK3ZbOZIhj2nhDeDMlWD6qL8L6JNwj7AJELWawOcTRrXIJZOk= X-Received: by 2002:a17:907:3e8c:b0:b76:2517:6927 with SMTP id a640c23a62f3a-b7d23a610b5mr1665726966b.43.1765895184234; Tue, 16 Dec 2025 06:26:24 -0800 (PST) MIME-Version: 1.0 References: <19354-eefe6d8b3e84f9f2@postgresql.org> <2292889.1765846569@sss.pgh.pa.us> In-Reply-To: From: Robert Haas Date: Tue, 16 Dec 2025 09:26:12 -0500 X-Gm-Features: AQt7F2oqbAA3VpdKU1NZtB5HUR-1F2pEGvUrWmdOCTs_I2QOTgcDVWG4VBoZGB4 Message-ID: Subject: Re: BUG #19354: JOHAB rejects valid byte sequences To: Jeroen Vermeulen Cc: VASUKI M , Tom Lane , pgsql-bugs@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Dec 16, 2025 at 2:42=E2=80=AFAM Jeroen Vermeulen = wrote: > My one worry is perhaps Johab is on the list because one important user n= eeded it. > > But even then that requirement may have gone away? Well, that was over 20 years ago. There's a very good chance that even if somebody was using JOHAB back then, they're not still using it now. What's mystifying to me is that, presumably, somebody had a reason at the time for thinking that this was correct. I know that our quality standards were a whole looser back then, but I still don't quite understand why someone would have spent time and effort writing code based on a purely fictitious encoding scheme. So I went looking for where we got the mapping tables from. UCS_to_JOHAB.pl expects to read from a file JOHAB.TXT, of which the latest version seems to be found here: https://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/JOHAB.TXT And indeed, if I run UCS_to_JOHAB.pl on that JOHAB.txt file, it regenerates the current mapping files. Playing with it a bit: rhaas=3D# select convert_from(e'\\x8a5c'::bytea, 'johab'); ERROR: invalid byte sequence for encoding "JOHAB": 0x8a 0x5c rhaas=3D# select convert_from(e'\\x8444'::bytea, 'johab'); ERROR: invalid byte sequence for encoding "JOHAB": 0x84 0x44 rhaas=3D# select convert_from(e'\\x89ef'::bytea, 'johab'); convert_from -------------- =EA=B4=A6 (1 row) So, \x8a5c is the original example, which does appear in JOHAB.TXT, and \x8444 is the first multi-byte character in that file, and both of them fail. But 89ef, which also appears in that file, doesn't fail, and from what I can tell the mapping is correct. So apparently we've got the "right" mappings, but you can only actually the ones that match the code's rules for something to be a valid multi-byte character, which aren't actually in sync with the mapping table. I'm left with the conclusions that (1) nobody ever actually tried using this encoding for anything real until 3 days ago and (2) we don't have any testing infrastructure that verifies that the characters in the mapping tables are actually accepted by pg_verifymbstr(). I wonder how many other encodings we have that don't actually work? --=20 Robert Haas EDB: http://www.enterprisedb.com