Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wCpOI-002LVe-2x for pgsql-bugs@arkaria.postgresql.org; Wed, 15 Apr 2026 01:50:07 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wCpOG-00DtxD-17 for pgsql-bugs@arkaria.postgresql.org; Wed, 15 Apr 2026 01:50:05 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wCpOF-00Dtx4-38 for pgsql-bugs@lists.postgresql.org; Wed, 15 Apr 2026 01:50:04 +0000 Received: from mail-dy1-x1334.google.com ([2607:f8b0:4864:20::1334]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wCpOE-00000001454-2knH for pgsql-bugs@lists.postgresql.org; Wed, 15 Apr 2026 01:50:04 +0000 Received: by mail-dy1-x1334.google.com with SMTP id 5a478bee46e88-2bd801b4078so517366eec.2 for ; Tue, 14 Apr 2026 18:50:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1776217802; cv=none; d=google.com; s=arc-20240605; b=jTedIjxfZhCJROh9m0vzu/ltnV9aCXMfG+yJJseDE4sl+FkAsceGAeSG30OeXhPCkO jVA/AerIZKKBeWzh5WvByy7noeVChjmy20ju+mt9DLu493kMA7yDSQ7IvN5zVPrLdAsS jQSqyK+mTs03tyke6vumdjxrKsn4/f/7ag2e5HNxlR8jf/kudemRLlN5zQDQRSJai919 pm2Zz/hmsHdymo4curc/UJyANLw9RthtQaov3gmYljtSKRltoUszZymiNfYCKVblM4Q0 N9asoklocxhssqwVEB+d0dIgUfXgLx0FcKUQ0pZdZodEnL+67xljiH+idiB+T1ozYMXo KPUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=nOSNNMEW3fFd5rfyTiD5KU/QGUHmdmJtknOc3hQNNn8=; fh=Y04uqfRpkJDpmW1HGWJ1sfB26ZDWnkUKcyCMLtL19Rk=; b=B98VE+2GejMptUaXYI0dDU+tiMBKZJzNvnuRuHCcD+K/8UdzkLoO6Oh4wx5RsE9MOg T6zW13UullMJ4CU+xt2sQRhwEI9U43D36jyAj2Y35j3YejFCb5fwS1paFF9GoNSkJzDh seCaKuUtScF0j5MRJNlb0TzPn2GbV2TJulqLH19ETbwx3IY+UoL+/NYBdWSf0b50/1V0 xatPbTwVCR8yjAvVYSbAIUpyXKNokQAhxWtPuaTJRKU+ZqfDykl5+8IuEwJow9/eX/FO tbc6XewiYP4yFCzx9sYUpOl+IcUjB8QYy3aRh1Jy9/FzRSUtI2T39PIxdUI4jbqNAFPH +C7A==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1776217802; x=1776822602; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nOSNNMEW3fFd5rfyTiD5KU/QGUHmdmJtknOc3hQNNn8=; b=iNZ/C3FKdlYW5cKsQBdlW34XxikJ4oTUcB1XsnObYMmbyLomXvd6mz5gVsEohrJxTp MAt7//MrNwxxOVtXXbmlALXbHnrGRhoxxgu69WH95wLrLcVfIIRDEXtVjqPpJ3CR3AOx bmg7YZzEBLXnc5hODpP9tsdjnYpAcsXgF8MG2flYZjUsN1hB14Xm+4B+v9fYI//qU9Tu JC9XBnngR9ZQ9BUIFDf3Extm34TjOC1iFd5AiZeho+P4gmQP/RRZquFE2SbjpNqhLdew ij2PmJKtZNrCLCyzNBQNxc3tjp2U7Sb4nUWD7oQ9oF2xhsYdDgpmbj0JA2JvS4XRTvX9 ZBtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776217802; x=1776822602; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=nOSNNMEW3fFd5rfyTiD5KU/QGUHmdmJtknOc3hQNNn8=; b=ncTefu3RLSsc4JHBxovDPeDSLUDzLg+qo2FMyx5NifCWT/UT3Ei9RYZSU7oGasEAkK z3U2zbeLNqbFE1ICze4GMJnCkxj6yipi+BeinHjKqqmhIS390MVX8iWkAtDecCFtE6BL 7L77OENH6gd3LuD/whFWLS2xSQnaqebN6RB2wT5TPvrsiZSJW22AmQl4F0UtFWcSLqbm toLqHd4nV7eZeXDpJhvMDvNkECKCuvR4C+4jQqQjXqhaqSDa8LtVam0UqkSgLMeiZjGu wnkCbeazTVHmV6jbuZn3a8RkgIlf2Rvk4237KHRwRhT+jyLVht2OovxHFiLKjPz0+O3f gdJw== X-Forwarded-Encrypted: i=1; AFNElJ9xGu/iFltxtSnoJ25zU5DOv+WFSmzAnJXTeJx9pGksDY1irVbEDFyl6ChsLTVeGIkXCCJ+3VEtDcc8@lists.postgresql.org X-Gm-Message-State: AOJu0YwoK0XAiyhPzRADISMk2qmreFrFpWV0YBhrqYZ20pW4XV5BqO9h QNEF9ZorNj+Etmt7wrghnqZ+18S7laCYCym4MqTqoynNGtFaqNTZ/no96p0jTDKHkRbohdTT9a7 ePL8NKNlMht0l7InSt0+n8DWwFfaO9HA= X-Gm-Gg: AeBDieuLfFqO5IThgNv/bhQbV61lEqTe0K3WuSCGUvuUFdNKCWyEriORwWt0hffi+OQ 87LFNxWq7EZIlpF7LouCbnGze1lXi17yIMGs03FfG5TpoVjMe5DzwKMC/UhFTToUIh/U59om1r5 hdavEzYrk6765NSge1itEi6j3epJUtJq5KQ5yCoVtS+NbSWm/k6RqRzomJnLdvYmf5d8HMywWhK Nk/wd+Z3Xf7FC3au3KZKwV3EmB0ZPGoKXPx6r3HU0WN1M83KdOV/GcYTO5BfJHubY25nzDh+zwe taABN6l8beRAqdGfu0cRc8rhVq/FOX1X9RS3ZDNZm4AsGO7uH3OMZAQ+qm79wu5gnPV2/SoJ8xT CzhgZmXWGOg== X-Received: by 2002:a05:7300:8c9f:b0:2d9:6373:ace6 with SMTP id 5a478bee46e88-2de7b683e7emr140665eec.5.1776217801666; Tue, 14 Apr 2026 18:50:01 -0700 (PDT) MIME-Version: 1.0 References: <19354-eefe6d8b3e84f9f2@postgresql.org> <2292889.1765846569@sss.pgh.pa.us> <2393116.1765899706@sss.pgh.pa.us> <6a8122ac-123d-4e93-9269-0b3be1e4a5a4@iki.fi> In-Reply-To: From: Thomas Munro Date: Wed, 15 Apr 2026 13:49:24 +1200 X-Gm-Features: AQROBzA1y7rNWVAiXh5kPyfzvLGNsWnT8COCx9O3VpKfiDuHNMydL4M4eAN2BJs Message-ID: Subject: Re: BUG #19354: JOHAB rejects valid byte sequences To: assam258@gmail.com Cc: Heikki Linnakangas , Robert Haas , Tom Lane , Jeroen Vermeulen , VASUKI M , pgsql-bugs@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Apr 15, 2026 at 1:20=E2=80=AFPM Henson Choi wr= ote: > In short: completion form is a frequency-curated lookup, combinational > form is an algorithmic composition that covers the full modern Hangul > space. Unicode later adopted the combinational form's coverage as a > completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3) > encodes exactly the same 11,172 modern syllables, as precomposed code > points. So today the three Korean-related encodings PostgreSQL > supports sit along this spectrum: EUC_KR (curated completion form), > UHC (extended completion form), and JOHAB (algorithmic combinational > form). Thank you! Yes, that makes total sense. Here are my own notes (compiled from English-language Wikipedia articles), which say essentially the same thing + some notes about Hancom: The Korean writing system: 1. Hanja: Chinese characters used in names, legal and historical documents, and to disambiguate homonyms. The number of characters in use is difficult to pin down (as in Japan and China). 2. Hangul: a phonetic system used for almost all modern Korean text. Hangul characters are composed of 2-5 "jamo", commonly 2-3 in modern texts, each representing a consonant/vowel. Character set standards: 1. KS X 1001: 4,888 Hanja (of the vast number of hard to count CJK ideographs) + 2,350 precomposed Hangul (of 11,172 theoretically possible jamo combinations). 2. KS X 1002: added some more but no one ever implemented it, possibly because... 3. Unicode: all 11,172 possible precomposed Hangul + individual jamo for composition + all Hanja/Kanji/Hanzi characters known to humanity (still growing). Encodings: 1. EUR-KR, AKA Wansung (=3D "precomposed"): directly encoded KS X 1001. 2. JOHAB (=3D "combining"): deferred to KS X 1001 for Hanja, but described all possible Hangul as jamo stored in bitfields. 3. UHC (=3D "Unified Hangul Code", invented by Microsoft): used EUR-KR as a base but supplied all possible pre-composed Hangul and 8,222 Hanja (complete CJK as of Unicode 2.0). 4. UTF-8, UTF-16, UTF-32: Unicode. Realpolitik that fed back into standards: 1. The Hancom "Hangul" word processor used de facto standard JOHAB encoding, and dominated. 2. KS X 1001 recognised this and added that annex. 3. MS-DOS/Windows recognised this and called it CP1361. 4. MS-DOS/Windows switched to UHC/CP949 alongside Unicode some time in the early to mid 90s. 5. Hancom switched to Unicode around the turn of the millennium. I will study your patch and your analysis. It looks good on first read. > Why keep it rather than remove it > --------------------------------- > > > I understand the appeal of simply deleting a dead-looking encoding, > and Thomas' removal patch is clean work. However, Korean archival > data from the 1990s (government records, academic repositories, early > online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB > in PostgreSQL provides a straightforward ingest path > (client_encoding=3DJOHAB, convert_from, then store as UTF-8). Once > removed, that path closes with no obvious alternative short of > preprocessing outside PostgreSQL. Fixing the verifier preserves the > capability at the cost of a ~30-line correction plus tests. The counter argument would be that you could use iconv --from-code=3DJOHAB ..., or libiconv, or the codecs available in Python, Java, etc for dealing with historical archived data, something that data archivists must be very aware of. And for old Hancom word processor files, not really of relevance to PostgreSQL, apparently they can be imported by modern word processors. > Happy to iterate on the patch, the commit message, or the tests. > Thanks to everyone for the careful analysis that preceded this; I > recognise that the consensus was leaning toward removal, and I would > appreciate a chance to have this fix considered as an alternative. Cool. For now I'll leave the removal on ice, and look into committing your patch. Thanks for working on it!