Re: BUG #19354: JOHAB rejects valid byte sequences

public inbox for pgsql-bugs@postgresql.org  
help / color / mirror / Atom feed

From: Thomas Munro <thomas.munro@gmail.com>
To: assam258@gmail.com
Cc: Heikki Linnakangas <hlinnaka@iki.fi>
Cc: Robert Haas <robertmhaas@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Jeroen Vermeulen <jtvjtv@gmail.com>
Cc: VASUKI M <vasukianand0119@gmail.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: Wed, 15 Apr 2026 13:49:24 +1200
Message-ID: <CA+hUKGJMrcS=hBkqVk=5pjM4w8edG=_ArASC82RqB6HQro-v-g@mail.gmail.com> (raw)
In-Reply-To: <CAAAe_zCLVunjt1u+2E86shwc3hk1x4bzUyU86nY1fq-nAVYN0Q@mail.gmail.com>
References: <19354-eefe6d8b3e84f9f2@postgresql.org>
	<CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
	<CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com>
	<2292889.1765846569@sss.pgh.pa.us>
	<CAE2r8H5vaSyaC_t1FcpHBo-BB_=SrFj7GFnOC-SxC6WDf5c9VA@mail.gmail.com>
	<CA+zULE47EXZOp7qKYODd+mjSgDiR-WX5ZNBkwdKnj-Zc0FT58w@mail.gmail.com>
	<CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com>
	<2393116.1765899706@sss.pgh.pa.us>
	<CA+TgmoaoW4F2rRzYcQQim9ddT4-6H3oi0UYV9Ucw-rRQ5MdHsg@mail.gmail.com>
	<CA+hUKGKy-ViGBXdOjcPownBM=OdWiULO8H1RyH1r_8qNp=U4CA@mail.gmail.com>
	<6a8122ac-123d-4e93-9269-0b3be1e4a5a4@iki.fi>
	<CAAAe_zCLVunjt1u+2E86shwc3hk1x4bzUyU86nY1fq-nAVYN0Q@mail.gmail.com>

On Wed, Apr 15, 2026 at 1:20 PM Henson Choi <assam258@gmail.com> wrote:
> In short: completion form is a frequency-curated lookup, combinational
> form is an algorithmic composition that covers the full modern Hangul
> space.  Unicode later adopted the combinational form's coverage as a
> completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3)
> encodes exactly the same 11,172 modern syllables, as precomposed code
> points.  So today the three Korean-related encodings PostgreSQL
> supports sit along this spectrum: EUC_KR (curated completion form),
> UHC (extended completion form), and JOHAB (algorithmic combinational
> form).

Thank you!  Yes, that makes total sense.  Here are my own notes
(compiled from English-language Wikipedia articles), which say
essentially the same thing + some notes about Hancom:

The Korean writing system:
1.  Hanja: Chinese characters used in names, legal and historical
documents, and to disambiguate homonyms.  The number of characters in
use is difficult to pin down (as in Japan and China).
2.  Hangul: a phonetic system used for almost all modern Korean text.
Hangul characters are composed of 2-5 "jamo", commonly 2-3 in modern
texts, each representing a consonant/vowel.

Character set standards:
1.  KS X 1001: 4,888 Hanja (of the vast number of hard to count CJK
ideographs) + 2,350 precomposed Hangul (of 11,172 theoretically
possible jamo combinations).
2.  KS X 1002: added some more but no one ever implemented it,
possibly because...
3.  Unicode: all 11,172 possible precomposed Hangul + individual jamo
for composition + all Hanja/Kanji/Hanzi characters known to humanity
(still growing).

Encodings:
1.  EUR-KR, AKA Wansung (= "precomposed"): directly encoded KS X 1001.
2.  JOHAB (= "combining"): deferred to KS X 1001 for Hanja, but
described all possible Hangul as jamo stored in bitfields.
3.  UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR
as a base but supplied all possible pre-composed Hangul and 8,222
Hanja (complete CJK as of Unicode 2.0).
4.  UTF-8, UTF-16, UTF-32: Unicode.

Realpolitik that fed back into standards:
1.  The Hancom "Hangul" word processor used de facto standard JOHAB
encoding, and dominated.
2.  KS X 1001 recognised this and added that annex.
3.  MS-DOS/Windows recognised this and called it CP1361.
4.  MS-DOS/Windows switched to UHC/CP949 alongside Unicode some time
in the early to mid 90s.
5.  Hancom switched to Unicode around the turn of the millennium.

I will study your patch and your analysis.  It looks good on first read.

> Why keep it rather than remove it
> ---------------------------------
>
>
> I understand the appeal of simply deleting a dead-looking encoding,
> and Thomas' removal patch is clean work.  However, Korean archival
> data from the 1990s (government records, academic repositories, early
> online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
> in PostgreSQL provides a straightforward ingest path
> (client_encoding=JOHAB, convert_from, then store as UTF-8).  Once
> removed, that path closes with no obvious alternative short of
> preprocessing outside PostgreSQL.  Fixing the verifier preserves the
> capability at the cost of a ~30-line correction plus tests.

The counter argument would be that you could use iconv
--from-code=JOHAB ..., or libiconv, or the codecs available in Python,
Java, etc for dealing with historical archived data, something that
data archivists must be very aware of.  And for old Hancom word
processor files, not really of relevance to PostgreSQL, apparently
they can be imported by modern word processors.

> Happy to iterate on the patch, the commit message, or the tests.
> Thanks to everyone for the careful analysis that preceded this; I
> recognise that the consensus was leaning toward removal, and I would
> appreciate a chance to have this fix considered as an alternative.

Cool.  For now I'll leave the removal on ice, and look into committing
your patch.  Thanks for working on it!

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-bugs@postgresql.org
  Cc: thomas.munro@gmail.com, assam258@gmail.com, hlinnaka@iki.fi, robertmhaas@gmail.com, tgl@sss.pgh.pa.us, jtvjtv@gmail.com, vasukianand0119@gmail.com, pgsql-bugs@lists.postgresql.org
  Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
  In-Reply-To: <CA+hUKGJMrcS=hBkqVk=5pjM4w8edG=_ArASC82RqB6HQro-v-g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox