Re: BUG #19354: JOHAB rejects valid byte sequences

public inbox for pgsql-bugs@postgresql.org  
help / color / mirror / Atom feed

From: Henson Choi <assam258@gmail.com>
To: Heikki Linnakangas <hlinnaka@iki.fi>
To: Thomas Munro <thomas.munro@gmail.com>
To: Robert Haas <robertmhaas@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>
Cc: Jeroen Vermeulen <jtvjtv@gmail.com>
Cc: VASUKI M <vasukianand0119@gmail.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
Date: Wed, 15 Apr 2026 10:20:06 +0900
Message-ID: <CAAAe_zCLVunjt1u+2E86shwc3hk1x4bzUyU86nY1fq-nAVYN0Q@mail.gmail.com> (raw)
In-Reply-To: <6a8122ac-123d-4e93-9269-0b3be1e4a5a4@iki.fi>
References: <19354-eefe6d8b3e84f9f2@postgresql.org>
	<CA+TgmoaRGSezRaA7x00X495Qho8WGTzggbDSUt-JsruXceZWug@mail.gmail.com>
	<CA+zULE4L4rA2DLAcfy=eQL7w_ZexV4P5zpQRbP=_qrhJBEOzjg@mail.gmail.com>
	<2292889.1765846569@sss.pgh.pa.us>
	<CAE2r8H5vaSyaC_t1FcpHBo-BB_=SrFj7GFnOC-SxC6WDf5c9VA@mail.gmail.com>
	<CA+zULE47EXZOp7qKYODd+mjSgDiR-WX5ZNBkwdKnj-Zc0FT58w@mail.gmail.com>
	<CA+TgmoZaoc37ohnhF5inoPxWzfoznV483xQw8Fmw+ELFScv47g@mail.gmail.com>
	<2393116.1765899706@sss.pgh.pa.us>
	<CA+TgmoaoW4F2rRzYcQQim9ddT4-6H3oi0UYV9Ucw-rRQ5MdHsg@mail.gmail.com>
	<CA+hUKGKy-ViGBXdOjcPownBM=OdWiULO8H1RyH1r_8qNp=U4CA@mail.gmail.com>
	<6a8122ac-123d-4e93-9269-0b3be1e4a5a4@iki.fi>

Hi hackers,



> > So +1 from me, set the phasers to git rm.
>
> +1
>
> > Wait until 20, or just do it now?
> Let's just do it now.
>



Following up on my earlier note with an actual review of the primary
Korean national standard and a fix patch.  The result turns out to be
small, and I believe it resolves the ambiguity that drove the removal
proposal.


Standard reference
------------------


The authoritative specification for JOHAB is Annex 3 of KS X 1001
(originally KS C 5601-1992 Annex 3, renumbered KS X 1001:1992 and
republished as KS X 1001:2004), published by the Korean Agency for
Technology and Standards (KATS) and available from the national
e-standards portal:



https://standard.go.kr/KSCI/api/std/viewMachine.do?reformNo=08&tmprKsNo=KSX1001&formType=STD


The decisive passages are quoted below in the original Korean with an
English translation, so non-Korean readers can verify the byte ranges
the fix implements.


Two terms from the standard recur throughout the quoted passages:

  * 완성형 부호계 (romanised "WANSUNG", literally "completion-form
    code set").  Each Hangul syllable is assigned a single code point
    drawn from a fixed table of pre-composed syllables.  The main
    body of KS X 1001 defines such a table of 2,350 syllables; per
    the standard's commentary, that subset was chosen by frequency
    analysis over samples from publishing, print media, industry,
    academia and dictionaries at the time of the 1987 revision,
    which is why some valid modern syllables (e.g. 뢔, 쌰, 쎼, 쓔,
    쬬) were deliberately excluded.  EUC-KR is the packed 8-bit form
    of that WANSUNG table, and Microsoft's CP949 / UHC is a later
    superset that fills in additional syllables.

  * 조합형 부호계 (romanised "JOHAB", literally "combinational code
    set").  Each Hangul syllable is constructed at encoding time
    from 5-bit codes for the initial consonant, medial vowel, and
    final consonant packed into two bytes, so all 11,172 modern
    syllables are directly representable without a lookup table.
    This is what Annex 3 defines and what PostgreSQL ships under
    the encoding name JOHAB.

In short: completion form is a frequency-curated lookup, combinational
form is an algorithmic composition that covers the full modern Hangul
space.  Unicode later adopted the combinational form's coverage as a
completion-form table: the Hangul Syllables block (U+AC00 - U+D7A3)
encodes exactly the same 11,172 modern syllables, as precomposed code
points.  So today the three Korean-related encodings PostgreSQL
supports sit along this spectrum: EUC_KR (curated completion form),
UHC (extended completion form), and JOHAB (algorithmic combinational
form).


부속서 3 보조 부호계 (2바이트 조합형 부호계)
[Annex 3.  Supplementary code set (two-byte combinational code)]


1. 적용 범위
[Scope]


  이 부속서에서는 기본 부호계인 2바이트 완성형 부호계의 보조 부호계로서,
  2바이트 조합형 부호계를 규정한다.
  [This annex specifies the two-byte combinational code set as the
  supplementary code set to the two-byte completion-form code set that
  constitutes the main body of the standard.]


2. 도형 문자
[Graphic characters]


  a) 한 글
     [Hangul]
     부속서 3 표 2에 규정된 첫소리 글자 19자, 가운뎃소리 글자 21자,
     끝소리 글자 27자로 조합 가능한, 모든 현대 한글 글자 마디(11 172자)
     및 현대 한글 낱자(67자)
     [All modern Hangul syllables (11,172) and modern Hangul jamo (67)
     that can be composed from the 19 initials, 21 medials, and 27
     finals defined in Annex 3 Table 2.]
  b) 한 자
     [Hanja]
     2바이트 완성형 부호계에서 규정한 한자(4 888자)
     [The 4,888 Hanja defined in the two-byte completion-form code
     set.]
  c) 그 밖의 문자
     [Other characters]
     2바이트 완성형 부호계에서 규정한 문자 중에서 현대 한글 글자 마디
     및 현대 한글 낱자, 한자를 제외한 도형 문자(937자)
     [The 937 graphic characters defined in the completion-form code
     set other than modern Hangul syllables, modern Hangul jamo, and
     Hanja.]


3. 도형 문자의 배치 영역
[Graphic-character placement]


  도형 문자의 배치 영역은 부속서 3 표 1과 같다.
  [The placement of the graphic characters is given in Annex 3
  Table 1.]


부속서 3 표 1  도형 문자의 배치 영역
[Annex 3 Table 1.  Placement of graphic characters]


  구 분              첫째 바이트    둘째 바이트
  [Category]         [Lead byte]    [Trail byte]
  ----------------   -----------    --------------------
  한글 글자마디      84H–D3H        41H–7EH, 81H–FEH
  [Hangul syllables]
  사용자 정의 영역   D8H            31H–7EH, 91H–FEH
  [User-defined area]
  기타 문자          D9H–DEH        31H–7EH, 91H–FEH
  [Other characters]
  한 자              E0H–F9H        31H–7EH, 91H–FEH
  [Hanja]


  비 고 16진수를 나타내기 위하여 맨 뒤에 H를 적는다
        (10 H는 10진법으로 16이다).
  [Note: a trailing H denotes a hexadecimal value
        (e.g. 10H equals 16 in decimal).]


4. 한글 글자 마디의 부호값 구성 및 배열
[Encoding and layout of Hangul syllables]


  각 한글 글자 마디의 부호값은 2바이트 내에 첫소리 글자 5비트,
  가운뎃소리 글자 5비트, 끝소리 글자 5비트로 하여, 한글 낱자를 조합하여
  표현한 값으로 정의한다. 각 한글 낱자의 순서는 최상위 비트(MSB)를 1로
  하고 나서 첫소리, 가운뎃소리, 끝소리 글자가 순서대로 나오도록
  구성한다.
  [The code value of each Hangul syllable is defined as the composition
  of the Hangul letters within two bytes: 5 bits for the initial
  consonant, 5 bits for the medial vowel, and 5 bits for the final
  consonant, laid out with the most-significant bit set to 1 followed
  by the initial, medial, and final in that order.]


Annex 3 continues with Table 2 (5-bit jamo codes), Table 3 (row-wise
mapping between completion-form and combinational-form for Hanja and
other characters), and usage notes.  Those are not needed for the
verifier fix, but they do confirm that the mapping tables we already
ship in johab_to_utf8.map line up with the standard; the same is true
of the data under unicode.org's JOHAB.TXT that Robert pointed to
earlier in the thread.


On "multiple variants": the KS national standard for JOHAB (Annex 3)
is singular and authoritative, and the mapping tables we ship match
it.  The Wikipedia note about EBCDIC-based and stateful JOHAB variants
refers to niche vendor encodings that PostgreSQL never implemented.

The historical "variant" churn in Korean encoding is in fact not about
JOHAB but about the completion-form main body of KS X 1001 and its
packed form EUC-KR: Microsoft's CP949 / UHC extended WANSUNG with
additional Hangul syllables, and different vendors disagreed at the
edges.  PostgreSQL already separates those concerns by carrying
EUC_KR and UHC as distinct encodings, so fixing JOHAB does not
re-open that family of ambiguities.


Diagnosis
---------


pg_johab_mblen() in src/common/wchar.c delegates to pg_euc_mblen(),
whose relevant branches treat 0x8F (EUC's SS3) as a 3-byte prefix and
any other high-bit byte as a 2-byte prefix.  pg_johab_verifychar()
then requires each trail byte to satisfy IS_EUC_RANGE_VALID(), defined
in the same file as ((c) >= 0xa1 && (c) <= 0xfe).  Neither rule
corresponds to the standard:


  * JOHAB has no three-byte sequences.  0x8F is simply a valid Hangul
    lead byte (it lies in the 0x84-0xD3 Hangul syllable range from
    Table 1) that begins a normal 2-byte sequence; EUC's SS3 handling
    spuriously inflates its length to 3.
  * Hangul trail bytes are 0x41-0x7E or 0x81-0xFE; the other three
    categories use 0x31-0x7E or 0x91-0xFE.  Restricting trail bytes to
    0xA1-0xFE rejects large portions of the standard, including the
    sequences in the bug report.  0x5C (ASCII backslash) is a valid
    Hangul trail byte, which is exactly what Jeroen's unit test
    surfaced.


The consequence is that a substantial portion of johab_to_utf8.map is
unreachable today: the verifier rejects the byte sequences before
conversion is attempted.  That matches Robert's observation that the
"right" mapping existed but was gated behind an incorrect rule.


Patch
-----


The attached 0001-Fix-JOHAB-encoding-validation.txt makes these
changes:


  src/common/wchar.c
    Rewrite pg_johab_mblen() to return 2 when the lead byte falls in
    any of the ranges listed in Annex 3 Table 1, and 1 otherwise
    (ASCII pass-through).  Rewrite pg_johab_verifychar() to apply the
    correct trail-byte range depending on whether the lead byte is a
    Hangul lead byte (trail 0x41-0x7E or 0x81-0xFE) or a non-Hangul
    lead byte (trail 0x31-0x7E or 0x91-0xFE).  Two helper macros
    IS_JOHAB_LEAD_HANGUL() and IS_JOHAB_LEAD_OTHER() express the
    lead-byte classification once and are shared between mblen and
    verifychar.  A comment block above the implementation reproduces
    Table 1 for future maintainers.  Also correct
    pg_wchar_table[PG_JOHAB].maxmblen from 3 to 2 so that callers
    sizing buffers from maxmblen do not over-allocate and so that the
    value matches the spec.


  doc/src/sgml/charset.sgml
    Update the JOHAB row in the character-set table to show the
    maximum character length as 1-2 instead of 1-3, matching the
    standard and the corrected maxmblen.


  src/test/regress/sql/johab.sql
  src/test/regress/expected/johab.out
  src/test/regress/expected/johab_1.out
  src/test/regress/parallel_schedule
    A new regression test, modelled on euc_kr.sql, that runs in UTF8
    databases and skips otherwise.  It covers:


      - the original bug sequences \x8A\x5B, \x8A\x5C, \x8A\x5D
        decoding to 굍, 굎, 굏;
      - the first multibyte character from JOHAB.TXT (\x84\x44 -> ㄳ),
        previously rejected;
      - byte sequences that already decoded under the old rules
        (\x89\xEF -> 괦, \x89\xA1 -> 고) to guard against regression;
      - Hanja trail bytes that used to be rejected (\xE0\x31,
        \xE0\x7E, \xE0\x91);
      - one representative of the "other characters" category
        (\xD9\x31);
      - each lead-byte gap (0x80, 0xD5, 0xDF, 0xFA) producing an
        "invalid byte sequence" error;
      - every trail-byte gap for both Hangul (0x40, 0x7F, 0x80) and
        the non-Hangul categories (0x30, 0x7F, 0x90, 0xFF);
      - an incomplete trailing byte for a valid lead byte.


Compatibility
-------------


The mapping tables themselves are unchanged.  Byte sequences that
decode successfully today continue to decode to the same characters;
the change is strictly additive in that previously-rejected sequences
now succeed.  Because JOHAB is a client-only encoding there is no
on-disk representation to reconcile, so back-branch behaviour would
move from a strict subset of valid JOHAB to full valid JOHAB, without
reinterpreting any byte sequence that was previously accepted.  I
believe that is safe to back-patch, but confining the change to v19
is also entirely reasonable if the project prefers to limit the
exposure.


Why keep it rather than remove it
---------------------------------


I understand the appeal of simply deleting a dead-looking encoding,
and Thomas' removal patch is clean work.  However, Korean archival
data from the 1990s (government records, academic repositories, early
online corpora) does exist as JOHAB bytes; as a client encoding, JOHAB
in PostgreSQL provides a straightforward ingest path
(client_encoding=JOHAB, convert_from, then store as UTF-8).  Once
removed, that path closes with no obvious alternative short of
preprocessing outside PostgreSQL.  Fixing the verifier preserves the
capability at the cost of a ~30-line correction plus tests.


Happy to iterate on the patch, the commit message, or the tests.
Thanks to everyone for the careful analysis that preceded this; I
recognise that the consensus was leaning toward removal, and I would
appreciate a chance to have this fix considered as an alternative.


Regards,
Henson

From 94fc0d0c2f2e7428f111fb952dda635b99c84da3 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Wed, 15 Apr 2026 08:46:56 +0900
Subject: [PATCH] Fix JOHAB encoding validation to match KS X 1001 Annex 3.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since the encoding was added in 2002, pg_johab_mblen() and
pg_johab_verifychar() have borrowed their byte-length and trail-byte
rules from EUC-KR via pg_euc_mblen() and IS_EUC_RANGE_VALID(), which
demand trail bytes in 0xA1-0xFE.  JOHAB does not follow that rule: per
KS X 1001:2004 Annex 3 Table 1, trail bytes may fall anywhere in
0x41-0x7E or 0x81-0xFE for Hangul syllables (0x31-0x7E or 0x91-0xFE
for the other three categories), including the ASCII graphic range
and in particular 0x5C, the backslash.  As a result, most of the
mappings shipped in johab_to_utf8.map were unreachable: the verifier
rejected the byte sequences before they could be converted.  The
first multi-byte character in the source JOHAB.TXT (\x84\x44) and the
originally reported sequence \x8A\x5C = "굎" were both affected.

Rewrite pg_johab_mblen() and pg_johab_verifychar() to classify the
leading byte into the four categories defined by Annex 3 Table 1 and
accept only the trail-byte ranges specified for each category.  The
encoding is strictly two bytes wide for any non-ASCII character, so
also correct pg_wchar_table[PG_JOHAB].maxmblen from 3 to 2 and the
corresponding column in charset.sgml.  A new regression test covers
the original bug sequences, boundary cases for each lead and trail
range, and the invalid-byte gaps.

The mapping tables themselves were already correct and are unchanged,
so this fix is forward-compatible: sequences that decoded before
continue to decode identically, and the sequences that were
erroneously rejected now succeed.

Bug: #19354
Reported-by: Jeroen Vermeulen <jtvjtv@gmail.com>
Discussion: https://postgr.es/m/19354-eefe6d8b3e84f9f2@postgresql.org
---
 doc/src/sgml/charset.sgml             |  2 +-
 src/common/wchar.c                    | 69 ++++++++++++++++-----
 src/test/regress/expected/johab.out   | 87 +++++++++++++++++++++++++++
 src/test/regress/expected/johab_1.out |  9 +++
 src/test/regress/parallel_schedule    |  2 +-
 src/test/regress/sql/johab.sql        | 58 ++++++++++++++++++
 6 files changed, 209 insertions(+), 18 deletions(-)
 create mode 100644 src/test/regress/expected/johab.out
 create mode 100644 src/test/regress/expected/johab_1.out
 create mode 100644 src/test/regress/sql/johab.sql

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 746e40bb9d2..8ff7f7ed03d 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1934,7 +1934,7 @@ ORDER BY c COLLATE ebcdic;
          <entry>Korean (Hangul)</entry>
          <entry>No</entry>
          <entry>No</entry>
-         <entry>1&ndash;3</entry>
+         <entry>1&ndash;2</entry>
          <entry></entry>
         </row>
         <row>
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a44ee73accf..f493e4d9a99 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -438,18 +438,45 @@ pg_wchar2euc_with_len(const pg_wchar *from, unsigned char *to, int len)
 
 
 /*
- * JOHAB
+ * JOHAB (KS X 1001:2004 Annex 3, a.k.a. the 2-byte combinational code)
+ *
+ * Byte ranges per Annex 3 Table 1:
+ *
+ *   Category              Lead byte    Trail byte
+ *   --------------------  -----------  ---------------------
+ *   Hangul syllables      0x84 - 0xD3  0x41 - 0x7E, 0x81 - 0xFE
+ *   User-defined area A   0xD8         0x31 - 0x7E, 0x91 - 0xFE
+ *   Other characters      0xD9 - 0xDE  0x31 - 0x7E, 0x91 - 0xFE
+ *   Hanja                 0xE0 - 0xF9  0x31 - 0x7E, 0x91 - 0xFE
+ *
+ * ASCII (< 0x80) is single-byte.  Lead bytes in the gaps between the ranges
+ * above (0x80-0x83, 0xD4-0xD7, 0xDF, 0xFA-0xFF) are invalid.  Likewise,
+ * trail bytes that fall outside their allowed union are invalid: for Hangul
+ * this excludes 0x00-0x40, 0x7F-0x80, and 0xFF; for the other categories
+ * this excludes 0x00-0x30, 0x7F-0x90, and 0xFF.
+ *
+ * Note that unlike EUC-KR, trail bytes may fall within the ASCII graphic
+ * range (including 0x5C backslash), so callers dealing with JOHAB text
+ * must not assume ASCII bytes are self-synchronizing.
  */
+#define IS_JOHAB_LEAD_HANGUL(c)	((c) >= 0x84 && (c) <= 0xD3)
+#define IS_JOHAB_LEAD_OTHER(c)	\
+	(((c) >= 0xD8 && (c) <= 0xDE) || ((c) >= 0xE0 && (c) <= 0xF9))
+
 static int
 pg_johab_mblen(const unsigned char *s)
 {
-	return pg_euc_mblen(s);
+	if (IS_JOHAB_LEAD_HANGUL(*s) || IS_JOHAB_LEAD_OTHER(*s))
+		return 2;
+	return 1;
 }
 
 static int
 pg_johab_dsplen(const unsigned char *s)
 {
-	return pg_euc_dsplen(s);
+	if (IS_HIGHBIT_SET(*s))
+		return 2;
+	return pg_ascii_dsplen(s);
 }
 
 /*
@@ -1156,25 +1183,35 @@ pg_euctw_verifystr(const unsigned char *s, int len)
 static int
 pg_johab_verifychar(const unsigned char *s, int len)
 {
-	int			l,
-				mbl;
-	unsigned char c;
+	unsigned char b1,
+				b2;
 
-	l = mbl = pg_johab_mblen(s);
+	if (!IS_HIGHBIT_SET(*s))
+		return 1;
 
-	if (len < l)
+	if (len < 2)
 		return -1;
 
-	if (!IS_HIGHBIT_SET(*s))
-		return mbl;
+	b1 = s[0];
+	b2 = s[1];
 
-	while (--l > 0)
+	/*
+	 * Per KS X 1001:2004 Annex 3 Table 1, trailing byte ranges depend on the
+	 * leading byte's category.
+	 */
+	if (IS_JOHAB_LEAD_HANGUL(b1))
 	{
-		c = *++s;
-		if (!IS_EUC_RANGE_VALID(c))
-			return -1;
+		/* Hangul syllables: 0x41-0x7E or 0x81-0xFE */
+		if ((b2 >= 0x41 && b2 <= 0x7E) || (b2 >= 0x81 && b2 <= 0xFE))
+			return 2;
 	}
-	return mbl;
+	else if (IS_JOHAB_LEAD_OTHER(b1))
+	{
+		/* User-defined, other characters, Hanja: 0x31-0x7E or 0x91-0xFE */
+		if ((b2 >= 0x31 && b2 <= 0x7E) || (b2 >= 0x91 && b2 <= 0xFE))
+			return 2;
+	}
+	return -1;
 }
 
 static int
@@ -1901,7 +1938,7 @@ const pg_wchar_tbl pg_wchar_table[] = {
 	[PG_GBK] = {0, 0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifychar, pg_gbk_verifystr, 2},
 	[PG_UHC] = {0, 0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifychar, pg_uhc_verifystr, 2},
 	[PG_GB18030] = {0, 0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifychar, pg_gb18030_verifystr, 4},
-	[PG_JOHAB] = {0, 0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifychar, pg_johab_verifystr, 3},
+	[PG_JOHAB] = {0, 0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifychar, pg_johab_verifystr, 2},
 	[PG_SHIFT_JIS_2004] = {0, 0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifychar, pg_sjis_verifystr, 2},
 };
 
diff --git a/src/test/regress/expected/johab.out b/src/test/regress/expected/johab.out
new file mode 100644
index 00000000000..d2eafdf73e4
--- /dev/null
+++ b/src/test/regress/expected/johab.out
@@ -0,0 +1,87 @@
+-- This test exercises the JOHAB client encoding (KS X 1001:2004 Annex 3).
+-- JOHAB's valid byte ranges differ from EUC-KR: trail bytes may fall within
+-- the ASCII graphic range (0x41-0x7E for Hangul, 0x31-0x7E for the other
+-- categories), including 0x5C which is the ASCII backslash.  The test runs
+-- only in UTF8 databases, since some decoded characters have no equivalent
+-- in other server encodings.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- Bug #19354 original report plus its neighbors: these three byte sequences
+-- are valid Hangul syllables per Annex 3 Table 1 (lead 0x8A is in 0x84-0xD3,
+-- trail 0x5B/0x5C/0x5D is in 0x41-0x7E) but were rejected by the prior
+-- EUC-KR-derived check that demanded trail bytes in 0xA1-0xFE.
+SELECT convert_from('\x8a5b'::bytea, 'johab') AS "0x8a5b",
+       convert_from('\x8a5c'::bytea, 'johab') AS "0x8a5c",
+       convert_from('\x8a5d'::bytea, 'johab') AS "0x8a5d";
+ 0x8a5b | 0x8a5c | 0x8a5d 
+--------+--------+--------
+ 굍     | 굎     | 굏
+(1 row)
+
+-- First multi-byte character in unicode.org's JOHAB.TXT, also rejected by
+-- the prior check (trail 0x44 in Hangul range 0x41-0x7E).
+SELECT convert_from('\x8444'::bytea, 'johab') AS "0x8444";
+ 0x8444 
+--------
+ ㄳ
+(1 row)
+
+-- Regression check for byte sequences that already decoded correctly under
+-- the old rules (trail byte already within the old-allowed 0xA1-0xFE).
+SELECT convert_from('\x89ef'::bytea, 'johab') AS "0x89ef",
+       convert_from('\x89a1'::bytea, 'johab') AS "0x89a1";
+ 0x89ef | 0x89a1 
+--------+--------
+ 괦     | 고
+(1 row)
+
+-- Hanja range (lead 0xE0-0xF9) with trail bytes in the old-rejected region
+-- 0x31-0xA0.  Per Annex 3 Table 1 the Hanja trail range is 0x31-0x7E and
+-- 0x91-0xFE.
+SELECT convert_from('\xe031'::bytea, 'johab') AS "0xe031",
+       convert_from('\xe07e'::bytea, 'johab') AS "0xe07e",
+       convert_from('\xe091'::bytea, 'johab') AS "0xe091";
+ 0xe031 | 0xe07e | 0xe091 
+--------+--------+--------
+ 伽     | 嵌     | 感
+(1 row)
+
+-- "Other characters" category (lead 0xD9-0xDE) with a low trail byte.
+SELECT convert_from('\xd931'::bytea, 'johab') AS "0xd931";
+ 0xd931 
+--------
+ 　
+(1 row)
+
+-- Invalid lead bytes: the gaps between the four lead-byte ranges defined by
+-- Annex 3 Table 1.
+SELECT convert_from('\x8041'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x80
+SELECT convert_from('\xd541'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xd5
+SELECT convert_from('\xdf41'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xdf
+SELECT convert_from('\xfa41'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xfa
+-- Invalid trail bytes: values inside the gaps within each trail-byte range.
+-- For Hangul the gaps are 0x00-0x40, 0x7F-0x80, and 0xFF.
+SELECT convert_from('\x8a40'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x40
+SELECT convert_from('\x8a7f'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x7f
+SELECT convert_from('\x8a80'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x80
+-- For the other categories the gaps are 0x00-0x30, 0x7F-0x90, and 0xFF.
+SELECT convert_from('\xe030'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0x30
+SELECT convert_from('\xe07f'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0x7f
+SELECT convert_from('\xe090'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0x90
+SELECT convert_from('\xe0ff'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0xff
+-- Incomplete sequence: a valid lead byte with no trail byte is rejected.
+SELECT convert_from('\x8a'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a
diff --git a/src/test/regress/expected/johab_1.out b/src/test/regress/expected/johab_1.out
new file mode 100644
index 00000000000..89028ad81e0
--- /dev/null
+++ b/src/test/regress/expected/johab_1.out
@@ -0,0 +1,9 @@
+-- This test exercises the JOHAB client encoding (KS X 1001:2004 Annex 3).
+-- JOHAB's valid byte ranges differ from EUC-KR: trail bytes may fall within
+-- the ASCII graphic range (0x41-0x7E for Hangul, 0x31-0x7E for the other
+-- categories), including 0x5C which is the ASCII backslash.  The test runs
+-- only in UTF8 databases, since some decoded characters have no equivalent
+-- in other server encodings.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index cc365393bb7..63f7419d255 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -28,7 +28,7 @@ test: strings md5 numerology point lseg line box path polygon circle date time t
 # geometry depends on point, lseg, line, box, path, polygon, circle
 # horology depends on date, time, timetz, timestamp, timestamptz, interval
 # ----------
-test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc database stats_import pg_ndistinct pg_dependencies oid8 encoding euc_kr
+test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc database stats_import pg_ndistinct pg_dependencies oid8 encoding euc_kr johab
 
 # ----------
 # Load huge amounts of data
diff --git a/src/test/regress/sql/johab.sql b/src/test/regress/sql/johab.sql
new file mode 100644
index 00000000000..7a919f430a7
--- /dev/null
+++ b/src/test/regress/sql/johab.sql
@@ -0,0 +1,58 @@
+-- This test exercises the JOHAB client encoding (KS X 1001:2004 Annex 3).
+-- JOHAB's valid byte ranges differ from EUC-KR: trail bytes may fall within
+-- the ASCII graphic range (0x41-0x7E for Hangul, 0x31-0x7E for the other
+-- categories), including 0x5C which is the ASCII backslash.  The test runs
+-- only in UTF8 databases, since some decoded characters have no equivalent
+-- in other server encodings.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- Bug #19354 original report plus its neighbors: these three byte sequences
+-- are valid Hangul syllables per Annex 3 Table 1 (lead 0x8A is in 0x84-0xD3,
+-- trail 0x5B/0x5C/0x5D is in 0x41-0x7E) but were rejected by the prior
+-- EUC-KR-derived check that demanded trail bytes in 0xA1-0xFE.
+SELECT convert_from('\x8a5b'::bytea, 'johab') AS "0x8a5b",
+       convert_from('\x8a5c'::bytea, 'johab') AS "0x8a5c",
+       convert_from('\x8a5d'::bytea, 'johab') AS "0x8a5d";
+
+-- First multi-byte character in unicode.org's JOHAB.TXT, also rejected by
+-- the prior check (trail 0x44 in Hangul range 0x41-0x7E).
+SELECT convert_from('\x8444'::bytea, 'johab') AS "0x8444";
+
+-- Regression check for byte sequences that already decoded correctly under
+-- the old rules (trail byte already within the old-allowed 0xA1-0xFE).
+SELECT convert_from('\x89ef'::bytea, 'johab') AS "0x89ef",
+       convert_from('\x89a1'::bytea, 'johab') AS "0x89a1";
+
+-- Hanja range (lead 0xE0-0xF9) with trail bytes in the old-rejected region
+-- 0x31-0xA0.  Per Annex 3 Table 1 the Hanja trail range is 0x31-0x7E and
+-- 0x91-0xFE.
+SELECT convert_from('\xe031'::bytea, 'johab') AS "0xe031",
+       convert_from('\xe07e'::bytea, 'johab') AS "0xe07e",
+       convert_from('\xe091'::bytea, 'johab') AS "0xe091";
+
+-- "Other characters" category (lead 0xD9-0xDE) with a low trail byte.
+SELECT convert_from('\xd931'::bytea, 'johab') AS "0xd931";
+
+-- Invalid lead bytes: the gaps between the four lead-byte ranges defined by
+-- Annex 3 Table 1.
+SELECT convert_from('\x8041'::bytea, 'johab');
+SELECT convert_from('\xd541'::bytea, 'johab');
+SELECT convert_from('\xdf41'::bytea, 'johab');
+SELECT convert_from('\xfa41'::bytea, 'johab');
+
+-- Invalid trail bytes: values inside the gaps within each trail-byte range.
+-- For Hangul the gaps are 0x00-0x40, 0x7F-0x80, and 0xFF.
+SELECT convert_from('\x8a40'::bytea, 'johab');
+SELECT convert_from('\x8a7f'::bytea, 'johab');
+SELECT convert_from('\x8a80'::bytea, 'johab');
+-- For the other categories the gaps are 0x00-0x30, 0x7F-0x90, and 0xFF.
+SELECT convert_from('\xe030'::bytea, 'johab');
+SELECT convert_from('\xe07f'::bytea, 'johab');
+SELECT convert_from('\xe090'::bytea, 'johab');
+SELECT convert_from('\xe0ff'::bytea, 'johab');
+
+-- Incomplete sequence: a valid lead byte with no trail byte is rejected.
+SELECT convert_from('\x8a'::bytea, 'johab');
-- 
2.50.1 (Apple Git-155)



Attachments:

  [text/plain] 0001-Fix-JOHAB-encoding-validation.txt (15.0K, 3-0001-Fix-JOHAB-encoding-validation.txt)
  download | inline diff:
From 94fc0d0c2f2e7428f111fb952dda635b99c84da3 Mon Sep 17 00:00:00 2001
From: Henson Choi <assam258@gmail.com>
Date: Wed, 15 Apr 2026 08:46:56 +0900
Subject: [PATCH] Fix JOHAB encoding validation to match KS X 1001 Annex 3.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Since the encoding was added in 2002, pg_johab_mblen() and
pg_johab_verifychar() have borrowed their byte-length and trail-byte
rules from EUC-KR via pg_euc_mblen() and IS_EUC_RANGE_VALID(), which
demand trail bytes in 0xA1-0xFE.  JOHAB does not follow that rule: per
KS X 1001:2004 Annex 3 Table 1, trail bytes may fall anywhere in
0x41-0x7E or 0x81-0xFE for Hangul syllables (0x31-0x7E or 0x91-0xFE
for the other three categories), including the ASCII graphic range
and in particular 0x5C, the backslash.  As a result, most of the
mappings shipped in johab_to_utf8.map were unreachable: the verifier
rejected the byte sequences before they could be converted.  The
first multi-byte character in the source JOHAB.TXT (\x84\x44) and the
originally reported sequence \x8A\x5C = "굎" were both affected.

Rewrite pg_johab_mblen() and pg_johab_verifychar() to classify the
leading byte into the four categories defined by Annex 3 Table 1 and
accept only the trail-byte ranges specified for each category.  The
encoding is strictly two bytes wide for any non-ASCII character, so
also correct pg_wchar_table[PG_JOHAB].maxmblen from 3 to 2 and the
corresponding column in charset.sgml.  A new regression test covers
the original bug sequences, boundary cases for each lead and trail
range, and the invalid-byte gaps.

The mapping tables themselves were already correct and are unchanged,
so this fix is forward-compatible: sequences that decoded before
continue to decode identically, and the sequences that were
erroneously rejected now succeed.

Bug: #19354
Reported-by: Jeroen Vermeulen <jtvjtv@gmail.com>
Discussion: https://postgr.es/m/19354-eefe6d8b3e84f9f2@postgresql.org
---
 doc/src/sgml/charset.sgml             |  2 +-
 src/common/wchar.c                    | 69 ++++++++++++++++-----
 src/test/regress/expected/johab.out   | 87 +++++++++++++++++++++++++++
 src/test/regress/expected/johab_1.out |  9 +++
 src/test/regress/parallel_schedule    |  2 +-
 src/test/regress/sql/johab.sql        | 58 ++++++++++++++++++
 6 files changed, 209 insertions(+), 18 deletions(-)
 create mode 100644 src/test/regress/expected/johab.out
 create mode 100644 src/test/regress/expected/johab_1.out
 create mode 100644 src/test/regress/sql/johab.sql

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 746e40bb9d2..8ff7f7ed03d 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1934,7 +1934,7 @@ ORDER BY c COLLATE ebcdic;
          <entry>Korean (Hangul)</entry>
          <entry>No</entry>
          <entry>No</entry>
-         <entry>1&ndash;3</entry>
+         <entry>1&ndash;2</entry>
          <entry></entry>
         </row>
         <row>
diff --git a/src/common/wchar.c b/src/common/wchar.c
index a44ee73accf..f493e4d9a99 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -438,18 +438,45 @@ pg_wchar2euc_with_len(const pg_wchar *from, unsigned char *to, int len)
 
 
 /*
- * JOHAB
+ * JOHAB (KS X 1001:2004 Annex 3, a.k.a. the 2-byte combinational code)
+ *
+ * Byte ranges per Annex 3 Table 1:
+ *
+ *   Category              Lead byte    Trail byte
+ *   --------------------  -----------  ---------------------
+ *   Hangul syllables      0x84 - 0xD3  0x41 - 0x7E, 0x81 - 0xFE
+ *   User-defined area A   0xD8         0x31 - 0x7E, 0x91 - 0xFE
+ *   Other characters      0xD9 - 0xDE  0x31 - 0x7E, 0x91 - 0xFE
+ *   Hanja                 0xE0 - 0xF9  0x31 - 0x7E, 0x91 - 0xFE
+ *
+ * ASCII (< 0x80) is single-byte.  Lead bytes in the gaps between the ranges
+ * above (0x80-0x83, 0xD4-0xD7, 0xDF, 0xFA-0xFF) are invalid.  Likewise,
+ * trail bytes that fall outside their allowed union are invalid: for Hangul
+ * this excludes 0x00-0x40, 0x7F-0x80, and 0xFF; for the other categories
+ * this excludes 0x00-0x30, 0x7F-0x90, and 0xFF.
+ *
+ * Note that unlike EUC-KR, trail bytes may fall within the ASCII graphic
+ * range (including 0x5C backslash), so callers dealing with JOHAB text
+ * must not assume ASCII bytes are self-synchronizing.
  */
+#define IS_JOHAB_LEAD_HANGUL(c)	((c) >= 0x84 && (c) <= 0xD3)
+#define IS_JOHAB_LEAD_OTHER(c)	\
+	(((c) >= 0xD8 && (c) <= 0xDE) || ((c) >= 0xE0 && (c) <= 0xF9))
+
 static int
 pg_johab_mblen(const unsigned char *s)
 {
-	return pg_euc_mblen(s);
+	if (IS_JOHAB_LEAD_HANGUL(*s) || IS_JOHAB_LEAD_OTHER(*s))
+		return 2;
+	return 1;
 }
 
 static int
 pg_johab_dsplen(const unsigned char *s)
 {
-	return pg_euc_dsplen(s);
+	if (IS_HIGHBIT_SET(*s))
+		return 2;
+	return pg_ascii_dsplen(s);
 }
 
 /*
@@ -1156,25 +1183,35 @@ pg_euctw_verifystr(const unsigned char *s, int len)
 static int
 pg_johab_verifychar(const unsigned char *s, int len)
 {
-	int			l,
-				mbl;
-	unsigned char c;
+	unsigned char b1,
+				b2;
 
-	l = mbl = pg_johab_mblen(s);
+	if (!IS_HIGHBIT_SET(*s))
+		return 1;
 
-	if (len < l)
+	if (len < 2)
 		return -1;
 
-	if (!IS_HIGHBIT_SET(*s))
-		return mbl;
+	b1 = s[0];
+	b2 = s[1];
 
-	while (--l > 0)
+	/*
+	 * Per KS X 1001:2004 Annex 3 Table 1, trailing byte ranges depend on the
+	 * leading byte's category.
+	 */
+	if (IS_JOHAB_LEAD_HANGUL(b1))
 	{
-		c = *++s;
-		if (!IS_EUC_RANGE_VALID(c))
-			return -1;
+		/* Hangul syllables: 0x41-0x7E or 0x81-0xFE */
+		if ((b2 >= 0x41 && b2 <= 0x7E) || (b2 >= 0x81 && b2 <= 0xFE))
+			return 2;
 	}
-	return mbl;
+	else if (IS_JOHAB_LEAD_OTHER(b1))
+	{
+		/* User-defined, other characters, Hanja: 0x31-0x7E or 0x91-0xFE */
+		if ((b2 >= 0x31 && b2 <= 0x7E) || (b2 >= 0x91 && b2 <= 0xFE))
+			return 2;
+	}
+	return -1;
 }
 
 static int
@@ -1901,7 +1938,7 @@ const pg_wchar_tbl pg_wchar_table[] = {
 	[PG_GBK] = {0, 0, pg_gbk_mblen, pg_gbk_dsplen, pg_gbk_verifychar, pg_gbk_verifystr, 2},
 	[PG_UHC] = {0, 0, pg_uhc_mblen, pg_uhc_dsplen, pg_uhc_verifychar, pg_uhc_verifystr, 2},
 	[PG_GB18030] = {0, 0, pg_gb18030_mblen, pg_gb18030_dsplen, pg_gb18030_verifychar, pg_gb18030_verifystr, 4},
-	[PG_JOHAB] = {0, 0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifychar, pg_johab_verifystr, 3},
+	[PG_JOHAB] = {0, 0, pg_johab_mblen, pg_johab_dsplen, pg_johab_verifychar, pg_johab_verifystr, 2},
 	[PG_SHIFT_JIS_2004] = {0, 0, pg_sjis_mblen, pg_sjis_dsplen, pg_sjis_verifychar, pg_sjis_verifystr, 2},
 };
 
diff --git a/src/test/regress/expected/johab.out b/src/test/regress/expected/johab.out
new file mode 100644
index 00000000000..d2eafdf73e4
--- /dev/null
+++ b/src/test/regress/expected/johab.out
@@ -0,0 +1,87 @@
+-- This test exercises the JOHAB client encoding (KS X 1001:2004 Annex 3).
+-- JOHAB's valid byte ranges differ from EUC-KR: trail bytes may fall within
+-- the ASCII graphic range (0x41-0x7E for Hangul, 0x31-0x7E for the other
+-- categories), including 0x5C which is the ASCII backslash.  The test runs
+-- only in UTF8 databases, since some decoded characters have no equivalent
+-- in other server encodings.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- Bug #19354 original report plus its neighbors: these three byte sequences
+-- are valid Hangul syllables per Annex 3 Table 1 (lead 0x8A is in 0x84-0xD3,
+-- trail 0x5B/0x5C/0x5D is in 0x41-0x7E) but were rejected by the prior
+-- EUC-KR-derived check that demanded trail bytes in 0xA1-0xFE.
+SELECT convert_from('\x8a5b'::bytea, 'johab') AS "0x8a5b",
+       convert_from('\x8a5c'::bytea, 'johab') AS "0x8a5c",
+       convert_from('\x8a5d'::bytea, 'johab') AS "0x8a5d";
+ 0x8a5b | 0x8a5c | 0x8a5d 
+--------+--------+--------
+ 굍     | 굎     | 굏
+(1 row)
+
+-- First multi-byte character in unicode.org's JOHAB.TXT, also rejected by
+-- the prior check (trail 0x44 in Hangul range 0x41-0x7E).
+SELECT convert_from('\x8444'::bytea, 'johab') AS "0x8444";
+ 0x8444 
+--------
+ ㄳ
+(1 row)
+
+-- Regression check for byte sequences that already decoded correctly under
+-- the old rules (trail byte already within the old-allowed 0xA1-0xFE).
+SELECT convert_from('\x89ef'::bytea, 'johab') AS "0x89ef",
+       convert_from('\x89a1'::bytea, 'johab') AS "0x89a1";
+ 0x89ef | 0x89a1 
+--------+--------
+ 괦     | 고
+(1 row)
+
+-- Hanja range (lead 0xE0-0xF9) with trail bytes in the old-rejected region
+-- 0x31-0xA0.  Per Annex 3 Table 1 the Hanja trail range is 0x31-0x7E and
+-- 0x91-0xFE.
+SELECT convert_from('\xe031'::bytea, 'johab') AS "0xe031",
+       convert_from('\xe07e'::bytea, 'johab') AS "0xe07e",
+       convert_from('\xe091'::bytea, 'johab') AS "0xe091";
+ 0xe031 | 0xe07e | 0xe091 
+--------+--------+--------
+ 伽     | 嵌     | 感
+(1 row)
+
+-- "Other characters" category (lead 0xD9-0xDE) with a low trail byte.
+SELECT convert_from('\xd931'::bytea, 'johab') AS "0xd931";
+ 0xd931 
+--------
+ 　
+(1 row)
+
+-- Invalid lead bytes: the gaps between the four lead-byte ranges defined by
+-- Annex 3 Table 1.
+SELECT convert_from('\x8041'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x80
+SELECT convert_from('\xd541'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xd5
+SELECT convert_from('\xdf41'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xdf
+SELECT convert_from('\xfa41'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xfa
+-- Invalid trail bytes: values inside the gaps within each trail-byte range.
+-- For Hangul the gaps are 0x00-0x40, 0x7F-0x80, and 0xFF.
+SELECT convert_from('\x8a40'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x40
+SELECT convert_from('\x8a7f'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x7f
+SELECT convert_from('\x8a80'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a 0x80
+-- For the other categories the gaps are 0x00-0x30, 0x7F-0x90, and 0xFF.
+SELECT convert_from('\xe030'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0x30
+SELECT convert_from('\xe07f'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0x7f
+SELECT convert_from('\xe090'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0x90
+SELECT convert_from('\xe0ff'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0xe0 0xff
+-- Incomplete sequence: a valid lead byte with no trail byte is rejected.
+SELECT convert_from('\x8a'::bytea, 'johab');
+ERROR:  invalid byte sequence for encoding "JOHAB": 0x8a
diff --git a/src/test/regress/expected/johab_1.out b/src/test/regress/expected/johab_1.out
new file mode 100644
index 00000000000..89028ad81e0
--- /dev/null
+++ b/src/test/regress/expected/johab_1.out
@@ -0,0 +1,9 @@
+-- This test exercises the JOHAB client encoding (KS X 1001:2004 Annex 3).
+-- JOHAB's valid byte ranges differ from EUC-KR: trail bytes may fall within
+-- the ASCII graphic range (0x41-0x7E for Hangul, 0x31-0x7E for the other
+-- categories), including 0x5C which is the ASCII backslash.  The test runs
+-- only in UTF8 databases, since some decoded characters have no equivalent
+-- in other server encodings.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index cc365393bb7..63f7419d255 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -28,7 +28,7 @@ test: strings md5 numerology point lseg line box path polygon circle date time t
 # geometry depends on point, lseg, line, box, path, polygon, circle
 # horology depends on date, time, timetz, timestamp, timestamptz, interval
 # ----------
-test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc database stats_import pg_ndistinct pg_dependencies oid8 encoding euc_kr
+test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc database stats_import pg_ndistinct pg_dependencies oid8 encoding euc_kr johab
 
 # ----------
 # Load huge amounts of data
diff --git a/src/test/regress/sql/johab.sql b/src/test/regress/sql/johab.sql
new file mode 100644
index 00000000000..7a919f430a7
--- /dev/null
+++ b/src/test/regress/sql/johab.sql
@@ -0,0 +1,58 @@
+-- This test exercises the JOHAB client encoding (KS X 1001:2004 Annex 3).
+-- JOHAB's valid byte ranges differ from EUC-KR: trail bytes may fall within
+-- the ASCII graphic range (0x41-0x7E for Hangul, 0x31-0x7E for the other
+-- categories), including 0x5C which is the ASCII backslash.  The test runs
+-- only in UTF8 databases, since some decoded characters have no equivalent
+-- in other server encodings.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- Bug #19354 original report plus its neighbors: these three byte sequences
+-- are valid Hangul syllables per Annex 3 Table 1 (lead 0x8A is in 0x84-0xD3,
+-- trail 0x5B/0x5C/0x5D is in 0x41-0x7E) but were rejected by the prior
+-- EUC-KR-derived check that demanded trail bytes in 0xA1-0xFE.
+SELECT convert_from('\x8a5b'::bytea, 'johab') AS "0x8a5b",
+       convert_from('\x8a5c'::bytea, 'johab') AS "0x8a5c",
+       convert_from('\x8a5d'::bytea, 'johab') AS "0x8a5d";
+
+-- First multi-byte character in unicode.org's JOHAB.TXT, also rejected by
+-- the prior check (trail 0x44 in Hangul range 0x41-0x7E).
+SELECT convert_from('\x8444'::bytea, 'johab') AS "0x8444";
+
+-- Regression check for byte sequences that already decoded correctly under
+-- the old rules (trail byte already within the old-allowed 0xA1-0xFE).
+SELECT convert_from('\x89ef'::bytea, 'johab') AS "0x89ef",
+       convert_from('\x89a1'::bytea, 'johab') AS "0x89a1";
+
+-- Hanja range (lead 0xE0-0xF9) with trail bytes in the old-rejected region
+-- 0x31-0xA0.  Per Annex 3 Table 1 the Hanja trail range is 0x31-0x7E and
+-- 0x91-0xFE.
+SELECT convert_from('\xe031'::bytea, 'johab') AS "0xe031",
+       convert_from('\xe07e'::bytea, 'johab') AS "0xe07e",
+       convert_from('\xe091'::bytea, 'johab') AS "0xe091";
+
+-- "Other characters" category (lead 0xD9-0xDE) with a low trail byte.
+SELECT convert_from('\xd931'::bytea, 'johab') AS "0xd931";
+
+-- Invalid lead bytes: the gaps between the four lead-byte ranges defined by
+-- Annex 3 Table 1.
+SELECT convert_from('\x8041'::bytea, 'johab');
+SELECT convert_from('\xd541'::bytea, 'johab');
+SELECT convert_from('\xdf41'::bytea, 'johab');
+SELECT convert_from('\xfa41'::bytea, 'johab');
+
+-- Invalid trail bytes: values inside the gaps within each trail-byte range.
+-- For Hangul the gaps are 0x00-0x40, 0x7F-0x80, and 0xFF.
+SELECT convert_from('\x8a40'::bytea, 'johab');
+SELECT convert_from('\x8a7f'::bytea, 'johab');
+SELECT convert_from('\x8a80'::bytea, 'johab');
+-- For the other categories the gaps are 0x00-0x30, 0x7F-0x90, and 0xFF.
+SELECT convert_from('\xe030'::bytea, 'johab');
+SELECT convert_from('\xe07f'::bytea, 'johab');
+SELECT convert_from('\xe090'::bytea, 'johab');
+SELECT convert_from('\xe0ff'::bytea, 'johab');
+
+-- Incomplete sequence: a valid lead byte with no trail byte is rejected.
+SELECT convert_from('\x8a'::bytea, 'johab');
-- 
2.50.1 (Apple Git-155)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-bugs@postgresql.org
  Cc: assam258@gmail.com, hlinnaka@iki.fi, thomas.munro@gmail.com, robertmhaas@gmail.com, tgl@sss.pgh.pa.us, jtvjtv@gmail.com, vasukianand0119@gmail.com, pgsql-bugs@lists.postgresql.org
  Subject: Re: BUG #19354: JOHAB rejects valid byte sequences
  In-Reply-To: <CAAAe_zCLVunjt1u+2E86shwc3hk1x4bzUyU86nY1fq-nAVYN0Q@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox