public inbox for pgsql-novice@postgresql.org  
help / color / mirror / Atom feed
From: Chris Papademetrious <Christopher.Papademetrious@synopsys.com>
To: Greg Sabino Mullane <htamfids@gmail.com>
Cc: pgsql-novice@lists.postgresql.org <pgsql-novice@lists.postgresql.org>
Subject: RE: is there a way to automate deduplication of strings?
Date: Thu, 1 Jan 2026 00:25:20 +0000
Message-ID: <DM4PR12MB603913E485D34B4ED790FDAFDDBAA@DM4PR12MB6039.namprd12.prod.outlook.com> (raw)
In-Reply-To: <CAKAnmmL1UZcP_SUotSRpDD2a6+k4qCRWvFGawDb6-JkXY0Rmbw@mail.gmail.com>
References: <DM4PR12MB603953767048EE1B8A39283ADDB1A@DM4PR12MB6039.namprd12.prod.outlook.com>
	<CAKAnmmL1UZcP_SUotSRpDD2a6+k4qCRWvFGawDb6-JkXY0Rmbw@mail.gmail.com>

Hi Greg,

Thanks for the reply! I tried to be vague to avoid getting distracted by the details, but I think I overdid it!

Let’s say I have a table of transactions like this:

CREATE TABLE transaction (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_agent TEXT NOT NULL,
    --
    -- ...more columns here...
    --
);

The table can contain millions of transactions. The user_agent field stores information about the application that performed each transaction. These user-agent values will be populated from a relatively small set of unique values. For example,

MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop (BuildID 00000000-0000-0000-0000-000000000000)
MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile (BuildID 00000000-0000-0000-0000-000000000000)
MyFictitiousApp/1.0 (Windows 11 25H2) Desktop (BuildID 00000000-0000-0000-0000-000000000000)
MyFictitiousApp/1.0 (Windows 11 25H2) Mobile (BuildID 00000000-0000-0000-0000-000000000000)
MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop  (BuildID 11111111-1111-1111-1111-111111111111)
MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile (BuildID 11111111-1111-1111-1111-111111111111)
MyFictitiousApp/1.1 (Windows 11 25H2) Desktop (BuildID 11111111-1111-1111-1111-111111111111)
MyFictitiousApp/1.1 (Windows 11 25H2) Mobile (BuildID 11111111-1111-1111-1111-111111111111)

The values themselves will vary over time (as new versions appear and old versions age out) so the set cannot be hardcoded, but the column will always contain large numbers of duplicate values.

I could store the user-agent values in a separate table and reference them by a UUID computed from their value:

CREATE TABLE user_agent (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_agent TEXT NOT NULL UNIQUE,
);

CREATE TABLE transaction (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_agent_id UUID NOT NULL,
    CONSTRAINT fk_user_agent FOREIGN KEY (user_agent_id) REFERENCES user_agent(id) ON DELETE CASCADE
);

but this adds transactional complexity for storage and retrieval, along with cleanup of no-longer-referenced values over time.

I’m wishing for a magic “sparsely stored texts” column in Postgres that performs this deduplication automatically, but I don’t think it exists. So I’m wondering, is there an extension or some other trick to get the space savings without the transactional complexity?


  *   Chris


From: Greg Sabino Mullane <htamfids@gmail.com>
Sent: Wednesday, December 31, 2025 10:12 AM
To: Chris Papademetrious <chrispy@synopsys.com>
Cc: pgsql-novice@lists.postgresql.org
Subject: Re: is there a way to automate deduplication of strings?

It is not quite clear what you are trying to do. Can you provide a small test table to show what you want to achieve?

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com<https://urldefense.com/v3/__https:/www.crunchydata.com__;!!A4F2R9G_pg...;
Enterprise Postgres Software Products & Tech Support



reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-novice@postgresql.org
  Cc: Christopher.Papademetrious@synopsys.com, htamfids@gmail.com, pgsql-novice@lists.postgresql.org
  Subject: RE: is there a way to automate deduplication of strings?
  In-Reply-To: <DM4PR12MB603913E485D34B4ED790FDAFDDBAA@DM4PR12MB6039.namprd12.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox