RE: is there a way to automate deduplication of strings?

public inbox for pgsql-novice@postgresql.org  
help / color / mirror / Atom feed

From: Chris Papademetrious <Christopher.Papademetrious@synopsys.com>
To: Tim Anderson <postgresql@timando.net>
To: pgsql-novice@lists.postgresql.org <pgsql-novice@lists.postgresql.org>
Subject: RE: is there a way to automate deduplication of strings?
Date: Thu, 8 Jan 2026 02:37:31 +0000
Message-ID: <DM4PR12MB603984E63493B9170CE51EBADD85A@DM4PR12MB6039.namprd12.prod.outlook.com> (raw)
In-Reply-To: <75e0f4dd-cc89-4305-bb8b-4e19c0060ff4@timando.net>
References: <DM4PR12MB603953767048EE1B8A39283ADDB1A@DM4PR12MB6039.namprd12.prod.outlook.com>
	<CAKAnmmL1UZcP_SUotSRpDD2a6+k4qCRWvFGawDb6-JkXY0Rmbw@mail.gmail.com>
	<DM4PR12MB603913E485D34B4ED790FDAFDDBAA@DM4PR12MB6039.namprd12.prod.outlook.com>
	<75e0f4dd-cc89-4305-bb8b-4e19c0060ff4@timando.net>

Hi Tim,

That was my original plan. The 16 bytes for a UUID is fine - the strings are much longer so it's still a win. But I'm deciding whether the transactional complexity is worth the space savings.

If there was some more automated way of doing this store-values-indirected-by-their-hash stuff (a Postgres feature I didn't know about or an extension that does it automatically), it would make the decision easier.

 - Chris


-----Original Message-----
From: Tim Anderson <postgresql@timando.net> 
Sent: Wednesday, January 7, 2026 5:06 PM
To: pgsql-novice@lists.postgresql.org
Subject: Re: is there a way to automate deduplication of strings?

One thing you could do if you're OK with the space overhead of uuid vs int, is use a hash of the user agent e.g. `md5(user_agent)::uuid` which would reduce the need to lookup the value when inserting. Then when you get a foreign key violation, add the user agent to the user_agent table.

On 1/1/26 10:25, Chris Papademetrious wrote:
>
> Hi Greg,
>
> Thanks for the reply! I tried to be vague to avoid getting distracted 
> by the details, but I think I overdid it!
>
> Let’s say I have a table of transactions like this:
>
> CREATE TABLE transaction (
>
>     id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
>
>     user_agent TEXT NOT NULL,
>
>     --
>
>     -- ...more columns here...
>
>     --
>
> );
>
> The table can contain millions of transactions. The *user_agent*field 
> stores information about the application that performed each 
> transaction. These user-agent values will be populated from a 
> relatively small set of unique values. For example,
>
> MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop 
> (BuildID 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile 
> (BuildID 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.0 (Windows 11 25H2) Desktop (BuildID
> 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.0 (Windows 11 25H2) Mobile (BuildID
> 00000000-0000-0000-0000-000000000000)
>
> MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop 
> (BuildID 11111111-1111-1111-1111-111111111111)
>
> MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile 
> (BuildID 11111111-1111-1111-1111-111111111111)
>
> MyFictitiousApp/1.1 (Windows 11 25H2) Desktop (BuildID
> 11111111-1111-1111-1111-111111111111)
>
> MyFictitiousApp/1.1 (Windows 11 25H2) Mobile (BuildID
> 11111111-1111-1111-1111-111111111111)
>
> The values themselves will vary over time (as new versions appear and 
> old versions age out) so the set cannot be hardcoded, but the column 
> will always contain large numbers of duplicate values.
>
> I could store the user-agent values in a separate table and reference 
> them by a UUID computed from their value:
>
> CREATE TABLE user_agent (
>
>     id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
>
>     user_agent TEXT NOT NULL UNIQUE,
>
> );
>
> CREATE TABLE transaction (
>
>     id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
>
> user_agent_id UUID NOT NULL,
>
>     CONSTRAINT fk_user_agent FOREIGN KEY (user_agent_id) REFERENCES
> user_agent(id) ON DELETE CASCADE
>
> );
>
> but this adds transactional complexity for storage and retrieval, 
> along with cleanup of no-longer-referenced values over time.
>
> I’m wishing for a magic “sparsely stored texts” column in Postgres 
> that performs this deduplication automatically, but I don’t think it 
> exists. So I’m wondering, is there an extension or some other trick to 
> get the space savings without the transactional complexity?
>
>   * Chris
>
> *From:*Greg Sabino Mullane <htamfids@gmail.com>
> *Sent:* Wednesday, December 31, 2025 10:12 AM
> *To:* Chris Papademetrious <chrispy@synopsys.com>
> *Cc:* pgsql-novice@lists.postgresql.org
> *Subject:* Re: is there a way to automate deduplication of strings?
>
> It is not quite clear what you are trying to do. Can you provide a 
> small test table to show what you want to achieve?
>
> Cheers,
>
> Greg
>
> --
>
> Crunchy Data - 
> https://urldefense.com/v3/__https://www.crunchydata.com__;!!A4F2R9G_pg
> !d8s5tNhFHCYqdvuxCWPn_vper3_G_W0Y5HQMctNm15hKZoVs5VpRP78WKTQjdCl8rTYrz
> JQ58Dr_QJKAC3o0Bg$ 
> <https://urldefense.com/v3/__https:/www.crunchydata.com__;!!A4F2R9G_pg
> !dkWhEcznp8l1toENuJsDgY1GABWqYLWZBHncXtqiCaZguLzkN0U0-3JBHLCm7wE8gN0L4
> ZIwyXqM5pGNLIs7ORVTFh7KnJI$>
>
> Enterprise Postgres Software Products & Tech Support
>

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-novice@postgresql.org
  Cc: Christopher.Papademetrious@synopsys.com, postgresql@timando.net, pgsql-novice@lists.postgresql.org
  Subject: RE: is there a way to automate deduplication of strings?
  In-Reply-To: <DM4PR12MB603984E63493B9170CE51EBADD85A@DM4PR12MB6039.namprd12.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox