Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vdbfa-002hjJ-33 for pgsql-novice@arkaria.postgresql.org; Wed, 07 Jan 2026 22:06:23 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vdbfZ-00FlR5-1Q for pgsql-novice@arkaria.postgresql.org; Wed, 07 Jan 2026 22:06:22 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vdbfZ-00FlQx-0X for pgsql-novice@lists.postgresql.org; Wed, 07 Jan 2026 22:06:21 +0000 Received: from mx1.tim.tja.au ([2404:9400:1:0:216:3eff:fef0:f426]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vdbfW-0059b3-0A for pgsql-novice@lists.postgresql.org; Wed, 07 Jan 2026 22:06:20 +0000 Received: from mail.tim.tja.au (localhost [IPv6:::1]) by mx1.tim.tja.au (Postfix) with ESMTP id AFCB01C8EFB for ; Thu, 8 Jan 2026 08:05:56 +1000 (AEST) Received: from [10.42.0.101] ([120.22.81.73]) by mail.tim.tja.au with ESMTPSA id fHr7KMTYXmlKLAIAF2wIhQ (envelope-from ) for ; Thu, 08 Jan 2026 08:05:56 +1000 Message-ID: <75e0f4dd-cc89-4305-bb8b-4e19c0060ff4@timando.net> Date: Thu, 8 Jan 2026 08:05:56 +1000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: is there a way to automate deduplication of strings? To: pgsql-novice@lists.postgresql.org References: Content-Language: en-US From: Tim Anderson In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk One thing you could do if you're OK with the space overhead of uuid vs int, is use a hash of the user agent e.g. `md5(user_agent)::uuid` which would reduce the need to lookup the value when inserting. Then when you get a foreign key violation, add the user agent to the user_agent table. On 1/1/26 10:25, Chris Papademetrious wrote: > > Hi Greg, > > Thanks for the reply! I tried to be vague to avoid getting distracted > by the details, but I think I overdid it! > > Let’s say I have a table of transactions like this: > > CREATE TABLE transaction ( > >     id UUID PRIMARY KEY DEFAULT gen_random_uuid(), > >     user_agent TEXT NOT NULL, > >     -- > >     -- ...more columns here... > >     -- > > ); > > The table can contain millions of transactions. The *user_agent*field > stores information about the application that performed each > transaction. These user-agent values will be populated from a > relatively small set of unique values. For example, > > MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop > (BuildID 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.0 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile > (BuildID 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.0 (Windows 11 25H2) Desktop (BuildID > 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.0 (Windows 11 25H2) Mobile (BuildID > 00000000-0000-0000-0000-000000000000) > > MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Desktop  > (BuildID 11111111-1111-1111-1111-111111111111) > > MyFictitiousApp/1.1 (Linux; x86_64; Ubuntu 22.04.3 LTS) Mobile > (BuildID 11111111-1111-1111-1111-111111111111) > > MyFictitiousApp/1.1 (Windows 11 25H2) Desktop (BuildID > 11111111-1111-1111-1111-111111111111) > > MyFictitiousApp/1.1 (Windows 11 25H2) Mobile (BuildID > 11111111-1111-1111-1111-111111111111) > > The values themselves will vary over time (as new versions appear and > old versions age out) so the set cannot be hardcoded, but the column > will always contain large numbers of duplicate values. > > I could store the user-agent values in a separate table and reference > them by a UUID computed from their value: > > CREATE TABLE user_agent ( > >     id UUID PRIMARY KEY DEFAULT gen_random_uuid(), > >     user_agent TEXT NOT NULL UNIQUE, > > ); > > CREATE TABLE transaction ( > >     id UUID PRIMARY KEY DEFAULT gen_random_uuid(), > > user_agent_id UUID NOT NULL, > >     CONSTRAINT fk_user_agent FOREIGN KEY (user_agent_id) REFERENCES > user_agent(id) ON DELETE CASCADE > > ); > > but this adds transactional complexity for storage and retrieval, > along with cleanup of no-longer-referenced values over time. > > I’m wishing for a magic “sparsely stored texts” column in Postgres > that performs this deduplication automatically, but I don’t think it > exists. So I’m wondering, is there an extension or some other trick to > get the space savings without the transactional complexity? > > * Chris > > *From:*Greg Sabino Mullane > *Sent:* Wednesday, December 31, 2025 10:12 AM > *To:* Chris Papademetrious > *Cc:* pgsql-novice@lists.postgresql.org > *Subject:* Re: is there a way to automate deduplication of strings? > > It is not quite clear what you are trying to do. Can you provide a > small test table to show what you want to achieve? > > Cheers, > > Greg > > -- > > Crunchy Data - https://www.crunchydata.com > > > Enterprise Postgres Software Products & Tech Support >