Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vbJdd-0048ie-2x for pgsql-novice@arkaria.postgresql.org; Thu, 01 Jan 2026 14:26:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vbJdb-008UCY-1z for pgsql-novice@arkaria.postgresql.org; Thu, 01 Jan 2026 14:26:52 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vbJdb-008UCQ-0X for pgsql-novice@lists.postgresql.org; Thu, 01 Jan 2026 14:26:52 +0000 Received: from mail-ot1-x32b.google.com ([2607:f8b0:4864:20::32b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vbJdZ-003fEY-0x for pgsql-novice@lists.postgresql.org; Thu, 01 Jan 2026 14:26:51 +0000 Received: by mail-ot1-x32b.google.com with SMTP id 46e09a7af769-7c6dbdaced8so9405124a34.1 for ; Thu, 01 Jan 2026 06:26:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767277608; x=1767882408; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=9cIMxeUHVJ5StR4RW1t7dmnI0pGo4c/jTOpd0xtJ0dQ=; b=m7gIjR6CkpCLp94yjPkWwITjGY0KoRTTpXH3zmi1WO27rtRWvk5QE9w3oklvf/SU0G mixf6SRLiWJtzr2Izkfy6JUvqPeUYf8sEFFJZ3B8OtAtzYtsPgZHWSitcC+sg/vQm9b6 8bIdJplCljOWRyqOBa3aDT3dAiU3pAnshqNEeKbamUXv1iqsphgSBbB3mlDxqhOFC6G6 yn76jN3SRuDON8Z5E4vJb2VJwK/1YDqg/iB7mX1hl1gbp8UObhtakmKD4gU2u9+nB9eu uByHHOQTz9g+j6pSbnhW1/RUUjPBz7aaPJZArGWXkWZiwbvcoXPQ1JGl+yIzjM6EtTNr 3QMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767277608; x=1767882408; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=9cIMxeUHVJ5StR4RW1t7dmnI0pGo4c/jTOpd0xtJ0dQ=; b=LL9zbupc1oXRNiIMPN90nBnFvGWuWKTUrxoUycaqaKT8ZGMxxDIvTYLX9/lSo3IKbC kOxTzccOfWnrZ3W8sVUGmpGSmsJqB3BTTOfG4ClyvN9UAgJrqLtrC+5ggmyFDxicyHEv F2rfpHkkPdcPBYVqCKe051ww8Of/4Q6RVn1Bq5w1HDBxddg63Vkk9CLPioFZLOgSzMzR W7VZX0qQ5xDWVfV/ZeiNfJ/PV7nhZk6C5x/hccJA/rJiL35EpSjYN9y3tVo/0XNey1Xf 3mdKJBQnWbB7LCsQ3fvfwPfvnv80d8HGZRHcWeoKc4rXAEJQD+1+9auY2TziHHZ8iOpf KLtA== X-Gm-Message-State: AOJu0YwTg2wj6rmnzM9b6VxHO8xNmUIZGJmaXkxoPFxkyiqd0fj2QO3g zEKy3vaZKBgUEht8u/jzet6XnVvUDiPAzXJFQGHFTV4ZITEpMmQIe4gk/2f/b382oi8z5gqQj6b TujTzRpSJPd4z8EnjaA1pcrwmm9GwlCg= X-Gm-Gg: AY/fxX6DvcuAA7lUfIaXyfC8dXWus6hPibXC9WzRxa+gVvYs59788LFIqcGz1sxp1F/ 7J7MXM+g1A17+RmAKUnbnUn2h/oHsIBaUgxMrBUTm8ggkmfriPLGy1s0sdk1MCcvlcGhgrkWAcz llGx2AuepRzyA8YxcgN+q9tNj+CZB3SUSg0e+TWFCl1+0FyB+R+BrLmgPM4ZKoA3d8/UBleExYW Bp3Erat0EjpDLvYb8HoUgNFXglibucc0MdMOlDz9ctIToRjImLHS4hHkTS/BiUAI0TfP/J/B/n0 6jl01QVOGcfCZgevdncQP/eYRn3poA== X-Google-Smtp-Source: AGHT+IEyxE9eiyGmV/LPuT49kZWqNLg0sU7yQDzBfXIok1htotlyOzakLXBUY77DIspSrqlcIgSQWm/KQTUg9mtVbO0= X-Received: by 2002:a05:6820:827:b0:659:9a49:8f12 with SMTP id 006d021491bc7-65d0e9f40d5mr15525098eaf.35.1767277608033; Thu, 01 Jan 2026 06:26:48 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Greg Sabino Mullane Date: Thu, 1 Jan 2026 09:26:12 -0500 X-Gm-Features: AQt7F2qHltAwfgNpPQcDD1GTdZ6C50fB79kXUHqt19mo8EdMbxAs8M9e2akFbFg Message-ID: Subject: Re: is there a way to automate deduplication of strings? To: Chris Papademetrious Cc: "pgsql-novice@lists.postgresql.org" Content-Type: multipart/alternative; boundary="0000000000000be1ae064754628b" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000000be1ae064754628b Content-Type: text/plain; charset="UTF-8" Ah, okay, so what you want is to normalize the data. Turns out, that's something database purists strive for, so you are in good company. Yes, there is a little more overhead, but that's kind of the cost of using a relational model. You can smooth things out a great bit with views, triggers, and functions. For example, you could create a view that joins the two tables together under the hood but an application or user would simply select from it as if it were the original non-duplicated form: create view txn as select t.id /* plus other fields from "t" here */ , u.user_agent from transaction t join user_agent u on (u.id = t.user_agent_id); You can also make the view ("txn" in this case) updateable by adding INSTEAD OF triggers on the view, which then call some functions that do the necessary maintenance work behind the scenes. Then your application simply calls insert/update/delete on the view as if it were the original table. https://www.postgresql.org/docs/current/sql-createtrigger.html We can help with those on this list, of course, but there are some good examples on that page to get you started. Minor notes: * Since the user_agent contains a small and finite number of variants, I would not bother with a uuid and just use an int: create table user_agent (id int primary key generated always as identity, user_agent text not null unique); * You ought to look at the new uuidv7() function, as it improves a lot on regular uuids. See for example: https://www.thenile.dev/blog/uuidv7 * Rather than ON DELETE CASCADE, you probably want ON DELETE RESTRICT for this use case * You mentioned "cleanup of no-longer-referenced values over time" I would not sweat this, at least for this particular example. If those user agents were used once, they may get used again. For different types of data and larger tables, you could write a small sql function that cleans up stray entries and make sure to call that function after a bulk delete, or just cron it to run weekly. Cheers, Greg -- Crunchy Data - https://www.crunchydata.com Enterprise Postgres Software Products & Tech Support --0000000000000be1ae064754628b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Ah, okay, so what you want is to normalize the data. Turns= out, that's something database purists strive for, so you are in good = company. Yes, there is a little more overhead, but that's kind of the c= ost of using a relational model. You can smooth things out a great bit with= views, triggers, and functions. For example, you could create a view that = joins the two tables together under the hood but an application or user wou= ld simply select from it as if it were the original non-duplicated form:
create view txn as select t.id /* plus oth= er fields from "t" here */ , u.user_agent from transaction t join= user_agent u on (u.id =3D t.user_agent_id);
You can also make the view ("txn" in this case) updateable b= y adding INSTEAD OF triggers on the view, which then call some functions th= at do the necessary maintenance work behind the scenes. Then your applicati= on simply calls insert/update/delete on the view as if it were the original= table.

https://www.postgresql.org/docs/current/sql-createtrigger.ht= ml

We can help with those on this list, of course, but there are= some good examples on that page to get you started.

Minor notes:
* Since the user_agent contains a small and finite number of variants,= I would not bother with a uuid and just use an int:

create table us= er_agent (id int primary key generated always as identity, user_agent text = not null unique);

* You ought to look at the new uuidv7() function, = as it improves a lot on regular uuids. See for example:

https://www.thenile.dev/blog/uuidv7=

* Rather than ON DELETE CASCADE, you probably want ON DELETE RESTRI= CT for this use case

* You mentioned "cleanup of no-longer-refe= renced values over time"

I would not sweat this, at least for t= his particular example. If those user agents were used once, they may get u= sed again. For different types of data and larger tables, you could write a= small sql function that cleans up stray entries and make sure to call that= function after a bulk delete, or just cron it to run weekly.

Che= ers,
Greg

--
Crunchy Data - https://www.crunchy= data.com
Enterprise Postgres Software Products & Tech Sup= port

--0000000000000be1ae064754628b--