MIME-Version: 1.0
From: Priya V <mailme0216@gmail.com>
Date: Thu, 9 Apr 2026 17:39:06 -0500
Message-ID: 
 <CAFsZ43y2s3FE=RhDoTRKNVDRmdoRaL5X9CzpoZaT7E=XoLdyVg@mail.gmail.com>
Subject: Async standby lag + physical slot + hot_standby_feedback=on appeared
 to degrade primary performance
To: pgsql-performance@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000c5cfb0064f0eafd7"
Archived-At: 
 <https://www.postgresql.org/message-id/CAFsZ43y2s3FE%3DRhDoTRKNVDRmdoRaL5X9CzpoZaT7E%3DXoLdyVg%40mail.gmail.com>
Precedence: bulk

--000000000000c5cfb0064f0eafd7
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi all,

I=E2=80=99m looking for insight into a behavior we observed in a PostgreSQL
physical replication setup.

Environment:

   - PostgreSQL version:15.14
   - DB size - 282 GB
   - Environment - AWS EC2
   - PR =3D primary
   - HA =3D synchronous standby
   - DP =3D asynchronous standby
   - DP used a physical replication slot
   - hot_standby_feedback =3D on on DP

Observed behavior:

   - DP fell behind PR by about 400 GB of replication lag
   - There were no user queries running on DP
   - During this period, query performance on PR degraded and application
   backlog built up on PR
   - After removing DP from replication, PR performance improved gradually
   over about 1 to 2 hours, not immediately

Why this is confusing:

   - DP was async, so this does not appear to be synchronous commit wait
   - There were no active queries on DP at the time we checked
   - The delayed recovery on PR makes me wonder whether cleanup on PR had
   been held back for some time, causing dead tuple accumulation / bloat /
   autovacuum backlog, and whether removing DP only allowed PR to recover
   gradually afterward

My questions:

   1. In an async physical standby setup, can a lagging standby with a
   physical slot and hot_standby_feedback=3Don still hold back VACUUM clean=
up
   on the primary even when no queries are currently running on the standby=
?
   2. Can an old or stale slot xmin on the primary explain this kind of
   behavior?
   3. Does the 1=E2=80=932 hour gradual recovery after removing DP point mo=
re
   toward cleanup debt / dead tuple buildup / bloat on PR, WAL retention /
   storage pressure, or a combination of both?
   4. What PR-side evidence would best confirm the root cause after the
   fact? For example:
      - pg_stat_replication.backend_xmin
      - pg_replication_slots.xmin
      - pg_replication_slots.restart_lsn
      - pg_stat_user_tables.n_dead_tup
      - autovacuum activity on heavily updated tables

Any insights would be appreciated.

Thanks.

--000000000000c5cfb0064f0eafd7
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><p><span>Hi all,</span></p><p><span>I=E2=80=99m looking fo=
r insight into a behavior we observed in a PostgreSQL physical replication =
setup.</span></p><p class=3D"gmail-isSelectedEnd"><span>Environment:</span>=
</p><ul><li><span>PostgreSQL version:15.14</span></li><li><span>DB size -=
=C2=A0</span>282 GB</li><li>Environment - AWS EC2</li><li><span>PR =3D prim=
ary</span></li><li><span>HA =3D synchronous standby</span></li><li><span>DP=
 =3D asynchronous standby</span></li><li><span>DP used a physical replicati=
on slot</span></li><li><code dir=3D"ltr"><span>hot_standby_feedback =3D on<=
/span></code><span> on DP</span></li></ul><p class=3D"gmail-isSelectedEnd">=
<span>Observed behavior:</span></p><ul><li><span>DP fell behind PR by about=
 400 GB of replication lag</span></li><li><span>There were no user queries =
running on DP</span></li><li><span>During this period, query performance on=
 PR degraded and application backlog built up on PR</span></li><li><span>Af=
ter removing DP from replication, PR performance improved gradually over ab=
out 1 to 2 hours, not immediately</span></li></ul><p class=3D"gmail-isSelec=
tedEnd"><span>Why this is confusing:</span></p><ul><li><span>DP was async, =
so this does not appear to be synchronous commit wait</span></li><li><span>=
There were no active queries on DP at the time we checked</span></li><li><s=
pan>The delayed recovery on PR makes me wonder whether cleanup on PR had be=
en held back for some time, causing dead tuple accumulation / bloat / autov=
acuum backlog, and whether removing DP only allowed PR to recover gradually=
 afterward</span></li></ul><div><p class=3D"gmail-isSelectedEnd"><span>My q=
uestions:</span></p><ol start=3D"1"><li><span>In an async physical standby =
setup, can a lagging standby with a physical slot and </span><code dir=3D"l=
tr"><span>hot_standby_feedback=3Don</span></code><span> still hold back VAC=
UUM cleanup on the primary even when no queries are currently running on th=
e standby?</span></li><li><span>Can an old or stale slot </span><code dir=
=3D"ltr"><span>xmin</span></code><span> on the primary explain this kind of=
 behavior?</span></li><li><span>Does the 1=E2=80=932 hour gradual recovery =
after removing DP point more toward cleanup debt / dead tuple buildup / blo=
at on PR, WAL retention / storage pressure, or a combination of both?</span=
></li><li><span>What PR-side evidence would best confirm the root cause aft=
er the fact? For example:</span><ul><li><code dir=3D"ltr"><span>pg_stat_rep=
lication.backend_xmin</span></code></li><li><code dir=3D"ltr"><span>pg_repl=
ication_slots.xmin</span></code></li><li><code dir=3D"ltr"><span>pg_replica=
tion_slots.restart_lsn</span></code></li><li><code dir=3D"ltr"><span>pg_sta=
t_user_tables.n_dead_tup</span></code></li><li><span>autovacuum activity on=
 heavily updated tables</span></li></ul></li></ol><p class=3D"gmail-isSelec=
tedEnd"><span>Any insights would be appreciated.</span></p><p><span>Thanks.=
</span></p><br></div></div>

--000000000000c5cfb0064f0eafd7--