public inbox for pgsql-performance@postgresql.org  
help / color / mirror / Atom feed
From: Priya V <mailme0216@gmail.com>
To: pgsql-performance@lists.postgresql.org
Subject: Async standby lag + physical slot + hot_standby_feedback=on appeared to degrade primary performance
Date: Thu, 9 Apr 2026 17:39:06 -0500
Message-ID: <CAFsZ43y2s3FE=RhDoTRKNVDRmdoRaL5X9CzpoZaT7E=XoLdyVg@mail.gmail.com> (raw)

Hi all,

I’m looking for insight into a behavior we observed in a PostgreSQL
physical replication setup.

Environment:

   - PostgreSQL version:15.14
   - DB size - 282 GB
   - Environment - AWS EC2
   - PR = primary
   - HA = synchronous standby
   - DP = asynchronous standby
   - DP used a physical replication slot
   - hot_standby_feedback = on on DP

Observed behavior:

   - DP fell behind PR by about 400 GB of replication lag
   - There were no user queries running on DP
   - During this period, query performance on PR degraded and application
   backlog built up on PR
   - After removing DP from replication, PR performance improved gradually
   over about 1 to 2 hours, not immediately

Why this is confusing:

   - DP was async, so this does not appear to be synchronous commit wait
   - There were no active queries on DP at the time we checked
   - The delayed recovery on PR makes me wonder whether cleanup on PR had
   been held back for some time, causing dead tuple accumulation / bloat /
   autovacuum backlog, and whether removing DP only allowed PR to recover
   gradually afterward

My questions:

   1. In an async physical standby setup, can a lagging standby with a
   physical slot and hot_standby_feedback=on still hold back VACUUM cleanup
   on the primary even when no queries are currently running on the standby?
   2. Can an old or stale slot xmin on the primary explain this kind of
   behavior?
   3. Does the 1–2 hour gradual recovery after removing DP point more
   toward cleanup debt / dead tuple buildup / bloat on PR, WAL retention /
   storage pressure, or a combination of both?
   4. What PR-side evidence would best confirm the root cause after the
   fact? For example:
      - pg_stat_replication.backend_xmin
      - pg_replication_slots.xmin
      - pg_replication_slots.restart_lsn
      - pg_stat_user_tables.n_dead_tup
      - autovacuum activity on heavily updated tables

Any insights would be appreciated.

Thanks.


reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-performance@postgresql.org
  Cc: mailme0216@gmail.com, pgsql-performance@lists.postgresql.org
  Subject: Re: Async standby lag + physical slot + hot_standby_feedback=on appeared to degrade primary performance
  In-Reply-To: <CAFsZ43y2s3FE=RhDoTRKNVDRmdoRaL5X9CzpoZaT7E=XoLdyVg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox