Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Subject: BUG #19490: Streaming standby on 16.14 stops applying WAL on
 MultiXactOffsetSLRU when primary is 16.8
To: pgsql-bugs@lists.postgresql.org
From: PG Bug reporting form <noreply@postgresql.org>
Cc: radim@boringsql.com
Reply-To: radim@boringsql.com, pgsql-bugs@lists.postgresql.org
Date: Wed, 20 May 2026 21:16:59 +0000
Message-ID: <19490-9c59c6a583513b99@postgresql.org>
Auto-Submitted: auto-generated
Archived-At: 
 <https://www.postgresql.org/message-id/19490-9c59c6a583513b99%40postgresql.org>
Precedence: bulk

The following bug has been logged on the website:

Bug reference:      19490
Logged by:          Radim Marek
Email address:      radim@boringsql.com
PostgreSQL version: 16.14
Operating system:   Linux - Ubuntu 22.04
Description:       =20

Hello,=C2=A0
due to a mistake we have run a higher minor version of 16.x against the
non-upgraded primary. This led to repeated issues on WAL processing.

Description:

A streaming replication standby running 16.14 stops advancing replay while
WAL keeps arriving from a 16.8 primary. The startup process is parked in
futex_wait_queue with wait_event =3D LWLock:MultiXactOffsetSLRU and no long=
er
makes progress.

pg_stat_slru shows zero MultiXact activity over the same window, so it
appears to stop on the lock itself rather than inside any SLRU read/write
path. Downgrading the standby binary to 16.12 (same data directory) resolved
the symptom under the same workload.

Configuration:

Primary running 16.8-1.pgdg22.04+1, we observed both loaded and "relatively"
idle (below 1000 QPS)
Replica:=C2=A016.14-1.pgdg22.04+1,=C2=A0=C2=A0physical streaming, async, si=
ngle replica on
16.14 due to misconfiguration, no cascading. Other replicas not affected
(running 16.8).

hot_standby_feedback enabled, logical replication from primary. default WAL
segment size.=C2=A0Default SLRU buffer sizes.

Observed symptoms on the standby

1.=C2=A0pg_stat_replication on primary, just the affected node

client_addr =C2=A0 state =C2=A0 =C2=A0 sent_lag =C2=A0write_lag =C2=A0flush=
_lag =C2=A0replay_lag_bytes
=C2=A0replay_lag
10.x.x.x =C2=A0 =C2=A0 =C2=A0streaming 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A087667843=
44 =C2=A0 =C2=A0 =C2=A0
=C2=A002:42:50

2.=C2=A0Receive/write/flush all at the primary's current LSN; only replay i=
s far
behind and growing.

3.=C2=A0Startup process wait event on standby (sampled repeatedly, always
identical)pid =C2=A0 =C2=A0wait_event_type =C2=A0 =C2=A0wait_event =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 state
19095 =C2=A0LWLock =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 MultiXactOffse=
tSLRU =C2=A0 =C2=A0(null)

4.=C2=A0Kernel stack of the startup process
cat /proc/19095/stack
[<0>] futex_wait_queue+0x67/0xa0
[<0>] __futex_wait+0x155/0x1d0
[<0>] futex_wait+0x74/0x120
[<0>] do_futex+0x16d/0x230
[<0>] __x64_sys_futex+0x95/0x200
[<0>] x64_sys_call+0x117b/0x2480
[<0>] do_syscall_64+0x81/0x170
[<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80
cat /proc/19095/wchan
futex_wait_queue

5.=C2=A0pg_stat_slru on the standby, after pg_stat_reset_slru(NULL) and a
60-second wait under live WAL streaming
name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 blks_zeroed =C2=A0blks_hit =
=C2=A0blks_read =C2=A0blks_written
MultiXactMember =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00
MultiXactOffset =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00

6.=C2=A0There was no MultiXact SLRU activity while the startup process is
reportedly waiting on the MultiXact offset SLRU lock.

7.=C2=A0Replay LSN frozen, receive LSN advancing. Sampled 60 sec apart.
recv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 replay =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0lag_bytes
1476A/D1DA158 =C2=A0 =C2=A014767/EE01DB78 =C2=A09111848416
1476A/EB565D0 =C2=A0 =C2=A014767/EE01DB78 =C2=A09138571864

8.=C2=A0No replay progress; ~9 GB of WAL buffered locally that is never app=
lied.

6.=C2=A0Other backends on the standby: only a diagnostic psql client. No
hot-standby readers.=C2=A0

7.=C2=A0MultiXact age on the primary is small (~360k on most DBs, ~239k on =
the
main DB). No MultiXact storm.

Workarounds

- Restarting the standby=C2=A0cleared the block but once it caught up it re=
peated
again-=C2=A0Downgrading the standby binary to 16.12 (16.12-1.pgdg22.04+1) a=
gainst
the same data directory restored normal replay. After 60s under the same
workload=C2=A0pg_stat_slru shows only 2 hits / 0 reads on MultiXact.

I understand that running 6 minor versions behind is not particulary good
setup, but given this being supported direction this might be worth at least
in 16.13/16.14 release notes.

---

Hope this helps,
Radim