Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPs3G-0016A6-1q for pgsql-bugs@arkaria.postgresql.org; Thu, 21 May 2026 01:18:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPs3E-008tPM-1l for pgsql-bugs@arkaria.postgresql.org; Thu, 21 May 2026 01:18:17 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPoHr-008ZtG-2p for pgsql-bugs@lists.postgresql.org; Wed, 20 May 2026 21:17:08 +0000 Received: from mahout.postgresql.org ([2001:4800:3e1:1::227]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wPoHq-00000000bvh-193V for pgsql-bugs@lists.postgresql.org; Wed, 20 May 2026 21:17:08 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=postgresql.org; s=20171124; h=Message-ID:Date:Reply-To:Cc:From:To:Subject: Content-Transfer-Encoding:MIME-Version:Content-Type:Sender:Content-ID: Content-Description:In-Reply-To:References; bh=PqFE7ITLBfIky+gggmrk4LmwROLL2lRa9otrpCt2bTo=; b=ndZFh9H750RgJrjTIB5Q5S1G+6 P6dc7s5WE4t3s3B8QW299fR3vMzyg9b8iEklhak0snGTKUo81xDNOa+FX0A2qrRaJKFQTpm6ISZuz 3ayvwAEMeOLGBkho8Ry1CTiXeCDuw9akYrbZkVSXM/Fs7arTQ9Mk4yRzWnRjik1jRuh2oHEJs4Aox QIK0S2PlAjPwvX5vTyf+qAUVWoxd4ptxDxW/A2Ho7OZpzT/+gWJojKeR0Gko/QSu+qZWjAqRi0bpz XXJIRxDS4vYtLxOwgaddTTrrrJeofHGySIfpOjx51IP2HreEGwO3oD0awynkI/mikqOnyumcJBILS /Gnna7Vg==; Received: from wrigleys.postgresql.org ([2a02:16a8:dc51::60]) by mahout.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPoHn-001diC-1A for pgsql-bugs@lists.postgresql.org; Wed, 20 May 2026 21:17:04 +0000 Received: from localhost ([127.0.0.1] helo=wrigleys.postgresql.org) by wrigleys.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPoHn-003GqU-0u for pgsql-bugs@lists.postgresql.org; Wed, 20 May 2026 21:17:03 +0000 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8 To: pgsql-bugs@lists.postgresql.org From: PG Bug reporting form Cc: radim@boringsql.com Reply-To: radim@boringsql.com, pgsql-bugs@lists.postgresql.org Date: Wed, 20 May 2026 21:16:59 +0000 Message-ID: <19490-9c59c6a583513b99@postgresql.org> X-Auto-Response-Suppress: All Auto-Submitted: auto-generated List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk The following bug has been logged on the website: Bug reference: 19490 Logged by: Radim Marek Email address: radim@boringsql.com PostgreSQL version: 16.14 Operating system: Linux - Ubuntu 22.04 Description: =20 Hello,=C2=A0 due to a mistake we have run a higher minor version of 16.x against the non-upgraded primary. This led to repeated issues on WAL processing. Description: A streaming replication standby running 16.14 stops advancing replay while WAL keeps arriving from a 16.8 primary. The startup process is parked in futex_wait_queue with wait_event =3D LWLock:MultiXactOffsetSLRU and no long= er makes progress. pg_stat_slru shows zero MultiXact activity over the same window, so it appears to stop on the lock itself rather than inside any SLRU read/write path. Downgrading the standby binary to 16.12 (same data directory) resolved the symptom under the same workload. Configuration: Primary running 16.8-1.pgdg22.04+1, we observed both loaded and "relatively" idle (below 1000 QPS) Replica:=C2=A016.14-1.pgdg22.04+1,=C2=A0=C2=A0physical streaming, async, si= ngle replica on 16.14 due to misconfiguration, no cascading. Other replicas not affected (running 16.8). hot_standby_feedback enabled, logical replication from primary. default WAL segment size.=C2=A0Default SLRU buffer sizes. Observed symptoms on the standby 1.=C2=A0pg_stat_replication on primary, just the affected node client_addr =C2=A0 state =C2=A0 =C2=A0 sent_lag =C2=A0write_lag =C2=A0flush= _lag =C2=A0replay_lag_bytes =C2=A0replay_lag 10.x.x.x =C2=A0 =C2=A0 =C2=A0streaming 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A087667843= 44 =C2=A0 =C2=A0 =C2=A0 =C2=A002:42:50 2.=C2=A0Receive/write/flush all at the primary's current LSN; only replay i= s far behind and growing. 3.=C2=A0Startup process wait event on standby (sampled repeatedly, always identical)pid =C2=A0 =C2=A0wait_event_type =C2=A0 =C2=A0wait_event =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 state 19095 =C2=A0LWLock =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 MultiXactOffse= tSLRU =C2=A0 =C2=A0(null) 4.=C2=A0Kernel stack of the startup process cat /proc/19095/stack [<0>] futex_wait_queue+0x67/0xa0 [<0>] __futex_wait+0x155/0x1d0 [<0>] futex_wait+0x74/0x120 [<0>] do_futex+0x16d/0x230 [<0>] __x64_sys_futex+0x95/0x200 [<0>] x64_sys_call+0x117b/0x2480 [<0>] do_syscall_64+0x81/0x170 [<0>] entry_SYSCALL_64_after_hwframe+0x78/0x80 cat /proc/19095/wchan futex_wait_queue 5.=C2=A0pg_stat_slru on the standby, after pg_stat_reset_slru(NULL) and a 60-second wait under live WAL streaming name =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 blks_zeroed =C2=A0blks_hit = =C2=A0blks_read =C2=A0blks_written MultiXactMember =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 MultiXactOffset =C2=A00 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00 6.=C2=A0There was no MultiXact SLRU activity while the startup process is reportedly waiting on the MultiXact offset SLRU lock. 7.=C2=A0Replay LSN frozen, receive LSN advancing. Sampled 60 sec apart. recv =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 replay =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0lag_bytes 1476A/D1DA158 =C2=A0 =C2=A014767/EE01DB78 =C2=A09111848416 1476A/EB565D0 =C2=A0 =C2=A014767/EE01DB78 =C2=A09138571864 8.=C2=A0No replay progress; ~9 GB of WAL buffered locally that is never app= lied. 6.=C2=A0Other backends on the standby: only a diagnostic psql client. No hot-standby readers.=C2=A0 7.=C2=A0MultiXact age on the primary is small (~360k on most DBs, ~239k on = the main DB). No MultiXact storm. Workarounds - Restarting the standby=C2=A0cleared the block but once it caught up it re= peated again-=C2=A0Downgrading the standby binary to 16.12 (16.12-1.pgdg22.04+1) a= gainst the same data directory restored normal replay. After 60s under the same workload=C2=A0pg_stat_slru shows only 2 hits / 0 reads on MultiXact. I understand that running 6 minor versions behind is not particulary good setup, but given this being supported direction this might be worth at least in 16.13/16.14 release notes. --- Hope this helps, Radim