Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPw7P-0019bi-2q for pgsql-hackers@arkaria.postgresql.org; Thu, 21 May 2026 05:38:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPw7N-009Bxm-0y for pgsql-hackers@arkaria.postgresql.org; Thu, 21 May 2026 05:38:50 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPw7M-009Bxe-36 for pgsql-hackers@lists.postgresql.org; Thu, 21 May 2026 05:38:49 +0000 Received: from mail-pf1-x436.google.com ([2607:f8b0:4864:20::436]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wPw7L-00000000gGP-2xsg for pgsql-hackers@postgresql.org; Thu, 21 May 2026 05:38:49 +0000 Received: by mail-pf1-x436.google.com with SMTP id d2e1a72fcca58-83d31ac4017so2720897b3a.3 for ; Wed, 20 May 2026 22:38:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779341925; cv=none; d=google.com; s=arc-20240605; b=Kd6S3vJTEh+Q8PtxPfq2j4PiJB7eVx7ng0J19rnStgZi+8MaIVm+JwUGA1jPq3tcEg 68jnp6/M3P+mb/1XHdKYJfwN/Px+RGFb64e9BbW2baPnQD59/URTYGzQIwJlVHUNYB0P LXHhr2S7x61T2b5z98iUGEOkWMg0PheAFhmL9IRmYXeqvs243uMlJFrImn597RWNu+A7 GRz5jpQcu/QsCtSB4YhMOaekYwwSpMYVdI8sQ+5CYCmokPOFsG0n6bsjVIWU2gkYcCCS KR0xV9OTLXDXCKNSXLZGtYEKbh+OZH0/Bhh0mtW9QGlmdmUiNwabZvHoQM22Ap4uwBse BSyA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=h/xr+4wlHskFHn6lIyeb8+GrHfiB6NgLVsYJ4sp+JGI=; fh=vC0M9IMXKLJ697FQuf/TkukK3rFBE/dYVMEfCB2n5Ec=; b=YVUWoi4g2QUq5HXGdVpuVBJL0AlQWsZndfoYfi7tHCAqZitlETYSTo5gOBSR2s69WW F1UFZPlFqWlKVjeO86bQhmXzsKWDew5qwqhdF5saC3rBDiXvcr5Jfm5izqCK/fuEO1Wt xCDb8zcGDFOlPsxdipa94FjPFqTGzOeeSIYl1AocoBg54u/sVW/hdVayfC7ckZV3zb0z Ahfahrcses5qxPZDt1HP2hR7XiP/0VQL27e2fe8svnL09/ijyo9nHbFPUuRjHetE6XLu XsPB3CYBfUTPSdWQZ+RcEBTJmUzD5Q7MiO5mIH+LMWQwVP4gMQSNu0zb5TdgMTWT+/Z5 kZKw==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779341925; x=1779946725; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=h/xr+4wlHskFHn6lIyeb8+GrHfiB6NgLVsYJ4sp+JGI=; b=gBoIjb/spoqCSh5sx6HmbI0vqyIuspTyS1YBcOoDGhXFpFOsd1IkzGMl25IKnBevYW yXTAMcU86BKVmstaMv3y4akJ8BH89Ba8vLIdV3mrcsMasdkcZQ7AJWQnQ3+xmyk+b/+o h4i0Ft63CNTO3zYvVr764qbJEDE7d5d0V7sI51/4Z3nl7gCaDEJm8erfIsxDc9O7Y6aS auORmMYeFExayGMproOp0u5WgFSc2ckI9diSGbxlgngVDh+RvGqgEo2lw5Vz2HHppP95 y/EWTYCZOs28JolPUZatyBWAI+U/uD8MHTZkP6YPbjd4cdSokXjsn/CcWiI4Cx/NAJDH pz+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779341925; x=1779946725; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=h/xr+4wlHskFHn6lIyeb8+GrHfiB6NgLVsYJ4sp+JGI=; b=pVSF/r/WiysnrCNEuwavWAJiy86kAfI2gjxbelV87qqvehFkcfQc7Fr6l1h2mYGN+o jJkzMMMWGxhttadgbAj1ONNeAc3K0rJvEInbWoC3dvQjOV1ordaZeu3sTPPyS7sqcy/a z7/oEFkOt5OkXHrCYtLxFnd11HymcnMJ2gt0252n4fNW81SRQVuwHIEgQkmAtX2RpsOg aMVriR1yJPPeBVneqNP2NCb9xqyyptzjVgMiOCU/sJ6RD5vlakMbD3tr2gnp1B831aGQ upSd21ZEHIvQhpFMdq2fYGBYMPYqsI0vLoiuVX0IpCokHkYeaq5KZkwwURKC+vaEnh02 Ne9Q== X-Gm-Message-State: AOJu0YzzeRNl5mgQLObKM5n3zmnpqf/kXdP5/GK6u2C5jIhSf7uQ10IV 8I7JOXAmcspTNzgvt6F5Wfb9MT0QU/CN1jQB8jsiM+OWM04wBRoOFXWgS26ICi7OgdcAAr3eB5C k/6zo+G296HAnvuy6OozINNKsVbuizrF/LwsQnW8= X-Gm-Gg: Acq92OGJU6zyMk7F+PCeeAUWcJ+6W6I/l8jmORjSxVpFsku0phV4wIdVjHYPFb7C+Wv 4wDPaP2XWjFJ6+21QtoDe9migCN2rTX/I2JwGUHKphetgMe6J47HAOQh8wdL1hQSlh9mjNuPvaZ y4/JwaHhujzW+6hKRee53yWMH26BE3lfONl1tgWv3+A6HUcQb0nmFLcKGX/HJ2YMCpj6BRjK7L3 iwG70/sWaiock6+zX9xVRQ008FbsuQlyl/vrT5J3V2sMDpPKD2oL719AOQcpzKjHQrcFOkOOxI0 +IwfUvODt9azPpAGEF8hIHQdkHbEHZVIozLbfjwr7iZnEaQWOKo= X-Received: by 2002:a05:6a00:1d8c:b0:83d:43be:d537 with SMTP id d2e1a72fcca58-8414ac8d1c3mr1490009b3a.17.1779341925198; Wed, 20 May 2026 22:38:45 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: vellaipandiyan sm Date: Thu, 21 May 2026 11:08:33 +0530 X-Gm-Features: AVHnY4KABJ5uuy8v_HvfAU6jP2EDfFHop8Froezvlkwf-ROf6knRTceljCWS3l0 Message-ID: Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information To: torikoshia Cc: Pgsql Hackers Content-Type: multipart/alternative; boundary="00000000000062b07906524d53b0" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000062b07906524d53b0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello hackers, I reviewed the EXPLAIN storage I/O patch and the overall direction seems useful, especially for distinguishing shared-buffer hits from actual storage reads during query analysis. One concern that stood out to me from the later discussion is the interaction with asynchronous I/O and worker-based I/O accounting. Since the patch currently relies on per-process getrusage() statistics, it seems possible that the reported values could become partial or misleading once I/O is performed outside the backend process context. In particular, worker-based AIO could undercount storage reads/writes while still returning non-zero values, which may make the output appear more accurate than it actually is. I also wonder whether this should affect the semantics/documentation of the feature itself =E2=80=94 i.e. whether the reported numbers should be descri= bed more explicitly as =E2=80=9Cbackend-observed synchronous storage I/O=E2=80=9D ra= ther than total query I/O. That said, the examples showing differences between OS cache hits and actual storage reads were quite compelling, especially for query performance debugging. Thanks for working on the patch. Regards, Vellaipandiyan On Thu, May 21, 2026 at 11:07=E2=80=AFAM torikoshia wrote: > Hi, > > When reading the output of EXPLAIN (ANALYZE) to diagnose slow queries > for our customers, I often want to know how many page faults occurred, > especially major page faults, which involve disk access. > > Currently, the BUFFERS option in EXPLAIN provides information on whether > a page was found in the shared buffers, but it doesn't provide insight > into whether the page was found in the OS cache or not and disk I/O > occurred. > > Since page faults especially major one impact performance compared to > shared buffer and OS cache hits, it would be helpful to track these > events. > > I have attached a PoC patch that modifies EXPLAIN to include page fault > information during both the planning and execution phases of a query. > The output would look like this: > > =3D# EXPLAIN (ANALYZE, PAGEFAULTS) > SELECT * FROM pgbench_branches b > JOIN pgbench_accounts a ON b.bid =3D a.bid ORDER BY a.aid; > > QUERY PLAN > > > -------------------------------------------------------------------------= ---------------------------------------------------------------------------= ----------------- > Nested Loop (cost=3D0.58..335386.98 rows=3D4999917 width=3D197) (ac= tual > time=3D3.785..5590.294 rows=3D5000000 loops=3D1) > > ... (omitted) > > Planning: > Buffers: shared hit=3D50 read=3D48 dirtied=3D4 written=3D2 > Page Faults: minor=3D30 major=3D19 <-ADDED > Planning Time: 22.080 ms > Execution: > Page Faults: minor=3D49 major=3D5 <-ADDED > Execution Time: 5794.356 ms > > The patch has not yet added this functionality to auto_explain, but I > believe this feature would be more useful in auto_explain than in the > EXPLAIN command itself. I plan to add it to auto_explain if there's > interest in including page fault information in EXPLAIN. > > While GUCs like log_statement_stats allow logging of getrusage(2) > information, including page faults, always enabling this can lead to > excessive log output. It would be better to log this data only when > queries are slow. Therefore, adding this feature to auto_explain seems > like a good solution. > > > The patch introduces a new option, PAGEFAULTS, but it may be more > appropriate to include the page fault information in another option, > such as SUMMARY, especially if there are other useful resources that can > be obtained from getrusage(2). > > > The patch cannot be applied to Windows because getrusage() in PostgreSQL > ported for Windows currently only tracks CPU times. I'm not sure if > information on major and minor page faults is accessible on Windows, but > it might be acceptable to treat it similarly to log_statement_stats and > exclude it from Windows support. > > > I also tried to add page faults information for each plan node, similar > to the BUFFERS option, but I decided against this approach due to the > performance impact. The frequent calls to getrusage(2), i.e. each time > getting one row, created significant overhead. > > > Any feedback would be greatly appreciated. > > > -- > Regards, > > -- > Atsushi Torikoshi > Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K. --00000000000062b07906524d53b0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello hackers,

I reviewed the EXPLAIN storage I/O p= atch and the overall direction seems useful, especially for distinguishing = shared-buffer hits from actual storage reads during query analysis.

= One concern that stood out to me from the later discussion is the interacti= on with asynchronous I/O and worker-based I/O accounting.

Since the = patch currently relies on per-process getrusage() statistics, it seems poss= ible that the reported values could become partial or misleading once I/O i= s performed outside the backend process context. In particular, worker-base= d AIO could undercount storage reads/writes while still returning non-zero = values, which may make the output appear more accurate than it actually is.=

I also wonder whether this should affect the semantics/documentatio= n of the feature itself =E2=80=94 i.e. whether the reported numbers should = be described more explicitly as =E2=80=9Cbackend-observed synchronous stora= ge I/O=E2=80=9D rather than total query I/O.

That said, the examples= showing differences between OS cache hits and actual storage reads were qu= ite compelling, especially for query performance debugging.

Thanks f= or working on the patch.

Regards,
Vellaipandiyan

On Thu, May 21, 2026 at 11:07=E2=80=AFAM torikoshia <torikoshia@oss.nttdata.com> wrote:
Hi,

When reading the output of EXPLAIN (ANALYZE) to diagnose slow queries
for our customers, I often want to know how many page faults occurred,
especially major page faults, which involve disk access.

Currently, the BUFFERS option in EXPLAIN provides information on whether a page was found in the shared buffers, but it doesn't provide insight =
into whether the page was found in the OS cache or not and disk I/O
occurred.

Since page faults especially major one impact performance compared to
shared buffer and OS cache hits, it would be helpful to track these
events.

I have attached a PoC patch that modifies EXPLAIN to include page fault information during both the planning and execution phases of a query.
The output would look like this:

=C2=A0 =C2=A0=3D# EXPLAIN (ANALYZE, PAGEFAULTS)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0SELECT * FROM pgbench_branches b =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0JOIN pgbench_accounts a ON = b.bid =3D a.bid ORDER BY a.aid;

=C2=A0 =C2=A0QUERY PLAN

---------------------------------------------------------------------------= ---------------------------------------------------------------------------= ---------------
=C2=A0 =C2=A0 =C2=A0Nested Loop=C2=A0 (cost=3D0.58..335386.98 rows=3D499991= 7 width=3D197) (actual
time=3D3.785..5590.294 rows=3D5000000 loops=3D1)

=C2=A0 =C2=A0 =C2=A0 =C2=A0... (omitted)

=C2=A0 =C2=A0 =C2=A0Planning:
=C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D50 read=3D48 dirtied=3D4 w= ritten=3D2
=C2=A0 =C2=A0 =C2=A0 =C2=A0Page Faults: minor=3D30 major=3D19 <-ADDED =C2=A0 =C2=A0 =C2=A0Planning Time: 22.080 ms
=C2=A0 =C2=A0 =C2=A0Execution:
=C2=A0 =C2=A0 =C2=A0 =C2=A0Page Faults: minor=3D49 major=3D5 <-ADDED
=C2=A0 =C2=A0 =C2=A0Execution Time: 5794.356 ms

The patch has not yet added this functionality to auto_explain, but I
believe this feature would be more useful in auto_explain than in the
EXPLAIN command itself. I plan to add it to auto_explain if there's interest in including page fault information in EXPLAIN.

While GUCs like log_statement_stats allow logging of getrusage(2)
information, including page faults, always enabling this can lead to
excessive log output. It would be better to log this data only when
queries are slow. Therefore, adding this feature to auto_explain seems
like a good solution.


The patch introduces a new option, PAGEFAULTS, but it may be more
appropriate to include the page fault information in another option,
such as SUMMARY, especially if there are other useful resources that can be obtained from getrusage(2).


The patch cannot be applied to Windows because getrusage() in PostgreSQL ported for Windows currently only tracks CPU times. I'm not sure if information on major and minor page faults is accessible on Windows, but it might be acceptable to treat it similarly to log_statement_stats and exclude it from Windows support.


I also tried to add page faults information for each plan node, similar to the BUFFERS option, but I decided against this approach due to the
performance impact. The frequent calls to getrusage(2), i.e. each time
getting one row, created significant overhead.


Any feedback would be greatly appreciated.


--
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
--00000000000062b07906524d53b0--