public inbox for pgsql-hackers@postgresql.org
help / color / mirror / Atom feedRe: RFC: Allow EXPLAIN to Output Page Fault Information
22+ messages / 4 participants
[nested] [flat]
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
@ 2025-02-08 13:54 Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-02-08 13:54 UTC (permalink / raw)
To: torikoshia <torikoshia@oss.nttdata.com>; +Cc: pgsql-hackers; rjuju123@gmail.com; tgl@sss.pgh.pa.us; Bruce Momjian <bruce@momjian.us>
On Mon, 27 Jan 2025 at 10:05, torikoshia <torikoshia@oss.nttdata.com> wrote:
> Therefore, I believe it would be reasonable to report the raw values
> as-is, as they should still be useful for understanding storage I/O
> activity.
Sounds reasonable.
Below some feedback on the patch. It's all really minor. The patch
looks great. I'll play around with it a bit next week.
meta: it's confusing that this one is called v1 again, it would be
clearer if it was called v2.
nit: at line 528 the "if (es->buffers)" check can simply be merged
with the if block above (which does the exact same check)
> if (usage->inblock <= 0 && usage->outblock <= 0)
> return false;
>
> else
> return true;
nit: You can replace that if-else with:
return usage->inblock > 0 || usage->outblock > 0;
> StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
Missing a function comment
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-02-09 11:51 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-02-09 11:51 UTC (permalink / raw)
To: torikoshia <torikoshia@oss.nttdata.com>; +Cc: pgsql-hackers; rjuju123@gmail.com; tgl@sss.pgh.pa.us; Bruce Momjian <bruce@momjian.us>
On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> I'll play around with it a bit next week.
Okay, I played around with it and couldn't find any issues. I marked
the patch as "ready for committer" in the commitfest app[1], given
that all feedback in my previous email was very minor.
[1]: https://commitfest.postgresql.org/52/5526/
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-02-09 17:59 ` Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
0 siblings, 1 reply; 22+ messages in thread
From: Andres Freund @ 2025-02-09 17:59 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; tgl@sss.pgh.pa.us; Bruce Momjian <bruce@momjian.us>
Hi,
On 2025-02-09 12:51:40 +0100, Jelte Fennema-Nio wrote:
> On Sat, 8 Feb 2025 at 14:54, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> > I'll play around with it a bit next week.
>
> Okay, I played around with it and couldn't find any issues. I marked
> the patch as "ready for committer" in the commitfest app[1], given
> that all feedback in my previous email was very minor.
I'm somewhat against this patch, as it's fairly fundamentally incompatible
with AIO. There's no real way to get information in this manner if the IO
isn't executed synchronously in process context...
Greetings,
Andres
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
@ 2025-02-09 18:05 ` Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
0 siblings, 1 reply; 22+ messages in thread
From: Tom Lane @ 2025-02-09 18:05 UTC (permalink / raw)
To: Andres Freund <andres@anarazel.de>; +Cc: Jelte Fennema-Nio <postgres@jeltef.nl>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
Andres Freund <andres@anarazel.de> writes:
> I'm somewhat against this patch, as it's fairly fundamentally incompatible
> with AIO. There's no real way to get information in this manner if the IO
> isn't executed synchronously in process context...
Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes. I don't really believe that this will produce useful
numbers.
regards, tom lane
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
@ 2025-02-09 20:06 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-02-10 13:31 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
0 siblings, 2 replies; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-02-09 20:06 UTC (permalink / raw)
To: Tom Lane <tgl@sss.pgh.pa.us>; +Cc: Andres Freund <andres@anarazel.de>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > I'm somewhat against this patch, as it's fairly fundamentally incompatible
> > with AIO. There's no real way to get information in this manner if the IO
> > isn't executed synchronously in process context...
Hmm, I had not considered how this would interact with your AIO work.
I agree that getting this info would be hard/impossible to do
efficiently, when IOs are done by background IO processes that
interleave IOs from different queries. But I'd expect that AIOs that
are done using iouring would be tracked correctly without having to
change this code at all (because I assume those are done from the
query backend process).
One other thought: I think the primary benefit of this feature is
being able to see how many read IOs actually hit the disk, as opposed
to hitting OS page cache. That benefit disappears when using Direct
IO, because then there's no OS page cache.
How many years away do you think that widespread general use of
AIO+Direct IO is, though? I think that for the N years from now until
then, it would be very nice to have this feature to help debug query
performance problems. Then once the numbers become too
inaccurate/useless at some point, we could simply remove them again.
> Even without looking ahead to AIO, there's bgwriter, walwriter, and
> checkpointer processes that all take I/O load away from foreground
> processes. I don't really believe that this will produce useful
> numbers.
The bgwriter, walwriter, and checkpointer should only take away
*write* IOs. For read IOs the numbers should be very accurate and as
explained above read IOs is where I think the primary benefit of this
feature is.
But even for write IOs I think the numbers would be useful when
looking at them with the goal of finding out why a particular query is
slow: If the bgwriter or checkpointer do the writes, then the query
should be roughly as fast as if no writes to the disk had taken place
at all, but if the query process does the writes then those writes are
probably blocking further execution of the query and thus slowing it
down.
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-02-10 13:23 ` torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
1 sibling, 1 reply; 22+ messages in thread
From: torikoshia @ 2025-02-10 13:23 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; andres@anarazel.de; tgl@sss.pgh.pa.us; +Cc: pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On 2025-02-10 05:06, Jelte Fennema-Nio wrote:
Thanks for reviewing the patch and comments!
Fixed issues you pointed out and attached v2 patch.
> On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Andres Freund <andres@anarazel.de> writes:
>> > I'm somewhat against this patch, as it's fairly fundamentally incompatible
>> > with AIO. There's no real way to get information in this manner if the IO
>> > isn't executed synchronously in process context...
>
> Hmm, I had not considered how this would interact with your AIO work.
> I agree that getting this info would be hard/impossible to do
> efficiently, when IOs are done by background IO processes that
> interleave IOs from different queries. But I'd expect that AIOs that
> are done using iouring would be tracked correctly without having to
> change this code at all (because I assume those are done from the
> query backend process).
>
> One other thought: I think the primary benefit of this feature is
> being able to see how many read IOs actually hit the disk, as opposed
> to hitting OS page cache. That benefit disappears when using Direct
> IO, because then there's no OS page cache.
>
> How many years away do you think that widespread general use of
> AIO+Direct IO is, though? I think that for the N years from now until
> then, it would be very nice to have this feature to help debug query
> performance problems. Then once the numbers become too
> inaccurate/useless at some point, we could simply remove them again.
AIO efforts are something I haven't fully grasped yet, but Jelte's
comments seem reasonable to me.
Of course, as someone proposing this, I'm naturally biased toward
thinking it’s beneficial.
What do you think?
>> Even without looking ahead to AIO, there's bgwriter, walwriter, and
>> checkpointer processes that all take I/O load away from foreground
>> processes. I don't really believe that this will produce useful
>> numbers.
>
> The bgwriter, walwriter, and checkpointer should only take away
> *write* IOs. For read IOs the numbers should be very accurate and as
> explained above read IOs is where I think the primary benefit of this
> feature is.
>
> But even for write IOs I think the numbers would be useful when
> looking at them with the goal of finding out why a particular query is
> slow: If the bgwriter or checkpointer do the writes, then the query
> should be roughly as fast as if no writes to the disk had taken place
> at all, but if the query process does the writes then those writes are
> probably blocking further execution of the query and thus slowing it
> down.
I agree with this as well.
For example, in a SELECT query executed immediately after a large number
of INSERTs, we can observe that writes to storage occur due to WAL
writes for hint bits.
This makes the query take longer compared to a scenario where these
writes do not occur.
I think we can guess what is happening from the output:
postgres=# insert into t1 (select i, repeat('a', 1000) from
generate_series(1, 1000000) i);
INSERT 0 1000000
postgres=# explain analyze table t1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=1.926..11035.531 rows=1000100 loops=1)
Buffers: shared read=168575 dirtied=142858 written=142479
Planning:
Buffers: shared hit=3 read=3 written=1
Storage I/O: read=48 times write=16 times
Planning Time: 4.472 ms
Execution:
Storage I/O: read=2697272 times write=4480096 times // many writes
Execution Time: 11099.424 ms // slow
(9 rows)
postgres=# explain analyze table t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=2.066..2926.394 rows=1000100 loops=1)
Buffers: shared read=168575 written=14
Planning Time: 0.295 ms
Execution:
Storage I/O: read=2697200 times write=224 times // few writes
Execution Time: 3016.257 ms // fast
(6 rows)
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
[text/x-diff] v2-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (33.9K, 2-v2-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 5459326fcfca8e119ce5d694f2180839958b3d5c Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 10 Feb 2025 21:53:56 +0900
Subject: [PATCH v2] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison. By measuring the numbers of
storage I/O read and write, we can better identify if storage I/O is a
bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'll implement it later.
Reviewed-by: Jelte Fennema-Nio <postgres@jeltef.nl>
---
doc/src/sgml/ref/explain.sgml | 25 ++++--
src/backend/access/brin/brin.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 110 +++++++++++++++++++++++++-
src/backend/commands/prepare.c | 8 ++
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 ++++++--
src/backend/executor/instrument.c | 62 ++++++++++++++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 19 ++++-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain.out | 37 ++++++++-
src/tools/pgindent/typedefs.list | 1 +
15 files changed, 292 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 6361a14e65..7b45a34fc2 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -198,13 +198,24 @@ ROLLBACK;
previously unmodified blocks that were changed by this query; while the
number of blocks <emphasis>written</emphasis> indicates the number of
previously-dirtied blocks evicted from cache by this backend during
- query processing.
- The number of blocks shown for an
- upper-level node includes those used by all its child nodes. In text
- format, only non-zero values are printed. Buffers information is
- included by default when <literal>ANALYZE</literal> is used but
- otherwise is not included by default, but can be enabled using this
- option.
+ query processing. In text format, only non-zero values are printed.
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 KB.
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
</para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccf824bbdb..fc711be2e1 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2551,7 +2551,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2913,7 +2913,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2928,8 +2928,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db9..98cfde8875 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1619,7 +1619,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1827,7 +1827,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,8 +1837,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c24e66f82e..d23bbb3699 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -145,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -475,6 +477,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -496,7 +500,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -516,11 +523,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ storageio.inblock -= storageio_start.inblock;
+ storageio.outblock -= storageio_start.outblock;
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -644,7 +656,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -654,6 +666,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -663,7 +676,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -747,8 +772,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -760,8 +786,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -813,6 +841,35 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio = {0};
+ StorageIOUsage storageio_end;
+
+ GetStorageIOUsage(&storageio_end);
+ StorageIOUsageAccumDiff(&storageio, &storageio_end, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4232,6 +4289,51 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 8989c0c882..e11601b21d 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -579,6 +579,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -594,7 +596,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -644,6 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -656,6 +663,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index dc3322c256..cd4bf22082 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
TidStoreDetach(dead_items);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 134ff62f5c..aced0bbe4b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -64,6 +64,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -599,6 +600,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -680,6 +682,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -775,6 +784,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1170,11 +1185,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1406,6 +1421,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start = {0};
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1459,13 +1476,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1478,11 +1496,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storageIO usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2d3569b374..94664d5fab 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,21 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+/* Only count parallel workers' usage */
+StorageIOUsage pgStorageIOUsageParallel;
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +202,46 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +297,38 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ea7419951f..8e67dd57b3 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -108,6 +108,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c397..5c8bc76c53 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 5a6eff75c6..7fc7281b1a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -99,6 +107,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -109,11 +118,15 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb584..270dc37c84 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c9437..27f0ea052a 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index ee31e41d50..f569d8f6fd 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -127,10 +127,16 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Local-Written-Blocks>N</Local-Written-Blocks> +
<Temp-Read-Blocks>N</Temp-Read-Blocks> +
<Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
</Planning> +
<Planning-Time>N.N</Planning-Time> +
<Triggers> +
</Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
<Execution-Time>N.N</Execution-Time> +
</Query> +
</explain>
@@ -175,6 +181,8 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Planning Time: N.N +
Triggers: +
Serialization: +
@@ -191,6 +199,9 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Execution Time: N.N
(1 row)
@@ -237,7 +248,13 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Local Dirtied Blocks": N, +
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
- "Temp Written Blocks": N +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
} +
} +
]
@@ -299,11 +316,17 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Local I/O Read Time": N.N, +
"Local I/O Write Time": N.N, +
"Temp I/O Read Time": N.N, +
- "Temp I/O Write Time": N.N +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -423,12 +446,18 @@ select explain_filter('explain (memory, analyze, format json) select * from int8
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
"Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
"Memory Used": N, +
"Memory Allocated": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -641,6 +670,7 @@ select jsonb_pretty(
}, +
"Planning": { +
"Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
"Temp Read Blocks": 0, +
"Local Read Blocks": 0, +
"Shared Hit Blocks": 0, +
@@ -653,6 +683,9 @@ select jsonb_pretty(
}, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
"Planning Time": 0.0, +
"Execution Time": 0.0 +
} +
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9a3bee93de..a5eeaca55c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: 3d17d7d7fb7a4603b48acb275b5a416f110db464
--
2.43.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2025-03-17 23:52 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-03-17 23:52 UTC (permalink / raw)
To: torikoshia <torikoshia@oss.nttdata.com>; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On Mon, 10 Feb 2025 at 14:23, torikoshia <torikoshia@oss.nttdata.com> wrote:
> Thanks for reviewing the patch and comments!
> Fixed issues you pointed out and attached v2 patch.
This patch needs a rebase, because it's failing to compile currently.
So I marked this as "Waiting on Author" in the commitfest app.
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-03-19 13:15 ` torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
0 siblings, 1 reply; 22+ messages in thread
From: torikoshia @ 2025-03-19 13:15 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
Hi,
On 2025-03-18 08:52, Jelte Fennema-Nio wrote:
> On Mon, 10 Feb 2025 at 14:23, torikoshia <torikoshia@oss.nttdata.com>
> wrote:
>> Thanks for reviewing the patch and comments!
>> Fixed issues you pointed out and attached v2 patch.
>
> This patch needs a rebase, because it's failing to compile currently.
> So I marked this as "Waiting on Author" in the commitfest app.
Thanks! I've attached an updated patch.
BTW based on your discussion, I thought this patch could not be merged
anytime soon. Does that align with your understanding?
- With bgworker-based AIO, this patch could mislead users into
underestimating the actual storage I/O load, which is undesirable.
- With io_uring-based AIO, this patch could provide meaningful values,
but it may take some time before io_uring sees widespread adoption.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
[text/x-diff] v3-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (35.5K, 2-v3-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 5c889804ebc145173e7c9cf2519c7a91ede05bcf Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Wed, 19 Mar 2025 22:07:19 +0900
Subject: [PATCH v3] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison. By measuring the numbers of
storage I/O read and write, we can better identify if storage I/O is a
bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'll implement it later.
---
doc/src/sgml/ref/explain.sgml | 25 ++++--
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 110 +++++++++++++++++++++++++-
src/backend/commands/prepare.c | 8 ++
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 ++++++--
src/backend/executor/instrument.c | 62 ++++++++++++++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 19 ++++-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain.out | 37 ++++++++-
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 296 insertions(+), 42 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 9ed1061b7f..5a66b10afe 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -198,13 +198,24 @@ ROLLBACK;
previously unmodified blocks that were changed by this query; while the
number of blocks <emphasis>written</emphasis> indicates the number of
previously-dirtied blocks evicted from cache by this backend during
- query processing.
- The number of blocks shown for an
- upper-level node includes those used by all its child nodes. In text
- format, only non-zero values are printed. Buffers information is
- included by default when <literal>ANALYZE</literal> is used but
- otherwise is not included by default, but can be enabled using this
- option.
+ query processing. In text format, only non-zero values are printed.
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 KB.
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
</para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 737ad63880..fdabcf506b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2558,7 +2558,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2920,7 +2920,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2935,8 +2935,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2f89cad88..d768f2dc17 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1084,7 +1084,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2135,7 +2135,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2150,8 +2150,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924a..c66e742909 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1618,7 +1618,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1826,7 +1826,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1836,8 +1836,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 33a16d2d8e..d16833f31e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,6 +144,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -325,6 +327,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -346,7 +350,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -366,12 +373,17 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ storageio.inblock -= storageio_start.inblock;
+ storageio.outblock -= storageio_start.outblock;
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, NULL, NULL, -1, into, es, queryString, params,
queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -497,7 +509,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -507,6 +519,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -516,7 +529,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -609,8 +634,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -622,8 +648,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -680,6 +708,35 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio = {0};
+ StorageIOUsage storageio_end;
+
+ GetStorageIOUsage(&storageio_end);
+ StorageIOUsageAccumDiff(&storageio, &storageio_end, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4270,6 +4327,51 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf7d2b2309..d1e73a0793 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,6 +583,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -599,7 +601,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -649,6 +655,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -662,6 +669,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ExplainOnePlan(pstmt, cplan, entry->plansource, query_index,
into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cde..f77124f8c5 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e9337a97d1..1728e1cdd0 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -65,6 +65,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -608,6 +609,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -689,6 +691,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -784,6 +793,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1189,11 +1204,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1436,6 +1451,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start = {0};
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1490,13 +1507,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1509,11 +1527,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storageIO usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f470..5fa33b97da 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,21 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+/* Only count parallel workers' usage */
+StorageIOUsage pgStorageIOUsageParallel;
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +202,46 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +297,38 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 387839eb5d..0becde3319 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -70,6 +70,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(struct ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c397..5c8bc76c53 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6..e09c2f8943 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -100,6 +108,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -110,11 +119,15 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb584..270dc37c84 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c9437..27f0ea052a 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 340747a8f7..f635525ee5 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -127,10 +127,16 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Local-Written-Blocks>N</Local-Written-Blocks> +
<Temp-Read-Blocks>N</Temp-Read-Blocks> +
<Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
</Planning> +
<Planning-Time>N.N</Planning-Time> +
<Triggers> +
</Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
<Execution-Time>N.N</Execution-Time> +
</Query> +
</explain>
@@ -175,6 +181,8 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Planning Time: N.N +
Triggers: +
Serialization: +
@@ -191,6 +199,9 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Execution Time: N.N
(1 row)
@@ -237,7 +248,13 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Local Dirtied Blocks": N, +
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
- "Temp Written Blocks": N +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
} +
} +
]
@@ -335,11 +352,17 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Local I/O Read Time": N.N, +
"Local I/O Write Time": N.N, +
"Temp I/O Read Time": N.N, +
- "Temp I/O Write Time": N.N +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -459,12 +482,18 @@ select explain_filter('explain (memory, analyze, format json) select * from int8
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
"Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
"Memory Used": N, +
"Memory Allocated": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -677,6 +706,7 @@ select jsonb_pretty(
}, +
"Planning": { +
"Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
"Temp Read Blocks": 0, +
"Local Read Blocks": 0, +
"Shared Hit Blocks": 0, +
@@ -689,6 +719,9 @@ select jsonb_pretty(
}, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
"Planning Time": 0.0, +
"Execution Time": 0.0 +
} +
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfa276d2d3..b4afc49048 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2608,6 +2608,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: 190dc27998d5b7b4c36e12bebe62f7176f4b4507
--
2.43.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2025-03-22 11:23 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-03-22 11:23 UTC (permalink / raw)
To: torikoshia <torikoshia@oss.nttdata.com>; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia@oss.nttdata.com> wrote:
> BTW based on your discussion, I thought this patch could not be merged
> anytime soon. Does that align with your understanding?
Yeah, that aligns with my understanding. I don't think it's realistic
to get this merged before the code freeze, but I think both of the
below issues could be resolved.
> - With bgworker-based AIO, this patch could mislead users into
> underestimating the actual storage I/O load, which is undesirable.
To resolve this, I think the patch would need to change to not report
anything if bgworker-based AIO is used. So I moved this patch to the
next commitfest, and marked it as "waiting for author" there.
> - With io_uring-based AIO, this patch could provide meaningful values,
> but it may take some time before io_uring sees widespread adoption.
I submitted this patch to help make io_uring-based AIO more of a reality:
https://commitfest.postgresql.org/patch/5570/
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-03-25 01:27 ` torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: torikoshia @ 2025-03-25 01:27 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On 2025-03-22 20:23, Jelte Fennema-Nio wrote:
> On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia@oss.nttdata.com>
> wrote:
>> BTW based on your discussion, I thought this patch could not be merged
>> anytime soon. Does that align with your understanding?
>
> Yeah, that aligns with my understanding. I don't think it's realistic
> to get this merged before the code freeze, but I think both of the
> below issues could be resolved.
>
>> - With bgworker-based AIO, this patch could mislead users into
>> underestimating the actual storage I/O load, which is undesirable.
>
> To resolve this, I think the patch would need to change to not report
> anything if bgworker-based AIO is used.
Agreed.
I feel the new GUC io_method can be used to determine whether
bgworker-based AIO is being used.
> So I moved this patch to the
> next commitfest, and marked it as "waiting for author" there.
Thanks for moving it.
>> - With io_uring-based AIO, this patch could provide meaningful values,
>> but it may take some time before io_uring sees widespread adoption.
>
> I submitted this patch to help make io_uring-based AIO more of a
> reality:
> https://commitfest.postgresql.org/patch/5570/
Thanks for working on that, too.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2025-04-11 13:18 ` torikoshia <torikoshia@oss.nttdata.com>
2025-05-08 13:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: torikoshia @ 2025-04-11 13:18 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On 2025-03-25 10:27, torikoshia wrote:
> On 2025-03-22 20:23, Jelte Fennema-Nio wrote:
>
>> On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia@oss.nttdata.com>
>> wrote:
>>> BTW based on your discussion, I thought this patch could not be
>>> merged
>>> anytime soon. Does that align with your understanding?
>>
>> Yeah, that aligns with my understanding. I don't think it's realistic
>> to get this merged before the code freeze, but I think both of the
>> below issues could be resolved.
>>
>>> - With bgworker-based AIO, this patch could mislead users into
>>> underestimating the actual storage I/O load, which is undesirable.
>>
>> To resolve this, I think the patch would need to change to not report
>> anything if bgworker-based AIO is used.
>
> Agreed.
> I feel the new GUC io_method can be used to determine whether
> bgworker-based AIO is being used.
I took this approach and when io_method=worker, no additional output is
shown in the attached patch.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
[text/x-diff] v4-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (71.3K, 2-v4-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From e80e53eb36f7909ca8638b26d3cd61a58a5bc889 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Fri, 11 Apr 2025 22:01:22 +0900
Subject: [PATCH v4] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison in general. By measuring
the numbers of storage I/O read and write, we can better identify if
storage I/O is a bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
This output is shown when io_method=worker, since asynchronous workers
handle I/O for multiple processes, and isolating the EXPLAIN target's
I/O is difficult.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'll implement it later.
---
doc/src/sgml/ref/explain.sgml | 25 +-
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 125 +++-
src/backend/commands/prepare.c | 8 +
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 +-
src/backend/executor/instrument.c | 79 ++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 20 +-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain_1.out | 849 ++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 1145 insertions(+), 38 deletions(-)
create mode 100644 src/test/regress/expected/explain_1.out
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 9ed1061b7f..493afe6a34 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -201,10 +201,27 @@ ROLLBACK;
query processing.
The number of blocks shown for an
upper-level node includes those used by all its child nodes. In text
- format, only non-zero values are printed. Buffers information is
- included by default when <literal>ANALYZE</literal> is used but
- otherwise is not included by default, but can be enabled using this
- option.
+ format, only non-zero values are printed.
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown. Also, When
+ <varname>io_method</varname> is set to <literal>worker</literal>, no output
+ will be shown, as I/O handled by asynchronous workers cannot be measured
+ accurately.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 bytes.
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
</para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f85..c6a8f74213 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2557,7 +2557,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2919,7 +2919,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,8 +2934,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e25d817c19..5463d81c96 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1084,7 +1084,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2135,7 +2135,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2150,8 +2150,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924a..c66e742909 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1618,7 +1618,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1826,7 +1826,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1836,8 +1836,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 786ee865f1..2e391b347b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -32,6 +32,7 @@
#include "parser/analyze.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_subsys.h"
#include "storage/bufmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -144,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -325,6 +328,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -346,7 +351,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -361,17 +369,21 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, NULL, NULL, -1, into, es, queryString, params,
queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
}
@@ -497,7 +509,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -507,6 +519,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -516,7 +529,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -609,8 +634,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -622,8 +648,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -680,6 +708,34 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio;
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4260,6 +4316,65 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, treat this case as having nothing to print.
+ */
+ if (pgaio_workers_enabled())
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, do not show anything.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf7d2b2309..68b87aab0c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,6 +583,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -599,7 +601,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -649,6 +655,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -662,6 +669,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ExplainOnePlan(pstmt, cplan, entry->plansource, query_index,
into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cde..f77124f8c5 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 39c990ae63..cf46633100 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -65,6 +65,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -609,6 +610,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -690,6 +692,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -785,6 +794,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1190,11 +1205,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1437,6 +1452,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start;
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1491,13 +1508,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1510,11 +1528,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storage I/O usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f470..9cb0e9300b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,22 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
+#include "storage/aio_subsys.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+StorageIOUsage pgStorageIOUsageParallel; /* only count parallel workers'
+ * usage */
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +203,47 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +299,53 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* dst -= sub */
+void
+StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub)
+{
+ dst->inblock -= sub->inblock;
+ dst->outblock -= sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ /*
+ * Since getting the I/O excluding AIO workers underestimates the total
+ * I/O, don't get the I/O usage statistics when AIO worker is enabled.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 387839eb5d..0becde3319 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -70,6 +70,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(struct ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c397..5c8bc76c53 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6..5392f05022 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -100,6 +108,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -110,11 +119,16 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb584..270dc37c84 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c9437..27f0ea052a 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain_1.out b/src/test/regress/expected/explain_1.out
new file mode 100644
index 0000000000..215ce1818b
--- /dev/null
+++ b/src/test/regress/expected/explain_1.out
@@ -0,0 +1,849 @@
+--
+-- EXPLAIN
+--
+-- There are many test cases elsewhere that use EXPLAIN as a vehicle for
+-- checking something else (usually planner behavior). This file is
+-- concerned with testing EXPLAIN in its own right.
+--
+-- To produce stable regression test output, it's usually necessary to
+-- ignore details such as exact costs or row counts. These filter
+-- functions replace changeable output details with fixed strings.
+create function explain_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just 'N'
+ ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+ -- In sort output, the above won't match units-suffixed numbers
+ ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
+ -- Ignore text-mode buffers output because it varies depending
+ -- on the system state
+ CONTINUE WHEN (ln ~ ' +Buffers: .*');
+ -- Ignore text-mode "Planning:" line because whether it's output
+ -- varies depending on the system state
+ CONTINUE WHEN (ln = 'Planning:');
+ return next ln;
+ end loop;
+end;
+$$;
+-- To produce valid JSON output, replace numbers with "0" or "0.0" not "N"
+create function explain_filter_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just '0'
+ ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+-- Disable JIT, or we'll get different output on machines where that's been
+-- forced on
+set jit = off;
+-- Similarly, disable track_io_timing, to avoid output differences when
+-- enabled.
+set track_io_timing = off;
+-- Simple cases
+select explain_filter('explain select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Output: q1, q2
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain"> +
+ <Query> +
+ <Plan> +
+ <Node-Type>Seq Scan</Node-Type> +
+ <Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
+ <Relation-Name>int8_tbl</Relation-Name> +
+ <Alias>i8</Alias> +
+ <Startup-Cost>N.N</Startup-Cost> +
+ <Total-Cost>N.N</Total-Cost> +
+ <Plan-Rows>N</Plan-Rows> +
+ <Plan-Width>N</Plan-Width> +
+ <Actual-Startup-Time>N.N</Actual-Startup-Time> +
+ <Actual-Total-Time>N.N</Actual-Total-Time> +
+ <Actual-Rows>N.N</Actual-Rows> +
+ <Actual-Loops>N</Actual-Loops> +
+ <Disabled>false</Disabled> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ </Plan> +
+ <Planning> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Planning> +
+ <Planning-Time>N.N</Planning-Time> +
+ <Triggers> +
+ </Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
+ <Execution-Time>N.N</Execution-Time> +
+ </Query> +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Actual Startup Time: N.N +
+ Actual Total Time: N.N +
+ Actual Rows: N.N +
+ Actual Loops: N +
+ Disabled: false +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Planning: +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Planning Time: N.N +
+ Triggers: +
+ Serialization: +
+ Time: N.N +
+ Output Volume: N +
+ Format: "text" +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Execution Time: N.N
+(1 row)
+
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ } +
+ } +
+ ]
+(1 row)
+
+-- Check expansion of window definitions
+select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
+ explain_filter
+-------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w, (sum(unique2) OVER w1), (sum(tenthous) OVER w1), ten, hundred
+ Window: w AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w1, sum(tenthous) OVER w1
+ Window: w1 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(11 rows)
+
+select explain_filter('explain verbose select sum(unique1) over w1, sum(unique2) over (w1 order by hundred), sum(tenthous) over (w1 order by hundred rows 10 preceding) from tenk1 window w1 as (partition by ten)');
+ explain_filter
+---------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w1, (sum(unique2) OVER w2), (sum(tenthous) OVER w3), ten, hundred
+ Window: w1 AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, (sum(unique2) OVER w2), sum(tenthous) OVER w3
+ Window: w3 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred ROWS 'N'::bigint PRECEDING)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w2
+ Window: w2 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(14 rows)
+
+-- Check output including I/O timings. These fields are conditional
+-- but always set in JSON format, so check them only in this case.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl", +
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+set track_io_timing = off;
+-- SETTINGS option
+-- We have to ignore other settings that might be imposed by the environment,
+-- so printing the whole Settings field unfortunately won't do.
+begin;
+set local plan_cache_mode = force_generic_plan;
+select true as "OK"
+ from explain_filter('explain (settings) select * from int8_tbl i8') ln
+ where ln ~ '^ *Settings: .*plan_cache_mode = ''force_generic_plan''';
+ OK
+----
+ t
+(1 row)
+
+select explain_filter_to_json('explain (settings, format json) select * from int8_tbl i8') #> '{0,Settings,plan_cache_mode}';
+ ?column?
+----------------------
+ "force_generic_plan"
+(1 row)
+
+rollback;
+-- GENERIC_PLAN option
+select explain_filter('explain (generic_plan) select unique1 from tenk1 where thousand = $1');
+ explain_filter
+---------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=N.N..N.N rows=N width=N)
+ Recheck Cond: (thousand = $N)
+ -> Bitmap Index Scan on tenk1_thous_tenthous (cost=N.N..N.N rows=N width=N)
+ Index Cond: (thousand = $N)
+(4 rows)
+
+-- should fail
+select explain_filter('explain (analyze, generic_plan) select unique1 from tenk1 where thousand = $1');
+ERROR: EXPLAIN options ANALYZE and GENERIC_PLAN cannot be used together
+CONTEXT: PL/pgSQL function explain_filter(text) line 5 at FOR over EXECUTE statement
+-- MEMORY option
+select explain_filter('explain (memory) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+select explain_filter('explain (memory, analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Memory: used=NkB allocated=NkB
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (memory, summary, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Disabled: false +
+ Planning: +
+ Memory Used: N +
+ Memory Allocated: N +
+ Planning Time: N.N
+(1 row)
+
+select explain_filter('explain (memory, analyze, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
+ "Memory Used": N, +
+ "Memory Allocated": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+prepare int8_query as select * from int8_tbl i8;
+select explain_filter('explain (memory) execute int8_query');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+-- Test EXPLAIN (GENERIC_PLAN) with partition pruning
+-- partitions should be pruned at plan time, based on constants,
+-- but there should be no pruning based on parameter placeholders
+create table gen_part (
+ key1 integer not null,
+ key2 integer not null
+) partition by list (key1);
+create table gen_part_1
+ partition of gen_part for values in (1)
+ partition by range (key2);
+create table gen_part_1_1
+ partition of gen_part_1 for values from (1) to (2);
+create table gen_part_1_2
+ partition of gen_part_1 for values from (2) to (3);
+create table gen_part_2
+ partition of gen_part for values in (2);
+-- should scan gen_part_1_1 and gen_part_1_2, but not gen_part_2
+select explain_filter('explain (generic_plan) select key1, key2 from gen_part where key1 = 1 and key2 = $1');
+ explain_filter
+---------------------------------------------------------------------------
+ Append (cost=N.N..N.N rows=N width=N)
+ -> Seq Scan on gen_part_1_1 gen_part_1 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+ -> Seq Scan on gen_part_1_2 gen_part_2 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+(5 rows)
+
+drop table gen_part;
+--
+-- Test production of per-worker data
+--
+-- Unfortunately, because we don't know how many worker processes we'll
+-- actually get (maybe none at all), we can't examine the "Workers" output
+-- in any detail. We can check that it parses correctly as JSON, and then
+-- remove it from the displayed results.
+begin;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select jsonb_pretty(
+ explain_filter_to_json('explain (analyze, verbose, buffers, format json)
+ select * from tenk1 order by tenthous')
+ -- remove "Workers" node of the Seq Scan plan node
+ #- '{0,Plan,Plans,0,Plans,0,Workers}'
+ -- remove "Workers" node of the Sort plan node
+ #- '{0,Plan,Plans,0,Workers}'
+ -- Also remove its sort-type fields, as those aren't 100% stable
+ #- '{0,Plan,Plans,0,Sort Method}'
+ #- '{0,Plan,Plans,0,Sort Space Type}'
+);
+ jsonb_pretty
+-------------------------------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Plans": [ +
+ { +
+ "Plans": [ +
+ { +
+ "Alias": "tenk1", +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Schema": "public", +
+ "Disabled": false, +
+ "Node Type": "Seq Scan", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Relation Name": "tenk1", +
+ "Parallel Aware": true, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer",+
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Sort Key": [ +
+ "tenk1.tenthous" +
+ ], +
+ "Node Type": "Sort", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Sort Space Used": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer", +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Node Type": "Gather Merge", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Workers Planned": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Workers Launched": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Planning": { +
+ "Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
+ "Temp Read Blocks": 0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
+ "Planning Time": 0.0, +
+ "Execution Time": 0.0 +
+ } +
+ ]
+(1 row)
+
+rollback;
+-- Test display of temporary objects
+create temp table t1(f1 float8);
+create function pg_temp.mysin(float8) returns float8 language plpgsql
+as 'begin return sin($1); end';
+select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1) < 0.5');
+ explain_filter
+------------------------------------------------------------
+ Seq Scan on pg_temp.t1 (cost=N.N..N.N rows=N width=N)
+ Output: f1
+ Filter: (pg_temp.mysin(t1.f1) < 'N.N'::double precision)
+(3 rows)
+
+-- Test compute_query_id
+set compute_query_id = on;
+select explain_filter('explain (verbose) select * from int8_tbl i8');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+-- Test compute_query_id with utility statements containing plannable query
+select explain_filter('explain (verbose) declare test_cur cursor for select * from int8_tbl');
+ explain_filter
+-------------------------------------------------------------
+ Seq Scan on public.int8_tbl (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+select explain_filter('explain (verbose) create table test_ctas as select 1');
+ explain_filter
+----------------------------------------
+ Result (cost=N.N..N.N rows=N width=N)
+ Output: N
+ Query Identifier: N
+(3 rows)
+
+-- Test SERIALIZE option
+select explain_filter('explain (analyze,buffers off,serialize) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize text,buffers,timing off) select * from int8_tbl i8');
+ explain_filter
+-----------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize binary,buffers,timing) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=binary
+ Execution Time: N.N ms
+(4 rows)
+
+-- this tests an edge case where we have no data to return
+select explain_filter('explain (analyze,buffers off,serialize) create temp table explain_temp as select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory case)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,10) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Memory Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (disk case)
+set work_mem to 64;
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2500) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Disk Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
+ explain_filter
+----------------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS (PARTITION BY ((a.n < N)))
+ Storage: Disk Maximum Storage: NkB
+ -> Sort (actual time=N.N..N.N rows=N.N loops=N)
+ Sort Key: ((a.n < N))
+ Sort Method: external merge Disk: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(9 rows)
+
+reset work_mem;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d16bc20865..0d7f4579b9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2636,6 +2636,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: 914ea1c93c0e446a0cd174497fd6a22fd6071c5e
--
2.43.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2025-05-08 13:51 ` torikoshia <torikoshia@oss.nttdata.com>
2025-10-28 08:43 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: torikoshia @ 2025-05-08 13:51 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On 2025-04-11 22:18, torikoshia wrote:
> On 2025-03-25 10:27, torikoshia wrote:
>> On 2025-03-22 20:23, Jelte Fennema-Nio wrote:
>>
>>> On Wed, 19 Mar 2025 at 14:15, torikoshia <torikoshia@oss.nttdata.com>
>>> wrote:
>>>> BTW based on your discussion, I thought this patch could not be
>>>> merged
>>>> anytime soon. Does that align with your understanding?
>>>
>>> Yeah, that aligns with my understanding. I don't think it's realistic
>>> to get this merged before the code freeze, but I think both of the
>>> below issues could be resolved.
>>>
>>>> - With bgworker-based AIO, this patch could mislead users into
>>>> underestimating the actual storage I/O load, which is undesirable.
>>>
>>> To resolve this, I think the patch would need to change to not report
>>> anything if bgworker-based AIO is used.
>>
>> Agreed.
>> I feel the new GUC io_method can be used to determine whether
>> bgworker-based AIO is being used.
>
> I took this approach and when io_method=worker, no additional output
> is shown in the attached patch.
Rebased the patch.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
[text/x-diff] v5-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (71.1K, 2-v5-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 822620b7d7af2894ad126797ba52cd7f3326a676 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshi@sraoss.co.jp>
Date: Thu, 8 May 2025 22:43:02 +0900
Subject: [PATCH v5] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison in general. By measuring
the numbers of storage I/O read and write, we can better identify if
storage I/O is a bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
This output is shown when io_method=worker, since asynchronous workers
handle I/O for multiple processes, and isolating the EXPLAIN target's
I/O is difficult.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'm going to implement it if this patch is merged.
---
doc/src/sgml/ref/explain.sgml | 24 +
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 125 +++-
src/backend/commands/prepare.c | 8 +
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 +-
src/backend/executor/instrument.c | 79 ++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 20 +-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain_1.out | 849 ++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 1148 insertions(+), 34 deletions(-)
create mode 100644 src/test/regress/expected/explain_1.out
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 6dda680aa0..86b66e06cd 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -204,6 +204,30 @@ ROLLBACK;
format, only non-zero values are printed. Buffers information is
automatically included when <literal>ANALYZE</literal> is used.
</para>
+ <para>
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown. Also, When
+ <varname>io_method</varname> is set to <literal>worker</literal>, no output
+ will be shown, as I/O handled by asynchronous workers cannot be measured
+ accurately.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 bytes.
+ </para>
+ <para>
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f85..c6a8f74213 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2557,7 +2557,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2919,7 +2919,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,8 +2934,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd8910..e439381009 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1084,7 +1084,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2147,7 +2147,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2162,8 +2162,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924a..c66e742909 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1618,7 +1618,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1826,7 +1826,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1836,8 +1836,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 786ee865f1..2e391b347b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -32,6 +32,7 @@
#include "parser/analyze.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_subsys.h"
#include "storage/bufmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -144,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -325,6 +328,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -346,7 +351,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -361,17 +369,21 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, NULL, NULL, -1, into, es, queryString, params,
queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
}
@@ -497,7 +509,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -507,6 +519,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -516,7 +529,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -609,8 +634,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -622,8 +648,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -680,6 +708,34 @@ ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio;
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4260,6 +4316,65 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, treat this case as having nothing to print.
+ */
+ if (pgaio_workers_enabled())
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, do not show anything.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf7d2b2309..68b87aab0c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -583,6 +583,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -599,7 +601,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -649,6 +655,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -662,6 +669,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ExplainOnePlan(pstmt, cplan, entry->plansource, query_index,
into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cde..f77124f8c5 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 39c990ae63..cf46633100 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -65,6 +65,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -609,6 +610,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -690,6 +692,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -785,6 +794,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1190,11 +1205,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1437,6 +1452,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start;
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1491,13 +1508,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1510,11 +1528,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storage I/O usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f470..9cb0e9300b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,22 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
+#include "storage/aio_subsys.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+StorageIOUsage pgStorageIOUsageParallel; /* only count parallel workers'
+ * usage */
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +203,47 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +299,53 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* dst -= sub */
+void
+StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub)
+{
+ dst->inblock -= sub->inblock;
+ dst->outblock -= sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ /*
+ * Since getting the I/O excluding AIO workers underestimates the total
+ * I/O, don't get the I/O usage statistics when AIO worker is enabled.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 03c5b3d73e..02e6b85aac 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -70,6 +70,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, CachedPlan *cplan,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(struct ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c397..5c8bc76c53 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6..5392f05022 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -100,6 +108,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -110,11 +119,16 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb584..270dc37c84 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c9437..27f0ea052a 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain_1.out b/src/test/regress/expected/explain_1.out
new file mode 100644
index 0000000000..215ce1818b
--- /dev/null
+++ b/src/test/regress/expected/explain_1.out
@@ -0,0 +1,849 @@
+--
+-- EXPLAIN
+--
+-- There are many test cases elsewhere that use EXPLAIN as a vehicle for
+-- checking something else (usually planner behavior). This file is
+-- concerned with testing EXPLAIN in its own right.
+--
+-- To produce stable regression test output, it's usually necessary to
+-- ignore details such as exact costs or row counts. These filter
+-- functions replace changeable output details with fixed strings.
+create function explain_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just 'N'
+ ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+ -- In sort output, the above won't match units-suffixed numbers
+ ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
+ -- Ignore text-mode buffers output because it varies depending
+ -- on the system state
+ CONTINUE WHEN (ln ~ ' +Buffers: .*');
+ -- Ignore text-mode "Planning:" line because whether it's output
+ -- varies depending on the system state
+ CONTINUE WHEN (ln = 'Planning:');
+ return next ln;
+ end loop;
+end;
+$$;
+-- To produce valid JSON output, replace numbers with "0" or "0.0" not "N"
+create function explain_filter_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just '0'
+ ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+-- Disable JIT, or we'll get different output on machines where that's been
+-- forced on
+set jit = off;
+-- Similarly, disable track_io_timing, to avoid output differences when
+-- enabled.
+set track_io_timing = off;
+-- Simple cases
+select explain_filter('explain select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Output: q1, q2
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain"> +
+ <Query> +
+ <Plan> +
+ <Node-Type>Seq Scan</Node-Type> +
+ <Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
+ <Relation-Name>int8_tbl</Relation-Name> +
+ <Alias>i8</Alias> +
+ <Startup-Cost>N.N</Startup-Cost> +
+ <Total-Cost>N.N</Total-Cost> +
+ <Plan-Rows>N</Plan-Rows> +
+ <Plan-Width>N</Plan-Width> +
+ <Actual-Startup-Time>N.N</Actual-Startup-Time> +
+ <Actual-Total-Time>N.N</Actual-Total-Time> +
+ <Actual-Rows>N.N</Actual-Rows> +
+ <Actual-Loops>N</Actual-Loops> +
+ <Disabled>false</Disabled> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ </Plan> +
+ <Planning> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Planning> +
+ <Planning-Time>N.N</Planning-Time> +
+ <Triggers> +
+ </Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
+ <Execution-Time>N.N</Execution-Time> +
+ </Query> +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Actual Startup Time: N.N +
+ Actual Total Time: N.N +
+ Actual Rows: N.N +
+ Actual Loops: N +
+ Disabled: false +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Planning: +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Planning Time: N.N +
+ Triggers: +
+ Serialization: +
+ Time: N.N +
+ Output Volume: N +
+ Format: "text" +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Execution Time: N.N
+(1 row)
+
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ } +
+ } +
+ ]
+(1 row)
+
+-- Check expansion of window definitions
+select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
+ explain_filter
+-------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w, (sum(unique2) OVER w1), (sum(tenthous) OVER w1), ten, hundred
+ Window: w AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w1, sum(tenthous) OVER w1
+ Window: w1 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(11 rows)
+
+select explain_filter('explain verbose select sum(unique1) over w1, sum(unique2) over (w1 order by hundred), sum(tenthous) over (w1 order by hundred rows 10 preceding) from tenk1 window w1 as (partition by ten)');
+ explain_filter
+---------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w1, (sum(unique2) OVER w2), (sum(tenthous) OVER w3), ten, hundred
+ Window: w1 AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, (sum(unique2) OVER w2), sum(tenthous) OVER w3
+ Window: w3 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred ROWS 'N'::bigint PRECEDING)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w2
+ Window: w2 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(14 rows)
+
+-- Check output including I/O timings. These fields are conditional
+-- but always set in JSON format, so check them only in this case.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl", +
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+set track_io_timing = off;
+-- SETTINGS option
+-- We have to ignore other settings that might be imposed by the environment,
+-- so printing the whole Settings field unfortunately won't do.
+begin;
+set local plan_cache_mode = force_generic_plan;
+select true as "OK"
+ from explain_filter('explain (settings) select * from int8_tbl i8') ln
+ where ln ~ '^ *Settings: .*plan_cache_mode = ''force_generic_plan''';
+ OK
+----
+ t
+(1 row)
+
+select explain_filter_to_json('explain (settings, format json) select * from int8_tbl i8') #> '{0,Settings,plan_cache_mode}';
+ ?column?
+----------------------
+ "force_generic_plan"
+(1 row)
+
+rollback;
+-- GENERIC_PLAN option
+select explain_filter('explain (generic_plan) select unique1 from tenk1 where thousand = $1');
+ explain_filter
+---------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=N.N..N.N rows=N width=N)
+ Recheck Cond: (thousand = $N)
+ -> Bitmap Index Scan on tenk1_thous_tenthous (cost=N.N..N.N rows=N width=N)
+ Index Cond: (thousand = $N)
+(4 rows)
+
+-- should fail
+select explain_filter('explain (analyze, generic_plan) select unique1 from tenk1 where thousand = $1');
+ERROR: EXPLAIN options ANALYZE and GENERIC_PLAN cannot be used together
+CONTEXT: PL/pgSQL function explain_filter(text) line 5 at FOR over EXECUTE statement
+-- MEMORY option
+select explain_filter('explain (memory) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+select explain_filter('explain (memory, analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Memory: used=NkB allocated=NkB
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (memory, summary, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Disabled: false +
+ Planning: +
+ Memory Used: N +
+ Memory Allocated: N +
+ Planning Time: N.N
+(1 row)
+
+select explain_filter('explain (memory, analyze, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
+ "Memory Used": N, +
+ "Memory Allocated": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+prepare int8_query as select * from int8_tbl i8;
+select explain_filter('explain (memory) execute int8_query');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+-- Test EXPLAIN (GENERIC_PLAN) with partition pruning
+-- partitions should be pruned at plan time, based on constants,
+-- but there should be no pruning based on parameter placeholders
+create table gen_part (
+ key1 integer not null,
+ key2 integer not null
+) partition by list (key1);
+create table gen_part_1
+ partition of gen_part for values in (1)
+ partition by range (key2);
+create table gen_part_1_1
+ partition of gen_part_1 for values from (1) to (2);
+create table gen_part_1_2
+ partition of gen_part_1 for values from (2) to (3);
+create table gen_part_2
+ partition of gen_part for values in (2);
+-- should scan gen_part_1_1 and gen_part_1_2, but not gen_part_2
+select explain_filter('explain (generic_plan) select key1, key2 from gen_part where key1 = 1 and key2 = $1');
+ explain_filter
+---------------------------------------------------------------------------
+ Append (cost=N.N..N.N rows=N width=N)
+ -> Seq Scan on gen_part_1_1 gen_part_1 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+ -> Seq Scan on gen_part_1_2 gen_part_2 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+(5 rows)
+
+drop table gen_part;
+--
+-- Test production of per-worker data
+--
+-- Unfortunately, because we don't know how many worker processes we'll
+-- actually get (maybe none at all), we can't examine the "Workers" output
+-- in any detail. We can check that it parses correctly as JSON, and then
+-- remove it from the displayed results.
+begin;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select jsonb_pretty(
+ explain_filter_to_json('explain (analyze, verbose, buffers, format json)
+ select * from tenk1 order by tenthous')
+ -- remove "Workers" node of the Seq Scan plan node
+ #- '{0,Plan,Plans,0,Plans,0,Workers}'
+ -- remove "Workers" node of the Sort plan node
+ #- '{0,Plan,Plans,0,Workers}'
+ -- Also remove its sort-type fields, as those aren't 100% stable
+ #- '{0,Plan,Plans,0,Sort Method}'
+ #- '{0,Plan,Plans,0,Sort Space Type}'
+);
+ jsonb_pretty
+-------------------------------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Plans": [ +
+ { +
+ "Plans": [ +
+ { +
+ "Alias": "tenk1", +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Schema": "public", +
+ "Disabled": false, +
+ "Node Type": "Seq Scan", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Relation Name": "tenk1", +
+ "Parallel Aware": true, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer",+
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Sort Key": [ +
+ "tenk1.tenthous" +
+ ], +
+ "Node Type": "Sort", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Sort Space Used": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer", +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Node Type": "Gather Merge", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Workers Planned": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Workers Launched": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Planning": { +
+ "Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
+ "Temp Read Blocks": 0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
+ "Planning Time": 0.0, +
+ "Execution Time": 0.0 +
+ } +
+ ]
+(1 row)
+
+rollback;
+-- Test display of temporary objects
+create temp table t1(f1 float8);
+create function pg_temp.mysin(float8) returns float8 language plpgsql
+as 'begin return sin($1); end';
+select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1) < 0.5');
+ explain_filter
+------------------------------------------------------------
+ Seq Scan on pg_temp.t1 (cost=N.N..N.N rows=N width=N)
+ Output: f1
+ Filter: (pg_temp.mysin(t1.f1) < 'N.N'::double precision)
+(3 rows)
+
+-- Test compute_query_id
+set compute_query_id = on;
+select explain_filter('explain (verbose) select * from int8_tbl i8');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+-- Test compute_query_id with utility statements containing plannable query
+select explain_filter('explain (verbose) declare test_cur cursor for select * from int8_tbl');
+ explain_filter
+-------------------------------------------------------------
+ Seq Scan on public.int8_tbl (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+select explain_filter('explain (verbose) create table test_ctas as select 1');
+ explain_filter
+----------------------------------------
+ Result (cost=N.N..N.N rows=N width=N)
+ Output: N
+ Query Identifier: N
+(3 rows)
+
+-- Test SERIALIZE option
+select explain_filter('explain (analyze,buffers off,serialize) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize text,buffers,timing off) select * from int8_tbl i8');
+ explain_filter
+-----------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize binary,buffers,timing) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=binary
+ Execution Time: N.N ms
+(4 rows)
+
+-- this tests an edge case where we have no data to return
+select explain_filter('explain (analyze,buffers off,serialize) create temp table explain_temp as select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory case)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,10) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Memory Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (disk case)
+set work_mem to 64;
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2500) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Disk Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
+ explain_filter
+----------------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS (PARTITION BY ((a.n < N)))
+ Storage: Disk Maximum Storage: NkB
+ -> Sort (actual time=N.N..N.N rows=N.N loops=N)
+ Sort Key: ((a.n < N))
+ Sort Method: external merge Disk: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(9 rows)
+
+reset work_mem;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9ea573fae2..4148ae3317 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2638,6 +2638,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: 8fcc6487809efa5508a4b70049402236a53be84d
--
2.43.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-05-08 13:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2025-10-28 08:43 ` torikoshia <torikoshia@oss.nttdata.com>
2026-01-25 14:35 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
0 siblings, 1 reply; 22+ messages in thread
From: torikoshia @ 2025-10-28 08:43 UTC (permalink / raw)
To: pgsql-hackers; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>; postgres@jeltef.nl
On 2025-05-08 22:51, torikoshia wrote:
> On 2025-04-11 22:18, torikoshia wrote:
>> On 2025-03-25 10:27, torikoshia wrote:
>>> On 2025-03-22 20:23, Jelte Fennema-Nio wrote:
>>>
>>>> On Wed, 19 Mar 2025 at 14:15, torikoshia
>>>> <torikoshia@oss.nttdata.com> wrote:
>>>>> BTW based on your discussion, I thought this patch could not be
>>>>> merged
>>>>> anytime soon. Does that align with your understanding?
>>>>
>>>> Yeah, that aligns with my understanding. I don't think it's
>>>> realistic
>>>> to get this merged before the code freeze, but I think both of the
>>>> below issues could be resolved.
>>>>
>>>>> - With bgworker-based AIO, this patch could mislead users into
>>>>> underestimating the actual storage I/O load, which is undesirable.
>>>>
>>>> To resolve this, I think the patch would need to change to not
>>>> report
>>>> anything if bgworker-based AIO is used.
>>>
>>> Agreed.
>>> I feel the new GUC io_method can be used to determine whether
>>> bgworker-based AIO is being used.
>>
>> I took this approach and when io_method=worker, no additional output
>> is shown in the attached patch.
>
Rebased the patch again.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA Japan Corporation to SRA OSS K.K.
Attachments:
[text/x-diff] v6-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (71.4K, 2-v6-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 56ce089616803b69fa2568cc9b2aa3e605af0a52 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Tue, 28 Oct 2025 17:37:11 +0900
Subject: [PATCH v6] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison in general. By measuring
the numbers of storage I/O read and write, we can better identify if
storage I/O is a bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
This output is shown when io_method=worker, since asynchronous workers
handle I/O for multiple processes, and isolating the EXPLAIN target's
I/O is difficult.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'm going to implement it if this patch is merged.
Squashed commit of the following:
---
doc/src/sgml/ref/explain.sgml | 24 +
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 125 +++-
src/backend/commands/prepare.c | 8 +
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 +-
src/backend/executor/instrument.c | 79 ++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 20 +-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain_1.out | 857 ++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 1156 insertions(+), 34 deletions(-)
create mode 100644 src/test/regress/expected/explain_1.out
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 6dda680aa0d..86b66e06cd3 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -204,6 +204,30 @@ ROLLBACK;
format, only non-zero values are printed. Buffers information is
automatically included when <literal>ANALYZE</literal> is used.
</para>
+ <para>
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown. Also, When
+ <varname>io_method</varname> is set to <literal>worker</literal>, no output
+ will be shown, as I/O handled by asynchronous workers cannot be measured
+ accurately.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 bytes.
+ </para>
+ <para>
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2f7d1437919..5be98e2e568 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2557,7 +2557,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2919,7 +2919,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,8 +2934,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..dc1225e5a83 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1088,7 +1088,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2151,7 +2151,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2166,8 +2166,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 313fe66bc96..66244cfb1f9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1617,7 +1617,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1825,7 +1825,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1835,8 +1835,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e6edae0845c..acc3250f101 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -32,6 +32,7 @@
#include "parser/analyze.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_subsys.h"
#include "storage/bufmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -144,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -326,6 +329,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -347,7 +352,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -362,16 +370,20 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
}
@@ -495,7 +507,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -505,6 +517,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -514,7 +527,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -598,8 +623,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -611,8 +637,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -669,6 +697,34 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio;
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4273,6 +4329,65 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, treat this case as having nothing to print.
+ */
+ if (pgaio_workers_enabled())
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, do not show anything.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 34b6410d6a2..89ce3377313 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -582,6 +582,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -597,7 +599,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -647,6 +653,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -659,6 +666,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..23d028eea7e 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..7a407c4ecf7 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -65,6 +65,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -610,6 +611,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -691,6 +693,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -786,6 +795,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1191,11 +1206,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1431,6 +1446,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start;
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1484,13 +1501,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1503,11 +1521,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storage I/O usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..9cb0e9300b4 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,22 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
+#include "storage/aio_subsys.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+StorageIOUsage pgStorageIOUsageParallel; /* only count parallel workers'
+ * usage */
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +203,47 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +299,53 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* dst -= sub */
+void
+StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub)
+{
+ dst->inblock -= sub->inblock;
+ dst->outblock -= sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ /*
+ * Since getting the I/O excluding AIO workers underestimates the total
+ * I/O, don't get the I/O usage statistics when AIO worker is enabled.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 6e51d50efc7..d27b2400e16 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -68,6 +68,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c397a..5c8bc76c53d 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6c..5392f05022e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -100,6 +108,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -110,11 +119,16 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb5844..270dc37c84f 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c94376..27f0ea052a1 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain_1.out b/src/test/regress/expected/explain_1.out
new file mode 100644
index 00000000000..426ebc2aa34
--- /dev/null
+++ b/src/test/regress/expected/explain_1.out
@@ -0,0 +1,857 @@
+--
+-- EXPLAIN
+--
+-- There are many test cases elsewhere that use EXPLAIN as a vehicle for
+-- checking something else (usually planner behavior). This file is
+-- concerned with testing EXPLAIN in its own right.
+--
+-- To produce stable regression test output, it's usually necessary to
+-- ignore details such as exact costs or row counts. These filter
+-- functions replace changeable output details with fixed strings.
+create function explain_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just 'N'
+ ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+ -- In sort output, the above won't match units-suffixed numbers
+ ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
+ -- Ignore text-mode buffers output because it varies depending
+ -- on the system state
+ CONTINUE WHEN (ln ~ ' +Buffers: .*');
+ -- Ignore text-mode "Planning:" line because whether it's output
+ -- varies depending on the system state
+ CONTINUE WHEN (ln = 'Planning:');
+ return next ln;
+ end loop;
+end;
+$$;
+-- To produce valid JSON output, replace numbers with "0" or "0.0" not "N"
+create function explain_filter_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just '0'
+ ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+-- Disable JIT, or we'll get different output on machines where that's been
+-- forced on
+set jit = off;
+-- Similarly, disable track_io_timing, to avoid output differences when
+-- enabled.
+set track_io_timing = off;
+-- Simple cases
+explain (costs off) select 1 as a, 2 as b having false;
+ QUERY PLAN
+--------------------------
+ Result
+ Replaces: Aggregate
+ One-Time Filter: false
+(3 rows)
+
+select explain_filter('explain select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Output: q1, q2
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain"> +
+ <Query> +
+ <Plan> +
+ <Node-Type>Seq Scan</Node-Type> +
+ <Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
+ <Relation-Name>int8_tbl</Relation-Name> +
+ <Alias>i8</Alias> +
+ <Startup-Cost>N.N</Startup-Cost> +
+ <Total-Cost>N.N</Total-Cost> +
+ <Plan-Rows>N</Plan-Rows> +
+ <Plan-Width>N</Plan-Width> +
+ <Actual-Startup-Time>N.N</Actual-Startup-Time> +
+ <Actual-Total-Time>N.N</Actual-Total-Time> +
+ <Actual-Rows>N.N</Actual-Rows> +
+ <Actual-Loops>N</Actual-Loops> +
+ <Disabled>false</Disabled> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ </Plan> +
+ <Planning> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Planning> +
+ <Planning-Time>N.N</Planning-Time> +
+ <Triggers> +
+ </Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
+ <Execution-Time>N.N</Execution-Time> +
+ </Query> +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Actual Startup Time: N.N +
+ Actual Total Time: N.N +
+ Actual Rows: N.N +
+ Actual Loops: N +
+ Disabled: false +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Planning: +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Planning Time: N.N +
+ Triggers: +
+ Serialization: +
+ Time: N.N +
+ Output Volume: N +
+ Format: "text" +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Execution Time: N.N
+(1 row)
+
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ } +
+ } +
+ ]
+(1 row)
+
+-- Check expansion of window definitions
+select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
+ explain_filter
+-------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w, (sum(unique2) OVER w1), (sum(tenthous) OVER w1), ten, hundred
+ Window: w AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w1, sum(tenthous) OVER w1
+ Window: w1 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(11 rows)
+
+select explain_filter('explain verbose select sum(unique1) over w1, sum(unique2) over (w1 order by hundred), sum(tenthous) over (w1 order by hundred rows 10 preceding) from tenk1 window w1 as (partition by ten)');
+ explain_filter
+---------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w1, (sum(unique2) OVER w2), (sum(tenthous) OVER w3), ten, hundred
+ Window: w1 AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, (sum(unique2) OVER w2), sum(tenthous) OVER w3
+ Window: w3 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred ROWS 'N'::bigint PRECEDING)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w2
+ Window: w2 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(14 rows)
+
+-- Check output including I/O timings. These fields are conditional
+-- but always set in JSON format, so check them only in this case.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl", +
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+set track_io_timing = off;
+-- SETTINGS option
+-- We have to ignore other settings that might be imposed by the environment,
+-- so printing the whole Settings field unfortunately won't do.
+begin;
+set local plan_cache_mode = force_generic_plan;
+select true as "OK"
+ from explain_filter('explain (settings) select * from int8_tbl i8') ln
+ where ln ~ '^ *Settings: .*plan_cache_mode = ''force_generic_plan''';
+ OK
+----
+ t
+(1 row)
+
+select explain_filter_to_json('explain (settings, format json) select * from int8_tbl i8') #> '{0,Settings,plan_cache_mode}';
+ ?column?
+----------------------
+ "force_generic_plan"
+(1 row)
+
+rollback;
+-- GENERIC_PLAN option
+select explain_filter('explain (generic_plan) select unique1 from tenk1 where thousand = $1');
+ explain_filter
+---------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=N.N..N.N rows=N width=N)
+ Recheck Cond: (thousand = $N)
+ -> Bitmap Index Scan on tenk1_thous_tenthous (cost=N.N..N.N rows=N width=N)
+ Index Cond: (thousand = $N)
+(4 rows)
+
+-- should fail
+select explain_filter('explain (analyze, generic_plan) select unique1 from tenk1 where thousand = $1');
+ERROR: EXPLAIN options ANALYZE and GENERIC_PLAN cannot be used together
+CONTEXT: PL/pgSQL function explain_filter(text) line 5 at FOR over EXECUTE statement
+-- MEMORY option
+select explain_filter('explain (memory) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+select explain_filter('explain (memory, analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Memory: used=NkB allocated=NkB
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (memory, summary, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Disabled: false +
+ Planning: +
+ Memory Used: N +
+ Memory Allocated: N +
+ Planning Time: N.N
+(1 row)
+
+select explain_filter('explain (memory, analyze, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
+ "Memory Used": N, +
+ "Memory Allocated": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+prepare int8_query as select * from int8_tbl i8;
+select explain_filter('explain (memory) execute int8_query');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+-- Test EXPLAIN (GENERIC_PLAN) with partition pruning
+-- partitions should be pruned at plan time, based on constants,
+-- but there should be no pruning based on parameter placeholders
+create table gen_part (
+ key1 integer not null,
+ key2 integer not null
+) partition by list (key1);
+create table gen_part_1
+ partition of gen_part for values in (1)
+ partition by range (key2);
+create table gen_part_1_1
+ partition of gen_part_1 for values from (1) to (2);
+create table gen_part_1_2
+ partition of gen_part_1 for values from (2) to (3);
+create table gen_part_2
+ partition of gen_part for values in (2);
+-- should scan gen_part_1_1 and gen_part_1_2, but not gen_part_2
+select explain_filter('explain (generic_plan) select key1, key2 from gen_part where key1 = 1 and key2 = $1');
+ explain_filter
+---------------------------------------------------------------------------
+ Append (cost=N.N..N.N rows=N width=N)
+ -> Seq Scan on gen_part_1_1 gen_part_1 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+ -> Seq Scan on gen_part_1_2 gen_part_2 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+(5 rows)
+
+drop table gen_part;
+--
+-- Test production of per-worker data
+--
+-- Unfortunately, because we don't know how many worker processes we'll
+-- actually get (maybe none at all), we can't examine the "Workers" output
+-- in any detail. We can check that it parses correctly as JSON, and then
+-- remove it from the displayed results.
+begin;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select jsonb_pretty(
+ explain_filter_to_json('explain (analyze, verbose, buffers, format json)
+ select * from tenk1 order by tenthous')
+ -- remove "Workers" node of the Seq Scan plan node
+ #- '{0,Plan,Plans,0,Plans,0,Workers}'
+ -- remove "Workers" node of the Sort plan node
+ #- '{0,Plan,Plans,0,Workers}'
+ -- Also remove its sort-type fields, as those aren't 100% stable
+ #- '{0,Plan,Plans,0,Sort Method}'
+ #- '{0,Plan,Plans,0,Sort Space Type}'
+);
+ jsonb_pretty
+-------------------------------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Plans": [ +
+ { +
+ "Plans": [ +
+ { +
+ "Alias": "tenk1", +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Schema": "public", +
+ "Disabled": false, +
+ "Node Type": "Seq Scan", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Relation Name": "tenk1", +
+ "Parallel Aware": true, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer",+
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Sort Key": [ +
+ "tenk1.tenthous" +
+ ], +
+ "Node Type": "Sort", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Sort Space Used": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer", +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Node Type": "Gather Merge", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Workers Planned": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Workers Launched": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Planning": { +
+ "Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
+ "Temp Read Blocks": 0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
+ "Planning Time": 0.0, +
+ "Execution Time": 0.0 +
+ } +
+ ]
+(1 row)
+
+rollback;
+-- Test display of temporary objects
+create temp table t1(f1 float8);
+create function pg_temp.mysin(float8) returns float8 language plpgsql
+as 'begin return sin($1); end';
+select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1) < 0.5');
+ explain_filter
+------------------------------------------------------------
+ Seq Scan on pg_temp.t1 (cost=N.N..N.N rows=N width=N)
+ Output: f1
+ Filter: (pg_temp.mysin(t1.f1) < 'N.N'::double precision)
+(3 rows)
+
+-- Test compute_query_id
+set compute_query_id = on;
+select explain_filter('explain (verbose) select * from int8_tbl i8');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+-- Test compute_query_id with utility statements containing plannable query
+select explain_filter('explain (verbose) declare test_cur cursor for select * from int8_tbl');
+ explain_filter
+-------------------------------------------------------------
+ Seq Scan on public.int8_tbl (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+select explain_filter('explain (verbose) create table test_ctas as select 1');
+ explain_filter
+----------------------------------------
+ Result (cost=N.N..N.N rows=N width=N)
+ Output: N
+ Query Identifier: N
+(3 rows)
+
+-- Test SERIALIZE option
+select explain_filter('explain (analyze,buffers off,serialize) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize text,buffers,timing off) select * from int8_tbl i8');
+ explain_filter
+-----------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize binary,buffers,timing) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=binary
+ Execution Time: N.N ms
+(4 rows)
+
+-- this tests an edge case where we have no data to return
+select explain_filter('explain (analyze,buffers off,serialize) create temp table explain_temp as select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory case)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,10) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Memory Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (disk case)
+set work_mem to 64;
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2500) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Disk Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
+ explain_filter
+----------------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS (PARTITION BY ((a.n < N)))
+ Storage: Disk Maximum Storage: NkB
+ -> Sort (actual time=N.N..N.N rows=N.N loops=N)
+ Sort Key: ((a.n < N))
+ Sort Method: external merge Disk: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(9 rows)
+
+reset work_mem;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bb4e1b37005..d6492847eeb 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2657,6 +2657,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
--
2.43.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-05-08 13:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-10-28 08:43 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2026-01-25 14:35 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2026-01-28 13:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2026-01-25 14:35 UTC (permalink / raw)
To: torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; +Cc: andres@anarazel.de; tgl@sss.pgh.pa.us; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On Tue Oct 28, 2025 at 9:43 AM CET, torikoshia wrote:
> Rebased the patch again.
I took another look at this patch because I think it would be really
useful to have. Below is my review:
1.
ExplainPropertyInteger("Storage I/O Read", NULL,
usage->inblock, es);
ExplainPropertyInteger("Storage I/O Read", NULL,
usage->outblock, es);
The second one is a copy paste error and should say Storage I/O Write
instead of read.
2. I think the comment on pgStorageIOUsageParallel could be a bit
clearer on how it works, because it's different than how we
accumulate most others explain numbers. Something like:
Accumulates the I/O usage send by parallel workers to the main
process. This does not contain the I/O from the main backend process
itself because the kernel tracks that instead of us.
3. I'm not sure if I like the "times" suffix either, because it doesn't
mean that did that many reads. It probably read a bunch of blocks in
one go. Maybe just put the number without a suffix.
4. I think it would be good to use the string "Storage I/O" in the docs,
so that it's easily findable for people that try to figure out how it
works.
5. I think this debug statement can be removed:
ereport(DEBUG1,
(errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
storageiousage->inblock, storageiousage->outblock)));
6. The prepare exacute planning calculation is missing a
StorageIOUsageDiff call, so it will report the number since process
start.
7. I think it would be good for GetStorageIOUsage to stil initialize the
struct values even if pgaio workers are enabled. Let's just set them
to 0 in that case.
8. Now that worker is the default io_method the new explain fields are never
present in the explain tests anymore. So explain_1.out is
unnecessary and can be removed. That obviously also means that
there's no coverage for this feature at all in the tests currently,
which is clearly a problem. So I think it would be good to add some
actual tests for the feature using some other io_method in a Perl TAP
test.
9. The added docs about the kernel not being built with the correct
options seems a bit too much detail. I'd say remove that sentence.
And maybe shorten th
10. nit: There's "Also, When..." in the docs, when should be lower case
there.
Attached is your original patch plus a fixup patch that address:
1, 2, 3, 5, 6 and 7.
Attachments:
[text/x-patch] v7-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (71.4K, 2-v7-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 3001695ca38e4ea65c635ff90de9da53d251590b Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Tue, 28 Oct 2025 17:37:11 +0900
Subject: [PATCH v7 1/2] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison in general. By measuring
the numbers of storage I/O read and write, we can better identify if
storage I/O is a bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
This output is shown when io_method=worker, since asynchronous workers
handle I/O for multiple processes, and isolating the EXPLAIN target's
I/O is difficult.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'm going to implement it if this patch is merged.
Squashed commit of the following:
---
doc/src/sgml/ref/explain.sgml | 24 +
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 125 +++-
src/backend/commands/prepare.c | 8 +
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 +-
src/backend/executor/instrument.c | 79 ++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 20 +-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain_1.out | 857 ++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 1156 insertions(+), 34 deletions(-)
create mode 100644 src/test/regress/expected/explain_1.out
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..244990a99fb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -204,6 +204,30 @@ ROLLBACK;
format, only non-zero values are printed. Buffers information is
automatically included when <literal>ANALYZE</literal> is used.
</para>
+ <para>
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown. Also, When
+ <varname>io_method</varname> is set to <literal>worker</literal>, no output
+ will be shown, as I/O handled by asynchronous workers cannot be measured
+ accurately.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 bytes.
+ </para>
+ <para>
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6887e421442..9b34914140f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2572,7 +2572,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2934,7 +2934,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2949,8 +2949,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d63fb4ba27..5f34524be3b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1116,7 +1116,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2176,7 +2176,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2191,8 +2191,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 90ab4e91b56..f4609017d80 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1617,7 +1617,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1825,7 +1825,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1835,8 +1835,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..7902866da7b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -32,6 +32,7 @@
#include "parser/analyze.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_subsys.h"
#include "storage/bufmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -144,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -326,6 +329,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -347,7 +352,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -362,16 +370,20 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
}
@@ -495,7 +507,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -505,6 +517,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -514,7 +527,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -598,8 +623,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -611,8 +637,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -669,6 +697,34 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio;
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4275,6 +4331,65 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, treat this case as having nothing to print.
+ */
+ if (pgaio_workers_enabled())
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, do not show anything.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 5b86a727587..08bf28d2078 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -582,6 +582,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -597,7 +599,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -647,6 +653,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -659,6 +666,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c3b3c9ea21a..d19cd04e421 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 772e81f3154..081acf483b4 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -621,6 +622,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -702,6 +704,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -797,6 +806,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1207,11 +1222,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1452,6 +1467,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start;
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1505,13 +1522,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1524,11 +1542,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storage I/O usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index edab92a0ebe..316671c3ced 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,22 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
+#include "storage/aio_subsys.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+StorageIOUsage pgStorageIOUsageParallel; /* only count parallel workers'
+ * usage */
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -194,27 +200,47 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -270,6 +296,53 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* dst -= sub */
+void
+StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub)
+{
+ dst->inblock -= sub->inblock;
+ dst->outblock -= sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ /*
+ * Since getting the I/O excluding AIO workers underestimates the total
+ * I/O, don't get the I/O usage statistics when AIO worker is enabled.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 86226f8db70..7625f08c95e 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -68,6 +68,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..f91ef700991 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..b95157d5588 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -101,6 +109,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -111,11 +120,16 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb5844..270dc37c84f 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index fa2b79cd5ed..2e267de7b28 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain_1.out b/src/test/regress/expected/explain_1.out
new file mode 100644
index 00000000000..426ebc2aa34
--- /dev/null
+++ b/src/test/regress/expected/explain_1.out
@@ -0,0 +1,857 @@
+--
+-- EXPLAIN
+--
+-- There are many test cases elsewhere that use EXPLAIN as a vehicle for
+-- checking something else (usually planner behavior). This file is
+-- concerned with testing EXPLAIN in its own right.
+--
+-- To produce stable regression test output, it's usually necessary to
+-- ignore details such as exact costs or row counts. These filter
+-- functions replace changeable output details with fixed strings.
+create function explain_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just 'N'
+ ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+ -- In sort output, the above won't match units-suffixed numbers
+ ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
+ -- Ignore text-mode buffers output because it varies depending
+ -- on the system state
+ CONTINUE WHEN (ln ~ ' +Buffers: .*');
+ -- Ignore text-mode "Planning:" line because whether it's output
+ -- varies depending on the system state
+ CONTINUE WHEN (ln = 'Planning:');
+ return next ln;
+ end loop;
+end;
+$$;
+-- To produce valid JSON output, replace numbers with "0" or "0.0" not "N"
+create function explain_filter_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just '0'
+ ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+-- Disable JIT, or we'll get different output on machines where that's been
+-- forced on
+set jit = off;
+-- Similarly, disable track_io_timing, to avoid output differences when
+-- enabled.
+set track_io_timing = off;
+-- Simple cases
+explain (costs off) select 1 as a, 2 as b having false;
+ QUERY PLAN
+--------------------------
+ Result
+ Replaces: Aggregate
+ One-Time Filter: false
+(3 rows)
+
+select explain_filter('explain select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Output: q1, q2
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain"> +
+ <Query> +
+ <Plan> +
+ <Node-Type>Seq Scan</Node-Type> +
+ <Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
+ <Relation-Name>int8_tbl</Relation-Name> +
+ <Alias>i8</Alias> +
+ <Startup-Cost>N.N</Startup-Cost> +
+ <Total-Cost>N.N</Total-Cost> +
+ <Plan-Rows>N</Plan-Rows> +
+ <Plan-Width>N</Plan-Width> +
+ <Actual-Startup-Time>N.N</Actual-Startup-Time> +
+ <Actual-Total-Time>N.N</Actual-Total-Time> +
+ <Actual-Rows>N.N</Actual-Rows> +
+ <Actual-Loops>N</Actual-Loops> +
+ <Disabled>false</Disabled> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ </Plan> +
+ <Planning> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Planning> +
+ <Planning-Time>N.N</Planning-Time> +
+ <Triggers> +
+ </Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
+ <Execution-Time>N.N</Execution-Time> +
+ </Query> +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Actual Startup Time: N.N +
+ Actual Total Time: N.N +
+ Actual Rows: N.N +
+ Actual Loops: N +
+ Disabled: false +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Planning: +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Planning Time: N.N +
+ Triggers: +
+ Serialization: +
+ Time: N.N +
+ Output Volume: N +
+ Format: "text" +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
+ Execution Time: N.N
+(1 row)
+
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ } +
+ } +
+ ]
+(1 row)
+
+-- Check expansion of window definitions
+select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
+ explain_filter
+-------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w, (sum(unique2) OVER w1), (sum(tenthous) OVER w1), ten, hundred
+ Window: w AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w1, sum(tenthous) OVER w1
+ Window: w1 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(11 rows)
+
+select explain_filter('explain verbose select sum(unique1) over w1, sum(unique2) over (w1 order by hundred), sum(tenthous) over (w1 order by hundred rows 10 preceding) from tenk1 window w1 as (partition by ten)');
+ explain_filter
+---------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w1, (sum(unique2) OVER w2), (sum(tenthous) OVER w3), ten, hundred
+ Window: w1 AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, (sum(unique2) OVER w2), sum(tenthous) OVER w3
+ Window: w3 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred ROWS 'N'::bigint PRECEDING)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w2
+ Window: w2 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(14 rows)
+
+-- Check output including I/O timings. These fields are conditional
+-- but always set in JSON format, so check them only in this case.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl", +
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+set track_io_timing = off;
+-- SETTINGS option
+-- We have to ignore other settings that might be imposed by the environment,
+-- so printing the whole Settings field unfortunately won't do.
+begin;
+set local plan_cache_mode = force_generic_plan;
+select true as "OK"
+ from explain_filter('explain (settings) select * from int8_tbl i8') ln
+ where ln ~ '^ *Settings: .*plan_cache_mode = ''force_generic_plan''';
+ OK
+----
+ t
+(1 row)
+
+select explain_filter_to_json('explain (settings, format json) select * from int8_tbl i8') #> '{0,Settings,plan_cache_mode}';
+ ?column?
+----------------------
+ "force_generic_plan"
+(1 row)
+
+rollback;
+-- GENERIC_PLAN option
+select explain_filter('explain (generic_plan) select unique1 from tenk1 where thousand = $1');
+ explain_filter
+---------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=N.N..N.N rows=N width=N)
+ Recheck Cond: (thousand = $N)
+ -> Bitmap Index Scan on tenk1_thous_tenthous (cost=N.N..N.N rows=N width=N)
+ Index Cond: (thousand = $N)
+(4 rows)
+
+-- should fail
+select explain_filter('explain (analyze, generic_plan) select unique1 from tenk1 where thousand = $1');
+ERROR: EXPLAIN options ANALYZE and GENERIC_PLAN cannot be used together
+CONTEXT: PL/pgSQL function explain_filter(text) line 5 at FOR over EXECUTE statement
+-- MEMORY option
+select explain_filter('explain (memory) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+select explain_filter('explain (memory, analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Memory: used=NkB allocated=NkB
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (memory, summary, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Disabled: false +
+ Planning: +
+ Memory Used: N +
+ Memory Allocated: N +
+ Planning Time: N.N
+(1 row)
+
+select explain_filter('explain (memory, analyze, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
+ "Memory Used": N, +
+ "Memory Allocated": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+prepare int8_query as select * from int8_tbl i8;
+select explain_filter('explain (memory) execute int8_query');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+-- Test EXPLAIN (GENERIC_PLAN) with partition pruning
+-- partitions should be pruned at plan time, based on constants,
+-- but there should be no pruning based on parameter placeholders
+create table gen_part (
+ key1 integer not null,
+ key2 integer not null
+) partition by list (key1);
+create table gen_part_1
+ partition of gen_part for values in (1)
+ partition by range (key2);
+create table gen_part_1_1
+ partition of gen_part_1 for values from (1) to (2);
+create table gen_part_1_2
+ partition of gen_part_1 for values from (2) to (3);
+create table gen_part_2
+ partition of gen_part for values in (2);
+-- should scan gen_part_1_1 and gen_part_1_2, but not gen_part_2
+select explain_filter('explain (generic_plan) select key1, key2 from gen_part where key1 = 1 and key2 = $1');
+ explain_filter
+---------------------------------------------------------------------------
+ Append (cost=N.N..N.N rows=N width=N)
+ -> Seq Scan on gen_part_1_1 gen_part_1 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+ -> Seq Scan on gen_part_1_2 gen_part_2 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+(5 rows)
+
+drop table gen_part;
+--
+-- Test production of per-worker data
+--
+-- Unfortunately, because we don't know how many worker processes we'll
+-- actually get (maybe none at all), we can't examine the "Workers" output
+-- in any detail. We can check that it parses correctly as JSON, and then
+-- remove it from the displayed results.
+begin;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select jsonb_pretty(
+ explain_filter_to_json('explain (analyze, verbose, buffers, format json)
+ select * from tenk1 order by tenthous')
+ -- remove "Workers" node of the Seq Scan plan node
+ #- '{0,Plan,Plans,0,Plans,0,Workers}'
+ -- remove "Workers" node of the Sort plan node
+ #- '{0,Plan,Plans,0,Workers}'
+ -- Also remove its sort-type fields, as those aren't 100% stable
+ #- '{0,Plan,Plans,0,Sort Method}'
+ #- '{0,Plan,Plans,0,Sort Space Type}'
+);
+ jsonb_pretty
+-------------------------------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Plans": [ +
+ { +
+ "Plans": [ +
+ { +
+ "Alias": "tenk1", +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Schema": "public", +
+ "Disabled": false, +
+ "Node Type": "Seq Scan", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Relation Name": "tenk1", +
+ "Parallel Aware": true, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer",+
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Sort Key": [ +
+ "tenk1.tenthous" +
+ ], +
+ "Node Type": "Sort", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Sort Space Used": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer", +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Node Type": "Gather Merge", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Workers Planned": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Workers Launched": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Planning": { +
+ "Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
+ "Temp Read Blocks": 0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
+ "Planning Time": 0.0, +
+ "Execution Time": 0.0 +
+ } +
+ ]
+(1 row)
+
+rollback;
+-- Test display of temporary objects
+create temp table t1(f1 float8);
+create function pg_temp.mysin(float8) returns float8 language plpgsql
+as 'begin return sin($1); end';
+select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1) < 0.5');
+ explain_filter
+------------------------------------------------------------
+ Seq Scan on pg_temp.t1 (cost=N.N..N.N rows=N width=N)
+ Output: f1
+ Filter: (pg_temp.mysin(t1.f1) < 'N.N'::double precision)
+(3 rows)
+
+-- Test compute_query_id
+set compute_query_id = on;
+select explain_filter('explain (verbose) select * from int8_tbl i8');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+-- Test compute_query_id with utility statements containing plannable query
+select explain_filter('explain (verbose) declare test_cur cursor for select * from int8_tbl');
+ explain_filter
+-------------------------------------------------------------
+ Seq Scan on public.int8_tbl (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+select explain_filter('explain (verbose) create table test_ctas as select 1');
+ explain_filter
+----------------------------------------
+ Result (cost=N.N..N.N rows=N width=N)
+ Output: N
+ Query Identifier: N
+(3 rows)
+
+-- Test SERIALIZE option
+select explain_filter('explain (analyze,buffers off,serialize) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize text,buffers,timing off) select * from int8_tbl i8');
+ explain_filter
+-----------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize binary,buffers,timing) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=binary
+ Execution Time: N.N ms
+(4 rows)
+
+-- this tests an edge case where we have no data to return
+select explain_filter('explain (analyze,buffers off,serialize) create temp table explain_temp as select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory case)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,10) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Memory Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (disk case)
+set work_mem to 64;
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2500) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Disk Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
+ explain_filter
+----------------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS (PARTITION BY ((a.n < N)))
+ Storage: Disk Maximum Storage: NkB
+ -> Sort (actual time=N.N..N.N rows=N.N loops=N)
+ Sort Key: ((a.n < N))
+ Sort Method: external merge Disk: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(9 rows)
+
+reset work_mem;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1c8610fd46c..cd696b9fccc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2691,6 +2691,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: a9bdb63bba8a631cd4797393307eecf5fcde9167
--
2.52.0
[text/x-patch] v7-0002-fixup-Add-storage-I-O-tracking-to-BUFFERS-option.patch (3.6K, 3-v7-0002-fixup-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 2a45234fca1ba77cd0080d366fdf879ef9165d9f Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio <postgres@jeltef.nl>
Date: Sun, 25 Jan 2026 15:29:18 +0100
Subject: [PATCH v7 2/2] fixup! Add storage I/O tracking to 'BUFFERS' option
---
src/backend/commands/explain.c | 6 +++---
src/backend/commands/prepare.c | 4 +++-
src/backend/executor/instrument.c | 17 +++++++++++------
3 files changed, 17 insertions(+), 10 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7902866da7b..8f668ee46c7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -4376,8 +4376,8 @@ show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
ExplainIndentText(es);
appendStringInfoString(es->str, "Storage I/O:");
- appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
- appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+ appendStringInfo(es->str, " read=%ld", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld", (long) usage->outblock);
appendStringInfoChar(es->str, '\n');
}
@@ -4385,7 +4385,7 @@ show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
{
ExplainPropertyInteger("Storage I/O Read", NULL,
usage->inblock, es);
- ExplainPropertyInteger("Storage I/O Read", NULL,
+ ExplainPropertyInteger("Storage I/O Write", NULL,
usage->outblock, es);
}
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 08bf28d2078..7c38f1393fd 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -648,12 +648,14 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
plan_list = cplan->stmt_list;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 316671c3ced..902e7e54b57 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -22,8 +22,13 @@
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
-StorageIOUsage pgStorageIOUsageParallel; /* only count parallel workers'
- * usage */
+/*
+ * Accumulates the I/O usage send by parallel workers to the main
+ * process. This does not contain the I/O from the main backend process
+ * itself because the kernel tracks that instead of us.
+ */
+StorageIOUsage pgStorageIOUsageParallel;
+
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
@@ -224,10 +229,6 @@ InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, Wal
memset(storageiousage, 0, sizeof(StorageIOUsage));
StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
-
- ereport(DEBUG1,
- (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
- storageiousage->inblock, storageiousage->outblock)));
}
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
@@ -331,7 +332,11 @@ GetStorageIOUsage(StorageIOUsage *usage)
* I/O, don't get the I/O usage statistics when AIO worker is enabled.
*/
if (pgaio_workers_enabled())
+ {
+ usage->inblock = 0;
+ usage->outblock = 0;
return;
+ }
if (getrusage(RUSAGE_SELF, &rusage))
{
--
2.52.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-05-08 13:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-10-28 08:43 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2026-01-25 14:35 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2026-01-28 13:27 ` torikoshia <torikoshia@oss.nttdata.com>
2026-01-28 14:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 1 reply; 22+ messages in thread
From: torikoshia @ 2026-01-28 13:27 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: pgsql-hackers; andres@anarazel.de; tgl@sss.pgh.pa.us; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On 2026-01-25 23:35, Jelte Fennema-Nio wrote:
> On Tue Oct 28, 2025 at 9:43 AM CET, torikoshia wrote:
>> Rebased the patch again.
>
> I took another look at this patch because I think it would be really
> useful to have. Below is my review:
Thank you for the review!
> 8. Now that worker is the default io_method the new explain fields are
> never
> present in the explain tests anymore. So explain_1.out is
> unnecessary and can be removed. That obviously also means that
> there's no coverage for this feature at all in the tests currently,
> which is clearly a problem. So I think it would be good to add some
> actual tests for the feature using some other io_method in a Perl
> TAP
> test.
However, I noticed that there are test environments where io_method is
set to io_uring (for example, Linux Debian Trixie builds on Cirrus CI
using Meson), so it appears that explain_1.out is still needed.
Given this, adding the Perl TAP test might be somewhat redundant, but
for now I’m attaching a patch that includes both.
> 9. The added docs about the kernel not being built with the correct
> options seems a bit too much detail. I'd say remove that sentence.
> And maybe shorten th
Is this comment still a work in progress?
I agree with the other comments. Attached an updated patch reflecting
them.
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA Japan Corporation to SRA OSS K.K
Attachments:
[text/x-diff] v8-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (74.6K, 2-v8-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From 1cfd9089e320e8177b166aff004a4065645df948 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Wed, 28 Jan 2026 22:19:52 +0900
Subject: [PATCH v8] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison in general. By measuring
the numbers of storage I/O read and write, we can better identify if
storage I/O is a bottleneck in performance.
This patch enables to track storage I/O usage by calling getrusage(2)
at both the planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
Note that no output is shown when io_method=worker, since asynchronous
workers handle I/O for multiple processes, and isolating the EXPLAIN
target's I/O is difficult.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'm going to implement it if this patch is merged.
---
doc/src/sgml/ref/explain.sgml | 25 +
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 125 ++-
src/backend/commands/prepare.c | 12 +-
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 +-
src/backend/executor/instrument.c | 84 +-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 20 +-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
.../test_misc/t/011_explain_storage_io.pl | 89 ++
src/test/regress/expected/explain_1.out | 858 ++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
17 files changed, 1255 insertions(+), 35 deletions(-)
create mode 100644 src/test/modules/test_misc/t/011_explain_storage_io.pl
create mode 100644 src/test/regress/expected/explain_1.out
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..d50c18855c3 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -204,6 +204,31 @@ ROLLBACK;
format, only non-zero values are printed. Buffers information is
automatically included when <literal>ANALYZE</literal> is used.
</para>
+ <para>
+ If possible, this option also displays <emphasis>Storage I/O</emphasis>
+ statistics at the end of the plan. The storage I/O
+ <emphasis>read</emphasis> value indicates the number of read operations
+ performed on storage during query planning and execution, while the
+ Storage I/O <emphasis>write</emphasis> value indicates the number of
+ write operations performed on storage during these phases.
+ These values are obtained from the <function>getrusage()</function> system
+ call. Note that on platforms that do not support
+ <function>getrusage()</function>, such as Windows, no output will be shown,
+ even if reads or writes actually occur. Also, when
+ <xref linkend="guc-io-method"/> is set to <literal>worker</literal>, no output
+ will be shown, as I/O handled by asynchronous workers cannot be measured
+ accurately.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 bytes.
+ </para>
+ <para>
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6887e421442..9b34914140f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2572,7 +2572,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2934,7 +2934,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2949,8 +2949,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d63fb4ba27..5f34524be3b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1116,7 +1116,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2176,7 +2176,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2191,8 +2191,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 90ab4e91b56..f4609017d80 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1617,7 +1617,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1825,7 +1825,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1835,8 +1835,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..8f668ee46c7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -32,6 +32,7 @@
#include "parser/analyze.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_subsys.h"
#include "storage/bufmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -144,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -326,6 +329,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -347,7 +352,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -362,16 +370,20 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
}
@@ -495,7 +507,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -505,6 +517,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -514,7 +527,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -598,8 +623,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -611,8 +637,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -669,6 +697,34 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio;
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4275,6 +4331,65 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, treat this case as having nothing to print.
+ */
+ if (pgaio_workers_enabled())
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, do not show anything.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Write", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 5b86a727587..7c38f1393fd 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -582,6 +582,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -597,7 +599,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -642,11 +648,14 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
plan_list = cplan->stmt_list;
@@ -659,6 +668,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c3b3c9ea21a..d19cd04e421 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 772e81f3154..081acf483b4 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -621,6 +622,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -702,6 +704,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -797,6 +806,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1207,11 +1222,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1452,6 +1467,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start;
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1505,13 +1522,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1524,11 +1542,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storage I/O usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index edab92a0ebe..4d3711b4aa9 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,27 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
+#include "storage/aio_subsys.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+/*
+ * Accumulates the I/O usage sent by parallel workers to the main
+ * process. This does not contain the I/O from the main backend process
+ * itself because the kernel tracks that instead of us.
+ */
+StorageIOUsage pgStorageIOUsageParallel;
+
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -194,27 +205,43 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -270,6 +297,57 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* dst -= sub */
+void
+StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub)
+{
+ dst->inblock -= sub->inblock;
+ dst->outblock -= sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ /*
+ * Since getting the I/O excluding AIO workers underestimates the total
+ * I/O, don't get the I/O usage statistics when AIO worker is enabled.
+ */
+ if (pgaio_workers_enabled())
+ {
+ usage->inblock = 0;
+ usage->outblock = 0;
+ return;
+ }
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 86226f8db70..7625f08c95e 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -68,6 +68,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..f91ef700991 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..b95157d5588 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -101,6 +109,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -111,11 +120,16 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb5844..270dc37c84f 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index fa2b79cd5ed..2e267de7b28 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/modules/test_misc/t/011_explain_storage_io.pl b/src/test/modules/test_misc/t/011_explain_storage_io.pl
new file mode 100644
index 00000000000..cb846aee92f
--- /dev/null
+++ b/src/test/modules/test_misc/t/011_explain_storage_io.pl
@@ -0,0 +1,89 @@
+
+# Copyright (c) 2024-2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use locale;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Node initialization
+my $node_sync = PostgreSQL::Test::Cluster->new('sync');
+$node_sync->init();
+$node_sync->append_conf('postgresql.conf', "io_method = sync");
+$node_sync->start;
+
+# Check that the XML-formatted EXPLAIN output contains Storage I/O
+# read and write information.
+my $xml_sync = $node_sync->safe_psql(
+ 'postgres',
+ q{
+ EXPLAIN (ANALYZE, BUFFERS, FORMAT XML)
+ SELECT * FROM pg_class;
+ }
+);
+
+like(
+ $xml_sync,
+ qr/<Storage-I-O-Read>\d+<\/Storage-I-O-Read>/,
+ "Storage-I-O-Read is shown in EXPLAIN XML (io_method=sync)"
+);
+
+like(
+ $xml_sync,
+ qr/<Storage-I-O-Write>\d+<\/Storage-I-O-Write>/,
+ "Storage-I-O-Write is shown in EXPLAIN XML (io_method=sync)"
+);
+
+# Do the same test if io_method=io_uring is supported
+if (have_io_uring())
+{
+ my $node_io_uring = PostgreSQL::Test::Cluster->new('io_uring');
+ $node_io_uring->init;
+ $node_io_uring->append_conf('postgresql.conf', "io_method = 'io_uring'");
+ $node_io_uring->start;
+
+ my $xml_io_uring = $node_io_uring->safe_psql(
+ 'postgres',
+ q{
+ EXPLAIN (ANALYZE, BUFFERS, FORMAT XML)
+ SELECT * FROM pg_class;
+ }
+ );
+
+ like(
+ $xml_io_uring,
+ qr/<Storage-I-O-Read>\d+<\/Storage-I-O-Read>/,
+ "Storage-I-O-Read is shown in EXPLAIN XML (io_method=io_uring)"
+ );
+
+ like(
+ $xml_io_uring,
+ qr/<Storage-I-O-Write>\d+<\/Storage-I-O-Write>/,
+ "Storage-I-O-Write is shown in EXPLAIN XML (io_method=io_uring)"
+ );
+}
+else
+{
+ note "io_uring is not supported on this platform. skipping io_uring tests";
+}
+
+sub have_io_uring
+{
+ # To detect if io_uring is supported, we look at the error message for
+ # assigning an invalid value to an enum GUC, which lists all the valid
+ # options. We need to use -C to deal with running as administrator on
+ # windows, the superuser check is omitted if -C is used.
+ my ($stdout, $stderr) =
+ run_command [qw(postgres -C invalid -c io_method=invalid)];
+ die "can't determine supported io_method values"
+ unless $stderr =~ m/Available values: ([^\.]+)\./;
+ my $methods = $1;
+ note "supported io_method values are: $methods";
+
+ return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+done_testing();
diff --git a/src/test/regress/expected/explain_1.out b/src/test/regress/expected/explain_1.out
new file mode 100644
index 00000000000..71f95ac23c1
--- /dev/null
+++ b/src/test/regress/expected/explain_1.out
@@ -0,0 +1,858 @@
+--
+-- EXPLAIN
+--
+-- There are many test cases elsewhere that use EXPLAIN as a vehicle for
+-- checking something else (usually planner behavior). This file is
+-- concerned with testing EXPLAIN in its own right.
+--
+-- To produce stable regression test output, it's usually necessary to
+-- ignore details such as exact costs or row counts. These filter
+-- functions replace changeable output details with fixed strings.
+create function explain_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just 'N'
+ ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+ -- In sort output, the above won't match units-suffixed numbers
+ ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
+ -- Ignore text-mode buffers output because it varies depending
+ -- on the system state
+ CONTINUE WHEN (ln ~ ' +Buffers: .*');
+ -- Ignore text-mode "Planning:" line because whether it's output
+ -- varies depending on the system state
+ CONTINUE WHEN (ln = 'Planning:');
+ return next ln;
+ end loop;
+end;
+$$;
+-- To produce valid JSON output, replace numbers with "0" or "0.0" not "N"
+create function explain_filter_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just '0'
+ ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+-- Disable JIT, or we'll get different output on machines where that's been
+-- forced on
+set jit = off;
+-- Similarly, disable track_io_timing, to avoid output differences when
+-- enabled.
+set track_io_timing = off;
+-- Simple cases
+explain (costs off) select 1 as a, 2 as b having false;
+ QUERY PLAN
+--------------------------
+ Result
+ Replaces: Aggregate
+ One-Time Filter: false
+(3 rows)
+
+select explain_filter('explain select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Output: q1, q2
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain"> +
+ <Query> +
+ <Plan> +
+ <Node-Type>Seq Scan</Node-Type> +
+ <Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
+ <Relation-Name>int8_tbl</Relation-Name> +
+ <Alias>i8</Alias> +
+ <Startup-Cost>N.N</Startup-Cost> +
+ <Total-Cost>N.N</Total-Cost> +
+ <Plan-Rows>N</Plan-Rows> +
+ <Plan-Width>N</Plan-Width> +
+ <Actual-Startup-Time>N.N</Actual-Startup-Time> +
+ <Actual-Total-Time>N.N</Actual-Total-Time> +
+ <Actual-Rows>N.N</Actual-Rows> +
+ <Actual-Loops>N</Actual-Loops> +
+ <Disabled>false</Disabled> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ </Plan> +
+ <Planning> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Write>N</Storage-I-O-Write> +
+ </Planning> +
+ <Planning-Time>N.N</Planning-Time> +
+ <Triggers> +
+ </Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Write>N</Storage-I-O-Write> +
+ </Execution> +
+ <Execution-Time>N.N</Execution-Time> +
+ </Query> +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Actual Startup Time: N.N +
+ Actual Total Time: N.N +
+ Actual Rows: N.N +
+ Actual Loops: N +
+ Disabled: false +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Planning: +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Write: N +
+ Planning Time: N.N +
+ Triggers: +
+ Serialization: +
+ Time: N.N +
+ Output Volume: N +
+ Format: "text" +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Write: N +
+ Execution Time: N.N
+(1 row)
+
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ } +
+ } +
+ ]
+(1 row)
+
+-- Check expansion of window definitions
+select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
+ explain_filter
+-------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w, (sum(unique2) OVER w1), (sum(tenthous) OVER w1), ten, hundred
+ Window: w AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w1, sum(tenthous) OVER w1
+ Window: w1 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(11 rows)
+
+select explain_filter('explain verbose select sum(unique1) over w1, sum(unique2) over (w1 order by hundred), sum(tenthous) over (w1 order by hundred rows 10 preceding) from tenk1 window w1 as (partition by ten)');
+ explain_filter
+---------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w1, (sum(unique2) OVER w2), (sum(tenthous) OVER w3), ten, hundred
+ Window: w1 AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, (sum(unique2) OVER w2), sum(tenthous) OVER w3
+ Window: w3 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred ROWS 'N'::bigint PRECEDING)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w2
+ Window: w2 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(14 rows)
+
+-- Check output including I/O timings. These fields are conditional
+-- but always set in JSON format, so check them only in this case.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl", +
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+set track_io_timing = off;
+-- SETTINGS option
+-- We have to ignore other settings that might be imposed by the environment,
+-- so printing the whole Settings field unfortunately won't do.
+begin;
+set local plan_cache_mode = force_generic_plan;
+select true as "OK"
+ from explain_filter('explain (settings) select * from int8_tbl i8') ln
+ where ln ~ '^ *Settings: .*plan_cache_mode = ''force_generic_plan''';
+ OK
+----
+ t
+(1 row)
+
+select explain_filter_to_json('explain (settings, format json) select * from int8_tbl i8') #> '{0,Settings,plan_cache_mode}';
+ ?column?
+----------------------
+ "force_generic_plan"
+(1 row)
+
+rollback;
+-- GENERIC_PLAN option
+select explain_filter('explain (generic_plan) select unique1 from tenk1 where thousand = $1');
+ explain_filter
+---------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=N.N..N.N rows=N width=N)
+ Recheck Cond: (thousand = $N)
+ -> Bitmap Index Scan on tenk1_thous_tenthous (cost=N.N..N.N rows=N width=N)
+ Index Cond: (thousand = $N)
+(4 rows)
+
+-- should fail
+select explain_filter('explain (analyze, generic_plan) select unique1 from tenk1 where thousand = $1');
+ERROR: EXPLAIN options ANALYZE and GENERIC_PLAN cannot be used together
+CONTEXT: PL/pgSQL function explain_filter(text) line 5 at FOR over EXECUTE statement
+-- MEMORY option
+select explain_filter('explain (memory) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+select explain_filter('explain (memory, analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Memory: used=NkB allocated=NkB
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (memory, summary, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Disabled: false +
+ Planning: +
+ Memory Used: N +
+ Memory Allocated: N +
+ Planning Time: N.N
+(1 row)
+
+select explain_filter('explain (memory, analyze, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N, +
+ "Memory Used": N, +
+ "Memory Allocated": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+prepare int8_query as select * from int8_tbl i8;
+select explain_filter('explain (memory) execute int8_query');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+-- Test EXPLAIN (GENERIC_PLAN) with partition pruning
+-- partitions should be pruned at plan time, based on constants,
+-- but there should be no pruning based on parameter placeholders
+create table gen_part (
+ key1 integer not null,
+ key2 integer not null
+) partition by list (key1);
+create table gen_part_1
+ partition of gen_part for values in (1)
+ partition by range (key2);
+create table gen_part_1_1
+ partition of gen_part_1 for values from (1) to (2);
+create table gen_part_1_2
+ partition of gen_part_1 for values from (2) to (3);
+create table gen_part_2
+ partition of gen_part for values in (2);
+-- should scan gen_part_1_1 and gen_part_1_2, but not gen_part_2
+select explain_filter('explain (generic_plan) select key1, key2 from gen_part where key1 = 1 and key2 = $1');
+ explain_filter
+---------------------------------------------------------------------------
+ Append (cost=N.N..N.N rows=N width=N)
+ -> Seq Scan on gen_part_1_1 gen_part_1 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+ -> Seq Scan on gen_part_1_2 gen_part_2 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+(5 rows)
+
+drop table gen_part;
+--
+-- Test production of per-worker data
+--
+-- Unfortunately, because we don't know how many worker processes we'll
+-- actually get (maybe none at all), we can't examine the "Workers" output
+-- in any detail. We can check that it parses correctly as JSON, and then
+-- remove it from the displayed results.
+begin;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select jsonb_pretty(
+ explain_filter_to_json('explain (analyze, verbose, buffers, format json)
+ select * from tenk1 order by tenthous')
+ -- remove "Workers" node of the Seq Scan plan node
+ #- '{0,Plan,Plans,0,Plans,0,Workers}'
+ -- remove "Workers" node of the Sort plan node
+ #- '{0,Plan,Plans,0,Workers}'
+ -- Also remove its sort-type fields, as those aren't 100% stable
+ #- '{0,Plan,Plans,0,Sort Method}'
+ #- '{0,Plan,Plans,0,Sort Space Type}'
+);
+ jsonb_pretty
+-------------------------------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Plans": [ +
+ { +
+ "Plans": [ +
+ { +
+ "Alias": "tenk1", +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Schema": "public", +
+ "Disabled": false, +
+ "Node Type": "Seq Scan", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Relation Name": "tenk1", +
+ "Parallel Aware": true, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer",+
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Sort Key": [ +
+ "tenk1.tenthous" +
+ ], +
+ "Node Type": "Sort", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Sort Space Used": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer", +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Node Type": "Gather Merge", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Workers Planned": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Workers Launched": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Planning": { +
+ "Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
+ "Temp Read Blocks": 0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Storage I/O Write": 0, +
+ "Shared Read Blocks": 0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": 0, +
+ "Storage I/O Write": 0 +
+ }, +
+ "Planning Time": 0.0, +
+ "Execution Time": 0.0 +
+ } +
+ ]
+(1 row)
+
+rollback;
+-- Test display of temporary objects
+create temp table t1(f1 float8);
+create function pg_temp.mysin(float8) returns float8 language plpgsql
+as 'begin return sin($1); end';
+select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1) < 0.5');
+ explain_filter
+------------------------------------------------------------
+ Seq Scan on pg_temp.t1 (cost=N.N..N.N rows=N width=N)
+ Output: f1
+ Filter: (pg_temp.mysin(t1.f1) < 'N.N'::double precision)
+(3 rows)
+
+-- Test compute_query_id
+set compute_query_id = on;
+select explain_filter('explain (verbose) select * from int8_tbl i8');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+-- Test compute_query_id with utility statements containing plannable query
+select explain_filter('explain (verbose) declare test_cur cursor for select * from int8_tbl');
+ explain_filter
+-------------------------------------------------------------
+ Seq Scan on public.int8_tbl (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+select explain_filter('explain (verbose) create table test_ctas as select 1');
+ explain_filter
+----------------------------------------
+ Result (cost=N.N..N.N rows=N width=N)
+ Output: N
+ Query Identifier: N
+(3 rows)
+
+-- Test SERIALIZE option
+select explain_filter('explain (analyze,buffers off,serialize) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize text,buffers,timing off) select * from int8_tbl i8');
+ explain_filter
+-----------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize binary,buffers,timing) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=binary
+ Execution Time: N.N ms
+(4 rows)
+
+-- this tests an edge case where we have no data to return
+select explain_filter('explain (analyze,buffers off,serialize) create temp table explain_temp as select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory case)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,10) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Memory Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (disk case)
+set work_mem to 64;
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2500) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Disk Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
+ explain_filter
+----------------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS (PARTITION BY ((a.n < N)))
+ Storage: Disk Maximum Storage: NkB
+ -> Sort (actual time=N.N..N.N rows=N.N loops=N)
+ Sort Key: ((a.n < N))
+ Sort Method: external merge Disk: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(9 rows)
+
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ddbe4c64971..92e29069989 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2691,6 +2691,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
--
2.43.0
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-05-08 13:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2025-10-28 08:43 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
2026-01-25 14:35 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2026-01-28 13:27 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information torikoshia <torikoshia@oss.nttdata.com>
@ 2026-01-28 14:59 ` torikoshia <torikoshia@oss.nttdata.com>
0 siblings, 0 replies; 22+ messages in thread
From: torikoshia @ 2026-01-28 14:59 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: pgsql-hackers; andres@anarazel.de; tgl@sss.pgh.pa.us; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On 2026-01-28 22:27, torikoshia wrote:
> On 2026-01-25 23:35, Jelte Fennema-Nio wrote:
>
>> On Tue Oct 28, 2025 at 9:43 AM CET, torikoshia wrote:
>>> Rebased the patch again.
>>
>> I took another look at this patch because I think it would be really
>> useful to have. Below is my review:
>
> Thank you for the review!
>
>> 8. Now that worker is the default io_method the new explain fields are
>> never
>> present in the explain tests anymore. So explain_1.out is
>> unnecessary and can be removed. That obviously also means that
>> there's no coverage for this feature at all in the tests currently,
>> which is clearly a problem. So I think it would be good to add some
>> actual tests for the feature using some other io_method in a Perl
>> TAP
>> test.
>
> However, I noticed that there are test environments where io_method is
> set to io_uring (for example, Linux Debian Trixie builds on Cirrus CI
> using Meson), so it appears that explain_1.out is still needed.
> Given this, adding the Perl TAP test might be somewhat redundant, but
> for now I’m attaching a patch that includes both.
>
>> 9. The added docs about the kernel not being built with the correct
>> options seems a bit too much detail. I'd say remove that sentence.
>> And maybe shorten th
>
> Is this comment still a work in progress?
>
> I agree with the other comments. Attached an updated patch reflecting
> them.
Updated the patch to fix a regression test failure.
>
>
> Regards,
>
> --
> Atsushi Torikoshi
> Seconded from NTT DATA Japan Corporation to SRA OSS K.K
Attachments:
[text/x-diff] v9-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch (74.7K, 2-v9-0001-Add-storage-I-O-tracking-to-BUFFERS-option.patch)
download | inline diff:
From fc66a97d42208758104e329957b9633c2fe65014 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Wed, 28 Jan 2026 23:48:51 +0900
Subject: [PATCH v9] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison in general. By measuring
the numbers of storage I/O read and write, we can better identify if
storage I/O is a bottleneck in performance.
This patch enables to track storage I/O usage by calling getrusage(2)
at both the planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
Note that no output is shown when io_method=worker, since asynchronous
workers handle I/O for multiple processes, and isolating the EXPLAIN
target's I/O is difficult.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'm going to implement it if this patch is merged.
---
doc/src/sgml/ref/explain.sgml | 25 +
src/backend/access/brin/brin.c | 8 +-
src/backend/access/gin/gininsert.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 125 ++-
src/backend/commands/prepare.c | 12 +-
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 +-
src/backend/executor/instrument.c | 84 +-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 20 +-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
.../test_misc/t/011_explain_storage_io.pl | 89 ++
src/test/regress/expected/explain_1.out | 859 ++++++++++++++++++
src/tools/pgindent/typedefs.list | 1 +
17 files changed, 1256 insertions(+), 35 deletions(-)
create mode 100644 src/test/modules/test_misc/t/011_explain_storage_io.pl
create mode 100644 src/test/regress/expected/explain_1.out
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..d50c18855c3 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -204,6 +204,31 @@ ROLLBACK;
format, only non-zero values are printed. Buffers information is
automatically included when <literal>ANALYZE</literal> is used.
</para>
+ <para>
+ If possible, this option also displays <emphasis>Storage I/O</emphasis>
+ statistics at the end of the plan. The storage I/O
+ <emphasis>read</emphasis> value indicates the number of read operations
+ performed on storage during query planning and execution, while the
+ Storage I/O <emphasis>write</emphasis> value indicates the number of
+ write operations performed on storage during these phases.
+ These values are obtained from the <function>getrusage()</function> system
+ call. Note that on platforms that do not support
+ <function>getrusage()</function>, such as Windows, no output will be shown,
+ even if reads or writes actually occur. Also, when
+ <xref linkend="guc-io-method"/> is set to <literal>worker</literal>, no output
+ will be shown, as I/O handled by asynchronous workers cannot be measured
+ accurately.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 bytes.
+ </para>
+ <para>
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
+ </para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 6887e421442..9b34914140f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2572,7 +2572,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2934,7 +2934,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2949,8 +2949,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d63fb4ba27..5f34524be3b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1116,7 +1116,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->bufferusage[i], NULL, &ginleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2176,7 +2176,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2191,8 +2191,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 90ab4e91b56..f4609017d80 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1617,7 +1617,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1825,7 +1825,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1835,8 +1835,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..8f668ee46c7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -32,6 +32,7 @@
#include "parser/analyze.h"
#include "parser/parsetree.h"
#include "rewrite/rewriteHandler.h"
+#include "storage/aio_subsys.h"
#include "storage/bufmgr.h"
#include "tcop/tcopprot.h"
#include "utils/builtins.h"
@@ -144,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -326,6 +329,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -347,7 +352,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -362,16 +370,20 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
}
@@ -495,7 +507,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -505,6 +517,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -514,7 +527,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -598,8 +623,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -611,8 +637,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -669,6 +697,34 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio;
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4275,6 +4331,65 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, treat this case as having nothing to print.
+ */
+ if (pgaio_workers_enabled())
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ /*
+ * Since showing only the I/O excluding AIO workers underestimates the
+ * total I/O, do not show anything.
+ */
+ if (pgaio_workers_enabled())
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Write", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 5b86a727587..7c38f1393fd 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -582,6 +582,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -597,7 +599,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -642,11 +648,14 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
+ /* calc differences of buffer and storage I/O counters. */
if (es->buffers)
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ StorageIOUsageDiff(&storageio, &storageio_start);
}
plan_list = cplan->stmt_list;
@@ -659,6 +668,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ es->buffers ? &storageio : NULL,
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c3b3c9ea21a..d19cd04e421 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 772e81f3154..081acf483b4 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -66,6 +66,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -621,6 +622,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -702,6 +704,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -797,6 +806,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1207,11 +1222,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1452,6 +1467,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start;
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1505,13 +1522,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1524,11 +1542,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storage I/O usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index edab92a0ebe..4d3711b4aa9 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,27 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
+#include "storage/aio_subsys.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+/*
+ * Accumulates the I/O usage sent by parallel workers to the main
+ * process. This does not contain the I/O from the main backend process
+ * itself because the kernel tracks that instead of us.
+ */
+StorageIOUsage pgStorageIOUsageParallel;
+
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -194,27 +205,43 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -270,6 +297,57 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* dst -= sub */
+void
+StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub)
+{
+ dst->inblock -= sub->inblock;
+ dst->outblock -= sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ /*
+ * Since getting the I/O excluding AIO workers underestimates the total
+ * I/O, don't get the I/O usage statistics when AIO worker is enabled.
+ */
+ if (pgaio_workers_enabled())
+ {
+ usage->inblock = 0;
+ usage->outblock = 0;
+ return;
+ }
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index 86226f8db70..7625f08c95e 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -68,6 +68,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..f91ef700991 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..b95157d5588 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -101,6 +109,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -111,11 +120,16 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageDiff(StorageIOUsage *dst, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb5844..270dc37c84f 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index fa2b79cd5ed..2e267de7b28 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/modules/test_misc/t/011_explain_storage_io.pl b/src/test/modules/test_misc/t/011_explain_storage_io.pl
new file mode 100644
index 00000000000..cb846aee92f
--- /dev/null
+++ b/src/test/modules/test_misc/t/011_explain_storage_io.pl
@@ -0,0 +1,89 @@
+
+# Copyright (c) 2024-2026, PostgreSQL Global Development Group
+
+use strict;
+use warnings FATAL => 'all';
+use locale;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Node initialization
+my $node_sync = PostgreSQL::Test::Cluster->new('sync');
+$node_sync->init();
+$node_sync->append_conf('postgresql.conf', "io_method = sync");
+$node_sync->start;
+
+# Check that the XML-formatted EXPLAIN output contains Storage I/O
+# read and write information.
+my $xml_sync = $node_sync->safe_psql(
+ 'postgres',
+ q{
+ EXPLAIN (ANALYZE, BUFFERS, FORMAT XML)
+ SELECT * FROM pg_class;
+ }
+);
+
+like(
+ $xml_sync,
+ qr/<Storage-I-O-Read>\d+<\/Storage-I-O-Read>/,
+ "Storage-I-O-Read is shown in EXPLAIN XML (io_method=sync)"
+);
+
+like(
+ $xml_sync,
+ qr/<Storage-I-O-Write>\d+<\/Storage-I-O-Write>/,
+ "Storage-I-O-Write is shown in EXPLAIN XML (io_method=sync)"
+);
+
+# Do the same test if io_method=io_uring is supported
+if (have_io_uring())
+{
+ my $node_io_uring = PostgreSQL::Test::Cluster->new('io_uring');
+ $node_io_uring->init;
+ $node_io_uring->append_conf('postgresql.conf', "io_method = 'io_uring'");
+ $node_io_uring->start;
+
+ my $xml_io_uring = $node_io_uring->safe_psql(
+ 'postgres',
+ q{
+ EXPLAIN (ANALYZE, BUFFERS, FORMAT XML)
+ SELECT * FROM pg_class;
+ }
+ );
+
+ like(
+ $xml_io_uring,
+ qr/<Storage-I-O-Read>\d+<\/Storage-I-O-Read>/,
+ "Storage-I-O-Read is shown in EXPLAIN XML (io_method=io_uring)"
+ );
+
+ like(
+ $xml_io_uring,
+ qr/<Storage-I-O-Write>\d+<\/Storage-I-O-Write>/,
+ "Storage-I-O-Write is shown in EXPLAIN XML (io_method=io_uring)"
+ );
+}
+else
+{
+ note "io_uring is not supported on this platform. skipping io_uring tests";
+}
+
+sub have_io_uring
+{
+ # To detect if io_uring is supported, we look at the error message for
+ # assigning an invalid value to an enum GUC, which lists all the valid
+ # options. We need to use -C to deal with running as administrator on
+ # windows, the superuser check is omitted if -C is used.
+ my ($stdout, $stderr) =
+ run_command [qw(postgres -C invalid -c io_method=invalid)];
+ die "can't determine supported io_method values"
+ unless $stderr =~ m/Available values: ([^\.]+)\./;
+ my $methods = $1;
+ note "supported io_method values are: $methods";
+
+ return ($methods =~ m/io_uring/) ? 1 : 0;
+}
+
+done_testing();
diff --git a/src/test/regress/expected/explain_1.out b/src/test/regress/expected/explain_1.out
new file mode 100644
index 00000000000..e15487613f4
--- /dev/null
+++ b/src/test/regress/expected/explain_1.out
@@ -0,0 +1,859 @@
+--
+-- EXPLAIN
+--
+-- There are many test cases elsewhere that use EXPLAIN as a vehicle for
+-- checking something else (usually planner behavior). This file is
+-- concerned with testing EXPLAIN in its own right.
+--
+-- To produce stable regression test output, it's usually necessary to
+-- ignore details such as exact costs or row counts. These filter
+-- functions replace changeable output details with fixed strings.
+create function explain_filter(text) returns setof text
+language plpgsql as
+$$
+declare
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just 'N'
+ ln := regexp_replace(ln, '-?\m\d+\M', 'N', 'g');
+ -- In sort output, the above won't match units-suffixed numbers
+ ln := regexp_replace(ln, '\m\d+kB', 'NkB', 'g');
+ -- Ignore text-mode buffers output because it varies depending
+ -- on the system state
+ CONTINUE WHEN (ln ~ ' +Buffers: .*');
+ -- Ignore text-mode "Planning:" line because whether it's output
+ -- varies depending on the system state
+ CONTINUE WHEN (ln = 'Planning:');
+ return next ln;
+ end loop;
+end;
+$$;
+-- To produce valid JSON output, replace numbers with "0" or "0.0" not "N"
+create function explain_filter_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ -- Replace any numeric word with just '0'
+ ln := regexp_replace(ln, '\m\d+\M', '0', 'g');
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+-- Disable JIT, or we'll get different output on machines where that's been
+-- forced on
+set jit = off;
+-- Similarly, disable track_io_timing, to avoid output differences when
+-- enabled.
+set track_io_timing = off;
+-- Simple cases
+explain (costs off) select 1 as a, 2 as b having false;
+ QUERY PLAN
+--------------------------
+ Result
+ Replaces: Aggregate
+ One-Time Filter: false
+(3 rows)
+
+select explain_filter('explain select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers off, verbose) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Output: q1, q2
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+select explain_filter('explain (analyze, buffers, format xml) select * from int8_tbl i8');
+ explain_filter
+--------------------------------------------------------
+ <explain xmlns="http://www.postgresql.org/N/explain"> +
+ <Query> +
+ <Plan> +
+ <Node-Type>Seq Scan</Node-Type> +
+ <Parallel-Aware>false</Parallel-Aware> +
+ <Async-Capable>false</Async-Capable> +
+ <Relation-Name>int8_tbl</Relation-Name> +
+ <Alias>i8</Alias> +
+ <Startup-Cost>N.N</Startup-Cost> +
+ <Total-Cost>N.N</Total-Cost> +
+ <Plan-Rows>N</Plan-Rows> +
+ <Plan-Width>N</Plan-Width> +
+ <Actual-Startup-Time>N.N</Actual-Startup-Time> +
+ <Actual-Total-Time>N.N</Actual-Total-Time> +
+ <Actual-Rows>N.N</Actual-Rows> +
+ <Actual-Loops>N</Actual-Loops> +
+ <Disabled>false</Disabled> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ </Plan> +
+ <Planning> +
+ <Shared-Hit-Blocks>N</Shared-Hit-Blocks> +
+ <Shared-Read-Blocks>N</Shared-Read-Blocks> +
+ <Shared-Dirtied-Blocks>N</Shared-Dirtied-Blocks>+
+ <Shared-Written-Blocks>N</Shared-Written-Blocks>+
+ <Local-Hit-Blocks>N</Local-Hit-Blocks> +
+ <Local-Read-Blocks>N</Local-Read-Blocks> +
+ <Local-Dirtied-Blocks>N</Local-Dirtied-Blocks> +
+ <Local-Written-Blocks>N</Local-Written-Blocks> +
+ <Temp-Read-Blocks>N</Temp-Read-Blocks> +
+ <Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Write>N</Storage-I-O-Write> +
+ </Planning> +
+ <Planning-Time>N.N</Planning-Time> +
+ <Triggers> +
+ </Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Write>N</Storage-I-O-Write> +
+ </Execution> +
+ <Execution-Time>N.N</Execution-Time> +
+ </Query> +
+ </explain>
+(1 row)
+
+select explain_filter('explain (analyze, serialize, buffers, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Actual Startup Time: N.N +
+ Actual Total Time: N.N +
+ Actual Rows: N.N +
+ Actual Loops: N +
+ Disabled: false +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Planning: +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Write: N +
+ Planning Time: N.N +
+ Triggers: +
+ Serialization: +
+ Time: N.N +
+ Output Volume: N +
+ Format: "text" +
+ Shared Hit Blocks: N +
+ Shared Read Blocks: N +
+ Shared Dirtied Blocks: N +
+ Shared Written Blocks: N +
+ Local Hit Blocks: N +
+ Local Read Blocks: N +
+ Local Dirtied Blocks: N +
+ Local Written Blocks: N +
+ Temp Read Blocks: N +
+ Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Write: N +
+ Execution Time: N.N
+(1 row)
+
+select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+(1 row)
+
+select explain_filter('explain (buffers, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ } +
+ } +
+ ]
+(1 row)
+
+-- Check expansion of window definitions
+select explain_filter('explain verbose select sum(unique1) over w, sum(unique2) over (w order by hundred), sum(tenthous) over (w order by hundred) from tenk1 window w as (partition by ten)');
+ explain_filter
+-------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w, (sum(unique2) OVER w1), (sum(tenthous) OVER w1), ten, hundred
+ Window: w AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w1, sum(tenthous) OVER w1
+ Window: w1 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(11 rows)
+
+select explain_filter('explain verbose select sum(unique1) over w1, sum(unique2) over (w1 order by hundred), sum(tenthous) over (w1 order by hundred rows 10 preceding) from tenk1 window w1 as (partition by ten)');
+ explain_filter
+---------------------------------------------------------------------------------------------------------
+ WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: sum(unique1) OVER w1, (sum(unique2) OVER w2), (sum(tenthous) OVER w3), ten, hundred
+ Window: w1 AS (PARTITION BY tenk1.ten)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, (sum(unique2) OVER w2), sum(tenthous) OVER w3
+ Window: w3 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred ROWS 'N'::bigint PRECEDING)
+ -> WindowAgg (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous, sum(unique2) OVER w2
+ Window: w2 AS (PARTITION BY tenk1.ten ORDER BY tenk1.hundred)
+ -> Sort (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+ Sort Key: tenk1.ten, tenk1.hundred
+ -> Seq Scan on public.tenk1 (cost=N.N..N.N rows=N width=N)
+ Output: ten, hundred, unique1, unique2, tenthous
+(14 rows)
+
+-- Check output including I/O timings. These fields are conditional
+-- but always set in JSON format, so check them only in this case.
+set track_io_timing = on;
+select explain_filter('explain (analyze, buffers, format json) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl", +
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Shared I/O Read Time": N.N, +
+ "Shared I/O Write Time": N.N,+
+ "Local I/O Read Time": N.N, +
+ "Local I/O Write Time": N.N, +
+ "Temp I/O Read Time": N.N, +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+set track_io_timing = off;
+-- SETTINGS option
+-- We have to ignore other settings that might be imposed by the environment,
+-- so printing the whole Settings field unfortunately won't do.
+begin;
+set local plan_cache_mode = force_generic_plan;
+select true as "OK"
+ from explain_filter('explain (settings) select * from int8_tbl i8') ln
+ where ln ~ '^ *Settings: .*plan_cache_mode = ''force_generic_plan''';
+ OK
+----
+ t
+(1 row)
+
+select explain_filter_to_json('explain (settings, format json) select * from int8_tbl i8') #> '{0,Settings,plan_cache_mode}';
+ ?column?
+----------------------
+ "force_generic_plan"
+(1 row)
+
+rollback;
+-- GENERIC_PLAN option
+select explain_filter('explain (generic_plan) select unique1 from tenk1 where thousand = $1');
+ explain_filter
+---------------------------------------------------------------------------------
+ Bitmap Heap Scan on tenk1 (cost=N.N..N.N rows=N width=N)
+ Recheck Cond: (thousand = $N)
+ -> Bitmap Index Scan on tenk1_thous_tenthous (cost=N.N..N.N rows=N width=N)
+ Index Cond: (thousand = $N)
+(4 rows)
+
+-- should fail
+select explain_filter('explain (analyze, generic_plan) select unique1 from tenk1 where thousand = $1');
+ERROR: EXPLAIN options ANALYZE and GENERIC_PLAN cannot be used together
+CONTEXT: PL/pgSQL function explain_filter(text) line 5 at FOR over EXECUTE statement
+-- MEMORY option
+select explain_filter('explain (memory) select * from int8_tbl i8');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+select explain_filter('explain (memory, analyze, buffers off) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Memory: used=NkB allocated=NkB
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (memory, summary, format yaml) select * from int8_tbl i8');
+ explain_filter
+-------------------------------
+ - Plan: +
+ Node Type: "Seq Scan" +
+ Parallel Aware: false +
+ Async Capable: false +
+ Relation Name: "int8_tbl"+
+ Alias: "i8" +
+ Startup Cost: N.N +
+ Total Cost: N.N +
+ Plan Rows: N +
+ Plan Width: N +
+ Disabled: false +
+ Planning: +
+ Memory Used: N +
+ Memory Allocated: N +
+ Planning Time: N.N
+(1 row)
+
+select explain_filter('explain (memory, analyze, format json) select * from int8_tbl i8');
+ explain_filter
+------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Node Type": "Seq Scan", +
+ "Parallel Aware": false, +
+ "Async Capable": false, +
+ "Relation Name": "int8_tbl",+
+ "Alias": "i8", +
+ "Startup Cost": N.N, +
+ "Total Cost": N.N, +
+ "Plan Rows": N, +
+ "Plan Width": N, +
+ "Actual Startup Time": N.N, +
+ "Actual Total Time": N.N, +
+ "Actual Rows": N.N, +
+ "Actual Loops": N, +
+ "Disabled": false, +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N +
+ }, +
+ "Planning": { +
+ "Shared Hit Blocks": N, +
+ "Shared Read Blocks": N, +
+ "Shared Dirtied Blocks": N, +
+ "Shared Written Blocks": N, +
+ "Local Hit Blocks": N, +
+ "Local Read Blocks": N, +
+ "Local Dirtied Blocks": N, +
+ "Local Written Blocks": N, +
+ "Temp Read Blocks": N, +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N, +
+ "Memory Used": N, +
+ "Memory Allocated": N +
+ }, +
+ "Planning Time": N.N, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Write": N +
+ }, +
+ "Execution Time": N.N +
+ } +
+ ]
+(1 row)
+
+prepare int8_query as select * from int8_tbl i8;
+select explain_filter('explain (memory) execute int8_query');
+ explain_filter
+---------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Memory: used=NkB allocated=NkB
+(2 rows)
+
+-- Test EXPLAIN (GENERIC_PLAN) with partition pruning
+-- partitions should be pruned at plan time, based on constants,
+-- but there should be no pruning based on parameter placeholders
+create table gen_part (
+ key1 integer not null,
+ key2 integer not null
+) partition by list (key1);
+create table gen_part_1
+ partition of gen_part for values in (1)
+ partition by range (key2);
+create table gen_part_1_1
+ partition of gen_part_1 for values from (1) to (2);
+create table gen_part_1_2
+ partition of gen_part_1 for values from (2) to (3);
+create table gen_part_2
+ partition of gen_part for values in (2);
+-- should scan gen_part_1_1 and gen_part_1_2, but not gen_part_2
+select explain_filter('explain (generic_plan) select key1, key2 from gen_part where key1 = 1 and key2 = $1');
+ explain_filter
+---------------------------------------------------------------------------
+ Append (cost=N.N..N.N rows=N width=N)
+ -> Seq Scan on gen_part_1_1 gen_part_1 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+ -> Seq Scan on gen_part_1_2 gen_part_2 (cost=N.N..N.N rows=N width=N)
+ Filter: ((key1 = N) AND (key2 = $N))
+(5 rows)
+
+drop table gen_part;
+--
+-- Test production of per-worker data
+--
+-- Unfortunately, because we don't know how many worker processes we'll
+-- actually get (maybe none at all), we can't examine the "Workers" output
+-- in any detail. We can check that it parses correctly as JSON, and then
+-- remove it from the displayed results.
+begin;
+-- encourage use of parallel plans
+set parallel_setup_cost=0;
+set parallel_tuple_cost=0;
+set min_parallel_table_scan_size=0;
+set max_parallel_workers_per_gather=4;
+select jsonb_pretty(
+ explain_filter_to_json('explain (analyze, verbose, buffers, format json)
+ select * from tenk1 order by tenthous')
+ -- remove "Workers" node of the Seq Scan plan node
+ #- '{0,Plan,Plans,0,Plans,0,Workers}'
+ -- remove "Workers" node of the Sort plan node
+ #- '{0,Plan,Plans,0,Workers}'
+ -- Also remove its sort-type fields, as those aren't 100% stable
+ #- '{0,Plan,Plans,0,Sort Method}'
+ #- '{0,Plan,Plans,0,Sort Space Type}'
+);
+ jsonb_pretty
+-------------------------------------------------------------
+ [ +
+ { +
+ "Plan": { +
+ "Plans": [ +
+ { +
+ "Plans": [ +
+ { +
+ "Alias": "tenk1", +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Schema": "public", +
+ "Disabled": false, +
+ "Node Type": "Seq Scan", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Relation Name": "tenk1", +
+ "Parallel Aware": true, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer",+
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Sort Key": [ +
+ "tenk1.tenthous" +
+ ], +
+ "Node Type": "Sort", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Sort Space Used": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Parent Relationship": "Outer", +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ } +
+ ], +
+ "Output": [ +
+ "unique1", +
+ "unique2", +
+ "two", +
+ "four", +
+ "ten", +
+ "twenty", +
+ "hundred", +
+ "thousand", +
+ "twothousand", +
+ "fivethous", +
+ "tenthous", +
+ "odd", +
+ "even", +
+ "stringu1", +
+ "stringu2", +
+ "string4" +
+ ], +
+ "Disabled": false, +
+ "Node Type": "Gather Merge", +
+ "Plan Rows": 0, +
+ "Plan Width": 0, +
+ "Total Cost": 0.0, +
+ "Actual Rows": 0.0, +
+ "Actual Loops": 0, +
+ "Startup Cost": 0.0, +
+ "Async Capable": false, +
+ "Parallel Aware": false, +
+ "Workers Planned": 0, +
+ "Local Hit Blocks": 0, +
+ "Temp Read Blocks": 0, +
+ "Workers Launched": 0, +
+ "Actual Total Time": 0.0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Shared Read Blocks": 0, +
+ "Actual Startup Time": 0.0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Planning": { +
+ "Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
+ "Temp Read Blocks": 0, +
+ "Local Read Blocks": 0, +
+ "Shared Hit Blocks": 0, +
+ "Storage I/O Write": 0, +
+ "Shared Read Blocks": 0, +
+ "Temp Written Blocks": 0, +
+ "Local Dirtied Blocks": 0, +
+ "Local Written Blocks": 0, +
+ "Shared Dirtied Blocks": 0, +
+ "Shared Written Blocks": 0 +
+ }, +
+ "Triggers": [ +
+ ], +
+ "Execution": { +
+ "Storage I/O Read": 0, +
+ "Storage I/O Write": 0 +
+ }, +
+ "Planning Time": 0.0, +
+ "Execution Time": 0.0 +
+ } +
+ ]
+(1 row)
+
+rollback;
+-- Test display of temporary objects
+create temp table t1(f1 float8);
+create function pg_temp.mysin(float8) returns float8 language plpgsql
+as 'begin return sin($1); end';
+select explain_filter('explain (verbose) select * from t1 where pg_temp.mysin(f1) < 0.5');
+ explain_filter
+------------------------------------------------------------
+ Seq Scan on pg_temp.t1 (cost=N.N..N.N rows=N width=N)
+ Output: f1
+ Filter: (pg_temp.mysin(t1.f1) < 'N.N'::double precision)
+(3 rows)
+
+-- Test compute_query_id
+set compute_query_id = on;
+select explain_filter('explain (verbose) select * from int8_tbl i8');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on public.int8_tbl i8 (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+-- Test compute_query_id with utility statements containing plannable query
+select explain_filter('explain (verbose) declare test_cur cursor for select * from int8_tbl');
+ explain_filter
+-------------------------------------------------------------
+ Seq Scan on public.int8_tbl (cost=N.N..N.N rows=N width=N)
+ Output: q1, q2
+ Query Identifier: N
+(3 rows)
+
+select explain_filter('explain (verbose) create table test_ctas as select 1');
+ explain_filter
+----------------------------------------
+ Result (cost=N.N..N.N rows=N width=N)
+ Output: N
+ Query Identifier: N
+(3 rows)
+
+-- Test SERIALIZE option
+select explain_filter('explain (analyze,buffers off,serialize) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize text,buffers,timing off) select * from int8_tbl i8');
+ explain_filter
+-----------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+select explain_filter('explain (analyze,serialize binary,buffers,timing) select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=binary
+ Execution Time: N.N ms
+(4 rows)
+
+-- this tests an edge case where we have no data to return
+select explain_filter('explain (analyze,buffers off,serialize) create temp table explain_temp as select * from int8_tbl i8');
+ explain_filter
+-------------------------------------------------------------------------------------------------
+ Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Serialization: time=N.N ms output=NkB format=text
+ Execution Time: N.N ms
+(4 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory case)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,10) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Memory Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (disk case)
+set work_mem to 64;
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over() from generate_series(1,2500) a(n)');
+ explain_filter
+----------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS ()
+ Storage: Disk Maximum Storage: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
+select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
+ explain_filter
+----------------------------------------------------------------------------------------
+ WindowAgg (actual time=N.N..N.N rows=N.N loops=N)
+ Window: w1 AS (PARTITION BY ((a.n < N)))
+ Storage: Disk Maximum Storage: NkB
+ -> Sort (actual time=N.N..N.N rows=N.N loops=N)
+ Sort Key: ((a.n < N))
+ Sort Method: external merge Disk: NkB
+ -> Function Scan on generate_series a (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(9 rows)
+
+reset work_mem;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ddbe4c64971..92e29069989 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2691,6 +2691,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: e3094679b9835fed2ea5c7d7877e8ac8e7554d33
--
2.48.1
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-02-10 13:31 ` Andres Freund <andres@anarazel.de>
2025-02-10 22:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
1 sibling, 1 reply; 22+ messages in thread
From: Andres Freund @ 2025-02-10 13:31 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
Hi,
On 2025-02-09 21:06:02 +0100, Jelte Fennema-Nio wrote:
> On Sun, 9 Feb 2025 at 19:05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Andres Freund <andres@anarazel.de> writes:
> > > I'm somewhat against this patch, as it's fairly fundamentally incompatible
> > > with AIO. There's no real way to get information in this manner if the IO
> > > isn't executed synchronously in process context...
>
> Hmm, I had not considered how this would interact with your AIO work.
> I agree that getting this info would be hard/impossible to do
> efficiently, when IOs are done by background IO processes that
> interleave IOs from different queries. But I'd expect that AIOs that
> are done using iouring would be tracked correctly without having to
> change this code at all (because I assume those are done from the
> query backend process).
>
> One other thought: I think the primary benefit of this feature is
> being able to see how many read IOs actually hit the disk, as opposed
> to hitting OS page cache. That benefit disappears when using Direct
> IO, because then there's no OS page cache.
>
> How many years away do you think that widespread general use of
> AIO+Direct IO is, though?
I think it'll always be a subset of use. It doesn't make sense to use DIO for
a small databases or untuned databases. Or a system that's deliberately
overcommmitted.
But this will also not work with AIO w/ Buffered IO. Which we hope to use much
more commonly.
If suddenly I have to reimplement something like this to work with worker
based IO, it'll certainly take longer to get to AIO.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:31 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
@ 2025-02-10 22:52 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 23:30 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-02-10 22:52 UTC (permalink / raw)
To: Andres Freund <andres@anarazel.de>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote:
> I think it'll always be a subset of use. It doesn't make sense to use DIO for
> a small databases or untuned databases. Or a system that's deliberately
> overcommmitted.
Thanks, that's useful context.
> But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> more commonly.
To be clear, here you mean worker based AIO right? Because it would
work with io_uring based AIO, right?
> If suddenly I have to reimplement something like this to work with worker
> based IO, it'll certainly take longer to get to AIO.
I totally understand. But in my opinion it would be completely fine to
decide that these new IO stats are simply not available for worker
based IO. Just like they're not available for Windows either with this
patch.
I think it would be a shame to make perfect be the enemy of good here
(as often seems to happen with PG patches). I'd rather have this
feature for some setups, than for no setups at all.
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:31 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-10 22:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-02-10 23:30 ` Andres Freund <andres@anarazel.de>
2025-02-10 23:53 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
0 siblings, 1 reply; 22+ messages in thread
From: Andres Freund @ 2025-02-10 23:30 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
Hi,
On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote:
> On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote:
> > But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> > more commonly.
>
> To be clear, here you mean worker based AIO right? Because it would
> work with io_uring based AIO, right?
I mostly meant worker based AIO, yes. I haven't checked how accurately these
are kept for io_uring. I would hope they are...
> > If suddenly I have to reimplement something like this to work with worker
> > based IO, it'll certainly take longer to get to AIO.
>
> I totally understand. But in my opinion it would be completely fine to
> decide that these new IO stats are simply not available for worker
> based IO. Just like they're not available for Windows either with this
> patch.
The thing is that you'd often get completely misleading stats. Some of the IO
will still be done by the backend itself, so there will be a non-zero
value. But it will be a significant undercount, because the asynchronously
executed IO won't be tracked (if worker mode is used).
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:31 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-10 22:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 23:30 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
@ 2025-02-10 23:53 ` Andres Freund <andres@anarazel.de>
2025-02-11 08:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
0 siblings, 1 reply; 22+ messages in thread
From: Andres Freund @ 2025-02-10 23:53 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
Hi,
On 2025-02-10 18:30:56 -0500, Andres Freund wrote:
> On 2025-02-10 23:52:17 +0100, Jelte Fennema-Nio wrote:
> > On Mon, 10 Feb 2025 at 14:31, Andres Freund <andres@anarazel.de> wrote:
> > > But this will also not work with AIO w/ Buffered IO. Which we hope to use much
> > > more commonly.
> >
> > To be clear, here you mean worker based AIO right? Because it would
> > work with io_uring based AIO, right?
>
> I mostly meant worker based AIO, yes. I haven't checked how accurately these
> are kept for io_uring. I would hope they are...
It does look like it is tracked.
> > > If suddenly I have to reimplement something like this to work with worker
> > > based IO, it'll certainly take longer to get to AIO.
> >
> > I totally understand. But in my opinion it would be completely fine to
> > decide that these new IO stats are simply not available for worker
> > based IO. Just like they're not available for Windows either with this
> > patch.
>
> The thing is that you'd often get completely misleading stats. Some of the IO
> will still be done by the backend itself, so there will be a non-zero
> value. But it will be a significant undercount, because the asynchronously
> executed IO won't be tracked (if worker mode is used).
<clear cache>
postgres[985394][1]=# SHOW io_method ;
┌───────────┐
│ io_method │
├───────────┤
│ worker │
└───────────┘
(1 row)
postgres[985394][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=199.494..199.494 rows=1 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=57.906 │
│ -> Seq Scan on manyrows (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.380..140.671 rows=1000000 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=57.906 │
│ Planning: │
│ Buffers: shared hit=41 read=12 │
│ Storage I/O: read=192 times write=0 times │
│ Planning Time: 1.869 ms │
│ Execution Time: 199.554 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
<clear cache>
postgres[1014152][1]=# SHOW io_method ;
┌───────────┐
│ io_method │
├───────────┤
│ io_uring │
└───────────┘
(1 row)
postgres[1014152][1]=# EXPLAIN ANALYZE SELECT count(*) FROM manyrows ;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=17906.00..17906.01 rows=1 width=8) (actual time=111.591..111.593 rows=1 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=14.342 │
│ -> Seq Scan on manyrows (cost=0.00..15406.00 rows=1000000 width=0) (actual time=0.161..70.843 rows=1000000 loops=1) │
│ Buffers: shared read=5406 │
│ I/O Timings: shared read=14.342 │
│ Planning: │
│ Buffers: shared hit=41 read=12 │
│ Storage I/O: read=192 times write=0 times │
│ Planning Time: 1.768 ms │
│ Execution: │
│ Storage I/O: read=86496 times write=0 times │
│ Execution Time: 111.670 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Independent to of this, it's probably not good that we're tracking shared
buffer hits after io combining, if I interpret this correctly... That looks to
be an issue in master, not just the AIO branch.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:31 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-10 22:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 23:30 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-10 23:53 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
@ 2025-02-11 08:59 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-11 15:36 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
0 siblings, 1 reply; 22+ messages in thread
From: Jelte Fennema-Nio @ 2025-02-11 08:59 UTC (permalink / raw)
To: Andres Freund <andres@anarazel.de>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres@anarazel.de> wrote:
> > I mostly meant worker based AIO, yes. I haven't checked how accurately these
> > are kept for io_uring. I would hope they are...
>
> It does look like it is tracked.
nice!
> > The thing is that you'd often get completely misleading stats. Some of the IO
> > will still be done by the backend itself, so there will be a non-zero
> > value. But it will be a significant undercount, because the asynchronously
> > executed IO won't be tracked (if worker mode is used).
Yeah, makes sense. Like I said, I would be completely fine with not
showing these numbers at all/setting them to 0 for setups where we
cannot easily get useful numbers (and this bgworker AIO would be one
of those setups).
> Independent to of this, it's probably not good that we're tracking shared
> buffer hits after io combining, if I interpret this correctly... That looks to
> be an issue in master, not just the AIO branch.
You mean that e.g. a combined IO for 20 blocks still sounds only as 1
"shared read"? Yeah, that sounds like a bug.
^ permalink raw reply [nested|flat] 22+ messages in thread
* Re: RFC: Allow EXPLAIN to Output Page Fault Information
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:31 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-10 22:52 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 23:30 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-10 23:53 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Andres Freund <andres@anarazel.de>
2025-02-11 08:59 ` Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
@ 2025-02-11 15:36 ` Andres Freund <andres@anarazel.de>
0 siblings, 0 replies; 22+ messages in thread
From: Andres Freund @ 2025-02-11 15:36 UTC (permalink / raw)
To: Jelte Fennema-Nio <postgres@jeltef.nl>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; torikoshia <torikoshia@oss.nttdata.com>; pgsql-hackers; rjuju123@gmail.com; Bruce Momjian <bruce@momjian.us>
Hi,
On 2025-02-11 09:59:43 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 00:53, Andres Freund <andres@anarazel.de> wrote:
> > > The thing is that you'd often get completely misleading stats. Some of the IO
> > > will still be done by the backend itself, so there will be a non-zero
> > > value. But it will be a significant undercount, because the asynchronously
> > > executed IO won't be tracked (if worker mode is used).
>
> Yeah, makes sense. Like I said, I would be completely fine with not
> showing these numbers at all/setting them to 0 for setups where we
> cannot easily get useful numbers (and this bgworker AIO would be one
> of those setups).
Shrug. It means that it'll not work in what I hope will be the default
mechanism before long. I just can't get excited for that. In all likelihood
it'll result in bug reports that I'll then be on the hook to fix.
> > Independent to of this, it's probably not good that we're tracking shared
> > buffer hits after io combining, if I interpret this correctly... That looks to
> > be an issue in master, not just the AIO branch.
>
> You mean that e.g. a combined IO for 20 blocks still sounds only as 1
> "shared read"? Yeah, that sounds like a bug.
Yep.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 22+ messages in thread
end of thread, other threads:[~2026-01-28 14:59 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-02-08 13:54 Re: RFC: Allow EXPLAIN to Output Page Fault Information Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 11:51 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-09 17:59 ` Andres Freund <andres@anarazel.de>
2025-02-09 18:05 ` Tom Lane <tgl@sss.pgh.pa.us>
2025-02-09 20:06 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 13:23 ` torikoshia <torikoshia@oss.nttdata.com>
2025-03-17 23:52 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-19 13:15 ` torikoshia <torikoshia@oss.nttdata.com>
2025-03-22 11:23 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-03-25 01:27 ` torikoshia <torikoshia@oss.nttdata.com>
2025-04-11 13:18 ` torikoshia <torikoshia@oss.nttdata.com>
2025-05-08 13:51 ` torikoshia <torikoshia@oss.nttdata.com>
2025-10-28 08:43 ` torikoshia <torikoshia@oss.nttdata.com>
2026-01-25 14:35 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2026-01-28 13:27 ` torikoshia <torikoshia@oss.nttdata.com>
2026-01-28 14:59 ` torikoshia <torikoshia@oss.nttdata.com>
2025-02-10 13:31 ` Andres Freund <andres@anarazel.de>
2025-02-10 22:52 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-10 23:30 ` Andres Freund <andres@anarazel.de>
2025-02-10 23:53 ` Andres Freund <andres@anarazel.de>
2025-02-11 08:59 ` Jelte Fennema-Nio <postgres@jeltef.nl>
2025-02-11 15:36 ` Andres Freund <andres@anarazel.de>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox