public inbox for pgsql-hackers@postgresql.org
help / color / mirror / Atom feedFrom: torikoshia <torikoshia@oss.nttdata.com>
To: Jelte Fennema-Nio <postgres@jeltef.nl>
To: tgl@sss.pgh.pa.us
Cc: Bruce Momjian <bruce@momjian.us>
Cc: Pgsql Hackers <pgsql-hackers@postgresql.org>
Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date: Mon, 06 Jan 2025 18:49:06 +0900
Message-ID: <1f22794321b745549d54359d343e37b8@oss.nttdata.com> (raw)
In-Reply-To: <D6PENGW1PL79.266GQNPSMRIGA@jeltef.nl>
References: <c20f6340eb26f3b736abc59471bfada8@oss.nttdata.com>
<1614321.1735055528@sss.pgh.pa.us>
<D6MJOHS7HZ80.3B17NDGUV6T47@jeltef.nl>
<Z3LII9YEgp3kAZG-@momjian.us>
<3861403.1735576764@sss.pgh.pa.us>
<D6PENGW1PL79.266GQNPSMRIGA@jeltef.nl>
On Tue, Dec 31, 2024 at 1:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
>
>> I certainly would love to see storage I/O numbers as distinct from
>> kernel read I/O numbers.
>
> Me too, but I think it is 100% wishful thinking to imagine that
> page fault counts match up with that. Maybe there are filesystems
> where a read that we request maps one-to-one with a subsequent
> page fault, but it hardly seems likely to me that that's
> universal. Also, you can't tell page faults for reading program
> code apart from those for data, and you won't get any information
> at all about writes.
Thanks for the explanation.
On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl>
wrote:
> On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> I certainly would love to see storage I/O numbers as distinct from
>>> kernel read I/O numbers.
>>
>> Me too, but I think it is 100% wishful thinking to imagine that
>> page fault counts match up with that.
>
> Okay I played around with this patch a bit, in hopes of proving you
> wrong. But I now agree with you. I cannot seem to get any numbers out
> of
> this that make sense.
>
> The major page fault numbers are always zero, even after running:
>
> echo 1 > /proc/sys/vm/drop_caches
>
> If Takahori has a way to get some more useful insights from this patch,
> I'm quite interested in the steps he took (I might very well have
> missed
> something obvious).
Thanks for testing.
I also did pg_ctl restart to clear buffercache in addition to your step
and saw many major faults again.
However, when I replaced the restart with pg_buffercache_evict(), I also
observed too few number of major fault.
I now feel majflt from getrusage() is not appropriate metrics for
measuring storage I/O.
> **However, I think the general direction has merit**: Changing this
> patch to
> use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> is 0 when everything is in page cache, and it is very high when stuff
> is
> not. I was only hacking around and basically did this:
>
> s/ru_minflt/ru_inblock/g
> s/ru_majflt/ru_oublock/g
Great!
I misunderstood these metrics contain page cached I/O.
As far as I inspected, they come from read_bytes/write_bytes of
task_io_accounting and the comment seems they are what we want, i.e.
storage I/O:
--
/usr/src/linux-headers-5.15.0-127/include/linux/task_io_accounting.h
struct task_io_accounting {
..(snip)..
#ifdef CONFIG_TASK_IO_ACCOUNTING
/*
* The number of bytes which this task has caused to be read
from
* storage.
*/
u64 read_bytes;
/*
* The number of bytes which this task has caused, or shall
cause to be
* written to disk.
*/
u64 write_bytes;
> Obviously more is needed. We'd probably want to show these numbers in
> useful units like MB or something. Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).
Updated the PoC patch to calculate them by KB:
=# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on pgbench_accounts (cost=0.00..263935.35 rows=10000035
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
Buffers: shared hit=2587 read=161348
Planning Time: 0.367 ms
Execution:
Storage I/O: read=1291856 KB write=0 KB
Execution Time: 4353.253 ms
(6 rows)
> Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).
Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem
the best.
> One other thing that I noticed when playing around with this, which
> would need to be addressed: Parallel workers need to pass these values
> to the main process somehow, otherwise the IO from those processes gets
> lost.
Yes.
I haven't developed it yet but I believe we can pass them like
buffer/WAL usage.
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
Attachments:
[text/x-diff] v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patch (10.5K, 2-v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patch)
download | inline diff:
From 7c76d63811381ee4252e02285f58e0816dcf273d Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 6 Jan 2025 18:32:40 +0900
Subject: [PATCH v1] PoC: Allow EXPLAIN to output storage I/O information
This patch adds STORAGEIO option to EXPLAIN and it tracks the
amount of read/write from storage during planning and execution
phases using ru_inblock/ru_outblock from getrusage(2).
This is a PoC patch and only shows them when the FORMAT is TEXT.
Accumulating Parallel workers' I/O has not implemented yet.
Building on Windows causes error.
---
src/backend/commands/explain.c | 103 ++++++++++++++++++++++++++++++++-
src/backend/commands/prepare.c | 23 ++++++++
src/include/commands/explain.h | 10 ++++
3 files changed, 134 insertions(+), 2 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c24e66f82..3c3b68b31 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,8 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
+
#include "access/xact.h"
#include "catalog/pg_type.h"
#include "commands/createas.h"
@@ -145,6 +147,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio(ExplainState *es, const StorageIO *usage);
+static void show_storageio(ExplainState *es, const StorageIO *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -217,6 +221,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
buffers_set = true;
es->buffers = defGetBoolean(opt);
}
+ else if (strcmp(opt->defname, "storageio") == 0)
+ es->storageio = defGetBoolean(opt);
else if (strcmp(opt->defname, "wal") == 0)
es->wal = defGetBoolean(opt);
else if (strcmp(opt->defname, "settings") == 0)
@@ -475,6 +481,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIO storageio = {0};
+ StorageIO storageio_start = {0};
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -499,6 +507,15 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
bufusage_start = pgBufferUsage;
INSTR_TIME_SET_CURRENT(planstart);
+ if (es->storageio)
+ {
+ struct rusage rusage;
+
+ getrusage(RUSAGE_SELF, &rusage);
+ storageio_start.inblock = rusage.ru_inblock;
+ storageio_start.outblock = rusage.ru_oublock;
+ }
+
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params);
@@ -518,9 +535,19 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
}
+ if (es->storageio)
+ {
+ struct rusage rusage;
+
+ getrusage(RUSAGE_SELF, &rusage);
+ storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
+ storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
+ }
+
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->storageio ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -644,7 +671,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIO *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -654,6 +681,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIO storageio = {0};
+ StorageIO storageio_start = {0};
+ struct rusage rusage;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -674,6 +704,14 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
*/
INSTR_TIME_SET_CURRENT(starttime);
+ if (es->storageio)
+ {
+ getrusage(RUSAGE_SELF, &rusage);
+
+ storageio_start.inblock = rusage.ru_inblock;
+ storageio_start.outblock = rusage.ru_oublock;
+ }
+
/*
* Use a snapshot with an updated command ID to ensure this query sees
* results of any previously executed queries.
@@ -748,7 +786,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
ExplainPrintPlan(es, queryDesc);
/* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ if (peek_buffer_usage(es, bufusage) || peek_storageio(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -762,6 +801,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
if (bufusage)
show_buffer_usage(es, bufusage);
+ if (es->storageio)
+ show_storageio(es, planstorageio);
+
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -813,6 +855,26 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ if (es->storageio)
+ {
+ getrusage(RUSAGE_SELF, &rusage);
+
+ storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
+ storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4232,6 +4294,43 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+static bool
+peek_storageio(ExplainState *es, const StorageIO *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return false;
+
+ else
+ return true;
+}
+
+/*
+ * Show storage I/O.
+ *
+ * Since the unit of inblock/outblock is 512 bytes, change them to KB by dividing by two.
+ */
+static void
+show_storageio(ExplainState *es, const StorageIO *usage)
+{
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld KB", (long) usage->inblock / 2);
+ appendStringInfo(es->str, " write=%ld KB", (long) usage->outblock / 2);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 8989c0c88..4743bacf4 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -17,6 +17,7 @@
#include "postgres.h"
#include <limits.h>
+#include <sys/resource.h>
#include "access/xact.h"
#include "catalog/pg_type.h"
@@ -579,6 +580,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIO storageio_start = {0};
+ StorageIO storageio = {0};
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -595,6 +598,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (es->buffers)
bufusage_start = pgBufferUsage;
+
+ if (es->storageio)
+ {
+ struct rusage rusage;
+
+ getrusage(RUSAGE_SELF, &rusage);
+ storageio_start.inblock = rusage.ru_inblock;
+ storageio_start.outblock = rusage.ru_oublock;
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -646,6 +659,15 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
}
+ if (es->storageio)
+ {
+ struct rusage rusage;
+
+ getrusage(RUSAGE_SELF, &rusage);
+ storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
+ storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
+ }
+
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -656,6 +678,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->storageio ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ea7419951..7b452b306 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -49,6 +49,7 @@ typedef struct ExplainState
bool analyze; /* print actual times */
bool costs; /* print estimated costs */
bool buffers; /* print buffer usage */
+ bool storageio; /* print storageio */
bool wal; /* print WAL usage */
bool timing; /* print detailed node timing */
bool summary; /* print total planning and execution timing */
@@ -73,6 +74,14 @@ typedef struct ExplainState
ExplainWorkersState *workers_state; /* needed if parallel plan */
} ExplainState;
+typedef struct StorageIO
+{
+ /* Note that page size here is 512 byte */
+ long inblock; /* # of pages that has been read from storage */
+ long outblock; /* # of pages that has been written or shall be written to storage */
+} StorageIO;
+
+
/* Hook for plugins to get control in ExplainOneQuery() */
typedef void (*ExplainOneQuery_hook_type) (Query *query,
int cursorOptions,
@@ -108,6 +117,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIO *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
base-commit: b1ef48980ddd082a90ed39aa5914af45e53059e6
--
2.34.1
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: pgsql-hackers@postgresql.org
Cc: torikoshia@oss.nttdata.com, postgres@jeltef.nl, tgl@sss.pgh.pa.us, bruce@momjian.us
Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information
In-Reply-To: <1f22794321b745549d54359d343e37b8@oss.nttdata.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox