public inbox for pgsql-hackers@postgresql.org  
help / color / mirror / Atom feed
From: torikoshia <torikoshia@oss.nttdata.com>
To: Jelte Fennema-Nio <postgres@jeltef.nl>
To: tgl@sss.pgh.pa.us
Cc: Bruce Momjian <bruce@momjian.us>
Cc: Pgsql Hackers <pgsql-hackers@postgresql.org>
Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information
Date: Mon, 06 Jan 2025 18:49:06 +0900
Message-ID: <1f22794321b745549d54359d343e37b8@oss.nttdata.com> (raw)
In-Reply-To: <D6PENGW1PL79.266GQNPSMRIGA@jeltef.nl>
References: <c20f6340eb26f3b736abc59471bfada8@oss.nttdata.com>
	<1614321.1735055528@sss.pgh.pa.us>
	<D6MJOHS7HZ80.3B17NDGUV6T47@jeltef.nl>
	<Z3LII9YEgp3kAZG-@momjian.us>
	<3861403.1735576764@sss.pgh.pa.us>
	<D6PENGW1PL79.266GQNPSMRIGA@jeltef.nl>


On Tue, Dec 31, 2024 at 1:39 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> 
>> I certainly would love to see storage I/O numbers as distinct from
>> kernel read I/O numbers.
> 
> Me too, but I think it is 100% wishful thinking to imagine that
> page fault counts match up with that.  Maybe there are filesystems
> where a read that we request maps one-to-one with a subsequent
> page fault, but it hardly seems likely to me that that's
> universal.  Also, you can't tell page faults for reading program
> code apart from those for data, and you won't get any information
> at all about writes.

Thanks for the explanation.


On Tue, Dec 31, 2024 at 7:57 AM Jelte Fennema-Nio <postgres@jeltef.nl> 
wrote:
> On Mon Dec 30, 2024 at 5:39 PM CET, Tom Lane wrote:
>> Bruce Momjian <bruce@momjian.us> writes:
>>> I certainly would love to see storage I/O numbers as distinct from
>>> kernel read I/O numbers.
>> 
>> Me too, but I think it is 100% wishful thinking to imagine that
>> page fault counts match up with that.
> 
> Okay I played around with this patch a bit, in hopes of proving you
> wrong. But I now agree with you. I cannot seem to get any numbers out 
> of
> this that make sense.
> 
> The major page fault numbers are always zero, even after running:
> 
> echo 1 > /proc/sys/vm/drop_caches
> 
> If Takahori has a way to get some more useful insights from this patch,
> I'm quite interested in the steps he took (I might very well have 
> missed
> something obvious).

Thanks for testing.

I also did pg_ctl restart to clear buffercache in addition to your step 
and saw many major faults again.
However, when I replaced the restart with pg_buffercache_evict(), I also 
observed too few number of major fault.
I now feel majflt from getrusage() is not appropriate metrics for 
measuring storage I/O.


> **However, I think the general direction has merit**: Changing this 
> patch to
> use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock`
> is 0 when everything is in page cache, and it is very high when stuff 
> is
> not. I was only hacking around and basically did this:
> 
> s/ru_minflt/ru_inblock/g
> s/ru_majflt/ru_oublock/g

Great!
I misunderstood these metrics contain page cached I/O.

As far as I inspected, they come from read_bytes/write_bytes of 
task_io_accounting and the comment seems they are what we want, i.e. 
storage I/O:

    -- 
/usr/src/linux-headers-5.15.0-127/include/linux/task_io_accounting.h
    struct task_io_accounting {
     ..(snip)..
    #ifdef CONFIG_TASK_IO_ACCOUNTING
            /*
             * The number of bytes which this task has caused to be read 
from
             * storage.
             */
            u64 read_bytes;

            /*
             * The number of bytes which this task has caused, or shall 
cause to be
             * written to disk.
             */
            u64 write_bytes;

> Obviously more is needed. We'd probably want to show these numbers in
> useful units like MB or something. Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).

Updated the PoC patch to calculate them by KB:

   =# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts;
                                                              QUERY PLAN
   
---------------------------------------------------------------------------------------------------------------------------------
    Seq Scan on pgbench_accounts  (cost=0.00..263935.35 rows=10000035 
width=97) (actual time=1.447..3900.279 rows=10000000 loops=1)
      Buffers: shared hit=2587 read=161348
    Planning Time: 0.367 ms
    Execution:
      Storage I/O: read=1291856 KB write=0 KB
    Execution Time: 4353.253 ms
   (6 rows)


>  Also, maybe there's some better way
> of getting read/write numbers for the current process than
> ru_inblock/ru_oublock (but this one seems to work at least reasonably
> well).

Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem 
the best.

> One other thing that I noticed when playing around with this, which
> would need to be addressed: Parallel workers need to pass these values
> to the main process somehow, otherwise the IO from those processes gets 
> lost.

Yes.
I haven't developed it yet but I believe we can pass them like 
buffer/WAL usage.


-- 
Regards,

--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.

Attachments:

  [text/x-diff] v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patch (10.5K, 2-v1-0001-PoC-Allow-EXPLAIN-to-output-storage-I-O-informati.patch)
  download | inline diff:
From 7c76d63811381ee4252e02285f58e0816dcf273d Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikoshia@oss.nttdata.com>
Date: Mon, 6 Jan 2025 18:32:40 +0900
Subject: [PATCH v1] PoC: Allow EXPLAIN to output storage I/O information

This patch adds STORAGEIO option to EXPLAIN and it tracks the
amount of read/write from storage during planning and execution
phases using ru_inblock/ru_outblock from getrusage(2).

This is a PoC patch and only shows them when the FORMAT is TEXT.
Accumulating Parallel workers' I/O has not implemented yet.
Building on Windows causes error.
---
 src/backend/commands/explain.c | 103 ++++++++++++++++++++++++++++++++-
 src/backend/commands/prepare.c |  23 ++++++++
 src/include/commands/explain.h |  10 ++++
 3 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c24e66f82..3c3b68b31 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -13,6 +13,8 @@
  */
 #include "postgres.h"
 
+#include <sys/resource.h>
+
 #include "access/xact.h"
 #include "catalog/pg_type.h"
 #include "commands/createas.h"
@@ -145,6 +147,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
 static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio(ExplainState *es, const StorageIO *usage);
+static void show_storageio(ExplainState *es, const StorageIO *usage);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
 static void show_memory_counters(ExplainState *es,
 								 const MemoryContextCounters *mem_counters);
@@ -217,6 +221,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 			buffers_set = true;
 			es->buffers = defGetBoolean(opt);
 		}
+		else if (strcmp(opt->defname, "storageio") == 0)
+			es->storageio = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "wal") == 0)
 			es->wal = defGetBoolean(opt);
 		else if (strcmp(opt->defname, "settings") == 0)
@@ -475,6 +481,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 				planduration;
 	BufferUsage bufusage_start,
 				bufusage;
+	StorageIO	storageio = {0};
+	StorageIO	storageio_start = {0};
 	MemoryContextCounters mem_counters;
 	MemoryContext planner_ctx = NULL;
 	MemoryContext saved_ctx = NULL;
@@ -499,6 +507,15 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 		bufusage_start = pgBufferUsage;
 	INSTR_TIME_SET_CURRENT(planstart);
 
+	if (es->storageio)
+	{
+		struct rusage rusage;
+
+		getrusage(RUSAGE_SELF, &rusage);
+		storageio_start.inblock = rusage.ru_inblock;
+		storageio_start.outblock = rusage.ru_oublock;
+	}
+
 	/* plan the query */
 	plan = pg_plan_query(query, queryString, cursorOptions, params);
 
@@ -518,9 +535,19 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 	}
 
+	if (es->storageio)
+	{
+		struct rusage rusage;
+
+		getrusage(RUSAGE_SELF, &rusage);
+		storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
+		storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
+	}
+
 	/* run it (if needed) and produce output */
 	ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
 				   &planduration, (es->buffers ? &bufusage : NULL),
+				   (es->storageio ? &storageio : NULL),
 				   es->memory ? &mem_counters : NULL);
 }
 
@@ -644,7 +671,7 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			   const char *queryString, ParamListInfo params,
 			   QueryEnvironment *queryEnv, const instr_time *planduration,
-			   const BufferUsage *bufusage,
+			   const BufferUsage *bufusage, const StorageIO *planstorageio,
 			   const MemoryContextCounters *mem_counters)
 {
 	DestReceiver *dest;
@@ -654,6 +681,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	int			eflags;
 	int			instrument_option = 0;
 	SerializeMetrics serializeMetrics = {0};
+	StorageIO	storageio = {0};
+	StorageIO	storageio_start = {0};
+	struct rusage rusage;
 
 	Assert(plannedstmt->commandType != CMD_UTILITY);
 
@@ -674,6 +704,14 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	 */
 	INSTR_TIME_SET_CURRENT(starttime);
 
+	if (es->storageio)
+	{
+		getrusage(RUSAGE_SELF, &rusage);
+
+		storageio_start.inblock = rusage.ru_inblock;
+		storageio_start.outblock = rusage.ru_oublock;
+	}
+
 	/*
 	 * Use a snapshot with an updated command ID to ensure this query sees
 	 * results of any previously executed queries.
@@ -748,7 +786,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	ExplainPrintPlan(es, queryDesc);
 
 	/* Show buffer and/or memory usage in planning */
-	if (peek_buffer_usage(es, bufusage) || mem_counters)
+	if (peek_buffer_usage(es, bufusage) || peek_storageio(es, planstorageio) ||
+		mem_counters)
 	{
 		ExplainOpenGroup("Planning", "Planning", true, es);
 
@@ -762,6 +801,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		if (bufusage)
 			show_buffer_usage(es, bufusage);
 
+		if (es->storageio)
+			show_storageio(es, planstorageio);
+
 		if (mem_counters)
 			show_memory_counters(es, mem_counters);
 
@@ -813,6 +855,26 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 
 	totaltime += elapsed_time(&starttime);
 
+	if (es->storageio)
+	{
+		getrusage(RUSAGE_SELF, &rusage);
+
+		storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
+		storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+		{
+			ExplainIndentText(es);
+			appendStringInfoString(es->str, "Execution:\n");
+			es->indent++;
+		}
+		show_storageio(es, &storageio);
+
+		if (es->format == EXPLAIN_FORMAT_TEXT)
+			es->indent--;
+		ExplainCloseGroup("Execution", "Execution", true, es);
+	}
+
 	/*
 	 * We only report execution time if we actually ran the query (that is,
 	 * the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4232,6 +4294,43 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
 	}
 }
 
+static bool
+peek_storageio(ExplainState *es, const StorageIO *usage)
+{
+	if (usage == NULL)
+		return false;
+
+	if (usage->inblock <= 0 && usage->outblock <= 0)
+		return false;
+
+	else
+		return true;
+}
+
+/*
+ * Show storage I/O.
+ *
+ * Since the unit of inblock/outblock is 512 bytes, change them to KB by dividing by two.
+ */
+static void
+show_storageio(ExplainState *es, const StorageIO *usage)
+{
+	/* Show only positive counter values. */
+	if (usage->inblock <= 0 && usage->outblock <= 0)
+		return;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+
+		ExplainIndentText(es);
+		appendStringInfoString(es->str, "Storage I/O:");
+		appendStringInfo(es->str, " read=%ld KB", (long) usage->inblock / 2);
+		appendStringInfo(es->str, " write=%ld KB", (long) usage->outblock / 2);
+
+		appendStringInfoChar(es->str, '\n');
+	}
+}
+
 /*
  * Show WAL usage details.
  */
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 8989c0c88..4743bacf4 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -17,6 +17,7 @@
 #include "postgres.h"
 
 #include <limits.h>
+#include <sys/resource.h>
 
 #include "access/xact.h"
 #include "catalog/pg_type.h"
@@ -579,6 +580,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	instr_time	planduration;
 	BufferUsage bufusage_start,
 				bufusage;
+	StorageIO	storageio_start = {0};
+	StorageIO	storageio = {0};
 	MemoryContextCounters mem_counters;
 	MemoryContext planner_ctx = NULL;
 	MemoryContext saved_ctx = NULL;
@@ -595,6 +598,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 	if (es->buffers)
 		bufusage_start = pgBufferUsage;
+
+	if (es->storageio)
+	{
+		struct rusage rusage;
+
+		getrusage(RUSAGE_SELF, &rusage);
+		storageio_start.inblock = rusage.ru_inblock;
+		storageio_start.outblock = rusage.ru_oublock;
+	}
+
 	INSTR_TIME_SET_CURRENT(planstart);
 
 	/* Look it up in the hash table */
@@ -646,6 +659,15 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
 	}
 
+	if (es->storageio)
+	{
+		struct rusage rusage;
+
+		getrusage(RUSAGE_SELF, &rusage);
+		storageio.inblock = rusage.ru_inblock - storageio_start.inblock;
+		storageio.outblock = rusage.ru_oublock - storageio_start.outblock;
+	}
+
 	plan_list = cplan->stmt_list;
 
 	/* Explain each query */
@@ -656,6 +678,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		if (pstmt->commandType != CMD_UTILITY)
 			ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
 						   &planduration, (es->buffers ? &bufusage : NULL),
+						   (es->storageio ? &storageio : NULL),
 						   es->memory ? &mem_counters : NULL);
 		else
 			ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ea7419951..7b452b306 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -49,6 +49,7 @@ typedef struct ExplainState
 	bool		analyze;		/* print actual times */
 	bool		costs;			/* print estimated costs */
 	bool		buffers;		/* print buffer usage */
+	bool		storageio;		/* print storageio */
 	bool		wal;			/* print WAL usage */
 	bool		timing;			/* print detailed node timing */
 	bool		summary;		/* print total planning and execution timing */
@@ -73,6 +74,14 @@ typedef struct ExplainState
 	ExplainWorkersState *workers_state; /* needed if parallel plan */
 } ExplainState;
 
+typedef struct StorageIO
+{
+	/* Note that page size here is 512 byte */
+	long		inblock;			/* # of pages that has been read from storage */
+	long		outblock;			/* # of pages that has been written or shall be written to storage */
+}			StorageIO;
+
+
 /* Hook for plugins to get control in ExplainOneQuery() */
 typedef void (*ExplainOneQuery_hook_type) (Query *query,
 										   int cursorOptions,
@@ -108,6 +117,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
 						   ParamListInfo params, QueryEnvironment *queryEnv,
 						   const instr_time *planduration,
 						   const BufferUsage *bufusage,
+						   const StorageIO *planstorageio,
 						   const MemoryContextCounters *mem_counters);
 
 extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);

base-commit: b1ef48980ddd082a90ed39aa5914af45e53059e6
-- 
2.34.1



reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-hackers@postgresql.org
  Cc: torikoshia@oss.nttdata.com, postgres@jeltef.nl, tgl@sss.pgh.pa.us, bruce@momjian.us
  Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information
  In-Reply-To: <1f22794321b745549d54359d343e37b8@oss.nttdata.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox