Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tV9qM-005ycE-Ev for pgsql-hackers@arkaria.postgresql.org; Tue, 07 Jan 2025 13:42:02 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tV9qL-003rBi-Ew for pgsql-hackers@arkaria.postgresql.org; Tue, 07 Jan 2025 13:42:01 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tV9qL-003rAw-1a for pgsql-hackers@lists.postgresql.org; Tue, 07 Jan 2025 13:42:00 +0000 Received: from mail-yb1-xb34.google.com ([2607:f8b0:4864:20::b34]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tV9qI-000KxA-1N for pgsql-hackers@postgresql.org; Tue, 07 Jan 2025 13:41:59 +0000 Received: by mail-yb1-xb34.google.com with SMTP id 3f1490d57ef6-e4930eca0d4so20337095276.3 for ; Tue, 07 Jan 2025 05:41:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736257318; x=1736862118; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=JtCWyfm8XfZGI92ChCBw6Ys+Hkcd/tIHUn5aU2dDwus=; b=I/hu7QuBmbr2EFv7zwnBhycgxkJNtNd3AeNhX36g4awjQTbGdtls+TSqiknTrwQGpF y7NVH5VGgIcSVcOXlEBMeJHK+uQvCbWfVIkIc5GgnlRZSa4FSOmmmi59mLWRwdw2oChB 24xCU/WLG9r1vPuXQUvxEhBIIC+DLMfDNadYapHofd4uQ99ikSC0gFmp6Sg6/4XK7UVq c5KlrVqRSeP4Ej5tR73l929siX+DvxBhqwos4RGI5/GdbeBDNOHBWlW6F3q6vc7wWtc3 vM1bzf93CeM1TYmKnc9gStIANeDrt/irB5TWY3uqXDpXeR49iH9jAtJxvGDhmbGVwu6Z ITBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736257318; x=1736862118; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JtCWyfm8XfZGI92ChCBw6Ys+Hkcd/tIHUn5aU2dDwus=; b=jD4COrqfCUJhIcT+XXvoZ3zMgU/9H92zxlptLI5SMGFR6fBpkIk7tw9+GwtPJNPIw+ hcbAFNt4h2z3Pd/HLUZ2YBrx4d1FaBOxeecv0XzYj6qjykEwxaSm+J5xbebCNFsEpGxN CRxWHK+793S6VkDs/K3SIvYqQzstR0sxwBhxuDi4cr08/8m8dlmeoX+ftKxqlil4PvWa reKAKH1fiWyNUW1HI6U/rfr6NbQK+YrxvkO8/1lg6PfKvIaWQBZGNFuZpBsA36B+kwxD S1Cg6xAm2RfsYlKdjjwsjL0oRpPAglU+3ShV7XcgfwFhDUIPwRxagGSaGttUGpX9EHLD lTug== X-Forwarded-Encrypted: i=1; AJvYcCUyw2vlZZVLWA5wx4OhaALjgWQ209pDoDlDHT2nbnH00cgChws0eV0ZSneYDu+UYIGvP5psnFCOKrYLLm2a@postgresql.org X-Gm-Message-State: AOJu0YxEwiwJjRo+5rZbdAO/VyKfFVl2WjL7Bz8feYjTlVBMA4VpgxFK ewVmiTuYG0GqpmNoN1FCF/bL9OI+nk3ruonLCWOVcQO/5+G7Wofr8iJu/FcN31F+nrUA/yd321e ga1vCeoTTLPwKk/fr+Qq2Peus+3I= X-Gm-Gg: ASbGnctv9es/whpCQbrFo2U67m3JK3zjWx5xbhbHBUVyG6Fq7/ilhhn5zhExD81ajTP jpxViUhiowLLb9LluSlYc95B6ncWZ2wKEX6EuhLkc9JZhJalo/zAwcxlqFniMjQlqflnKpQ== X-Google-Smtp-Source: AGHT+IEOyJlIWp1A/IXWNQrEOvmLHlI9qmtbSApbB3qrSEU1WacnIF4hrSH2Tij8T9NdSH8eq1fprVUJX5a6pyvtMfM= X-Received: by 2002:a25:4a82:0:b0:e39:9749:bb2b with SMTP id 3f1490d57ef6-e538c28e1cbmr26449197276.29.1736257317960; Tue, 07 Jan 2025 05:41:57 -0800 (PST) MIME-Version: 1.0 References: <1614321.1735055528@sss.pgh.pa.us> <3861403.1735576764@sss.pgh.pa.us> <1f22794321b745549d54359d343e37b8@oss.nttdata.com> In-Reply-To: From: Atsushi Torikoshi Date: Tue, 7 Jan 2025 22:41:47 +0900 Message-ID: Subject: Re: RFC: Allow EXPLAIN to Output Page Fault Information To: Jelte Fennema-Nio , rjuju123@gmail.com Cc: torikoshia , tgl@sss.pgh.pa.us, Bruce Momjian , Pgsql Hackers Content-Type: multipart/alternative; boundary="000000000000ad0951062b1de8c2" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ad0951062b1de8c2 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Jan 6, 2025 at 6:59=E2=80=AFPM Julien Rouhaud = wrote: > Hi, > > On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia wrote: > > > > > **However, I think the general direction has merit**: Changing this > > > patch to > > > use `ru_inblock`/`ru_oublock` gives very useful insights. `ru_inblock= ` > > > is 0 when everything is in page cache, and it is very high when stuff > is > > > not. I was only hacking around and basically did this: > > > > > > s/ru_minflt/ru_inblock/g > > > s/ru_majflt/ru_oublock/g > > > [...] > > > Also, maybe there's some better way > > > of getting read/write numbers for the current process than > > > ru_inblock/ru_oublock (but this one seems to work at least reasonably > > > well). > > > > Maybe, but as far as using getrusage(), ru_inblock and ru_outblock seem > the > > best. > > FWIW that's the counters we've been using in pg_stat_kcache / powa to > compute > disk IO and "postgres shared buffers hit VS OS cache hit VS disk read" hi= t > ratio for many years and we didn't get any problem report for that. Thanks for sharing the information! On Tue, Jan 7, 2025 at 5:09=E2=80=AFAM Jelte Fennema-Nio wrote: > On Mon Jan 6, 2025 at 10:49 AM CET, torikoshia wrote: > > On Tue, Dec 31, 2024 at 7:57=E2=80=AFAM Jelte Fennema-Nio > > Updated the PoC patch to calculate them by KB: > > > > =3D# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_accounts; > > QUERY PLA= N > > > > > -------------------------------------------------------------------------= -------------------------------------------------------- > > Seq Scan on pgbench_accounts (cost=3D0.00..263935.35 rows=3D100000= 35 > > width=3D97) (actual time=3D1.447..3900.279 rows=3D10000000 loops=3D1) > > Buffers: shared hit=3D2587 read=3D161348 > > Planning Time: 0.367 ms > > Execution: > > Storage I/O: read=3D1291856 KB write=3D0 KB > > Execution Time: 4353.253 ms > > (6 rows) > > > > > > The core functionality works well in my opinion. I think it makes sense > to spend the effort to move this from PoC quality to something > committable. Thanks for the comment and review! If there are no other opinions, I will make a patch based on the direction of the current PoC patch. > 4. I think this setting should be enabled by default for ANALYZE, just > like BUFFERS is now since c2a4078e[1]. > > 5. I'm wondering whether this option deserves its own EXPLAIN option, or > if it should simply be made part of BUFFERS. Although this is not information about PostgreSQL buffers, I now feel that making this addition part of BUFFERS is attractive, since people using BUFFERS would also sometimes want to know about the storage I/O. Since BUFFERS is now ON by default with EXPLAIN ANALYZE, I am concerned about the performance impact. However, if it is limited to just twice=E2=80=94once at the start and once = at the end=E2=80=94for the planning phase, execution phase, and each parallel work= er, I believe the impact is relatively small. > 9. I think this division by 2 could use some explanation in a comment. I > understand that you're doing this because linux divides its original > bytes using 512 bytes[2] and your additional factor of 2 gets that to > 1024 bytes. But that's not clear immediately from the code. > > I'm also not convinced that 512 is the blocksize if this logic is > even correct on every platform. I'm wondering if maybe we should > simply show the blocks after all. > Maybe so. I'll look into this and then decide the unit. For the other comments, I plan to address them as you suggested. Regards, Atsushi Torikoshi --000000000000ad0951062b1de8c2 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Mon, Jan 6, 2025 at 6:59=E2= =80=AFPM Julien Rouhaud <rjuju123@= gmail.com> wrote:
Hi,

On Mon, Jan 06, 2025 at 06:49:06PM +0900, torikoshia = wrote:
>
> > **However, I think the general direction has me= rit**: Changing this
> > patch to
> > use `ru_inblock`/`r= u_oublock` gives very useful insights. `ru_inblock`
> > is 0 when = everything is in page cache, and it is very high when stuff is
> >= not. I was only hacking around and basically did this:
> >
>= ; > s/ru_minflt/ru_inblock/g
> > s/ru_majflt/ru_oublock/g
&g= t; > [...]
> >=C2=A0 Also, maybe there's some better way> > of getting read/write numbers for the current process than
&g= t; > ru_inblock/ru_oublock (but this one seems to work at least reasonab= ly
> > well).
>
> Maybe, but as far as using getrusage= (), ru_inblock and ru_outblock seem the
> best.

FWIW that'= s the counters we've been using in pg_stat_kcache / powa to compute
= disk IO and "postgres shared buffers hit VS OS cache hit VS disk read&= quot; hit
ratio for many years and we didn't get any problem report = for that.

Thanks for sharing the informati= on!

On Tue, Jan 7, 2025 at 5:09=E2=80=AFAM Jelte Fennem= a-Nio <postgres@jeltef.nl> = wrote:
On Mon Ja= n 6, 2025 at 10:49 AM CET, torikoshia wrote:
> On Tue, Dec 31, 2024 at 7:57=E2=80=AFAM Jelte Fennema-Nio <postgres@jeltef.nl>=
> Updated the PoC patch to calculate them by KB:
>
>=C2=A0 =C2=A0 =3D# EXPLAIN (ANALYZE, STORAGEIO) SELECT * FROM pgbench_a= ccounts;
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0QUERY PLAN
>=C2=A0 =C2=A0
> ----------------------------------------------------------------------= -----------------------------------------------------------
>=C2=A0 =C2=A0 =C2=A0Seq Scan on pgbench_accounts=C2=A0 (cost=3D0.00..26= 3935.35 rows=3D10000035
> width=3D97) (actual time=3D1.447..3900.279 rows=3D10000000 loops=3D1)<= br> >=C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D2587 read=3D161348
>=C2=A0 =C2=A0 =C2=A0Planning Time: 0.367 ms
>=C2=A0 =C2=A0 =C2=A0Execution:
>=C2=A0 =C2=A0 =C2=A0 =C2=A0Storage I/O: read=3D1291856 KB write=3D0 KB<= br> >=C2=A0 =C2=A0 =C2=A0Execution Time: 4353.253 ms
>=C2=A0 =C2=A0 (6 rows)
>
>

The core functionality works well in my opinion. I think it makes sense
to spend the effort to move this from PoC quality to something
committable.

Thanks for the comment and re= view!
If there are no other opinions, I will make a patch based o= n the direction of the current PoC patch.
=C2=A0=C2=A0
4. I think this setting should be enabled by default for ANALYZE, just
=C2=A0 =C2=A0like BUFFERS is now since c2a4078e[1].

5. I'm wondering whether this option deserves its own EXPLAIN option, o= r
=C2=A0 =C2=A0if it should simply be made part of BUFFERS.
= =C2=A0
Although this is not information about PostgreSQL buffers,= I now feel that making this addition part of BUFFERS is attractive, since = people using BUFFERS would also sometimes want to know about the storage I/= O.=C2=A0

Since BUFFERS is now ON by default with E= XPLAIN ANALYZE, I am concerned about the performance impact.
Howe= ver, if it is limited to just twice=E2=80=94once at the start and once at t= he end=E2=80=94for the planning phase, execution phase, and each parallel w= orker, I believe the impact is relatively small.
=C2=A0
9. I think this division by 2= could use some explanation in a comment. I
=C2=A0 =C2=A0understand that you're doing this because linux divides it= s original
=C2=A0 =C2=A0bytes using 512 bytes[2] and your additional factor of 2 gets = that to
=C2=A0 =C2=A01024 bytes. But that's not clear immediately from the code= .

=C2=A0 =C2=A0I'm also not convinced that 512 is the blocksize if this l= ogic is
=C2=A0 =C2=A0even correct on every platform. I'm wondering if maybe we = should
=C2=A0 =C2=A0simply show the blocks after all.

Maybe so. I'll look into this and then decide the unit.=C2=A0

For the other comments, I plan to address them as y= ou suggested.

=C2=A0
Regards,
= Atsushi Torikoshi
--000000000000ad0951062b1de8c2--