Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)

public inbox for pgsql-bugs@postgresql.org  
help / color / mirror / Atom feed

Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
9+ messages / 4 participants
[nested] [flat]

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
@ 2026-04-02 14:49 Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  0 siblings, 1 reply; 9+ messages in thread

From: Adrian Mönnich @ 2026-04-02 14:49 UTC (permalink / raw)
  To: Andres Freund <andres@anarazel.de>; +Cc: pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>

Indeed, good catch. I was generating the test data from an older prod data copy
and not a more recent one. In any case, the performance was fine on that same
copy on 14/15 and got bad on 16.

I just re-ran it with a larger database (and also replaced the gzipped SQL file
from my initial message with the latest one).

PG14: https://explain.depesz.com/s/ysdJ
PG16, 4M: massive cpu + disk usage and thus aborted after a few seconds
PG16, 32M: https://explain.depesz.com/s/mYiY

Cheers,
Adrian

> Hi,

> On 2026-04-02 16:06:27 +0200, Adrian Mönnich wrote:
>> thanks a lot, I just tried with work_mem set to 128MB on PG16 and it worked fine:
>> https://explain.depesz.com/s/7Zan
>> 
>> Likewise on PG18:
>> https://explain.depesz.com/s/H15B
>> 
>> And with enable_memoize=0 (PG18, 128MB):
>> https://explain.depesz.com/s/SaVI

> That's good.


>> So increasing work_mem seems like a good workaround for when we upgrade
>> our production DB. But I guess there's still a but somewhere that results to the
>> wrong estimate?

> I don't even know if it's a misestimate that didn't happen in the earlier
> versions - the join order is different in 14 than it's in the later ones.  I
> don't know why that is at this point.

> This means that we don't know if 14 would have had the same misestimation if
> the same join order had been chosen.


> There also seem to be some data differences:

> 14: https://explain.depesz.com/s/17Fp#source
>   ->  Parallel Seq Scan on contributions contributions_1 
> (cost=0.00..164891.13 rows=2687413 width=5) (actual time=0.013..454.721 rows=2143186 loops=3)

> 16: https://explain.depesz.com/s/7Zan
>   ->  Parallel Seq Scan on contributions contributions_1 
> (cost=0.00..37776.28 rows=1643228 width=5) (actual time=0.081..78.499 rows=1314582.00 loops=3)

> That's a pretty substantial difference in the number of rows.


> Greetings,

> Andres Freund







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
@ 2026-04-02 18:12 ` Tomas Vondra <tomas@vondra.me>
  2026-04-02 19:00   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-05-19 10:10   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  0 siblings, 2 replies; 9+ messages in thread

From: Tomas Vondra @ 2026-04-02 18:12 UTC (permalink / raw)
  To: Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; +Cc: pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>

Hi,

I can reproduce the performance getting much worse in 16, using the
provided SQL scripts. This is what I see:

  14: 1551.363 ms
  15: 1385.414 ms
  16: 161571.400 ms
  17: 156434.543 ms
  18: 159095.001 ms

I'm attaching the explains for 15+16. I don't know what's causing it,
but I have a couple interesting observations.

1) If I disable parallel query, the timings change to

  14: 3990.439 ms
  15: 3518.453 ms
  16: 3606.460 ms
  17: 3591.039 ms
  18: 3617.872 ms

So no regression in this case. It seems to be related to parallelism.


2) There seems to be an explosion of temporary files. We don't have that
in explain, but I queried pg_stat_database before/after the query, and
there's huge difference. Both start at

  temp_files               | 112
  temp_bytes               | 1942275280

so 112 files, ~2GB disk space. But after the query, 15 says

  temp_files               | 721
  temp_bytes               | 2755839184

while 16 has

  temp_files               | 2078995
  temp_bytes               | 70607906000

2M files and 70GB? Wow!


3) Indeed, before the query completes the pgsql_tmp directory has this:

  63M	pgsql_tmp3499395.0.fileset
  63G	pgsql_tmp3499395.1.fileset
  95M	pgsql_tmp3499395.2.fileset
  95M	pgsql_tmp3499395.3.fileset
  127M	pgsql_tmp3499395.4.fileset

So I guess that's one of the parallel hash joins doing something, and
consuming 63GB of disk space? I don't see anything suspicious in the
plan, but I assume parallel HJ may not report the relevant stats.

FWIW bumping up work_mem (to 64MB) solved this with the sample data.

I suspect this is going to be something like the hash join explosion,
where we just happen to add more and more batches. I don't have time to
investigate this more at the moment.


regards

-- 
Tomas Vondra


Attachments:

  [text/x-log] 16.log (6.7K, 2-16.log)
  download | inline:
                                                                                                         QUERY PLAN                                                                                                          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=424803.70..424803.71 rows=1 width=8) (actual time=143020.230..161567.872 rows=1 loops=1)
   ->  Gather  (cost=424803.48..424803.69 rows=2 width=8) (actual time=142964.457..161567.851 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=423803.48..423803.49 rows=1 width=8) (actual time=142963.280..142968.559 rows=1 loops=3)
               ->  Parallel Hash Join  (cost=338640.72..421783.68 rows=807919 width=4) (actual time=142746.820..142933.580 rows=1523232 loops=3)
                     Hash Cond: (attachments.folder_id = folders.id)
                     ->  Parallel Seq Scan on attachments  (cost=0.00..54748.97 rows=1980177 width=8) (actual time=0.233..85.380 rows=1576685 loops=3)
                           Filter: (NOT is_deleted)
                           Rows Removed by Filter: 327767
                     ->  Parallel Hash  (cost=329694.99..329694.99 rows=545258 width=4) (actual time=142532.861..142538.136 rows=1030633 loops=3)
                           Buckets: 262144  Batches: 16  Memory Usage: 9664kB
                           ->  Parallel Hash Right Join  (cost=228094.50..329694.99 rows=545258 width=4) (actual time=48620.655..142449.984 rows=1030633 loops=3)
                                 Hash Cond: (contributions.id = folders.contribution_id)
                                 Filter: (NOT COALESCE(sessions.is_deleted, contributions.is_deleted, subcontributions.is_deleted, false))
                                 Rows Removed by Filter: 5283
                                 ->  Parallel Seq Scan on contributions  (cost=0.00..61461.31 rows=2674131 width=5) (actual time=0.034..149.134 rows=2139359 loops=3)
                                 ->  Parallel Hash  (cost=209138.05..209138.05 rows=1090516 width=10) (actual time=22774.998..22775.308 rows=1035916 loops=3)
                                       Buckets: 262144 (originally 262144)  Batches: 262144 (originally 32)  Memory Usage: 19168kB
                                       ->  Parallel Hash Left Join  (cost=115722.60..209138.05 rows=1090516 width=10) (actual time=542.858..617.491 rows=1035916 loops=3)
                                             Hash Cond: (folders.session_id = sessions.id)
                                             ->  Parallel Hash Left Join  (cost=105930.64..184862.48 rows=1090516 width=13) (actual time=412.124..472.199 rows=1035916 loops=3)
                                                   Hash Cond: (folders.subcontribution_id = subcontributions.id)
                                                   Filter: ((contributions_1.is_deleted IS NULL) OR (NOT contributions_1.is_deleted))
                                                   Rows Removed by Filter: 150
                                                   ->  Parallel Hash Join  (cost=23682.51..85276.38 rows=1163755 width=16) (actual time=158.760..242.455 rows=1036066 loops=3)
                                                         Hash Cond: (folders.event_id = events.id)
                                                         ->  Parallel Seq Scan on folders  (cost=0.00..40321.07 rows=1331730 width=20) (actual time=0.009..58.211 rows=1064701 loops=3)
                                                               Filter: ((NOT is_deleted) AND (link_type <> 1))
                                                               Rows Removed by Filter: 4424
                                                         ->  Parallel Hash  (cost=14575.53..14575.53 rows=555039 width=4) (actual time=40.413..40.418 rows=444370 loops=3)
                                                               Buckets: 262144  Batches: 16  Memory Usage: 5344kB
                                                               ->  Parallel Seq Scan on events  (cost=0.00..14575.53 rows=555039 width=4) (actual time=0.009..19.095 rows=444370 loops=3)
                                                                     Filter: (NOT is_deleted)
                                                                     Rows Removed by Filter: 63752
                                                   ->  Parallel Hash  (cost=76907.44..76907.44 rows=325495 width=6) (actual time=119.187..119.190 rows=184447 loops=3)
                                                         Buckets: 262144  Batches: 4  Memory Usage: 8576kB
                                                         ->  Nested Loop Left Join  (cost=0.44..76907.44 rows=325495 width=6) (actual time=0.026..105.242 rows=184447 loops=3)
                                                               ->  Parallel Seq Scan on subcontributions  (cost=0.00..6294.95 rows=325495 width=9) (actual time=0.002..6.771 rows=184447 loops=3)
                                                               ->  Memoize  (cost=0.44..0.84 rows=1 width=5) (actual time=0.000..0.000 rows=1 loops=553341)
                                                                     Cache Key: subcontributions.contribution_id
                                                                     Cache Mode: logical
                                                                     Hits: 123894  Misses: 44462  Evictions: 0  Overflows: 0  Memory Usage: 4690kB
                                                                     Worker 0:  Hits: 142773  Misses: 49442  Evictions: 0  Overflows: 0  Memory Usage: 5215kB
                                                                     Worker 1:  Hits: 142496  Misses: 50274  Evictions: 0  Overflows: 0  Memory Usage: 5303kB
                                                                     ->  Index Scan using pk_contributions on contributions contributions_1  (cost=0.43..0.83 rows=1 width=5) (actual time=0.001..0.001 rows=1 loops=144178)
                                                                           Index Cond: (id = subcontributions.contribution_id)
                                             ->  Parallel Hash  (cost=5715.98..5715.98 rows=248398 width=5) (actual time=20.697..20.698 rows=198719 loops=3)
                                                   Buckets: 262144  Batches: 4  Memory Usage: 7936kB
                                                   ->  Parallel Seq Scan on sessions  (cost=0.00..5715.98 rows=248398 width=5) (actual time=0.003..7.165 rows=198719 loops=3)
 Planning Time: 1.446 ms
 Execution Time: 161571.400 ms
(52 rows)

  [text/x-log] 15.log (6.5K, 3-15.log)
  download | inline:
                                                                                                         QUERY PLAN                                                                                                          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=433938.40..433938.41 rows=1 width=8) (actual time=1324.805..1384.694 rows=1 loops=1)
   ->  Gather  (cost=433938.19..433938.40 rows=2 width=8) (actual time=1323.712..1384.690 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=432938.19..432938.20 rows=1 width=8) (actual time=1322.444..1322.455 rows=1 loops=3)
               ->  Parallel Hash Join  (cost=347888.29..430918.53 rows=807861 width=4) (actual time=1135.831..1287.886 rows=1523232 loops=3)
                     Hash Cond: (attachments.folder_id = folders.id)
                     ->  Parallel Seq Scan on attachments  (cost=0.00..54689.63 rows=1975074 width=8) (actual time=0.015..74.821 rows=1576685 loops=3)
                           Filter: (NOT is_deleted)
                           Rows Removed by Filter: 327767
                     ->  Parallel Hash  (cost=338919.45..338919.45 rows=546627 width=4) (actual time=988.970..988.979 rows=1030633 loops=3)
                           Buckets: 262144  Batches: 16  Memory Usage: 9664kB
                           ->  Parallel Hash Join  (cost=222930.79..338919.45 rows=546627 width=4) (actual time=862.438..942.550 rows=1030633 loops=3)
                                 Hash Cond: (folders.event_id = events.id)
                                 ->  Parallel Hash Left Join  (cost=199218.52..306522.55 rows=623482 width=8) (actual time=647.388..774.130 rows=1058109 loops=3)
                                       Hash Cond: (folders.contribution_id = contributions.id)
                                       Filter: (NOT COALESCE(sessions.is_deleted, contributions.is_deleted, subcontributions.is_deleted, false))
                                       Rows Removed by Filter: 6434
                                       ->  Parallel Hash Left Join  (cost=93909.23..175314.98 rows=1246964 width=14) (actual time=351.134..404.342 rows=1064544 loops=3)
                                             Hash Cond: (folders.session_id = sessions.id)
                                             ->  Parallel Hash Left Join  (cost=84126.27..146673.74 rows=1246964 width=17) (actual time=217.466..279.889 rows=1064544 loops=3)
                                                   Hash Cond: (folders.subcontribution_id = subcontributions.id)
                                                   Filter: ((contributions_1.is_deleted IS NULL) OR (NOT contributions_1.is_deleted))
                                                   Rows Removed by Filter: 157
                                                   ->  Parallel Seq Scan on folders  (cost=0.00..40289.07 rows=1331373 width=20) (actual time=0.016..59.322 rows=1064701 loops=3)
                                                         Filter: ((NOT is_deleted) AND (link_type <> 1))
                                                         Rows Removed by Filter: 4424
                                                   ->  Parallel Hash  (cost=78785.58..78785.58 rows=325495 width=6) (actual time=103.535..103.536 rows=184447 loops=3)
                                                         Buckets: 262144  Batches: 4  Memory Usage: 7488kB
                                                         ->  Nested Loop Left Join  (cost=0.44..78785.58 rows=325495 width=6) (actual time=0.030..90.520 rows=184447 loops=3)
                                                               ->  Parallel Seq Scan on subcontributions  (cost=0.00..6246.95 rows=325495 width=9) (actual time=0.002..6.298 rows=184447 loops=3)
                                                               ->  Memoize  (cost=0.44..0.84 rows=1 width=5) (actual time=0.000..0.000 rows=1 loops=553341)
                                                                     Cache Key: subcontributions.contribution_id
                                                                     Cache Mode: logical
                                                                     Hits: 135333  Misses: 47817  Evictions: 0  Overflows: 0  Memory Usage: 4904kB
                                                                     Worker 0:  Hits: 136389  Misses: 48426  Evictions: 0  Overflows: 0  Memory Usage: 4966kB
                                                                     Worker 1:  Hits: 137268  Misses: 48108  Evictions: 0  Overflows: 0  Memory Usage: 4933kB
                                                                     ->  Index Scan using pk_contributions on contributions contributions_1  (cost=0.43..0.83 rows=1 width=5) (actual time=0.001..0.001 rows=1 loops=144351)
                                                                           Index Cond: (id = subcontributions.contribution_id)
                                             ->  Parallel Hash  (cost=5706.98..5706.98 rows=248398 width=5) (actual time=17.667..17.667 rows=198719 loops=3)
                                                   Buckets: 262144  Batches: 4  Memory Usage: 7936kB
                                                   ->  Parallel Seq Scan on sessions  (cost=0.00..5706.98 rows=248398 width=5) (actual time=0.003..7.229 rows=198719 loops=3)
                                       ->  Parallel Hash  (cost=61434.91..61434.91 rows=2674191 width=5) (actual time=184.109..184.109 rows=2139359 loops=3)
                                             Buckets: 262144  Batches: 64  Memory Usage: 6048kB
                                             ->  Parallel Seq Scan on contributions  (cost=0.00..61434.91 rows=2674191 width=5) (actual time=0.009..83.995 rows=2139359 loops=3)
                                 ->  Parallel Hash  (cost=14575.53..14575.53 rows=556859 width=4) (actual time=38.855..38.855 rows=444370 loops=3)
                                       Buckets: 262144  Batches: 16  Memory Usage: 5344kB
                                       ->  Parallel Seq Scan on events  (cost=0.00..14575.53 rows=556859 width=4) (actual time=0.008..18.893 rows=444370 loops=3)
                                             Filter: (NOT is_deleted)
                                             Rows Removed by Filter: 63752
 Planning Time: 1.354 ms
 Execution Time: 1385.414 ms
(52 rows)

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
@ 2026-04-02 19:00   ` Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:34     ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  1 sibling, 1 reply; 9+ messages in thread

From: Tomas Vondra @ 2026-04-02 19:00 UTC (permalink / raw)
  To: Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; +Cc: pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>



On 4/2/26 20:12, Tomas Vondra wrote:
> Hi,
> 
> I can reproduce the performance getting much worse in 16, using the
> provided SQL scripts. This is what I see:
> 
>   14: 1551.363 ms
>   15: 1385.414 ms
>   16: 161571.400 ms
>   17: 156434.543 ms
>   18: 159095.001 ms
> 
> I'm attaching the explains for 15+16. I don't know what's causing it,
> but I have a couple interesting observations.
> 
> 1) If I disable parallel query, the timings change to
> 
>   14: 3990.439 ms
>   15: 3518.453 ms
>   16: 3606.460 ms
>   17: 3591.039 ms
>   18: 3617.872 ms
> 
> So no regression in this case. It seems to be related to parallelism.
> 
> 
> 2) There seems to be an explosion of temporary files. We don't have that
> in explain, but I queried pg_stat_database before/after the query, and
> there's huge difference. Both start at
> 
>   temp_files               | 112
>   temp_bytes               | 1942275280
> 
> so 112 files, ~2GB disk space. But after the query, 15 says
> 
>   temp_files               | 721
>   temp_bytes               | 2755839184
> 
> while 16 has
> 
>   temp_files               | 2078995
>   temp_bytes               | 70607906000
> 
> 2M files and 70GB? Wow!
> 
> 
> 3) Indeed, before the query completes the pgsql_tmp directory has this:
> 
>   63M	pgsql_tmp3499395.0.fileset
>   63G	pgsql_tmp3499395.1.fileset
>   95M	pgsql_tmp3499395.2.fileset
>   95M	pgsql_tmp3499395.3.fileset
>   127M	pgsql_tmp3499395.4.fileset
> 
> So I guess that's one of the parallel hash joins doing something, and
> consuming 63GB of disk space? I don't see anything suspicious in the
> plan, but I assume parallel HJ may not report the relevant stats.
> 
> FWIW bumping up work_mem (to 64MB) solved this with the sample data.
> 
> I suspect this is going to be something like the hash join explosion,
> where we just happen to add more and more batches. I don't have time to
> investigate this more at the moment.
> 

FWIW I think that's what's happening. If I add an elog(WARNING) into
ExecParallelHashJoinSetUpBatches, I see this:

    WARNING:  0x55dbe375a5e8 initializing 16 batches
    WARNING:  0x7f3868a3a978 initializing 32 batches
    WARNING:  0x7f3868a3ab80 initializing 4 batches
    WARNING:  0x55dbe36148c0 initializing 4 batches
    WARNING:  0x7f3868a3b230 initializing 16 batches
    WARNING:  0x7f3868a3a978 initializing 64 batches
    WARNING:  0x55dbe36144b0 initializing 128 batches
    WARNING:  0x55dbe36144b0 initializing 256 batches
    WARNING:  0x55dbe36144b0 initializing 512 batches
    WARNING:  0x55dbe36144b0 initializing 1024 batches
    WARNING:  0x7f3868a3a978 initializing 2048 batches
    WARNING:  0x7f3868a3a978 initializing 4096 batches
    WARNING:  0x55dbe36144b0 initializing 8192 batches
    WARNING:  0x55dbe36144b0 initializing 16384 batches
    WARNING:  0x55dbe36144b0 initializing 32768 batches
    WARNING:  0x7f3868a3a978 initializing 65536 batches
    WARNING:  0x55dbe36144b0 initializing 131072 batches
    WARNING:  0x7f3868a3a978 initializing 262144 batches

so we're ending with 256k batches, for this one join. I'm not sure how
exactly this maps to the 2M files from pg_stat_database, but it means
~0.5M tuplestores and ~10GB virtual memory (at lest per top).

I don't know what triggers the batch increase, but I still suspect it's
similar to the explosion we fixed (or mitigated) in PG18, but only for
serial (non-parallel) joins.

regards

-- 
Tomas Vondra







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 19:00   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
@ 2026-04-02 22:34     ` Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:43       ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tom Lane <tgl@sss.pgh.pa.us>
  0 siblings, 1 reply; 9+ messages in thread

From: Tomas Vondra @ 2026-04-02 22:34 UTC (permalink / raw)
  To: Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; +Cc: pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>

On 4/2/26 21:00, Tomas Vondra wrote:
> ...
> FWIW I think that's what's happening. If I add an elog(WARNING) into
> ExecParallelHashJoinSetUpBatches, I see this:
> 
>     WARNING:  0x55dbe375a5e8 initializing 16 batches
>     WARNING:  0x7f3868a3a978 initializing 32 batches
>     WARNING:  0x7f3868a3ab80 initializing 4 batches
>     WARNING:  0x55dbe36148c0 initializing 4 batches
>     WARNING:  0x7f3868a3b230 initializing 16 batches
>     WARNING:  0x7f3868a3a978 initializing 64 batches
>     WARNING:  0x55dbe36144b0 initializing 128 batches
>     WARNING:  0x55dbe36144b0 initializing 256 batches
>     WARNING:  0x55dbe36144b0 initializing 512 batches
>     WARNING:  0x55dbe36144b0 initializing 1024 batches
>     WARNING:  0x7f3868a3a978 initializing 2048 batches
>     WARNING:  0x7f3868a3a978 initializing 4096 batches
>     WARNING:  0x55dbe36144b0 initializing 8192 batches
>     WARNING:  0x55dbe36144b0 initializing 16384 batches
>     WARNING:  0x55dbe36144b0 initializing 32768 batches
>     WARNING:  0x7f3868a3a978 initializing 65536 batches
>     WARNING:  0x55dbe36144b0 initializing 131072 batches
>     WARNING:  0x7f3868a3a978 initializing 262144 batches
> 
> so we're ending with 256k batches, for this one join. I'm not sure how
> exactly this maps to the 2M files from pg_stat_database, but it means
> ~0.5M tuplestores and ~10GB virtual memory (at lest per top).
> 
> I don't know what triggers the batch increase, but I still suspect it's
> similar to the explosion we fixed (or mitigated) in PG18, but only for
> serial (non-parallel) joins.
> 

An interesting question is "What changed in PG16?" causing the query to
fail, when it worked OK on earlier versions. I guess the main suspect is
this item from release notes

  Allow parallelization of FULL and internal right OUTER hash joins

So I guess it might be interesting to flip the joins to inner, see if it
still fails like that, and then see if that crashes on PG15 too.

Although the query has only inner and left outer joins, which seems
unrelated to the change. It might be simply a consequence of the planner
picking a different join tree (due to some general optimizer changes).

It might be interesting to try forcing the same join tree (which might
be possible with join_collapse_limit=1) on PG15. Maybe it'll crash the
same way?

Maybe it'd be easier to try reducing the query first, before doing any
of this. Start removing the joins one by one from the "top" (per the
explain), until it stops failing. That might leave a much smaller query.


regards

-- 
Tomas Vondra







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 19:00   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:34     ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
@ 2026-04-02 22:43       ` Tom Lane <tgl@sss.pgh.pa.us>
  2026-04-04 14:45         ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  0 siblings, 1 reply; 9+ messages in thread

From: Tom Lane @ 2026-04-02 22:43 UTC (permalink / raw)
  To: Tomas Vondra <tomas@vondra.me>; +Cc: Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>

Tomas Vondra <tomas@vondra.me> writes:
> An interesting question is "What changed in PG16?" causing the query to
> fail, when it worked OK on earlier versions.

"git bisect" could be informative here.  I agree with trying to
minimize the query first, though --- else you may waste time
going down blind alleys, as a result of planner changes changing
the join order without affecting the critical executor behavior.

			regards, tom lane






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 19:00   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:34     ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:43       ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tom Lane <tgl@sss.pgh.pa.us>
@ 2026-04-04 14:45         ` Tomas Vondra <tomas@vondra.me>
  2026-04-16 05:25           ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Thomas Munro <thomas.munro@gmail.com>
  0 siblings, 1 reply; 9+ messages in thread

From: Tomas Vondra @ 2026-04-04 14:45 UTC (permalink / raw)
  To: Tom Lane <tgl@sss.pgh.pa.us>; +Cc: Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>

On 4/3/26 00:43, Tom Lane wrote:
> Tomas Vondra <tomas@vondra.me> writes:
>> An interesting question is "What changed in PG16?" causing the query to
>> fail, when it worked OK on earlier versions.
> 
> "git bisect" could be informative here.  I agree with trying to
> minimize the query first, though --- else you may waste time
> going down blind alleys, as a result of planner changes changing
> the join order without affecting the critical executor behavior.
> 

I did a bit of bisecting today (with the full query), and unsurprisingly
it started failing at:

commit 11c2d6fdf5af1aacec9ca2005543f1b0fc4cc364 (HEAD ->
hashjoin-explosion-bisect)
Author: Thomas Munro <tmunro@postgresql.org>
Date:   Fri Mar 31 11:01:51 2023 +1300

    Parallel Hash Full Join.

    Full and right outer joins were not supported in the initial
    implementation of Parallel Hash Join because of deadlock hazards
    (see discussion).  Therefore FULL JOIN inhibited parallelism, as
    the other join strategies can't do that in parallel either.

    ...

Although, it's a bit strange, AFAIK the query does not have any full
outer join. Also, for me it now fails like this:

    Sat Apr  4 04:00:58 PM CEST 2026
    ERROR:  invalid DSA memory alloc request size 1811939328
    CONTEXT:  parallel worker
    Sat Apr  4 04:02:04 PM CEST 2026

I believe it's the same issue (I still get the same tempfile explosion).

After a bit of trial-and-error I managed to reduce the query to a single
join:

  SET parallel_setup_cost = 0;
  SET cpu_tuple_cost = 1;
  SET enable_nestloop = off;

  EXPLAIN ANALYZE SELECT *
  FROM attachments.folders
  LEFT OUTER JOIN events.contributions
   ON events.contributions.id = attachments.folders.contribution_id;

The trick is to force it to do a parallel hash join by adjusting the CPU
costs. I don't think it can be reduced even further, even just switching
to an inner join makes it work fine.

At this point I was suspecting the data distributions for the join
columns may be somewhat weird, causing issues for the hashjoin batching.
For events.contributions.id it's perfectly fine - it's entirely unique,
with each ID having 1 entry. Unsurprisingly, because it's the PK. But
for attachments.folders.contribution_id I see this:

SELECT contribution_id, count(*) FROM attachments.folders
 GROUP BY contribution_id ORDER BY 2 DESC;

 contribution_id | count
-----------------+--------
                 | 464515
         5492978 |     67
         4117499 |     42
         4045045 |     41
             ...

So there's ~500k entries with NULL, that can't possibly match to
anything (right)? I assume we still add them to the hash, though.
Because if I explicitly filter them out, it starts working fine:

EXPLAIN ANALYZE SELECT *
FROM attachments.folders
LEFT OUTER JOIN events.contributions
  ON events.contributions.id = attachments.folders.contribution_id
WHERE attachments.folders.contribution_id IT NOT NULL;
...
Planning Time: 0.192 ms
Execution Time: 670.950 ms

and when I invert the condition (to IS NULL), it stats failing pretty
much right away.

regards

-- 
Tomas Vondra

^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 19:00   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:34     ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:43       ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tom Lane <tgl@sss.pgh.pa.us>
  2026-04-04 14:45         ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
@ 2026-04-16 05:25           ` Thomas Munro <thomas.munro@gmail.com>
  2026-04-16 18:52             ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  0 siblings, 1 reply; 9+ messages in thread

From: Thomas Munro @ 2026-04-16 05:25 UTC (permalink / raw)
  To: Tomas Vondra <tomas@vondra.me>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>

On Sun, Apr 5, 2026 at 2:45 AM Tomas Vondra <tomas@vondra.me> wrote:
> At this point I was suspecting the data distributions for the join
> columns may be somewhat weird, causing issues for the hashjoin batching.
> For events.contributions.id it's perfectly fine - it's entirely unique,
> with each ID having 1 entry. Unsurprisingly, because it's the PK. But
> for attachments.folders.contribution_id I see this:
>
> SELECT contribution_id, count(*) FROM attachments.folders
>  GROUP BY contribution_id ORDER BY 2 DESC;
>
>  contribution_id | count
> -----------------+--------
>                  | 464515
>          5492978 |     67
>          4117499 |     42
>          4045045 |     41
>              ...
>
> So there's ~500k entries with NULL, that can't possibly match to
> anything (right)? I assume we still add them to the hash, though.

That's also the conditions required to prevent the
"stop-partitioning-it's-not-working" logic from triggering.  That
thing where we know we need to pick a better lower than 100%.  But
what?

Did this commit help?

commit 1811f1af98fb237fdd5adb588cd4b57c433b75f8
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Thu Mar 19 15:21:36 2026 -0400

    Improve hash join's handling of tuples with null join keys.






^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 19:00   ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:34     ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-02 22:43       ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tom Lane <tgl@sss.pgh.pa.us>
  2026-04-04 14:45         ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
  2026-04-16 05:25           ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Thomas Munro <thomas.munro@gmail.com>
@ 2026-04-16 18:52             ` Tomas Vondra <tomas@vondra.me>
  0 siblings, 0 replies; 9+ messages in thread

From: Tomas Vondra @ 2026-04-16 18:52 UTC (permalink / raw)
  To: Thomas Munro <thomas.munro@gmail.com>; +Cc: Tom Lane <tgl@sss.pgh.pa.us>; Adrian Mönnich <adrian.moennich@cern.ch>; Andres Freund <andres@anarazel.de>; pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>



On 4/16/26 07:25, Thomas Munro wrote:
> On Sun, Apr 5, 2026 at 2:45 AM Tomas Vondra <tomas@vondra.me> wrote:
>> At this point I was suspecting the data distributions for the join
>> columns may be somewhat weird, causing issues for the hashjoin batching.
>> For events.contributions.id it's perfectly fine - it's entirely unique,
>> with each ID having 1 entry. Unsurprisingly, because it's the PK. But
>> for attachments.folders.contribution_id I see this:
>>
>> SELECT contribution_id, count(*) FROM attachments.folders
>>  GROUP BY contribution_id ORDER BY 2 DESC;
>>
>>  contribution_id | count
>> -----------------+--------
>>                  | 464515
>>          5492978 |     67
>>          4117499 |     42
>>          4045045 |     41
>>              ...
>>
>> So there's ~500k entries with NULL, that can't possibly match to
>> anything (right)? I assume we still add them to the hash, though.
> 
> That's also the conditions required to prevent the
> "stop-partitioning-it's-not-working" logic from triggering.  That
> thing where we know we need to pick a better lower than 100%.  But
> what?
> 
> Did this commit help?
> 
> commit 1811f1af98fb237fdd5adb588cd4b57c433b75f8
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Thu Mar 19 15:21:36 2026 -0400
> 
>     Improve hash join's handling of tuples with null join keys.

Possibly. With the original (simplified) query, I get no failures with
current master. And it starts failing after I revert 1811f1af98.

With the alternative queries (with IS NOT NULL), it seems to work OK
even after the revert. So maybe the queries are not failing for the same
reason?

regards

-- 
Tomas Vondra







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

* Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
  2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
  2026-04-02 18:12 ` Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Tomas Vondra <tomas@vondra.me>
@ 2026-05-19 10:10   ` Adrian Mönnich <adrian.moennich@cern.ch>
  1 sibling, 0 replies; 9+ messages in thread

From: Adrian Mönnich @ 2026-05-19 10:10 UTC (permalink / raw)
  To: Tomas Vondra <tomas@vondra.me>; Andres Freund <andres@anarazel.de>; +Cc: pgsql-bugs@lists.postgresql.org; Tomas Vondra <tv@fuzzy.cz>; Thomas Munro <thomas.munro@gmail.com>

Hi,

just wondering, when this gets fixed, will the fix only go into the latest
master version, or also get backported to other still-supported versions?

Cheers,
Adrian







^ permalink  raw  reply  [nested|flat] 9+ messages in thread

end of thread, other threads:[~2026-05-19 10:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-02 14:49 Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours) Adrian Mönnich <adrian.moennich@cern.ch>
2026-04-02 18:12 ` Tomas Vondra <tomas@vondra.me>
2026-04-02 19:00   ` Tomas Vondra <tomas@vondra.me>
2026-04-02 22:34     ` Tomas Vondra <tomas@vondra.me>
2026-04-02 22:43       ` Tom Lane <tgl@sss.pgh.pa.us>
2026-04-04 14:45         ` Tomas Vondra <tomas@vondra.me>
2026-04-16 05:25           ` Thomas Munro <thomas.munro@gmail.com>
2026-04-16 18:52             ` Tomas Vondra <tomas@vondra.me>
2026-05-19 10:10   ` Adrian Mönnich <adrian.moennich@cern.ch>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox