public inbox for pgsql-performance@postgresql.org  
help / color / mirror / Atom feed
From: Justin Pryzby <pryzby@telsasoft.com>
To: pgsql-performance@postgresql.org
Subject: CREATE STATISTICS and join selectivity
Date: Wed, 15 Nov 2017 14:19:13 -0600
Message-ID: <20171115201913.GX2167@telsasoft.com> (raw)
List-Unsubscribe:  <mailto:majordomo@postgresql.org?body=unsub%20pgsql-performance>

I just noticed that PG10 CREATE STATISTICS (dependencies) doesn't seem to work
for joins on multiple columns; is that right?

With statistics on table for 20171111 but not 20171110:

ts=# CREATE STATISTICS x ON site_id,sect_id FROM eric_enodeb_cell_20171111;
ts=# ANALYZE VERBOSE eric_enodeb_cell_20171111;

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN eric_enodeb_cell_20171110 b USING(start_time, sect_id) WHERE a.site_id=318 AND sect_id=1489;
Nested Loop  (cost=0.83..4565.09 rows=1 width=0) (actual time=23.595..69.541 rows=96 loops=1)
=> bad estimate on redundant WHERE WITHOUT multivar statistics

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171111 a JOIN eric_enodeb_cell_20171111 b USING(start_time, sect_id) WHERE a.site_id=318 AND sect_id=1489;
Nested Loop  (cost=0.83..4862.41 rows=96 width=0) (actual time=0.034..3.882 rows=96 loops=1)
=> good estimate on redundant WHERE WITH multivar statistics

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN eric_enodeb_cell_20171110 b USING(start_time, sect_id);
Merge Join  (cost=18249.85..19624.18 rows=54858 width=0) (actual time=157.252..236.945 rows=55050 loops=1)
=> good estimate on JOIN on SECT_id without stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171110 a JOIN eric_enodeb_cell_20171110 b USING(start_time, site_id);
Merge Join  (cost=0.83..14431.81 rows=261499 width=0) (actual time=0.031..259.382 rows=262638 loops=1)
=> good estimate on JOIN on SITE_id without stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171111 a JOIN eric_enodeb_cell_20171111 b USING(start_time, site_id);
Merge Join  (cost=0.83..14706.29 rows=268057 width=0) (actual time=37.360..331.276 rows=268092 loops=1)
=> good estimate on JOIN on SITE_id with stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171111 a JOIN eric_enodeb_cell_20171111 b USING(start_time, sect_id);
Merge Join  (cost=18560.89..19959.67 rows=55944 width=0) (actual time=130.865..198.439 rows=55956 loops=1)
=> good estimate on JOIN on SECT_id with stats

ts=# explain ANALYZE SELECT FROM eric_enodeb_cell_20171111 a JOIN eric_enodeb_cell_20171111 b USING(start_time, sect_id, site_id);
Gather  (cost=1000.83..12222.06 rows=460 width=0) (actual time=1.686..149.707 rows=55956 loops=1)
=> poor estimate on redundant JOIN WITH stats (??)

I've already fixed our reports to avoid this kind of thing and support our PG95
customers, but I tentatively would've expected PG10 MV stats to "know" that
USING(site_id, sect_id) is no more selective than USING(sect_id), same as it
knows that's true for WHERE site... AND sect....

Justin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-performance@postgresql.org
  Cc: pryzby@telsasoft.com
  Subject: Re: CREATE STATISTICS and join selectivity
  In-Reply-To: <20171115201913.GX2167@telsasoft.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox