MIME-Version: 1.0
References: 
 <CAA5RZ0vZwR_dSK6fo0P2-EnskUVN0NjLHnGnJMFDPC8-kEW3sQ@mail.gmail.com>
In-Reply-To: 
 <CAA5RZ0vZwR_dSK6fo0P2-EnskUVN0NjLHnGnJMFDPC8-kEW3sQ@mail.gmail.com>
From: Lukas Fittl <lukas@fittl.com>
Date: Thu, 21 May 2026 21:38:09 -0700
Message-ID: 
 <CAP53Pky9CViOfy1yQxJv0VE7aEiuDb13SppwGa7T_qOeBSumfA@mail.gmail.com>
Subject: Re: Improve pg_stat_statements scalability
To: Sami Imseih <samimseih@gmail.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAP53Pky9CViOfy1yQxJv0VE7aEiuDb13SppwGa7T_qOeBSumfA%40mail.gmail.com>
Precedence: bulk

On Mon, May 11, 2026 at 4:54=E2=80=AFPM Sami Imseih <samimseih@gmail.com> w=
rote:
> pg_stat_statements has well-known scaling problems under
> high concurrency. This patch series is an initial proposal
> for how to $SUBJECT.

Agreed. Let's fix this for Postgres 20.

Thank you for putting together initial patches!

>
> The three scaling problems:
>
> 1. Per-entry SpinLock contention. Every counter update acquires
> a SpinLock on the entry. Hot queries executed across many
> backends contend on the same lock, and the hold time grows as
> we add more counters to the struct. Higher core counts mean
> more CPUs contending for the same spinlock, making the problem
> worse on modern hardware. See this complaint here [3].
>
> 2. Deallocation under exclusive LWLock. The hash table is
> fixed-size, bounded by pg_stat_statements.max. When full, a
> single backend performs least-frequently-used eviction of
> the bottom 5% while holding an exclusive LWLock, blocking all
> other backends from reading or writing pg_stat_statements.
> This happens inline during query execution, and workloads with
> many unique queries trigger it frequently. Making matters
> worse, pg_stat_statements.max is a static GUC, requiring a
> full server restart to change, and that is disruptive in
> production systems.
>
> 3. Query text file bloat and GC. Deallocated entries leave dead
> text in the external file. When the file grows to 2x the size
> of live text, GC rewrites the entire file under exclusive lock.
> Disk I/O under exclusive lock is the worst combination, and
> frequent deallocations (problem #2) trigger more GC.

Yep, agreed on these problems.

>
> Using the pgstat, "statistics collector", system to improve
> pg_stat_statements scalability has been discussed previously
> [4]. It writes stats locally first and flushes periodically
> to shared memory, avoiding spinlock contention on the write
> path. Also, the storage is a dshash (partitioned hash table),
> which reduces contention on lookups and allows dynamic resizing
> without restart. The prerequisite work to make this usable by
> extensions has been building over two release cycles:
>
> - PG18: Pluggable cumulative statistics API
> (pgstat_register_kind) [7949d959]
>
> - PG19: Serialization callbacks (to_serialized_data,
> from_serialized_data, finish) for custom stats kinds
> [4ba012a8ed9]
>
> This patch builds on the prerequisite work to address the
> scalability issues mentioned above.
>
> The series consists of two patches:
>
> ---
>
> [0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid)
>
> Adds infrastructure for flushing pending statistics to shared
> memory on demand. pgstat_flush_pending() flushes all pending
> entries in the calling backend immediately. Unlike
> pgstat_report_stat(), it can be called mid-transaction, making
> it suitable for view functions that need fresh shared stats
> before the transaction ends.
>
> pg_stat_flush_pending(pid) is the SQL-callable interface to
> the same function.
>
> This patch is related to the discussion in [1] about flushing
> stats within running transactions. The pg_stat_statements
> modernization provides a good example where this is useful.
> At least for the pg_stat_statements tests, we need a way to
> flush statistics within a transaction. I plan to spin this off
> to a dedicated thread, but include it here for now so this
> patchset can be tested.

I noticed the following in the 0001 patch (to be clear, looking at the
v2 version you posted later), which I wasn't sure about:

> diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/u=
tils/activity/pgstat_relation.c
> index b2ca28f83ba..848687a9f7e 100644
> --- a/src/backend/utils/activity/pgstat_relation.c
> +++ b/src/backend/utils/activity/pgstat_relation.c
> @@ -828,64 +828,76 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref=
, bool nowait)
>
> ...
>         if (t > tabentry->lastscan)
>             tabentry->lastscan =3D t;
>     }
>     tabentry->tuples_returned +=3D lstats->counts.tuples_returned;
>     tabentry->tuples_fetched +=3D lstats->counts.tuples_fetched;
> -    tabentry->tuples_inserted +=3D lstats->counts.tuples_inserted;
> -    tabentry->tuples_updated +=3D lstats->counts.tuples_updated;
> -    tabentry->tuples_deleted +=3D lstats->counts.tuples_deleted;
>     tabentry->tuples_hot_updated +=3D lstats->counts.tuples_hot_updated;
>     tabentry->tuples_newpage_updated +=3D lstats->counts.tuples_newpage_u=
pdated;
> +    tabentry->blocks_fetched +=3D lstats->counts.blocks_fetched;
> +    tabentry->blocks_hit +=3D lstats->counts.blocks_hit;

If I follow correctly, you're moving tuples_(inserted|updated|deleted)
here to special case them so they respect transaction rollbacks, but
you don't do that for tuples_hot_updated and tuples_newpage_updated.
Shouldn't those also be ignored if a transaction rolls back?

>
> [0002] pg_stat_statements: modernize entry storage with
> pgstat kind
>
> This is the main patch. It replaces the private shared-memory
> hash table with the pgstat subsystem's dshash (registered as a
> custom stats kind).

I've taken a first pass at this, and noticed an issue (again, in the
v2 patch for clarity):

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c
b/contrib/pg_stat_statements/pg_stat_statements.c
> index 92315627916..0e6e65e3e51 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> ...
> @@ -1254,9 +1066,62 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char=
 *queryString,
>      }
>  }
>
> ...
> +/*
> + * Store query text into a shared entry via the external text file.
> + *
> + * Caller must hold the entry lock.  Does nothing if text is already pre=
sent.
> + */
> +static void
> +pgss_store_query_text(PgStatShared_Pgss *shared_entry,
> +                      const char *query, int query_len, int encoding)
> +{
> +    Size        query_offset;
> +    int            gc_count;
> +
> +    if (shared_entry->query_len >=3D 0)
> +        return;
> +
> +    LWLockAcquire(&pgss->lock.lock, LW_SHARED);
> +    if (qtext_store(query, query_len, &query_offset, &gc_count))
> +    {
> +        shared_entry->query_offset =3D query_offset;
> +        shared_entry->query_len =3D query_len;
> +        shared_entry->encoding =3D encoding;
> +    }
> +    LWLockRelease(&pgss->lock.lock);
> +}
> +
>
> ...
>      /*
> @@ -1313,192 +1180,171 @@ pgss_store(const char *query, int64 queryId,
>      key.queryid =3D queryId;
>      key.toplevel =3D (nesting_level =3D=3D 0);
>
> -    /* Lookup the hash table entry with shared lock. */
> -    LWLockAcquire(&pgss->lock.lock, LW_SHARED);
> -
> -    entry =3D (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL=
);
> -
> -    /* Create new entry, if not present */
> -    if (!entry)
> +    /*
> +     * If jstate is set, create the shared entry and store normalized qu=
ery
> +     * text.  Don't increment counters; entries with zero calls are not
> +     * returned by pg_stat_statements_internal().
> +     */
> +    if (jstate)
>      {
> -        Size        query_offset;
> -        int            gc_count;
> -        bool        stored;
> -        bool        do_gc;
> +        const char *store_query;
>
> -        /*
> -         * Create a new, normalized query string if caller asked.  We do=
n't
> -         * need to hold the lock while doing this work.  (Note: in any c=
ase,
> -         * it's possible that someone else creates a duplicate hashtable=
 entry
> -         * in the interval where we don't hold the lock below.  That cas=
e is
> -         * handled by entry_alloc.)
> -         */
> -        if (jstate)
> ...
>
> -        /* Append new query text to file with only shared lock held */
> -        stored =3D qtext_store(norm_query ? norm_query : query, query_le=
n,
> -                             &query_offset, &gc_count);
> -
> -        /*
> -         * Determine whether we need to garbage collect external query t=
exts
> -         * while the shared lock is still held.  This micro-optimization
> -         * avoids taking the time to decide this while holding exclusive=
 lock.
> -         */
> -        do_gc =3D need_gc_qtexts();
> -
> -        /* Need exclusive lock to make a new hashtable entry - promote *=
/
> -        LWLockRelease(&pgss->lock.lock);
> -        LWLockAcquire(&pgss->lock.lock, LW_EXCLUSIVE);
> +        norm_query =3D generate_normalized_query(jstate, query,
> +                                               query_location,
> +                                               &query_len);
> +        store_query =3D norm_query ? norm_query : query;
>
> -        /*
> -         * A garbage collection may have occurred while we weren't holdi=
ng the
> -         * lock.  In the unlikely event that this happens, the query tex=
t we
> -         * stored above will have been garbage collected, so write it ag=
ain.
> -         * This should be infrequent enough that doing it while holding
> -         * exclusive lock isn't a performance problem.
> -         */
> -        if (!stored || pgss->gc_count !=3D gc_count)
> -            stored =3D qtext_store(norm_query ? norm_query : query, quer=
y_len,
> -                                 &query_offset, NULL);
> +        entry_ref =3D pgstat_get_entry_ref_locked(PGSTAT_KIND_PGSS,
> +                                                key.dbid,
> +                                                pgss_objid(&key),
> +                                                true);
> +        if (!entry_ref)
> +        {
> +            if (norm_query)
> +                pfree(norm_query);
> +            return;
> +        }
>
> -        /* If we failed to write to the text file, give up */
> -        if (!stored)
> -            goto done;
> +        shared_entry =3D (PgStatShared_Pgss *) entry_ref->shared_stats;
> +        pgss_entry_init(shared_entry, &key, encoding);
> +        pgss_store_query_text(shared_entry, store_query, query_len, enco=
ding);

It appears you've moved the equivalent of the "if (!entry)" check into
the pgss_store_query_text function, and we now unconditionally call
generate_normalized_query.

Is there a reason we couldn't just defer that work until we've
confirmed we actually need to save the query text?

>
> The pgstats hash key has an objid:
>
> typedef struct PgStat_HashKey
> {
>     PgStat_Kind kind;           /* statistics entry kind */
>     Oid         dboid;          /* database ID. InvalidOid for shared
> objects. */
>     uint64      objid;          /* object ID (table, function, etc.), or
>                                  * identifier. */
> } PgStat_HashKey;
>
> So the entry objid is computed by combining hashes from userid, queryid a=
nd
> toplevel.

Makes sense.

>
> Note that because of 0001 there are no changes required to the
> existing regression tests. We may need to add some new tests,
> but I have not thought too much about that yet.
>
> Key design changes:
>
> - No SpinLock on the execution path. Counters accumulate
> per-backend and merge into shared memory on flush. Stddev
> and related fields use Welford's algorithm as before, but since the stats
> are first updated locally, we need
> a way to merge the stats on flush, so we use Chan's parallel
> algorithm for this purpose [5].
>
> - No fixed shared memory allocation. Entries live in the pgstat
> dshash, which grows dynamically. pg_stat_statements.max
> becomes PGC_SIGHUP, and therefore can be changed dynamically.

Big +1 on these first two design changes - its clear that we get a
benefit here, and its nice to simplify/streamline the code as well.

> - Throttled inline eviction. When entry count reaches
> pgss_max, a backend attempts eviction using a conditional
> lock and a shared timestamp that ensures at most one eviction
> cycle per 10 seconds. Other backends simply skip entry
> creation without blocking. This is acceptable because the
> current upstream behavior already suffers from the same data
> loss under heavy churn. Newly created entries are immediately
> candidates for eviction and are frequently removed in the
> next deallocation cycle.

I feel this is problematic as presented (looking at pgss_maybe_evict
in v2), because even on a somewhat steady workload you have a high
risk of losing the 5001st entry (if max is 5000).

What if we instead trigger deallocation early, e.g. at 90% of entries
being full? I think this could also go well with a background worker
doing the work.

> The throttled approach makes this
> trade-off explicit and avoids the cost of blocking all backends
> behind an exclusive lock to create space for entries that may
> just be removed shortly. See benchmark results below for
> measuring the retention of "hot" entries.
>
> The aging decay mechanism remains in place, but the sticky
> entries are no longer needed for this patch, and we simply
> don't evict calls =3D=3D 0.
>
> A background worker for eviction was also considered. The
> current inline approach was chosen because it avoids the
> added complexity while still preventing other backends from
> blocking. If a background worker has more merit, I am open
> to that discussion.

I think a background worker might simplify things around deallocation
and query text GC handling - we have precedent that contrib extensions
utilize background workers, e.g. pg_prewarm, pg_stash_advice, etc.

> - Query texts in DSA memory with disk fallback. A new GUC
> pg_stat_statements.query_text_memory (default 64 MB) controls
> DSA shared memory for query text storage. When enabled and
> not exhausted, new texts are stored in DSA instead of the
> external file, eliminating file I/O on the read path. If DSA
> is disabled (set to 0) or full, texts fall back to the
> existing file-based storage. Entry eviction and reset properly
> free DSA-allocated texts, and GC of the text file skips
> DSA-backed entries.

I feel like this design introduces a performance cliff at the
in-memory =3D> disk boundary that will be surprising and hard to
understand. We had previously discussed doing an in-memory only
design, but I could see that being harder to get agreement on in
practice with the large query text files many of us have seen on
customer systems.

What if we instead designed this as an in-memory buffering mechanism?

i.e. new query texts are first written to the in-memory buffer, and
then flushed to disk. The in-memory buffer is kept as DSA like you
have it, so many concurrent sessions suddenly running a new query only
have one of them actually saving it to the buffer. We could then have
a background writer write that out to disk in batches, to make room
for new entries in the in-memory buffer.

If we wanted to get fancy we could even introduce compression when we
write it out to disk, since that work would no longer have to be
happening inline with the query execution - it could instead be
happening in the background worker.

>
> Open questions:
>
> 1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the
> custom kind ID. For commit, we'd want a proper kind number.

Makes sense to me - I assume the question is when we assign the real ID?

I think we could separately consider whether pg_stat_statements should
live in core (maybe?), but that seems like a different patch set to
me.

>
> 2. The 10-second eviction throttle is a compile-time constant
> (EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed
> interval sufficient?

I suspect its probably slightly too frequent, but its a bit hard to
say for sure. Personally, I don't think we need a GUC for it
necessarily, at least not from how I currently look at it.

> 3. The current design skips entry creation when at capacity and
> eviction is throttled. An alternative would be to allow
> temporary overshoot (soft limit) and never reject entries.
> Thoughts on the trade-off? However, for heavy churn and
> many unique entries we can easily overshoot the max by
> magnitudes higher which I don't think is a good idea.

Per earlier note, I think we should change this so that we trigger
eviction early before actually reaching the max.

It seems reasonable to still drop entries when we actually exceed the
max, instead of blocking on the deallocation finishing (as we do
today).

>
> 4. The default size of query text memory. The patch has it at 64MB.

I think we can work on other parts of the design first, but I wonder
if we should auto-scale that based on shared_buffers (and allow
overriding) - 64MB default seems high for very small instances.

Thanks,
Lukas

--
Lukas Fittl