Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wQHeu-001PNr-2B for pgsql-hackers@arkaria.postgresql.org; Fri, 22 May 2026 04:38:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wQHes-00CIYX-0j for pgsql-hackers@arkaria.postgresql.org; Fri, 22 May 2026 04:38:51 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wQHer-00CIYP-28 for pgsql-hackers@lists.postgresql.org; Fri, 22 May 2026 04:38:50 +0000 Received: from mail-qv1-xf30.google.com ([2607:f8b0:4864:20::f30]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wQHep-00000000Dkm-3Vqj for pgsql-hackers@postgresql.org; Fri, 22 May 2026 04:38:49 +0000 Received: by mail-qv1-xf30.google.com with SMTP id 6a1803df08f44-8b3fe2f19a4so72088136d6.2 for ; Thu, 21 May 2026 21:38:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779424726; cv=none; d=google.com; s=arc-20240605; b=D65ucK45w7mTac3MKJSrz4+cCvHmxDihU9I9IMeywtz6jnWpA7uHTj1ecXTXUjOnhZ GBRzOH/m00wycbc/+x1UPYEY4dcc/CN0U6MbvfbHLrXrsQ5MVItXltS8PvDyBQriIkgL 7Qcb8TQfKVJUU74AVpl2fsXGzGf2+vhXdtsxi7fqh3WsZmLE6nlkt+1XmRUd5Z9E1M7e vyJNIFjUTCEpMltGkJTnKy8ZX+PNKZ3T+VUIMpxtjctqSqNKYacRgJguqBSDsv0P28bU hGp7+plvWFju7UHD3s8rOPFAZOWXNBTBScDg90QOOOEI9RHxht9GCCYuXEcrhCJ1q/qn rImw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=zUyyJ+aWdLBAcxsubZHDwx5CG1bhuXUSbxLWfnlL3WA=; fh=eVUSIqo8oW6GlS9+lJ2dMbh393DkmW/G0CKaTSrcg9Y=; b=B/Ty0KxQeV4NiTJj6G/oDU4gM1w+S8q4c7KRY64p+QPwU6hKZfrlMvdCjXwqSgTTUY AabueMkUgUmd/InBf8ku6ewNFl5OhT5l2zMtPULf9Hw4jKzEICEwXn95qAN0JD9Q9fVW EV1Yu+hUJ70qByhzKiuY4opz9251occEEa5Jy/OHuYwJV4kT+gXshHfNZ05JjuxXBXfS a7SockiZZ6Ml9duYTTme7BnNKL4hMkn0r4upwcjBk0uncv66wqwnZ4UZQwzhRvItPAbR OKkshLtQ+rUUVXpAA3ibhLCuD1QVr7cggRF/ENPQRYKmDC4Qc/y2vP2v5r4zFnGkA6Ll qdcw==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fittl.com; s=google; t=1779424726; x=1780029526; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zUyyJ+aWdLBAcxsubZHDwx5CG1bhuXUSbxLWfnlL3WA=; b=ejYHxVS0lc5E6GKQFfFmlqcD/htdxzWdJkWv8HlbpJdTpX71gJbVanJa2JHe19MJaU 9Ec5mXEF0diWY7/7dLXURkj8knmTRYnAHI2nCY12nCHS7BJpuA4YYTTlIIWGjhxg+2an FPxTMNDfNGS2tDY4V5vWLfaJKrZLG0yQ1g7r4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779424726; x=1780029526; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=zUyyJ+aWdLBAcxsubZHDwx5CG1bhuXUSbxLWfnlL3WA=; b=saryEj2oQ6ISVMjqe8aNskgOEbkbldusDzPHE/nMAPzWJ2SvDIYEN+AQzAssrzuAPd l0064LRku3/H6ZN1AWBRFRWN4GqQqSv+Ykq8FwWf5lH0kKx9E+q+IiVK28Kh3X8PT0nz VE6yJxVImlWpBvJjglGlkih2lUDqK1nraebJTGfUU5k/gIE6f8kota6gPIdZIu7+M80o qrtjwAmVUU9j/pHhfMPATgKdfdeeiv1PgEBia9vaz+yu3YDioyeXxC303GDizItQ4TOd T7Khmuw21wxP+7mho4HOon2G8IK2nbUx+D12a7jD+uKSHTDxHsSsOe6+uvSj+bnqBqp4 I6SA== X-Gm-Message-State: AOJu0YzBJxvkW/tyOi4GdS+5ckTQ0O7I6XOF7i9QLugnRLBlR9eeTmf1 PxKv95OH+JAqw6tG1NkHPEsTCMvsEJBmW3yqqn8+EhF9qwnqLure4Bzk7JkqSbffkTPXC4728ek jPQxi6OXu0U5X9zKRTHZLKnIwr4bqSb7ParPV6sL9 X-Gm-Gg: Acq92OGTGancX0IGRwswuv+WmMr0gDqr6p9I4TJp7oqgTpNoXlFV6ZSu1r2o6hSHACC UDrV9C6CbsfzceAiLYwaURERvFPANwUa0mwtrdZ6PvgHuZHIsSrXYt4EFk/0PhJFpWY9Hsbm70T fNliMZeDOXNMEpori428a8xSbU1bzMMs4zMO/g3rZ1tcHICU7rOIU6SpUqXVXVxPvKhasNOylqy vomqgKEnvxIg0bW7lUUKAp9s9LbqyH8juyeAfWo1YnLGa2/FuoDPvm1efgyQpYz9g8CfGcrOgpM DLKNW2xxXJcPfgOTtu41dHSY8vPdfhEpsS6cXxECjziPnAMUdw== X-Received: by 2002:a05:6214:540e:b0:8b1:f069:3fc9 with SMTP id 6a1803df08f44-8cc7b65a95bmr37179426d6.23.1779424726112; Thu, 21 May 2026 21:38:46 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Lukas Fittl Date: Thu, 21 May 2026 21:38:09 -0700 X-Gm-Features: AVHnY4JprcCZZ31dC5r_Gj73Ukz9uCLXOOav39BQV_VySvb2O52rjaR0NvKO44I Message-ID: Subject: Re: Improve pg_stat_statements scalability To: Sami Imseih Cc: pgsql-hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Mon, May 11, 2026 at 4:54=E2=80=AFPM Sami Imseih w= rote: > pg_stat_statements has well-known scaling problems under > high concurrency. This patch series is an initial proposal > for how to $SUBJECT. Agreed. Let's fix this for Postgres 20. Thank you for putting together initial patches! > > The three scaling problems: > > 1. Per-entry SpinLock contention. Every counter update acquires > a SpinLock on the entry. Hot queries executed across many > backends contend on the same lock, and the hold time grows as > we add more counters to the struct. Higher core counts mean > more CPUs contending for the same spinlock, making the problem > worse on modern hardware. See this complaint here [3]. > > 2. Deallocation under exclusive LWLock. The hash table is > fixed-size, bounded by pg_stat_statements.max. When full, a > single backend performs least-frequently-used eviction of > the bottom 5% while holding an exclusive LWLock, blocking all > other backends from reading or writing pg_stat_statements. > This happens inline during query execution, and workloads with > many unique queries trigger it frequently. Making matters > worse, pg_stat_statements.max is a static GUC, requiring a > full server restart to change, and that is disruptive in > production systems. > > 3. Query text file bloat and GC. Deallocated entries leave dead > text in the external file. When the file grows to 2x the size > of live text, GC rewrites the entire file under exclusive lock. > Disk I/O under exclusive lock is the worst combination, and > frequent deallocations (problem #2) trigger more GC. Yep, agreed on these problems. > > Using the pgstat, "statistics collector", system to improve > pg_stat_statements scalability has been discussed previously > [4]. It writes stats locally first and flushes periodically > to shared memory, avoiding spinlock contention on the write > path. Also, the storage is a dshash (partitioned hash table), > which reduces contention on lookups and allows dynamic resizing > without restart. The prerequisite work to make this usable by > extensions has been building over two release cycles: > > - PG18: Pluggable cumulative statistics API > (pgstat_register_kind) [7949d959] > > - PG19: Serialization callbacks (to_serialized_data, > from_serialized_data, finish) for custom stats kinds > [4ba012a8ed9] > > This patch builds on the prerequisite work to address the > scalability issues mentioned above. > > The series consists of two patches: > > --- > > [0001] pgstat: add pgstat_flush_pending() and pg_stat_flush_pending(pid) > > Adds infrastructure for flushing pending statistics to shared > memory on demand. pgstat_flush_pending() flushes all pending > entries in the calling backend immediately. Unlike > pgstat_report_stat(), it can be called mid-transaction, making > it suitable for view functions that need fresh shared stats > before the transaction ends. > > pg_stat_flush_pending(pid) is the SQL-callable interface to > the same function. > > This patch is related to the discussion in [1] about flushing > stats within running transactions. The pg_stat_statements > modernization provides a good example where this is useful. > At least for the pg_stat_statements tests, we need a way to > flush statistics within a transaction. I plan to spin this off > to a dedicated thread, but include it here for now so this > patchset can be tested. I noticed the following in the 0001 patch (to be clear, looking at the v2 version you posted later), which I wasn't sure about: > diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/u= tils/activity/pgstat_relation.c > index b2ca28f83ba..848687a9f7e 100644 > --- a/src/backend/utils/activity/pgstat_relation.c > +++ b/src/backend/utils/activity/pgstat_relation.c > @@ -828,64 +828,76 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref= , bool nowait) > > ... > if (t > tabentry->lastscan) > tabentry->lastscan =3D t; > } > tabentry->tuples_returned +=3D lstats->counts.tuples_returned; > tabentry->tuples_fetched +=3D lstats->counts.tuples_fetched; > - tabentry->tuples_inserted +=3D lstats->counts.tuples_inserted; > - tabentry->tuples_updated +=3D lstats->counts.tuples_updated; > - tabentry->tuples_deleted +=3D lstats->counts.tuples_deleted; > tabentry->tuples_hot_updated +=3D lstats->counts.tuples_hot_updated; > tabentry->tuples_newpage_updated +=3D lstats->counts.tuples_newpage_u= pdated; > + tabentry->blocks_fetched +=3D lstats->counts.blocks_fetched; > + tabentry->blocks_hit +=3D lstats->counts.blocks_hit; If I follow correctly, you're moving tuples_(inserted|updated|deleted) here to special case them so they respect transaction rollbacks, but you don't do that for tuples_hot_updated and tuples_newpage_updated. Shouldn't those also be ignored if a transaction rolls back? > > [0002] pg_stat_statements: modernize entry storage with > pgstat kind > > This is the main patch. It replaces the private shared-memory > hash table with the pgstat subsystem's dshash (registered as a > custom stats kind). I've taken a first pass at this, and noticed an issue (again, in the v2 patch for clarity): diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c > index 92315627916..0e6e65e3e51 100644 > --- a/contrib/pg_stat_statements/pg_stat_statements.c > +++ b/contrib/pg_stat_statements/pg_stat_statements.c > ... > @@ -1254,9 +1066,62 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char= *queryString, > } > } > > ... > +/* > + * Store query text into a shared entry via the external text file. > + * > + * Caller must hold the entry lock. Does nothing if text is already pre= sent. > + */ > +static void > +pgss_store_query_text(PgStatShared_Pgss *shared_entry, > + const char *query, int query_len, int encoding) > +{ > + Size query_offset; > + int gc_count; > + > + if (shared_entry->query_len >=3D 0) > + return; > + > + LWLockAcquire(&pgss->lock.lock, LW_SHARED); > + if (qtext_store(query, query_len, &query_offset, &gc_count)) > + { > + shared_entry->query_offset =3D query_offset; > + shared_entry->query_len =3D query_len; > + shared_entry->encoding =3D encoding; > + } > + LWLockRelease(&pgss->lock.lock); > +} > + > > ... > /* > @@ -1313,192 +1180,171 @@ pgss_store(const char *query, int64 queryId, > key.queryid =3D queryId; > key.toplevel =3D (nesting_level =3D=3D 0); > > - /* Lookup the hash table entry with shared lock. */ > - LWLockAcquire(&pgss->lock.lock, LW_SHARED); > - > - entry =3D (pgssEntry *) hash_search(pgss_hash, &key, HASH_FIND, NULL= ); > - > - /* Create new entry, if not present */ > - if (!entry) > + /* > + * If jstate is set, create the shared entry and store normalized qu= ery > + * text. Don't increment counters; entries with zero calls are not > + * returned by pg_stat_statements_internal(). > + */ > + if (jstate) > { > - Size query_offset; > - int gc_count; > - bool stored; > - bool do_gc; > + const char *store_query; > > - /* > - * Create a new, normalized query string if caller asked. We do= n't > - * need to hold the lock while doing this work. (Note: in any c= ase, > - * it's possible that someone else creates a duplicate hashtable= entry > - * in the interval where we don't hold the lock below. That cas= e is > - * handled by entry_alloc.) > - */ > - if (jstate) > ... > > - /* Append new query text to file with only shared lock held */ > - stored =3D qtext_store(norm_query ? norm_query : query, query_le= n, > - &query_offset, &gc_count); > - > - /* > - * Determine whether we need to garbage collect external query t= exts > - * while the shared lock is still held. This micro-optimization > - * avoids taking the time to decide this while holding exclusive= lock. > - */ > - do_gc =3D need_gc_qtexts(); > - > - /* Need exclusive lock to make a new hashtable entry - promote *= / > - LWLockRelease(&pgss->lock.lock); > - LWLockAcquire(&pgss->lock.lock, LW_EXCLUSIVE); > + norm_query =3D generate_normalized_query(jstate, query, > + query_location, > + &query_len); > + store_query =3D norm_query ? norm_query : query; > > - /* > - * A garbage collection may have occurred while we weren't holdi= ng the > - * lock. In the unlikely event that this happens, the query tex= t we > - * stored above will have been garbage collected, so write it ag= ain. > - * This should be infrequent enough that doing it while holding > - * exclusive lock isn't a performance problem. > - */ > - if (!stored || pgss->gc_count !=3D gc_count) > - stored =3D qtext_store(norm_query ? norm_query : query, quer= y_len, > - &query_offset, NULL); > + entry_ref =3D pgstat_get_entry_ref_locked(PGSTAT_KIND_PGSS, > + key.dbid, > + pgss_objid(&key), > + true); > + if (!entry_ref) > + { > + if (norm_query) > + pfree(norm_query); > + return; > + } > > - /* If we failed to write to the text file, give up */ > - if (!stored) > - goto done; > + shared_entry =3D (PgStatShared_Pgss *) entry_ref->shared_stats; > + pgss_entry_init(shared_entry, &key, encoding); > + pgss_store_query_text(shared_entry, store_query, query_len, enco= ding); It appears you've moved the equivalent of the "if (!entry)" check into the pgss_store_query_text function, and we now unconditionally call generate_normalized_query. Is there a reason we couldn't just defer that work until we've confirmed we actually need to save the query text? > > The pgstats hash key has an objid: > > typedef struct PgStat_HashKey > { > PgStat_Kind kind; /* statistics entry kind */ > Oid dboid; /* database ID. InvalidOid for shared > objects. */ > uint64 objid; /* object ID (table, function, etc.), or > * identifier. */ > } PgStat_HashKey; > > So the entry objid is computed by combining hashes from userid, queryid a= nd > toplevel. Makes sense. > > Note that because of 0001 there are no changes required to the > existing regression tests. We may need to add some new tests, > but I have not thought too much about that yet. > > Key design changes: > > - No SpinLock on the execution path. Counters accumulate > per-backend and merge into shared memory on flush. Stddev > and related fields use Welford's algorithm as before, but since the stats > are first updated locally, we need > a way to merge the stats on flush, so we use Chan's parallel > algorithm for this purpose [5]. > > - No fixed shared memory allocation. Entries live in the pgstat > dshash, which grows dynamically. pg_stat_statements.max > becomes PGC_SIGHUP, and therefore can be changed dynamically. Big +1 on these first two design changes - its clear that we get a benefit here, and its nice to simplify/streamline the code as well. > - Throttled inline eviction. When entry count reaches > pgss_max, a backend attempts eviction using a conditional > lock and a shared timestamp that ensures at most one eviction > cycle per 10 seconds. Other backends simply skip entry > creation without blocking. This is acceptable because the > current upstream behavior already suffers from the same data > loss under heavy churn. Newly created entries are immediately > candidates for eviction and are frequently removed in the > next deallocation cycle. I feel this is problematic as presented (looking at pgss_maybe_evict in v2), because even on a somewhat steady workload you have a high risk of losing the 5001st entry (if max is 5000). What if we instead trigger deallocation early, e.g. at 90% of entries being full? I think this could also go well with a background worker doing the work. > The throttled approach makes this > trade-off explicit and avoids the cost of blocking all backends > behind an exclusive lock to create space for entries that may > just be removed shortly. See benchmark results below for > measuring the retention of "hot" entries. > > The aging decay mechanism remains in place, but the sticky > entries are no longer needed for this patch, and we simply > don't evict calls =3D=3D 0. > > A background worker for eviction was also considered. The > current inline approach was chosen because it avoids the > added complexity while still preventing other backends from > blocking. If a background worker has more merit, I am open > to that discussion. I think a background worker might simplify things around deallocation and query text GC handling - we have precedent that contrib extensions utilize background workers, e.g. pg_prewarm, pg_stash_advice, etc. > - Query texts in DSA memory with disk fallback. A new GUC > pg_stat_statements.query_text_memory (default 64 MB) controls > DSA shared memory for query text storage. When enabled and > not exhausted, new texts are stored in DSA instead of the > external file, eliminating file I/O on the read path. If DSA > is disabled (set to 0) or full, texts fall back to the > existing file-based storage. Entry eviction and reset properly > free DSA-allocated texts, and GC of the text file skips > DSA-backed entries. I feel like this design introduces a performance cliff at the in-memory =3D> disk boundary that will be surprising and hard to understand. We had previously discussed doing an in-memory only design, but I could see that being harder to get agreement on in practice with the large query text files many of us have seen on customer systems. What if we instead designed this as an in-memory buffering mechanism? i.e. new query texts are first written to the in-memory buffer, and then flushed to disk. The in-memory buffer is kept as DSA like you have it, so many concurrent sessions suddenly running a new query only have one of them actually saving it to the buffer. We could then have a background writer write that out to disk in batches, to make room for new entries in the in-memory buffer. If we wanted to get fancy we could even introduce compression when we write it out to disk, since that work would no longer have to be happening inline with the query execution - it could instead be happening in the background worker. > > Open questions: > > 1. The attached patches use PGSTAT_KIND_EXPERIMENTAL for the > custom kind ID. For commit, we'd want a proper kind number. Makes sense to me - I assume the question is when we assign the real ID? I think we could separately consider whether pg_stat_statements should live in core (maybe?), but that seems like a different patch set to me. > > 2. The 10-second eviction throttle is a compile-time constant > (EVICTION_INTERVAL_MS). Should this be a GUC, or is a fixed > interval sufficient? I suspect its probably slightly too frequent, but its a bit hard to say for sure. Personally, I don't think we need a GUC for it necessarily, at least not from how I currently look at it. > 3. The current design skips entry creation when at capacity and > eviction is throttled. An alternative would be to allow > temporary overshoot (soft limit) and never reject entries. > Thoughts on the trade-off? However, for heavy churn and > many unique entries we can easily overshoot the max by > magnitudes higher which I don't think is a good idea. Per earlier note, I think we should change this so that we trigger eviction early before actually reaching the max. It seems reasonable to still drop entries when we actually exceed the max, instead of blocking on the deallocation finishing (as we do today). > > 4. The default size of query text memory. The patch has it at 64MB. I think we can work on other parts of the design first, but I wonder if we should auto-scale that based on shared_buffers (and allow overriding) - 64MB default seems high for very small instances. Thanks, Lukas -- Lukas Fittl