Re: scaling up from t1n to 60 million records

public inbox for pgsql-general@postgresql.org  
help / color / mirror / Atom feed

From: Martin Mueller <martinmueller@northwestern.edu>
To: pgsql-general <pgsql-general@postgresql.org>
Subject: Re: scaling up from t1n to 60 million records
Date: Tue, 19 May 2026 18:52:57 +0000
Message-ID: <CY8PR05MB10108C25DA7344760DAA414FEC4002@CY8PR05MB10108.namprd05.prod.outlook.com> (raw)
In-Reply-To: <ecd7305e-888b-43bb-9e16-4297c93e4904@aklaver.com>
References: <CY8PR05MB1010861EAD48ED098786C9690C4002@CY8PR05MB10108.namprd05.prod.outlook.com>
	<ecd7305e-888b-43bb-9e16-4297c93e4904@aklaver.com>

Here is  a more detailed version.

I work on the curation of a corpus of some 65,000 Early Modern  texts with 1.5  billion words.  They exist as TEI-XML files and each word is wrapped in a <w> element.  Here are the first and last two words in the corpus

                  <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>

      <w lemma="begin" pos="vvz" reg="beginneth" xml:id="a73abc-001-b-0020">begynneth</w>

...

      <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>

          <pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>

The corpus has many corrupt spellings and errors in the linguistic annotation. Most of them are low-frequency phenomena and occur in no more than 64 documents. In nearly all cases you have enough evidence to correct a word or its annotation if you can see the word in the middle of a text string that  includes

  1.
the spelling
  2.
the lemma
  3.
the part of speech tag
  4.
a standard spelling (e.g. 'loue' for 'love')
  5.
up to seven previous words
  6.
up to seven next words
  7.
the spelling and POS tag of the previous
  8.
the lemma and POS tag of the next word
  9.
the Xpath of the current work

My goal is to involve users of the corpus in identifying and correcting corrupt readings. I call this a "philological shopping cart" since the offering of a correction can be thought of as a sale. Instead of buying something, with the machine registering the who, what, when, and where of the purchase, I offer an emendation, with the machine registering the who, what, when, and where of my emendation.

My hunch is that it would not be particularly difficult to build such a philological shopping cart and that in terms of scale it would not be a big thing.

I am trying to mirror that "shopping cart" on my Mac.  There are about 60 million word occurrences that occur in no more than 64 texts. The basic table has the columns described above, and half a dozen other columns for data entry and various counts.  There are some helper tables. The most important of them is a simple case-insensitive list of spellings with their document frequencies.  This is very useful for finding suspect spellings with queries like "show me all spellings in a low frequency range that contain 'tb' and look for words where replacing 'tb' with 'th' will find a word with a higher document frequency. That picks up spellings like 'tbe', 'tbat', 'autboritie', etc.

I've worked with KWIC tables of this kind for several years.  I have Aqua Data Studio as a front end for Postgres, currently version 17, running on a five-year old Mac with an Intel processor and 32 GB of memory.  I know a lot less about the innards of a SQL database than I should.

My largest kwic table has about 15 million rows with dozen columns for each row. Except for the left and right context, the columns consists of single words or numbers. The left and right context columns rarely add up to more than 35 characters each..  I have used plain indexes for some columns, with commands like "Create index on kwics16(keyword)", where 'kwics16' is the table name. My typical routine takes  a single-user interactive form: ask a query, wait for the results (typically seconds, sometimes a minute or more), and do something with the results.  I know next to nothing about the size of the database or tables, and it's not something Ihave needed to worry about. There are occasional memory bottle necks, because Aqua Data Studio isn't particularly good at release memory once it's no longer used. Closing and reopening the client fixes that.

It takes an hour or so to upload a table of this kind into the database.  Several tables of that size exist on my database and don't cause any trouble. I don't know at what point I would be running into constraints of an aging Mac with 32 GB of memory and a 2 TB hard drive.

I could comfortably live with what I'm doing now, dividing the data into three or four frequency ranges.

Given this information, should I try and create a single table or am I likely to run into serious constraints if I move beyond my current maximum table size of 15 million records.

Perhaps there is no clear answer, and I should just experiment.  But if any reader with more knowledge of Postgres thinks that in my environment I would be skating on thin ice if I move beyond current limits, I'd be grateful to be told so.

Martin Mueller
Professor emeritus of English and Classics
Northwestern University
From: Adrian Klaver <adrian.klaver@aklaver.com>
Date: Tuesday, May 19, 2026 at 09:45
To: Martin Mueller <martinmueller@northwestern.edu>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
Subject: Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:
> I use Postgres with a GUI frontend (Aquafold) as a very large
> spreadsheet on steroids that analyzes rare or defective spellings in a
> corpus of 65,000 texts and1.5 billion words.  I typically extract  data
> from the corpus with python scripts, turn them into tables and load them
> into the database.
>
>
> On my Mac with 32 GB of memory performance is OK with queries that
> typically within seconds extract data rows from tables  with up to ten
> million rows.  If the result set is large, I suspect that most of time
> machine's time is spent displaying result sets. I have used indexing
> sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

>
>
> I am thinking about scaling up to table with about 60 million rows.  Are
> there things to do or watch out for? Or should I proceed on the
> assumption that that 60 million records are within scope and that the
> added timecost is roughly linear?
>
> Martin Mueller
>
> Professor emeritus of English and Classics
>
> Northwestern University
>

--
Adrian Klaver
adrian.klaver@aklaver.com

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: pgsql-general@postgresql.org
  Cc: martinmueller@northwestern.edu
  Subject: Re: scaling up from t1n to 60 million records
  In-Reply-To: <CY8PR05MB10108C25DA7344760DAA414FEC4002@CY8PR05MB10108.namprd05.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox