scaling up from t1n to 60 million records

public inbox for pgsql-general@postgresql.org  
help / color / mirror / Atom feed

scaling up from t1n to 60 million records
6+ messages / 4 participants
[nested] [flat]

* scaling up from t1n to 60 million records
@ 2026-05-19 14:27 Martin Mueller <martinmueller@northwestern.edu>
  2026-05-19 14:32 ` Re: scaling up from t1n to 60 million records Jan Karremans <karremans.ja@gmail.com>
  2026-05-19 14:41 ` Re: scaling up from t1n to 60 million records Ron Johnson <ronljohnsonjr@gmail.com>
  2026-05-19 14:44 ` Re: scaling up from t1n to 60 million records Adrian Klaver <adrian.klaver@aklaver.com>
  0 siblings, 3 replies; 6+ messages in thread

From: Martin Mueller @ 2026-05-19 14:27 UTC (permalink / raw)
  To: pgsql-general

I use Postgres with a GUI frontend (Aquafold) as a very large spreadsheet on steroids that analyzes rare or defective spellings in a corpus of 65,000 texts and1.5 billion words.  I typically extract  data from the corpus with python scripts, turn them into tables and load them into the database.

On my Mac with 32 GB of memory performance is OK with queries that typically within seconds extract data rows from tables  with up to ten million rows.  If the result set is large, I suspect that most of time machine's time is spent displaying result sets. I have used indexing sparingly. While it helps, the time savings often don't matter much.

I am thinking about scaling up to table with about 60 million rows.  Are there things to do or watch out for? Or should I proceed on the assumption that that 60 million records are within scope and that the added timecost is roughly linear?

Martin Mueller
Professor emeritus of English and Classics
Northwestern University

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scaling up from t1n to 60 million records
  2026-05-19 14:27 scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
@ 2026-05-19 14:32 ` Jan Karremans <karremans.ja@gmail.com>
  2 siblings, 0 replies; 6+ messages in thread

From: Jan Karremans @ 2026-05-19 14:32 UTC (permalink / raw)
  To: Martin Mueller <martinmueller@northwestern.edu>; +Cc: pgsql-general

Dear Martin,

I think you would be mostly good for just going ahead with this.
You might look at the size of your tables, but I expect that all to be well within safe ranges.

Cheers,
Jan

> On 19 May 2026, at 16:27, Martin Mueller <martinmueller@northwestern.edu> wrote:
> 
> I use Postgres with a GUI frontend (Aquafold) as a very large spreadsheet on steroids that analyzes rare or defective spellings in a corpus of 65,000 texts and1.5 billion words.  I typically extract  data from the corpus with python scripts, turn them into tables and load them into the database.
> 
> On my Mac with 32 GB of memory performance is OK with queries that typically within seconds extract data rows from tables  with up to ten million rows.  If the result set is large, I suspect that most of time machine's time is spent displaying result sets. I have used indexing sparingly. While it helps, the time savings often don't matter much. 
> 
> I am thinking about scaling up to table with about 60 million rows.  Are there things to do or watch out for? Or should I proceed on the assumption that that 60 million records are within scope and that the added timecost is roughly linear?
>  
> Martin Mueller
> Professor emeritus of English and Classics
> Northwestern University
>  
>  
>  



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scaling up from t1n to 60 million records
  2026-05-19 14:27 scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
@ 2026-05-19 14:41 ` Ron Johnson <ronljohnsonjr@gmail.com>
  2 siblings, 0 replies; 6+ messages in thread

From: Ron Johnson @ 2026-05-19 14:41 UTC (permalink / raw)
  To: pgsql-general

On Tue, May 19, 2026 at 10:27 AM Martin Mueller <
martinmueller@northwestern.edu> wrote:

> I use Postgres with a GUI frontend (Aquafold) as a very large spreadsheet
> on steroids that analyzes rare or defective spellings in a corpus of 65,000
> texts and1.5 billion words.  I typically extract  data from the corpus with
> python scripts, turn them into tables and load them into the database.
>
>
> On my Mac with 32 GB of memory performance is OK with queries that
> typically within seconds extract data rows from tables  with up to ten
> million rows.  If the result set is large, I suspect that most of time
> machine's time is spent displaying result sets. I have used indexing
> sparingly. While it helps, the time savings often don't matter much.
>
>
> I am thinking about scaling up to table with about 60 million rows.  Are
> there things to do or watch out for?
>

Use the correct tool for the task at hand, even if you are not a carpenter
and thus only know how to use a hammer.

Or should I proceed on the assumption that that 60 million records are
> within scope and that the added timecost is roughly linear?
>

In my experience, database performance shows a hockey stick graph: good
while stuff fits in memory, and then suddenly not so good.

The correct tool for full text search is PG's Full Text Search (ts_vector)
facility, paired with GIN indexes.  Do you use them?  Probably not, based
on your comments, but that would "keep 'everything' in memory", thus
staving off performance degradation.

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scaling up from t1n to 60 million records
  2026-05-19 14:27 scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
@ 2026-05-19 14:44 ` Adrian Klaver <adrian.klaver@aklaver.com>
  2026-05-19 18:52   ` Re: scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
  2 siblings, 1 reply; 6+ messages in thread

From: Adrian Klaver @ 2026-05-19 14:44 UTC (permalink / raw)
  To: Martin Mueller <martinmueller@northwestern.edu>; pgsql-general

On 5/19/26 7:27 AM, Martin Mueller wrote:
> I use Postgres with a GUI frontend (Aquafold) as a very large 
> spreadsheet on steroids that analyzes rare or defective spellings in a 
> corpus of 65,000 texts and1.5 billion words.  I typically extract  data 
> from the corpus with python scripts, turn them into tables and load them 
> into the database.
> 
> 
> On my Mac with 32 GB of memory performance is OK with queries that 
> typically within seconds extract data rows from tables  with up to ten 
> million rows.  If the result set is large, I suspect that most of time 
> machine's time is spent displaying result sets. I have used indexing 
> sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

> 
> 
> I am thinking about scaling up to table with about 60 million rows.  Are 
> there things to do or watch out for? Or should I proceed on the 
> assumption that that 60 million records are within scope and that the 
> added timecost is roughly linear?
> 
> Martin Mueller
> 
> Professor emeritus of English and Classics
> 
> Northwestern University
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scaling up from t1n to 60 million records
  2026-05-19 14:27 scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
  2026-05-19 14:44 ` Re: scaling up from t1n to 60 million records Adrian Klaver <adrian.klaver@aklaver.com>
@ 2026-05-19 18:52   ` Martin Mueller <martinmueller@northwestern.edu>
  2026-05-19 20:18     ` Re: scaling up from t1n to 60 million records Ron Johnson <ronljohnsonjr@gmail.com>
  0 siblings, 1 reply; 6+ messages in thread

From: Martin Mueller @ 2026-05-19 18:52 UTC (permalink / raw)
  To: pgsql-general

Here is  a more detailed version.

I work on the curation of a corpus of some 65,000 Early Modern  texts with 1.5  billion words.  They exist as TEI-XML files and each word is wrapped in a <w> element.  Here are the first and last two words in the corpus

                  <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>

      <w lemma="begin" pos="vvz" reg="beginneth" xml:id="a73abc-001-b-0020">begynneth</w>

...

      <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>

          <pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>

The corpus has many corrupt spellings and errors in the linguistic annotation. Most of them are low-frequency phenomena and occur in no more than 64 documents. In nearly all cases you have enough evidence to correct a word or its annotation if you can see the word in the middle of a text string that  includes

  1.
the spelling
  2.
the lemma
  3.
the part of speech tag
  4.
a standard spelling (e.g. 'loue' for 'love')
  5.
up to seven previous words
  6.
up to seven next words
  7.
the spelling and POS tag of the previous
  8.
the lemma and POS tag of the next word
  9.
the Xpath of the current work

My goal is to involve users of the corpus in identifying and correcting corrupt readings. I call this a "philological shopping cart" since the offering of a correction can be thought of as a sale. Instead of buying something, with the machine registering the who, what, when, and where of the purchase, I offer an emendation, with the machine registering the who, what, when, and where of my emendation.

My hunch is that it would not be particularly difficult to build such a philological shopping cart and that in terms of scale it would not be a big thing.

I am trying to mirror that "shopping cart" on my Mac.  There are about 60 million word occurrences that occur in no more than 64 texts. The basic table has the columns described above, and half a dozen other columns for data entry and various counts.  There are some helper tables. The most important of them is a simple case-insensitive list of spellings with their document frequencies.  This is very useful for finding suspect spellings with queries like "show me all spellings in a low frequency range that contain 'tb' and look for words where replacing 'tb' with 'th' will find a word with a higher document frequency. That picks up spellings like 'tbe', 'tbat', 'autboritie', etc.

I've worked with KWIC tables of this kind for several years.  I have Aqua Data Studio as a front end for Postgres, currently version 17, running on a five-year old Mac with an Intel processor and 32 GB of memory.  I know a lot less about the innards of a SQL database than I should.

My largest kwic table has about 15 million rows with dozen columns for each row. Except for the left and right context, the columns consists of single words or numbers. The left and right context columns rarely add up to more than 35 characters each..  I have used plain indexes for some columns, with commands like "Create index on kwics16(keyword)", where 'kwics16' is the table name. My typical routine takes  a single-user interactive form: ask a query, wait for the results (typically seconds, sometimes a minute or more), and do something with the results.  I know next to nothing about the size of the database or tables, and it's not something Ihave needed to worry about. There are occasional memory bottle necks, because Aqua Data Studio isn't particularly good at release memory once it's no longer used. Closing and reopening the client fixes that.

It takes an hour or so to upload a table of this kind into the database.  Several tables of that size exist on my database and don't cause any trouble. I don't know at what point I would be running into constraints of an aging Mac with 32 GB of memory and a 2 TB hard drive.

I could comfortably live with what I'm doing now, dividing the data into three or four frequency ranges.

Given this information, should I try and create a single table or am I likely to run into serious constraints if I move beyond my current maximum table size of 15 million records.

Perhaps there is no clear answer, and I should just experiment.  But if any reader with more knowledge of Postgres thinks that in my environment I would be skating on thin ice if I move beyond current limits, I'd be grateful to be told so.

Martin Mueller
Professor emeritus of English and Classics
Northwestern University
From: Adrian Klaver <adrian.klaver@aklaver.com>
Date: Tuesday, May 19, 2026 at 09:45
To: Martin Mueller <martinmueller@northwestern.edu>; pgsql-general@postgresql.org <pgsql-general@postgresql.org>
Subject: Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:
> I use Postgres with a GUI frontend (Aquafold) as a very large
> spreadsheet on steroids that analyzes rare or defective spellings in a
> corpus of 65,000 texts and1.5 billion words.  I typically extract  data
> from the corpus with python scripts, turn them into tables and load them
> into the database.
>
>
> On my Mac with 32 GB of memory performance is OK with queries that
> typically within seconds extract data rows from tables  with up to ten
> million rows.  If the result set is large, I suspect that most of time
> machine's time is spent displaying result sets. I have used indexing
> sparingly. While it helps, the time savings often don't matter much.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

>
>
> I am thinking about scaling up to table with about 60 million rows.  Are
> there things to do or watch out for? Or should I proceed on the
> assumption that that 60 million records are within scope and that the
> added timecost is roughly linear?
>
> Martin Mueller
>
> Professor emeritus of English and Classics
>
> Northwestern University
>

--
Adrian Klaver
adrian.klaver@aklaver.com

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: scaling up from t1n to 60 million records
  2026-05-19 14:27 scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
  2026-05-19 14:44 ` Re: scaling up from t1n to 60 million records Adrian Klaver <adrian.klaver@aklaver.com>
  2026-05-19 18:52   ` Re: scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
@ 2026-05-19 20:18     ` Ron Johnson <ronljohnsonjr@gmail.com>
  0 siblings, 0 replies; 6+ messages in thread

From: Ron Johnson @ 2026-05-19 20:18 UTC (permalink / raw)
  To: pgsql-general

Indices are your friend!  (Except when loading data.)

Add them on any relevant column.
https://www.postgresql.org/docs/16/sql-createindex.html

The "(expression)" clause might be useful in your situation, since it can
exclude some words from an index, exclude empty cells, index upper-case
versions of the word, etc.


On Tue, May 19, 2026 at 2:53 PM Martin Mueller <
martinmueller@northwestern.edu> wrote:

> Here is  a more detailed version.
>
> I work on the curation of a corpus of some 65,000 Early Modern  texts with
> 1.5  billion words.  They exist as TEI-XML files and each word is wrapped
> in a <w> element.  Here are the first and last two words in the corpus
>
> <w lemma="here" pos="av" xml:id="a73abc-001-b-0010">HEre</w>
>
>
> <w lemma="begin" pos="vvz" reg="beginneth" xml:id="a73abc-001-b-0020">begynneth</w>
>
> ...
>
>   <w lemma="mercy" pos="n1" xml:id="e20ady-0008-3120">mercy</w>
>
>           <pc unit="sentence" xml:id="e20ady-0008-3130">.</pc>
>
>
> The corpus has many corrupt spellings and errors in the linguistic
> annotation. Most of them are low-frequency phenomena and occur in no more
> than 64 documents. In nearly all cases you have enough evidence to correct
> a word or its annotation if you can see the word in the middle of a text
> string that  includes
>
>
>    1. the spelling
>    2. the lemma
>    3. the part of speech tag
>    4. a standard spelling (e.g. 'loue' for 'love')
>    5. up to seven previous words
>    6. up to seven next words
>    7. the spelling and POS tag of the previous
>    8. the lemma and POS tag of the next word
>    9. the Xpath of the current work
>
>
> My goal is to involve users of the corpus in identifying and correcting
> corrupt readings. I call this a "philological shopping cart" since the
> offering of a correction can be thought of as a sale. Instead of buying
> something, with the machine registering the who, what, when, and where of
> the purchase, I offer an emendation, with the machine registering the who,
> what, when, and where of my emendation.
>
> My hunch is that it would not be particularly difficult to build such a
> philological shopping cart and that in terms of scale it would not be a big
> thing.
>
> I am trying to mirror that "shopping cart" on my Mac.  There are about 60
> million word occurrences that occur in no more than 64 texts. The basic
> table has the columns described above, and half a dozen other columns for
> data entry and various counts.  There are some helper tables. The most
> important of them is a simple case-insensitive list of spellings with their
> document frequencies.  This is very useful for finding suspect spellings
> with queries like "show me all spellings in a low frequency range that
> contain 'tb' and look for words where replacing 'tb' with 'th' will find a
> word with a higher document frequency. That picks up spellings like 'tbe',
> 'tbat', 'autboritie', etc.
>
> I've worked with KWIC tables of this kind for several years.  I have Aqua
> Data Studio as a front end for Postgres, currently version 17, running on a
> five-year old Mac with an Intel processor and 32 GB of memory.  I know a
> lot less about the innards of a SQL database than I should.
>
> My largest kwic table has about 15 million rows with dozen columns for
> each row. Except for the left and right context, the columns consists of
> single words or numbers. The left and right context columns rarely add up
> to more than 35 characters each..  I have used plain indexes for some
> columns, with commands like "Create index on kwics16(keyword)", where
> 'kwics16' is the table name. My typical routine takes  a single-user
> interactive form: ask a query, wait for the results (typically seconds,
> sometimes a minute or more), and do something with the results.  I know
> next to nothing about the size of the database or tables, and it's not
> something Ihave needed to worry about. There are occasional memory bottle
> necks, because Aqua Data Studio isn't particularly good at release memory
> once it's no longer used. Closing and reopening the client fixes that.
>
> It takes an hour or so to upload a table of this kind into the database.
> Several tables of that size exist on my database and don't cause any
> trouble. I don't know at what point I would be running into constraints of
> an aging Mac with 32 GB of memory and a 2 TB hard drive.
>
> I could comfortably live with what I'm doing now, dividing the data into
> three or four frequency ranges.
>
> Given this information, should I try and create a single table or am I
> likely to run into serious constraints if I move beyond my current maximum
> table size of 15 million records.
>
> Perhaps there is no clear answer, and I should just experiment.  But if
> any reader with more knowledge of Postgres thinks that in my environment I
> would be skating on thin ice if I move beyond current limits, I'd be
> grateful to be told so.
>
>
>
>
>
>
>
>
>
>
> Martin Mueller
> Professor emeritus of English and Classics
> Northwestern University
> *From: *Adrian Klaver <adrian.klaver@aklaver.com>
> *Date: *Tuesday, May 19, 2026 at 09:45
> *To: *Martin Mueller <martinmueller@northwestern.edu>;
> pgsql-general@postgresql.org <pgsql-general@postgresql.org>
> *Subject: *Re: scaling up from t1n to 60 million records
>
> On 5/19/26 7:27 AM, Martin Mueller wrote:
> > I use Postgres with a GUI frontend (Aquafold) as a very large
> > spreadsheet on steroids that analyzes rare or defective spellings in a
> > corpus of 65,000 texts and1.5 billion words.  I typically extract  data
> > from the corpus with python scripts, turn them into tables and load them
> > into the database.
> >
> >
> > On my Mac with 32 GB of memory performance is OK with queries that
> > typically within seconds extract data rows from tables  with up to ten
> > million rows.  If the result set is large, I suspect that most of time
> > machine's time is spent displaying result sets. I have used indexing
> > sparingly. While it helps, the time savings often don't matter much.
>
> This is going to need more information:
>
> 1) Postgres version.
>
> 2) The table schema including indexes.
>
> 3) An example of the query.
>
> 4) Where you are measuring the time.
>
> 5) The client you are displaying the results in.
>
> >
> >
> > I am thinking about scaling up to table with about 60 million rows.  Are
> > there things to do or watch out for? Or should I proceed on the
> > assumption that that 60 million records are within scope and that the
> > added timecost is roughly linear?
> >
> > Martin Mueller
> >
> > Professor emeritus of English and Classics
> >
> > Northwestern University
> >
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>


-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-05-19 20:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-19 14:27 scaling up from t1n to 60 million records Martin Mueller <martinmueller@northwestern.edu>
2026-05-19 14:32 ` Jan Karremans <karremans.ja@gmail.com>
2026-05-19 14:41 ` Ron Johnson <ronljohnsonjr@gmail.com>
2026-05-19 14:44 ` Adrian Klaver <adrian.klaver@aklaver.com>
2026-05-19 18:52   ` Martin Mueller <martinmueller@northwestern.edu>
2026-05-19 20:18     ` Ron Johnson <ronljohnsonjr@gmail.com>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox