Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPQtS-000jIL-00 for pgsql-general@arkaria.postgresql.org; Tue, 19 May 2026 20:18:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPQtP-0051UN-26 for pgsql-general@arkaria.postgresql.org; Tue, 19 May 2026 20:18:20 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPQtP-0051UE-0E for pgsql-general@lists.postgresql.org; Tue, 19 May 2026 20:18:20 +0000 Received: from mail-ot1-x331.google.com ([2607:f8b0:4864:20::331]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wPQtM-00000000NbL-1hf3 for pgsql-general@postgresql.org; Tue, 19 May 2026 20:18:18 +0000 Received: by mail-ot1-x331.google.com with SMTP id 46e09a7af769-7dcd9061b1aso3720949a34.2 for ; Tue, 19 May 2026 13:18:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779221896; cv=none; d=google.com; s=arc-20240605; b=RRp/HpMzC7x0QdL0AQ1dMxCUrGfjvMnESOg2QgivPv7bjGyq5vofaoihW62psa46on FaNUy34nUjogskOD+cLkEIHlXDXlbE4X423ruMoyTj7bgPV/k0AF9rMTdAm1bXGG2190 IVe+6HJlC38aKESP9MJN4pNotsolAhpe976tZiJ7bOoykjWOS0FCC+G801CBRUzXcckM Z8hfdzevbNTDuvgs6kZ+TXwTOqinaAyWLvT35o4c9Ir7X3ViHyhuI+AbcSbPoAuEy3TV PGRm5YVUGZPZp3Sds1VwMRS/b2TKxmW2iyDXwfciWEm4LJikstOLRseqdhn5Oyyz18GZ abog== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=Hl9F1lVGhD8L31YUWXtXBjFA5aAcZaZgaeBnmQkA2cA=; fh=KNSq+t9BltSXFnT3Yof/aKGBtqxeA+bTALiYdvTslaY=; b=etU3Rev9XW0Hb19ziW3CH7/7MyKq5+F6mLmBi6fz550WyYaZ7hcGqnXsxreVT/qwzs kIzrIK7SRhpza277QPbKGA++69KJb9G6mHhyyhsM2aTnQObdw/vcWGxnMhFPSSiweeCi nSpnJMSJV62hp/FOYVom24uTph520ObRf4LsRQajAZM0ezclb9ZnY3tO9/ddjBoM5MCt RdXUum41cJgsyv6PWhBlTOCSdak+XTtIWVL0McgsuzQNBwAOq3vO8G6Lb6R5ZsHfl8Q/ +XU6omHfikNeY89sPDSE5vyyLCnfkjBsDJGNSnFjoL//OxK1Cn34djOHakRd4S3kMvj6 L60g==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779221896; x=1779826696; darn=postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=Hl9F1lVGhD8L31YUWXtXBjFA5aAcZaZgaeBnmQkA2cA=; b=lh28K4/9J1y4hdpU5fem2ZxyHchQbTpASzZl4wacgclMBJI7imVtKhRdRr7YFGdKu/ l3Wctd+JattA8CGbWVfSBaHpmWUU4z2R7yi3O1oOxW4lsJfH4Qa3+Psr4o+l3d44vnuE YrEzYCZBQBjbeA0GSrTff+yaiAbd1IUrr0Tx98Ll+XkLDwof4RIhRsTtG/YtMn3nYHJD OZzdUmofDcJc+u/d42w0GyFz2e92uBVCFPPVkFe5GGDCVWOZ0qSulNcCqAYA7+mym5J/ ZO4NjXyu1t2oFOf+GlObta1BROW454BZiBnMjZiDn1IYeeALtPIwpUjAB3yWZ2hV8Whb bpYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779221896; x=1779826696; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Hl9F1lVGhD8L31YUWXtXBjFA5aAcZaZgaeBnmQkA2cA=; b=XHwa1vDDb8Ey/0ARVax6ao72clMUce+157GKQtovnXBK6ux3E8w66e2KYXz32JKwYD /4IJTfXduKKiaX2C3xoecEn8q3+G8ka1YaBHW1pE8f38b9Chvcw7qNKpPUPURIEP9XaT YM2W4Tmjm58n/SEmaIpkUcGDNuWLYKEmo7j18IV/THXF0tEZGrFHeTfwSLdz0Fcf5AcW Hh/9yEVpNjHOQTNoxmW71wxDepegexrA1/B0YZOjnHYNNjV8mo+w5S9YqY7bS1Q8X/ba Hz5H4CM/SFNa6at3OWe1eDh4PyBAsbaJqh+tEuVX2sBfiF3aNHg7pH/rFLHdYOjGzBxf EtrA== X-Gm-Message-State: AOJu0Yzaq8Db8JDYG031uYbZ5TR60w4lfMNwSCDCt7sBIuHeOqDiciAG QvcQmhYvxNYFh1+IGt/HiVjXOuRPnaTKFI+2nIjORsq11adVomF+eIsdXgblSL7Rq24s+B95+K+ 6gcmoL3tcS0mmMJIVPsqd5FmZwx1H396Em8XR X-Gm-Gg: Acq92OGSvKMNEVCMknkMP3dFqV92U7KAyap1P/8IQ7G7ekP7AdbFy5HQrL0nQz0vXZ2 8NQfDKSQpluNy8jRJoRT5QfiEH2TtZAPlqjwofTzHesjLyBE1ILlLvtbz86hVfAzeMHzOQAoFZC uKEVKnWeveDkRb3XwXtG8D7IMjKJJG2xjXcg7yJ19hy/3zRwTmNtWMvfJNu8aSnDC3m3RQU+eXT d3k3ztjvXKoZreyW+2uq9we92JDOeFQvoHotLIfTRk5tRD8/n6kXKppZ1YLvRkg6vOJ64M2gwi8 KRN95Zb0 X-Received: by 2002:a05:6830:d01:b0:7d7:ed69:81b2 with SMTP id 46e09a7af769-7e4ea071959mr14191205a34.5.1779221896079; Tue, 19 May 2026 13:18:16 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ron Johnson Date: Tue, 19 May 2026 16:18:04 -0400 X-Gm-Features: AVHnY4KxrSboniEOEdS777UNfwcTlLFml8ZBFqtOYfkCzeGDrguPhns8tJAWI28 Message-ID: Subject: Re: scaling up from t1n to 60 million records To: pgsql-general Content-Type: multipart/alternative; boundary="00000000000017ad1706523161fc" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000017ad1706523161fc Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Indices are your friend! (Except when loading data.) Add them on any relevant column. https://www.postgresql.org/docs/16/sql-createindex.html The "(expression)" clause might be useful in your situation, since it can exclude some words from an index, exclude empty cells, index upper-case versions of the word, etc. On Tue, May 19, 2026 at 2:53=E2=80=AFPM Martin Mueller < martinmueller@northwestern.edu> wrote: > Here is a more detailed version. > > I work on the curation of a corpus of some 65,000 Early Modern texts wit= h > 1.5 billion words. They exist as TEI-XML files and each word is wrapped > in a element. Here are the first and last two words in the corpus > > HEre > > > begynneth > > ... > > mercy > > . > > > The corpus has many corrupt spellings and errors in the linguistic > annotation. Most of them are low-frequency phenomena and occur in no more > than 64 documents. In nearly all cases you have enough evidence to correc= t > a word or its annotation if you can see the word in the middle of a text > string that includes > > > 1. the spelling > 2. the lemma > 3. the part of speech tag > 4. a standard spelling (e.g. 'loue' for 'love') > 5. up to seven previous words > 6. up to seven next words > 7. the spelling and POS tag of the previous > 8. the lemma and POS tag of the next word > 9. the Xpath of the current work > > > My goal is to involve users of the corpus in identifying and correcting > corrupt readings. I call this a "philological shopping cart" since the > offering of a correction can be thought of as a sale. Instead of buying > something, with the machine registering the who, what, when, and where of > the purchase, I offer an emendation, with the machine registering the who= , > what, when, and where of my emendation. > > My hunch is that it would not be particularly difficult to build such a > philological shopping cart and that in terms of scale it would not be a b= ig > thing. > > I am trying to mirror that "shopping cart" on my Mac. There are about 60 > million word occurrences that occur in no more than 64 texts. The basic > table has the columns described above, and half a dozen other columns for > data entry and various counts. There are some helper tables. The most > important of them is a simple case-insensitive list of spellings with the= ir > document frequencies. This is very useful for finding suspect spellings > with queries like "show me all spellings in a low frequency range that > contain 'tb' and look for words where replacing 'tb' with 'th' will find = a > word with a higher document frequency. That picks up spellings like 'tbe'= , > 'tbat', 'autboritie', etc. > > I've worked with KWIC tables of this kind for several years. I have Aqua > Data Studio as a front end for Postgres, currently version 17, running on= a > five-year old Mac with an Intel processor and 32 GB of memory. I know a > lot less about the innards of a SQL database than I should. > > My largest kwic table has about 15 million rows with dozen columns for > each row. Except for the left and right context, the columns consists of > single words or numbers. The left and right context columns rarely add up > to more than 35 characters each.. I have used plain indexes for some > columns, with commands like "Create index on kwics16(keyword)", where > 'kwics16' is the table name. My typical routine takes a single-user > interactive form: ask a query, wait for the results (typically seconds, > sometimes a minute or more), and do something with the results. I know > next to nothing about the size of the database or tables, and it's not > something Ihave needed to worry about. There are occasional memory bottle > necks, because Aqua Data Studio isn't particularly good at release memory > once it's no longer used. Closing and reopening the client fixes that. > > It takes an hour or so to upload a table of this kind into the database. > Several tables of that size exist on my database and don't cause any > trouble. I don't know at what point I would be running into constraints o= f > an aging Mac with 32 GB of memory and a 2 TB hard drive. > > I could comfortably live with what I'm doing now, dividing the data into > three or four frequency ranges. > > Given this information, should I try and create a single table or am I > likely to run into serious constraints if I move beyond my current maximu= m > table size of 15 million records. > > Perhaps there is no clear answer, and I should just experiment. But if > any reader with more knowledge of Postgres thinks that in my environment = I > would be skating on thin ice if I move beyond current limits, I'd be > grateful to be told so. > > > > > > > > > > > Martin Mueller > Professor emeritus of English and Classics > Northwestern University > *From: *Adrian Klaver > *Date: *Tuesday, May 19, 2026 at 09:45 > *To: *Martin Mueller ; > pgsql-general@postgresql.org > *Subject: *Re: scaling up from t1n to 60 million records > > On 5/19/26 7:27 AM, Martin Mueller wrote: > > I use Postgres with a GUI frontend (Aquafold) as a very large > > spreadsheet on steroids that analyzes rare or defective spellings in a > > corpus of 65,000 texts and1.5 billion words. I typically extract data > > from the corpus with python scripts, turn them into tables and load the= m > > into the database. > > > > > > On my Mac with 32 GB of memory performance is OK with queries that > > typically within seconds extract data rows from tables with up to ten > > million rows. If the result set is large, I suspect that most of time > > machine's time is spent displaying result sets. I have used indexing > > sparingly. While it helps, the time savings often don't matter much. > > This is going to need more information: > > 1) Postgres version. > > 2) The table schema including indexes. > > 3) An example of the query. > > 4) Where you are measuring the time. > > 5) The client you are displaying the results in. > > > > > > > I am thinking about scaling up to table with about 60 million rows. Ar= e > > there things to do or watch out for? Or should I proceed on the > > assumption that that 60 million records are within scope and that the > > added timecost is roughly linear? > > > > Martin Mueller > > > > Professor emeritus of English and Classics > > > > Northwestern University > > > > > -- > Adrian Klaver > adrian.klaver@aklaver.com > --=20 Death to , and butter sauce. Don't boil me, I'm still alive. lobster! --00000000000017ad1706523161fc Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Indices are your friend!=C2=A0 (Except wh= en loading data.)

Add them on any rele= vant column.

The "(expression)" clause might be usef= ul in your situation, since it can exclude some words from an index, exclud= e empty cells, index upper-case versions of the word, etc.


On Tue, May 19, 2026 at 2:53=E2=80=AFPM Martin Mueller = <martinmueller@northwe= stern.edu> wrote:
Here is =C2=A0a more detailed version.

I work on the curation of a corpus of some 65,000 Early Modern =C2=A0texts = with 1.5 =C2=A0billion words.=C2=A0 They exist as TEI-XML files and each wo= rd is wrapped in a <w> element.=C2=A0 Here are the first and last two= words in the corpus

=E2=80=82=E2=80=82=E2= =80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80= =82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82<w=C2=A0lemma=3D"here"=C2=A0pos=3D"av&= quot;=C2=A0xml:id=3D"a73abc-001-b-0010">HEre</w><= /p>

=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82 <w=C2=A0lemma=3D"b= egin"=C2=A0pos=3D"vvz"=C2=A0reg=3D"beginneth"=C2= =A0xml:id=3D"a73abc-001-b-0020">begynneth</w>

...

=C2=A0=E2=80=82=E2=80=82=E2=80=82=E2=80=82=E2=80=82<w=C2=A0lemma=3D&q= uot;mercy"=C2=A0pos=3D"n1"=C2=A0xml:id=3D"e20ady-0008-3= 120">mercy</w>

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 <pc=C2=A0unit=3D"sentence&quo= t;=C2=A0xml:id=3D"e20ady-0008-3130">.</pc>



The corpus has many corrupt spellings and errors in the linguistic annotati= on. Most of them are low-frequency phenomena and occur in no more than 64 d= ocuments. In nearly all cases you have enough evidence to correct a word or= its annotation if you can see the word in the middle of a text string that =C2=A0includes

  1. the spelling
  2. the lemma
  3. the part of speech tag
  4. a standard spelling (e.g= . 'loue' for 'love')
  5. up to seven previous wor= ds
  6. up to seven next words
  7. the spelling and POS tag= of the previous
  8. the lemma and POS tag of= the next word
  9. the Xpath of the current= work

My goal is to involve users of the corpus in identifying and correcting cor= rupt readings. I call this a "philological shopping cart" since t= he offering of a correction can be thought of as a sale. Instead of buying = something, with the machine registering the who, what, when, and where of the purchase, I offer an emendation, with th= e machine registering the who, what, when, and where of my emendation.

My hunch is that it would not be particularly difficult to build such a phi= lological shopping cart and that in terms of scale it would not be a big th= ing.=C2=A0

I am trying to mirror that "shopping cart" on my Mac.=C2=A0 There= are about 60 million word occurrences that occur in no more than 64 texts.= The basic table has the columns described above, and half a dozen other co= lumns for data entry and various counts.=C2=A0 There are some helper tables. The most important of them is a simple case-insens= itive list of spellings with their document frequencies.=C2=A0 This is very= useful for finding suspect spellings with queries like "show me all s= pellings in a low frequency range that contain 'tb' and look for words where replacing 'tb' with 'th&= #39; will find a word with a higher document frequency. That picks up spell= ings like 'tbe', 'tbat', 'autboritie', etc.

I've worked with KWIC tables of this kind for several years.=C2=A0 I ha= ve Aqua Data Studio as a front end for Postgres, currently version 17, runn= ing on a five-year old Mac with an Intel processor and 32 GB of memory.=C2= =A0 I know a lot less about the innards of a SQL database than I should.=C2=A0

My largest kwic table has about 15 million rows with dozen columns for each= row. Except for the left and right context, the columns consists of single= words or numbers. The left and right context columns rarely add up to more= than 35 characters each..=C2=A0 I have used plain indexes for some columns, with commands like "Create index= on kwics16(keyword)", where 'kwics16' is the table name. My t= ypical routine takes =C2=A0a single-user interactive form: ask a query, wai= t for the results (typically seconds, sometimes a minute or more), and do something with the results.=C2=A0 I know next to nothing = about the size of the database or tables, and it's not something Ihave = needed to worry about. There are occasional memory bottle necks, because Aq= ua Data Studio isn't particularly good at release memory once it's no longer used. Closing and reopening the cli= ent fixes that.=C2=A0

It takes an hour or so to upload a table of this kind into the database.=C2= =A0 Several tables of that size exist on my database and don't cause an= y trouble. I don't know at what point I would be running into constrain= ts of an aging Mac with 32 GB of memory and a 2 TB hard drive.=C2=A0

I could comfortably live with what I'm doing now, dividing the data int= o three or four frequency ranges.=C2=A0

Given this information, should I try and create a single table or am I like= ly to run into serious constraints if I move beyond my current maximum tabl= e size of 15 million records.

Perhaps there is no clear answer, and I should just experiment.=C2=A0 But i= f any reader with more knowledge of Postgres thinks that in my environment = I would be skating on thin ice if I move beyond current limits, I'd be = grateful to be told so.=C2=A0










Martin Mueller
Professor emeritus of Eng= lish and Classics
Northwestern University
From: Adrian Klaver <adrian.klaver@aklaver.com>
Date: Tuesday, May 19, 2026 at 09:45
To: Martin Mueller <martinmueller@northwestern.edu>; pgsql-general@postgre= sql.org <pgsql-general@postgresql.org>
Subject: Re: scaling up from t1n to 60 million records

On 5/19/26 7:27 AM, Martin Mueller wrote:
> I use Postgres with a GUI frontend (Aquafold) as a very large
> spreadsheet on steroids that analyzes rare or defective spellings in a=
> corpus of 65,000 texts and1.5 billion words.=C2=A0 I typically extract= =C2=A0data
> from the corpus with python scripts, turn them into tables and load th= em
> into the database.
>
>
> On my Mac with 32 GB of memory performance is OK with queries that
> typically within seconds extract data rows from tables =C2=A0with up t= o ten
> million rows.=C2=A0 If the result set is large, I suspect that most of= time
> machine's time is spent displaying result sets. I have used indexi= ng
> sparingly. While it helps, the time savings often don't matter muc= h.

This is going to need more information:

1) Postgres version.

2) The table schema including indexes.

3) An example of the query.

4) Where you are measuring the time.

5) The client you are displaying the results in.

>
>
> I am thinking about scaling up to table with about 60 million rows.=C2= =A0 Are
> there things to do or watch out for? Or should I proceed on the
> assumption that that 60 million records are within scope and that the<= br> > added timecost is roughly linear?
>
> Martin Mueller
>
> Professor emeritus of English and Classics
>
> Northwestern University
>


--
Adrian Klaver
adrian.klave= r@aklaver.com


--
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobs= ter!
--00000000000017ad1706523161fc--