Return-Path: pg_adm@postgres.berkeley.edu
Received: by postgres.Berkeley.EDU (5.61/1.29)
	id AA27690; Fri, 19 Jun 92 12:44:13 -0700
Date: Fri, 19 Jun 92 12:44:13 -0700
Message-Id: <9206191944.AA27690@postgres.Berkeley.EDU>
From: musman@radar.nrl.navy.mil (Scott Musman)
Subject: Re: internal storage???
To: postgres@postgres.berkeley.edu
Sender: pg_adm@postgres.berkeley.edu


	Thanks for the informative reply.. Some of the problem was
caused by operator error on my part, but some of what is happening is
still unclear. 

create sig_amp (
       sig_id = int4,
       samp_number = int2,
       proc_code = int2,
       real_amp = bytea[],
       im_amp = bytea[]
       )

>One thing is the way you have this declared uses bytea[] a variable length
>array of byte arrays.  Do you really want the 2-dimensions? Could you get by
>with just bytea?  The extra variable length dimension costs you 4 bytes per
>row.

	Quite right! This is most certainly an error on my part. The
files that I'm loading into postgres via "copy" are:

id<tab>samp1<tab>code_id<tab>{v1,...vn}<tab>{v1,...vn}<new-line>
id<tab>samp2<tab>code_id<tab>{v1,...vn}<tab>{v1,...vn}<new-line>
...
...

	Each part of the signal is a 1-d vector. The 2nd dimension had
to be a large part of my problem. I definately did not want to define
the above data as a 2d array. 

>>       Anyway... The purpose of this message is that after storing
>> some 6,000 signal entries into a database I happened to notice that
>> the disk storage being eaten up by this storage format is over 100 Mb.
>
>wow! I don't think this is right. If we do the math:
>
>        100 Mb / 6000 tuples = 17476 bytes per tuple
>
>which in Postgres is impossible.  If it were 12000 tuples then we near
>the realm of possibility.  Could it be that you populated then deleted
>and repopulated (possibly several times)?

	I made sure that I did a "destroy sig_amp" before reloading
any part of the database, so I'm pretty sure that that was not my
problem!! I also vacuumed the db several times to see if that was the
problem. I'm assuming that loading files of the above format into a db
should duplicate the large amount of storage that gets eaten up if you 
try doing something stupid like I did by saying it's a "bytea[]". 

	I guess also that I don't understand what "bytea" means?
Retrieving real_amp[1] always seems to return "{"?? I was stupidly
under the impression that a byte array would store (at the very least)
byte numbers between 0-127.  
	The signal datum are all 8-bit numbers. Seeing as I have so
many of the durn things to store I figured that compacting the data in
the database would be a good thing. After having problems with bytea I
tried using an array of int2 (int2[]). Unfortunately reading from the
disk files using the "copy" command only gets 0's instead of the
actual numbers. Is their a bug fix for this that I'm missing?? I
haven't checked too recently to see if any new bugs in 3.1 have been
fixed?? 
	Yesterday I changed the int2[] into int4[]. Now it works!! It
reads the disk files, and has reduced the size of the db from over
100Mb down to about 40Mb. This size seems to jive better with your
calculations, but internally still involves using 4bytes to store
information that could be easily stored with only one byte. 
	Unfortunately, when using the int4[] format I seem to be able
to retrieve info from the db from the monitor program, but my programatic
interface to the db (which uses libpq calls) keeps bombing with the
following error: 

Error: Invalid argument to pg_alloc().
NOTICE:Jun 19 12:53:05:PortalHeapMemoryFree: 0x12b630 not in alloc set!

	Maybe I don't have enough shared memory or something??? Does
this error look familiar to anyone??

	Right now, I've changed the format for storing these vectors
again.. Now I'm storing the information as "text", which just stores
the ascii representation of the information. The above error has
dissapeared (without changing any code), and the db size is now down to
around 20Mb. Retrieval of the text is faster (I haven't benchmarked
it, but can see it when my progam is operating) than for the int4[]
version. As my program which uses the data from the db has to convert
the retrieved portal values into an array of ints whether I internally
use int4 or text anyway, I'm not sure that I've lost too much (unless
of course if I ever have to try and access "signal_amp[23]" or
something from a db query command).... 
	Anyone have any suggestions for a better compromise, in
trading off storage vs access flexibility??

		-- Scott
