Return-Path: postarch
Received: by postgres.Berkeley.EDU (5.61/1.29)
	id AA26213; Tue, 10 Dec 91 10:29:08 -0800
Message-Id: <9112101829.AA26213@postgres.Berkeley.EDU>
From: postarch (Postgres Mailing Archive)
Subject: Re: Lock problems?
To: postgres@postgres.berkeley.edu
Sender: pg_adm@postgres.berkeley.edu
Reply-To: mer@postgres.berkeley.edu
In-Reply-To: Your message of "Sun, 01 Dec 91 14:49:34 PST."
             <9112012348.AA02627@dpi.inpe.br> 
Date: Tue, 10 Dec 91 10:25:12 PST

you write:
> Hi,
> 
> I have two postgres users running copies of a libpq aplication, so that
> they access the same database classes. Sometimes core is dumped, while trying
> to complete a transaction, after messages from the backend like this:

...stuff deleted...

 The transaction in question is:

  begin
  retrieve portal projects (Project.label) where Project.label = "NONE"
  fetch all in projects
  close projects
  append Project (label = "NONE" ,x1 = "4.390000e+04" ,y1  = "4.190000e+04" ,x2
   = "6.410000e+04" ,y2  = "6.210000e+04" ,projection  = 1)
  end

>         There is yet another message from the backend:
> -----------------------------------------------------------------------------
> NOTICE:Nov 29 14:56:32:I have been signalled by the postmaster.
> NOTICE:Nov 29 14:56:32:Some backend process has died unexpectedly and possibly
> NOTICE:Nov 29 14:56:32:corrupted shared memory.  The current transaction was
> NOTICE:Nov 29 14:56:32:aborted, and I am going to exit.  Please resend the
> NOTICE:Nov 29 14:56:32:last query. -- The postgres backend
> Error: No response from the backend, exiting...
> -----------------------------------------------------------------------------
> 
>Those situations seems to occur while two libpq applications are sharing tables
>in the same database. It seems like something related to the lock system..
> 
> Could someone point me any suggestion on how to manage this problem?

I have done a lot of experimenting with this, trying to pin down just what the
problem is.  It looks like there is a multi-user bug in postgres relating to
the use of portals.  I ran a bunch of tests using the exact sequence of
queries given above and was able to cause the backend to crash occasionally
with the above notice sent the other backend (This is on postgres 3.1). I
was, however, never able to make the backend crash if i used unique portal
names in each backend.  This appears to be a possible work around.

I don't believe there are any problems with the lock manager.  The above
xact is bound to cause deadlock.  The reason is that all locks on user
classes are two-phase and only released when the xact commits.  In the
retrieve you acquire a read lock on project, then later you try to upgrade
your read lock to an exclusive write lock with the append.  When two
backends are executing simultaneously this might very well (and always does
in my tests) cause each to wait for the other to release the initial read
lock before it can go ahead with the append.

	backend 1			backend 2

	retrieve - read lock	
					retrieve - read lock
	...				...
	append - write lock blocks
					append - write lock blocks

	deadlock

Postgres then waits a specified amount of time and kills the xact in
backend 1.  When the xact in backend 1 aborts the read lock it holds on
project is released and this allows backend 2 to acquire the write lock
it wants.  Backend 2 then successfully commits its append.

To avoid this situation it is best to the append first if you can.  This
way one of the backends will get exclusive access to project and deadlock
will be avoided.


Jeff Meredith
mer@postgres.berkeley.edu
