Return-Path: postarch
Received: by postgres.Berkeley.EDU (5.61/1.29)
	id AA20738; Mon, 6 Apr 92 14:51:41 -0700
Message-Id: <9204062151.AA20738@postgres.Berkeley.EDU>
From: postarch (Postgres Mailing Archive)
Subject: Re: class difference
To: postgres@postgres.berkeley.edu
Sender: pg_adm@postgres.berkeley.edu
Reply-To: joey@postgres.berkeley.edu
In-Reply-To: Your message of "Mon, 06 Apr 92 11:18:02 PDT."
             <9204061818.AA17173@postgres.Berkeley.EDU> 
Date: Mon, 06 Apr 92 14:51:33 PDT

you write:
> Actually, I needed something like the SQL minus operation.
> I finally implemented it by creating a temporary class, and then
> deleting the unwanted instances.
> 
> E.g. given
> 
> Class A
> x  y
> 1  10
> 2  20
> 3  30
> 
> Class B
> x  y
> 1  10
> 3  30
> 
> I wanted A - B = ({2,20}), so i did
> 
> retrieve into t unique (A.x, A.y) where A.x != B.x
> delete t where t.x = B.x
> (leaving the tuples i wanted in t)
> 
While this works for your example, it is not correct in general.
Your first retrieve command will retrieve all unique tuples in A
such that there exists a tuple in B with a different x value.  This
will be most of the tuples in A (in your example, it's *all* of the tuples
in A).  So your first line doesn't do much more than make a copy of A.
This is not incorrect, but the qualification "where A.x != B.x" isn't helping
out very much.

Then your second delete command deletes *too many* tuples in general,
i.e. it deletes all the tuples in t for which there is a tuple in B with
*a matching x value.*  If (1,10) was not in B, but (1,20) was, you would still
get the final answer {(2,20)}, rather than the correct answer 
{(1,10), (2,20)}.

One reasonable correct scheme (see note on duplicates below for caveat)
would be to make a copy of A (i.e. "retrieve into tmp unique (A.all)")
and then remove from tmp all those tuples for which there is a tuple
in B that matches on *ALL* attributes.  For your example,
"delete tmp where tmp.x = B.x and tmp.y = B.y"
is the correct syntax.

> Is there a better way to do it?
One optimization is only to put the key columns of A into the where clause
of the delete command (if you know what the key columns are).
For instance, if you knew that y was a key of A, then you'd
get the correct result by saying
"delete tmp where tmp.y = B.y"

Otherwise, to express this in one command without a set difference
operator you need subqueries.
(e.g. in many versions of SQL you could say
SELECT DISTINCT A.* FROM A
WHERE NOT EXISTS (SELECT B.x1 FROM B
                   WHERE B.x1 = A.x1
                     AND B.x2 = A.x2
                     ...
                     AND B.xn = A.xn);
)

Postquel does not support subqueries, and there is no conversion of
the NOT EXISTS construct to join.  This is unfortunate, since your
hand-constructed "t" table is likely to be less efficient than an
internal scheme for processing the query.  That's a historical
decision that was made by the designers of Quel.  Note that
efficiently handling "correlated subqueries" like the SQL statment
above requires a lot of extra complexity, both in the language design and
the database code.

Note also that the semantics for the number of duplicates in the
output of set-difference are kind of tricky.  I believe SQL2 has two 
commands, "EXCEPT ALL" and "EXCEPT DISTINCT" (which is the default if you
just say "EXCEPT".)  The former does multi-set difference (i.e. the number 
of copies of a value in the output is equal to the max of 0 and the 
difference between the number of copies in the left and right inputs), 
while the latter removes duplicates from each input and then does set 
difference directly.  This can be rather confusing if you're not careful.  
The scheme we've discussed above does the latter semantics, which may or 
may not be what you want.  Doing the former semantics in Postquel sounds 
pretty yucky.

Joe Hellerstein

