11..  TThhee PPOOSSTTGGRREESS AAcccceessss MMeetthhooddss

     This  section  describes the POSTGRES access methods in
detail.  The major concepts covered here are

  +o  the relation descriptor and its contents,

  +o  the differences between heap and index relations,

  +o  scan keys, scan descriptors, and  the  scan  interface,
     and

  +o  the POSTGRES access method interface routines.

11..11..  TThhee RReellaattiioonn DDeessccrriippttoorr

     The  relation  descriptor, or _r_e_l_d_e_s_c, is the in-memory
data structure that describes an open relation.  The reldesc
is  an  argument  to  most of the procedures that operate on
relations.  Relations may be opened by name or  by  relation
ID  --  either  of these uniquely identifies the relation to
which it corresponds.  A relation's ID,  or  _r_e_l_i_d,  is  the
object ID of the tuple in ppgg__ccllaassss that describes it.

     The  structure that stores a reldesc is declared in the
source file uuttiillss//rreell..hh.  The structure definition is

     ttyyppeeddeeff ssttrruucctt RReellaattiioonnDDaattaa {{
          FFiillee                rrdd__ffdd;;
          iinntt                 rrdd__nnbblloocckkss;;
          uuiinntt1166              rrdd__rreeffccnntt;;
          bbooooll                rrdd__iissmmeemm;;
          bbooooll                rrdd__iissnnaaiilleedd;;
          AAcccceessssMMeetthhooddTTuupplleeFFoorrmm    rrdd__aamm;;
          RReellaattiioonnTTuupplleeFFoorrmm   rrdd__rreell;;
          OObbjjeeccttIIdd            rrdd__iidd;;
          PPooiinntteerr             lloocckkIInnffoo;;
          TTuupplleeDDeessccrriippttoorrDDaattaa rrdd__aatttt;;
          //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY AATT EENNDD OOFF SSTTRRUUCCTT **//
     }} RReellaattiioonnDDaattaa;;

     ttyyppeeddeeff RReellaattiioonnDDaattaa          **RReellaattiioonn;;


     The meaning of the rrdd__ffdd entry depends on  the  storage
manager  that  stores  the  relation.  For the magnetic disk
storage manager, this is the POSTGRES virtual file  descrip-
tor  (VFD)  for  the open file that stores the relation.  On
the Sony jukebox, this entry is meaningless, and  is  simply
set  to a non-negative value if the jukebox relation is suc-
cessfully opened.


                              11


     The rrdd__nnbblloocckkss entry is the number  of  blocks  in  the
relation,  but  this  number is not reliable.  This field is
used by the executor during scans of heap relations to avoid
scanning past the end of the relation.  For index relations,
and during scans of heap relations, the  value  stored  here
will  be wrong.  It should not be used outside of the execu-
tor.

     The entry rrdd__rreeffccoouunntt reflects the number of  currently
active references to this reldesc inside the backend.  Every
backend manages a private  cache  of  relation  descriptors.
When  the  cache is full, reldescs with rrdd__rreeffccoouunntt equal to
zero may be evicted to make room for new descriptors.  If  a
single  relation  is  opened  by  a single backend more than
once, every open request  returns  a  pointer  to  the  same
reldesc,  and  rrdd__rreeffccoouunntt  is  incremented to keep track of
each reference.  When the relation is closed, the  reference
count is decremented.

     The  structure  entry rrdd__iissmmeemm is unused in the current
system.  It is intended to support in-memory-only relations,
changes  for  which need not be flushed to stable storage at
transaction boundaries.

     The private cache of relation descriptors contains sev-
eral  reldescs that cannot be evicted, in order to guarantee
that the cache continues to work.  For example, in order  to
instantiate  the  relation  descriptor  for  a user relation
uusseerr__rreellnn,  POSTGRES  must  open  and  scan   ppgg__ccllaassss   and
ppgg__aattttrriibbuuttee  for  its  relation and attribute data.  If the
reldesc for ppgg__ccllaassss is not in the cache, then  no  relation
(including  ppgg__ccllaassss) can ever be instantiated in the cache.
The reldescs for ppgg__ccllaassss,  ppgg__ttyyppee,  and  ppgg__aattttrriibbuuttee  all
fall  into this category, and are initialized specially when
the system starts up.

     To keep track of which relations  may  not  be  evicted
from  the  private  reldesc cache, every reldesc contains an
rrdd__iissnnaaiilleedd entry.  If this entry is set to _t_r_u_e,  then  the
reldesc  is  nailed  and  may not be evicted from the cache,
even if its rrdd__rreeffccoouunntt drops to zero.

     The rrdd__aamm entry stores a pointer to the  access  method
tuple  for  the  access  method  that manages this relation.
Currently, POSTGRES supports a heap access method for  stor-
ing user and system data, and btree and rtree indexed access
methods for storing indices on data.  Support for  a  hashed
indexed  access  method  will  be  added in the near future.
Because these access methods have different implementations,
POSTGRES  must  know what actual routine to call to dispatch
general access  method  interface  calls  for  a  particular
access method.  For indexed access methods, this information


                              22


is stored in the access method tuple.  Since there  is  only
one  access  method for user and system data (the heap), the
rrdd__aamm entry is NULL for reldescs describing heap  relations.

     The  access  method tuple contents will be described in
section 1.1.1.

     The structure entry rrdd__rreell  stores  a  pointer  to  the
ppgg__ccllaassss  tuple  that describes this relation.  The ppgg__ccllaassss
tuple is copied to a safe place in memory when the  relation
is  entered  into  the  reldesc  cache.  The tuple contains,
among other things, the relation's name and the user  id  of
its owner.  The complete contents of the tuple pointed to by
rrdd__rreell are supplied in section 1.1.2.

     The entry rrdd__iidd stores the relid of the open  relation.
The  relid is the object ID of the ppgg__ccllaassss tuple describing
the relation.  This does not appear in the tuple pointed  to
by  the  rrdd__rreell  entry, because that tuple includes only the
ordinary attributes, and not the system attributes, from the
ppgg__ccllaassss  tuple.   Note  that  the ppgg__ccllaassss tuple has system
attributes,  as  it  is  stored  in  the  relation.    These
attributes  simply  are  not  copied  into  memory  when the
reldesc is initialized.

     The structure entry lloocckkIInnffoo  points  to  an  in-memory
representation  of  the  rule lock data associated with this
relation.  This entry is not used  by  the  access  methods.
The  rule  system  and  the  data  stored in a rule lock are
beyond the scope of this section.

     The final entry in the reldesc is rrdd__aatttt, a  vector  of
attribute  descriptor  data  for  the relation.  This vector
stores one entry for every user-level (that is,  non-system)
attribute that the relation stores.  This vector is initial-
ized when the reldesc is loaded into the cache.  The data in
the  vector  come from a scan of ppgg__aattttrriibbuuttee.  The contents
of the vector are described in section 1.1.3.

11..11..11..  TThhee AAcccceessss MMeetthhoodd TTuuppllee FFoorrmm ((rrdd__aamm))

     The rrdd__aamm entry in a reldesc points at an _a_c_c_e_s_s _m_e_t_h_o_d
_t_u_p_l_e  _f_o_r_m  describing  the  access method that manages the
relation1.   This  tuple  has  one attribute for each of the
routines that implement the standard access method interface
____________________
   1 In POSTGRES, the word _f_o_r_m is often used to name struc-
tures.  In  this  instance,  for  example, the clause _a_c_c_e_s_s
_m_e_t_h_o_d _t_u_p_l_e _f_o_r_m is used to name a structure that describes
an access method tuple.


                              33


for  the particular access method.  Details of the interface
routines appear in section 1.4.  Here, we list only the con-
tents  of  the  access  method tuple form.  This declaration
appears in the source file  ccaattaalloogg//ppgg__aamm..hh,  and  describes
the tuples stored in the ppgg__aamm system catalog.

     ttyyppeeddeeff ssttrruucctt AAcccceessssMMeetthhooddTTuupplleeFFoorrmmDD {{
          NNaammeeDDaattaa       aammnnaammee;;
          OObbjjeeccttIIdd       aammoowwnneerr;;
          cchhaarr           aammkkiinndd;;
          uuiinntt1166         aammssttrraatteeggiieess;;
          uuiinntt1166         aammssuuppppoorrtt;;
          RReeggPPrroocceedduurree   aammggeettttuuppllee;;
          RReeggPPrroocceedduurree   aammiinnsseerrtt;;
          RReeggPPrroocceedduurree   aammddeelleettee;;
          RReeggPPrroocceedduurree   aammggeettaattttrr;;
          RReeggPPrroocceedduurree   aammsseettlloocckk;;
          RReeggPPrroocceedduurree   aammsseettttiidd;;
          RReeggPPrroocceedduurree   aammffrreeeettuuppllee;;
          RReeggPPrroocceedduurree   aammbbeeggiinnssccaann;;
          RReeggPPrroocceedduurree   aammrreessccaann;;
          RReeggPPrroocceedduurree   aammeennddssccaann;;
          RReeggPPrroocceedduurree   aammmmaarrkkppooss;;
          RReeggPPrroocceedduurree   aammrreessttrrppooss;;
          RReeggPPrroocceedduurree   aammooppeenn;;
          RReeggPPrroocceedduurree   aammcclloossee;;
          RReeggPPrroocceedduurree   aammbbuuiilldd;;
          RReeggPPrroocceedduurree   aammccrreeaattee;;
          RReeggPPrroocceedduurree   aammddeessttrrooyy;;
     }} AAcccceessssMMeetthhooddTTuupplleeFFoorrmmDD;;

     ttyyppeeddeeff AAcccceessssMMeetthhooddTTuupplleeFFoorrmmDD     **AAcccceessssMMeetthhooddTTuupplleeFFoorrmm;;


     The  structure  entry  aammnnaammee is the name of the access
method.  In the current system, this is  one  of  _b_t_r_e_e,  or
_r_t_r_e_e.  Support for _h_a_s_h is forthcoming.

     The aammoowwnneerr entry is the object ID of the ppgg__uusseerr tuple
for the owner of  this  access  method.   For  all  existing
access methods, this is the OID for the user _p_o_s_t_g_r_e_s.  This
entry is intended to support access protection, so that, for
example, only the owner of an access method could change its
interface routines.  Such protection is not  implemented  in
the current system.

     The aammkkiinndd entry is unused in the current system.

     The  structure  entry  aammssttrraatteeggiieess  is  the  number of
strategies (operators) that can be used in searches on  this
index.


                              44


  +o  For btrees, this is 5: <, <=, =, >=, and >.

  +o  For  rtrees,  this is 8, corresponding to the operators
     for left, left-or-overlap,  overlap,  right-or-overlap,
     right, same, contains, and contained-by.

  +o  When hash tables are supported, the number of operators
     will be 1, for equality searches.

     Operators that are supported by an  access  method  are
listed  in  ppgg__aammoopp.   This  relation  contains one group of
operators for each operator class, or _o_p_c_l_a_s_s.   An  opclass
is  an  access  method/type  pair.   For example, an opclass
iinntt44__ooppss is defined for btrees, so that iinntt44 values  may  be
indexed using btrees.

     The  structure entry aammssuuppppoorrtt is the number of support
routines required by this access method.  For example,  when
inserting  keys  into  a  btree,  the  btree  access  method
requires one support routine  that  compares  two  keys  and
returns  negative,  zero,  or positive, depending on whether
the first key is less than, equal to, or  greater  than  the
second,  respectively.   Similarly,  rtrees require routines
that compute the size, intersection, and union of two  rect-
angles.  These are not operators that are available to users
for index searches, so they are stored separately  from  the
strategies.   These  routines  are  stored  in  the relation
ppgg__aammssuuppppoorrtt.

     The rest of the AAcccceessssMMeetthhooddTTuupplleeFFoorrmm structure  stores
the object IDs of functions that support the standard access
method interface for this access method.  Those routines are
summarized in the table below.


                              55


+------------+-----------------------------------------------------+
|_E_n_t_r_y _n_a_m_e  |                 _r_o_u_t_i_n_e _d_e_s_c_r_i_p_t_i_o_n                 |
+------------+-----------------------------------------------------+
|aammggeettttuuppllee  | Get the next tuple in a scan.                       |
+------------+-----------------------------------------------------+
|aammiinnsseerrtt    | Insert a new tuple into the index.                  |
+------------+-----------------------------------------------------+
|aammddeelleettee    | Delete a particular _t_i_d from the index.             |
+------------+-----------------------------------------------------+
|aammggeettaattttrr   | Get a particular attribute from the index tuple.    |
+------------+-----------------------------------------------------+
|aammsseettlloocckk   | Unsupported.                                        |
+------------+-----------------------------------------------------+
|aammsseettttiidd    | Unsupported.                                        |
+------------+-----------------------------------------------------+
|aammffrreeeettuuppllee | Unsupported.                                        |
+------------+-----------------------------------------------------+
|aammbbeeggiinnssccaann | Start a scan with a qualification on the index key. |
+------------+-----------------------------------------------------+
|aammrreessccaann    | Reset an active scan to the beginning.              |
+------------+-----------------------------------------------------+
|aammeennddssccaann   | End an active scan.                                 |
+------------+-----------------------------------------------------+
|aammmmaarrkkppooss   | Mark the current position in an index scan.         |
+------------+-----------------------------------------------------+
|aammrreessttrrppooss  | Restore the scan to the previously-marked position. |
+------------+-----------------------------------------------------+
|aammooppeenn      | Open the index relation.                            |
+------------+-----------------------------------------------------+
|aammcclloossee     | Close an open index relation.                       |
+------------+-----------------------------------------------------+
|aammbbuuiilldd     | Define an index on an existing heap relation.       |
+------------+-----------------------------------------------------+
|aammccrreeaattee    | Create a new, empty, index relation.                |
+------------+-----------------------------------------------------+
|aammddeessttrrooyy   | Destroy an existing index relation.                 |
+------------+-----------------------------------------------------+

     In  general,  the  programmer  need not worry about how
function dispatch via the AAcccceessssMMeetthhooddTTuupplleeFFoorrmm works.  This
is  handled  properly for all indexed access methods by code
in aacccceessss//ccoommmmoonn..

11..11..22..  TThhee RReellaattiioonn TTuuppllee FFoorrmm ((rrdd__rreell))

     The reldesc for a relation includes a  pointer  to  the
ppgg__ccllaassss  tuple for the relation.  When a reldesc is instan-
tiated into the cache, it is initialized from the  data  for
the  relation  from ppgg__ccllaassss.  The user-level (that is, non-
system) attributes for the  ppgg__ccllaassss  tuple  constitute  the
_r_e_l_a_t_i_o_n   _t_u_p_l_e   _f_o_r_m.   This  structure  is  declared  in


                              66


ccaattaalloogg//ppgg__rreellaattiioonn..hh2 as FFoorrmm__ppgg__rreellaattiioonn.  The declaration
is

     CCAATTAALLOOGG((ppgg__rreellaattiioonn)) BBOOOOTTSSTTRRAAPP {{
          cchhaarr1166    rreellnnaammee;;
          ooiidd       rreelloowwnneerr;;
          ooiidd       rreellaamm;;
          iinntt44           rreellppaaggeess;;
          iinntt44           rreellttuupplleess;;
          ddtt        rreelleexxppiirreess;;
          ddtt        rreellpprreesseerrvveedd;;
          bbooooll           rreellhhaassiinnddeexx;;
          bbooooll           rreelliisssshhaarreedd;;
          cchhaarr           rreellkkiinndd;;
          cchhaarr           rreellaarrcchh;;
          iinntt22           rreellnnaattttss;;
          iinntt22      rreellssmmggrr;;
          iinntt2288     rreellkkeeyy;;
          ooiidd88      rreellkkeeyyoopp;;
          aacclliitteemm   rreellaaccll[[11]];;
     }} FFoorrmmDDaattaa__ppgg__rreellaattiioonn;;

     ttyyppeeddeeff FFoorrmmDDaattaa__ppgg__rreellaattiioonn  **FFoorrmm__ppgg__rreellaattiioonn;;

The notation

     CCAATTAALLOOGG((ppgg__rreellaattiioonn)) BBOOOOTTSSTTRRAAPP {{

is turned into a structure declaration by cpp macros at com-
pile  time.   This declaration is also used to produce setup
files, called _b_k_i _f_i_l_e_s, that create and populate  the  _t_e_m_-
_p_l_a_t_e_1 database.

     The  structure  entry  rreellnnaammee is the name of the rela-
tion.

     The rreelloowwnneerr entry is the  object  ID  of  the  ppgg__uusseerr
tuple  describing the relation's owner.  A relation is owned
by the user that created it.

     The rreellaamm entry is the object ID of the ppgg__aamm tuple for
the  access  method  that  manages  this relation.  For heap
relations, this is zero, an invalid object ID.


____________________
   2 The class ppgg__ccllaassss was previously  called  ppgg__rreellaattiioonn.
In  all  user-visible parts of the system, the name was con-
verted in 1990, but internally (as in the  names  of  system
header files), the name was not always changed.


                              77


     The rreellppaaggeess and rreellttuupplleess entries  are,  respectively,
the approximate number of pages and tuples in this relation.
These numbers are set by the  vacuum  cleaner  and  when  an
index  is  defined  on  a heap relation, and so are wrong in
general.  They are not changed when tuples are inserted into
or  deleted  from  the  relation.  When the relation is ini-
tially created, both are zero.  Both rreellttuupplleess and  rreellppaaggeess
are  used by the planner to estimate plan costs during query
optimization.  Interestingly, then, if neither rreellttuupplleess nor
rreellppaaggeess  have  been updated by the vacuum cleaner recently,
their incorrect values can skew query optimization such that
queries may run for an unexpectedly long time.

     The  rreelleexxppiirreess  entry  is  the  amount of history that
should be saved for this relation.  This is not used in  the
current system.

     The  rreellpprreesseerrvveedd  entry  is the date and time at which
the relation was last vacuumed.  This is used by the planner
when it produces query plans for historical queries.  If the
historical query extends to before the  time  at  which  the
relation  was  last  vacuumed,  then  the  archive  must  be
scanned.  Otherwise, it need not be.

     The rreellhhaassiinnddeexx entry is _t_r_u_e if an index exists on the
relation.   When  an  index  is  destroyed, this flag is not
changed, and so an entry of _t_r_u_e may  mean  that  there  was
recently  an index on the relation, but that there no longer
is.  The vacuum cleaner restores this to the  correct  value
when  it  runs,  and  defining  an  index on a heap relation
always sets this entry to _t_r_u_e for the heap.  RReellhhaassiinnddeexx is
always _f_a_l_s_e for an index relation.

     The  rreelliisssshhaarreedd  entry  is _t_r_u_e for some shared system
catalogs.  Most relations for database _d_b_n_a_m_e reside in  the
directory  $$PPGGDDAATTAA//bbaassee//_d_b_n_a_m_e.   However,  some  relations,
such as ppgg__uusseerr and ppgg__ddaattaabbaassee, must be visible to users of
all databases simultaneously.  These relations are stored in
the directory $$PPGGDDAATTAA, and  the  rreelliisssshhaarreedd  entry  in  the
ppgg__ccllaassss tuple for these relations is set to _t_r_u_e.

     The rreellkkiinndd entry has one of three values:

  +o  _r means that the relation is an ordinary heap.

  +o  _i means that the relation is an index.

  +o  _u  means that the relation is an uncatalogued (that is,
     temporary) heap relation which  will  be  automatically
     destroyed when the transaction ends.


                              88


     The  rreellaarrcchh  entry  describes the frequency with which
the relation should  be  archived.   Although  three  levels
exist,  only  two  are actually supported.  The three levels
are

  +o  _h, for heavy update traffic and frequent archival;

  +o  _l, for light update traffic  and  infrequent  archival;
     and

  +o  _n, for no archival.

     The only levels actually supported are _h and _n.  If the
rreellaarrcchh entry is _n, then the  vacuum  cleaner  discards  all
historical data from the relation when it runs.

     The  rreellnnaattttss  entry is the number of attributes in the
relation.

     The rreellssmmggrr entry identifies the storage  manager  that
manages  storage for this relation.  Storage manager zero is
magnetic disk.  On the  system  installed  at  UC  Berkeley,
storage managers one and two are, respectively, for the Sony
optical disk jukebox and main memory relations.  The storage
manager code is called from the POSTGRES buffer manager, and
is of no concern to access method implementors.

     The rreellkkeeyy entry is a vector of attribute numbers  that
define  the  unique key for this relation.  This is not sup-
ported in the current system.

     The structure ends with  a  variable-length  vector  of
access control list information, stored in the rreellaaccll entry.
The access control list is used to grant  or  deny  read  or
write access on the relation to particular users.

11..11..33..  TThhee TTuuppllee DDeessccrriippttoorr DDaattaa ((rrdd__aatttt))

     The  reldesc  ends with a vector of attribute data that
describe the tuples stored in the relation.  This is a vari-
able-length  entry;  different relations have different num-
bers of attributes, and so their rrdd__aatttt vectors are of  dif-
ferent length.

     There   are  two  data  structures  of  interest.   The
TTuupplleeDDeessccrriippttoorrDDaattaa structure, defined in  aacccceessss//ttuuppddeesscc..hh,
is   a   variable-length   array  of  AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa
entries.  The AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa structure,  defined  in
ccaattaalloogg//ppgg__aattttrriibbuuttee..hh,  contains  detailed information on a
single attribute.


                              99


     The data structure for  AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa  is  the
same as a FFoorrmm__ppgg__aattttrriibbuuttee, which is

     CCAATTAALLOOGG((ppgg__aattttrriibbuuttee)) BBOOOOTTSSTTRRAAPP {{
         ooiidd        aattttrreelliidd;;
         cchhaarr1166     aattttnnaammee;;
         ooiidd        aattttttyyppiidd;;
         ooiidd        aattttddeeffrreell;;
         iinntt44            aattttnnvvaallss;;
         ooiidd        aattttttyyppaarrgg;;
         iinntt22       aattttlleenn;;
         iinntt22            aattttnnuumm;;
         iinntt22       aattttbboouunndd;;
         bbooooll            aattttbbyyvvaall;;
         bbooooll       aattttccaanniinnddeexx;;
         ooiidd        aattttpprroocc;;
         iinntt44       aattttnneelleemmss;;
         iinntt44       aattttccaacchheeooffff;;
     }} FFoorrmmDDaattaa__ppgg__aattttrriibbuuttee;;

     ttyyppeeddeeff FFoorrmmDDaattaa__ppgg__aattttrriibbuuttee **FFoorrmm__ppgg__aattttrriibbuuttee;;


     As above, the

     CCAATTAALLOOGG((ppgg__aattttrriibbuuttee)) BBOOOOTTSSTTRRAAPP

line  is  converted to a struct declaration by cpp macros at
compile time.

     The aattttrreelliidd entry is the object  ID  of  the  ppgg__ccllaassss
tuple for the relation that contains this attribute.

     The aattttnnaammee entry is the name of the attribute.

     The  aattttttyyppiidd  entry  is  the  object ID of the ppgg__ttyyppee
tuple for the type of this attribute.

     The aattttddeeffrreell entry is used  for  object-oriented  type
management  by  POSTGRES.   When  a  class  is declared that
inherits from some other class, the new class implicitly has
all  the  attributes  that  the original class had.  In this
case, the aattttddeeffrreell entry is the object ID of  the  relation
from which this attribute is inherited.

     The  aattttnnvvaallss  entry is intended to support query opti-
mization.  It should store the number of distinct values for
this  attribute  that  appear  in  the relation.  This value
should  be  computed  at  vacuum   time,   and   stored   in
ppgg__aattttrriibbuuttee  by  the  vacuum cleaner.  However, the current
vacuum cleaner does not compute statistics, so  aattttnnvvaallss  is
always  zero.   If  the  correct  value were maintained, the


                             1100


optimizer could use it to produce better query  plans.   The
optimizer currently ignores aattttnnvvaallss.

     The aattttttyyppaarrgg entry is unused.

     The  aattttlleenn  entry  is the length of this attribute, in
bytes, for fixed-length attributes,  or  -1,  for  variable-
length attributes.

     The aattttbboouunndd entry was intended to support array-valued
attributes, but is not used by the existing  POSTGRES  array
implementation.

     The  aattttbbyyvvaall  entry is _t_r_u_e if this is a pass-by-value
attribute, and _f_a_l_s_e if it is pass-by-reference.  This entry
is  set  by looking at the ttyyppbbyyvvaall attribute in the ppgg__ttyyppee
tuple for the type of this attribute.

     The aattttccaanniinnddeexx entry is _t_r_u_e  if  this  entry  can  be
indexed,  and  false  otherwise.  Array attributes cannot be
indexed.

     The aattttpprroocc entry is unused.

     The aattttnneelleemmss entry is used for  array  attributes,  to
record  the  number  of  elements  that may be stored in the
attribute.

     The aattttccaacchheeooffff entry is never maintained on disk,  but
is set in memory to allow fast attribute lookups.  The first
time an attribute is fetched from a  tuple,  its  offset  is
computed  by summing the lengths of the attributes that pre-
cede it.  If all of  the  preceding  attributes  are  fixed-
length,  and  none are NULL, then the computed offset may be
cached for use on subsequent lookups.  The offset is  stored
in aattttccaacchheeooffff.

     The AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa structure stores information
on a single attribute.  The vector that describes all of the
attributes  in  a relation, rrdd__aatttt, contains pointers to in-
memory AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa structures for each attribute.
The  vector  is stored in the TTuupplleeDDeessccrriippttoorrDDaattaa structure,
whose declaration is

     ttyyppeeddeeff ssttrruucctt TTuupplleeDDeessccrriippttoorrDDaattaa {{
          AAttttrriibbuutteeTTuupplleeFFoorrmm  ddaattaa[[11]];;
          //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY **//
     }} TTuupplleeDDeessccrriippttoorrDDaattaa;;

     ttyyppeeddeeff TTuupplleeDDeessccrriippttoorrDDaattaa   **TTuupplleeDDeessccrriippttoorr;;


                             1111


11..11..44..  TThhee SSttrraatteeggyy MMaapp  aanndd  SSuuppppoorrtt  RRoouuttiinneess  ffoorr  IInnddeexx
RReellddeessccss

     The  sections  above describe the contents of a reldesc
that are common to the heap and indexed access methods.  The
indexed  access methods store some additional information at
the end of the reldesc.  This information is used to  select
operators  to  apply  during  scans, and to find the support
routines that are  required  by  particular  indexed  access
methods.   These  data  structures,  their layout, and their
purposes are obscure.   They  reflect  poor  design  of  the
indexed access methods.  This section is difficult to under-
stand, and may be safely  ignored  by  everyone  except  the
maintainer of the POSTGRES access methods.

     Immediately  following  the TupleDescriptorData vector,
the indexed access methods have  two  additional  pieces  of
data.   The  first  is  a  pointer  to the IInnddeexxSSttrraatteeggyyDDaattaa
structure for this access method.  The second is  a  pointer
to  the  vector  of  support routines for the access method.
Neither of these is a named entry  of  any  data  structure;
their  existence is assumed (and, at least so far, correctly
maintained) by  the  code  that  manages  reldescs  for  the
indexed access methods.

     The problem with this design is that it makes it impos-
sible to name these entries from the debugger.  The  follow-
ing cumbersome GDB statements will print out the contents of
these entries:

     pprriinntt ((IInnddeexxSSttrraatteeggyy))((&&rreellddeesscc-->>rrdd__aatttt[[
          rreellddeesscc-->>rrdd__rreell-->>rreellnnaattttss]]))
     pprriinntt ((RReeggPPrroocceedduurree **))((((((cchhaarr **))
          &&rreellddeesscc-->>rrdd__aatttt[[rreellddeesscc-->>rrdd__rreell-->>rreellnnaattttss]]))
          ++ ssiizzeeooff((IInnddeexxSSttrraatteeggyy))))

The first statement prints the  pointer  to  the  vector  of
index  strategy  map  data; the second prints the pointer to
the vector of support procedure IDs.  These pointers must be
dereferenced  in  order  to examine the contents of the vec-
tors.  Lines have been  broken  so  that  the  commands  fit
across  the page; the statements should be typed on a single
line.

     The strategy map is a vector of scan key entries.  Scan
keys  are described in sections 1.3.1 and 1.3.2.  The strat-
egy map stores information on what procedure to call  for  a
particular  operator  on  this  index.  For example, a btree
storing iinntt44 values would  have  the  procedure  ID  of  the
iinntt44lltt procedure at the location corresponding to '<' in the
strategy vector.  The strategy map is constructed  automati-
cally  by the system when the index relation is opened.  The


                             1122


contents are read from the ppgg__aammoopp relation, which is  keyed
by operator class and access method.

     Finally,  the vector of support procedures is simply an
array of procedure IDs, or ppgg__pprroocc tuple OIDs, for the  sup-
port routines required by the access method.  This vector is
initialized when the index relation is  opened  by  scanning
the  ppgg__aammssuuppppoorrtt  relation.  The tuples in ppgg__aammssuuppppoorrtt are
keyed by access method and operator class.  POSTGRES assigns
no  meaning to the entries in this vector; they are intended
for use by the access method.  The btree access method  uses
a  single  support  routine,  which  compares  two  keys and
returns less than zero, zero, and greater than zero, respec-
tively,  if the first key is less than, equal to, or greater
than the second.  The rtree access method uses three support
routines,  which compute the size of a rectangle in standard
units, the intersection of two rectangles, and the union  of
two rectangles.

11..11..55..  SSuummmmaarryy ooff tthhee RReellaattiioonn DDeessccrriippttoorr DDaattaa SSttrruuccttuurree

     The  reldesc  describes  an  open  relation, and is the
argument that is passed to most of the routines that operate
on relations.  POSTGRES manages a private cache of reldescs,
and at most one copy of the reldesc  for  a  given  relation
appears in the cache at a time.  The reldesc includes point-
ers to the relation tuple form and a vector  of  information
that  describes  the attributes that appear in tuples of the
relation.  For indexed  access  methods,  the  reldesc  also
includes  a  pointer  to  the  access  method  tuple form, a
pointer to a vector of strategy information  that  describes
the  operators supported by the access method for the opera-
tor class, and a pointer  to  a  vector  of  access  method-
specific support routines.

11..22..  HHeeaapp aanndd IInnddeexx RReellaattiioonnss

     Heap  relations are the primary POSTGRES storage struc-
ture; all user relations and all the system catalog informa-
tion  is stored in heaps.  The POSTGRES heap is logically an
unordered set of tuples.  Heaps  consist  of  zero  or  more
8192-byte  blocks,  each  of  which  contains  zero  or more
tuples.

     Index relations store key/value pairs,  where  the  key
allows  fast  lookup  on some set of attributes from a heap,
and the value is a pointer to a tuple in the heap.

11..22..11..  TTuuppllee IIddeennttiiffiieerrss

     Tuples are identified by _t_u_p_l_e _i_d_e_n_t_i_f_i_e_r_s, or tids.  A
tid  is a triple of the form (_b_l_o_c_k_n_o, _p_a_g_e_n_o, _o_f_f_s_e_t).  The


                             1133


original design of the system allowed for more than one page
to  be stored on a single block of a relation, but this fea-
ture has never been used.  As a  result,  the  middle  value
(_p_a_g_e_n_o)  is  always  zero.   To confuse things further, the
term "page" is used interchangeably with the term "block" in
the code and by programmers talking about the system.  Since
the original usage of "page" was never implemented,  it  has
come to mean exactly the same thing as "block."

     Thus  a  tid  is a triple of the form (_b_l_o_c_k_n_o, 0, _o_f_f_-
_s_e_t).

     The _b_l_o_c_k_n_o is the number of the block in the relation;
blocks  are  numbered  sequentially  from  zero.  When a new
block is allocated to a  relation,  it  is  given  the  next
available block number.

     The  _o_f_f_s_e_t  stored in a tid is the offset of the tuple
within the block.  Blocks store zero  or  more  tuples,  and
tuples  are  numbered  sequentially from one within a block.
The number of tuples that fit  in  a  block  depend  on  the
schema of the relation that the block stores, and the number
may vary for a given relation (for  example,  if  the  tuple
includes variable-length attributes).

     A tid uniquely identifies a single tuple in a relation.
In the current version of the system, relations are not com-
pressed when they are vacuumed, so tuples do not move around
inside a relation.  Thus the tid is a valid  identifier  for
the  tuple  until the tuple is replaced or deleted, at which
point it (logically) vanishes from the relation.  Note  that
other  tuples  on  the  same  page as a deleted tuple do not
change their tids when the tuple is deleted;  the  available
slot  is  reserved for subsequent reuse.  In fact, slots are
never reused in the current system.  The vacuum cleaner  and
the tuple allocation code may be changed to fix this problem
in the future.

11..22..22..  DDaattaa SSttoorreedd iinn HHeeaapp RReellaattiioonnss

     Every heap tuple begins with a HHeeaappTTuupplleeDDaattaa structure.
The  declaration  for  this  structure appears in the source
file aacccceessss//hhttuupp..hh, and is

     ttyyppeeddeeff ssttrruucctt HHeeaappTTuupplleeDDaattaa {{
          SSiizzee           tt__lleenn;;
          IItteemmPPooiinntteerrDDaattaa     tt__ccttiidd;;
          IItteemmPPooiinntteerrDDaattaa     tt__cchhaaiinn;;
          uunniioonn {{
               IItteemmPPooiinntteerrDDaattaa     ll__llttiidd;;
               RRuulleeLLoocckk       ll__lloocckk;;
          }}              tt__lloocckk;;


                             1144


          OObbjjeeccttIIdd       tt__ooiidd;;
          CCoommmmaannddIIdd      tt__ccmmiinn;;
          CCoommmmaannddIIdd      tt__ccmmaaxx;;
          TTrraannssaaccttiioonnIIdd  tt__xxmmiinn;;
          TTrraannssaaccttiioonnIIdd  tt__xxmmaaxx;;
          AABBSSTTIIMMEE        tt__ttmmiinn,, tt__ttmmaaxx;;
          AAttttrriibbuutteeNNuummbbeerr     tt__nnaattttss;;
          cchhaarr           tt__vvttyyppee;;
          cchhaarr           tt__iinnffoommaasskk;;
          cchhaarr           tt__lloocckkttyyppee;;
          uuiinntt88               tt__hhooffff;;
          cchhaarr           tt__bbiittss[[MMiinnHHeeaappTTuupplleeBBiittmmaappSSiizzee // 88]];;
     }} HHeeaappTTuupplleeDDaattaa;;    //** MMOORREE DDAATTAA FFOOLLLLOOWWSS AATT EENNDD OOFF SSTTRRUUCCTT **//

     ttyyppeeddeeff HHeeaappTTuupplleeDDaattaa    **HHeeaappTTuuppllee;;

This structure includes a variable-length bitmap  vector  at
its  end, after which the user-level attributes of the tuple
appear.

     The tt__lleenn entry is the  length  of  the  entire  tuple,
including the HHeeaappTTuupplleeDDaattaa structure and the user data that
follows it.  This is used to allocate  sufficient  space  in
memory and on disk pages for the tuple.

     The tt__ccttiidd entry is the tid of this tuple.  Storing the
tid in the tuple is easier than  computing  it  when  it  is
needed  (for  example, when an index tuple is constructed to
point at the heap tuple).

     When a tuple is replaced, POSTGRES maintains a  pointer
from  the old version of the tuple to the new version.  This
pointer is stored in the tt__cchhaaiinn entry.  When the  new  ver-
sion  of  the  tuple is successfully inserted, the page con-
taining the old version is fetched, and the original tuple's
tt__cchhaaiinn entry is set to the tt__ccttiidd entry of the new version.
This strategy, called _f_o_r_w_a_r_d _c_h_a_i_n_i_n_g, leaves older records
pointing  at newer records.  The advantage if forward chain-
ing is that it allows records to be updated without changing
indices  that point at them if the indexed attribute has not
chained.

     The tt__cchhaaiinn entry was originally  intended  to  support
tuple differencing, which would allow POSTGRES to store only
the changed values for the new version in most cases, rather
than  the  entire new tuple.  Tuple differencing is not sup-
ported in the  current  system.   In  addition,  the  system
always inserts a new index record when a heap tuple is modi-
fied,  regardless  of  whether  the  indexed  attribute  has
changed.  This may be fixed in the future.


                             1155


     The  tt__lloocckk  entry stores a rule lock for the tuple.  A
rule lock is a tag that  identifies  a  set  of  rules  that
should be run whenever this tuple is manipulated.  The mech-
anism is beyond the scope of this section.  The value stored
in  tt__lloocckk  is a list of tids when the tuple is on disk, and
is swizzled to an in-memory pointer when the tuple is copied
into memory.

     The  tt__ooiidd  entry of the HHeeaappTTuupplleeDDaattaa structure is the
object ID that uniquely  identifies  this  tuple.   Although
tids  will currently uniquely identify a tuple for its life-
time, this may change someday, when tuple  differencing  and
space  reclamation are implemented.  The OID will is guaran-
teed by design never to change.  Furthermore, when  a  tuple
is  replaced, its tid will change, but its OID is guaranteed
not to.  This means that the OID is a safe, globally unique,
eternal identifier for the tuple.

     The  tt__ccmmiinn  and  tt__ccmmaaxx entries are, respectively, the
command IDs that inserted and deleted this tuple.  In  every
transaction,  the  user  may run up to 32768 individual com-
mands.  Any of these commands may update the database.   The
command  ID  is stored in every tuple to allow the system to
keep track of which command inside a single transaction made
a  change.   The  reason  that this is necessary is to allow
subsequent commands inside the same transaction to  see  the
effects  of changes made by earlier commands in the transac-
tion, even though those changes are  not  visible  to  other
users running concurrently in other transactions.

     The  tt__xxmmiinn  and  tt__xxmmaaxx entries are, respectively, the
transaction  IDs  of  the  transactions  that  inserted  and
deleted this tuple.  These transaction IDs are used by POST-
GRES to check at runtime whether the transaction  that  made
the  changes  ever  committed.   Changes made by uncommitted
transactions may be safely ignored.

     The tt__ttmmiinn and tt__ttmmaaxx entries  are,  respectively,  the
commit times of the inserting and deleting transactions that
operated on this tuple.  When POSTGRES inserts a  tuple,  it
writes the inserting command and transaction IDs, but leaves
the tt__ttmmiinn entry empty.  When  the  transaction  commits,  a
special  entry  is  written  to  the ppgg__ttiimmee relation.  This
entry contains the time at which the transaction  committed.
Later,  the  vacuum  cleaner  (or, in some cases, a POSTGRES
backend in normal operation) will fill in  the  commit  time
for the inserting transaction.  Once this time is filled in,
POSTGRES does not need to check the transaction log in order
to  see  whether  the transaction ever committed.  This sub-
stantially speeds up processing, since the  transaction  log
check generally requires disk accesses.


                             1166


     The  tt__nnaattttss entry is the number of attributes that are
stored in this tuple.  The number of attributes may  not  be
the  same  as  the  number stored by the relation, since the
relation schema may change over time  as  a  result  of  the
_a_d_d_a_t_t_r  command.   Any  references to attributes beyond the
end of the tuple will return the special database value NULL
(note  that  database NULL is not the same as the zero value
used by C).

     The tt__vvttyyppee entry is for the version type,  whose  pur-
pose  has  been lost to history.  This is unused by the cur-
rent system.  The  value  stored  here  is  typically  zero,
although  the  header file aacccceessss//hhttuupp..hh claims that only _i,
_r, and _d are supported.

     The tt__iinnffoommaasskk entry  is  used  to  encode  information
about  the  tuple  to  speed up fetching of attributes.  The
information encoded here includes flags  indicating  whether
the  tuple  contains  any null values or any variable-length
attributes.

     The tt__lloocckkttyyppee entry indicates whether the value stored
in  the  tt__lloocckk  union  is  an  on-disk  (tid)  or in-memory
(pointer) lock representation.  This is stored far from  the
union  in order to keep the data structure tightly packed in
memory.

     The tt__hhooffff entry is the number of bytes occupied by the
tuple  header  (that  is,  by  the HHeeaappTTuupplleeDDaattaa structure).
This may vary from tuple to tuple, because if the tuple con-
tains  no  nulls  values,  then  the  bitmap  of null values
(stored in tt__bbiittss, below) is not  necessary,  and  does  not
appear  in  the  tuple.   For  any tuple, the address of the
HHeeaappTTuupplleeDDaattaa structure  that  stores  it,  plus  the  value
stored  in  tt__hhooffff, is the address of the first byte of user
data stored in the tuple.

     The tt__bbiittss vector contains one bit for every  attribute
in  the  tuple  if  the  tuple  contains  nulls.   For  each
attribute, the corresponding bit is set if the attribute  is
null, and is clear otherwise.  Nulls are not actually stored
in the tuple; just the tt__bbiittss bit for the null value appear.
This  keeps  tuples  containing  null  values small.  If the
tuple contains no null values, then  the  tt__bbiittss  vector  is
left  off  the  end  of  the structure, and user data begins
immediately after the tt__hhooffff entry.

     Collectively, the values stored  in  the  HHeeaappTTuupplleeDDaattaa
structure are referred to as the tuple header.  These values
are accessible from the query language as _s_y_s_t_e_m _a_t_t_r_i_b_u_t_e_s.
Every  heap  tuple  has  the same system attributes, but the
_u_s_e_r _a_t_t_r_i_b_u_t_e_s they store vary from relation  to  relation,


                             1177


and possibly even within a relation.

     System attributes are all assigned attributes less than
zero internally by the system. The  User-defined  attributes
all  have attribute numbers greater than zero.  The names of
the system attributes and their corresponding  HHeeaappTTuupplleeDDaattaa
structure elements are listed in the following table3 .

         +----------------------+------------------+
         |_H_e_a_p_T_u_p_l_e_D_a_t_a _e_l_e_m_e_n_t | _S_y_s_t_e_m _A_t_t_r_i_b_u_t_e |
         +----------------------+------------------+
         |tt__cchhaaiinn               | chain            |
         +----------------------+------------------+
         |tt__ccmmaaxx                | cmax             |
         +----------------------+------------------+
         |tt__ccmmiinn                | cmin             |
         +----------------------+------------------+
         |tt__ccttiidd                | ctid             |
         +----------------------+------------------+
         |tt__ooiidd                 | oid              |
         +----------------------+------------------+
         |tt__lloocckk                | rlock            |
         +----------------------+------------------+
         |tt__ttmmaaxx                | tmax             |
         +----------------------+------------------+
         |tt__ttmmiinn                | tmin             |
         +----------------------+------------------+
         |xx__xxmmaaxx                | xmax             |
         +----------------------+------------------+
         |tt__xxmmiinn                | xmin             |
         +----------------------+------------------+
         |tt__vvttyyppee               | vtype            |
         +----------------------+------------------+

11..22..33..  DDaattaa SSttoorreedd iinn IInnddeexx RReellaattiioonnss

     All POSTGRES indices are secondary -- that is, ordinary
system  and  user  data  are  stored  in heap relations, and
indices store pointers into the heaps.  All of  the  indexed
access  methods  store  index tuples, which are much smaller
than heap tuples.  In general, the  indexed  access  methods
also  store  other  data  on  POSTGRES  pages; what data are
stored depends on the indexed access method.  For the  rtree
____________________
   3  At  one  time, there existed a system attribute _a_n_c_h_o_r
which corresponded with the HHeeaappTTuupplleeDDaattaa element  tt__aanncchhoorr.
_A_n_c_h_o_r  was the anchor point, or complete tuple, stored in a
sequence of difference tuples.  Neither are used in the sys-
tem  at  this time since tuple differencing has not yet been
implemented.


                             1188


and  btree  access  methods,  some  pages  are designated as
_i_n_t_e_r_n_a_l pages, and only store pointers to  other  pages  in
the same index.  _L_e_a_f pages, on the other hand, store point-
ers into the heap.

     A particular index is defined on a  single  heap  rela-
tion,  and  all the heap tids that the index stores refer to
that relation.  Since indices support fast keyed lookup, the
indices  also  store key data, which are typically either an
attribute from the heap or some function of an attribute  or
attributes  from  the  heap.   For example, if the EEMMPP class
stored employee records, and included  a  ssaallaarryy  attribute,
then  a  btree  index on EEMMPP..ssaallaarryy would store the salaries
from EEMMPP together with the tids of the tuples with each par-
ticular salary.

     The index tuple format is defined in aacccceessss//iittuupp..hh, and
is

     ttyyppeeddeeff ssttrruucctt IInnddeexxTTuupplleeDDaattaa {{
          IItteemmPPooiinntteerrDDaattaa               tt__ttiidd;;
          uunnssiiggnneedd sshhoorrtt           tt__iinnffoo;;
     }} IInnddeexxTTuupplleeDDaattaa;;   //** MMOORREE DDAATTAA FFOOLLLLOOWWSS **//

     ttyyppeeddeeff IInnddeexxTTuupplleeDDaattaa   **IInnddeexxTTuuppllee;;


     The tt__ttiidd entry is the tid of the heap tuple that  cor-
responds to this index tuple.

     The  tt__iinnffoo  entry  encodes  some information about the
index tuple.  This is a sixteen-bit quantity whose layout is
as follows:

  +o  Bit  fifteen  (the  leftmost  bit)  is set if the index
     tuple contains null values, and is clear otherwise.

  +o  Bit fourteen is set if the index tuple  contains  vari-
     able-length attributes, and is clear otherwise.

  +o  Bit  thirteen is set if there are rules associated with
     the index tuple, and  is  clear  otherwise.   Rules  on
     index tuples are not supported by the current system.

  +o  Bits  twelve through zero are the size of the tuple, in
     bytes.

     Immediately following the tt__iinnffoo  entry  is  the  index
key.   The  index  key is the set of all attributes from the
heap tuple that form the search key  for  this  index.   The
current  implementations of the btree and rtree access meth-
ods support only a single index key.  This  may  change,  at


                             1199


least for btrees, in the near future.

11..22..44..  HHooww tthhee RReellaattiioonnss aarree MMaannaaggeedd

     When the POSTQUEL user issues an update to a heap rela-
tion (either inserting or deleting data), all of the  corre-
sponding  indices are automatically updated, as well.  POST-
GRES stores, for every heap relation,  the  index  relations
defined  on it, and for every index relation, the attributes
from the heap that form  the  index  key.   These  data  are
stored in ppgg__iinnddeexx.

     When  the  user  issues a POSTQUEL query against a heap
relation that requires the relation to be scanned,  POSTGRES
checks to see whether any of the indices defined on the heap
will permit the scan to be executed more quickly.  For exam-
ple,  if a btree index is defined on the ssaallaarryy attribute of
the EEMMPP relation,  then  a  query  for  all  employees  with
salaries  between  $30,000  and  $50,000 can be satisfied by
scanning the btree index, and only selecting  out  the  heap
tuples  with  the correct salaries.  This allows the scan to
be completed more quickly than would a  sequential  scan  of
all of the tuples in the EEMMPP relation.

11..33..  SSccaann KKeeyyss,, SSccaann DDeessccrriippttoorrss,, aanndd SSccaannss

     A  _s_c_a_n is the abstraction used in POSTGRES to search a
heap or index relation for tuples that satisfy some qualifi-
cation.   A scan descriptor, or _s_c_a_n_d_e_s_c, is the data struc-
ture that describes an  open  scan.   The  qualification  is
stored  in the scandesc in data structures called _s_c_a_n _k_e_y_s.
Each key consists of a procedure _p, an attribute  number  _k,
and  a value _v.  Logically, every tuple _t in the relation is
checked to see whether its _kth  attribute,  _t_k,  passes  the
qualification  stored in the scan key.  The tuple passes the
qualification if _p(_t_k,_v) returns true.   In  order  for  the
tuple to satisfy the entire qualification, it must pass this
test for every scan key stored in the scandesc.  This  means
that  a  scandesc  stores a conjunctive qualification on the
relation.  Disjunctive  qualifications  may  be  handled  by
using  more than one scan, and suppressing duplicate tuples,
or by using a scan with no keys, and applying  the  disjunc-
tive tests manually.

     Once a scan is opened, it returns tuples one at a time.
Every tuple that the scan returns is guaranteed  to  satisfy
the  qualification, and the scan is guaranteed to return all
qualifying tuples.

     The rest of this section describes the data  structures
associated with scans, scandescs, and scan keys.


                             2200


11..33..11..  SSccaann KKeeyyss

     A  scan is a conjunction of zero or more qualifications
on  single  attributes  in  the  relation.   Every   single-
attribute  qualification  is stored in a scan key.  The data
structure used to store scan keys appears in the source file
aacccceessss//sskkeeyy..hh.  Its declaration is

     ttyyppeeddeeff ssttrruucctt SSccaannKKeeyyEEnnttrryyDDaattaa {{
          bbiittss1166         ffllaaggss;;
          AAttttrriibbuutteeNNuummbbeerr     aattttrriibbuutteeNNuummbbeerr;;
          RReeggPPrroocceedduurree   pprroocceedduurree;;
          iinntt            ((**ffuunncc)) (());;
          iinntt3322          nnaarrggss;;
          DDaattuumm               aarrgguummeenntt;;
     }} SSccaannKKeeyyEEnnttrryyDDaattaa;;

     ##ddeeffiinnee CChheecckkIIffNNuullll      00xx11
     ##ddeeffiinnee UUnnaarryyPPrroocceedduurree   00xx22
     ##ddeeffiinnee NNeeggaatteeRReessuulltt          00xx44
     ##ddeeffiinnee CCoommmmuutteeAArrgguummeennttss 00xx88

     ttyyppeeddeeff SSccaannKKeeyyEEnnttrryyDDaattaa **SSccaannKKeeyyEEnnttrryy;;

The  ffllaaggss  entry  is  a  sixteen-bit bitmask describing the
function to be called and its return values.  The flag  val-
ues  appear  in the ##ddeeffiinnees that follow the data structure.
If the CChheecckkIIffNNuullll flag is set, then the  scan  code  should
check  for  null attributes and return null for them without
calling the function.  However, this behavior is not  imple-
mented  in the current system.  If the UUnnaarryyPPrroocceedduurree bit is
set, then this procedure takes a single argument (the appro-
priate  attribute  from the tuple), rather than an attribute
and a constant.  If the NNeeggaatteeRReessuulltt bit is  set,  then  the
scan code will negate the logical value returned by the pro-
cedure when checking for tuples that satisfy the  qualifica-
tion.   Finally,  if  the CCoommmmuutteeAArrgguummeennttss flag is set, then
the scan code will call  the  procedure  with  the  supplied
value  and  the  attribute  from  the tuple, rather than the
opposite order.

     The aattttrriibbuutteeNNuummbbeerr entry is the  attribute  number  in
the tuple to check.

     The  pprroocceedduurree  entry  is  the object ID of the ppgg__pprroocc
tuple that describes the procedure to call.  When the proce-
dure  is  called for the first time, the ppgg__pprroocc relation is
scanned and the appropriate function is dynamically  loaded,
if necessary.

     The  ffuunncc  entry  is  a pointer to a function that must
return type bbooooll.  This function pointer is  initialized  by


                             2211


the  scan  key from the ppgg__pprroocc tuple for the desired pprrooccee--
dduurree, and need not be  set  by  the  caller.   The  function
pointer is cached to save repeated scans of ppgg__pprroocc.

     The  nnaarrggss  entry  is  the number of arguments that are
required by the  function.   This  is  filled  in  from  the
ppgg__pprroocc  tuple, and need not be set by the caller.  However,
it should always be either one or two.

     Finally, the aarrgguummeenntt entry is the additional  argument
to the function.  This must be set by the caller.

11..33..22..  SSeettttiinngg uupp SSccaann KKeeyyss

     In order to properly set up a scan key, the caller must
know which attributes in the  tuple  are  of  interest,  the
object IDs of the procedures that should be used to test the
attributes, and what values they should be  tested  against.
In order to simplify scan key setup, a number of conventions
have been adopted in the POSTGRES source code.

     First, the schemas for all of the system  catalogs  are
defined  in  header  files  in  the  ccaattaalloogg directory.  The
header files are named _r_e_l_n_a_m_e.h, where _r_e_l_n_a_m_e is the  name
of  the catalog of interest.  For example, the schema of the
ppgg__uusseerr  relation  appears  in   the   header   file   ccaattaa--
lloogg//ppgg__uusseerr..hh.   The lone exception to this rule is that the
schema for the ppgg__ccllaassss relation appears in the  file  ccaattaa--
lloogg//ppgg__rreellaattiioonn..hh.   This  is an artifact of the terminology
change (relations became classes) that swept the project  in
the early 1990s.

     Second,  the header files described above include stan-
dard ##ddeeffiinnees for the relations  that  they  describe.   The
relation's  name, as a string, is NNaammee___r_e_l_n_a_m_e; for example,
NNaammee__ppgg__uusseerr.  The number of attributes in the  relation  is
NNaattttss___r_e_l_n_a_m_e  (NNaattttss__ppgg__uusseerr).  Every attribute may be ref-
erenced as AAnnuumm___r_e_l_n_a_m_e___a_t_t_n_a_m_e (for  example,  the  uusseennaammee
attribute  of  the ppgg__uusseerr relation is AAnnuumm__ppgg__uusseerr__uusseennaammee.
These conventions permit programmers to use symbolic  names,
rather  than  embedded  constants,  to set up scans and open
relations.

     Similar conventions have  not  been  adopted  for  user
relations,  so  programmers who want to set up and use scans
on them must already know their schemas.

     Third, there exist some ##ddeeffiinneed constants  for  proce-
dures  that are frequently used in scans on the system cata-
logs.  These procedure IDs appear in the source  file  ccaattaa--
lloogg//ppgg__pprroocc..hh, and generally take the form _O_p_e_r_a_t_i_o_n_O_f_I_n_t_e_r_-
_e_s_tRReeggPPrroocceedduurree -- for  example,  CChhaarraacctteerrEEqquuaallRReeggPPrroocceedduurree


                             2222


and OObbjjeeccttIIddEEqquuaallRReeggPPrroocceedduurree.

     Finally,  the constant values used by POSTGRES in scans
are of type DDaattuumm.  A number of  macros  have  been  defined
that  convert  embedded  constants  to values of type DDaattuumm.
These macros are defined in the source file ttmmpp//ddaattuumm..hh, and
are

+---------------------------+--------------------------+--------------------------+
|       _c_o_n_v_e_r_t _t_y_p_e        |        _f_r_o_m _D_a_t_u_m        |         _t_o _D_a_t_u_m         |
+---------------------------+--------------------------+--------------------------+
+---------------------------+--------------------------+--------------------------+
|one-byte char              | DatumGetChar(_v)          | CharGetDatum(_v)          |
+---------------------------+--------------------------+--------------------------+
|one-byte integer           | DatumGetInt8(_v)          | Int8GetDatum(_v)          |
+---------------------------+--------------------------+--------------------------+
|unsigned one-byte integer  | DatumGetUInt8(_v)         | UInt8GetDatum(_v)         |
+---------------------------+--------------------------+--------------------------+
|two-byte integer           | DatumGetInt16(_v)         | Int16GetDatum(_v)         |
+---------------------------+--------------------------+--------------------------+
|unsigned two-byte integer  | DatumGetUInt16(_v)        | UInt16GetDatum(_v)        |
+---------------------------+--------------------------+--------------------------+
|four-byte integer          | DatumGetInt32(_v)         | Int32GetDatum(_v)         |
+---------------------------+--------------------------+--------------------------+
|unsigned four-byte integer | DatumGetUInt32(_v)        | UInt32GetDatum(_v)        |
+---------------------------+--------------------------+--------------------------+
|four-byte float            | DatumGetFloat32(_v)       | Float32GetDatum(_v)       |
+---------------------------+--------------------------+--------------------------+
|eight-byte double          | DatumGetFloat64(_v)       | Float64GetDatum(_v)       |
+---------------------------+--------------------------+--------------------------+
|void *                     | DatumGetPointer(_v)       | PointerGetDatum(_v)       |
+---------------------------+--------------------------+--------------------------+
|pointer to struct          | DatumGetStructPointer(_v) | StructPointerGetDatum(_v) |
+---------------------------+--------------------------+--------------------------+
|16-byte char string name   | DatumGetName(_v)          | NameGetDatum(_v)          |
+---------------------------+--------------------------+--------------------------+
|object ID                  | DatumGetObjectId(_v)      | ObjectIdGetDatum(_v)      |
+---------------------------+--------------------------+--------------------------+
In  POSTGRES,  four-byte floating point values are passed by
reference, not by value.  All other four-byte quantities are
pass-by-value.   A very common programming error in POSTGRES
is to use values of type NNaammee as if they were pointers.   In
fact,  NNaammee is a structure containing a sixteen-byte charac-
ter array, and so will be passed on the stack if its address
is  not  explicitly  used.  These facts should be documented
elsewhere, but are worth mentioning here.

     The following code fragment shows how to set up a  scan
key on the ppgg__uusseerr relation for all tuples that have uusseennaammee
equal to "mao" and an object ID of 1806:


                             2233


     NNaammeeDDaattaa nnaammee;;
     SSccaannKKeeyyEEnnttrryyDDaattaa sskkeeyy[[22]];;

     bbzzeerroo((&&nnaammee,, ssiizzeeooff((nnaammee))));;
     bbccooppyy((nnaammee..ddaattaa[[00]],, ""mmaaoo"",, ssttrrlleenn((""mmaaoo""))));;

     SSccaannKKeeyyEEnnttrryyIInniittiiaalliizzee((&&sskkeeyy[[00]],, ((bbiittss1166))00xx00,,
          AAnnuumm__ppgg__uusseerr__uusseennaammee,,
          ((RReeggPPrroocceedduurree))NNaammeeEEqquuaallRReeggPPrroocceedduurree,,
          NNaammeeGGeettDDaattuumm((nnaammee))));;
     SSccaannKKeeyyEEnnttrryyIInniittiiaalliizzee((&&sskkeeyy[[11]],, ((bbiittss1166))00xx00,,
          OObbjjeeccttIIddAAttttrriibbuutteeNNuummbbeerr,,
          ((RReeggPPrroocceedduurree))OObbjjeeccttIIddEEqquuaallssRReeggPPrroocceedduurree,,
          OObbjjeeccttIIddGGeettDDaattuumm((11880066))));;


11..33..33..  TThhee SSccaann DDeessccrriippttoorr

     Once the scan keys are properly initialized,  they  may
be  used  to  open  a  scan  on a relation.  An open scan is
described by a scandesc.  The data structures that  describe
open scans are HHeeaappSSccaanns and IInnddeexxSSccaanns, and are declared in
the source file aacccceessss//rreellssccaann..hh.

11..33..33..11..  TThhee HHeeaapp SSccaann DDeessccrriippttoorr

     The declaration for HHeeaappSSccaannDDeessccDDaattaa is

     ttyyppeeddeeff ssttrruucctt HHeeaappSSccaannDDeessccDDaattaa {{
          RReellaattiioonn       rrss__rrdd;;
          HHeeaappTTuuppllee      rrss__ppttuupp;;
          HHeeaappTTuuppllee      rrss__ccttuupp;;
          HHeeaappTTuuppllee      rrss__nnttuupp;;
          BBuuffffeerr         rrss__ppbbuuff;;
          BBuuffffeerr         rrss__ccbbuuff;;
          BBuuffffeerr         rrss__nnbbuuff;;
          ssttrruucctt ddcchhaaiinn  **rrss__ddcc;;
          IItteemmPPooiinntteerrDDaattaa     rrss__mmppttiidd;;
          IItteemmPPooiinntteerrDDaattaa     rrss__mmccttiidd;;
          IItteemmPPooiinntteerrDDaattaa     rrss__mmnnttiidd;;
          IItteemmPPooiinntteerrDDaattaa     rrss__mmccdd;;
          BBoooolleeaann        rrss__aatteenndd;;
          TTiimmeeQQuuaall       rrss__ttrr;;
          uuiinntt1166         rrss__ccddeellttaa;;
          bbooooll           rrss__ppaarraalllleell__ookk;;
          uuiinntt1166         rrss__nnkkeeyyss;;
          SSccaannKKeeyyDDaattaa    rrss__kkeeyy;;
          //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY AATT EENNDD OOFF SSTTRRUUCCTT **//
     }} HHeeaappSSccaannDDeessccDDaattaa;;

     ttyyppeeddeeff HHeeaappSSccaannDDeessccDDaattaa **HHeeaappSSccaannDDeesscc;;


                             2244


     The rrss__rrdd entry is a pointer to  the  reldesc  for  the
relation on which this scan has been opened.

     The  rrss__ppttuupp, rrss__ccttuupp, and rrss__nnttuupp entries are, respec-
tively, the previous tuple, current tuple,  and  next  tuple
visited in this scan.  The scan code caches these because it
was originally thought that  scans  would  change  direction
frequently.   In fact, that is not the case, and maintaining
the previous and next tuple pointers has turned  out  to  be
pure  overhead.   Getting rid of the previous and next tuple
pointers everywhere that they appear in the  scandesc  would
speed up scan processing.

     The  rrss__ppbbuuff,  rrss__ccbbuuff,  and  rrss__nnbbuuff  entries  are the
buffers on which the previous, current, and next  tuple  are
stored.   These  buffers  are  kept pinned, which guarantees
that they will not be evicted from the shared  buffer  cache
by  any  backend.  Pinning the buffers allows the backend to
reference them directly, without reacquiring  a  pointer  to
the buffer every time data on it is used.  The shared buffer
cache is protected by mutual exclusion.  Some POSTGRES ports
use  System  V semaphores to implement exclusion.  Acquiring
and releasing these  semaphores  is  slow,  so  pinning  the
buffers during a scan improves performance significantly.

     The rrss__ddcc entry is intended to support tuple differenc-
ing, which is not supported in the current  version  of  the
system.

     The  rrss__mmppttiidd,  rrss__mmccttiidd,  and rrss__mmnnttiidd entries support
marking of positions in scans.  While processing  some  join
strategies,  the  executor  marks  a location so that it can
return to it later.  When such a mark is made, the  tids  of
the  rrss__ppttuupp,  rrss__ccttuupp, and rrss__nnttuupp tuples are copied to the
rrss__mmppttiidd, rrss__mmccttiidd, and rrss__mmnnttiidd entries.

     The rrss__mmccdd entry is intended to support  tuple  differ-
encing, and is not used in the current system.

     The  rrss__aatteenndd  entry  indicates whether the scan should
begin at the end of the relation.  Some attempt is  made  to
store  a  sensible  value here, but it is ignored by much of
the code and should not be relied on.

     The rrss__ttrr entry is a time range or snapshot time quali-
fication that indicates what historical tuples are of inter-
est.  The special constant value NNoowwTTiimmeeQQuuaall indicates  that
only current data is of interest.

     The  entry  rrss__ccddeellttaa is intended to support tuple dif-
ferencing, and is not used at present.


                             2255


     The structure entry rrss__ppaarraalllleell__ookk was added to support
parallelization  of POSTGRES for shared-memory architectures
by Wei Hong, whose doctoral dissertation was on that  topic.
This entry is no longer used.

     The  entry  rrss__nnkkeeyyss stores the number of SSccaannKKeeyyEEnnttrryy--
DDaattaa structures that are stored in  the  rrss__kkeeyy  entry  that
follows.

     The  rrss__kkeeyy  entry  stores  a variable-length vector of
SSccaannKKeeyyEEnnttrryyDDaattaa structures, one per scan key that are to be
applied for the scan.

11..33..33..22..  TThhee IInnddeexx SSccaann DDeessccrriippttoorr

     The declaration for IInnddeexxSSccaannDDeessccDDaattaa is

     ttyyppeeddeeff ssttrruucctt IInnddeexxSSccaannDDeessccDDaattaa {{
          RReellaattiioonn       rreellaattiioonn;;
          PPooiinntteerr        ooppaaqquuee;;
          IItteemmPPooiinntteerrDDaattaa     pprreevviioouussIItteemmDDaattaa;;
          IItteemmPPooiinntteerrDDaattaa     ccuurrrreennttIItteemmDDaattaa;;
          IItteemmPPooiinntteerrDDaattaa     nneexxttIItteemmDDaattaa;;
          MMaarrkkDDaattaa       pprreevviioouussMMaarrkkDDaattaa;;
          MMaarrkkDDaattaa       ccuurrrreennttMMaarrkkDDaattaa;;
          MMaarrkkDDaattaa       nneexxttMMaarrkkDDaattaa;;
          uuiinntt88               ffllaaggss;;
          BBoooolleeaann        ssccaannFFrroommEEnndd;;
          uuiinntt1166         nnuummbbeerrOOffKKeeyyss;;
          SSccaannKKeeyyDDaattaa    kkeeyyDDaattaa;;
          //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY AATT EENNDD OOFF SSTTRRUUCCTT **//
     }} IInnddeexxSSccaannDDeessccDDaattaa;;

     ttyyppeeddeeff IInnddeexxSSccaannDDeessccDDaattaa     **IInnddeexxSSccaannDDeesscc;;


     The  rreellaattiioonn entry points at the reldesc for the rela-
tion being scanned.

     The ooppaaqquuee entry is  for  use  by  the  indexed  access
method, and is assigned no meaning by higher-level scan pro-
cessing code.  The btree code uses it to store a pointer  to
a  BBTTSSccaannOOppaaqquuee  structure, defined in aacccceessss//nnbbttrreeee..hh.  The
BBTTSSccaannOOppaaqquuee structure stores, among other things, the  cur-
rent  buffer  in use by the scan.  This is equivalent to the
rrss__ccbbuuff entry of the HHeeaappSSccaannDDeessccDDaattaa structure.

     The pprreevviioouussIItteemmDDaattaa, ccuurrrreennttIItteemmDDaattaa, and nneexxttIItteemmDDaattaa
entries  store  the  tids of the previous, current, and next
index tuples (_n_o_t heap tuples!)  returned by the  scan.   As
was the case for heap scans, it was originally believed that
index scans would change  direction  frequently,  making  it


                             2266


useful  to cache these values.  In fact, this never happens,
and the previous and next tuple tids could be  removed  from
this structure.

     The pprreevviioouussMMaarrkkDDaattaa, ccuurrrreennttMMaarrkkDDaattaa, and nneexxttMMaarrkkDDaattaa
entries serve the same purpose as  rrss__mmppttiidd,  rrss__mmccttiidd,  and
rrss__mmnnttiidd in the HHeeaappSSccaannDDeessccDDaattaa structure.

     The  ffllaaggss entry is used to encode the direction of the
scan (forwards, backwards, or no movement).   The  constants
for  these  three  values are declared in aacccceessss//ssddiirr..hh, and
are BBaacckkwwaarrddSSccaannDDiirreeccttiioonn, NNooMMoovveemmeennttSSccaannDDiirreeccttiioonn, and FFoorr--
wwaarrddSSccaannDDiirreeccttiioonn.

     The  ssccaannFFrroommEEnndd  flag is _t_r_u_e if the scan should begin
at an endpoint, and false otherwise.  For  example,  in  the
btree  code,  a  scan  for all values greater than 500 could
begin at 500 and move forward, or at the end of the relation
and move backward.

     The  nnuummbbeerrOOffKKeeyyss  entry  records  the  number  of keys
stored in the kkeeyyDDaattaa entry that  follows.  Currently,  nnuumm--
bbeerrOOffKKeeyyss  is  hardcoded to 1 in many places since multi-key
indexing is not supported at this time.

     The kkeeyyDDaattaa entry stores a vector  of  SSccaannKKeeyyEEnnttrryyDDaattaa
structures that describe the scan key in use on the index.

11..44..  TThhee PPOOSSTTGGRREESS AAcccceessss MMeetthhoodd IInntteerrffaaccee

     This  section  describes  the programmatic interface to
the  POSTGRES  access  methods.    The   previous   sections
described  the data structures that are important when using
the access methods.  Here, the structures are filled in  and
used to do real work.

     The prototypes for the functions described in this sec-
tion are not well-managed in the current version of the sys-
tem.   Some routines are not prototyped at all, and the pro-
totypes that do exist are not ANSI.  The header files in the
directory aacccceessss are where the prototypes appear.

     The basic operations covered here are

  +o  opening and closing relations,

  +o  using scans to fetch tuples,

  +o  fetching particular attributes of a tuple, and

  +o  inserting, deleting, and replacing tuples.


                             2277


     There  are  two  classes  of  interfaces.  The routines
whose names begin with hheeaapp__ operate on heap relations.  The
routines  whose  names begin with iinnddeexx__ work on index rela-
tions.  The iinnddeexx__ routines use  the  ppgg__aamm  tuple  for  the
access  method  to call the routine that does the work for a
particular access method.  This dispatch  happens  transpar-
ently to the user.

11..44..11..  MMaannaaggiinngg tthhee SSyysstteemm CCaattaallooggss

     Because  backend  code frequently scans and updates the
system catalogs, and because these catalogs often have  spe-
cial  indices  defined  on  them,  POSTGRES includes special
interfaces for dealing with some system catalogs.  This sec-
tion describes those interfaces.

11..44..11..11..  TThhee SSyysstteemm CCaacchhee

     Each  POSTGRES  backend maintains, in private memory, a
cache of system catalog tuples that are frequently  required
during  query  processing.  Cache consistency is provided by
some extremely complicated invalidation code  that  runs  at
transaction  boundaries.   This cache, called the _s_y_s _c_a_c_h_e,
actually consists of a number of different caches that  sup-
port  fast  lookup  of catalog tuples by various attributes.
The cache  management  code  appears  in  the  source  files
uuttiillss//ccaacchhee//ssyyssccaacchhee..cc  and  uuttiillss//ccaacchhee//ccaattccaacchhee..cc, and the
cache  invalidation  code  is  in  ssttoorraaggee//iippcc//ssiinnvvaall..cc  and
uuttiillss//ccaacchhee//iinnvvaall..cc.

     The primary interface for using the sys cache is

     HHeeaappTTuuppllee
     SSeeaarrcchhSSyyssCCaacchheeTTuuppllee((iinntt ccaacchheeiidd,, cchhaarr **kkeeyy11,, cchhaarr **kkeeyy22,,
                   cchhaarr **kkeeyy33,, cchhaarr **kkeeyy44))

The ccaacchheeiidd argument is the sys cache of interest.  The sys-
tem currently supports the following caches:


                             2288


The UUSSEESSYYSSIIDD cache, for example, allows fast lookup of users
in ppgg__uusseerr by the uusseessyyssiidd attribute.  To do such a lookup,

     HHeeaappTTuuppllee ppgg__uusseerr__ttuupp;;

     ppgg__uusseerr__ttuupp == SSeeaarrcchhSSyyssCCaacchheeTTuuppllee((UUSSEESSYYSSIIDD,, 11880066,, 00,, 00,, 00));;

If  any tuple with uusseessyyssiidd equal to 1806 exists in ppgg__uusseerr,
it is loaded into the cache if necessary, and a  pointer  to
it is returned.

11..44..11..22..  MMaaiinnttaaiinniinngg SSyysstteemm CCaattaalloogg IInnddiicceess

     Since  system  catalog indices are critical to the fast
and correct functioning of POSTGRES,  several  support  rou-
tines  have  been defined to open, update, and close indices
on a given catalog when it is updated.  An  example  of  the
use of these routines appears in the file ccaattaalloogg//hheeaapp..cc, in
the routine AAddddNNeewwAAttttrriibbuutteeTTuupplleess.  The routines are defined
in  ccaattaalloogg//iinnddeexxiinngg..cc.  These rouintes open all the indices
on some catalog relation, insert index tuples into  all  the
indices, and close the indices.  The interfaces are

     vvooiidd
     CCaattaallooggOOppeennIInnddiicceess((iinntt nnIInnddiicceess,, cchhaarr ****nnaammeess,,
                  RReellaattiioonn **iinndd__rreellnnss))

     vvooiidd
     CCaattaallooggIInnddeexxIInnsseerrtt((RReellaattiioonn **iinndd__rreellnnss,, iinntt nnIInnddiicceess,,
                  RReellaattiioonn hheeaapp__rreellnn,, HHeeaappTTuuppllee hhttuupp))

     vvooiidd
     CCaattaallooggCClloosseeIInnddiicceess((iinntt nnIInnddiicceess,, RReellaattiioonn **iinndd__rreellnnss))

For  CCaattaallooggOOppeennIInnddiicceess,, the nnIInnddiicceess argument is the number
of indices on the catalog; constants for  all  catalogs  are
defined in ccaattaalloogg//iinnddeexxiinngg..hh.  The nnaammeess vector is an array
of names of indices on  the  catalog;  constants  for  these
arrays  are  provided  in  the  same  header file.  Finally,
iinndd__rreellnnss is a vector of  index  reldesc  pointers  that  is
large  enough  to  hold  nnIInnddiicceess reldesc pointers.  This is
filled in by CCaattaallooggOOppeennIInnddiicceess.

     For CCaattaallooggIInnddeexxIInnsseerrtt, the iinndd__rreellnnss argument  is  the
reldesc  vector returned by CCaattaallooggOOppeennIInnddiicceess.  NNIInnddiicceess is
the number of indices on the catalog.  The  hheeaapp__rreellnn  argu-
ment  is the catalog relation being updated, and hhttuupp is the
new tuple inserted into the catalog.   When  CCaattaallooggIInnddeexxIInn--
sseerrtt  returns, all of the indices on the system catalog will
have been updated with index tuples pointing at the new heap
tuple.


                             2299


     For  CCaattaallooggCClloosseeIInnddiicceess,  nnIInnddiicceess  is  the  number of
indices to close, and iinndd__rreellnnss is the reldesc  vector  from
CCaattaallooggOOppeennIInnddiicceess.

11..44..22..  OOppeenniinngg aanndd CClloossiinngg RReellaattiioonnss

     Relations may be opened by relid (object ID of the cor-
responding ppgg__ccllaassss tuple) or by name.  The interfaces are

     RReellaattiioonn
     hheeaapp__ooppeenn((OObbjjeeccttIIdd rreelliidd))

     RReellaattiioonn
     hheeaapp__ooppeennrr((NNaammee rreellnnaammee))

     RReellaattiioonn
     iinnddeexx__ooppeenn((OObbjjeeccttIIdd rreelliidd))

     RReellaattiioonn
     iinnddeexx__ooppeennrr((NNaammee rreellnnaammee))

The first two open the requested heap relation if it exists,
and  return  the reldesc for it.  The second two do the same
for index relations.  If the named relations do  not  exist,
the  routines  will  report a NNOOTTIICCEE-level eelloogg message, but
will not abort.  Instead, they return a NULL reldesc.

     In order to close a relation,

     vvooiidd
     hheeaapp__cclloossee((RReellaattiioonn rreellddeesscc))

     vvooiidd
     iinnddeexx__cclloossee((RReellaattiioonn rreellddeesscc))


11..44..33..  UUssiinngg SSccaannss

     Once a relation has been opened, a scan may be  set  up
on it to find tuples of interest.

11..44..33..11..  BBeeggiinnnniinngg aanndd EEnnddiinngg SSccaannss

     To set up a scan key,

     vvooiidd
     SSccaannKKeeyyEEnnttrryyIInniittiiaalliizzee((SSccaannKKeeyyEEnnttrryy eennttrryy,, bbiittss1166 ffllaaggss,,
                      AAttttrriibbuutteeNNuummbbeerr aattttrriibbuutteeNNuummbbeerr,,
                      RReeggPPrroocceedduurree pprroocceedduurree,, DDaattuumm aarrgguummeenntt))

As  many  scan  key  entries as desired may be defined for a
single scan.  These should  be  allocated  as  an  array  of


                             3300


SSccaannKKeeyyEEnnttrryyDDaattaa structures, as shown in section 1.3.2.

     Once a set of scan keys is initialized, the routines

     HHeeaappSSccaannDDeesscc
     hheeaapp__bbeeggiinnssccaann((RReellaattiioonn rreellddeesscc,, bbooooll aatteenndd,,
                 TTiimmeeQQuuaall ttiimmeeQQuuaall,, uunnssiiggnneedd nnkkeeyyss,, SSccaannKKeeyy kkeeyy))

     IInnddeexxSSccaannDDeesscc
     iinnddeexx__bbeeggiinnssccaann((RReellaattiioonn rreellddeesscc,, bbooooll ssccaannFFrroommEEnndd,,
                  uuiinntt1166 nnuummbbeerrOOffKKeeyyss,, SSccaannKKeeyy kkeeyy))

will  begin scans using the keys.  The keys should be in the
array beginning at address kkeeyy.

     To end an open scan,

     vvooiidd
     hheeaapp__eennddssccaann((HHeeaappSSccaannDDeesscc ssccaann))

     vvooiidd
     iinnddeexx__eennddssccaann((IInnddeexxSSccaannDDeesscc ssccaann))

The scans may no longer be used once the eennddssccaann routine has
been called on them.

11..44..33..22..  FFeettcchhiinngg QQuuaalliiffyyiinngg TTuupplleess

     Once  a scan has been initialized by the bbeeggiinnssccaann rou-
tines, qualifying tuples may be fetched from the relation by
calling the routines

     HHeeaappTTuuppllee
     hheeaapp__ggeettnneexxtt((HHeeaappSSccaannDDeesscc ssccaann,, iinntt bbaacckkww,, BBuuffffeerr **bb))

     RReettrriieevveeIInnddeexxRReessuulltt
     iinnddeexx__ggeettnneexxtt((IInnddeexxSSccaannDDeesscc ssccaann,, SSccaannDDiirreeccttiioonn ddiirreeccttiioonn))

In  both cases, the ssccaann argument is the scan to use to find
qualifying tuples.  For heap scans, if  bbaacckkww  is  non-zero,
then the scan moves backwards; otherwise, it moves forwards.
There is never a reason to set the bbaacckkww  flag.   For  index
scans, ddiirreeccttiioonn may be one of FFoorrwwaarrddSSccaannDDiirreeccttiioonn, NNooMMoovvee--
mmeennttSSccaannDDiirreeccttiioonn, or BBaacckkwwaarrddSSccaannDDiirreeccttiioonn.  FFoorrwwaarrddSSccaannDDii--
rreeccttiioonn is commonly used, and the others may not work.

     The  purpose  of the bb argument for the heap routine is
described below.


                             3311


11..44..33..33..  BBuuffffeerr MMaannaaggeemmeenntt

     POSTGRES  manages  a  shared  cache  of   recently-used
buffers.   This  cache  is  available to all of the backends
that are running concurrently.  The shared cache allows some
backends  to  take  advantage  of  work done by others.  For
example, pages from the system catalog are  typically  moved
into the cache by one backend and read by many others.

     When  a  given  backend  has  a pointer into one of the
pages in the shared buffer cache, that buffer is _p_i_n_n_e_d.   A
pinned  buffer  cannot  be evicted from the cache.  When the
pointer is dropped, the buffer may be  safely  unpinned  and
evicted  from  the cache, and the space that it occupied may
be used by another page from a database.

     When scanning tuples in a relation, the user may choose
to  examine  the  tuple directly on the page, or to copy the
tuple to private space and to examine the copy.  The benefit
to  using  the  tuple  on  the page is that no extra copy is
required.  The down side is that the user must be more care-
ful  in managing his scan.  In particular, he must unpin the
buffer when he no longer plans to use the pointer into it.

     The interface to fetch tuples from a heap scan is

     HHeeaappTTuuppllee
     hheeaapp__ggeettnneexxtt((HHeeaappSSccaannDDeesscc ssccaann,, iinntt bbaacckkww,, BBuuffffeerr **bb))

The bb argument is a pointer  to  a  value  of  type  BBuuffffeerr.
BBuuffffeerr  is  basically an integer, which is the number of the
buffer in the shared buffer cache.

  +o  If the bb argument to hheeaapp__ggeettnneexxtt  is  not  ((BBuuffffeerr  **))
     NNUULLLL,  then  the  buffer  number  on which the returned
     tuple appears will be copied to the BBuuffffeerr value that bb
     points to.

  +o  If  the  bb  argument  is NULL, then memory is allocated
     (via ppaalllloocc4) and the tuple is  copied  from  the  data
     page to the allocated memory.

     The  following  code  fragments  show  how to use the bb
argument in both cases.  The first example  uses  the  tuple
directly on the data page.

____________________
   4 PPaalllloocc is the POSTGRES memory allocator.  PPaalllloocc  allo-
cates  memory  from  memory  pools,  some  of  which perform
garbage collection at transaction end.


                             3322


     HHeeaappTTuuppllee hhttuupp;;
     BBuuffffeerr bb;;

     ......
     hhttuupp == hheeaapp__ggeettnneexxtt((hhssccaann,, 00,, &&bb));;

     //** ...... pprroocceessss tthhee ttuuppllee ...... **//

     //** aallll ddoonnee **//
     RReelleeaasseeBBuuffffeerr((bb));;

When  hheeaapp__ggeettnneexxtt  returns, bb is the buffer number on which
the tuple appears, and the _p_i_n _c_o_u_n_t on bb  has  been  incre-
mented to account for this new reference.  After the call to
RReelleeaasseeBBuuffffeerr, the tuple pointed to by hhttuupp may no longer be
used.  Some of the most insidious bugs in the project's his-
tory were from users who unpinned buffers, but continued  to
use  tuples  on them.  The buffer would be paged out at some
later time by another process, and the  user  would  have  a
pointer into some random location of another data page.

     The second example shows how to use a copy of the tuple
that does not reside on the page.

     HHeeaappTTuuppllee hhttuupp;;

     ......
     hhttuupp == hheeaapp__ggeettnneexxtt((hhssccaann,, 00,, ((BBuuffffeerr **)) NNUULLLL));;

     //** ...... pprroocceessss tthhee ttuuppllee ...... **//

     //** aallll ddoonnee **//
     ppffrreeee((hhttuupp));;

In this case, the user should release the  memory  allocated
to the tuple before returning.  Allocated memory is automat-
ically freed at transaction boundaries, but relying on  this
feature  causes  memory  leaks that make POSTGRES run slowly
and consume large amounts of system memory.

11..44..33..44..  UUssiinngg aann IInnddeexx SSccaann

     Buffer management is  not  required  for  index  scans.
Data of interest are copied to a special structure, called a
RReettrriieevveeIInnddeexxRReessuulltt.  A RReettrriieevveeIInnddeexxRReessuulltt includes the tid
of  the  index  tuple  and the tid of the heap tuple that it
references.  To use these,

     RReettrriieevveeIInnddeexxRReessuulltt rreess;;
     RReellaattiioonn hheeaapp__rreellnn,, iinnddeexx__rreellnn;;
     HHeeaappTTuuppllee hhttuupp;;
     IItteemmPPooiinntteerr hheeaapp__ttiidd;;


                             3333


     BBuuffffeerr bb;;

     ......
     iinnddssccaann == iinnddeexx__bbeeggiinnssccaann((iinnddeexx__rreellnn,, ......));;
     rreess == iinnddeexx__ggeettnneexxtt((iinnddssccaann,, FFoorrwwaarrddSSccaannDDiirreeccttiioonn));;
     iiff ((rreess)) {{
          hheeaapp__ttiidd == RReettrriieevveeIInnddeexxRReessuullttGGeettHHeeaappIItteemmPPooiinntteerr((rreess));;
          hhttuupp == hheeaapp__ffeettcchh((hheeaapp__rreellnn,, NNoowwTTiimmeeQQuuaall,, hheeaapp__ttiidd,, &&bb));;
          iiff ((HHeeaappTTuupplleeIIssVVaalliidd((hhttuupp)))) {{

               //** ...... pprroocceessss tthhee hheeaapp ttuuppllee ...... **//

               //** aallll ddoonnee **//
               RReelleeaasseeBBuuffffeerr((bb));;
          }}
     }}

The index relation requires no explicit  buffer  management.
The  heap  tid of interest is extracted from the RReettrriieevveeIInn--
ddeexxRReessuulltt, and is used to fetch the heap tuple itself.   The
hheeaapp__ffeettcchh interface fetches a heap tuple by tid; its decla-
ration is

     HHeeaappTTuuppllee
     hheeaapp__ffeettcchh((RReellaattiioonn rreellaattiioonn,, TTiimmeeQQuuaall ttiimmeeQQuuaall,,
                IItteemmPPooiinntteerr ttiidd,, BBuuffffeerr **bb))

The description of buffer management for heap scans  applies
to the hheeaapp__ffeettcchh interface, as well.

     If  the  resulting heap tuple does not satisfy the time
qualification, then hheeaapp__ffeettcchh will return NULL.  Since  the
index  may contain old data until the vacuum cleaner is run,
this is possible, so the user must check  the  return  value
from hheeaapp__ffeettcchh.

     If there are no more index tuples that match the quali-
fication, iinnddeexx__ggeettnneexxtt will return NULL.   At  that  point,
the   scan   must   be   ended   using   iinnddeexx__eennddssccaann.   If
iinnddeexx__eennddssccaann  is  not  called,  any  subsequent  calls   to
iinnddeexx__ggeettnneexxtt  on the scan will reset it to the beginning of
the qualifying index tuples, and they will all  be  returned
again.   This is not intended as a feature, so should not be
used.

11..44..44..  FFeettcchhiinngg AAttttrriibbuutteess

     Once a tuple of interest  has  been  fetched  from  the
heap,  attributes  of  interest may be extracted from it and
used.  The interfaces are

     cchhaarr **


                             3344


     hheeaapp__ggeettaattttrr((HHeeaappTTuuppllee ttuupp,, BBuuffffeerr bb,, iinntt aattttnnuumm,,
                  TTuupplleeDDeessccrriippttoorr ttuuppddeesscc,, bbooooll **iissnnuullll))

     PPooiinntteerr
     iinnddeexx__ggeettaattttrr((IInnddeexxTTuuppllee ttuupp,, iinntt aattttnnuumm,,
                  TTuupplleeDDeessccrriippttoorr ttuuppddeesscc,, bbooooll **iissnnuullll))

In general, users (and even POSTGRES implementors) need  not
extract  index tuple values; only access method implementors
need to do that.  The rest of this section  concentrates  on
the hheeaapp__ggeettaattttrr interface.

     The ttuupp argument points at the tuple of interest.  If bb
is non-null, then it  is  the  buffer  on  which  the  tuple
resides.   If  the  user is doing explicit buffer management
via hheeaapp__ffeettcchh or hheeaapp__ggeettnneexxtt, then the returned buffer may
be passed along to hheeaapp__ggeettaattttrr.

     The  tuple descriptor is actually the rrdd__aatttt entry from
the reldesc for the relation  that  stores  the  tuple.   To
extract the tuple descriptor,

     RReellaattiioonn rreellnn;;
     TTuupplleeDDeessccrriippttoorr ttuuppddeesscc;;

     ......
     ttuuppddeesscc == RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr((rreellnn));;


     The  iissnnuullll  argument  is  a pointer to a value of type
bbooooll.  If the corresponding attribute has the database value
null,  then  iissnnuullll will be _t_r_u_e and the return value should
be ignored.

     Although  it  is  declared  to  return  type  cchhaarr   **,
hheeaapp__ggeettaattttrr  actually  returns  type  DDaattuumm,  so the return
value should always be coerced to that  type.   This  is  an
implementation  mistake.   The DDaattuummGGeett_T_y_p_e macros described
in section 1.3.2 convert  values  from  DDaattuumm  to  the  type
required by the user.

11..44..55..  IInnsseerrttiinngg NNeeww TTuupplleess

     When  a  user  submits  a  POSTQUEL query to insert new
tuples into a heap relation, the POSTGRES executor automati-
cally  updates  all  associated indices with pointers to the
new tuple after it has been inserted.  There  is  no  simple
interface for doing this in general on user relations inside
the POSTGRES backend.  Section 1.4.1.2  describes  a  simple
way  for  doing  this  on system catalogs, but the user must
explicitly code updates to indices on  user  relations  when
new values are inserted into the heap.


                             3355


     The  code  for  the routine CCooppyyFFrroomm in the source file
ccoommmmaannddss//ccooppyy..cc contains the sample code that  was  used  to
produce  this  system.   That file is a reasonable reference
for how to manage tuple insertions correctly.

11..44..55..11..  FFoorrmmiinngg aa HHeeaapp TTuuppllee

     The routine

     HHeeaappTTuuppllee
     hheeaapp__ffoorrmmttuuppllee((AAttttrriibbuutteeNNuummbbeerr nnaattttss,, TTuupplleeDDeessccrriippttoorr ttuuppddeesscc,,
                    DDaattuumm **vvaalluueess,, cchhaarr **nnuullllss))

creates a heap tuple from the supplied vector of DDaattuumm  val-
ues.   The nnaattttss argument is the number of attributes in the
relation (the rrdd__rreell-->>rreellnnaattttss entry of the  reldesc).   The
ttuuppddeesscc  is  the  rrdd__aatttt  entry of the reldesc, which may be
fetched via RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr.

     Both vvaalluueess and nnuullllss are arrays of length nnaattttss.  VVaall--
uueess  contains  a  datum  for every non-null attribute in the
tuple.  The _T_y_p_eGGeettDDaattuumm macros can be  used  to  initialize
the  entries  in  this  array.  If an attribute is null, its
space is ignored in the array.  For example,  if  the  third
attribute  is null, then the value stored in the third loca-
tion of the vvaalluueess array will be ignored by  hheeaapp__ffoorrmmttuuppllee.

     The  nnuullllss  array  is  a  vector of characters, one per
attribute in the tuple.  If the corresponding attribute  has
the database value null, then the array entry is the charac-
ter _n.  If the corresponding attribute is non-null, then the
array entry is the ASCII blank character.

     The heap tuple returned by hheeaapp__ffoorrmmttuuppllee consumes ppaall--
lloocced memory, and should be ppffrreeeed when the heap  and  index
relations have all been updated.

11..44..55..22..  IInnsseerrttiinngg tthhee HHeeaapp TTuuppllee

     To insert the tuple returned by hheeaapp__ffoorrmmttuuppllee,

     OObbjjeeccttIIdd
     hheeaapp__iinnsseerrtt((RReellaattiioonn rreellnn,, HHeeaappTTuuppllee ttuupp,, ddoouubbllee **ooffff))

On  return, the ttuupp-->>tt__ccttiidd entry is the tid of the tuple in
the heap.  The ooffff argument is unnecessary and should  never
be  used.  Its purpose is unclear.  This argument may safely
be C-language NULL.  hheeaapp__iinnsseerrtt returns the  object  ID  of
the  new  tuple  in the heap (the tt__ooiidd entry of ttuupp is also
correctly set).


                             3366


11..44..55..33..  FFiinnddiinngg tthhee IInnddiicceess

     The  procedure  GGeettIInnddeexxRReellaattiioonnss  finds  all  of   the
indices  defined on a heap relation and initializes an array
with the reldescs for all of the indices.  The interface is

     GGeettIInnddeexxRReellaattiioonnss((OObbjjeeccttIIdd hheeaapp__rreelliidd,, iinntt **nn__iinnddiicceess,,
                       RReellaattiioonn ****iinnddeexx__rreellddeessccss));;

The hheeaapp__rreelliidd argument is the rrdd__iidd entry  of  the  reldesc
for the heap.  NN__iinnddiicceess is set, on return, to the number of
indices for the heap relation.  On return, iinnddeexx__rreellddeessccss is
a  pointer  to  an  array  of  reldescs,  one for each index
defined on the heap.

11..44..55..44..  UUppddaattiinngg tthhee IInnddiicceess

     A POSTGRES index may be

  +o  _f_u_n_c_t_i_o_n_a_l, meaning that the index  is  on  the  return
     value  of some function of the heap tuple's attributes,
     rather than on the attributes themselves;

  +o  _p_a_r_t_i_a_l, meaning that a predicate controls  whether  or
     not  particular heap tuples should appear in the index;
     or

  +o  _r_e_g_u_l_a_r, meaning that the  index  is  on  one  or  more
     attributes of all tuples in the heap.

     Functional  indices  are used for some system catalogs,
since the current  version  of  POSTGRES  does  not  support
multi-column btree indices.  Partial indices are very seldom
used.  Normal indices are by far the most common.  The  rest
of  this section ignores functional and partial indices, and
concentrates on normal ones.  For sample code  that  updates
these two kinds of indices, see the routine CCooppyyFFrroomm in ccoomm--
mmaannddss//ccooppyy..cc.   Special  interfaces,  described  in  section
1.4.1.2, have been defined to update the functional and nor-
mal indices on system catalogs, so the user need not  manage
these by hand.

     After  calling GGeettIInnddeexxRReellaattiioonnss, the user has the num-
ber of indices and all of the index reldescs for the indices
defined  on the heap.  Updating the indices requires forming
an index tuple for each index and inserting it.  The follow-
ing code fragment gives an example.

     RReellaattiioonn hheeaapp__rreellnn;;
     RReellaattiioonn **iinnddeexx__rreellnnss;;
     iinntt nn__iinnddiicceess;;
     DDaattuumm iiddaattuumm;;


                             3377


     HHeeaappTTuuppllee hhttuupp;;
     IInnddeexxTTuuppllee iittuupp;;
     HHeeaappTTuuppllee ppgg__iinnddeexxttuupp;;
     FFoorrmm__ppgg__iinnddeexx ppggiiffoorrmm;;
     TTuupplleeDDeessccrriippttoorr ttuuppddeesscc;;
     TTuupplleeDDeessccrriippttoorr iittuuppddeesscc;;
     cchhaarr **nnuullllss;;
     iinntt ii;;

     ......
     ttuuppddeesscc == RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr((hheeaapp__rreellnn));;

     //** ggeett ssppaaccee ffoorr nnuullll ffllaaggss ---- tthhiiss iiss mmoorree tthhaann wwee nneeeedd **//
     nnuullllss == ppaalllloocc((hheeaapp__rreellnn-->>rrdd__rreell-->>rreellnnaattttss));;
     ffoorr ((ii == 00;; ii << hheeaapp__rreellnn-->>rrdd__rreell-->>rreellnnaattttss;; ii++++))
          nnuullllss[[ii]] == '' '';;

     //** ...... iinnsseerrtt tthhee hheeaapp ttuuppllee,, wwhhiicchh sseettss hhttuupp-->>tt__ccttiidd ...... **//

     //** uuppddaattee eeaacchh iinnddeexx **//
     ffoorr ((ii == 00;; ii << nn__iinnddiicceess;; ii++++)) {{
          //** ggeett tthhee ppgg__iinnddeexx ttuuppllee ffoorr tthhiiss iinnddeexx **//
          ppgg__iinnddeexxttuupp == SSeeaarrcchhSSyyssCCaacchheeTTuuppllee((IINNDDEEXXRREELLIIDD,,
                                            iinnddeexx__rreellnnss[[ii]]-->>rrdd__iidd,,
                                            NNUULLLL,, NNUULLLL,, NNUULLLL));;
          ppggiiffoorrmm == ((FFoorrmm__ppgg__iinnddeexx)) GGEETTSSTTRRUUCCTT((ppgg__iinnddeexxttuupp));;

          //** ffoorrmm aa ddaattuumm ttoo uussee ffoorr tthhee iinnddeexx ttuuppllee **//
          FFoorrmmIInnddeexxDDaattuumm((iinnddeexx__rreellnnss[[ii]]-->>rrdd__rreell-->>rreellnnaattttss,,
                      ((AAttttrriibbuutteeNNuummbbeerr **)) &&ppggiiffoorrmm-->>iinnddkkeeyy..ddaattaa[[00]],,
                      hhttuupp,, ttuuppddeesscc,, IInnvvaalliiddBBuuffffeerr,,
                      &&iiddaattuumm,, nnuullllss,, NNUULLLL));;

          //** ffoorrmm tthhee iinnddeexx ttuuppllee **//
          iittuuppddeesscc == RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr((iinnddeexx__rreellnnss[[ii]]));;
          iittuupp == iinnddeexx__ffoorrmmttuuppllee((11,, iittuuppddeesscc,, &&iiddaattuumm,, nnuullllss));;

          //** mmaakkee iitt ppooiinntt aatt tthhee hheeaapp ttuuppllee **//
          iittuupp-->>tt__ttiidd == hhttuupp-->>tt__ccttiidd;;

          //** iinnsseerrtt iitt **//
          ((vvooiidd)) iinnddeexx__iinnsseerrtt((iinnddeexx__rreellnnss[[ii]],, iittuupp,, NNUULLLL,, NNUULLLL));;

          ppffrreeee((iittuupp));;
     }}

     ppffrreeee((nnuullllss));;

The  loop iterates over all the indices defined on the heap,
updating each in turn.  For each index, it fetches the  cor-
responding  ppgg__iinnddeexx  tuple  in  the cache of system tuples,
extracts the FFoorrmm__ppgg__iinnddeexx structure  from  each,  and  then


                             3388


forms  a  DDaattuumm  with  the  appropriate  value for the index
tuple.  Finally, an index tuple is formed and inserted  into
the index.

11..44..66..  DDeelleettiinngg TTuupplleess

     Compared  to  inserting tuples, deleting tuples is sim-
ple.  Users never delete tuples from indices; this  is  done
only  by the vacuum cleaner.  Deleting a heap tuple requires
the tid of  the  tuple  to  be  deleted.   The  tid  may  be
extracted  from  the  heap tuple itself, which can be found,
for example, by a scan of the heap relation.  Given the tid,
the interface for deleting heap tuples is

     RRuulleeLLoocckk
     hheeaapp__ddeelleettee((RReellaattiioonn rreellnn,, IItteemmPPooiinntteerr ttiidd))

The  RRuulleeLLoocckk  return  value from hheeaapp__ddeelleettee was originally
intended to support the POSTGRES rule system, but the design
of  the  rule  system  changed,  so  this value is not used.
HHeeaapp__ddeelleettee always returns NULL.

11..44..77..  RReeppllaacciinngg EExxiissttiinngg TTuupplleess

     When a heap tuple in POSTGRES is replaced, it is marked
as  deleted and a new tuple is inserted in the relation with
new values.  The new tuple has the same  object  ID  as  the
original  tuple, but will be stored at a new location and so
will have a new tid.

     Index tuples are never replaced.  If the heap tuples to
which  they  point  have  been  replaced, the vacuum cleaner
deletes them from the index.  No user  action  is  required.
However,  if  a  heap  tuple  is  replaced,  all the indices
defined on it must have new index  tuples  inserted.   These
index tuples must point at the new heap tuple.  The code for
doing this is identical to  the  code  for  inserting  index
tuples  that  point  at  new  heap  tuples,  and will not be
repeated here.

     Replacing a tuple consists of two steps.  First, a  new
tuple  is formed, based on the old one, but with some values
changed.  Second, the old tuple is replaced by the new  one.

     Replacing  a  tuple requires knowledge of the schema of
the relation storing the tuple, since new values for partic-
ular  attributes must be supplied.  The sample code fragment
below assumes the existence  of  a  relation  eemmpp  with  the
schema

     eemmpp((nnaammee == cchhaarr1166,, ssaallaarryy == iinntt44))


                             3399


The  code  sample replaces a particular eemmpp tuple, assigning
the employee a salary of 50,000.  Note that the  code  frag-
ment  does  not update any indices defined on eemmpp; that code
must be added if any such indices exist.

     RReellaattiioonn eemmpprreellnn;;
     HHeeaappTTuuppllee eemmppttuupp;;
     HHeeaappTTuuppllee nneewwttuupp;;
     BBuuffffeerr eemmppbbuuff;;
     DDaattuumm vvaalluuee[[22]];;
     cchhaarr nnuullllss[[22]];;
     cchhaarr rreeppll[[22]];;

     ......
     //** bbyy hheerree,, wwee hhaavvee tthhee eemmpp ttuuppllee ttoo rreeppllaaccee **//
     vvaalluuee[[00]] == ((DDaattuumm)) 00;;
     vvaalluuee[[11]] == IInntt3322GGeettDDaattuumm((5500000000));;   //** nneeww ssaallaarryy **//

     //** nnoo nnuullll aattttrriibbuutteess iinn tthhiiss ttuuppllee **//
     nnuullllss[[00]] == nnuullllss[[11]] == '' '';;

     //** rreeppllaaccee tthhee sseeccoonndd aatttt,, nnoott tthhee ffiirrsstt **//
     rreeppll[[00]] == '' '';;
     rreeppll[[11]] == ''rr'';;

     //** ggeett aa nneeww ttuuppllee bbaasseedd oonn tthhee oolldd oonnee,, wwiitthh tthhee nneeww ssaallaarryy **//
     nneewwttuupp == hheeaapp__mmooddiiffyyttuuppllee((eemmppttuupp,, eemmppbbuuff,, eemmpprreellnn,, vvaalluuee,,
                         nnuullllss,, rreeppll));;

     //** rreeppllaaccee tthhee oolldd ttuuppllee wwiitthh tthhee nneeww oonnee **//
     ((vvooiidd)) hheeaapp__rreeppllaaccee((eemmpprreellnn,, &&eemmppttuupp-->>tt__ccttiidd,, nneewwttuupp));;

The interface for hheeaapp__mmooddiiffyyttuuppllee is

     HHeeaappTTuuppllee
     hheeaapp__mmooddiiffyyttuuppllee((HHeeaappTTuuppllee ttuuppllee,, BBuuffffeerr bbuuffffeerr,,
                RReellaattiioonn rreellaattiioonn,, DDaattuumm **rreeppllVVaalluuee,,
                cchhaarr **rreeppllNNuullll,, cchhaarr **rreeppll))

The ttuuppllee argument is the  tuple  to  replace.   The  bbuuffffeerr
argument  is  the  buffer  in  the  buffer cache on which it
appears, as returned by  hheeaapp__ggeettnneexxtt  or  hheeaapp__ffeettcchh.   The
rreellaattiioonn argument is the relation containing the tuple.

     The next three arguments control what values are stored
in the new tuple.  The first, rreeppllVVaalluuee, is a vector of val-
ues of type DDaattuumm containing one entry for each attribute in
the relation.  If an attribute is not to  be  replaced,  but
rather  should be copied from the old tuple, then the corre-
sponding entry in rreeppllVVaalluuee will be ignored.


                             4400


     The rreeppllNNuullll argument contains one entry per  attribute
in  the  relation.   If the corresponding attribute is to be
replaced, and if the rreeppllNNuullll entry is the character _n, then
the  new  value for that attribute is null.  If the value is
to be replaced and the rreeppllNNuullll entry  is  an  ASCII  blank,
then  the  value stored in rreeppllVVaalluuee is consulted to get the
new value for the attribute.

     Finally, the  rreeppll  argument  contains  one  entry  per
attribute in the relation.  If the rreeppll entry for a particu-
lar attribute is the character _r, then the value  should  be
replaced,  and  the  rreeppllVVaalluuee  and rreeppllNNuullll entries for the
attribute are used to assign a new value.  If the rreeppll entry
is  an ASCII blank, then the value is copied from the origi-
nal tuple.

11..44..88..  CCrreeaattiinngg aanndd DDeessttrrooyyiinngg RReellaattiioonnss

     Creating a heap relation requires the user to supply  a
schema  for the new relation.  Creating an index relation is
easier, because the schema of the index is defined in  terms
of the heap.  The two cases are treated separately below.

11..44..88..11..  HHeeaapp RReellaattiioonnss

     Creating  a heap relation requires a schema for the new
relation.  The schema should be a TTuupplleeDDeessccrriippttoorr, which  is
the same as the rrdd__aatttt entry of the reldesc for the relation
about to be  created.   Creating  the  tuple  descriptor  is
straightforward;  a  vector  is  allocated  with  one  FFoorrmm--
DDaattaa__ppgg__aattttrriibbuuttee structure for each attribute.   The  FFoorrmm--
DDaattaa__ppgg__aattttrriibbuuttee  structures  are filled in with the names,
types,  and  other  appropriate  information  for  the   new
attributes.   The aattttrreelliidd and aattttddeeffrreell entries may be left
as IInnvvaalliiddOObbjjeeccttIIdd; these will be filled in by the  code  to
create the heap relation.

     Once  the tuple descriptor has been constructed, creat-
ing the relation is straightforward.  The interface is

     OObbjjeeccttIIdd
     hheeaapp__ccrreeaattee((cchhaarr **rreellnnaammee,, iinntt aarrcchhiivvee,, uunnssiiggnneedd nnaattttss,,
              uunnssiiggnneedd ssmmggrr,, TTuupplleeDDeessccrriippttoorr ttuuppddeesscc))

The rreellnnaammee argument is the name of  the  new  relation;  it
must be unique, or hheeaapp__ccrreeaattee will abort the transaction.

     The  aarrcchhiivvee  argument, although declared to be of type
iinntt in the prototype, is actually a character;  this  should
be one of _n (no archiving), _l (light archiving), or _h (heavy
archiving).  Light and heavy archiving behave identically in
the current system, and cause deleted and replaced tuples to


                             4411


be copied to a special archive by the  vacuum  cleaner.   No
archiving  causes  the vacuum cleaner to discard deleted and
replaced tuples.

     The nnaattttss argument is the number of attributes for  the
new  relation.  This should be the same as the length of the
ttuuppddeesscc vector.

     The ssmmggrr argument is the storage manager on  which  the
relation  should  be  created.  The set of supported storage
managers varies from installed system to  installed  system.
Storage  manager number zero is always magnetic disk.  Other
numbers are installation-dependent.

     Finally, the ttuuppddeesscc argument is the  tuple  descriptor
for the new relation.

     HHeeaapp__ccrreeaattee returns the object ID of the new relation's
ppgg__ccllaassss tuple, which is the same as the relid  of  the  new
relation.

     Destroying a heap relation is even simpler:

     vvooiidd
     hheeaapp__ddeessttrrooyy((cchhaarr **rreellnnaammee))

If  the  named  relation does not exist, hheeaapp__ddeessttrrooyy aborts
the transaction.  Otherwise, the relation is  destroyed  and
all  indices defined on it are removed.  In addition, if the
relation inherited  attributes  from  other  relations,  its
inheritance information is deleted from the system catalogs.

11..44..88..22..  IInnddeexx RReellaattiioonnss

     Creating an index  relation  is  straightforward.   The
interface is

     vvooiidd
     iinnddeexx__ccrreeaattee((NNaammee hheeaapp__nnaammee,, NNaammee iinnddeexx__nnaammee,,
               FFuunnccIInnddeexxIInnffoo ffIInnffoo,, OObbjjeeccttIIdd aamm__iidd,,
               AAttttrriibbuutteeNNuummbbeerr nnaattttss,, AAttttrriibbuutteeNNuummbbeerr **aattttnnooss,,
               OObbjjeeccttIIdd **ooppccllaasssseess,, uuiinntt1166 ppaarraammCCoouunntt,,
               DDaattuumm **ppaarraammss,, LLiissppVVaalluuee pprreeddiiccaattee))

The  ffIInnffoo,  ppaarraammCCoouunntt, ppaarraammss, and pprreeddiiccaattee arguments are
for defining functional or partial indices, and will not  be
covered here.

     The hheeaapp__nnaammee argument is the name of the heap relation
on which the index is to be defined.  The  iinnddeexx__nnaammee  argu-
ment is the name of the new index relation.


                             4422


     The  aamm__iidd  argument  is  the  object  ID of the access
method tuple for this index, from ppgg__aamm.

     The nnaattttss argument is the number of  attributes  (keys)
for this index.  The current system supports only single-key
indices, but this may change in a subsequent  release.   The
aattttnnooss  argument  is  a  vector  of attribute numbers in the
heap.  If the index is to be defined on the third  attribute
in  the  heap,  then  this  vector would consist of a single
entry, which would be three.

     The ooppccllaasssseess argument is the object ID of the operator
class  tuple  from  ppgg__aammoopp  to use for each attribute.  The
operator classes map types and operators for  the  index  to
particular functions in ppgg__pprroocc.

     Sample  code that defines regular, functional, and par-
tial  indices  may  be  found  in  the  source   file   ccoomm--
mmaannddss//ddeeffiinndd..cc.

     When  iinnddeexx__ccrreeaattee returns, the new index has been cre-
ated and tuples have been inserted into it for each tuple in
the heap relation.

11..44..99..  SSccaann PPoossiittiioonn MMaannaaggeemmeenntt

     Scans  on  index  or heap relations support _m_a_r_k_i_n_g and
_r_e_s_t_o_r_i_n_g of the scan's position.  A mark records  the  cur-
rent  scan  location,  and a restore restores the previously
marked location.  Only one mark may be defined  on  an  open
scan.

     Marking  and  restoring  are of limited use outside the
executor, but the interfaces are included here for complete-
ness.  They are

     vvooiidd
     hheeaapp__mmaarrkkppooss((HHeeaappSSccaannDDeesscc hhssccaann))

     vvooiidd
     hheeaapp__rreessttrrppooss((HHeeaappSSccaannDDeesscc hhssccaann))

     vvooiidd
     iinnddeexx__mmaarrkkppooss((IInnddeexxSSccaannDDeesscc iissccaann))

     vvooiidd
     iinnddeexx__rreessttrrppooss((IInnddeexxSSccaannDDeesscc iissccaann))


                             4433


11..44..1100..  LLoocckkiinngg

     Users  of the access method interface routines need not
worry about locking.  The  access  method  routines  acquire
(and,  when  appropriate,  release)  locks  on  relations in
response to scans and updates.  Heap relations are protected
by  two-phase  relation  level  locks.   Btree  indices  use
Lehman-Yao5 short-term locking  for  high  concurrency,  but
since  the  underlying  relations are locked at the relation
level, index updates are serialized.  Rtree indices use two-
phase relation-level locking.

     System  catalogs  in  POSTGRES  are not subject to two-
phase locking in order to improve concurrency.   This  means
that  under  some circumstances concurrent transactions that
operate on the system catalogs are not  serializable.   This
flaw  is  common  in commercial relational systems, as well,
and occurs infrequently enough that it does  not  matter  in
practice.


____________________
   5 Lehman, P., Yao, S., ``Efficient Locking for Concurrent
Operations  on  B-trees'', _A_C_M _T_r_a_n_s_a_c_t_i_o_n_s _o_n _D_a_t_a_b_a_s_e _S_y_s_-
_t_e_m_s, 6(4), December 1981.


                             4444