.\"
.\" postgres internals access method document
.\"
.\"	to print, groff -te -me am.me
.\"
.\" the '-te' flag to groff preprocesses with tbl and eqn, so if you're
.\" not using groff, you need to run those preprocessors by hand.
.\"
.\" /usr/local/devel/postgres-v4r2/src/doc/implementation/RCS/am.me,v 1.11 1993/07/20 22:21:48 mao Exp
.fo ''%''
.EQ
delim %%
.EN
.nr pp 11
.nr tp 11
.ps 11
.ds PG "\\s-2POSTGRES\\s0
.ds PQ "\\s-2POSTQUEL\\s0
.de RN
\\fC\\$1\\fP\\$2
..
.de SE
\\fC\\$1\\fP\\$2
..
.de FI
\\fC\\$1\\fP\\$2
..
.de BU
.ip \0\0\(bu
..
.de (E
.in +0.5i
.sp
.nf
.na
.ft C
.ta 0.5i 1i 1.5i 2i 2.5i 3i 3.5i 4i 4.5i 5i 5.5i
..
.de )E
.in
.sp
.fi
.ad
.ft R
..
.sh 1 "The \*(PG Access Methods"
.pp
This section describes the \*(PG access methods in detail.
The major concepts covered here are
.BU
the relation descriptor and its contents,
.BU
the differences between heap and index relations,
.BU
scan keys, scan descriptors, and the scan interface,
and
.BU
the \*(PG access method interface routines.
.pp
.sh 2 "The Relation Descriptor"
.pp
The relation descriptor,
or
.i reldesc , 
is the in-memory data structure that describes an open relation.
The reldesc is an argument to most of the procedures that operate
on relations.
Relations may be opened by name or by relation ID \*-
either of these uniquely identifies the relation to which
it corresponds.
A relation's ID,
or
.i relid ,
is the object ID of the tuple in
.RN pg_class
that describes it.
.pp
The structure that stores a reldesc is declared in the source file
.FI utils/rel.h .
The structure definition is
.(E
typedef struct RelationData {
	File				rd_fd;
	int				rd_nblocks;
	uint16			rd_refcnt;
	bool				rd_ismem;
	bool				rd_isnailed;
	AccessMethodTupleForm	rd_am;
	RelationTupleForm	rd_rel;
	ObjectId			rd_id;
	Pointer			lockInfo;
	TupleDescriptorData	rd_att;
	/* VARIABLE LENGTH ARRAY AT END OF STRUCT */
} RelationData;

typedef RelationData		*Relation;
.)E
.pp
The meaning of the
.SE rd_fd
entry depends on the storage manager that stores the relation.
For the magnetic disk storage manager,
this is the \*(PG virtual file descriptor (VFD) for the open
file that stores the relation.
On the Sony jukebox,
this entry is meaningless,
and is simply set to a non-negative value if the jukebox relation is
successfully opened.
.pp
The
.SE rd_nblocks
entry is the number of blocks in the relation,
but this number is not reliable.
This field is used by the executor during scans of heap relations
to avoid scanning past the end of the relation.
For index relations,
and during scans of heap relations,
the value stored here will be wrong.
It should not be used outside of the executor.
.pp
The entry
.SE rd_refcount
reflects the number of currently active references to this reldesc
inside the backend.
Every backend manages a private cache of relation descriptors.
When the cache is full,
reldescs with
.SE rd_refcount
equal to zero may be evicted to make room for new descriptors.
If a single relation is opened by a single backend more than once,
every open request returns a pointer to the same reldesc,
and
.SE rd_refcount
is incremented to keep track of each reference.
When the relation is closed,
the reference count is decremented.
.pp
The structure entry
.SE rd_ismem
is unused in the current system.
It is intended to support in-memory-only relations,
changes for which need not be flushed to stable storage at transaction
boundaries.
.pp
The private cache of relation descriptors contains several
reldescs that cannot be evicted,
in order to guarantee that the cache continues to work.
For example,
in order to instantiate the relation descriptor for a user relation
.RN user_reln ,
\*(PG must open and scan
.RN pg_class
and
.RN pg_attribute
for its relation and attribute data.
If the reldesc for
.RN pg_class
is not in the cache,
then no relation
(including
.RN pg_class )
can ever be instantiated in the cache.
The reldescs for
.RN pg_class ,
.RN pg_type ,
and
.RN pg_attribute
all fall into this category,
and are initialized specially when the system starts up.
.pp
To keep track of which relations may not be evicted from the
private reldesc cache,
every reldesc contains an
.SE rd_isnailed
entry.
If this entry is set to
.i true ,
then the reldesc is nailed and may not be evicted from the cache,
even if its
.SE rd_refcount
drops to zero.
.pp
The
.SE rd_am
entry stores a pointer to the access method tuple for the access method
that manages this relation.
Currently,
\*(PG supports a heap access method for storing user and system data,
and btree and rtree indexed access methods for storing indices on
data.
Support for a hashed indexed access method will be added in the near future.
Because these access methods have different implementations,
\*(PG must know what actual routine to call to dispatch general
access method interface calls for a particular access method.
For indexed access methods,
this information is stored in the access method tuple.
Since there is only one access method for user and system data (the heap),
the
.SE rd_am
entry is NULL for reldescs describing heap relations.
.pp
The access method tuple contents will be described in section
\n($1.\n($2.1.
.\" XXX maintainable section ref
.pp
The structure entry
.SE rd_rel
stores a pointer to the
.RN pg_class
tuple that describes this relation.
The
.RN pg_class
tuple is copied to a safe place in memory when the relation is entered
into the reldesc cache.
The tuple contains,
among other things,
the relation's name and the user id of its owner.
The complete contents of the tuple pointed to by
.SE rd_rel
are supplied in section
\n($1.\n($2.2.
.\" XXX maintainable section ref
.pp
The entry
.SE rd_id
stores the relid of the open relation.
The relid is the object ID of the
.RN pg_class
tuple describing the relation.
This does not appear in the tuple pointed to by the
.SE rd_rel
entry,
because that tuple includes only the ordinary attributes,
and not the system attributes,
from the
.RN pg_class
tuple.
Note that the
.RN pg_class
tuple has system attributes,
as it is stored in the relation.
These attributes simply are not copied into memory
when the reldesc is initialized.
.pp
The structure entry
.SE lockInfo
points to an in-memory representation of the rule lock data associated
with this relation.
This entry is not used by the access methods.
The rule system and the data stored in a rule lock
are beyond the scope of this section.
.pp
The final entry in the reldesc is
.SE rd_att ,
a vector of attribute descriptor data for the relation.
This vector stores one entry for every user-level (that is,
non-system) attribute that the relation stores.
This vector is initialized when the reldesc is loaded into the cache.
The data in the vector come from a scan of
.RN pg_attribute .
The contents of the vector are described in section
\n($1.\n($2.3.
.\" XXX maintainable section ref
.sh 3 "The Access Method Tuple Form (rd_am)"
.pp
The
.SE rd_am
entry in a reldesc points at an 
.i "access method tuple form"
describing the access method that manages the relation\**.
.(f
\**
In \*(PG, the word
.i form
is often used to name structures. In this 
instance, for example, the clause 
.i "access method tuple form" 
is used to name a structure that describes an access method
tuple.
.)f
This tuple has one attribute for each of
the routines that implement the standard
access method interface for the particular access method.
Details of the interface routines appear in section
\n($1.4.
.\" XXX maintainable section ref
Here,
we list only the contents of the access method tuple form.
This declaration appears in the source file
.FI catalog/pg_am.h ,
and describes the tuples stored in the
.RN pg_am
system catalog.
.(E
typedef struct AccessMethodTupleFormD {
	NameData		amname;
	ObjectId		amowner;
	char			amkind;
	uint16		amstrategies;
	uint16		amsupport;
	RegProcedure	amgettuple;
	RegProcedure	aminsert;
	RegProcedure	amdelete;
	RegProcedure	amgetattr;
	RegProcedure	amsetlock;
	RegProcedure	amsettid;
	RegProcedure	amfreetuple;
	RegProcedure	ambeginscan;
	RegProcedure	amrescan;
	RegProcedure	amendscan;
	RegProcedure	ammarkpos;
	RegProcedure	amrestrpos;
	RegProcedure	amopen;
	RegProcedure	amclose;
	RegProcedure	ambuild;
	RegProcedure	amcreate;
	RegProcedure	amdestroy;
} AccessMethodTupleFormD;

typedef AccessMethodTupleFormD	*AccessMethodTupleForm;
.)E
.pp
The structure entry
.SE amname
is the name of the access method.
In the current system,
this is one of
.i btree ,
or
.i rtree .
Support for
.i hash
is forthcoming.
.pp
The
.SE amowner
entry is the object ID of the
.RN pg_user
tuple for the owner of this access method.
For all existing access methods,
this is the OID for the user
.i postgres .
This entry is intended to support access protection,
so that,
for example,
only the owner of an access method could change its
interface routines.
Such protection is not implemented in the current system.
.pp
The
.SE amkind
entry is unused in the current system.
.pp
The structure entry
.SE amstrategies
is the number of strategies (operators) that can be used in searches
on this index.
.BU
For btrees, this is 5:
<, \(<=, =, \(>=, and >.
.BU
For rtrees,
this is 8,
corresponding to the operators for
left, left-or-overlap, overlap, right-or-overlap, right, same,
contains, and contained-by.
.BU
When hash tables are supported,
the number of operators will be 1, for equality searches.
.pp
Operators that are supported by an access method
are listed in
.RN pg_amop .
This relation contains one group of operators for each
operator class,
or
.i opclass .
An opclass is an access method/type pair.
For example,
an opclass
.SE int4_ops
is defined for btrees,
so that
.SE int4
values may be indexed using btrees.
.pp
The structure entry
.SE amsupport
is the number of support routines required by this access method.
For example,
when inserting keys into a btree,
the btree access method requires one support routine that compares two keys
and returns negative, zero, or positive,
depending on whether the first key is less than, equal to, or greater than
the second,
respectively.
Similarly,
rtrees require routines that compute the size,
intersection,
and union of two rectangles.
These are not operators that are available to users for index searches,
so they are stored separately from the strategies.
These routines are stored in the relation
.RN pg_amsupport .
.pp
The rest of the
.SE AccessMethodTupleForm
structure stores the object IDs of functions that support
the standard access method interface for this access method.
Those routines are summarized in the table below.
.TS
center allbox;
cfI cfI
lfC l.
Entry name	routine description
amgettuple	Get the next tuple in a scan.
aminsert	Insert a new tuple into the index.
amdelete	Delete a particular \fItid\fP from the index.
amgetattr	Get a particular attribute from the index tuple.
amsetlock	Unsupported.
amsettid	Unsupported.
amfreetuple	Unsupported.
ambeginscan	Start a scan with a qualification on the index key.
amrescan	Reset an active scan to the beginning.
amendscan	End an active scan.
ammarkpos	Mark the current position in an index scan.
amrestrpos	Restore the scan to the previously-marked position.
amopen	Open the index relation.
amclose	Close an open index relation.
ambuild	Define an index on an existing heap relation.
amcreate	Create a new, empty, index relation.
amdestroy	Destroy an existing index relation.
.TE
.pp
In general,
the programmer need not worry about how function dispatch via the
.SE AccessMethodTupleForm
works.
This is handled properly for all indexed access methods by code in
.FI access/common.
.sh 3 "The Relation Tuple Form (rd_rel)"
.pp
The reldesc for a relation includes a pointer to the
.RN pg_class
tuple for the relation.
When a reldesc is instantiated into the cache,
it is initialized from the data for the relation from
.RN pg_class .
The user-level (that is, non-system) attributes for the
.RN pg_class
tuple constitute the
.i "relation tuple form" .
This structure is declared in
.RN catalog/pg_relation.h \**
.(f
\**
The class
.RN pg_class
was previously called
.RN pg_relation .
In all user-visible parts of the system,
the name was converted in 1990,
but internally
(as in the names of system header files),
the name was not always changed.
.)f
as 
.RN Form_pg_relation .
The declaration is
.(E
CATALOG(pg_relation) BOOTSTRAP {
     char16 	relname;
     oid 		relowner;
     oid 		relam;
     int4 		relpages;
     int4 		reltuples;
     dt 		relexpires;
     dt 		relpreserved;
     bool 		relhasindex;
     bool 		relisshared;
     char 		relkind;
     char 		relarch;
     int2 		relnatts;
     int2		relsmgr;
     int28 	relkey;
     oid8		relkeyop;
     aclitem	relacl[1];
} FormData_pg_relation;

typedef FormData_pg_relation	*Form_pg_relation;
.)E
The notation
.(E
CATALOG(pg_relation) BOOTSTRAP {
.)E
is turned into a structure declaration by cpp macros at compile time.
This declaration is also used to produce setup files,
called
.i "bki files" ,
that create and populate the
.i template1
database.
.pp
The structure entry
.SE relname
is the name of the relation.
.pp
The
.SE relowner
entry is the object ID of the
.RN pg_user
tuple describing the relation's owner.
A relation is owned by the user that created it.
.pp
The
.SE relam
entry is the object ID of the
.RN pg_am
tuple for the access method that manages this relation.
For heap relations,
this is zero,
an invalid object ID.
.pp
The
.SE relpages
and
.SE reltuples
entries are,
respectively,
the approximate number of pages and tuples in this relation.
These numbers are set by the vacuum cleaner and when an index
is defined on a heap relation,
and so are wrong in general.
They are not changed when tuples are inserted into or deleted from
the relation.
When the relation is initially created,
both are zero.
Both
.SE reltuples
and
.SE relpages
are used by the planner to estimate plan costs during query
optimization.
Interestingly, then, if neither 
.SE reltuples
nor
.SE relpages
have been updated by the vacuum cleaner recently,
their incorrect values can skew
query optimization such that queries may run for an 
unexpectedly long time. 
.pp
The
.SE relexpires
entry is the amount of history that should be saved for this relation.
This is not used in the current system.
.pp
The
.SE relpreserved
entry is the date and time at which the relation was last vacuumed.
This is used by the planner when it produces query plans for historical
queries.
If the historical query extends to before the time at which the relation
was last vacuumed,
then the archive must be scanned.
Otherwise,
it need not be.
.pp
The
.SE relhasindex
entry is
.i true
if an index exists on the relation.
When an index is destroyed,
this flag is not changed,
and so an entry of
.i true
may mean that there was recently an index on the relation,
but that there no longer is.
The vacuum cleaner restores this to the correct value when it runs,
and defining an index on a heap relation always sets this entry to
.i true
for the heap. 
.SE Relhasindex
is always 
.i false
for an index relation.
.pp
The
.SE relisshared
entry is
.i true
for some shared system catalogs.
Most relations for database
.i dbname
reside in the directory
\fC$PGDATA/base/\fP\fIdbname\fP.
However,
some relations,
such as
.RN pg_user
and
.RN pg_database ,
must be visible to users of all databases simultaneously.
These relations are stored in the directory
.FI $PGDATA ,
and the
.SE relisshared
entry in the
.RN pg_class
tuple for these relations is set to
.i true .
.pp
The
.SE relkind
entry has one of three values:
.BU
.i r
means that the relation is an ordinary heap.
.BU
.i i
means that the relation is an index.
.BU
.i u
means that the relation is an uncatalogued (that is, temporary)
heap relation
which will be automatically destroyed when the transaction ends.
.pp
The
.SE relarch
entry describes the frequency with which the relation should be archived.
Although three levels exist,
only two are actually supported.
The three levels are
.BU
.i h ,
for heavy update traffic and frequent archival;
.BU
.i l ,
for light update traffic and infrequent archival;
and
.BU
.i n ,
for no archival.
.pp
The only levels actually supported are
.i h
and
.i n .
If the
.SE relarch
entry is
.i n ,
then the vacuum cleaner discards all historical data from the relation
when it runs.
.pp
The
.SE relnatts
entry is the number of attributes in the relation.
.pp
The
.SE relsmgr
entry identifies the storage manager that manages storage for this
relation.
Storage manager zero is magnetic disk.
On the system installed at UC Berkeley,
storage managers one and two are,
respectively,
for the Sony optical disk jukebox and main memory relations.
The storage manager code is called from the \*(PG buffer manager,
and is of no concern to access method implementors.
.pp
The
.SE relkey
entry is a vector of attribute numbers that define the unique key for
this relation.
This is not supported in the current system.
.pp
The structure ends with a variable-length vector of access control list
information,
stored in the
.SE relacl
entry.
The access control list is used to grant or deny
read or write access on the relation to particular users.
.sh 3 "The Tuple Descriptor Data (rd_att)"
.pp
The reldesc ends with a vector of attribute data that describe the
tuples stored in the relation.
This is a variable-length entry;
different relations have different numbers of attributes,
and so their
.SE rd_att
vectors are of different length.
.pp
There are two data structures of interest.
The
.SE "TupleDescriptorData"
structure,
defined in
.FI access/tupdesc.h ,
is a variable-length array of
.SE AttributeTupleFormData
entries.
The
.SE "AttributeTupleFormData"
structure,
defined in
.FI catalog/pg_attribute.h ,
contains detailed information on a single attribute.
.pp
The data structure for
.SE AttributeTupleFormData
is the same as a
.SE Form_pg_attribute ,
which is
.(E
CATALOG(pg_attribute) BOOTSTRAP {
    oid  		attrelid;
    char16  	attname;
    oid  		atttypid;
    oid  		attdefrel;
    int4  		attnvals;
    oid  		atttyparg;
    int2 		attlen;
    int2  		attnum;
    int2 		attbound;
    bool  		attbyval;
    bool 		attcanindex;
    oid 		attproc;
    int4		attnelems;
    int4		attcacheoff;
} FormData_pg_attribute;

typedef FormData_pg_attribute	*Form_pg_attribute;
.)E
.pp
As above, the
.(E
CATALOG(pg_attribute) BOOTSTRAP
.)E
line is converted to a struct declaration by cpp macros at compile time.
.pp
The
.SE attrelid
entry is the object ID of the
.RN pg_class
tuple for the relation that contains this attribute.
.pp
The
.SE attname
entry is the name of the attribute.
.pp
The
.SE atttypid
entry is the object ID of the
.RN pg_type
tuple for the type of this attribute.
.pp
The
.SE attdefrel
entry is used for object-oriented type management by \*(PG.
When a class is declared that inherits from some other class,
the new class implicitly has all the attributes that the original
class had.
In this case,
the
.SE attdefrel
entry is the object ID of the relation from which this attribute is inherited.
.pp
The
.SE attnvals
entry is intended to support query optimization.
It should store the number of distinct values for this attribute
that appear in the relation.
This value should be computed at vacuum time,
and stored in
.RN pg_attribute
by the vacuum cleaner.
However,
the current vacuum cleaner does not compute statistics,
so
.SE attnvals
is always zero.
If the correct value were maintained,
the optimizer could use it to produce better query plans.
The optimizer currently ignores
.SE attnvals .
.pp
The
.SE atttyparg
entry is unused.
.pp
The
.SE attlen
entry is the length of this attribute, in bytes,
for fixed-length attributes,
or \(mi\|1,
for variable-length attributes.
.pp
The
.SE attbound
entry was intended to support array-valued attributes,
but is not used by the existing \*(PG array implementation.
.pp
The
.SE attbyval
entry is
.i true
if this is a pass-by-value attribute,
and
.i false
if it is pass-by-reference.
This entry is set by looking at the
.SE typbyval
attribute in the
.RN pg_type
tuple for the type of this attribute.
.pp
The
.SE attcanindex
entry is
.i true
if this entry can be indexed,
and false otherwise.
Array attributes cannot be indexed.
.pp
The
.SE attproc
entry is unused.
.pp
The
.SE attnelems
entry is used for array attributes,
to record the number of elements that may be stored in the attribute.
.pp
The
.SE attcacheoff
entry is never maintained on disk,
but is set in memory to allow fast attribute lookups.
The first time an attribute is fetched from a tuple,
its offset is computed by summing the lengths of the attributes
that precede it.
If all of the preceding attributes are fixed-length,
and none are NULL,
then the computed offset may be cached for use on subsequent lookups.
The offset is stored in
.SE attcacheoff .
.pp
The
.SE "AttributeTupleFormData"
structure stores information on a single attribute.
The vector that describes all of the attributes in a relation,
.SE rd_att ,
contains pointers to in-memory
.SE AttributeTupleFormData
structures for each attribute.
The vector is stored in
the
.SE TupleDescriptorData
structure,
whose declaration is
.(E
typedef struct TupleDescriptorData {
	AttributeTupleForm	data[1];
	/* VARIABLE LENGTH ARRAY */
} TupleDescriptorData;

typedef TupleDescriptorData	*TupleDescriptor;
.)E
.sh 3 "The Strategy Map and Support Routines for Index Reldescs"
.pp
The sections above describe the contents of a reldesc that are common
to the heap and indexed access methods.
The indexed access methods store some additional information at the
end of the reldesc.
This information is used to select operators to apply during scans,
and to find the support routines that are required by particular
indexed access methods.
These data structures,
their layout,
and their purposes are obscure.
They reflect poor design of the indexed access methods.
This section is difficult to understand,
and may be safely ignored by everyone except the maintainer of the
\*(PG access methods.
.pp
Immediately following the TupleDescriptorData vector,
the indexed access methods have two additional pieces of data.
The first is a pointer to the
.SE IndexStrategyData
structure for this access method.
The second is a pointer to the vector of support routines for
the access method.
Neither of these is a named entry of any data structure;
their existence is assumed
(and,
at least so far,
correctly maintained)
by the code that manages reldescs for the indexed access methods.
.pp
The problem with this design is that it makes it impossible to name
these entries from the debugger.
The following cumbersome GDB statements will print out the contents
of these entries:
.(E
print (IndexStrategy)(&reldesc->rd_att[
	reldesc->rd_rel->relnatts])
print (RegProcedure *)(((char *)
	&reldesc->rd_att[reldesc->rd_rel->relnatts])
	+ sizeof(IndexStrategy))
.)E
The first statement prints the pointer
to the vector of index strategy map data;
the second prints the pointer to the vector of support procedure IDs.
These pointers must be dereferenced in order to examine the contents
of the vectors.
Lines have been broken so that the commands fit across the page;
the statements should be typed on a single line.
.pp
The strategy map is a vector of scan key entries.
Scan keys are described in sections
\n($1.3.1 and \n($1.3.2.
.\" XXX maintainable section ref
The strategy map stores information on what procedure to call
for a particular operator on this index.
For example,
a btree storing
.SE int4
values would have the procedure ID of the
.SE int4lt
procedure at the location corresponding to '<' in the strategy vector.
The strategy map is constructed automatically by the system when
the index relation is opened.
The contents are read from the
.RN pg_amop
relation,
which is keyed by operator class and access method.
.pp
Finally,
the vector of support procedures is simply an array of procedure IDs,
or
.RN pg_proc
tuple OIDs,
for the support routines required by the access method.
This vector is initialized when the index relation is opened by scanning
the
.RN pg_amsupport
relation.
The tuples in
.RN pg_amsupport
are keyed by access method and operator class.
\*(PG assigns no meaning to the entries in this vector;
they are intended for use by the access method.
The btree access method uses a single support routine,
which compares two keys and returns less than zero,
zero,
and greater than zero,
respectively,
if the first key is less than, equal to, or greater than the second.
The rtree access method uses three support routines,
which compute the size of a rectangle in standard units,
the intersection of two rectangles,
and the union of two rectangles.
.sh 3 "Summary of the Relation Descriptor Data Structure"
.pp
The reldesc describes an open relation,
and is the argument that is passed to most of the routines
that operate on relations.
\*(PG manages a private cache of reldescs,
and at most one copy of the reldesc for a given relation appears
in the cache at a time.
The reldesc includes pointers to the relation tuple form
and a vector of information that describes the attributes
that appear in tuples of the relation.
For indexed access methods,
the reldesc also includes a pointer to the access method tuple form,
a pointer to a vector of strategy information that describes the
operators supported by the access method for the operator class,
and a pointer to a vector of access method-specific support routines.
.sh 2 "Heap and Index Relations"
.pp
Heap relations are the primary \*(PG storage structure;
all user relations and all the system catalog information is stored
in heaps.
The \*(PG heap is logically an unordered set of tuples.
Heaps consist of zero or more 8192-byte blocks,
each of which contains zero or more tuples.
.pp
Index relations store key/value pairs,
where the key allows fast lookup on some set of attributes from a heap,
and the value is a pointer to a tuple in the heap.
.sh 3 "Tuple Identifiers"
.pp
Tuples are identified by
.i "tuple identifiers" ,
or tids.
A tid is a triple of the form (\fIblockno\fP, \fIpageno\fP, \fIoffset\fP).
The original design of the system allowed for more than one page to
be stored on a single block of a relation,
but this feature has never been used.
As a result,
the middle value (\fIpageno\fP) is always zero.
To confuse things further,
the term
.q page
is used interchangeably with the term
.q block
in the code and by programmers talking about the system.
Since the original usage of
.q page
was never implemented,
it has come to mean exactly the same thing as
.q block. \|
.pp
Thus a tid is a triple of the form (\fIblockno\fP, 0, \fIoffset\fP).
.pp
The
.i blockno
is the number of the block in the relation;
blocks are numbered sequentially from zero.
When a new block is allocated to a relation,
it is given the next available block number.
.pp
The
.i offset
stored in a tid is the offset of the tuple within the block.
Blocks store zero or more tuples,
and tuples are numbered sequentially from one within a block.
The number of tuples that fit in a block depend on the schema of
the relation that the block stores,
and the number may vary for a given relation
(for example,
if the tuple includes variable-length attributes).
.pp
A tid uniquely identifies a single tuple in a relation.
In the current version of the system,
relations are not compressed when they are vacuumed,
so tuples do not move around inside a relation.
Thus the tid is a valid identifier for the tuple until the tuple
is replaced or deleted,
at which point it (logically) vanishes from the relation.
Note that other tuples on the same page as a deleted tuple do not
change their tids when the tuple is deleted;
the available slot is reserved for subsequent reuse.
In fact,
slots are never reused in the current system.
The vacuum cleaner and the tuple allocation code may be changed to
fix this problem in the future.
.sh 3 "Data Stored in Heap Relations"
.pp
Every heap tuple begins with a
.SE HeapTupleData
structure.
The declaration for this structure appears in the source file
.FI access/htup.h ,
and is
.(E
typedef struct HeapTupleData {
	Size			t_len;
	ItemPointerData	t_ctid;
	ItemPointerData	t_chain;
	union {
		ItemPointerData	l_ltid;
		RuleLock		l_lock;
	}			t_lock;
	ObjectId		t_oid;
	CommandId		t_cmin;
	CommandId		t_cmax;
	TransactionId	t_xmin;
	TransactionId	t_xmax;
	ABSTIME		t_tmin, t_tmax;
	AttributeNumber	t_natts;
	char			t_vtype;
	char			t_infomask;
	char			t_locktype;
	uint8			t_hoff;
	char			t_bits[MinHeapTupleBitmapSize / 8];
} HeapTupleData;	/* MORE DATA FOLLOWS AT END OF STRUCT */

typedef HeapTupleData	*HeapTuple;
.)E
This structure includes a variable-length bitmap vector at its end,
after which the user-level attributes of the tuple appear.
.pp
The
.SE t_len
entry is the length of the entire tuple,
including the
.SE HeapTupleData
structure and the user data that follows it.
This is used to allocate sufficient space in memory and on disk pages
for the tuple.
.pp
The
.SE t_ctid
entry is the tid of this tuple.
Storing the tid in the tuple is easier than computing it
when it is needed
(for example,
when an index tuple is constructed to point at the heap tuple).
.pp
When a tuple is replaced,
\*(PG maintains a pointer from the old version of the tuple
to the new version.
This pointer is stored in the
.SE t_chain
entry.
When the new version of the tuple is successfully inserted,
the page containing the old version is fetched,
and the original tuple's
.SE t_chain
entry is set to the
.SE t_ctid
entry of the new version.
This strategy,
called
.i "forward chaining" ,
leaves older records pointing at newer records.
The advantage if forward chaining is that it allows
records to be updated without changing indices that point at them
if the indexed attribute has not chained.
.pp
The
.SE t_chain
entry was originally intended to support tuple differencing,
which would allow \*(PG to store only the changed values for the
new version in most cases,
rather than the entire new tuple.
Tuple differencing is not supported in the current system.
In addition,
the system always inserts a new index record when a heap tuple
is modified,
regardless of whether the indexed attribute has changed.
This may be fixed in the future.
.pp
The
.SE t_lock
entry stores a rule lock for the tuple.
A rule lock is a tag that identifies a set of rules that should be
run whenever this tuple is manipulated.
The mechanism is beyond the scope of this section.
The value stored in
.SE t_lock
is a list of tids when the tuple is on disk,
and is swizzled to an in-memory pointer when the tuple is
copied into memory.
.pp
The
.SE t_oid
entry of the
.SE HeapTupleData
structure is the object ID that uniquely identifies this tuple.
Although tids will currently uniquely identify a tuple for its lifetime,
this may change someday,
when tuple differencing and space reclamation are implemented.
The OID will is guaranteed by design never to change.
Furthermore,
when a tuple is replaced,
its tid will change,
but its OID is guaranteed not to.
This means that the OID is a safe,
globally unique,
eternal identifier for the tuple.
.pp
The
.SE t_cmin
and
.SE t_cmax
entries
are,
respectively,
the command IDs that inserted and deleted this tuple.
In every transaction,
the user may run up to 32768 individual commands.
Any of these commands may update the database.
The command ID is stored in every tuple to allow the system
to keep track of which command inside a single transaction
made a change.
The reason that this is necessary is to allow subsequent commands
inside the same transaction to see the effects of changes made
by earlier commands in the transaction,
even though those changes are not visible to other users running
concurrently in other transactions.
.pp
The
.SE t_xmin
and
.SE t_xmax
entries are,
respectively,
the transaction IDs of the transactions that inserted and deleted
this tuple.
These transaction IDs are used by \*(PG to check at runtime whether
the transaction that made the changes ever committed.
Changes made by uncommitted transactions may be safely ignored.
.pp
The
.SE t_tmin
and
.SE t_tmax
entries are,
respectively,
the commit times of the inserting and deleting transactions that
operated on this tuple.
When \*(PG inserts a tuple,
it writes the inserting command and transaction IDs,
but leaves the
.SE t_tmin
entry empty.
When the transaction commits,
a special entry is written to the
.FI pg_time
relation.
This entry contains the time at which the transaction committed.
Later,
the vacuum cleaner
(or,
in some cases,
a \*(PG backend in normal operation)
will fill in the commit time for the inserting transaction.
Once this time is filled in,
\*(PG does not need to check the transaction log in order to
see whether the transaction ever committed.
This substantially speeds up processing,
since the transaction log check generally requires disk accesses.
.pp
The
.SE t_natts
entry is the number of attributes that are stored in this tuple.
The number of attributes may not be the same as the number stored
by the relation,
since the relation schema may change over time as a result of the
.i addattr
command.
Any references to attributes beyond the end of the tuple will return
the special database value NULL
(note that database NULL is not the same as the zero value used by C).
.pp
The
.SE t_vtype
entry is for the version type,
whose purpose has been lost to history.
This is unused by the current system.
The value stored here is typically zero,
although the header file
.FI access/htup.h
claims that only
.i i ,
.i r ,
and
.i d
are supported.
.pp
The
.SE t_infomask
entry is used to encode information about the tuple to speed up
fetching of attributes.
The information encoded here includes flags indicating whether
the tuple contains any null values or any variable-length attributes.
.pp
The
.SE t_locktype
entry indicates whether the value stored in the
.SE t_lock
union is an on-disk (tid) or in-memory (pointer) lock representation.
This is stored far from the union in order to keep the data structure
tightly packed in memory.
.pp
The
.SE t_hoff
entry is the number of bytes occupied by the tuple header
(that is,
by the
.SE HeapTupleData
structure).
This may vary from tuple to tuple,
because if the tuple contains no nulls values,
then the bitmap of null values
(stored in
.SE t_bits ,
below)
is not necessary,
and does not appear in the tuple.
For any tuple,
the address of the
.SE HeapTupleData
structure that stores it,
plus the value stored in
.SE t_hoff ,
is the address of the first byte of user data stored in the tuple.
.pp
The
.SE t_bits
vector contains one bit for every attribute in the tuple
if the tuple contains nulls.
For each attribute,
the corresponding bit is set if the attribute is null,
and is clear otherwise.
Nulls are not actually stored in the tuple;
just the
.SE t_bits
bit for the null value appear.
This keeps tuples containing null values small.
If the tuple contains no null values,
then the
.SE t_bits
vector is left off the end of the structure,
and user data begins immediately after the
.SE t_hoff
entry.
.pp
Collectively,
the values stored in the
.SE HeapTupleData
structure are referred to as the
tuple header.
These values are accessible from the query language as
.i "system attributes" .
Every heap tuple has the same
system attributes,
but the
.i "user attributes"
they store vary from relation to relation,
and possibly even within a relation.
.pp
System attributes are all assigned attributes less than zero
internally by the system. The 
User-defined attributes all have attribute numbers greater than zero.
The names of the system attributes and their corresponding 
.SE HeapTupleData
structure elements are listed in the following table\** .
.(f
\**
At one time, there existed a system attribute 
.i anchor
which corresponded with the 
.SE HeapTupleData
element
.SE t_anchor .
.i Anchor
was the anchor point, or complete tuple, stored in a sequence of 
difference tuples. 
Neither are used in the system at this time since tuple differencing
has not yet been implemented.
.)f
.TS
center allbox;
cfI cfI
lfC l.
HeapTupleData element	System Attribute
t_chain	chain
t_cmax 	cmax
t_cmin	cmin
t_ctid	ctid
t_oid	oid
t_lock	rlock
t_tmax	tmax
t_tmin	tmin
x_xmax	xmax
t_xmin	xmin
t_vtype	vtype
.TE
.sh 3 "Data Stored in Index Relations"
.pp
All \*(PG indices are secondary \*-
that is,
ordinary system and user data are stored in heap relations,
and indices store pointers into the heaps.
All of the indexed access methods store index tuples,
which are much smaller than heap tuples.
In general,
the indexed access methods also store other data on \*(PG pages;
what data are stored depends on the indexed access method.
For the rtree and btree access methods,
some pages are designated as
.i internal
pages,
and only store pointers to other pages in the same index.
.i Leaf
pages,
on the other hand,
store pointers into the heap.
.pp
A particular index is defined on a single heap relation,
and all the heap tids that the index stores refer to that relation.
Since indices support fast keyed lookup,
the indices also store key data,
which are typically either an attribute from the heap or
some function of an attribute or attributes from the heap.
For example,
if the
.RN EMP
class stored employee records,
and included a
.RN salary
attribute,
then a btree index on
.RN EMP.salary
would store the salaries from
.RN EMP
together with the tids of the tuples with each particular salary.
.pp
The index tuple format is defined in
.FI access/itup.h ,
and is
.(E
typedef struct IndexTupleData {
	ItemPointerData			t_tid;
	unsigned short			t_info;
} IndexTupleData;	/* MORE DATA FOLLOWS */

typedef IndexTupleData	*IndexTuple;
.)E
.pp
The
.SE t_tid
entry is the tid of the heap tuple that corresponds to this index tuple.
.pp
The
.SE t_info
entry encodes some information about the index tuple.
This is a sixteen-bit quantity whose layout is as follows:
.BU
Bit fifteen (the leftmost bit) is set if the index tuple contains
null values,
and is clear otherwise.
.BU
Bit fourteen is set if the index tuple contains variable-length
attributes,
and is clear otherwise.
.BU
Bit thirteen is set if there are rules associated with the index
tuple,
and is clear otherwise.
Rules on index tuples are not supported by the current system.
.BU
Bits twelve through zero are the size of the tuple, in bytes.
.pp
Immediately following the
.SE t_info
entry is the index key.
The index key is the set of all attributes from the heap tuple
that form the search key for this index.
The current implementations of the btree and rtree access methods
support only a single index key.
This may change,
at least for btrees,
in the near future.
.sh 3 "How the Relations are Managed"
.pp
When the \*(PQ user issues an update to a heap relation
(either inserting or deleting data),
all of the corresponding indices are automatically updated,
as well.
\*(PG stores,
for every heap relation,
the index relations defined on it,
and for every index relation,
the attributes from the heap that form the index key.
These data are stored in
.RN pg_index .
.pp
When the user issues a \*(PQ query against a heap relation that requires
the relation to be scanned,
\*(PG checks to see whether any of the indices defined on the heap
will permit the scan to be executed more quickly.
For example,
if a btree index is defined on the
.RN salary
attribute of the
.RN EMP
relation,
then a query for all employees with salaries between $30,000 and $50,000
can be satisfied by scanning the btree index,
and only selecting out the heap tuples with the correct salaries.
This allows the scan to be completed more quickly than would a sequential
scan of all of the tuples in the
.RN EMP
relation.
.sh 2 "Scan Keys, Scan Descriptors, and Scans"
.pp
A
.i scan
is the abstraction used in \*(PG to search a heap or index relation
for tuples that satisfy some qualification.
A scan descriptor,
or 
.i scandesc ,
is the data structure that describes an open scan.
The qualification is stored in the scandesc in data structures called
.i "scan keys" .
Each key consists of a procedure
.i p ,
an attribute number
.i k ,
and a value
.i v .
Logically, every tuple
.i t
in the relation is checked to see whether
its
.i k th
attribute,
%t sub k%,
passes the qualification stored in the scan key.
The tuple passes the qualification if
%p({t sub k}, v)%
returns true.
In order for the tuple to satisfy the entire qualification,
it must pass this test for every scan key stored in the scandesc.
This means that a scandesc stores a conjunctive qualification
on the relation.
Disjunctive qualifications may be handled by using more than one
scan,
and suppressing duplicate tuples,
or by using a scan with no keys,
and applying the disjunctive tests manually.
.pp
Once a scan is opened,
it returns tuples one at a time.
Every tuple that the scan returns is guaranteed to satisfy the qualification,
and the scan is guaranteed to return all qualifying tuples.
.pp
The rest of this section describes the data structures associated
with scans,
scandescs,
and scan keys.
.sh 3 "Scan Keys"
.pp
A scan is a conjunction of zero or more qualifications on single
attributes in the relation.
Every single-attribute qualification is stored in a scan key.
The data structure used to store scan keys appears in the source file
.FI access/skey.h .
Its declaration is
.(E
typedef struct ScanKeyEntryData {
	bits16		flags;
	AttributeNumber	attributeNumber;
	RegProcedure	procedure;
	int 			(*func) ();
	int32 		nargs;
	Datum			argument;
} ScanKeyEntryData;

#define CheckIfNull		0x1
#define UnaryProcedure	0x2
#define NegateResult		0x4
#define CommuteArguments	0x8

typedef ScanKeyEntryData	*ScanKeyEntry;
.)E
The
.SE flags
entry is a sixteen-bit bitmask describing the function to be called
and its return values.
The flag values appear in the
\fC#define\fPs that follow the data structure.
If the
\fCCheckIfNull\fP flag is set,
then the scan code should check for null attributes and return null
for them without calling the function.
However,
this behavior is not implemented in the current system.
If the
\fCUnaryProcedure\fP bit is set,
then this procedure takes a single argument
(the appropriate attribute from the tuple),
rather than an attribute and a constant.
If the
\fCNegateResult\fP bit is set,
then the scan code will negate the logical value returned by the
procedure when checking for tuples that satisfy the qualification.
Finally,
if the \fPCommuteArguments\fP flag is set,
then the scan code will call the procedure with the supplied value
and the attribute from the tuple,
rather than the opposite order.
.pp
The
.SE attributeNumber
entry is the attribute number in the tuple to check.
.pp
The
.SE procedure
entry is the object ID of the
.RN pg_proc
tuple that describes the procedure to call.
When the procedure is called for the first time,
the
.RN pg_proc
relation is scanned and the appropriate function is dynamically loaded,
if necessary.
.pp
The
.SE func
entry is a pointer to a function that must return type \fCbool\fP.
This function pointer is initialized by the scan key from the
.RN pg_proc
tuple for the desired
.SE procedure ,
and need not be set by the caller.
The function pointer is cached to save repeated scans of
.RN pg_proc .
.pp
The
.SE nargs
entry is the number of arguments that are required by the function.
This is filled in from the
.RN pg_proc
tuple,
and need not be set by the caller.
However,
it should always be either one or two.
.pp
Finally,
the
.SE argument
entry is the additional argument to the function.
This must be set by the caller.
.sh 3 "Setting up Scan Keys"
.pp
In order to properly set up a scan key,
the caller must know which attributes in the tuple are of interest,
the object IDs of the procedures that should be used to test the attributes,
and what values they should be tested against.
In order to simplify scan key setup,
a number of conventions have been adopted in the \*(PG source code.
.pp
First,
the schemas for all of the system catalogs are defined in header files
in the
.FI catalog
directory.
The header files are named
.i relname .h,
where
.i relname
is the name of the catalog of interest.
For example,
the schema of the
.RN pg_user
relation appears in the header file
.FI catalog/pg_user.h .
The lone exception to this rule is that the schema for the
.RN pg_class
relation appears in the file
.RN catalog/pg_relation.h .
This is an artifact of the terminology change (relations became classes)
that swept the project in the early 1990s.
.pp
Second,
the header files described above include standard
.SE #define s
for the relations that they describe.
The relation's name,
as a string,
is
\fCName_\fP\fIrelname\fP;
for example,
\fCName_pg_user\fP.
The number of attributes in the relation is
\fCNatts_\fP\fIrelname\fP
(\fCNatts_pg_user\fP).
Every attribute may be referenced as
\fCAnum_\fP\fIrelname\fP\fC_\fP\fIattname\fP
(for example,
the
.RN usename
attribute of the
.RN pg_user
relation is
.SE Anum_pg_user_usename .
These conventions permit programmers to use symbolic names,
rather than embedded constants,
to set up scans and open relations.
.pp
Similar conventions have not been adopted for user relations,
so programmers who want to set up and use scans on them
must already know their schemas.
.pp
Third,
there exist some
.SE #define d
constants for procedures that are frequently used in scans on the
system catalogs.
These procedure IDs appear in the source file
.FI catalog/pg_proc.h ,
and generally take the form
\fIOperationOfInterest\fP\fCRegProcedure\fP \*-
for example,
.SE CharacterEqualRegProcedure
and
.SE ObjectIdEqualRegProcedure .
.pp
Finally,
the constant values used by \*(PG in scans are of type
.SE Datum .
A number of macros have been defined that convert embedded
constants to values of type
.SE Datum .
These macros are defined in the source file
.FI tmp/datum.h ,
and are
.TS
center allbox;
cI cI cI
l l l.
convert type	from Datum	to Datum
=
one-byte char	DatumGetChar(\fIv\fP)	CharGetDatum(\fIv\fP)
one-byte integer	DatumGetInt8(\fIv\fP)	Int8GetDatum(\fIv\fP)
unsigned one-byte integer	DatumGetUInt8(\fIv\fP)	UInt8GetDatum(\fIv\fP)
two-byte integer	DatumGetInt16(\fIv\fP)	Int16GetDatum(\fIv\fP)
unsigned two-byte integer	DatumGetUInt16(\fIv\fP)	UInt16GetDatum(\fIv\fP)
four-byte integer	DatumGetInt32(\fIv\fP)	Int32GetDatum(\fIv\fP)
unsigned four-byte integer	DatumGetUInt32(\fIv\fP)	UInt32GetDatum(\fIv\fP)
four-byte float	DatumGetFloat32(\fIv\fP)	Float32GetDatum(\fIv\fP)
eight-byte double	DatumGetFloat64(\fIv\fP)	Float64GetDatum(\fIv\fP)
void *	DatumGetPointer(\fIv\fP)	PointerGetDatum(\fIv\fP)
pointer to struct	DatumGetStructPointer(\fIv\fP)	StructPointerGetDatum(\fIv\fP)
16-byte char string name	DatumGetName(\fIv\fP)	NameGetDatum(\fIv\fP)
object ID	DatumGetObjectId(\fIv\fP)	ObjectIdGetDatum(\fIv\fP)
.TE
In \*(PG,
four-byte floating point values are passed by reference,
not by value.
All other four-byte quantities are pass-by-value.
A very common programming error in \*(PG is to use values of type
.SE Name
as if they were pointers.
In fact,
.SE Name
is a structure containing a sixteen-byte character array,
and so will be passed on the stack if its address is not explicitly used.
These facts should be documented elsewhere,
but are worth mentioning here.
.pp
The following code fragment shows how to set up a scan key on the
.RN pg_user
relation for all tuples that have
.SE usename
equal to
.q mao
and an object ID of 1806:
.(E
NameData name;
ScanKeyEntryData skey[2];

bzero(&name, sizeof(name));
bcopy(name.data[0], "mao", strlen("mao"));

ScanKeyEntryInitialize(&skey[0], (bits16)0x0,
	Anum_pg_user_usename,
	(RegProcedure)NameEqualRegProcedure,
	NameGetDatum(name));
ScanKeyEntryInitialize(&skey[1], (bits16)0x0,
	ObjectIdAttributeNumber,
	(RegProcedure)ObjectIdEqualsRegProcedure,
	ObjectIdGetDatum(1806));
.)E
.sh 3 "The Scan Descriptor"
.pp
Once the scan keys are properly initialized,
they may be used to open a scan on a relation.
An open scan is described by a scandesc.
The data structures that describe open scans are
.SE HeapScan s
and
.SE IndexScan s,
and are declared in the source file
.FI access/relscan.h .
.sh 4 "The Heap Scan Descriptor"
.pp
The declaration for
.SE HeapScanDescData
is
.(E
typedef struct HeapScanDescData {
	Relation		rs_rd;
	HeapTuple		rs_ptup;
	HeapTuple		rs_ctup;
	HeapTuple		rs_ntup;
	Buffer		rs_pbuf;
	Buffer		rs_cbuf;
	Buffer		rs_nbuf;
	struct dchain	*rs_dc;
	ItemPointerData	rs_mptid;
	ItemPointerData	rs_mctid;
	ItemPointerData	rs_mntid;
	ItemPointerData	rs_mcd;
	Boolean		rs_atend;
	TimeQual		rs_tr;
	uint16		rs_cdelta;
	bool			rs_parallel_ok;
	uint16		rs_nkeys;
	ScanKeyData	rs_key;
	/* VARIABLE LENGTH ARRAY AT END OF STRUCT */
} HeapScanDescData;

typedef HeapScanDescData *HeapScanDesc;
.)E
.pp
The
.SE rs_rd
entry is a pointer to the reldesc for the relation on
which this scan has been opened.
.pp
The
.SE rs_ptup ,
.SE rs_ctup ,
and
.SE rs_ntup
entries are,
respectively,
the previous tuple,
current tuple,
and next tuple visited in this scan.
The scan code caches these because it was originally thought
that scans would change direction frequently.
In fact,
that is not the case,
and maintaining the previous and next tuple pointers has turned out
to be pure overhead.
Getting rid of the previous and next tuple pointers everywhere that
they appear in the scandesc would speed up scan processing.
.pp
The
.SE rs_pbuf ,
.SE rs_cbuf ,
and
.SE rs_nbuf
entries are the buffers on which the previous,
current,
and next tuple are stored.
These buffers are kept pinned,
which guarantees that they will not be evicted from the
shared buffer cache by any backend.
Pinning the buffers allows the backend to reference them
directly,
without reacquiring a pointer to the buffer every time
data on it is used.
The shared buffer cache is protected by mutual exclusion.
Some \*(PG ports use System V semaphores to implement exclusion.
Acquiring and releasing these semaphores is slow,
so pinning the buffers during a scan improves performance significantly.
.pp
The
.SE rs_dc
entry is intended to support tuple differencing,
which is not supported in the current version of the system.
.pp
The
.SE rs_mptid ,
.SE rs_mctid ,
and
.SE rs_mntid
entries support marking of positions in scans.
While processing some join strategies,
the executor marks a location so that it can return to it later.
When such a mark is made,
the tids of the
.SE rs_ptup ,
.SE rs_ctup ,
and
.SE rs_ntup
tuples are copied to the
.SE rs_mptid ,
.SE rs_mctid ,
and
.SE rs_mntid
entries.
.pp
The
.SE rs_mcd
entry is intended to support tuple differencing,
and is not used in the current system.
.pp
The
.SE rs_atend
entry indicates whether the scan should begin at the end of the
relation.
Some attempt is made to store a sensible value here,
but it is ignored by much of the code and should not be relied on.
.pp
The
.SE rs_tr
entry is a time range or snapshot time qualification
that indicates what historical tuples are of interest.
The special constant value
.SE NowTimeQual
indicates that only current data is of interest.
.pp
The entry
.SE rs_cdelta
is intended to support tuple differencing,
and is not used at present.
.pp
The structure entry
.SE rs_parallel_ok
was added to support parallelization of \*(PG for shared-memory
architectures by Wei Hong,
whose doctoral dissertation was on that topic.
This entry is no longer used.
.pp
The entry
.SE rs_nkeys
stores the number of
.SE ScanKeyEntryData
structures that are stored in the
.SE rs_key
entry that follows.
.pp
The
.SE rs_key
entry stores a variable-length vector of
.SE ScanKeyEntryData
structures,
one per scan key that are to be applied for the scan.
.sh 4 "The Index Scan Descriptor"
.pp
The declaration for
.SE IndexScanDescData
is
.(E
typedef struct IndexScanDescData {
	Relation		relation;
	Pointer		opaque;
	ItemPointerData	previousItemData;
	ItemPointerData	currentItemData;
	ItemPointerData	nextItemData;
	MarkData		previousMarkData;
	MarkData		currentMarkData;
	MarkData		nextMarkData;
	uint8			flags;
	Boolean		scanFromEnd;
	uint16		numberOfKeys;
	ScanKeyData	keyData;
	/* VARIABLE LENGTH ARRAY AT END OF STRUCT */
} IndexScanDescData;

typedef IndexScanDescData	*IndexScanDesc;
.)E
.pp
The
.SE relation
entry points at the reldesc for the relation being scanned.
.pp
The
.SE opaque
entry is for use by the indexed access method,
and is assigned no meaning by higher-level scan processing code.
The btree code uses it to store a pointer to a
.SE BTScanOpaque
structure,
defined in
.FI access/nbtree.h .
The
.SE BTScanOpaque
structure stores,
among other things,
the current buffer in use by the scan.
This is equivalent to the
.SE rs_cbuf
entry of the
.SE HeapScanDescData
structure.
.pp
The
.SE previousItemData ,
.SE currentItemData ,
and
.SE nextItemData
entries store the tids of the previous,
current,
and next index tuples
(\fInot\fP heap tuples!)
returned by the scan.
As was the case for heap scans,
it was originally believed that index scans would change direction
frequently,
making it useful to cache these values.
In fact,
this never happens,
and the
previous and next tuple tids could be removed from this structure.
.pp
The
.SE previousMarkData ,
.SE currentMarkData ,
and
.SE nextMarkData
entries serve the same purpose as
.SE rs_mptid ,
.SE rs_mctid ,
and
.SE rs_mntid
in the
.SE HeapScanDescData
structure.
.pp
The
.SE flags
entry is used to encode the direction of the scan
(forwards, backwards, or no movement).
The constants for these three values are declared in
.FI access/sdir.h ,
and are
.SE BackwardScanDirection ,
.SE NoMovementScanDirection ,
and
.SE ForwardScanDirection .
.pp
The
.SE scanFromEnd
flag is
.i true
if the scan should begin at an endpoint,
and false otherwise.
For example,
in the btree code,
a scan for all values greater than 500 could begin at 500 and
move forward,
or at the end of the relation and move backward.
.pp
The
.SE numberOfKeys
entry records the number of keys stored in the
.SE keyData
entry that follows. Currently, 
.SE numberOfKeys 
is hardcoded to 1 in many places since multi-key indexing is not
supported at this time. 
.pp
The
.SE keyData
entry stores a vector of
.SE ScanKeyEntryData
structures that describe the scan key in use on the index.
.sh 2 "The \*(PG Access Method Interface"
.pp
This section describes the programmatic interface to the \*(PG
access methods.
The previous sections described the data structures that are
important when using the access methods.
Here,
the structures are filled in and used to do real work.
.pp
The prototypes for the functions described in this section
are not well-managed in the current version of the system.
Some routines are not prototyped at all,
and the prototypes that do exist are not ANSI.
The header files in the directory
.FI access
are where the prototypes appear.
.pp
The basic operations covered here are
.BU
opening and closing relations,
.BU
using scans to fetch tuples,
.BU
fetching particular attributes of a tuple,
and
.BU
inserting,
deleting,
and replacing tuples.
.pp
There are two classes of interfaces.
The routines whose names begin with
.SE heap_
operate on heap relations.
The routines whose names begin with
.SE index_
work on index relations.
The
.SE index_
routines use the
.RN pg_am
tuple for the access method to call the routine that does the
work for a particular access method.
This dispatch happens transparently to the user.
.sh 3 "Managing the System Catalogs"
.pp
Because backend code frequently scans and updates the system catalogs,
and because these catalogs often have special indices defined on them,
\*(PG includes special interfaces for dealing with some system catalogs.
This section describes those interfaces.
.sh 4 "The System Cache"
.pp
Each \*(PG backend maintains,
in private memory,
a cache of system catalog tuples that are frequently required
during query processing.
Cache consistency is provided by some extremely complicated
invalidation code that runs at transaction boundaries.
This cache,
called the
.i "sys cache" ,
actually consists of a number of different caches
that support fast lookup of catalog tuples by various attributes.
The cache management code appears in the source files
.FI utils/cache/syscache.c
and
.FI utils/cache/catcache.c ,
and the cache invalidation code is in
.FI storage/ipc/sinval.c
and
.FI utils/cache/inval.c .
.pp
The primary interface for using the sys cache is
.(E
HeapTuple
SearchSysCacheTuple(int cacheid, char *key1, char *key2,
		    char *key3, char *key4)
.)E
The
.SE cacheid
argument is the sys cache of interest.
The system currently supports the following caches:
.TS
center allbox;
cfI cfI cfI
l l l.
Cache ID	Relation	Cache Key Attributes
AMOPOPID	pg_amop	amopclaid, amopopr
AMOPSTRATEGY	pg_amop	amopid, amopclaid, amopstrategy
ATTNAME	pg_attribute	attrelid, attname
ATTNUM	pg_attribute	attrelid, attnum
INDEXRELID	pg_index	indexrelid
LANNAME	pg_language	lanname
OPRNAME	pg_operator	oprname, oprleft, oprright, oprkind
OPROID	pg_operator	OID
PRONAME	pg_proc	proname, pronargs, proargtypes
PROOID	pg_proc	OID
RELNAME	pg_class	relname
RELOID	pg_class	OID
TYPNAME	pg_type	typname
TYPOID	pg_type	OID
AMNAME	pg_am	amname
CLANAME	pg_opclass	opcname
INDRELIDKEY	pg_index	indrelid, indkey
INHRELID	pg_inherits	inhrelid, inhseqno
PRS2PLANCODE	pg_prs2	prs2ruleid, prs2planno
RULOID	pg_rewrite	OID
PRS2STUB	pg_prs2	prs2relid, prs2no
PRS2RULEID	pg_prs2	prs2name
PRS2EVENTREL	pg_prs2	OID
AGGNAME	pg_aggregate	aggname
NAMEREL	pg_naming	parent_oid, filename
LOBJREL	pg_large_object	object_oid
LISTENREL	pg_listener	relname, pid
USENAME	pg_user	usename
USESYSID	pg_user	usesysid
GRONAME	pg_group	groname
GROSYSID	pg_group	grosysid
REWRITENAME	pg_rewrite	rulename
.TE
The
.SE USESYSID
cache,
for example,
allows fast lookup of users in
.RN pg_user
by the
.RN usesysid
attribute.
To do such a lookup,
.(E
HeapTuple pg_user_tup;

pg_user_tup = SearchSysCacheTuple(USESYSID, 1806, 0, 0, 0);
.)E
If any tuple with
.RN usesysid
equal to 1806 exists in
.RN pg_user ,
it is loaded into the cache if necessary,
and a pointer to it is returned.
.sh 4 "Maintaining System Catalog Indices"
.pp
Since system catalog indices are critical to the fast and correct
functioning of \*(PG,
several support routines have been defined to open,
update,
and close indices on a given catalog when it is updated.
An example of the use of these routines appears in the file
.FI catalog/heap.c ,
in the routine
.SE AddNewAttributeTuples .
The routines are defined in
.FI catalog/indexing.c .
These rouintes open all the indices on some catalog relation,
insert index tuples into all the indices,
and close the indices.
The interfaces are
.(E
void
CatalogOpenIndices(int nIndices, char **names, 
		   Relation *ind_relns)

void
CatalogIndexInsert(Relation *ind_relns, int nIndices,
		   Relation heap_reln, HeapTuple htup)

void
CatalogCloseIndices(int nIndices, Relation *ind_relns)
.)E
For
.SE CatalogOpenIndices,
the
.SE nIndices
argument is the number of indices on the catalog;
constants for all catalogs are defined in
.FI catalog/indexing.h .
The
.SE names
vector is an array of names of indices on the catalog;
constants for these arrays are provided in the same header file.
Finally,
.SE ind_relns
is a vector of index reldesc pointers that is large enough to hold
.SE nIndices
reldesc pointers.
This is filled in by
.SE CatalogOpenIndices .
.pp
For
.SE CatalogIndexInsert ,
the
.SE ind_relns
argument is the reldesc vector returned by
.SE CatalogOpenIndices .
.SE NIndices
is the number of indices on the catalog.
The
.SE heap_reln
argument is the catalog relation being updated,
and
.SE htup
is the new tuple inserted into the catalog.
When
.SE CatalogIndexInsert
returns,
all of the indices on the system catalog will have been updated
with index tuples pointing at the new heap tuple.
.pp
For
.SE CatalogCloseIndices ,
.SE nIndices
is the number of indices to close,
and
.SE ind_relns
is the reldesc vector from
.SE CatalogOpenIndices .
.sh 3 "Opening and Closing Relations"
.pp
Relations may be opened by relid (object ID of the corresponding
.RN pg_class
tuple)
or by name.
The interfaces are
.(E
Relation
heap_open(ObjectId relid)

Relation
heap_openr(Name relname)

Relation
index_open(ObjectId relid)

Relation
index_openr(Name relname)
.)E
The first two open the requested heap relation if it exists,
and return the reldesc for it.
The second two do the same for index relations.
If the named relations do not exist,
the routines will report a
.SE NOTICE -level
.SE elog
message,
but will not abort.
Instead,
they return a NULL reldesc.
.pp
In order to close a relation,
.(E
void
heap_close(Relation reldesc)

void
index_close(Relation reldesc)
.)E
.sh 3 "Using Scans"
.pp
Once a relation has been opened,
a scan may be set up on it to find tuples of interest.
.sh 4 "Beginning and Ending Scans"
.pp
To set up a scan key,
.(E
void
ScanKeyEntryInitialize(ScanKeyEntry entry, bits16 flags,
		       AttributeNumber attributeNumber,
		       RegProcedure procedure, Datum argument)
.)E
As many scan key entries as desired may be defined for a single scan.
These should be allocated as an array of
.SE ScanKeyEntryData
structures,
as shown in section
\n($1.3.2.
.\" XXX maintainable section ref
.pp
Once a set of scan keys is initialized,
the routines
.(E
HeapScanDesc
heap_beginscan(Relation reldesc, bool atend,
	       TimeQual timeQual, unsigned nkeys, ScanKey key)

IndexScanDesc
index_beginscan(Relation reldesc, bool scanFromEnd,
	        uint16 numberOfKeys, ScanKey key)
.)E
will begin scans using the keys.
The keys should be in the array beginning at address
.SE key .
.pp
To end an open scan,
.(E
void
heap_endscan(HeapScanDesc scan)

void
index_endscan(IndexScanDesc scan)
.)E
The scans may no longer be used once the
.SE endscan
routine has been called on them.
.sh 4 "Fetching Qualifying Tuples"
.pp
Once a scan has been initialized by the
.SE beginscan
routines,
qualifying tuples may be fetched from the relation by calling
the routines
.(E
HeapTuple
heap_getnext(HeapScanDesc scan, int backw, Buffer *b)

RetrieveIndexResult
index_getnext(IndexScanDesc scan, ScanDirection direction)
.)E
In both cases,
the
.SE scan
argument is the scan to use to find qualifying tuples.
For heap scans,
if
.SE backw
is non-zero,
then the scan moves backwards;
otherwise,
it moves forwards.
There is never a reason to set the
.SE backw
flag.
For index scans,
.SE direction
may be one of
.SE ForwardScanDirection ,
.SE NoMovementScanDirection ,
or
.SE BackwardScanDirection .
.SE ForwardScanDirection
is commonly used,
and the others may not work.
.pp
The purpose of the
.SE b
argument for the heap routine is described below.
.sh 4 "Buffer Management"
.pp
\*(PG manages a shared cache of recently-used buffers.
This cache is available to all of the backends that are running
concurrently.
The shared cache allows some backends to take advantage of work
done by others.
For example,
pages from the system catalog are typically moved into the cache
by one backend and read by many others.
.pp
When a given backend has a pointer into one of the pages in the
shared buffer cache,
that buffer is
.i pinned .
A pinned buffer cannot be evicted from the cache.
When the pointer is dropped,
the buffer may be safely unpinned and evicted from the cache,
and the space that it occupied may be used by another page from
a database.
.pp
When scanning tuples in a relation,
the user may choose to examine the tuple directly on the page,
or to copy the tuple to private space and to examine the copy.
The benefit to using the tuple on the page is that no extra
copy is required.
The down side is that the user must be more careful in managing
his scan.
In particular,
he must unpin the buffer when he no longer plans to use the
pointer into it.
.pp
The interface to fetch tuples from a heap scan is
.(E
HeapTuple
heap_getnext(HeapScanDesc scan, int backw, Buffer *b)
.)E
The
.SE b
argument is a pointer to a value of type
.SE Buffer .
.SE Buffer
is basically an integer,
which is the number of the buffer in the shared buffer cache.
.BU
If the
.SE b
argument to
.SE heap_getnext
is not
.SE "(Buffer *) NULL" ,
then the buffer number on which the returned tuple appears will
be copied to the
.SE Buffer
value that
.SE b
points to.
.BU
If the
.SE b
argument is NULL,
then memory is allocated (via
.SE palloc \**)
.(f
\**
.SE Palloc
is the \*(PG memory allocator. 
.SE Palloc
allocates memory from memory pools, some of which perform garbage
collection at transaction end. 
.)f
and the tuple is copied from the data page to the allocated memory.
.pp
The following code fragments show how to use the
.SE b
argument in both cases.
The first example uses the tuple directly on the data page.
.(E
HeapTuple htup;
Buffer b;

\&...
htup = heap_getnext(hscan, 0, &b);

/* ... process the tuple ... */

/* all done */
ReleaseBuffer(b);
.)E
When
.SE heap_getnext
returns,
.SE b
is the buffer number on which the tuple appears,
and the
.i "pin count"
on
.SE b
has been incremented to account for this new reference.
After the call to
.SE ReleaseBuffer ,
the tuple pointed to by
.SE htup
may no longer be used.
Some of the most insidious bugs in the project's history were
from users who unpinned buffers,
but continued to use tuples on them.
The buffer would be paged out at some later time by another process,
and the user would have a pointer into some random location of
another data page.
.pp
The second example shows how to use a copy of the tuple
that does not reside on the page.
.(E
HeapTuple htup;

\&...
htup = heap_getnext(hscan, 0, (Buffer *) NULL);

/* ... process the tuple ... */

/* all done */
pfree(htup);
.)E
In this case,
the user should release the memory allocated to
the tuple before returning.
Allocated memory is automatically freed at transaction boundaries,
but relying on this feature causes memory leaks that make
\*(PG run slowly and consume large amounts of system memory.
.sh 4 "Using an Index Scan"
.pp
Buffer management is not required for index scans.
Data of interest are copied to a special structure,
called a
.SE RetrieveIndexResult .
A
.SE RetrieveIndexResult
includes the tid of the index tuple
and the tid of the heap tuple that it references.
To use these,
.(E
RetrieveIndexResult res;
Relation heap_reln, index_reln;
HeapTuple htup;
ItemPointer heap_tid;
Buffer b;

\&...
indscan = index_beginscan(index_reln, ...);
res = index_getnext(indscan, ForwardScanDirection);
if (res) {
	heap_tid = RetrieveIndexResultGetHeapItemPointer(res);
	htup = heap_fetch(heap_reln, NowTimeQual, heap_tid, &b);
	if (HeapTupleIsValid(htup)) {

		/* ... process the heap tuple ... */

		/* all done */
		ReleaseBuffer(b);
	}
}
.)E
The index relation requires no explicit buffer management.
The heap tid of interest is extracted from the
.SE RetrieveIndexResult ,
and is used to fetch the heap tuple itself.
The
.SE heap_fetch
interface fetches a heap tuple by tid;
its declaration is
.(E
HeapTuple
heap_fetch(Relation relation, TimeQual timeQual,
           ItemPointer tid, Buffer *b)
.)E
The description of buffer management for heap scans applies to the
.SE heap_fetch
interface, as well.
.pp
If the resulting heap tuple does not satisfy the time qualification,
then
.SE heap_fetch
will return NULL.
Since the index may contain old data until the vacuum cleaner is run,
this is possible,
so the user must check the return value from
.SE heap_fetch .
.pp
If there are no more index tuples that match the qualification,
.SE index_getnext
will return NULL.
At that point,
the scan must be ended using
.SE index_endscan .
If 
.SE index_endscan 
is not called, any subsequent calls to
.SE index_getnext
on the scan will reset it to the beginning of the qualifying
index tuples,
and they will all be returned again.
This is not intended as a feature,
so should not be used.
.sh 3 "Fetching Attributes"
.pp
Once a tuple of interest has been fetched from the heap,
attributes of interest may be extracted from it and used.
The interfaces are
.(E
char *
heap_getattr(HeapTuple tup, Buffer b, int attnum,
             TupleDescriptor tupdesc, bool *isnull)

Pointer
index_getattr(IndexTuple tup, int attnum,
             TupleDescriptor tupdesc, bool *isnull)
.)E
In general,
users (and even \*(PG implementors) need not extract index tuple
values;
only access method implementors need to do that.
The rest of this section concentrates on the
.SE heap_getattr
interface.
.pp
The
.SE tup
argument points at the tuple of interest.
If
.SE b
is non-null,
then it is the buffer on which the tuple resides.
If the user is doing explicit buffer management via
.SE heap_fetch
or
.SE heap_getnext ,
then the returned buffer may be passed along to
.SE heap_getattr .
.pp
The tuple descriptor is actually the
.SE rd_att
entry from the reldesc for the relation that stores the tuple.
To extract the tuple descriptor,
.(E
Relation reln;
TupleDescriptor tupdesc;

\&...
tupdesc = RelationGetTupleDescriptor(reln);
.)E
.pp
The
.SE isnull
argument is a pointer to a value of type
.SE bool .
If the corresponding attribute has the database value null,
then
.SE isnull
will be
.i true
and the return value should be ignored.
.pp
Although it is declared to return type
.SE "char *" ,
.SE heap_getattr
actually returns type
.SE Datum ,
so the return value should always be coerced to that type.
This is an implementation mistake.
The \fCDatumGet\fP\fIType\fP macros described in
section
\n($1.3.2
.\" XXX maintainable section ref
convert values from
.SE Datum
to the type required by the user.
.sh 3 "Inserting New Tuples"
.pp
When a user submits a \*(PQ query to insert new tuples into a heap relation,
the \*(PG executor automatically updates all associated indices with
pointers to the new tuple after it has been inserted.
There is no simple interface for doing this in general on user relations
inside the \*(PG backend.
Section
\n($1.4.1.2
.\" XXX maintainable section ref
describes a simple way for doing this on system catalogs,
but the user must explicitly code updates to indices on user relations
when new values are inserted into the heap.
.pp
The code for the routine
.SE CopyFrom
in the source file
.FI commands/copy.c
contains the sample code that was used to produce this system.
That file is a reasonable reference for how to manage tuple
insertions correctly.
.sh 4 "Forming a Heap Tuple"
.pp
The routine
.(E
HeapTuple
heap_formtuple(AttributeNumber natts, TupleDescriptor tupdesc,
               Datum *values, char *nulls)
.)E
creates a heap tuple from the supplied vector of
.SE Datum
values.
The
.SE natts
argument is the number of attributes in the relation
(the
.SE rd_rel->relnatts
entry of the reldesc).
The
.SE tupdesc
is the
.SE rd_att
entry of the reldesc,
which may be fetched via
.SE RelationGetTupleDescriptor .
.pp
Both
.SE values
and
.SE nulls
are arrays of length
.SE natts .
.SE Values
contains a datum for every non-null attribute in the tuple.
The
\fIType\fP\fCGetDatum\fP macros can be used to initialize the
entries in this array.
If an attribute is null,
its space is ignored in the array.
For example,
if the third attribute is null,
then the value stored in the third location of the
.SE values
array will be ignored by
.SE heap_formtuple .
.pp
The
.SE nulls
array is a vector of characters,
one per attribute in the tuple.
If the corresponding attribute has the database value null,
then the array entry is the character
.i n .
If the corresponding attribute is non-null,
then the array entry is the ASCII blank character.
.pp
The heap tuple returned by
.SE heap_formtuple
consumes
.SE palloc ed
memory,
and should be
.SE pfree d
when the heap and index relations have all been updated.
.sh 4 "Inserting the Heap Tuple"
.pp
To insert the tuple returned by
.SE heap_formtuple ,
.(E
ObjectId
heap_insert(Relation reln, HeapTuple tup, double *off)
.)E
On return,
the
.SE tup->t_ctid
entry is the tid of the tuple in the heap.
The
.SE off
argument is unnecessary and should never be used.
Its purpose is unclear.
This argument may safely be C-language NULL.
.SE heap_insert
returns the object ID of the new tuple in the heap
(the
.SE t_oid
entry of
.SE tup
is also correctly set).
.sh 4 "Finding the Indices"
.pp
The procedure
.SE GetIndexRelations
finds all of the indices defined on a heap relation
and initializes an array with the reldescs for all of the indices.
The interface is
.(E
GetIndexRelations(ObjectId heap_relid, int *n_indices,
                  Relation **index_reldescs);
.)E
The
.SE heap_relid
argument is the
.SE rd_id
entry of the reldesc for the heap.
.SE N_indices
is set,
on return,
to the number of indices for the heap relation.
On return,
.SE index_reldescs
is a pointer to an array of reldescs,
one for each index defined on the heap.
.sh 4 "Updating the Indices"
.pp
A \*(PG index may be
.BU
.i functional ,
meaning that the index is on the return value of some function of
the heap tuple's attributes,
rather than on the attributes themselves;
.BU
.i partial ,
meaning that a predicate controls whether or not particular heap
tuples should appear in the index;
or
.BU
.i regular ,
meaning that the index is on
one or more attributes of all tuples in the heap.
.pp
Functional indices are used for some system catalogs,
since the current version of \*(PG does not support multi-column
btree indices.
Partial indices are very seldom used.
Normal indices are by far the most common.
The rest of this section ignores functional and partial indices,
and concentrates on normal ones.
For sample code that updates these two kinds of indices,
see the routine
.SE CopyFrom
in
.SE commands/copy.c .
Special interfaces,
described in section \n($1.4.1.2,
.\" XXX maintainable section ref
have been defined to update the functional and normal indices
on system catalogs,
so the user need not manage these by hand.
.pp
After calling
.SE GetIndexRelations ,
the user has the number of indices and all of the index reldescs
for the indices defined on the heap.
Updating the indices requires forming an index tuple for each
index and inserting it.
The following code fragment gives an example.
.(E
Relation heap_reln;
Relation *index_relns;
int n_indices;
Datum idatum;
HeapTuple htup;
IndexTuple itup;
HeapTuple pg_indextup;
Form_pg_index pgiform;
TupleDescriptor tupdesc;
TupleDescriptor itupdesc;
char *nulls;
int i;

\&...
tupdesc = RelationGetTupleDescriptor(heap_reln);

/* get space for null flags -- this is more than we need */
nulls = palloc(heap_reln->rd_rel->relnatts);
for (i = 0; i < heap_reln->rd_rel->relnatts; i++)
	nulls[i] = ' ';

/* ... insert the heap tuple, which sets htup->t_ctid ... */

/* update each index */
for (i = 0; i < n_indices; i++) {
	/* get the pg_index tuple for this index */
	pg_indextup = SearchSysCacheTuple(INDEXRELID,
	                                  index_relns[i]->rd_id,
	                                  NULL, NULL, NULL);
	pgiform = (Form_pg_index) GETSTRUCT(pg_indextup);

	/* form a datum to use for the index tuple */
	FormIndexDatum(index_relns[i]->rd_rel->relnatts,
		       (AttributeNumber *) &pgiform->indkey.data[0],
		       htup, tupdesc, InvalidBuffer,
		       &idatum, nulls, NULL);

	/* form the index tuple */
	itupdesc = RelationGetTupleDescriptor(index_relns[i]);
	itup = index_formtuple(1, itupdesc, &idatum, nulls);

	/* make it point at the heap tuple */
	itup->t_tid = htup->t_ctid;

	/* insert it */
	(void) index_insert(index_relns[i], itup, NULL, NULL);

	pfree(itup);
}

pfree(nulls);
.)E
The loop iterates over all the indices defined on the heap,
updating each in turn.
For each index,
it fetches the corresponding
.RN pg_index
tuple in the cache of system tuples,
extracts the
.SE Form_pg_index
structure from each,
and then forms a
.SE Datum
with the appropriate value for the index tuple.
Finally,
an index tuple is formed and inserted into the index.
.sh 3 "Deleting Tuples"
.pp
Compared to inserting tuples,
deleting tuples is simple.
Users never delete tuples from indices;
this is done only by the vacuum cleaner.
Deleting a heap tuple requires the tid of the tuple to be deleted.
The tid may be extracted from the heap tuple itself,
which can be found,
for example,
by a scan of the heap relation.
Given the tid,
the interface for deleting heap tuples is
.(E
RuleLock
heap_delete(Relation reln, ItemPointer tid)
.)E
The
.SE RuleLock
return value from
.SE heap_delete
was originally intended to support the \*(PG rule system,
but the design of the rule system changed,
so this value is not used.
.SE Heap_delete
always returns NULL.
.sh 3 "Replacing Existing Tuples"
.pp
When a heap tuple in \*(PG is replaced,
it is marked as deleted and a new tuple is inserted in the relation
with new values.
The new tuple has the same object ID as the original tuple,
but will be stored at a new location
and so will have a new tid.
.pp
Index tuples are never replaced.
If the heap tuples to which they point have been replaced,
the vacuum cleaner deletes them from the index.
No user action is required.
However,
if a heap tuple is replaced,
all the indices defined on it must have new index tuples inserted.
These index tuples must point at the new heap tuple.
The code for doing this is identical to the code for
inserting index tuples that point at new heap tuples,
and will not be repeated here.
.pp
Replacing a tuple consists of two steps.
First,
a new tuple is formed,
based on the old one,
but with some values changed.
Second,
the old tuple is replaced by the new one.
.pp
Replacing a tuple requires knowledge of the schema of the
relation storing the tuple,
since new values for particular attributes must be supplied.
The sample code fragment below assumes the existence of a relation
.RN emp
with the schema
.(E
emp(name = char16, salary = int4)
.)E
The code sample replaces a particular
.RN emp
tuple,
assigning the employee a salary of 50,000.
Note that the code fragment does not update any indices defined on
.RN emp ;
that code must be added if any such indices exist.
.(E
Relation empreln;
HeapTuple emptup;
HeapTuple newtup;
Buffer empbuf;
Datum value[2];
char nulls[2];
char repl[2];

\&...
/* by here, we have the emp tuple to replace */
value[0] = (Datum) 0;
value[1] = Int32GetDatum(50000);	/* new salary */

/* no null attributes in this tuple */
nulls[0] = nulls[1] = ' ';

/* replace the second att, not the first */
repl[0] = ' ';
repl[1] = 'r';

/* get a new tuple based on the old one, with the new salary */
newtup = heap_modifytuple(emptup, empbuf, empreln, value, 
		          nulls, repl);

/* replace the old tuple with the new one */
(void) heap_replace(empreln, &emptup->t_ctid, newtup);
.)E
The interface for
.SE heap_modifytuple
is
.(E
HeapTuple
heap_modifytuple(HeapTuple tuple, Buffer buffer, 
		 Relation relation, Datum *replValue, 
		 char *replNull, char *repl)
.)E
The
.SE tuple
argument is the tuple to replace.
The
.SE buffer
argument is the buffer in the buffer cache on which it appears,
as returned by
.SE heap_getnext
or
.SE heap_fetch .
The
.SE relation
argument is the relation containing the tuple.
.pp
The next three arguments control what values are stored in the new
tuple.
The first,
.SE replValue ,
is a vector of values of type
.SE Datum
containing one entry for each attribute in the relation.
If an attribute is not to be replaced,
but rather should be copied from the old tuple,
then the corresponding entry in
.SE replValue
will be ignored.
.pp
The
.SE replNull
argument contains one entry per attribute in the relation.
If the corresponding attribute is to be replaced,
and if the
.SE replNull
entry is the character
.i n ,
then the new value for that attribute is null.
If the value is to be replaced and the
.SE replNull
entry is an ASCII blank,
then the value stored in
.SE replValue
is consulted to get the new value for the attribute.
.pp
Finally,
the
.SE repl
argument contains one entry per attribute in the relation.
If the
.SE repl
entry for a particular attribute is the character
.i r ,
then the value should be replaced,
and the
.SE replValue
and
.SE replNull
entries for the attribute are used to assign a new value.
If the
.SE repl
entry is an ASCII blank,
then the value is copied from the original tuple.
.sh 3 "Creating and Destroying Relations"
.pp
Creating a heap relation requires the user to supply a schema
for the new relation.
Creating an index relation is easier,
because the schema of the index is defined in terms of the heap.
The two cases are treated separately below.
.sh 4 "Heap Relations"
.pp
Creating a heap relation requires a schema for the new relation.
The schema should be a
.SE TupleDescriptor ,
which is the same as the
.SE rd_att
entry of the reldesc for the relation about to be created.
Creating the tuple descriptor is straightforward;
a vector is allocated with one
.SE FormData_pg_attribute
structure for each attribute.
The
.SE FormData_pg_attribute
structures are filled in with the names, types,
and other appropriate information for the new attributes.
The
.SE attrelid
and
.SE attdefrel
entries may be left as
.SE InvalidObjectId ;
these will be filled in by the code to create the heap relation.
.pp
Once the tuple descriptor has been constructed,
creating the relation is straightforward.
The interface is
.(E
ObjectId
heap_create(char *relname, int archive, unsigned natts,
	    unsigned smgr, TupleDescriptor tupdesc)
.)E
The
.SE relname
argument is the name of the new relation;
it must be unique,
or
.SE heap_create
will abort the transaction.
.pp
The
.SE archive
argument,
although declared to be of type
.SE int
in the prototype,
is actually a character;
this should be one of
.i n
(no archiving),
.i l
(light archiving),
or
.i h
(heavy archiving).
Light and heavy archiving behave identically in the current system,
and cause deleted and replaced tuples to be copied to a special
archive by the vacuum cleaner.
No archiving causes the vacuum cleaner to discard deleted and replaced
tuples.
.pp
The
.SE natts
argument is the number of attributes for the new relation.
This should be the same as the length of the
.SE tupdesc
vector.
.pp
The
.SE smgr
argument is the storage manager on which the relation should be created.
The set of supported storage managers varies from installed system
to installed system.
Storage manager number zero is always magnetic disk.
Other numbers are installation-dependent.
.pp
Finally,
the
.SE tupdesc
argument is the tuple descriptor for the new relation.
.pp
.SE Heap_create
returns the object ID of the new relation's
.RN pg_class
tuple,
which is the same as the relid of the new relation.
.pp
Destroying a heap relation is even simpler:
.(E
void
heap_destroy(char *relname)
.)E
If the named relation does not exist,
.SE heap_destroy
aborts the transaction.
Otherwise,
the relation is destroyed and all indices defined on it are removed.
In addition,
if the relation inherited attributes from other relations,
its inheritance information is deleted from the system catalogs.
.sh 4 "Index Relations"
.pp
Creating an index relation is straightforward.
The interface is
.(E
void
index_create(Name heap_name, Name index_name,
	     FuncIndexInfo fInfo, ObjectId am_id,
	     AttributeNumber natts, AttributeNumber *attnos,
	     ObjectId *opclasses, uint16 paramCount,
	     Datum *params, LispValue predicate)
.)E
The
.SE fInfo ,
.SE paramCount ,
.SE params ,
and
.SE predicate
arguments are for defining functional or partial indices,
and will not be covered here.
.pp
The
.SE heap_name
argument is the name of the heap relation on which the index
is to be defined.
The
.SE index_name
argument is the name of the new index relation.
.pp
The
.SE am_id
argument is the object ID of the access method tuple for this index,
from
.RN pg_am .
.pp
The
.SE natts
argument is the number of attributes (keys) for this index.
The current system supports only single-key indices,
but this may change in a subsequent release.
The
.SE attnos
argument is a vector of attribute numbers in the heap.
If the index is to be defined on the third attribute in the heap,
then this vector would consist of a single entry,
which would be three.
.pp
The
.SE opclasses
argument is the object ID of the operator class tuple from
.SE pg_amop
to use for each attribute.
The operator classes map types and operators for the index to
particular functions in
.RN pg_proc .
.pp
Sample code that defines regular,
functional,
and partial indices may be found in the source file
.FI commands/defind.c .
.pp
When
.SE index_create
returns,
the new index has been created and tuples have been inserted into it
for each tuple in the heap relation.
.sh 3 "Scan Position Management"
.pp
Scans on index or heap relations support
.i marking
and
.i restoring
of the scan's position.
A mark records the current scan location,
and a restore restores the previously marked location.
Only one mark may be defined on an open scan.
.pp
Marking and restoring are of limited use outside the executor,
but the interfaces are included here for completeness.
They are
.(E
void
heap_markpos(HeapScanDesc hscan)

void
heap_restrpos(HeapScanDesc hscan)

void
index_markpos(IndexScanDesc iscan)

void
index_restrpos(IndexScanDesc iscan)
.)E
.sh 3 "Locking"
.pp
Users of the access method interface routines need not worry about
locking.
The access method routines acquire (and, when appropriate, release)
locks on relations in response to scans and updates.
Heap relations are protected by two-phase relation level locks.
Btree indices use Lehman-Yao\**
.(f
\**
Lehman, P., Yao, S., ``Efficient Locking for Concurrent Operations
on B-trees'', \fIACM Transactions on Database Systems\fR, 6(4),
December 1981.
.)f
short-term locking for high concurrency,
but since the underlying relations are locked at the relation level,
index updates are serialized.
Rtree indices use two-phase relation-level locking.
.pp
System catalogs in \*(PG are not subject to two-phase locking
in order to improve concurrency.
This means that under some circumstances concurrent transactions
that operate on the system catalogs are not serializable.
This flaw is common in commercial relational systems,
as well,
and occurs infrequently enough that it does not matter in practice.