11.. TThhee PPOOSSTTGGRREESS AAcccceessss MMeetthhooddss This section describes the POSTGRES access methods in detail. The major concepts covered here are +o the relation descriptor and its contents, +o the differences between heap and index relations, +o scan keys, scan descriptors, and the scan interface, and +o the POSTGRES access method interface routines. 11..11.. TThhee RReellaattiioonn DDeessccrriippttoorr The relation descriptor, or _r_e_l_d_e_s_c, is the in-memory data structure that describes an open relation. The reldesc is an argument to most of the procedures that operate on relations. Relations may be opened by name or by relation ID -- either of these uniquely identifies the relation to which it corresponds. A relation's ID, or _r_e_l_i_d, is the object ID of the tuple in ppgg__ccllaassss that describes it. The structure that stores a reldesc is declared in the source file uuttiillss//rreell..hh. The structure definition is ttyyppeeddeeff ssttrruucctt RReellaattiioonnDDaattaa {{ FFiillee rrdd__ffdd;; iinntt rrdd__nnbblloocckkss;; uuiinntt1166 rrdd__rreeffccnntt;; bbooooll rrdd__iissmmeemm;; bbooooll rrdd__iissnnaaiilleedd;; AAcccceessssMMeetthhooddTTuupplleeFFoorrmm rrdd__aamm;; RReellaattiioonnTTuupplleeFFoorrmm rrdd__rreell;; OObbjjeeccttIIdd rrdd__iidd;; PPooiinntteerr lloocckkIInnffoo;; TTuupplleeDDeessccrriippttoorrDDaattaa rrdd__aatttt;; //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY AATT EENNDD OOFF SSTTRRUUCCTT **// }} RReellaattiioonnDDaattaa;; ttyyppeeddeeff RReellaattiioonnDDaattaa **RReellaattiioonn;; The meaning of the rrdd__ffdd entry depends on the storage manager that stores the relation. For the magnetic disk storage manager, this is the POSTGRES virtual file descrip- tor (VFD) for the open file that stores the relation. On the Sony jukebox, this entry is meaningless, and is simply set to a non-negative value if the jukebox relation is suc- cessfully opened. 11 The rrdd__nnbblloocckkss entry is the number of blocks in the relation, but this number is not reliable. This field is used by the executor during scans of heap relations to avoid scanning past the end of the relation. For index relations, and during scans of heap relations, the value stored here will be wrong. It should not be used outside of the execu- tor. The entry rrdd__rreeffccoouunntt reflects the number of currently active references to this reldesc inside the backend. Every backend manages a private cache of relation descriptors. When the cache is full, reldescs with rrdd__rreeffccoouunntt equal to zero may be evicted to make room for new descriptors. If a single relation is opened by a single backend more than once, every open request returns a pointer to the same reldesc, and rrdd__rreeffccoouunntt is incremented to keep track of each reference. When the relation is closed, the reference count is decremented. The structure entry rrdd__iissmmeemm is unused in the current system. It is intended to support in-memory-only relations, changes for which need not be flushed to stable storage at transaction boundaries. The private cache of relation descriptors contains sev- eral reldescs that cannot be evicted, in order to guarantee that the cache continues to work. For example, in order to instantiate the relation descriptor for a user relation uusseerr__rreellnn, POSTGRES must open and scan ppgg__ccllaassss and ppgg__aattttrriibbuuttee for its relation and attribute data. If the reldesc for ppgg__ccllaassss is not in the cache, then no relation (including ppgg__ccllaassss) can ever be instantiated in the cache. The reldescs for ppgg__ccllaassss, ppgg__ttyyppee, and ppgg__aattttrriibbuuttee all fall into this category, and are initialized specially when the system starts up. To keep track of which relations may not be evicted from the private reldesc cache, every reldesc contains an rrdd__iissnnaaiilleedd entry. If this entry is set to _t_r_u_e, then the reldesc is nailed and may not be evicted from the cache, even if its rrdd__rreeffccoouunntt drops to zero. The rrdd__aamm entry stores a pointer to the access method tuple for the access method that manages this relation. Currently, POSTGRES supports a heap access method for stor- ing user and system data, and btree and rtree indexed access methods for storing indices on data. Support for a hashed indexed access method will be added in the near future. Because these access methods have different implementations, POSTGRES must know what actual routine to call to dispatch general access method interface calls for a particular access method. For indexed access methods, this information 22 is stored in the access method tuple. Since there is only one access method for user and system data (the heap), the rrdd__aamm entry is NULL for reldescs describing heap relations. The access method tuple contents will be described in section 1.1.1. The structure entry rrdd__rreell stores a pointer to the ppgg__ccllaassss tuple that describes this relation. The ppgg__ccllaassss tuple is copied to a safe place in memory when the relation is entered into the reldesc cache. The tuple contains, among other things, the relation's name and the user id of its owner. The complete contents of the tuple pointed to by rrdd__rreell are supplied in section 1.1.2. The entry rrdd__iidd stores the relid of the open relation. The relid is the object ID of the ppgg__ccllaassss tuple describing the relation. This does not appear in the tuple pointed to by the rrdd__rreell entry, because that tuple includes only the ordinary attributes, and not the system attributes, from the ppgg__ccllaassss tuple. Note that the ppgg__ccllaassss tuple has system attributes, as it is stored in the relation. These attributes simply are not copied into memory when the reldesc is initialized. The structure entry lloocckkIInnffoo points to an in-memory representation of the rule lock data associated with this relation. This entry is not used by the access methods. The rule system and the data stored in a rule lock are beyond the scope of this section. The final entry in the reldesc is rrdd__aatttt, a vector of attribute descriptor data for the relation. This vector stores one entry for every user-level (that is, non-system) attribute that the relation stores. This vector is initial- ized when the reldesc is loaded into the cache. The data in the vector come from a scan of ppgg__aattttrriibbuuttee. The contents of the vector are described in section 1.1.3. 11..11..11.. TThhee AAcccceessss MMeetthhoodd TTuuppllee FFoorrmm ((rrdd__aamm)) The rrdd__aamm entry in a reldesc points at an _a_c_c_e_s_s _m_e_t_h_o_d _t_u_p_l_e _f_o_r_m describing the access method that manages the relation1. This tuple has one attribute for each of the routines that implement the standard access method interface ____________________ 1 In POSTGRES, the word _f_o_r_m is often used to name struc- tures. In this instance, for example, the clause _a_c_c_e_s_s _m_e_t_h_o_d _t_u_p_l_e _f_o_r_m is used to name a structure that describes an access method tuple. 33 for the particular access method. Details of the interface routines appear in section 1.4. Here, we list only the con- tents of the access method tuple form. This declaration appears in the source file ccaattaalloogg//ppgg__aamm..hh, and describes the tuples stored in the ppgg__aamm system catalog. ttyyppeeddeeff ssttrruucctt AAcccceessssMMeetthhooddTTuupplleeFFoorrmmDD {{ NNaammeeDDaattaa aammnnaammee;; OObbjjeeccttIIdd aammoowwnneerr;; cchhaarr aammkkiinndd;; uuiinntt1166 aammssttrraatteeggiieess;; uuiinntt1166 aammssuuppppoorrtt;; RReeggPPrroocceedduurree aammggeettttuuppllee;; RReeggPPrroocceedduurree aammiinnsseerrtt;; RReeggPPrroocceedduurree aammddeelleettee;; RReeggPPrroocceedduurree aammggeettaattttrr;; RReeggPPrroocceedduurree aammsseettlloocckk;; RReeggPPrroocceedduurree aammsseettttiidd;; RReeggPPrroocceedduurree aammffrreeeettuuppllee;; RReeggPPrroocceedduurree aammbbeeggiinnssccaann;; RReeggPPrroocceedduurree aammrreessccaann;; RReeggPPrroocceedduurree aammeennddssccaann;; RReeggPPrroocceedduurree aammmmaarrkkppooss;; RReeggPPrroocceedduurree aammrreessttrrppooss;; RReeggPPrroocceedduurree aammooppeenn;; RReeggPPrroocceedduurree aammcclloossee;; RReeggPPrroocceedduurree aammbbuuiilldd;; RReeggPPrroocceedduurree aammccrreeaattee;; RReeggPPrroocceedduurree aammddeessttrrooyy;; }} AAcccceessssMMeetthhooddTTuupplleeFFoorrmmDD;; ttyyppeeddeeff AAcccceessssMMeetthhooddTTuupplleeFFoorrmmDD **AAcccceessssMMeetthhooddTTuupplleeFFoorrmm;; The structure entry aammnnaammee is the name of the access method. In the current system, this is one of _b_t_r_e_e, or _r_t_r_e_e. Support for _h_a_s_h is forthcoming. The aammoowwnneerr entry is the object ID of the ppgg__uusseerr tuple for the owner of this access method. For all existing access methods, this is the OID for the user _p_o_s_t_g_r_e_s. This entry is intended to support access protection, so that, for example, only the owner of an access method could change its interface routines. Such protection is not implemented in the current system. The aammkkiinndd entry is unused in the current system. The structure entry aammssttrraatteeggiieess is the number of strategies (operators) that can be used in searches on this index. 44 +o For btrees, this is 5: <, <=, =, >=, and >. +o For rtrees, this is 8, corresponding to the operators for left, left-or-overlap, overlap, right-or-overlap, right, same, contains, and contained-by. +o When hash tables are supported, the number of operators will be 1, for equality searches. Operators that are supported by an access method are listed in ppgg__aammoopp. This relation contains one group of operators for each operator class, or _o_p_c_l_a_s_s. An opclass is an access method/type pair. For example, an opclass iinntt44__ooppss is defined for btrees, so that iinntt44 values may be indexed using btrees. The structure entry aammssuuppppoorrtt is the number of support routines required by this access method. For example, when inserting keys into a btree, the btree access method requires one support routine that compares two keys and returns negative, zero, or positive, depending on whether the first key is less than, equal to, or greater than the second, respectively. Similarly, rtrees require routines that compute the size, intersection, and union of two rect- angles. These are not operators that are available to users for index searches, so they are stored separately from the strategies. These routines are stored in the relation ppgg__aammssuuppppoorrtt. The rest of the AAcccceessssMMeetthhooddTTuupplleeFFoorrmm structure stores the object IDs of functions that support the standard access method interface for this access method. Those routines are summarized in the table below. 55 +------------+-----------------------------------------------------+ |_E_n_t_r_y _n_a_m_e | _r_o_u_t_i_n_e _d_e_s_c_r_i_p_t_i_o_n | +------------+-----------------------------------------------------+ |aammggeettttuuppllee | Get the next tuple in a scan. | +------------+-----------------------------------------------------+ |aammiinnsseerrtt | Insert a new tuple into the index. | +------------+-----------------------------------------------------+ |aammddeelleettee | Delete a particular _t_i_d from the index. | +------------+-----------------------------------------------------+ |aammggeettaattttrr | Get a particular attribute from the index tuple. | +------------+-----------------------------------------------------+ |aammsseettlloocckk | Unsupported. | +------------+-----------------------------------------------------+ |aammsseettttiidd | Unsupported. | +------------+-----------------------------------------------------+ |aammffrreeeettuuppllee | Unsupported. | +------------+-----------------------------------------------------+ |aammbbeeggiinnssccaann | Start a scan with a qualification on the index key. | +------------+-----------------------------------------------------+ |aammrreessccaann | Reset an active scan to the beginning. | +------------+-----------------------------------------------------+ |aammeennddssccaann | End an active scan. | +------------+-----------------------------------------------------+ |aammmmaarrkkppooss | Mark the current position in an index scan. | +------------+-----------------------------------------------------+ |aammrreessttrrppooss | Restore the scan to the previously-marked position. | +------------+-----------------------------------------------------+ |aammooppeenn | Open the index relation. | +------------+-----------------------------------------------------+ |aammcclloossee | Close an open index relation. | +------------+-----------------------------------------------------+ |aammbbuuiilldd | Define an index on an existing heap relation. | +------------+-----------------------------------------------------+ |aammccrreeaattee | Create a new, empty, index relation. | +------------+-----------------------------------------------------+ |aammddeessttrrooyy | Destroy an existing index relation. | +------------+-----------------------------------------------------+ In general, the programmer need not worry about how function dispatch via the AAcccceessssMMeetthhooddTTuupplleeFFoorrmm works. This is handled properly for all indexed access methods by code in aacccceessss//ccoommmmoonn.. 11..11..22.. TThhee RReellaattiioonn TTuuppllee FFoorrmm ((rrdd__rreell)) The reldesc for a relation includes a pointer to the ppgg__ccllaassss tuple for the relation. When a reldesc is instan- tiated into the cache, it is initialized from the data for the relation from ppgg__ccllaassss. The user-level (that is, non- system) attributes for the ppgg__ccllaassss tuple constitute the _r_e_l_a_t_i_o_n _t_u_p_l_e _f_o_r_m. This structure is declared in 66 ccaattaalloogg//ppgg__rreellaattiioonn..hh2 as FFoorrmm__ppgg__rreellaattiioonn. The declaration is CCAATTAALLOOGG((ppgg__rreellaattiioonn)) BBOOOOTTSSTTRRAAPP {{ cchhaarr1166 rreellnnaammee;; ooiidd rreelloowwnneerr;; ooiidd rreellaamm;; iinntt44 rreellppaaggeess;; iinntt44 rreellttuupplleess;; ddtt rreelleexxppiirreess;; ddtt rreellpprreesseerrvveedd;; bbooooll rreellhhaassiinnddeexx;; bbooooll rreelliisssshhaarreedd;; cchhaarr rreellkkiinndd;; cchhaarr rreellaarrcchh;; iinntt22 rreellnnaattttss;; iinntt22 rreellssmmggrr;; iinntt2288 rreellkkeeyy;; ooiidd88 rreellkkeeyyoopp;; aacclliitteemm rreellaaccll[[11]];; }} FFoorrmmDDaattaa__ppgg__rreellaattiioonn;; ttyyppeeddeeff FFoorrmmDDaattaa__ppgg__rreellaattiioonn **FFoorrmm__ppgg__rreellaattiioonn;; The notation CCAATTAALLOOGG((ppgg__rreellaattiioonn)) BBOOOOTTSSTTRRAAPP {{ is turned into a structure declaration by cpp macros at com- pile time. This declaration is also used to produce setup files, called _b_k_i _f_i_l_e_s, that create and populate the _t_e_m_- _p_l_a_t_e_1 database. The structure entry rreellnnaammee is the name of the rela- tion. The rreelloowwnneerr entry is the object ID of the ppgg__uusseerr tuple describing the relation's owner. A relation is owned by the user that created it. The rreellaamm entry is the object ID of the ppgg__aamm tuple for the access method that manages this relation. For heap relations, this is zero, an invalid object ID. ____________________ 2 The class ppgg__ccllaassss was previously called ppgg__rreellaattiioonn. In all user-visible parts of the system, the name was con- verted in 1990, but internally (as in the names of system header files), the name was not always changed. 77 The rreellppaaggeess and rreellttuupplleess entries are, respectively, the approximate number of pages and tuples in this relation. These numbers are set by the vacuum cleaner and when an index is defined on a heap relation, and so are wrong in general. They are not changed when tuples are inserted into or deleted from the relation. When the relation is ini- tially created, both are zero. Both rreellttuupplleess and rreellppaaggeess are used by the planner to estimate plan costs during query optimization. Interestingly, then, if neither rreellttuupplleess nor rreellppaaggeess have been updated by the vacuum cleaner recently, their incorrect values can skew query optimization such that queries may run for an unexpectedly long time. The rreelleexxppiirreess entry is the amount of history that should be saved for this relation. This is not used in the current system. The rreellpprreesseerrvveedd entry is the date and time at which the relation was last vacuumed. This is used by the planner when it produces query plans for historical queries. If the historical query extends to before the time at which the relation was last vacuumed, then the archive must be scanned. Otherwise, it need not be. The rreellhhaassiinnddeexx entry is _t_r_u_e if an index exists on the relation. When an index is destroyed, this flag is not changed, and so an entry of _t_r_u_e may mean that there was recently an index on the relation, but that there no longer is. The vacuum cleaner restores this to the correct value when it runs, and defining an index on a heap relation always sets this entry to _t_r_u_e for the heap. RReellhhaassiinnddeexx is always _f_a_l_s_e for an index relation. The rreelliisssshhaarreedd entry is _t_r_u_e for some shared system catalogs. Most relations for database _d_b_n_a_m_e reside in the directory $$PPGGDDAATTAA//bbaassee//_d_b_n_a_m_e. However, some relations, such as ppgg__uusseerr and ppgg__ddaattaabbaassee, must be visible to users of all databases simultaneously. These relations are stored in the directory $$PPGGDDAATTAA, and the rreelliisssshhaarreedd entry in the ppgg__ccllaassss tuple for these relations is set to _t_r_u_e. The rreellkkiinndd entry has one of three values: +o _r means that the relation is an ordinary heap. +o _i means that the relation is an index. +o _u means that the relation is an uncatalogued (that is, temporary) heap relation which will be automatically destroyed when the transaction ends. 88 The rreellaarrcchh entry describes the frequency with which the relation should be archived. Although three levels exist, only two are actually supported. The three levels are +o _h, for heavy update traffic and frequent archival; +o _l, for light update traffic and infrequent archival; and +o _n, for no archival. The only levels actually supported are _h and _n. If the rreellaarrcchh entry is _n, then the vacuum cleaner discards all historical data from the relation when it runs. The rreellnnaattttss entry is the number of attributes in the relation. The rreellssmmggrr entry identifies the storage manager that manages storage for this relation. Storage manager zero is magnetic disk. On the system installed at UC Berkeley, storage managers one and two are, respectively, for the Sony optical disk jukebox and main memory relations. The storage manager code is called from the POSTGRES buffer manager, and is of no concern to access method implementors. The rreellkkeeyy entry is a vector of attribute numbers that define the unique key for this relation. This is not sup- ported in the current system. The structure ends with a variable-length vector of access control list information, stored in the rreellaaccll entry. The access control list is used to grant or deny read or write access on the relation to particular users. 11..11..33.. TThhee TTuuppllee DDeessccrriippttoorr DDaattaa ((rrdd__aatttt)) The reldesc ends with a vector of attribute data that describe the tuples stored in the relation. This is a vari- able-length entry; different relations have different num- bers of attributes, and so their rrdd__aatttt vectors are of dif- ferent length. There are two data structures of interest. The TTuupplleeDDeessccrriippttoorrDDaattaa structure, defined in aacccceessss//ttuuppddeesscc..hh, is a variable-length array of AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa entries. The AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa structure, defined in ccaattaalloogg//ppgg__aattttrriibbuuttee..hh, contains detailed information on a single attribute. 99 The data structure for AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa is the same as a FFoorrmm__ppgg__aattttrriibbuuttee, which is CCAATTAALLOOGG((ppgg__aattttrriibbuuttee)) BBOOOOTTSSTTRRAAPP {{ ooiidd aattttrreelliidd;; cchhaarr1166 aattttnnaammee;; ooiidd aattttttyyppiidd;; ooiidd aattttddeeffrreell;; iinntt44 aattttnnvvaallss;; ooiidd aattttttyyppaarrgg;; iinntt22 aattttlleenn;; iinntt22 aattttnnuumm;; iinntt22 aattttbboouunndd;; bbooooll aattttbbyyvvaall;; bbooooll aattttccaanniinnddeexx;; ooiidd aattttpprroocc;; iinntt44 aattttnneelleemmss;; iinntt44 aattttccaacchheeooffff;; }} FFoorrmmDDaattaa__ppgg__aattttrriibbuuttee;; ttyyppeeddeeff FFoorrmmDDaattaa__ppgg__aattttrriibbuuttee **FFoorrmm__ppgg__aattttrriibbuuttee;; As above, the CCAATTAALLOOGG((ppgg__aattttrriibbuuttee)) BBOOOOTTSSTTRRAAPP line is converted to a struct declaration by cpp macros at compile time. The aattttrreelliidd entry is the object ID of the ppgg__ccllaassss tuple for the relation that contains this attribute. The aattttnnaammee entry is the name of the attribute. The aattttttyyppiidd entry is the object ID of the ppgg__ttyyppee tuple for the type of this attribute. The aattttddeeffrreell entry is used for object-oriented type management by POSTGRES. When a class is declared that inherits from some other class, the new class implicitly has all the attributes that the original class had. In this case, the aattttddeeffrreell entry is the object ID of the relation from which this attribute is inherited. The aattttnnvvaallss entry is intended to support query opti- mization. It should store the number of distinct values for this attribute that appear in the relation. This value should be computed at vacuum time, and stored in ppgg__aattttrriibbuuttee by the vacuum cleaner. However, the current vacuum cleaner does not compute statistics, so aattttnnvvaallss is always zero. If the correct value were maintained, the 1100 optimizer could use it to produce better query plans. The optimizer currently ignores aattttnnvvaallss. The aattttttyyppaarrgg entry is unused. The aattttlleenn entry is the length of this attribute, in bytes, for fixed-length attributes, or -1, for variable- length attributes. The aattttbboouunndd entry was intended to support array-valued attributes, but is not used by the existing POSTGRES array implementation. The aattttbbyyvvaall entry is _t_r_u_e if this is a pass-by-value attribute, and _f_a_l_s_e if it is pass-by-reference. This entry is set by looking at the ttyyppbbyyvvaall attribute in the ppgg__ttyyppee tuple for the type of this attribute. The aattttccaanniinnddeexx entry is _t_r_u_e if this entry can be indexed, and false otherwise. Array attributes cannot be indexed. The aattttpprroocc entry is unused. The aattttnneelleemmss entry is used for array attributes, to record the number of elements that may be stored in the attribute. The aattttccaacchheeooffff entry is never maintained on disk, but is set in memory to allow fast attribute lookups. The first time an attribute is fetched from a tuple, its offset is computed by summing the lengths of the attributes that pre- cede it. If all of the preceding attributes are fixed- length, and none are NULL, then the computed offset may be cached for use on subsequent lookups. The offset is stored in aattttccaacchheeooffff. The AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa structure stores information on a single attribute. The vector that describes all of the attributes in a relation, rrdd__aatttt, contains pointers to in- memory AAttttrriibbuutteeTTuupplleeFFoorrmmDDaattaa structures for each attribute. The vector is stored in the TTuupplleeDDeessccrriippttoorrDDaattaa structure, whose declaration is ttyyppeeddeeff ssttrruucctt TTuupplleeDDeessccrriippttoorrDDaattaa {{ AAttttrriibbuutteeTTuupplleeFFoorrmm ddaattaa[[11]];; //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY **// }} TTuupplleeDDeessccrriippttoorrDDaattaa;; ttyyppeeddeeff TTuupplleeDDeessccrriippttoorrDDaattaa **TTuupplleeDDeessccrriippttoorr;; 1111 11..11..44.. TThhee SSttrraatteeggyy MMaapp aanndd SSuuppppoorrtt RRoouuttiinneess ffoorr IInnddeexx RReellddeessccss The sections above describe the contents of a reldesc that are common to the heap and indexed access methods. The indexed access methods store some additional information at the end of the reldesc. This information is used to select operators to apply during scans, and to find the support routines that are required by particular indexed access methods. These data structures, their layout, and their purposes are obscure. They reflect poor design of the indexed access methods. This section is difficult to under- stand, and may be safely ignored by everyone except the maintainer of the POSTGRES access methods. Immediately following the TupleDescriptorData vector, the indexed access methods have two additional pieces of data. The first is a pointer to the IInnddeexxSSttrraatteeggyyDDaattaa structure for this access method. The second is a pointer to the vector of support routines for the access method. Neither of these is a named entry of any data structure; their existence is assumed (and, at least so far, correctly maintained) by the code that manages reldescs for the indexed access methods. The problem with this design is that it makes it impos- sible to name these entries from the debugger. The follow- ing cumbersome GDB statements will print out the contents of these entries: pprriinntt ((IInnddeexxSSttrraatteeggyy))((&&rreellddeesscc-->>rrdd__aatttt[[ rreellddeesscc-->>rrdd__rreell-->>rreellnnaattttss]])) pprriinntt ((RReeggPPrroocceedduurree **))((((((cchhaarr **)) &&rreellddeesscc-->>rrdd__aatttt[[rreellddeesscc-->>rrdd__rreell-->>rreellnnaattttss]])) ++ ssiizzeeooff((IInnddeexxSSttrraatteeggyy)))) The first statement prints the pointer to the vector of index strategy map data; the second prints the pointer to the vector of support procedure IDs. These pointers must be dereferenced in order to examine the contents of the vec- tors. Lines have been broken so that the commands fit across the page; the statements should be typed on a single line. The strategy map is a vector of scan key entries. Scan keys are described in sections 1.3.1 and 1.3.2. The strat- egy map stores information on what procedure to call for a particular operator on this index. For example, a btree storing iinntt44 values would have the procedure ID of the iinntt44lltt procedure at the location corresponding to '<' in the strategy vector. The strategy map is constructed automati- cally by the system when the index relation is opened. The 1122 contents are read from the ppgg__aammoopp relation, which is keyed by operator class and access method. Finally, the vector of support procedures is simply an array of procedure IDs, or ppgg__pprroocc tuple OIDs, for the sup- port routines required by the access method. This vector is initialized when the index relation is opened by scanning the ppgg__aammssuuppppoorrtt relation. The tuples in ppgg__aammssuuppppoorrtt are keyed by access method and operator class. POSTGRES assigns no meaning to the entries in this vector; they are intended for use by the access method. The btree access method uses a single support routine, which compares two keys and returns less than zero, zero, and greater than zero, respec- tively, if the first key is less than, equal to, or greater than the second. The rtree access method uses three support routines, which compute the size of a rectangle in standard units, the intersection of two rectangles, and the union of two rectangles. 11..11..55.. SSuummmmaarryy ooff tthhee RReellaattiioonn DDeessccrriippttoorr DDaattaa SSttrruuccttuurree The reldesc describes an open relation, and is the argument that is passed to most of the routines that operate on relations. POSTGRES manages a private cache of reldescs, and at most one copy of the reldesc for a given relation appears in the cache at a time. The reldesc includes point- ers to the relation tuple form and a vector of information that describes the attributes that appear in tuples of the relation. For indexed access methods, the reldesc also includes a pointer to the access method tuple form, a pointer to a vector of strategy information that describes the operators supported by the access method for the opera- tor class, and a pointer to a vector of access method- specific support routines. 11..22.. HHeeaapp aanndd IInnddeexx RReellaattiioonnss Heap relations are the primary POSTGRES storage struc- ture; all user relations and all the system catalog informa- tion is stored in heaps. The POSTGRES heap is logically an unordered set of tuples. Heaps consist of zero or more 8192-byte blocks, each of which contains zero or more tuples. Index relations store key/value pairs, where the key allows fast lookup on some set of attributes from a heap, and the value is a pointer to a tuple in the heap. 11..22..11.. TTuuppllee IIddeennttiiffiieerrss Tuples are identified by _t_u_p_l_e _i_d_e_n_t_i_f_i_e_r_s, or tids. A tid is a triple of the form (_b_l_o_c_k_n_o, _p_a_g_e_n_o, _o_f_f_s_e_t). The 1133 original design of the system allowed for more than one page to be stored on a single block of a relation, but this fea- ture has never been used. As a result, the middle value (_p_a_g_e_n_o) is always zero. To confuse things further, the term "page" is used interchangeably with the term "block" in the code and by programmers talking about the system. Since the original usage of "page" was never implemented, it has come to mean exactly the same thing as "block." Thus a tid is a triple of the form (_b_l_o_c_k_n_o, 0, _o_f_f_- _s_e_t). The _b_l_o_c_k_n_o is the number of the block in the relation; blocks are numbered sequentially from zero. When a new block is allocated to a relation, it is given the next available block number. The _o_f_f_s_e_t stored in a tid is the offset of the tuple within the block. Blocks store zero or more tuples, and tuples are numbered sequentially from one within a block. The number of tuples that fit in a block depend on the schema of the relation that the block stores, and the number may vary for a given relation (for example, if the tuple includes variable-length attributes). A tid uniquely identifies a single tuple in a relation. In the current version of the system, relations are not com- pressed when they are vacuumed, so tuples do not move around inside a relation. Thus the tid is a valid identifier for the tuple until the tuple is replaced or deleted, at which point it (logically) vanishes from the relation. Note that other tuples on the same page as a deleted tuple do not change their tids when the tuple is deleted; the available slot is reserved for subsequent reuse. In fact, slots are never reused in the current system. The vacuum cleaner and the tuple allocation code may be changed to fix this problem in the future. 11..22..22.. DDaattaa SSttoorreedd iinn HHeeaapp RReellaattiioonnss Every heap tuple begins with a HHeeaappTTuupplleeDDaattaa structure. The declaration for this structure appears in the source file aacccceessss//hhttuupp..hh, and is ttyyppeeddeeff ssttrruucctt HHeeaappTTuupplleeDDaattaa {{ SSiizzee tt__lleenn;; IItteemmPPooiinntteerrDDaattaa tt__ccttiidd;; IItteemmPPooiinntteerrDDaattaa tt__cchhaaiinn;; uunniioonn {{ IItteemmPPooiinntteerrDDaattaa ll__llttiidd;; RRuulleeLLoocckk ll__lloocckk;; }} tt__lloocckk;; 1144 OObbjjeeccttIIdd tt__ooiidd;; CCoommmmaannddIIdd tt__ccmmiinn;; CCoommmmaannddIIdd tt__ccmmaaxx;; TTrraannssaaccttiioonnIIdd tt__xxmmiinn;; TTrraannssaaccttiioonnIIdd tt__xxmmaaxx;; AABBSSTTIIMMEE tt__ttmmiinn,, tt__ttmmaaxx;; AAttttrriibbuutteeNNuummbbeerr tt__nnaattttss;; cchhaarr tt__vvttyyppee;; cchhaarr tt__iinnffoommaasskk;; cchhaarr tt__lloocckkttyyppee;; uuiinntt88 tt__hhooffff;; cchhaarr tt__bbiittss[[MMiinnHHeeaappTTuupplleeBBiittmmaappSSiizzee // 88]];; }} HHeeaappTTuupplleeDDaattaa;; //** MMOORREE DDAATTAA FFOOLLLLOOWWSS AATT EENNDD OOFF SSTTRRUUCCTT **// ttyyppeeddeeff HHeeaappTTuupplleeDDaattaa **HHeeaappTTuuppllee;; This structure includes a variable-length bitmap vector at its end, after which the user-level attributes of the tuple appear. The tt__lleenn entry is the length of the entire tuple, including the HHeeaappTTuupplleeDDaattaa structure and the user data that follows it. This is used to allocate sufficient space in memory and on disk pages for the tuple. The tt__ccttiidd entry is the tid of this tuple. Storing the tid in the tuple is easier than computing it when it is needed (for example, when an index tuple is constructed to point at the heap tuple). When a tuple is replaced, POSTGRES maintains a pointer from the old version of the tuple to the new version. This pointer is stored in the tt__cchhaaiinn entry. When the new ver- sion of the tuple is successfully inserted, the page con- taining the old version is fetched, and the original tuple's tt__cchhaaiinn entry is set to the tt__ccttiidd entry of the new version. This strategy, called _f_o_r_w_a_r_d _c_h_a_i_n_i_n_g, leaves older records pointing at newer records. The advantage if forward chain- ing is that it allows records to be updated without changing indices that point at them if the indexed attribute has not chained. The tt__cchhaaiinn entry was originally intended to support tuple differencing, which would allow POSTGRES to store only the changed values for the new version in most cases, rather than the entire new tuple. Tuple differencing is not sup- ported in the current system. In addition, the system always inserts a new index record when a heap tuple is modi- fied, regardless of whether the indexed attribute has changed. This may be fixed in the future. 1155 The tt__lloocckk entry stores a rule lock for the tuple. A rule lock is a tag that identifies a set of rules that should be run whenever this tuple is manipulated. The mech- anism is beyond the scope of this section. The value stored in tt__lloocckk is a list of tids when the tuple is on disk, and is swizzled to an in-memory pointer when the tuple is copied into memory. The tt__ooiidd entry of the HHeeaappTTuupplleeDDaattaa structure is the object ID that uniquely identifies this tuple. Although tids will currently uniquely identify a tuple for its life- time, this may change someday, when tuple differencing and space reclamation are implemented. The OID will is guaran- teed by design never to change. Furthermore, when a tuple is replaced, its tid will change, but its OID is guaranteed not to. This means that the OID is a safe, globally unique, eternal identifier for the tuple. The tt__ccmmiinn and tt__ccmmaaxx entries are, respectively, the command IDs that inserted and deleted this tuple. In every transaction, the user may run up to 32768 individual com- mands. Any of these commands may update the database. The command ID is stored in every tuple to allow the system to keep track of which command inside a single transaction made a change. The reason that this is necessary is to allow subsequent commands inside the same transaction to see the effects of changes made by earlier commands in the transac- tion, even though those changes are not visible to other users running concurrently in other transactions. The tt__xxmmiinn and tt__xxmmaaxx entries are, respectively, the transaction IDs of the transactions that inserted and deleted this tuple. These transaction IDs are used by POST- GRES to check at runtime whether the transaction that made the changes ever committed. Changes made by uncommitted transactions may be safely ignored. The tt__ttmmiinn and tt__ttmmaaxx entries are, respectively, the commit times of the inserting and deleting transactions that operated on this tuple. When POSTGRES inserts a tuple, it writes the inserting command and transaction IDs, but leaves the tt__ttmmiinn entry empty. When the transaction commits, a special entry is written to the ppgg__ttiimmee relation. This entry contains the time at which the transaction committed. Later, the vacuum cleaner (or, in some cases, a POSTGRES backend in normal operation) will fill in the commit time for the inserting transaction. Once this time is filled in, POSTGRES does not need to check the transaction log in order to see whether the transaction ever committed. This sub- stantially speeds up processing, since the transaction log check generally requires disk accesses. 1166 The tt__nnaattttss entry is the number of attributes that are stored in this tuple. The number of attributes may not be the same as the number stored by the relation, since the relation schema may change over time as a result of the _a_d_d_a_t_t_r command. Any references to attributes beyond the end of the tuple will return the special database value NULL (note that database NULL is not the same as the zero value used by C). The tt__vvttyyppee entry is for the version type, whose pur- pose has been lost to history. This is unused by the cur- rent system. The value stored here is typically zero, although the header file aacccceessss//hhttuupp..hh claims that only _i, _r, and _d are supported. The tt__iinnffoommaasskk entry is used to encode information about the tuple to speed up fetching of attributes. The information encoded here includes flags indicating whether the tuple contains any null values or any variable-length attributes. The tt__lloocckkttyyppee entry indicates whether the value stored in the tt__lloocckk union is an on-disk (tid) or in-memory (pointer) lock representation. This is stored far from the union in order to keep the data structure tightly packed in memory. The tt__hhooffff entry is the number of bytes occupied by the tuple header (that is, by the HHeeaappTTuupplleeDDaattaa structure). This may vary from tuple to tuple, because if the tuple con- tains no nulls values, then the bitmap of null values (stored in tt__bbiittss, below) is not necessary, and does not appear in the tuple. For any tuple, the address of the HHeeaappTTuupplleeDDaattaa structure that stores it, plus the value stored in tt__hhooffff, is the address of the first byte of user data stored in the tuple. The tt__bbiittss vector contains one bit for every attribute in the tuple if the tuple contains nulls. For each attribute, the corresponding bit is set if the attribute is null, and is clear otherwise. Nulls are not actually stored in the tuple; just the tt__bbiittss bit for the null value appear. This keeps tuples containing null values small. If the tuple contains no null values, then the tt__bbiittss vector is left off the end of the structure, and user data begins immediately after the tt__hhooffff entry. Collectively, the values stored in the HHeeaappTTuupplleeDDaattaa structure are referred to as the tuple header. These values are accessible from the query language as _s_y_s_t_e_m _a_t_t_r_i_b_u_t_e_s. Every heap tuple has the same system attributes, but the _u_s_e_r _a_t_t_r_i_b_u_t_e_s they store vary from relation to relation, 1177 and possibly even within a relation. System attributes are all assigned attributes less than zero internally by the system. The User-defined attributes all have attribute numbers greater than zero. The names of the system attributes and their corresponding HHeeaappTTuupplleeDDaattaa structure elements are listed in the following table3 . +----------------------+------------------+ |_H_e_a_p_T_u_p_l_e_D_a_t_a _e_l_e_m_e_n_t | _S_y_s_t_e_m _A_t_t_r_i_b_u_t_e | +----------------------+------------------+ |tt__cchhaaiinn | chain | +----------------------+------------------+ |tt__ccmmaaxx | cmax | +----------------------+------------------+ |tt__ccmmiinn | cmin | +----------------------+------------------+ |tt__ccttiidd | ctid | +----------------------+------------------+ |tt__ooiidd | oid | +----------------------+------------------+ |tt__lloocckk | rlock | +----------------------+------------------+ |tt__ttmmaaxx | tmax | +----------------------+------------------+ |tt__ttmmiinn | tmin | +----------------------+------------------+ |xx__xxmmaaxx | xmax | +----------------------+------------------+ |tt__xxmmiinn | xmin | +----------------------+------------------+ |tt__vvttyyppee | vtype | +----------------------+------------------+ 11..22..33.. DDaattaa SSttoorreedd iinn IInnddeexx RReellaattiioonnss All POSTGRES indices are secondary -- that is, ordinary system and user data are stored in heap relations, and indices store pointers into the heaps. All of the indexed access methods store index tuples, which are much smaller than heap tuples. In general, the indexed access methods also store other data on POSTGRES pages; what data are stored depends on the indexed access method. For the rtree ____________________ 3 At one time, there existed a system attribute _a_n_c_h_o_r which corresponded with the HHeeaappTTuupplleeDDaattaa element tt__aanncchhoorr. _A_n_c_h_o_r was the anchor point, or complete tuple, stored in a sequence of difference tuples. Neither are used in the sys- tem at this time since tuple differencing has not yet been implemented. 1188 and btree access methods, some pages are designated as _i_n_t_e_r_n_a_l pages, and only store pointers to other pages in the same index. _L_e_a_f pages, on the other hand, store point- ers into the heap. A particular index is defined on a single heap rela- tion, and all the heap tids that the index stores refer to that relation. Since indices support fast keyed lookup, the indices also store key data, which are typically either an attribute from the heap or some function of an attribute or attributes from the heap. For example, if the EEMMPP class stored employee records, and included a ssaallaarryy attribute, then a btree index on EEMMPP..ssaallaarryy would store the salaries from EEMMPP together with the tids of the tuples with each par- ticular salary. The index tuple format is defined in aacccceessss//iittuupp..hh, and is ttyyppeeddeeff ssttrruucctt IInnddeexxTTuupplleeDDaattaa {{ IItteemmPPooiinntteerrDDaattaa tt__ttiidd;; uunnssiiggnneedd sshhoorrtt tt__iinnffoo;; }} IInnddeexxTTuupplleeDDaattaa;; //** MMOORREE DDAATTAA FFOOLLLLOOWWSS **// ttyyppeeddeeff IInnddeexxTTuupplleeDDaattaa **IInnddeexxTTuuppllee;; The tt__ttiidd entry is the tid of the heap tuple that cor- responds to this index tuple. The tt__iinnffoo entry encodes some information about the index tuple. This is a sixteen-bit quantity whose layout is as follows: +o Bit fifteen (the leftmost bit) is set if the index tuple contains null values, and is clear otherwise. +o Bit fourteen is set if the index tuple contains vari- able-length attributes, and is clear otherwise. +o Bit thirteen is set if there are rules associated with the index tuple, and is clear otherwise. Rules on index tuples are not supported by the current system. +o Bits twelve through zero are the size of the tuple, in bytes. Immediately following the tt__iinnffoo entry is the index key. The index key is the set of all attributes from the heap tuple that form the search key for this index. The current implementations of the btree and rtree access meth- ods support only a single index key. This may change, at 1199 least for btrees, in the near future. 11..22..44.. HHooww tthhee RReellaattiioonnss aarree MMaannaaggeedd When the POSTQUEL user issues an update to a heap rela- tion (either inserting or deleting data), all of the corre- sponding indices are automatically updated, as well. POST- GRES stores, for every heap relation, the index relations defined on it, and for every index relation, the attributes from the heap that form the index key. These data are stored in ppgg__iinnddeexx. When the user issues a POSTQUEL query against a heap relation that requires the relation to be scanned, POSTGRES checks to see whether any of the indices defined on the heap will permit the scan to be executed more quickly. For exam- ple, if a btree index is defined on the ssaallaarryy attribute of the EEMMPP relation, then a query for all employees with salaries between $30,000 and $50,000 can be satisfied by scanning the btree index, and only selecting out the heap tuples with the correct salaries. This allows the scan to be completed more quickly than would a sequential scan of all of the tuples in the EEMMPP relation. 11..33.. SSccaann KKeeyyss,, SSccaann DDeessccrriippttoorrss,, aanndd SSccaannss A _s_c_a_n is the abstraction used in POSTGRES to search a heap or index relation for tuples that satisfy some qualifi- cation. A scan descriptor, or _s_c_a_n_d_e_s_c, is the data struc- ture that describes an open scan. The qualification is stored in the scandesc in data structures called _s_c_a_n _k_e_y_s. Each key consists of a procedure _p, an attribute number _k, and a value _v. Logically, every tuple _t in the relation is checked to see whether its _kth attribute, _t_k, passes the qualification stored in the scan key. The tuple passes the qualification if _p(_t_k,_v) returns true. In order for the tuple to satisfy the entire qualification, it must pass this test for every scan key stored in the scandesc. This means that a scandesc stores a conjunctive qualification on the relation. Disjunctive qualifications may be handled by using more than one scan, and suppressing duplicate tuples, or by using a scan with no keys, and applying the disjunc- tive tests manually. Once a scan is opened, it returns tuples one at a time. Every tuple that the scan returns is guaranteed to satisfy the qualification, and the scan is guaranteed to return all qualifying tuples. The rest of this section describes the data structures associated with scans, scandescs, and scan keys. 2200 11..33..11.. SSccaann KKeeyyss A scan is a conjunction of zero or more qualifications on single attributes in the relation. Every single- attribute qualification is stored in a scan key. The data structure used to store scan keys appears in the source file aacccceessss//sskkeeyy..hh. Its declaration is ttyyppeeddeeff ssttrruucctt SSccaannKKeeyyEEnnttrryyDDaattaa {{ bbiittss1166 ffllaaggss;; AAttttrriibbuutteeNNuummbbeerr aattttrriibbuutteeNNuummbbeerr;; RReeggPPrroocceedduurree pprroocceedduurree;; iinntt ((**ffuunncc)) (());; iinntt3322 nnaarrggss;; DDaattuumm aarrgguummeenntt;; }} SSccaannKKeeyyEEnnttrryyDDaattaa;; ##ddeeffiinnee CChheecckkIIffNNuullll 00xx11 ##ddeeffiinnee UUnnaarryyPPrroocceedduurree 00xx22 ##ddeeffiinnee NNeeggaatteeRReessuulltt 00xx44 ##ddeeffiinnee CCoommmmuutteeAArrgguummeennttss 00xx88 ttyyppeeddeeff SSccaannKKeeyyEEnnttrryyDDaattaa **SSccaannKKeeyyEEnnttrryy;; The ffllaaggss entry is a sixteen-bit bitmask describing the function to be called and its return values. The flag val- ues appear in the ##ddeeffiinnees that follow the data structure. If the CChheecckkIIffNNuullll flag is set, then the scan code should check for null attributes and return null for them without calling the function. However, this behavior is not imple- mented in the current system. If the UUnnaarryyPPrroocceedduurree bit is set, then this procedure takes a single argument (the appro- priate attribute from the tuple), rather than an attribute and a constant. If the NNeeggaatteeRReessuulltt bit is set, then the scan code will negate the logical value returned by the pro- cedure when checking for tuples that satisfy the qualifica- tion. Finally, if the CCoommmmuutteeAArrgguummeennttss flag is set, then the scan code will call the procedure with the supplied value and the attribute from the tuple, rather than the opposite order. The aattttrriibbuutteeNNuummbbeerr entry is the attribute number in the tuple to check. The pprroocceedduurree entry is the object ID of the ppgg__pprroocc tuple that describes the procedure to call. When the proce- dure is called for the first time, the ppgg__pprroocc relation is scanned and the appropriate function is dynamically loaded, if necessary. The ffuunncc entry is a pointer to a function that must return type bbooooll. This function pointer is initialized by 2211 the scan key from the ppgg__pprroocc tuple for the desired pprrooccee-- dduurree, and need not be set by the caller. The function pointer is cached to save repeated scans of ppgg__pprroocc. The nnaarrggss entry is the number of arguments that are required by the function. This is filled in from the ppgg__pprroocc tuple, and need not be set by the caller. However, it should always be either one or two. Finally, the aarrgguummeenntt entry is the additional argument to the function. This must be set by the caller. 11..33..22.. SSeettttiinngg uupp SSccaann KKeeyyss In order to properly set up a scan key, the caller must know which attributes in the tuple are of interest, the object IDs of the procedures that should be used to test the attributes, and what values they should be tested against. In order to simplify scan key setup, a number of conventions have been adopted in the POSTGRES source code. First, the schemas for all of the system catalogs are defined in header files in the ccaattaalloogg directory. The header files are named _r_e_l_n_a_m_e.h, where _r_e_l_n_a_m_e is the name of the catalog of interest. For example, the schema of the ppgg__uusseerr relation appears in the header file ccaattaa-- lloogg//ppgg__uusseerr..hh. The lone exception to this rule is that the schema for the ppgg__ccllaassss relation appears in the file ccaattaa-- lloogg//ppgg__rreellaattiioonn..hh. This is an artifact of the terminology change (relations became classes) that swept the project in the early 1990s. Second, the header files described above include stan- dard ##ddeeffiinnees for the relations that they describe. The relation's name, as a string, is NNaammee___r_e_l_n_a_m_e; for example, NNaammee__ppgg__uusseerr. The number of attributes in the relation is NNaattttss___r_e_l_n_a_m_e (NNaattttss__ppgg__uusseerr). Every attribute may be ref- erenced as AAnnuumm___r_e_l_n_a_m_e___a_t_t_n_a_m_e (for example, the uusseennaammee attribute of the ppgg__uusseerr relation is AAnnuumm__ppgg__uusseerr__uusseennaammee. These conventions permit programmers to use symbolic names, rather than embedded constants, to set up scans and open relations. Similar conventions have not been adopted for user relations, so programmers who want to set up and use scans on them must already know their schemas. Third, there exist some ##ddeeffiinneed constants for proce- dures that are frequently used in scans on the system cata- logs. These procedure IDs appear in the source file ccaattaa-- lloogg//ppgg__pprroocc..hh, and generally take the form _O_p_e_r_a_t_i_o_n_O_f_I_n_t_e_r_- _e_s_tRReeggPPrroocceedduurree -- for example, CChhaarraacctteerrEEqquuaallRReeggPPrroocceedduurree 2222 and OObbjjeeccttIIddEEqquuaallRReeggPPrroocceedduurree. Finally, the constant values used by POSTGRES in scans are of type DDaattuumm. A number of macros have been defined that convert embedded constants to values of type DDaattuumm. These macros are defined in the source file ttmmpp//ddaattuumm..hh, and are +---------------------------+--------------------------+--------------------------+ | _c_o_n_v_e_r_t _t_y_p_e | _f_r_o_m _D_a_t_u_m | _t_o _D_a_t_u_m | +---------------------------+--------------------------+--------------------------+ +---------------------------+--------------------------+--------------------------+ |one-byte char | DatumGetChar(_v) | CharGetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |one-byte integer | DatumGetInt8(_v) | Int8GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |unsigned one-byte integer | DatumGetUInt8(_v) | UInt8GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |two-byte integer | DatumGetInt16(_v) | Int16GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |unsigned two-byte integer | DatumGetUInt16(_v) | UInt16GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |four-byte integer | DatumGetInt32(_v) | Int32GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |unsigned four-byte integer | DatumGetUInt32(_v) | UInt32GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |four-byte float | DatumGetFloat32(_v) | Float32GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |eight-byte double | DatumGetFloat64(_v) | Float64GetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |void * | DatumGetPointer(_v) | PointerGetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |pointer to struct | DatumGetStructPointer(_v) | StructPointerGetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |16-byte char string name | DatumGetName(_v) | NameGetDatum(_v) | +---------------------------+--------------------------+--------------------------+ |object ID | DatumGetObjectId(_v) | ObjectIdGetDatum(_v) | +---------------------------+--------------------------+--------------------------+ In POSTGRES, four-byte floating point values are passed by reference, not by value. All other four-byte quantities are pass-by-value. A very common programming error in POSTGRES is to use values of type NNaammee as if they were pointers. In fact, NNaammee is a structure containing a sixteen-byte charac- ter array, and so will be passed on the stack if its address is not explicitly used. These facts should be documented elsewhere, but are worth mentioning here. The following code fragment shows how to set up a scan key on the ppgg__uusseerr relation for all tuples that have uusseennaammee equal to "mao" and an object ID of 1806: 2233 NNaammeeDDaattaa nnaammee;; SSccaannKKeeyyEEnnttrryyDDaattaa sskkeeyy[[22]];; bbzzeerroo((&&nnaammee,, ssiizzeeooff((nnaammee))));; bbccooppyy((nnaammee..ddaattaa[[00]],, ""mmaaoo"",, ssttrrlleenn((""mmaaoo""))));; SSccaannKKeeyyEEnnttrryyIInniittiiaalliizzee((&&sskkeeyy[[00]],, ((bbiittss1166))00xx00,, AAnnuumm__ppgg__uusseerr__uusseennaammee,, ((RReeggPPrroocceedduurree))NNaammeeEEqquuaallRReeggPPrroocceedduurree,, NNaammeeGGeettDDaattuumm((nnaammee))));; SSccaannKKeeyyEEnnttrryyIInniittiiaalliizzee((&&sskkeeyy[[11]],, ((bbiittss1166))00xx00,, OObbjjeeccttIIddAAttttrriibbuutteeNNuummbbeerr,, ((RReeggPPrroocceedduurree))OObbjjeeccttIIddEEqquuaallssRReeggPPrroocceedduurree,, OObbjjeeccttIIddGGeettDDaattuumm((11880066))));; 11..33..33.. TThhee SSccaann DDeessccrriippttoorr Once the scan keys are properly initialized, they may be used to open a scan on a relation. An open scan is described by a scandesc. The data structures that describe open scans are HHeeaappSSccaanns and IInnddeexxSSccaanns, and are declared in the source file aacccceessss//rreellssccaann..hh. 11..33..33..11.. TThhee HHeeaapp SSccaann DDeessccrriippttoorr The declaration for HHeeaappSSccaannDDeessccDDaattaa is ttyyppeeddeeff ssttrruucctt HHeeaappSSccaannDDeessccDDaattaa {{ RReellaattiioonn rrss__rrdd;; HHeeaappTTuuppllee rrss__ppttuupp;; HHeeaappTTuuppllee rrss__ccttuupp;; HHeeaappTTuuppllee rrss__nnttuupp;; BBuuffffeerr rrss__ppbbuuff;; BBuuffffeerr rrss__ccbbuuff;; BBuuffffeerr rrss__nnbbuuff;; ssttrruucctt ddcchhaaiinn **rrss__ddcc;; IItteemmPPooiinntteerrDDaattaa rrss__mmppttiidd;; IItteemmPPooiinntteerrDDaattaa rrss__mmccttiidd;; IItteemmPPooiinntteerrDDaattaa rrss__mmnnttiidd;; IItteemmPPooiinntteerrDDaattaa rrss__mmccdd;; BBoooolleeaann rrss__aatteenndd;; TTiimmeeQQuuaall rrss__ttrr;; uuiinntt1166 rrss__ccddeellttaa;; bbooooll rrss__ppaarraalllleell__ookk;; uuiinntt1166 rrss__nnkkeeyyss;; SSccaannKKeeyyDDaattaa rrss__kkeeyy;; //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY AATT EENNDD OOFF SSTTRRUUCCTT **// }} HHeeaappSSccaannDDeessccDDaattaa;; ttyyppeeddeeff HHeeaappSSccaannDDeessccDDaattaa **HHeeaappSSccaannDDeesscc;; 2244 The rrss__rrdd entry is a pointer to the reldesc for the relation on which this scan has been opened. The rrss__ppttuupp, rrss__ccttuupp, and rrss__nnttuupp entries are, respec- tively, the previous tuple, current tuple, and next tuple visited in this scan. The scan code caches these because it was originally thought that scans would change direction frequently. In fact, that is not the case, and maintaining the previous and next tuple pointers has turned out to be pure overhead. Getting rid of the previous and next tuple pointers everywhere that they appear in the scandesc would speed up scan processing. The rrss__ppbbuuff, rrss__ccbbuuff, and rrss__nnbbuuff entries are the buffers on which the previous, current, and next tuple are stored. These buffers are kept pinned, which guarantees that they will not be evicted from the shared buffer cache by any backend. Pinning the buffers allows the backend to reference them directly, without reacquiring a pointer to the buffer every time data on it is used. The shared buffer cache is protected by mutual exclusion. Some POSTGRES ports use System V semaphores to implement exclusion. Acquiring and releasing these semaphores is slow, so pinning the buffers during a scan improves performance significantly. The rrss__ddcc entry is intended to support tuple differenc- ing, which is not supported in the current version of the system. The rrss__mmppttiidd, rrss__mmccttiidd, and rrss__mmnnttiidd entries support marking of positions in scans. While processing some join strategies, the executor marks a location so that it can return to it later. When such a mark is made, the tids of the rrss__ppttuupp, rrss__ccttuupp, and rrss__nnttuupp tuples are copied to the rrss__mmppttiidd, rrss__mmccttiidd, and rrss__mmnnttiidd entries. The rrss__mmccdd entry is intended to support tuple differ- encing, and is not used in the current system. The rrss__aatteenndd entry indicates whether the scan should begin at the end of the relation. Some attempt is made to store a sensible value here, but it is ignored by much of the code and should not be relied on. The rrss__ttrr entry is a time range or snapshot time quali- fication that indicates what historical tuples are of inter- est. The special constant value NNoowwTTiimmeeQQuuaall indicates that only current data is of interest. The entry rrss__ccddeellttaa is intended to support tuple dif- ferencing, and is not used at present. 2255 The structure entry rrss__ppaarraalllleell__ookk was added to support parallelization of POSTGRES for shared-memory architectures by Wei Hong, whose doctoral dissertation was on that topic. This entry is no longer used. The entry rrss__nnkkeeyyss stores the number of SSccaannKKeeyyEEnnttrryy-- DDaattaa structures that are stored in the rrss__kkeeyy entry that follows. The rrss__kkeeyy entry stores a variable-length vector of SSccaannKKeeyyEEnnttrryyDDaattaa structures, one per scan key that are to be applied for the scan. 11..33..33..22.. TThhee IInnddeexx SSccaann DDeessccrriippttoorr The declaration for IInnddeexxSSccaannDDeessccDDaattaa is ttyyppeeddeeff ssttrruucctt IInnddeexxSSccaannDDeessccDDaattaa {{ RReellaattiioonn rreellaattiioonn;; PPooiinntteerr ooppaaqquuee;; IItteemmPPooiinntteerrDDaattaa pprreevviioouussIItteemmDDaattaa;; IItteemmPPooiinntteerrDDaattaa ccuurrrreennttIItteemmDDaattaa;; IItteemmPPooiinntteerrDDaattaa nneexxttIItteemmDDaattaa;; MMaarrkkDDaattaa pprreevviioouussMMaarrkkDDaattaa;; MMaarrkkDDaattaa ccuurrrreennttMMaarrkkDDaattaa;; MMaarrkkDDaattaa nneexxttMMaarrkkDDaattaa;; uuiinntt88 ffllaaggss;; BBoooolleeaann ssccaannFFrroommEEnndd;; uuiinntt1166 nnuummbbeerrOOffKKeeyyss;; SSccaannKKeeyyDDaattaa kkeeyyDDaattaa;; //** VVAARRIIAABBLLEE LLEENNGGTTHH AARRRRAAYY AATT EENNDD OOFF SSTTRRUUCCTT **// }} IInnddeexxSSccaannDDeessccDDaattaa;; ttyyppeeddeeff IInnddeexxSSccaannDDeessccDDaattaa **IInnddeexxSSccaannDDeesscc;; The rreellaattiioonn entry points at the reldesc for the rela- tion being scanned. The ooppaaqquuee entry is for use by the indexed access method, and is assigned no meaning by higher-level scan pro- cessing code. The btree code uses it to store a pointer to a BBTTSSccaannOOppaaqquuee structure, defined in aacccceessss//nnbbttrreeee..hh. The BBTTSSccaannOOppaaqquuee structure stores, among other things, the cur- rent buffer in use by the scan. This is equivalent to the rrss__ccbbuuff entry of the HHeeaappSSccaannDDeessccDDaattaa structure. The pprreevviioouussIItteemmDDaattaa, ccuurrrreennttIItteemmDDaattaa, and nneexxttIItteemmDDaattaa entries store the tids of the previous, current, and next index tuples (_n_o_t heap tuples!) returned by the scan. As was the case for heap scans, it was originally believed that index scans would change direction frequently, making it 2266 useful to cache these values. In fact, this never happens, and the previous and next tuple tids could be removed from this structure. The pprreevviioouussMMaarrkkDDaattaa, ccuurrrreennttMMaarrkkDDaattaa, and nneexxttMMaarrkkDDaattaa entries serve the same purpose as rrss__mmppttiidd, rrss__mmccttiidd, and rrss__mmnnttiidd in the HHeeaappSSccaannDDeessccDDaattaa structure. The ffllaaggss entry is used to encode the direction of the scan (forwards, backwards, or no movement). The constants for these three values are declared in aacccceessss//ssddiirr..hh, and are BBaacckkwwaarrddSSccaannDDiirreeccttiioonn, NNooMMoovveemmeennttSSccaannDDiirreeccttiioonn, and FFoorr-- wwaarrddSSccaannDDiirreeccttiioonn. The ssccaannFFrroommEEnndd flag is _t_r_u_e if the scan should begin at an endpoint, and false otherwise. For example, in the btree code, a scan for all values greater than 500 could begin at 500 and move forward, or at the end of the relation and move backward. The nnuummbbeerrOOffKKeeyyss entry records the number of keys stored in the kkeeyyDDaattaa entry that follows. Currently, nnuumm-- bbeerrOOffKKeeyyss is hardcoded to 1 in many places since multi-key indexing is not supported at this time. The kkeeyyDDaattaa entry stores a vector of SSccaannKKeeyyEEnnttrryyDDaattaa structures that describe the scan key in use on the index. 11..44.. TThhee PPOOSSTTGGRREESS AAcccceessss MMeetthhoodd IInntteerrffaaccee This section describes the programmatic interface to the POSTGRES access methods. The previous sections described the data structures that are important when using the access methods. Here, the structures are filled in and used to do real work. The prototypes for the functions described in this sec- tion are not well-managed in the current version of the sys- tem. Some routines are not prototyped at all, and the pro- totypes that do exist are not ANSI. The header files in the directory aacccceessss are where the prototypes appear. The basic operations covered here are +o opening and closing relations, +o using scans to fetch tuples, +o fetching particular attributes of a tuple, and +o inserting, deleting, and replacing tuples. 2277 There are two classes of interfaces. The routines whose names begin with hheeaapp__ operate on heap relations. The routines whose names begin with iinnddeexx__ work on index rela- tions. The iinnddeexx__ routines use the ppgg__aamm tuple for the access method to call the routine that does the work for a particular access method. This dispatch happens transpar- ently to the user. 11..44..11.. MMaannaaggiinngg tthhee SSyysstteemm CCaattaallooggss Because backend code frequently scans and updates the system catalogs, and because these catalogs often have spe- cial indices defined on them, POSTGRES includes special interfaces for dealing with some system catalogs. This sec- tion describes those interfaces. 11..44..11..11.. TThhee SSyysstteemm CCaacchhee Each POSTGRES backend maintains, in private memory, a cache of system catalog tuples that are frequently required during query processing. Cache consistency is provided by some extremely complicated invalidation code that runs at transaction boundaries. This cache, called the _s_y_s _c_a_c_h_e, actually consists of a number of different caches that sup- port fast lookup of catalog tuples by various attributes. The cache management code appears in the source files uuttiillss//ccaacchhee//ssyyssccaacchhee..cc and uuttiillss//ccaacchhee//ccaattccaacchhee..cc, and the cache invalidation code is in ssttoorraaggee//iippcc//ssiinnvvaall..cc and uuttiillss//ccaacchhee//iinnvvaall..cc. The primary interface for using the sys cache is HHeeaappTTuuppllee SSeeaarrcchhSSyyssCCaacchheeTTuuppllee((iinntt ccaacchheeiidd,, cchhaarr **kkeeyy11,, cchhaarr **kkeeyy22,, cchhaarr **kkeeyy33,, cchhaarr **kkeeyy44)) The ccaacchheeiidd argument is the sys cache of interest. The sys- tem currently supports the following caches: 2288 The UUSSEESSYYSSIIDD cache, for example, allows fast lookup of users in ppgg__uusseerr by the uusseessyyssiidd attribute. To do such a lookup, HHeeaappTTuuppllee ppgg__uusseerr__ttuupp;; ppgg__uusseerr__ttuupp == SSeeaarrcchhSSyyssCCaacchheeTTuuppllee((UUSSEESSYYSSIIDD,, 11880066,, 00,, 00,, 00));; If any tuple with uusseessyyssiidd equal to 1806 exists in ppgg__uusseerr, it is loaded into the cache if necessary, and a pointer to it is returned. 11..44..11..22.. MMaaiinnttaaiinniinngg SSyysstteemm CCaattaalloogg IInnddiicceess Since system catalog indices are critical to the fast and correct functioning of POSTGRES, several support rou- tines have been defined to open, update, and close indices on a given catalog when it is updated. An example of the use of these routines appears in the file ccaattaalloogg//hheeaapp..cc, in the routine AAddddNNeewwAAttttrriibbuutteeTTuupplleess. The routines are defined in ccaattaalloogg//iinnddeexxiinngg..cc. These rouintes open all the indices on some catalog relation, insert index tuples into all the indices, and close the indices. The interfaces are vvooiidd CCaattaallooggOOppeennIInnddiicceess((iinntt nnIInnddiicceess,, cchhaarr ****nnaammeess,, RReellaattiioonn **iinndd__rreellnnss)) vvooiidd CCaattaallooggIInnddeexxIInnsseerrtt((RReellaattiioonn **iinndd__rreellnnss,, iinntt nnIInnddiicceess,, RReellaattiioonn hheeaapp__rreellnn,, HHeeaappTTuuppllee hhttuupp)) vvooiidd CCaattaallooggCClloosseeIInnddiicceess((iinntt nnIInnddiicceess,, RReellaattiioonn **iinndd__rreellnnss)) For CCaattaallooggOOppeennIInnddiicceess,, the nnIInnddiicceess argument is the number of indices on the catalog; constants for all catalogs are defined in ccaattaalloogg//iinnddeexxiinngg..hh. The nnaammeess vector is an array of names of indices on the catalog; constants for these arrays are provided in the same header file. Finally, iinndd__rreellnnss is a vector of index reldesc pointers that is large enough to hold nnIInnddiicceess reldesc pointers. This is filled in by CCaattaallooggOOppeennIInnddiicceess. For CCaattaallooggIInnddeexxIInnsseerrtt, the iinndd__rreellnnss argument is the reldesc vector returned by CCaattaallooggOOppeennIInnddiicceess. NNIInnddiicceess is the number of indices on the catalog. The hheeaapp__rreellnn argu- ment is the catalog relation being updated, and hhttuupp is the new tuple inserted into the catalog. When CCaattaallooggIInnddeexxIInn-- sseerrtt returns, all of the indices on the system catalog will have been updated with index tuples pointing at the new heap tuple. 2299 For CCaattaallooggCClloosseeIInnddiicceess, nnIInnddiicceess is the number of indices to close, and iinndd__rreellnnss is the reldesc vector from CCaattaallooggOOppeennIInnddiicceess. 11..44..22.. OOppeenniinngg aanndd CClloossiinngg RReellaattiioonnss Relations may be opened by relid (object ID of the cor- responding ppgg__ccllaassss tuple) or by name. The interfaces are RReellaattiioonn hheeaapp__ooppeenn((OObbjjeeccttIIdd rreelliidd)) RReellaattiioonn hheeaapp__ooppeennrr((NNaammee rreellnnaammee)) RReellaattiioonn iinnddeexx__ooppeenn((OObbjjeeccttIIdd rreelliidd)) RReellaattiioonn iinnddeexx__ooppeennrr((NNaammee rreellnnaammee)) The first two open the requested heap relation if it exists, and return the reldesc for it. The second two do the same for index relations. If the named relations do not exist, the routines will report a NNOOTTIICCEE-level eelloogg message, but will not abort. Instead, they return a NULL reldesc. In order to close a relation, vvooiidd hheeaapp__cclloossee((RReellaattiioonn rreellddeesscc)) vvooiidd iinnddeexx__cclloossee((RReellaattiioonn rreellddeesscc)) 11..44..33.. UUssiinngg SSccaannss Once a relation has been opened, a scan may be set up on it to find tuples of interest. 11..44..33..11.. BBeeggiinnnniinngg aanndd EEnnddiinngg SSccaannss To set up a scan key, vvooiidd SSccaannKKeeyyEEnnttrryyIInniittiiaalliizzee((SSccaannKKeeyyEEnnttrryy eennttrryy,, bbiittss1166 ffllaaggss,, AAttttrriibbuutteeNNuummbbeerr aattttrriibbuutteeNNuummbbeerr,, RReeggPPrroocceedduurree pprroocceedduurree,, DDaattuumm aarrgguummeenntt)) As many scan key entries as desired may be defined for a single scan. These should be allocated as an array of 3300 SSccaannKKeeyyEEnnttrryyDDaattaa structures, as shown in section 1.3.2. Once a set of scan keys is initialized, the routines HHeeaappSSccaannDDeesscc hheeaapp__bbeeggiinnssccaann((RReellaattiioonn rreellddeesscc,, bbooooll aatteenndd,, TTiimmeeQQuuaall ttiimmeeQQuuaall,, uunnssiiggnneedd nnkkeeyyss,, SSccaannKKeeyy kkeeyy)) IInnddeexxSSccaannDDeesscc iinnddeexx__bbeeggiinnssccaann((RReellaattiioonn rreellddeesscc,, bbooooll ssccaannFFrroommEEnndd,, uuiinntt1166 nnuummbbeerrOOffKKeeyyss,, SSccaannKKeeyy kkeeyy)) will begin scans using the keys. The keys should be in the array beginning at address kkeeyy. To end an open scan, vvooiidd hheeaapp__eennddssccaann((HHeeaappSSccaannDDeesscc ssccaann)) vvooiidd iinnddeexx__eennddssccaann((IInnddeexxSSccaannDDeesscc ssccaann)) The scans may no longer be used once the eennddssccaann routine has been called on them. 11..44..33..22.. FFeettcchhiinngg QQuuaalliiffyyiinngg TTuupplleess Once a scan has been initialized by the bbeeggiinnssccaann rou- tines, qualifying tuples may be fetched from the relation by calling the routines HHeeaappTTuuppllee hheeaapp__ggeettnneexxtt((HHeeaappSSccaannDDeesscc ssccaann,, iinntt bbaacckkww,, BBuuffffeerr **bb)) RReettrriieevveeIInnddeexxRReessuulltt iinnddeexx__ggeettnneexxtt((IInnddeexxSSccaannDDeesscc ssccaann,, SSccaannDDiirreeccttiioonn ddiirreeccttiioonn)) In both cases, the ssccaann argument is the scan to use to find qualifying tuples. For heap scans, if bbaacckkww is non-zero, then the scan moves backwards; otherwise, it moves forwards. There is never a reason to set the bbaacckkww flag. For index scans, ddiirreeccttiioonn may be one of FFoorrwwaarrddSSccaannDDiirreeccttiioonn, NNooMMoovvee-- mmeennttSSccaannDDiirreeccttiioonn, or BBaacckkwwaarrddSSccaannDDiirreeccttiioonn. FFoorrwwaarrddSSccaannDDii-- rreeccttiioonn is commonly used, and the others may not work. The purpose of the bb argument for the heap routine is described below. 3311 11..44..33..33.. BBuuffffeerr MMaannaaggeemmeenntt POSTGRES manages a shared cache of recently-used buffers. This cache is available to all of the backends that are running concurrently. The shared cache allows some backends to take advantage of work done by others. For example, pages from the system catalog are typically moved into the cache by one backend and read by many others. When a given backend has a pointer into one of the pages in the shared buffer cache, that buffer is _p_i_n_n_e_d. A pinned buffer cannot be evicted from the cache. When the pointer is dropped, the buffer may be safely unpinned and evicted from the cache, and the space that it occupied may be used by another page from a database. When scanning tuples in a relation, the user may choose to examine the tuple directly on the page, or to copy the tuple to private space and to examine the copy. The benefit to using the tuple on the page is that no extra copy is required. The down side is that the user must be more care- ful in managing his scan. In particular, he must unpin the buffer when he no longer plans to use the pointer into it. The interface to fetch tuples from a heap scan is HHeeaappTTuuppllee hheeaapp__ggeettnneexxtt((HHeeaappSSccaannDDeesscc ssccaann,, iinntt bbaacckkww,, BBuuffffeerr **bb)) The bb argument is a pointer to a value of type BBuuffffeerr. BBuuffffeerr is basically an integer, which is the number of the buffer in the shared buffer cache. +o If the bb argument to hheeaapp__ggeettnneexxtt is not ((BBuuffffeerr **)) NNUULLLL, then the buffer number on which the returned tuple appears will be copied to the BBuuffffeerr value that bb points to. +o If the bb argument is NULL, then memory is allocated (via ppaalllloocc4) and the tuple is copied from the data page to the allocated memory. The following code fragments show how to use the bb argument in both cases. The first example uses the tuple directly on the data page. ____________________ 4 PPaalllloocc is the POSTGRES memory allocator. PPaalllloocc allo- cates memory from memory pools, some of which perform garbage collection at transaction end. 3322 HHeeaappTTuuppllee hhttuupp;; BBuuffffeerr bb;; ...... hhttuupp == hheeaapp__ggeettnneexxtt((hhssccaann,, 00,, &&bb));; //** ...... pprroocceessss tthhee ttuuppllee ...... **// //** aallll ddoonnee **// RReelleeaasseeBBuuffffeerr((bb));; When hheeaapp__ggeettnneexxtt returns, bb is the buffer number on which the tuple appears, and the _p_i_n _c_o_u_n_t on bb has been incre- mented to account for this new reference. After the call to RReelleeaasseeBBuuffffeerr, the tuple pointed to by hhttuupp may no longer be used. Some of the most insidious bugs in the project's his- tory were from users who unpinned buffers, but continued to use tuples on them. The buffer would be paged out at some later time by another process, and the user would have a pointer into some random location of another data page. The second example shows how to use a copy of the tuple that does not reside on the page. HHeeaappTTuuppllee hhttuupp;; ...... hhttuupp == hheeaapp__ggeettnneexxtt((hhssccaann,, 00,, ((BBuuffffeerr **)) NNUULLLL));; //** ...... pprroocceessss tthhee ttuuppllee ...... **// //** aallll ddoonnee **// ppffrreeee((hhttuupp));; In this case, the user should release the memory allocated to the tuple before returning. Allocated memory is automat- ically freed at transaction boundaries, but relying on this feature causes memory leaks that make POSTGRES run slowly and consume large amounts of system memory. 11..44..33..44.. UUssiinngg aann IInnddeexx SSccaann Buffer management is not required for index scans. Data of interest are copied to a special structure, called a RReettrriieevveeIInnddeexxRReessuulltt. A RReettrriieevveeIInnddeexxRReessuulltt includes the tid of the index tuple and the tid of the heap tuple that it references. To use these, RReettrriieevveeIInnddeexxRReessuulltt rreess;; RReellaattiioonn hheeaapp__rreellnn,, iinnddeexx__rreellnn;; HHeeaappTTuuppllee hhttuupp;; IItteemmPPooiinntteerr hheeaapp__ttiidd;; 3333 BBuuffffeerr bb;; ...... iinnddssccaann == iinnddeexx__bbeeggiinnssccaann((iinnddeexx__rreellnn,, ......));; rreess == iinnddeexx__ggeettnneexxtt((iinnddssccaann,, FFoorrwwaarrddSSccaannDDiirreeccttiioonn));; iiff ((rreess)) {{ hheeaapp__ttiidd == RReettrriieevveeIInnddeexxRReessuullttGGeettHHeeaappIItteemmPPooiinntteerr((rreess));; hhttuupp == hheeaapp__ffeettcchh((hheeaapp__rreellnn,, NNoowwTTiimmeeQQuuaall,, hheeaapp__ttiidd,, &&bb));; iiff ((HHeeaappTTuupplleeIIssVVaalliidd((hhttuupp)))) {{ //** ...... pprroocceessss tthhee hheeaapp ttuuppllee ...... **// //** aallll ddoonnee **// RReelleeaasseeBBuuffffeerr((bb));; }} }} The index relation requires no explicit buffer management. The heap tid of interest is extracted from the RReettrriieevveeIInn-- ddeexxRReessuulltt, and is used to fetch the heap tuple itself. The hheeaapp__ffeettcchh interface fetches a heap tuple by tid; its decla- ration is HHeeaappTTuuppllee hheeaapp__ffeettcchh((RReellaattiioonn rreellaattiioonn,, TTiimmeeQQuuaall ttiimmeeQQuuaall,, IItteemmPPooiinntteerr ttiidd,, BBuuffffeerr **bb)) The description of buffer management for heap scans applies to the hheeaapp__ffeettcchh interface, as well. If the resulting heap tuple does not satisfy the time qualification, then hheeaapp__ffeettcchh will return NULL. Since the index may contain old data until the vacuum cleaner is run, this is possible, so the user must check the return value from hheeaapp__ffeettcchh. If there are no more index tuples that match the quali- fication, iinnddeexx__ggeettnneexxtt will return NULL. At that point, the scan must be ended using iinnddeexx__eennddssccaann. If iinnddeexx__eennddssccaann is not called, any subsequent calls to iinnddeexx__ggeettnneexxtt on the scan will reset it to the beginning of the qualifying index tuples, and they will all be returned again. This is not intended as a feature, so should not be used. 11..44..44.. FFeettcchhiinngg AAttttrriibbuutteess Once a tuple of interest has been fetched from the heap, attributes of interest may be extracted from it and used. The interfaces are cchhaarr ** 3344 hheeaapp__ggeettaattttrr((HHeeaappTTuuppllee ttuupp,, BBuuffffeerr bb,, iinntt aattttnnuumm,, TTuupplleeDDeessccrriippttoorr ttuuppddeesscc,, bbooooll **iissnnuullll)) PPooiinntteerr iinnddeexx__ggeettaattttrr((IInnddeexxTTuuppllee ttuupp,, iinntt aattttnnuumm,, TTuupplleeDDeessccrriippttoorr ttuuppddeesscc,, bbooooll **iissnnuullll)) In general, users (and even POSTGRES implementors) need not extract index tuple values; only access method implementors need to do that. The rest of this section concentrates on the hheeaapp__ggeettaattttrr interface. The ttuupp argument points at the tuple of interest. If bb is non-null, then it is the buffer on which the tuple resides. If the user is doing explicit buffer management via hheeaapp__ffeettcchh or hheeaapp__ggeettnneexxtt, then the returned buffer may be passed along to hheeaapp__ggeettaattttrr. The tuple descriptor is actually the rrdd__aatttt entry from the reldesc for the relation that stores the tuple. To extract the tuple descriptor, RReellaattiioonn rreellnn;; TTuupplleeDDeessccrriippttoorr ttuuppddeesscc;; ...... ttuuppddeesscc == RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr((rreellnn));; The iissnnuullll argument is a pointer to a value of type bbooooll. If the corresponding attribute has the database value null, then iissnnuullll will be _t_r_u_e and the return value should be ignored. Although it is declared to return type cchhaarr **, hheeaapp__ggeettaattttrr actually returns type DDaattuumm, so the return value should always be coerced to that type. This is an implementation mistake. The DDaattuummGGeett_T_y_p_e macros described in section 1.3.2 convert values from DDaattuumm to the type required by the user. 11..44..55.. IInnsseerrttiinngg NNeeww TTuupplleess When a user submits a POSTQUEL query to insert new tuples into a heap relation, the POSTGRES executor automati- cally updates all associated indices with pointers to the new tuple after it has been inserted. There is no simple interface for doing this in general on user relations inside the POSTGRES backend. Section 1.4.1.2 describes a simple way for doing this on system catalogs, but the user must explicitly code updates to indices on user relations when new values are inserted into the heap. 3355 The code for the routine CCooppyyFFrroomm in the source file ccoommmmaannddss//ccooppyy..cc contains the sample code that was used to produce this system. That file is a reasonable reference for how to manage tuple insertions correctly. 11..44..55..11.. FFoorrmmiinngg aa HHeeaapp TTuuppllee The routine HHeeaappTTuuppllee hheeaapp__ffoorrmmttuuppllee((AAttttrriibbuutteeNNuummbbeerr nnaattttss,, TTuupplleeDDeessccrriippttoorr ttuuppddeesscc,, DDaattuumm **vvaalluueess,, cchhaarr **nnuullllss)) creates a heap tuple from the supplied vector of DDaattuumm val- ues. The nnaattttss argument is the number of attributes in the relation (the rrdd__rreell-->>rreellnnaattttss entry of the reldesc). The ttuuppddeesscc is the rrdd__aatttt entry of the reldesc, which may be fetched via RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr. Both vvaalluueess and nnuullllss are arrays of length nnaattttss. VVaall-- uueess contains a datum for every non-null attribute in the tuple. The _T_y_p_eGGeettDDaattuumm macros can be used to initialize the entries in this array. If an attribute is null, its space is ignored in the array. For example, if the third attribute is null, then the value stored in the third loca- tion of the vvaalluueess array will be ignored by hheeaapp__ffoorrmmttuuppllee. The nnuullllss array is a vector of characters, one per attribute in the tuple. If the corresponding attribute has the database value null, then the array entry is the charac- ter _n. If the corresponding attribute is non-null, then the array entry is the ASCII blank character. The heap tuple returned by hheeaapp__ffoorrmmttuuppllee consumes ppaall-- lloocced memory, and should be ppffrreeeed when the heap and index relations have all been updated. 11..44..55..22.. IInnsseerrttiinngg tthhee HHeeaapp TTuuppllee To insert the tuple returned by hheeaapp__ffoorrmmttuuppllee, OObbjjeeccttIIdd hheeaapp__iinnsseerrtt((RReellaattiioonn rreellnn,, HHeeaappTTuuppllee ttuupp,, ddoouubbllee **ooffff)) On return, the ttuupp-->>tt__ccttiidd entry is the tid of the tuple in the heap. The ooffff argument is unnecessary and should never be used. Its purpose is unclear. This argument may safely be C-language NULL. hheeaapp__iinnsseerrtt returns the object ID of the new tuple in the heap (the tt__ooiidd entry of ttuupp is also correctly set). 3366 11..44..55..33.. FFiinnddiinngg tthhee IInnddiicceess The procedure GGeettIInnddeexxRReellaattiioonnss finds all of the indices defined on a heap relation and initializes an array with the reldescs for all of the indices. The interface is GGeettIInnddeexxRReellaattiioonnss((OObbjjeeccttIIdd hheeaapp__rreelliidd,, iinntt **nn__iinnddiicceess,, RReellaattiioonn ****iinnddeexx__rreellddeessccss));; The hheeaapp__rreelliidd argument is the rrdd__iidd entry of the reldesc for the heap. NN__iinnddiicceess is set, on return, to the number of indices for the heap relation. On return, iinnddeexx__rreellddeessccss is a pointer to an array of reldescs, one for each index defined on the heap. 11..44..55..44.. UUppddaattiinngg tthhee IInnddiicceess A POSTGRES index may be +o _f_u_n_c_t_i_o_n_a_l, meaning that the index is on the return value of some function of the heap tuple's attributes, rather than on the attributes themselves; +o _p_a_r_t_i_a_l, meaning that a predicate controls whether or not particular heap tuples should appear in the index; or +o _r_e_g_u_l_a_r, meaning that the index is on one or more attributes of all tuples in the heap. Functional indices are used for some system catalogs, since the current version of POSTGRES does not support multi-column btree indices. Partial indices are very seldom used. Normal indices are by far the most common. The rest of this section ignores functional and partial indices, and concentrates on normal ones. For sample code that updates these two kinds of indices, see the routine CCooppyyFFrroomm in ccoomm-- mmaannddss//ccooppyy..cc. Special interfaces, described in section 1.4.1.2, have been defined to update the functional and nor- mal indices on system catalogs, so the user need not manage these by hand. After calling GGeettIInnddeexxRReellaattiioonnss, the user has the num- ber of indices and all of the index reldescs for the indices defined on the heap. Updating the indices requires forming an index tuple for each index and inserting it. The follow- ing code fragment gives an example. RReellaattiioonn hheeaapp__rreellnn;; RReellaattiioonn **iinnddeexx__rreellnnss;; iinntt nn__iinnddiicceess;; DDaattuumm iiddaattuumm;; 3377 HHeeaappTTuuppllee hhttuupp;; IInnddeexxTTuuppllee iittuupp;; HHeeaappTTuuppllee ppgg__iinnddeexxttuupp;; FFoorrmm__ppgg__iinnddeexx ppggiiffoorrmm;; TTuupplleeDDeessccrriippttoorr ttuuppddeesscc;; TTuupplleeDDeessccrriippttoorr iittuuppddeesscc;; cchhaarr **nnuullllss;; iinntt ii;; ...... ttuuppddeesscc == RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr((hheeaapp__rreellnn));; //** ggeett ssppaaccee ffoorr nnuullll ffllaaggss ---- tthhiiss iiss mmoorree tthhaann wwee nneeeedd **// nnuullllss == ppaalllloocc((hheeaapp__rreellnn-->>rrdd__rreell-->>rreellnnaattttss));; ffoorr ((ii == 00;; ii << hheeaapp__rreellnn-->>rrdd__rreell-->>rreellnnaattttss;; ii++++)) nnuullllss[[ii]] == '' '';; //** ...... iinnsseerrtt tthhee hheeaapp ttuuppllee,, wwhhiicchh sseettss hhttuupp-->>tt__ccttiidd ...... **// //** uuppddaattee eeaacchh iinnddeexx **// ffoorr ((ii == 00;; ii << nn__iinnddiicceess;; ii++++)) {{ //** ggeett tthhee ppgg__iinnddeexx ttuuppllee ffoorr tthhiiss iinnddeexx **// ppgg__iinnddeexxttuupp == SSeeaarrcchhSSyyssCCaacchheeTTuuppllee((IINNDDEEXXRREELLIIDD,, iinnddeexx__rreellnnss[[ii]]-->>rrdd__iidd,, NNUULLLL,, NNUULLLL,, NNUULLLL));; ppggiiffoorrmm == ((FFoorrmm__ppgg__iinnddeexx)) GGEETTSSTTRRUUCCTT((ppgg__iinnddeexxttuupp));; //** ffoorrmm aa ddaattuumm ttoo uussee ffoorr tthhee iinnddeexx ttuuppllee **// FFoorrmmIInnddeexxDDaattuumm((iinnddeexx__rreellnnss[[ii]]-->>rrdd__rreell-->>rreellnnaattttss,, ((AAttttrriibbuutteeNNuummbbeerr **)) &&ppggiiffoorrmm-->>iinnddkkeeyy..ddaattaa[[00]],, hhttuupp,, ttuuppddeesscc,, IInnvvaalliiddBBuuffffeerr,, &&iiddaattuumm,, nnuullllss,, NNUULLLL));; //** ffoorrmm tthhee iinnddeexx ttuuppllee **// iittuuppddeesscc == RReellaattiioonnGGeettTTuupplleeDDeessccrriippttoorr((iinnddeexx__rreellnnss[[ii]]));; iittuupp == iinnddeexx__ffoorrmmttuuppllee((11,, iittuuppddeesscc,, &&iiddaattuumm,, nnuullllss));; //** mmaakkee iitt ppooiinntt aatt tthhee hheeaapp ttuuppllee **// iittuupp-->>tt__ttiidd == hhttuupp-->>tt__ccttiidd;; //** iinnsseerrtt iitt **// ((vvooiidd)) iinnddeexx__iinnsseerrtt((iinnddeexx__rreellnnss[[ii]],, iittuupp,, NNUULLLL,, NNUULLLL));; ppffrreeee((iittuupp));; }} ppffrreeee((nnuullllss));; The loop iterates over all the indices defined on the heap, updating each in turn. For each index, it fetches the cor- responding ppgg__iinnddeexx tuple in the cache of system tuples, extracts the FFoorrmm__ppgg__iinnddeexx structure from each, and then 3388 forms a DDaattuumm with the appropriate value for the index tuple. Finally, an index tuple is formed and inserted into the index. 11..44..66.. DDeelleettiinngg TTuupplleess Compared to inserting tuples, deleting tuples is sim- ple. Users never delete tuples from indices; this is done only by the vacuum cleaner. Deleting a heap tuple requires the tid of the tuple to be deleted. The tid may be extracted from the heap tuple itself, which can be found, for example, by a scan of the heap relation. Given the tid, the interface for deleting heap tuples is RRuulleeLLoocckk hheeaapp__ddeelleettee((RReellaattiioonn rreellnn,, IItteemmPPooiinntteerr ttiidd)) The RRuulleeLLoocckk return value from hheeaapp__ddeelleettee was originally intended to support the POSTGRES rule system, but the design of the rule system changed, so this value is not used. HHeeaapp__ddeelleettee always returns NULL. 11..44..77.. RReeppllaacciinngg EExxiissttiinngg TTuupplleess When a heap tuple in POSTGRES is replaced, it is marked as deleted and a new tuple is inserted in the relation with new values. The new tuple has the same object ID as the original tuple, but will be stored at a new location and so will have a new tid. Index tuples are never replaced. If the heap tuples to which they point have been replaced, the vacuum cleaner deletes them from the index. No user action is required. However, if a heap tuple is replaced, all the indices defined on it must have new index tuples inserted. These index tuples must point at the new heap tuple. The code for doing this is identical to the code for inserting index tuples that point at new heap tuples, and will not be repeated here. Replacing a tuple consists of two steps. First, a new tuple is formed, based on the old one, but with some values changed. Second, the old tuple is replaced by the new one. Replacing a tuple requires knowledge of the schema of the relation storing the tuple, since new values for partic- ular attributes must be supplied. The sample code fragment below assumes the existence of a relation eemmpp with the schema eemmpp((nnaammee == cchhaarr1166,, ssaallaarryy == iinntt44)) 3399 The code sample replaces a particular eemmpp tuple, assigning the employee a salary of 50,000. Note that the code frag- ment does not update any indices defined on eemmpp; that code must be added if any such indices exist. RReellaattiioonn eemmpprreellnn;; HHeeaappTTuuppllee eemmppttuupp;; HHeeaappTTuuppllee nneewwttuupp;; BBuuffffeerr eemmppbbuuff;; DDaattuumm vvaalluuee[[22]];; cchhaarr nnuullllss[[22]];; cchhaarr rreeppll[[22]];; ...... //** bbyy hheerree,, wwee hhaavvee tthhee eemmpp ttuuppllee ttoo rreeppllaaccee **// vvaalluuee[[00]] == ((DDaattuumm)) 00;; vvaalluuee[[11]] == IInntt3322GGeettDDaattuumm((5500000000));; //** nneeww ssaallaarryy **// //** nnoo nnuullll aattttrriibbuutteess iinn tthhiiss ttuuppllee **// nnuullllss[[00]] == nnuullllss[[11]] == '' '';; //** rreeppllaaccee tthhee sseeccoonndd aatttt,, nnoott tthhee ffiirrsstt **// rreeppll[[00]] == '' '';; rreeppll[[11]] == ''rr'';; //** ggeett aa nneeww ttuuppllee bbaasseedd oonn tthhee oolldd oonnee,, wwiitthh tthhee nneeww ssaallaarryy **// nneewwttuupp == hheeaapp__mmooddiiffyyttuuppllee((eemmppttuupp,, eemmppbbuuff,, eemmpprreellnn,, vvaalluuee,, nnuullllss,, rreeppll));; //** rreeppllaaccee tthhee oolldd ttuuppllee wwiitthh tthhee nneeww oonnee **// ((vvooiidd)) hheeaapp__rreeppllaaccee((eemmpprreellnn,, &&eemmppttuupp-->>tt__ccttiidd,, nneewwttuupp));; The interface for hheeaapp__mmooddiiffyyttuuppllee is HHeeaappTTuuppllee hheeaapp__mmooddiiffyyttuuppllee((HHeeaappTTuuppllee ttuuppllee,, BBuuffffeerr bbuuffffeerr,, RReellaattiioonn rreellaattiioonn,, DDaattuumm **rreeppllVVaalluuee,, cchhaarr **rreeppllNNuullll,, cchhaarr **rreeppll)) The ttuuppllee argument is the tuple to replace. The bbuuffffeerr argument is the buffer in the buffer cache on which it appears, as returned by hheeaapp__ggeettnneexxtt or hheeaapp__ffeettcchh. The rreellaattiioonn argument is the relation containing the tuple. The next three arguments control what values are stored in the new tuple. The first, rreeppllVVaalluuee, is a vector of val- ues of type DDaattuumm containing one entry for each attribute in the relation. If an attribute is not to be replaced, but rather should be copied from the old tuple, then the corre- sponding entry in rreeppllVVaalluuee will be ignored. 4400 The rreeppllNNuullll argument contains one entry per attribute in the relation. If the corresponding attribute is to be replaced, and if the rreeppllNNuullll entry is the character _n, then the new value for that attribute is null. If the value is to be replaced and the rreeppllNNuullll entry is an ASCII blank, then the value stored in rreeppllVVaalluuee is consulted to get the new value for the attribute. Finally, the rreeppll argument contains one entry per attribute in the relation. If the rreeppll entry for a particu- lar attribute is the character _r, then the value should be replaced, and the rreeppllVVaalluuee and rreeppllNNuullll entries for the attribute are used to assign a new value. If the rreeppll entry is an ASCII blank, then the value is copied from the origi- nal tuple. 11..44..88.. CCrreeaattiinngg aanndd DDeessttrrooyyiinngg RReellaattiioonnss Creating a heap relation requires the user to supply a schema for the new relation. Creating an index relation is easier, because the schema of the index is defined in terms of the heap. The two cases are treated separately below. 11..44..88..11.. HHeeaapp RReellaattiioonnss Creating a heap relation requires a schema for the new relation. The schema should be a TTuupplleeDDeessccrriippttoorr, which is the same as the rrdd__aatttt entry of the reldesc for the relation about to be created. Creating the tuple descriptor is straightforward; a vector is allocated with one FFoorrmm-- DDaattaa__ppgg__aattttrriibbuuttee structure for each attribute. The FFoorrmm-- DDaattaa__ppgg__aattttrriibbuuttee structures are filled in with the names, types, and other appropriate information for the new attributes. The aattttrreelliidd and aattttddeeffrreell entries may be left as IInnvvaalliiddOObbjjeeccttIIdd; these will be filled in by the code to create the heap relation. Once the tuple descriptor has been constructed, creat- ing the relation is straightforward. The interface is OObbjjeeccttIIdd hheeaapp__ccrreeaattee((cchhaarr **rreellnnaammee,, iinntt aarrcchhiivvee,, uunnssiiggnneedd nnaattttss,, uunnssiiggnneedd ssmmggrr,, TTuupplleeDDeessccrriippttoorr ttuuppddeesscc)) The rreellnnaammee argument is the name of the new relation; it must be unique, or hheeaapp__ccrreeaattee will abort the transaction. The aarrcchhiivvee argument, although declared to be of type iinntt in the prototype, is actually a character; this should be one of _n (no archiving), _l (light archiving), or _h (heavy archiving). Light and heavy archiving behave identically in the current system, and cause deleted and replaced tuples to 4411 be copied to a special archive by the vacuum cleaner. No archiving causes the vacuum cleaner to discard deleted and replaced tuples. The nnaattttss argument is the number of attributes for the new relation. This should be the same as the length of the ttuuppddeesscc vector. The ssmmggrr argument is the storage manager on which the relation should be created. The set of supported storage managers varies from installed system to installed system. Storage manager number zero is always magnetic disk. Other numbers are installation-dependent. Finally, the ttuuppddeesscc argument is the tuple descriptor for the new relation. HHeeaapp__ccrreeaattee returns the object ID of the new relation's ppgg__ccllaassss tuple, which is the same as the relid of the new relation. Destroying a heap relation is even simpler: vvooiidd hheeaapp__ddeessttrrooyy((cchhaarr **rreellnnaammee)) If the named relation does not exist, hheeaapp__ddeessttrrooyy aborts the transaction. Otherwise, the relation is destroyed and all indices defined on it are removed. In addition, if the relation inherited attributes from other relations, its inheritance information is deleted from the system catalogs. 11..44..88..22.. IInnddeexx RReellaattiioonnss Creating an index relation is straightforward. The interface is vvooiidd iinnddeexx__ccrreeaattee((NNaammee hheeaapp__nnaammee,, NNaammee iinnddeexx__nnaammee,, FFuunnccIInnddeexxIInnffoo ffIInnffoo,, OObbjjeeccttIIdd aamm__iidd,, AAttttrriibbuutteeNNuummbbeerr nnaattttss,, AAttttrriibbuutteeNNuummbbeerr **aattttnnooss,, OObbjjeeccttIIdd **ooppccllaasssseess,, uuiinntt1166 ppaarraammCCoouunntt,, DDaattuumm **ppaarraammss,, LLiissppVVaalluuee pprreeddiiccaattee)) The ffIInnffoo, ppaarraammCCoouunntt, ppaarraammss, and pprreeddiiccaattee arguments are for defining functional or partial indices, and will not be covered here. The hheeaapp__nnaammee argument is the name of the heap relation on which the index is to be defined. The iinnddeexx__nnaammee argu- ment is the name of the new index relation. 4422 The aamm__iidd argument is the object ID of the access method tuple for this index, from ppgg__aamm. The nnaattttss argument is the number of attributes (keys) for this index. The current system supports only single-key indices, but this may change in a subsequent release. The aattttnnooss argument is a vector of attribute numbers in the heap. If the index is to be defined on the third attribute in the heap, then this vector would consist of a single entry, which would be three. The ooppccllaasssseess argument is the object ID of the operator class tuple from ppgg__aammoopp to use for each attribute. The operator classes map types and operators for the index to particular functions in ppgg__pprroocc. Sample code that defines regular, functional, and par- tial indices may be found in the source file ccoomm-- mmaannddss//ddeeffiinndd..cc. When iinnddeexx__ccrreeaattee returns, the new index has been cre- ated and tuples have been inserted into it for each tuple in the heap relation. 11..44..99.. SSccaann PPoossiittiioonn MMaannaaggeemmeenntt Scans on index or heap relations support _m_a_r_k_i_n_g and _r_e_s_t_o_r_i_n_g of the scan's position. A mark records the cur- rent scan location, and a restore restores the previously marked location. Only one mark may be defined on an open scan. Marking and restoring are of limited use outside the executor, but the interfaces are included here for complete- ness. They are vvooiidd hheeaapp__mmaarrkkppooss((HHeeaappSSccaannDDeesscc hhssccaann)) vvooiidd hheeaapp__rreessttrrppooss((HHeeaappSSccaannDDeesscc hhssccaann)) vvooiidd iinnddeexx__mmaarrkkppooss((IInnddeexxSSccaannDDeesscc iissccaann)) vvooiidd iinnddeexx__rreessttrrppooss((IInnddeexxSSccaannDDeesscc iissccaann)) 4433 11..44..1100.. LLoocckkiinngg Users of the access method interface routines need not worry about locking. The access method routines acquire (and, when appropriate, release) locks on relations in response to scans and updates. Heap relations are protected by two-phase relation level locks. Btree indices use Lehman-Yao5 short-term locking for high concurrency, but since the underlying relations are locked at the relation level, index updates are serialized. Rtree indices use two- phase relation-level locking. System catalogs in POSTGRES are not subject to two- phase locking in order to improve concurrency. This means that under some circumstances concurrent transactions that operate on the system catalogs are not serializable. This flaw is common in commercial relational systems, as well, and occurs infrequently enough that it does not matter in practice. ____________________ 5 Lehman, P., Yao, S., ``Efficient Locking for Concurrent Operations on B-trees'', _A_C_M _T_r_a_n_s_a_c_t_i_o_n_s _o_n _D_a_t_a_b_a_s_e _S_y_s_- _t_e_m_s, 6(4), December 1981. 4444