11.. TThhee PPOOSSTTGGRREESS SSttoorraaggee MMaannaaggeerr SSwwiittcchh The POSTGRES abstraction that hides device-specific characteristics from the data manager is called the _s_t_o_r_a_g_e _m_a_n_a_g_e_r _s_w_i_t_c_h. The storage manager switch is similar to the _c_d_e_v_s_w and _b_d_e_v_s_w interfaces of UNIX and the user- supplied backing store interface in Mach. This section describes the POSTGRES storage manager switch in detail. Hiding device characteristics from the data manager required that the abstract operations required on the data store be identified. For example, the interface requires routines for opening relations and instantiating particular 8kByte relation blocks. Every device is managed by a _d_e_v_i_c_e _m_a_n_a_g_e_r, which knows how to implement the abstract opera- tions on the particular device. The _s_t_o_r_a_g_e _m_a_n_a_g_e_r is the module that chooses the correct device manager to service a particular request. Once the interfaces required of the device manager were defined, an appropriate structure (the _s_t_o_r_a_g_e _m_a_n_a_g_e_r _s_w_i_t_c_h _t_a_b_l_e) was compiled into POSTGRES. This structure contains an entry for all existing device managers. New device types are added by editing a source file, adding appropriate entries to the storage manager switch table, and recompiling POSTGRES. Whenever a device operation is required, the data man- ager calls the storage manager to carry it out. Every rela- tion is tagged in POSTGRES with the device manager on which it appears. The storage manager identifies the appropriate device and routine in the storage manager switch table, and calls the device manager to execute the routine on behalf of the data manager. Data may be located on a particular device manager at the granularity of a relation. Physical relations may not be horizontally or vertically partitioned, but the POSTGRES rules system can be used to create a partitioned logical relation. 11..11.. IIssoollaattiioonn ooff DDeevviiccee DDiiffffeerreenncceess New device managers are written to manage new types of devices. For example, main memory, magnetic disk, and the Sony optical disk jukebox are all managed by different device managers. To isolate the database system from the underlying devices, device managers are required to accept and return data in the format used internally by the data manager. This means, among other things, that data pages from user relations must be accepted and returned in units of 8kBytes. 11 Nevertheless, there is no requirement that a device manager use the internal representation when writing a page to persistent storage. Data pages may be compressed, broken into pieces, or concatenated into larger units by a device manager, as long as the original page can be reconstructed on demand. In addition, device managers may implement caches man- aged according to an appropriate policy. This means that the new POSTGRES architecture supports a multi-level cache. A main-memory cache of recently-used blocks is maintained in shared memory by the buffer manager. The buffer manager calls the storage manager to migrate pages between shared memory and persistent storage. A device manager may imple- ment a separate cache for its own use, subject to whatever policy it likes. For example, the Sony jukebox device man- ager maintains a cache of _e_x_t_e_n_t_s (contiguous collections of user data blocks) on magnetic disk. If there is locality of reference in accesses to a Sony jukebox relation, then some requests may be satisfied without accessing the jukebox device at all. This cache will be described in more detail in section 4.2. 11..22.. TThhee SSttoorraaggee MMaannaaggeerr SSwwiittcchh IInntteerrffaaccee This section defines the interface presented by the storage manager switch. The storage manager switch data structure, ff__ssmmggrr, con- tains ttyyppeeddeeff ssttrruucctt ff__ssmmggrr {{ iinntt ((**ssmmggrr__iinniitt))(());; //** mmaayy bbee NNUULLLL **// iinntt ((**ssmmggrr__sshhuuttddoowwnn))(());; //** mmaayy bbee NNUULLLL **// iinntt ((**ssmmggrr__ccrreeaattee))(());; iinntt ((**ssmmggrr__uunnlliinnkk))(());; iinntt ((**ssmmggrr__eexxtteenndd))(());; iinntt ((**ssmmggrr__ooppeenn))(());; iinntt ((**ssmmggrr__cclloossee))(());; iinntt ((**ssmmggrr__rreeaadd))(());; iinntt ((**ssmmggrr__wwrriittee))(());; iinntt ((**ssmmggrr__fflluusshh))(());; iinntt ((**ssmmggrr__bblliinnddwwrrtt))(());; iinntt ((**ssmmggrr__nnbblloocckkss))(());; iinntt ((**ssmmggrr__ccoommmmiitt))(());; //** mmaayy bbee NNUULLLL **// iinntt ((**ssmmggrr__aabboorrtt))(());; //** mmaayy bbee NNUULLLL **// iinntt ((**ssmmggrr__ccoosstt))(());; }} ff__ssmmggrr;; This structure defines the interface routines that all device managers must define. When a new device type is added to POSTGRES, this structure is filled in with function 22 pointers appropriate to the device, and POSTGRES is recom- piled. The storage manager calls these routines in response to requests from the data manager. In some cases, the switch entry for an interface routine may be NNUULLLL. This means that the corresponding operation is not defined for the device in question. Unless otherwise stated in the sec- tions that follow, all interface routines return an integer status code indicating success or failure. The following table summarizes the responsibilities of these interface routines. 33 +-------------------------------------------------------------------------+ +----------------------------------------+--------------------------------+ | _R_o_u_t_i_n_e | _P_u_r_p_o_s_e | +----------------------------------------+--------------------------------+ |smgr_init() | Called at startup to allow | | | initialization. | +----------------------------------------+--------------------------------+ |smgr_shutdown() | Called at shutdown to allow | | | graceful exit. | +----------------------------------------+--------------------------------+ |smgr_create(_r) | Create relation _r. | +----------------------------------------+--------------------------------+ |smgr_unlink(_r) | Destroy relation _r. | +----------------------------------------+--------------------------------+ |smgr_extend(_r, _b_u_f) | Add a new block to the end of | | | relation _r and fill it with | | | data from _b_u_f. | +----------------------------------------+--------------------------------+ |smgr_open(_r) | Open relation _r. The relation | | | descriptor includes a pointer | | | to the state for the open ob- | | | ject. | +----------------------------------------+--------------------------------+ |smgr_close(_r) | Close relation _r. | +----------------------------------------+--------------------------------+ |smgr_read(_r, _b_l_o_c_k, _b_u_f) | Read block _b_l_o_c_k of relation _r | | | into the buffer pointed to by | | | _b_u_f. | +----------------------------------------+--------------------------------+ |smgr_write(_r, _b_l_o_c_k, _b_u_f) | Write _b_u_f as block _b_l_o_c_k of | | | relation _r. | +----------------------------------------+--------------------------------+ |smgr_flush(_r, _b_l_o_c_k, _b_u_f) | Write _b_u_f synchronously as | | | block _b_l_o_c_k of relation _r. | +----------------------------------------+--------------------------------+ |smgr_blindwrt(_n_d, _n_r, _i_d, _i_r, _b_l_k, _b_u_f) | Do a ``blind write'' of buffer | | | _b_u_f as block _b_l_k of the rela- | | | tion named _n_r in the database | | | named _n_d. | +----------------------------------------+--------------------------------+ |smgr_nblocks(_r) | Return the number of blocks in | | | relation _r. | +----------------------------------------+--------------------------------+ |smgr_commit() | Force all dirty data to stable | | | storage. | +----------------------------------------+--------------------------------+ |smgr_abort() | Changes made during this | | | transaction may be discarded. | +----------------------------------------+--------------------------------+ +T-h-e--r-e-s-t--o-f--t-h-i-s--s-e-c-t-i-o-n--g-i-v-e-s--m-o-r-e--d-e-t-a-i-l-e-d--d-e-s-c-r-i-p-t-i-o-n-s-.----------------+ 44 11..22..11.. ssmmggrr__iinniitt The routine iinntt ssmmggrr__iinniitt(()) is called when POSTGRES starts up, and should initialize any private data used by a device manager. In addition, the first call to this routine should initialize any shared data. Typically, this is done by noticing whether shared data exists, and creating and initializing it if it does not. If a particular device manager requires no initializa- tion, it may define this routine to be NULL. 11..22..22.. ssmmggrr__sshhuuttddoowwnn The routine iinntt ssmmggrr__sshhuuttddoowwnn(()) is analogous to ssmmggrr__iinniitt(()), but is called when POSTGRES exits cleanly. Any necessary cleanup may be done here. If no work is required, this routine may be NULL. 11..22..33.. ssmmggrr__ccrreeaattee The routine to create new relations on a particular device manager is iinntt ssmmggrr__ccrreeaattee((rreellnn)) RReellaattiioonn rreellnn;; The rreellnn argument points to a POSTGRES relation descriptor for the relation to be created. The device manager should initialize whatever storage structure is required for the new relation. For example, the magnetic disk device manager creates an empty file with the name of the relation. 11..22..44.. ssmmggrr__uunnlliinnkk The routine iinntt ssmmggrr__uunnlliinnkk((rreellnn)) RReellaattiioonn rreellnn;; is called by the storage manager to destroy the relation described by rreellnn. Device managers may take whatever action is appropriate. For example, the magnetic disk device 55 manager removes the relation's associated file and archive, and reclaims the disk blocks they occupy. Since the Sony write-once drive cannot reclaim blocks once they are used, the Sony jukebox device manager simply marks the relation as destroyed and returns. 11..22..55.. ssmmggrr__eexxtteenndd When a relation must be extended with a new data block, the data manager calls iinntt ssmmggrr__eexxtteenndd((rreellnn,, bbuuffffeerr)) RReellaattiioonn rreellnn;; cchhaarr **bbuuffffeerr;; to do the work. The buffer is an 8kByte buffer in the for- mat used by the data manager internally. The data manager assumes that 8kByte data blocks in relations are numbered starting from one. Thus ssmmggrr__eexxtteenndd(()) must logically allo- cate a new block for the relation pointed to by rreellnn and store some representation of bbuuffffeerr there. 11..22..66.. ssmmggrr__ooppeenn The routine iinntt ssmmggrr__ooppeenn((rreellnn)) RReellaattiioonn rreellnn;; should open the relation pointed to by rreellnn and return a file descriptor for it. The notion of file descriptors is a holdover from the pre-storage manager switch days, and is no longer necessary. If file descriptors make no sense for a given device, then the associated device manager may return any non-negative value. A negative return value indicates an error. 11..22..77.. ssmmggrr__cclloossee When the data manager is finished with a relation, it calls iinntt ssmmggrr__cclloossee((rreellnn)) RReellaattiioonn rreellnn;; The device manager may release resources associated with the relation pointed to by rreellnn. 66 11..22..88.. ssmmggrr__rreeaadd To instantiate an 8kByte data block from a relation, the data manager calls iinntt ssmmggrr__rreeaadd((rreellnn,, bblloocckknnuumm,, bbuuffffeerr)) RReellaattiioonn rreellnn;; BBlloocckkNNuummbbeerr bblloocckknnuumm;; cchhaarr **bbuuffffeerr;; As stated above, 8kByte blocks in a relation are logically numbered from one. The storage manager must locate the block of interest and load its contents into bbuuffffeerr. 11..22..99.. ssmmggrr__wwrriittee The routine iinntt ssmmggrr__wwrriittee((rreellnn,, bblloocckknnuumm,, bbuuffffeerr)) RReellaattiioonn rreellnn;; BBlloocckkNNuummbbeerr bblloocckknnuumm;; cchhaarr **bbuuffffeerr;; writes some representation of bbuuffffeerr as block number bblloocckk-- nnuumm of the relation pointed to by rreellnn. This write may be asynchronous; the device manager need not guarantee that the buffer is flushed through to stable storage. Synchronous writes are handled using the ssmmggrr__fflluusshh routine, described below. As long as this routine guarantees that the buffer has been copied somewhere safe for eventual writing, it may return successfully. The buffer is an 8kByte POSTGRES page in the format used by the data manager. It may be stored in any form, but must be reconstructed exactly when the ssmmggrr__rreeaadd routine is called on it. 11..22..1100.. ssmmggrr__fflluusshh This routine synchronously writes a block to stable storage. The POSTGRES data manager almost never needs to do synchronous writes of single blocks. In some cases, how- ever, like the write of the transaction status file that marks a transaction committed, a single block must be writ- ten synchronously. In this case, the data manager calls the device manager's ssmmggrr__fflluusshh routine. The interface for this routine is iinntt ssmmggrr__fflluusshh((rreellnn,, bblloocckknnuumm,, bbuuffffeerr)) 77 RReellaattiioonn rreellnn;; BBlloocckkNNuummbbeerr bblloocckknnuumm;; cchhaarr **bbuuffffeerr;; The behavior of ssmmggrr__fflluusshh is almost exactly the same as that of ssmmggrr__wwrriittee, except that ssmmggrr__fflluusshh must not return until bbuuffffeerr is safely written to stable storage. 11..22..1111.. ssmmggrr__bblliinnddwwrrtt Under normal circumstances, POSTGRES moves dirty pages from memory to stable storage by calling the ssmmggrr__wwrriittee rou- tine for the appropriate storage manager. In order to call ssmmggrr__wwrriittee, the buffer manager must construct and pass a relation descriptor. In certain cases, POSTGRES is unable to construct the relation descriptor for a dirty buffer that must be evicted from the shared buffer cache. This can happen, for example, when the relation descriptor has just been created in another process, and has not yet been committed to the database. In such a case, POSTGRES must write the buffer without constructing a relation descriptor. The buffer manager can detect this case, and handles it using a strategy called "blind writes." When a blind write must be done, the buffer manager determines the name and object id of the database, the name and object id of the relation, and the block number of the buffer to be written. All of this information is stored with the buffer in the buffer cache. The device manager is responsible for deter- mining, from this information alone, where the buffer must be written. This means, in general, that every device man- ager must use some subset of these values as a unique key for finding the proper location to write a data block. In general, the blind write routine is slower than the ordinary write routine, since it must assemble enough infor- mation to synchronously open the relation and write the data. The interface for this routine is iinntt ssmmggrr__bblliinnddwwrrtt((ddbbnnaammee,, rreellnnaammee,, ddbbiidd,, rreelliidd,, bbllkknnoo,, bbuuffffeerr)) NNaammee ddbbnnaammee;; NNaammee rreellnnaammee;; OObbjjeeccttIIdd ddbbiidd;; OObbjjeeccttIIdd rreelliidd;; BBlloocckkNNuummbbeerr bbllkknnoo;; cchhaarr **bbuuffffeerr;; 88 It is expected that most device managers will write data much more efficiently using ssmmggrr__wwrriittee than ssmmggrr__bblliinnddwwrrtt. 11..22..1122.. ssmmggrr__nnbblloocckkss The routine iinntt ssmmggrr__nnbblloocckkss((rreellnn)) RReellaattiioonn rreellnn;; should return the number of blocks stored for the relation pointed to by rreellnn. The number of blocks is used during query planning and execution. Since this is a potentially expensive operation, the storage manager maintains a shared-memory cache of relation sizes recently computed by device managers. 11..22..1133.. ssmmggrr__ccoommmmiitt When a transaction finishes, the data manager calls the ssmmggrr__ccoommmmiitt routine for every device manager. This routine must flush any pending asynchronous writes for this transac- tion to disk, so that all relation data is on persistent storage before the transaction is marked committed. In addition, the device manager can release resources or update its state appropriately. This routine may be NULL, if there is no work to do for a device manager at transaction commit. If it is supplied, its interface is iinntt ssmmggrr__ccoommmmiitt(()) There is one important issue regarding transaction com- mit that should also be mentioned. Much research has been done in database systems to support _d_i_s_t_r_i_b_u_t_i_o_n, or databases that span multiple computers in a network. Although POSTGRES is not a distributed database system, it does support _d_a_t_a _d_i_s_t_r_i_b_u_t_i_o_n by means of the storage man- ager switch and device managers. Data distribution means that different tables, or parts of tables, may reside on storage devices connected to dif- ferent computers on a network. POSTGRES does all data manipulation at a single site, but the data it works with may be distributed. The data manager will call the device manager, which can operate on a local device, or use the 99 network to satisfy requests on a remote device. In fact, the Sony jukebox device manager works correctly whether it runs on the host computer to which the jukebox is connected, or on some other computer on the Internet. At transaction commit time, the data manager informs each device manager that all its data must be on stable storage. Each device manager flushes any pending writes synchronously, and returns. Once all device managers have forced data to stable storage, the data manager syn- chronously records that the transaction has committed by writing a ten-byte record to the _t_r_a_n_s_a_c_t_i_o_n _s_t_a_t_u_s _f_i_l_e on one of the storage devices. If the system crashes before the commit record is stored in the transaction status file, POSTGRES will consider the transaction to be aborted when the transaction's changes are encountered later. Thus the ssmmggrr__ccoommmmiitt interface supports data distribu- tion without using a traditional multi-phase commit proto- col. In particular, device managers are not notified whether a transaction actually commits, once they have forced data to stable storage. 11..22..1144.. ssmmggrr__aabboorrtt If a transaction aborts, the data manager calls ssmmggrr__aabboorrtt in place of ssmmggrr__ccoommmmiitt. In this case, it does not matter if changes made to user relations during the aborted transaction are reflected on stable storage. Device managers may deschedule pending writes, as long as the writes include no data from other transactions. This routine may be NULL, if there is no work to do for a device manager at transaction abort. 11..33.. AArrcchhiitteeccttuurree SSuummmmaarryy The POSTGRES storage manager architecture assigns a particular device manager to manage each device, or class of devices, available to the database system. A well-defined set of interface routines makes relation accesses device- transparent. The architecture uses a _s_t_o_r_a_g_e _m_a_n_a_g_e_r _s_w_i_t_c_h to record the interface routines for all the device managers known to the system. So far, three device managers exist: magnetic disk, a Sony WORM optical disk jukebox, and volatile main memory. 1100