.nr tp 13 .nr sp 13 .nr pp 11 .nr fp 9 .sz \n(pp .vs 13 \" set 1.5 spacing between lines .nr $r \n(.vu/\n(.su .nr $R \n($r .fo ''%'' .(l C .sz \n(sp .b MANAGING PERSISTENT OBJECTS IN A MULTI-LEVEL STORE .sz \n(pp .sp 3 .i Michael Stonebraker Computer Science Division, EECS Department University of California Berkeley, Ca. 94720 .sp .r .)l .sp 3 .ls 1 .ce .uh Abstract .pp This paper presents an architecture for a persistent object store in which multi-level storage is explicitly included. Traditionally, DBMSs have assumed that all accessible data resides on magnetic disk, and recently several researchers have begun to consider the possibility that significant amounts of data will occupy space in a main memory cache. We feel that future object managers will be called on to manage very large object bases in which time critical objects reside in main memory, other objects are disk resident, and the remainder occupy tertiary memory. Moreover, it is possible that more than three levels will be present, and that some of these levels will be on remote hardware. This paper contains an architectural proposal addressing these needs along with a sketch of the required query optimizer. .sh 1 "INTRODUCTION" .pp Traditionally DBMSs have assumed that all data resided on magnetic disk. Therefore, all optimization decisions were oriented toward disk technology. For example, access methods have been proposed that are efficient on disk devices, e.g. B-trees and R-trees. Query processing strategies have been designed that work well for disks, e.g. merge-sort [SELI79], and query optimizers have been architected with a disk environment in mind [SELI79]. .pp Recently, there has been some work on including in this traditional environment the possibility that significant portions of a data base may reside in main memory. Query processing strategies have been designed which require large amounts of main memory to be effective, e.g hash joins [DEWI84, DEWI86]. In addition, access methods appropriate to main memory have been constructed, e.g. T-trees [LEHM86] and the possibility of using AVL trees was evaluated in [DEWI84]. Lastly, query optimizers have begun to take more careful note of available main memory when constructing query plans [HONG90]. .pp However, current general purpose DBMSs are still too slow to manage the real-time data bases associated, for example, with factory floor applications or telephone switching. Such applications are addressed by special purpose system software such as VAX-Elan and Rdb-Elan. It would clearly be desirable to expand the set of applications amenable to a general purpose DBMS solution by increasing its support for main memory data. .pp We also believe that most data bases will expand dramatically in size, requiring the inclusion of tertiary storage. In addition, it is possible that there will be multiple tertiary stores, perhaps at remote locations. Moreover, it is not unreasonable that there may be more than three levels in future systems. Therefore, in Section 2 we present a proposal for a multi-level storage architecture. The special needs of long fields are covered in Section 3, and then we move in Section 4 to an outline of the query optimizer needed in this environment. We conclude in Sections 5 with our prototyping plans. .pp In the remainder of the paper we make several assumptions. First, we assume that an abstract data type facility is available [STON86B, STON90]. Hence, a user can define new data types, functions and operators. Moreover, such types, functions, and operators are automatically available in the query language supported by the DBMS. For the purposes of this discussion, we assume the ADT facility available in POSTGRES and the query language, POSTQUEL [STON86]. However, any comparable capabilities could be readily substituted. .pp We also assume that the storage manager uses a no-overwrite philosophy as in [STON87]. Therefore, certain techniques that we propose take advantage of this property. Anyone interested in a Write Ahead Log (WAL) storage manager must make minor adjustments to our proposal. .sh 1 "A MULTI-LEVEL STORAGE MANAGER" .pp We assume that the storage system consists of a collection of L .b logical devices that form a rooted tree. Hence, there is a unique root, called main memory, with zero or more direct descendant devices, each of which can have zero or more descendants. Moreover, these L devices can be on various computer systems in a network. For the moment we will assume that L = 3 and denote the devices by main memory, disk and archive, which are assumed to be on a single computer system. The extension of our proposal to L > 3 devices and to a distributed environment is discussed in [STON91]. .pp A multi-level storage manager must be able to address the needs of the following clients: 1) real time applications which need sub-millisecond response times for requests to a main memory data base along with conventional response times to disk based data. Persistent programming languages are an example of this class of applications. 2) applications with mamouth data bases which need conventional response times to disk based data and reasonable response times to archival data. To address these needs, one could either use a .b physical hierarchy or a .b logical hierarchy. We first discuss this issue and then turn to our specific proposal. .sh 2 "Physical or Logical Hierarchy" .pp The traditional way of viewing a multi-level store would be to generalize the .b physical .b block model used by current disk-based systems. Current DBMSs and file systems assume that data blocks reside on disk and that .b worthy blocks are placed in main memory. Movement of blocks between main memory and disk is controlled by the buffer manager, which utilizes a replacement policy based on how recently each block has been touched and perhaps other semantic information [CHOU85]. The generalization to a three level store is presented in column 3 of Figure 1, where data blocks reside on the archive, worthy blocks are placed on the disk and very worthy blocks are placed in main memory. Storage blocks would be moved in the hierarchy as access patterns change by a generalized buffer manager. Some work on migration of whole files in such a three level store has been done in [SMIT81]. .pp Although a physical block model is appealing because of its simplicity, we feel that it is inadequate for the following reasons: 1) Persistent programming languages need an object cache in main memory with the property that pointers to other objects are "swizzled". Specifically, a pointer to an object should be represented as a unique identifier (UID) on disk but as a physical pointer in main memory. Another simple example of an object changing representation when it moves between levels is a variable length character string. It might be stored in main memory as a pointer to a location in a heap containing a null terminated string. However, on disk the string would be represented by a length designator followed by the data. The physical block model is inadequate because it does not support an object changing representation when it move between levels. 2) A conventional relational DBMS maintains a main memory cache of system catalog objects (e.g. open tables, scan positions, etc.). In all systems we are familiar with, this cache is managed as a separate main memory data base. Moreover, system catalog objects have different representations in main memory and disk. Again, objects must change representation when a table is "opened" or "closed". 3) An application specific compression algorithm should be applied to images when they are placed on disk or archive. Moreover, when an image is fetched, it will sometimes be appropriate to decode it (for example to display it) and sometimes to leave it encoded (for example if the function to be applied to the image can use the encoded format). Moreover, it is possible that multiple levels of compression will be appropriate. For example, one should spend a small number of seconds compressing an image when it is sent to the disk and a larger amount of time when it is dispatched to the archive. Again, multiple representation changes may be required. 4) Objects are usually accessed through secondary (or primary) indexes. Such indexes can use direct physical pointers for main memory objects, but must employ UIDs for disk based data. Moreover, an AVL tree is an effective indexing scheme for main memory data, but fails disasterously on disk-based data, and a B-tree should be used instead. Hence, both the representation of an index and its basic algorithm should change when objects move in the storage hierarchy. 5) The degree of security controlling access to an object may be different for the different storage levels. One reason to place data in main memory is to ensure the highest possible performance. However, a traditional DBMS runs in a different address space from the application program. Hence, objects are not directly accessible to an application; rather commands must be sent over an interprocess message system and the result returned in the same way. This overhead is expensive for main memory data, and one might choose to allow the application program direct access to the main memory data. Consequently, the physical mechanisms used for access control to objects may vary from level to level in the storage hierarchy. .pp Because the physical block model does not support data objects or indexes changing representations or degree of security when they move between levels, we propose a more general .b logical model to rectify these deficiencies, as indicated in the rest of Figure 1. .(z .hl .(c .TS c c c c c c c c |l|l|l|l|. location main memory disk archive data base data base data base _ main memory main memory objects in cache of disk blocks cache of archive blocks main memory format (M) _ disk disk objects in cache of archive blocks disk format (D) _ archive archive objects in archive format (A) _ .TE .)c .sp .ce Logical Model of a Three Level Store .ce Figure 1 .hl .)z Here, the right-most column indicates the archive data base is the physical block model discussed earlier and supports caching of worthy blocks at higher levels in the storage hierarchy. However, the physical block model is generalized by supporting additional columns in the table. Each column corresponds to a different representation for an object which is appropriate for a specific storage device. Hence, there are three representations possible in Figure 3, one appropriate for the archive (A), one appropriate for the disk (D), and one appropriate for main memory (M). Moreover, .b three different logical data bases coexist, a main memory data base, a disk data base and an archive data base, each of the latter two with caching at higher levels of the hierarchy. Furthermore, an object may exist in any of these three data bases. Lastly, it is possible that one or more of these columns will not be used in a given application. In this case, the model simply contains less columns. .pp This architecture is a generalization of traditional DBMSs, which assume that all records are in disk format, D even if they are cached in main memory. Main memory caching of system catalogs in a different format then occurs outside the storage manager using specialized code. It is also a generalization of the POSTGRES storage system [STON87], which utilizes two different representations for disk and archive objects. Hence, objects are converted when they move between these levels. However, POSTGRES has no support for main memory objects. Lastly, it is also a generalization of commercial persistent object stores, (e.g. the products from Ontologic, Object Design, Objectivity, and Versant) which usually "swizzle" pointers when an object is moved from disk to main memory. Hence, they support two formats for objects, namely M and D, and convert between these representations when objects are moved. However, none of these products support an archive format. .pp We now turn to our specific proposal for a storage manager supporting the logical model. .sh 2 "The Storage Manager" .pp The DBMS stores a set of instances (objects, tuples) of collections (classes, tables). The instances of each collection may be in the format appropriate to any logical device, and we can think of the instances in a particular format as forming a .b logical data base. When a user query is submitted to this system, e.g: .(l retrieve (EMP.name) where EMP.age = 30 .)l it must be executed against some subset of the these logical data bases. To optimize this process, we require a .b distribution .b criteria, for each collection which indicates the location of instances among the three logical data bases. .pp Current persistent object stores, e.g. Orion [KIM90], maintain a list of UIDs of instances in main memory, with the remainder assumed to be in disk representation. Specifying which objects are in main memory using a list of UIDs is unduly restrictive. For example, employees over 65 are retired and might be placed on the archive, while those under 30 might exist in the main memory representation and those over 30 in the disk representation. It is very inefficient to require specifying such partitioning using lists of UIDs. Not only is it space inefficient, but also it does not permit optimization of queries. For example, finding employees under 20 is a query which need be directed only to the objects in main memory representation; however unless the partitioning is described semantically there will be no way to figure this out. .pp On the other hand, POSTGRES uses a semantic criteria to specify the location of objects. Unfortunately, the criteria is hard-coded to be: .(l disk representation: all objects valid at time = now archive representation: other objects .)l A much better alternative would be to allow a user or application program or maybe even the DBMS itself to dynamically specify the semantic composition of objects at each level through a general distribution criteria, and two possible approaches seem reasonable. First, we could require that the criteria for each device be mutually exclusive and correct at all times. One such set of criteria for EMP might be: .(l main memory representation: EMP where age >= 30 and age < 60 disk representation: age < 30 archive representation: age >= 60 .)l If the criteria form a partition, then any update or insert must be installed in the correct data base before the installing transaction commits. Moreover, certain queries need only be processed for one data base. For example, to find the names of all 25 year old employees, one need only query the disk data base. We will call this form of operation .b synchronous, because the distribution criteria is dynamically kept correct for each device. .pp The other possibility would be to require the criteria to form a partition as above. However, instead of guaranteeing that each criteria is correct, the execution engine only guarantees that each instance will be on its correct .b home device or on a device that is an .b ancestor of its home device. In this case, each insert or update can be installed on the main memory device, and then moved .b asynchronously at some later time to a lower level. This asynchronous mode of operation supports faster commit than synchronous mode because modifications can be installed in main memory. However, it has the disadvantage, that queries that should be logically processed by device-i must be processed for all ancestor devices in addition. .pp In a transaction processing environment, we can see the obvious utility of asynchronous operation, while in a decision support application, the synchronous mode might be better. Therefore, one might expect to support both modes of operation; however, we choose to support only asynchronous operation. Because we expect that each device is faster than its descendants by perhaps one order of magnitude, a query to a specific device will generally be much faster than the same query to any of its descendants. Hence, requiring each query to be processed by all ancestors of a device may not be a significant burden. .pp Our storage architecture assumes a general purpose DBMS whose query optimizer has been modified to produce the correct number of queries to the various actual collections. Moreover, we require a collection of background demons, one per logical storage device, which perform sophisticated storage management functions. Hence, we need a storage manager for main memory, the disk and the archive, and each storage manager controls space allocation on its device. Therefore, each one controls the caching of blocks of lower level objects on its device and also controls the asynchronous migration of objects from its data base to lower level home data bases. To signify that this is much more than a buffer manager, we term this software the .b vacuum .b cleaner. Hence, the main memory vacuum cleaner is responsible for the main memory buffer pool of disk blocks and archive blocks as well as for reclaiming space in the main memory data base by migrating instances to their home data bases. Of course the vacuum cleaner must be able to identify instances to be moved, so it must contain a complete execution engine. Also, when an instance is moved, the vacuum cleaner must pass the affected instance to a .b receiver for the target data base, who will install the instance. .pp The last task of the vacuum cleaner associated with each device is to support dynamically changing the distribution criteria, and there are two models of redistribution which we propose. The first model is used by a specific application prior to executing a query and supports .b temporary redistribution of instances, while the second model is used by a data base administrator for .b permanent redistribution. Denote the permanent distribution criteria as {PERM(device)}, and consider a second temporary distribution criteria, {TEMP(device)}. For each device, initially TEMP = PERM. .pp A user can execute a query in one of two ways. First, he can run the query directly as: .(l retrieve (target-list) where qualification .)l For example, he might want the names of employees with low salaries as follows: .(l retrieve (EMP.name) where EMP.salary < 1000 .)l The query optimizer will construct queries for those logical data bases which have a distribution criteria which intersects the above qualification. .pp Alternately, he can temporarily move the data to a higher level and then run the query. The syntax he requires is: .(l elevate objects where qualification to destination .)l For example, he might specify: .(l elevate EMP where EMP.salary < 1000 to main-memory .)l In this case, the distribution criteria changes as follows: .(l TEMP(destination) = TEMP (destination) union qualification TEMP(other-levels) = TEMP (other-levels) minus qualification .)l Elevation is performed synchronously with appropriate locking. An elevate command to a lower level device serves little purpose and will be ignored. .pp Therefore, if the application accessing the low salary employees expects to access this set several times, it can perform the following: .(l elevate EMP where EMP.salary < 1000 to main-memory retrieve (EMP.name) where EMP.salary < 1000 .)l This latter command sequence will cause qualifying instances to be moved to main memory, which may require instance conversion, as well as insertions into main-memory access methods. This overhead will be worth while only if the retrieved objects will be subsequently accessed several times. In this case, the user is requesting data with temporary locality of reference to be elevated to a faster mode of access. When the user is finished with this data, he can cause the data to be returned with a second command: .(l return objects where qualification .)l This command will cause each vacuum cleaner to perform the following steps: .in+2 .ll-2 1) Identify the instances to be moved from its device, j, to device-i with the following query: .(l retrieve (collection-instances) where qualification AND PERM(device-i) and TEMP (device-j) .)l 2) Asynchronously return the instances to their correct home. 3) Appropriately update TEMP for device-j. .ll+2 .in-2 .pp Because we are dealing with a no-overwrite storage environment, the following optimization can be employed for an elevate-return sequence. When an elevate command is executed the data can be .b copied to the destination, leaving the instances also in the lower level data base. If the higher level objects are not modified, then execution of a return command can cause them to be discarded rather than returned. On the other hand, any updates will result in new instances which require return. As a result, only the new instances need be actually moved back to the destination. The bookkeeping to support this is straightforward. One need only record the time, TIME, of the elevation command. When the return command is executed, the instances to be move back to device-i can be identified by: .(l retrieve (collection-instances) where qualification AND PERM(device-i) and collection.valid-time > TIME .)l .pp It is clear that users ofter move data to higher levels and then forget to return them to a lower level when they are done. In this case, the higher storage levels will become cluttered, causing two significant disadvantages. First, space will be taken up that could be better used for caching blocks from lower levels of storage. For example, space in main memory can be used either to store main memory data or to cache disk blocks or archive blocks. The more space that is used for main memory data, the less space that is available for the disk cache. This may mean that there is no space for .b worthy disk blocks such as root nodes of B-trees, etc. The second disadvantage is even more serious. If there is too much main memory data, then data instances will be paged out to a swapping device. In this case, a data base optimized for main memory storage will actually reside partly on disk storage, and the performance implications may be disasterous. For example, main memory data may use AVL trees for an access method. However, in [DEWI84] it is demonstrated that AVL trees perform much worse than B-trees if any significant fraction of their structure goes out to disk. .pp To alleviate this problem, we propose that each vacuum cleaner be able to issue .b return commands on the user's behalf to free up space if required. Traditional storage managers make such placement decisions solely based on the time since the last access to the object. In our environment, the sets of instances that have been temporarily moved are of various sizes. Therefore, the vacuum cleaner should make placement decisions based on both the time since an instance satisfying the qualification has been touched and by the total size of the set of instances. When qualifications overlap, this will lead to errors, but it is an easy to administer policy. .pp The above capability supports temporary movement of data under user control from lower levels to higher levels. The data is then returned to its permanent home under user control or vacuum cleaner control. We now turn to a mechanism to support changing the permanent distribution criteria, PERM. .pp Many users will not take advantage of the .b elevate command because their individual accesses will not justify the cost of the conversion of objects. However, the data base administrator might notice that overall better performance would occur if storage was rearranged. This requires a change in the permanent distribution criteria, PERM, and therefore we require the following .b move command: .(l move objects where qualification to destination .)l .b Move changes PERM in the obvious way, i.e: .(l PERM(destination) = PERM (destination) union qualification PERM(other-levels) = PERM (other-levels) minus qualification .)l Like the .b elevate command, movement of instances up the tree occurs synchronously and down the tree asynchronously. However, .b move differs from .b elevate in that movements of instances are performed by actually deleting the instances from the source data base and inserting them in the target data base, rather than by copying them and subsequently invalidating the copies. Because of this fact, a .b move command will cause the vacuum cleaner for device-i to identify instances to be sent to device-j by: .(l retrieve (collection-instances) where qualification AND NOT PERM (device-i) AND PERM(device-j) .)l i.e. any temporary redistributions currently in effect can be ignored. .pp We observe that some collections of objects to be moved may take substantial rearrangement time. For example, to move a one gigabyte collection of objects from archive to disk requires about 83 minutes. It makes no sense to perform this rearrangement synchronously, since the data in question would be unavailable for a long period of time. To deal with this issue, we propose the notion of a .b goal distribution criteria, {GOAL (device)}. The data base administrator can set a new goal at any time for any level using the following command: .(l target objects where qualification for destination .)l For example, .(l target EMP where EMP.salary < 500 for main-memory .)l The vacuum cleaner for device-i must find instances where: .(l PERM (device-i) AND NOT GOAL (device-i) .)l and asynchronously move them to the correct level with minimal locking. Consider the case that there is a term in GOAL of the form: .(l collection.attribute-1 operator-1 value-1 .)l and that an ordering index such as a B-tree or AVL tree exists on the field, attribute-1. If PERM does not include a clause that uses attribute-1, then the vacuum cleaner records the above clause as its movement goal. On the other hand, if PERM includes a clause using attribute-1, e.g: .(l collection.attribute-1 operator-1 value-2 .)l then, the storage manager has a movement goal of the form: .(l collection.attribute-1 operator-1 value-1 AND NOT collection.attribute operator-2 value-2 .)l In either case, the vacuum cleaner can enter the indicated index and identify small collections of records to incrementally move. The required algorithms are sketched in [STON89] for a similar problem, that of incrementally building secondary indexes. .pp We also propose that the same goal mechanism be used to construct new indexes for instances, which require sufficient time that synchronous index construction is not advantageous. Again [STON89] contains detailed algorithms. .pp The last requirement is extensions to the type system. We assume an abstract data type (ADT) system of the form in [STON86B]. Therefore, a user can construct a collection using the syntax: .(l create collection-name (attribute-1 = type-1, ... , attribute-n = type-n) .)l For example, the EMP collection could be constructed as follows: .(l create EMP (name = char16, address = point, manager = EMP, age = int4, salary = int4) .)l This specification indicates the types that the user wishes to give to the system for storage and receive back as answers to queries. However, the types actually stored in the three collections, may be different. To support construction of the various representations, we propose the addition of the following commands: .(l use name-1 for name-2 on device-1 convert name-1 to name-2 using function-name .)l For example, the following specification supports a main memory representation of the char16 type: .(l use m-char16 for char16 on main memory convert char16 to m-char16 using make-string .)l Here, make-string is a previously registered function with an argument of type char16 and a result type of m-char16. .pp When a create statement is processed, the DBMS must construct three actual create statements. To do so it identifies any relevant collection of .b use statements for the types specified by the user and utilizes them in the obvious way. If no .b use statement is encountered, then the DBMS simply utilizes the type specified in the user's create statement. Whenever, a vacuum cleaner is required to convert from one representation to another, it makes use of information in a relevant .b convert statement to identify which function to call. .pp An ADT system must support indexing for all three data bases. Since each index may be specific to a level, then a user must indicate a specific device in an index creation command as follows: .(l create sal-index (EMP) on salary as B-tree for disk .)l If he leaves out the location clause, then the DBMS should assume that the index is to be built for all devices. .pp The last extension we propose is to allow the data base administrator to specify for each collection which of two security modes he wishes, .b trusted or .b secure mode. In either case, the application program runs in a separate address space from the DBMS. With trusted mode, objects in the collection are placed in a segment which is shared between the DBMS and the application. Thereby, instances can be directly manipulated by the application. Alternately, instances occur in a segment private to the DBMS and records must be exchanged over an interprocess message system. .pp Trusted mode allows very fast access by the application, especially for main memory data. Once the application had identified a collection of records by running a query to obtain their storage addresses, subsequent accesses could be performed by direct memory access. .sh 2 "How Many DBMSs" .pp It might be argued that the above proposal is equivalent to .b three DBMSs, one for each storage device, and that the software complexity will be prohibitive. In this section, we claim that the above proposal is only modest extra work on top of an extendible DBMS such as POSTGRES. The main pieces of a DBMS are the parser, planner, executor, access methods, utilities, and transaction management system, and we discuss each piece in turn. .pp The parser is the same one used for a traditional system. Concerning the planner, it must be extended for a tertiary memory environment, and the modest extensions required are discussed in Section 5. The executor must evaluate a query plan for each instance fetched by the access methods. Given that our proposal specifies representation issues within the ADT system, there is essentially no extra work in this module, given that an ADT system has been constructed. Access methods may well be specific to devices, and we envison ones optimized for each type of device. Clearly, there is extra effort required to add access methods; however, if the DBMS has been designed along the lines of [STON87] to allow indexes to be added, then there is no additional complexity outside of the access methods. In the utilities, there is low level code to support each device. However, if any device comes with a reasonable file system, then such code should be modest. Lastly, a single concurrency control system must be used for all devices; hence no extra difficulties occur here, and a no-overwrite storage manager obviates the need for crash recovery code. The only requirement is a vacuum cleaner for each device. Hence, we estimate that a three level DBMS is much closer in complexity to a single extensible DBMS than to three times that complexity. .pp It might be argued that a two level hierarchy is enough because disk will be formatted the same way as the archive. In some applications this may be true; however, we believe that an architecture allowing N representations will be generally superior to one allowing only two representations. .sh 1 "STORAGE AND INDEXING OF LARGE OBJECTS" .pp Because very large data bases will usually contain long fields, we discuss their storage and indexing in this section. First, the characteristics of archival memory that constrain the problem of storing long fields are discussed. Then, we discuss record storage in this environment. Lastly, functions may be very expensive to compute, and the section closes with a proposal that addresses this fact. .sh 2 "Memory Characteristics" .pp In this section the abstract model for a three level store is considered. The archive device consists of a collection of 2 or more read/write stations onto which .b media .b platters can be loaded from storage racks by mechanical robots. The load time for a platter is dominated by the speed of the robotics, which is typically 5-10 seconds in current devices. Once loaded, the time to read or write any specific storage block depends on the characteristics of the device. For optical disks, access times vary from 10-100 msecs depending on how far the disk arm must move to the desired block. For tape media the access time is on the order of 1-100 seconds depending on what tape technology tape is used (9track, 8mm, DAT) and how far into the tape the desired block resides. Once located, a block can be read quickly as sequential read speeds are usually 0.2 mbyte per second or higher. Hence, tertiary store is characterized by the following parameters: .(l AP -- platter switch time in seconds AR -- time to move to a random block on a loaded platter in seconds. AT -- transfer rate in bytes/sec once the desired information is under the read head .)l .pp For secondary memory, it will be useful to include the following parameters: .(l DR -- time to move to a random block on a magnetic disk drive in seconds DT -- transfer rate in mbytes/sec once the desired information is under the read head .)l Although it is possible to build an I/O controller which would move archive blocks directly to the disk and back, we assume a more conventional organization in which archive blocks are read into main memory. In this case, we assume the existence of two physical block sizes, B1 and B2, which are the units of transfer respectively between the disk and main memory and the archive and main memory. Presumably B1 is 4K bytes, while B2 will be a much larger value, say 64K or 128K. .sh 2 "Storage of Records with Long Fields" .pp Clustering has been studied extensively with a disk as the assumed storage device [CHAN89]. Such studies try to arrange a collection of records so that the number of physical disk reads which must be performed to access a set of related objects is minimized. Similar work assuming an optical disk as a storage device is reported in [CHRI87]. .pp The two popular forms of archives are ones using optical disks and tapes. In tape systems, AR is the dominent time and must be carefully optimized. On the other hand, in an optical disk tertiary memory system, platter switches (AP) are a factor of 100 larger than seeks (AR), and will therefore dominate performance. Consequently, we believe that the important clustering problem for this device is to arrange data records so as to minimize the number of platter switches when accessing a set of records. Therefore, we assume that all the instances of each collection are allocated to a single .b home platter. As a result, queries to a specific collection would be confined to a single media. Since, the platter capacity of most archives exceeds 3 Gigabytes, this will support moderate size collections. Larger ones must be horizontally partitioned [CERI84] to multiple platters by an .b archive distribution criteria. We assume that the archive distribution criteria is a set of clauses of the form: .(l collection.fieldname operator value to platter-number .)l Should a user wish to cluster together instances of different collections, he can ensure that they are allocated to the same platters by carefully structuring the collection of archive distribution criteria. .pp Lastly, we assume that instances of each collection are stored with all short fields together in a record and all long fields stored separately. The reasoning behind this decision is presented in [STON91]. .sh 2 "Functions on Long Fields" .pp Functions in a large object environment may be very expensive to compute. For example, given the collection: .(l EMP (name, address, manager, age, salary, picture) .)l one might ask the query: .(l retrieve (EMP.name) where EMP.age > 50 and beard (EMP.picture) = "red" .)l In this case, .b beard, is a classification function operating on the image of the employee and will take many thousands of instructions to compute. Hence, the second term is vastly more expensive than the first one in CPU time. Moreover, the function .b beard may not require all the bits in a picture to perform the classification; therefore it should be possible for the function to selectively retrieve just the information it requires. Lastly, an index on EMP.picture will not accelerate the above query; rather we require indexes on a function of EMP.picture. .pp The following proposal addresses these requirements. When a new type is registered with the DBMS, a single extra flag must be supported in the type definition, namely LONG. This flag denotes to the DBMS that the type in question is a candidate for separate storage. For types which are not LONG, functions can reference their arguments either by value of by reference. On the other hand, functions on LONG types must use a special interface. Specifically, when a function on a LONG field is called, it receives a .b magic .b cookie which is similar to a file descriptor. The function can then use a library of routines to .b seek to a specific location in the long object and then .b read or .b write a specific number of bytes to or from a buffer. Hence, it receives an abstraction for long fields that is the same as that of a file and can buffer as much of the long field as desired. Unnecessary portions of the long field never need be read from secondary or tertiary memory. .pp We assume that qualifications have clauses of the following forms: .(l 1) collection.fieldname operator constant 2) function (collection.fieldname) operator constant .)l An example of the first form is .(l EMP.salary > 5000 .)l Clauses of this sort have long been addressed by query optimizers [SELI79], and the generalization to an abstract data type context performed in [STON86B]. An example of the second form is: .(l beard (EMP.picture) = "red" .)l Such classification functions are very common, and example functions for photographs include beard, glasses, hair color, large nose, and scowl. .pp We assume that indexes are available for the abstract data type fields present in any collection as in [STON86B]. However, we also assume that indexes are generalized to be available on a function of a data type. This idea was proposed in [LYNC88] in the context of textual data, and there is little difficulty in implementing this generalization, as noted in [AOKI89]. Moreover, [MAIE86] proposed essentially the same construct by suggesting that the member sub-objects of a complex object be indexable. .pp We further assume that archive indexes are stored on the same platter as the data they index. Moreover, they may be cached at higher levels of the storage hierarchy and should be allowed to be built incrementally. .pp The optimizer must deal intelligently with functions which are very expensive to compute and/or fetch substantial portions of long objects. Earlier, [STON86B] identified a collection of information which must be specified by the definer of each function. This information is used by the parser and query optimizer to process queries on ADT fields. In the proposed environment four additional parameters must be added to this collection when a function, f, is defined: .in+2 .ll-2 1) Fraction of archive blocks read -- AB(f) When a function is applied to a long field, this parameter specifies the fraction of the blocks it expects to read from the archive through the .b magic .b cookie interface. 2) Fraction of disk blocks read -- DB(f) In case the object is buffered on the disk or converted to disk representation, the query optimizer can estimate the fraction of the disk blocks that will be read using this parameter. 3) Fraction of bytes examined -- FB(f) This number represents the fraction of the bytes in a long field that the function must examine. This quantity will be used in the CPU computation to follow. 3) CPU time per byte -- CPU_b(f) Since classification functions are often extremely expensive to compute, this parameter allows the optimizer to make a better estimate of the CPU time to execute the function. This will also be used in the CPU computation to follow. 4) CPU time per call -- CPU_c(f) There are occasional functions on short fields that are CPU intensive. For example, suppose passwords are added to the EMP collection and stored in encoded form. Then, a system administrator might want to execute the following command .(l retrieve (EMP.name) where break(EMP.password) = "easy" .)l to look for users with passwords that are too easy to break. In this situation, the break function is extremely CPU intensive, even though the password field is short. .ll+2 .in-2 With the above three parameters, the CPU time for applying a function to an attribute can be estimated as: .(l CPU_c(f) + CPU_b(f) * FB(f) * (expected length of the attribute) .)l If the definer of a function does not specify these parameters, they would be defaulted to the values appropriate for short fields. Reasonable values might be: AB = 1, DB = 1, FB = 1, CPU_b = 10 and CPU_c = 100. .sh 1 "QUERY OPTIMIZATION" .pp The optimization framework in [SELI79] carries over into the environment of this paper with a few extensions and modifications. This section discusses how the cost function must change and how the space of plans to be considered must be extended. .pp A traditional optimizer [SELI79] uses a cost function of the form: .(l query cost = expected (I/Os) + W1 * expected (records examined) (2) .)l These terms are respectively the expected number of I/O's and the expected number of records examined, multiplied by a conversion factor, W1. When tertiary memory is considered, the above formula must change to: .(l query cost = expected CPU time + W1 * (DR + B1 / DT) * expected disk I/Os + W2 * (AR + B2 / AT) * expected archive I/Os + (3) W2 * P * expected platter changes .)l Here, expected CPU time is estimated by multiplying the expected number of records examined by the expected CPU time per record. The second clause is the expected disk time in seconds, and W1 is therefore the system-specific conversion rate between CPU seconds and disk seconds. The third and fourth terms together form the archive time, and therefore W2 is the conversion rate between CPU seconds and archive seconds. .pp In our environment, each query will be decomposed into as many as three actual queries, one to the main memory data base, one to the disk data base and one to the archive data base. The first term is the only one considered for the main memory query, while the first two terms are considered for the disk query. Only for the archive query are all terms considered. .pp An optimizer should therefore compute (3) for each possible plan and then choose the expected cheapest one. However, in the environment of this paper, the order of evaluation of clauses that restrict the same collection must be carefully considered. For example, consider the query .(l retrieve (EMP.name) where beard (EMP.picture) = "red" and EMP.salary = 500 .)l In the case that there is no applicable index, a conventional optimizer will evaluate the two clauses on picture and salary in random order. Since, the cost of evaulating .(l beard (EMP.picture) = "red" .)l is very large compared to .(l EMP.salary = 500 .)l the optimizer should construct a plan whereby the latter clause is evaluated first. Only for those instances with the correct salary must the expensive computation be performed. Therefore, the optimizer must consider the two different orderings of the clauses as separate plans. .pp In addition, when multiple clauses exist for a collection, a traditional optimizer will process all of them at the same time in some order. However, in this environment, the optimizer must also consider processing the clauses separated by other intervening operations. An example of the utility of this approach is discussed in [STON91]. Also discussed in [STON91] is the necessity of using more than one index when processing a restriction query for a collection. .pp Therefore, in our environment, if an optimizer is given a query with N clauses spanning K collections, there are as many as N! different ordering of the clauses that merit consideration. For each of these orderings there may be several different actual plans to evaluate. Consequently, the search space grows dramatically relative to the space evaluated by a conventional optimizer. .pp For each possible plan, the optimizer must construct an estimate for (3) and then choose the expected cheapest one. We need to extend [SELI79] with cost calculations for clauses involving long fields, clauses involving expensive functions on short fields, and costs for archival store access. We now treat these topics in order. .pp Consider a clause C of the form: .(l f (long field) operator constant .)l If there is an index on .(l f (long field), .)l then this clause becomes a "normal" one and can be evaluated using [SELI79]. Otherwise, the optimizer must estimate the following constant: .(l CONST(C) = (expected number of instances examined) * (expected field length) .)l Hence, the query optimizer must guess the number of instances examined using classical mechanisms, and the average field length is assumed available from the system catalogs. Moreover, it must make two additional estimates for the long fields in each collection: .(l disk fraction: the fraction of the bytes in the long objects present in the disk cache m-m fraction: the fraction of the bytes in the long objects present in the main memory cache .)l The archive fraction is one minus these two numbers. Of course, the obvious restrictions on these numbers for the disk and main memory data bases should be assumed. .pp As a result, the query optimizer can evaluate the cost of a clause containing a long field as: .(l CPU time = CPU_c + CPU_b * FB(f) * CONST(C) expected disk I/Os = (disk fraction) * DB(f) * CONST(C) expected archive I/Os = (archive fraction) * AB(f) * CONST(C) .)l .pp CPU intensive functions on short fields are the second kind of access to be individually accounted for. Here, the I/O will be accounted for in the traditional metrics for the rest of the query from [SELI79]. Only the CPU resources need be considered separately, and CPU time for functions of short fields is the same as that computed for long fields above. .pp We now turn to the cost calculations for operations to archival storage other than long field access. A query plan consists of a collection of query processing steps, each of which is a scan of a collection, an indexed scan of a collection, a join of two collections by iterative substitution, a join of two collections by merge sort, or a join of two collections by hash join. The CPU time for each plan operation can be estimated using conventional means. The disk and archive I/O for sequential scans and indexed scans can be computed by the optimizer using the disk fraction and main-memory fraction for the short fields in a given collection. The number of platter swaps required is one more than the number of platters on which the collection resides. .pp For merge-sort or hash joins, it is clear that any required temporary collections should be allocated in main memory or on disk. Hence, the archive need be read only during the initial scan of both collections. After that standard disk-based formulas apply. Furthermore, the total number of platter switches for each collection is 1 plus the number of platters that each collection occupies. .pp The last tactic is iterative substitution. For disk based collections, the standard formulas apply. For archive collections, one should only consider this strategy if there is an index on the inner collection. Furthermore, if the inner collection is a multi-platter collection, then there is a danger of one platter switch per outer record. To avoid this disasterous overhead, iterative substitution should only be considered if the inner collection is clustered on platters in the same way as the outer collection. This requires that both collections use the same field in the distribution criteria used to partition the table. In this case, the number of platter switches is again 1 plus the number of platters for each collection. .pp Based on these considerations, an optimizer can estimate the cost for any given query plan according to (3) above and then choose the expected cheapest one. However, there are at least two optimization tactics that may be useful to reduce the search space. First, it may be reasonable to automatically delay all expensive clauses to the end of a query. Then, at the end of a plan, they should be executed in ascending order of .(l function cost * clause selectivity .)l Second, it may be very desirable to perform multiple query optimization. This tactic has been studied in a disk-based environment when queries have overlapping terms in their qualifications [SELL86]. In our environment, it may be desirable to group a large collection of queries into a .b bundle and then make a sequential pass through the instances of a collection processing all queries in parallel. Consider the following two queries: .(l retrieve (EMP.name) where EMP.age > 40 retrieve (EMP.salary) where EMP.dept = "shoe" .)l If there are indexes on neither age nor dept, then a sequential scan of EMP is required. In this case, both qualifications can be economically checked in a single scan of the collection. .sh 1 "CONCLUSIONS" .pp We have proposed an architecture for a multi-level storage manager which integrates both main memory and archive data bases into a common framework and substantially generalizes many previous proposals. Specifically, it supports real-time applications, the caching requirements of persistent programming languages, and the needs of applications with very large data bases in a common framework. .pp Moreover, we have proposed the query optimization support required for the resulting environment. Basically, the optimizer must be extended to cope with: .(l 1) the characteristics of the archive media 2) the desirability of storing long fields in a separate location from short fields 3) the prospect of CPU intensive functions .)l and we have shown a methodology to accomplish these tasks. .pp In order to turn a prototype like POSTGRES into the system outlined in Section 3-5, the main steps required are: .(l 1) support for a main memory data base 2) the possibility of indexes on functions of an attribute 3) the replacement of a hard coded distribution criteria with a general one 4) extending the optimizer as noted in Section 5 5) implementation of separate storage for long fields 6) rearchitecting the POSTGRES disk-to-archive vacuum cleaner 7) implementation of a main memory vacuum cleaner .)l We are currently designing such a system, currently denoted POSTGRES II, with these characteristics. The scope of distribution support in POSTGRES II is also under study. .sh 1 "REFERENCES" .nr ii 10m .ip [AOKI89] Aoki, P., ``Implementation of Extended Indexes in POSTGRES,'' Electronics Research Laboratory, University of California, Technical Report 89-62, July 1989. .ip [CATT90] Cattell, R. and Skeen, J., "Engineering Database Benchmark," Sun Microsystems, Technical Report, April 1990. .ip [CERI84] Ceri, S. and Pelagatti, G., "Distributed Databases, Principles and Systems," McGraw-Hill, New York, N.Y., 1984. .ip [CHAN89] Chang, E. and Katz, R., "Exploiting Inheritance and Structure Semantics for Effective Clustering and Buffering in an Object-oriented DBMS," Proc. 1989 ACM- SIGMOD Conference on Management of Data, Portland, Ore., June 1989. .ip [CHRI87] Christodoulakis, S., "Analysis of Retrieval Performance for Records and Objects Using Optical Disk Technology," ACM TODS, June 1987. .ip [CHOU85] Chou, H. and Dewitt, D., ``An Evaluation of Buffer Management Stockholm, Sweden, August 1985. .ip [DEWI84] Dewitt, D. et. al., ``Implementation Techniques for Main Memory Data Base Systems,'' Proc. 1984 ACM-SIGMOD Conference on Management of Data, Boston, Ma., June 1984. .ip [DEWI86] Dewitt, D. et. al., .q "GAMMA: A High Performance Dataflow Database Machine," Proc. 1986 VLDB Conference, Kyoto, Japan, Sept. 1986. .ip [DEWI90] Dewitt, D. et. al., " A Study of Three Alternative Workstation-Server Architectures for Object-oriented Database Systems," Proc. 1990 VLDB Conference, Brisbane, Australia, August, 1990. .ip [HONG90] Hong, W. and Stonebraker, M., "Parallel Query Processing in XPRS," Memo UCB/ERL M90/47, University of California, Berkeley, Ca., May 1990. .ip [KIM90] Kim, W. et. al., "Architecture of the Orion Next-Generation Database System," IEEE Transactions on Knowledge and Data Engineering, March 1990. .ip [LEHM86] Lehman, T. and Carey, M., "Query Processing in Main Memory Database Management Systems," Proc. 1986 ACM-SIGMOD Conference on Management of Data, Washington, D.C., June 1986. .ip [LYNC88] Lynch, C. and Stonebraker, M., ``Extended User-Defined Indexing with Application to Textual Databases,'' Proc. 1988 VLDB Conference, Los Angeles, Ca., Sept. 1988. .ip [MAIE86] Maier, D. and Stein, J., "Indexing in an Object-Oriented DBMS," OGC Technical Report CS/E-86-006, Oregon graduate Center, Beaverton, Ore., May 1986. .ip [MOSH90] Mosher, C. (ed), "The POSTGRES Reference Manual, Version 2," Electronics Research Laboratory, University of California, Berkeley, Ca., Report M90/53, July 1990. .ip [NAVA82] Navathe, S. et. al., "Vertical Partitioning for Physical and Distribution Design of databases," Stanford University, Technical Report STAN-CS-82-957, 1982. .ip [SELI79] Selinger, P. et. al., ``Access Path Selection in a Relational Data Base System,'' Proc 1979 ACM-SIGMOD Conference on Management of Data, Boston, Mass., June 1979. .ip [SELL86] Sellis, T., ``Global Query Optimization,'' Proc 1986 ACM-SIGMOD Conference on Management of Data, Washington, D.C., June 1986. .ip [SMIT81] Smith, A., "Analysis of Long Term File Reference Patterns for Application to File Migration Algorithms", IEEE Transactions on Software Engineering, July, 1981. .ip [STON86] Stonebraker, M. and Rowe, L., .q "The Design of POSTGRES," Proc. 1986 ACM-SIGMOD Conference on Management of Data, Washington, D.C., May 1986. .ip [STON86B] Stonebraker, M., "Inclusion of New Types in Relational Data Base Systems," Proc. 1986 IEEE Data Engineering Conference, Los Angeles, Ca., Feb. 1986. .ip [STON87] Stonebraker, M., "The POSTGRES Storage System," Proc. 1987 VLDB Conference, Brighton, England, Sept. 1987. .ip [STON89] Stonebraker, M., "The Case for Partial Indexes," SIGMOD RECORD, Dec. 1989. .ip [STON90] Stonebraker, M. et. al., "The Implementation of POSTGRES," IEEE Transactions on Knowledge and Data Engineering, March 1990. .ip [STON91] Stonebraker, M. et. al., "Managing Persistent Objects in a Multi-Level Store," Electronics Research Laboratory, University of California, Technical Report M91/16, March 1991.