head	1.1;
access;
symbols;
locks; strict;
comment	@.\" @;


1.1
date	92.07.14.20.10.41;	author mao;	state Exp;
branches;
next	;


desc
@mike olson's master's thesis on the postgres storage manager switch.
@


1.1
log
@Initial revision
@
text
@.\"
.\"  This document describes the storage manager switch and Inversion
.\"  file system provided with POSTGRES.  This work was performed in
.\"  partial satisfaction of the requirements for a M.Sc. degree from
.\"  UC Berkeley.
.\"
.nr pi 3n
.nr pp 11
.nr tp 11
.nr sp 11
.\"
.\" $0 is the section-header trap
.\"
.nr xu 0
.de $0
.nr QQ \\n($0+2
.(x
.if \\n($0=1 .sp 0.7v
.ti \\n(QQm
\\*($n. \\$1
.)x
..
.de RV
.ie "\\$2"" \
\{\
.	ds VN "0.0
.	ds VD "(No date)
.	ds VT "UNAUDITED VERSION
.\}
.el \
\{\
.	ds VN \\$3
.	ds VD \\$4
.	ie "\\$7"Exp" \
.		ds VT "DRAFT
.	el \
.		ds VT \\$7
.\}
..
.de BU
.ip \0\0\0\(bu
..
.de CW
\\f(CW\\$1\\fP\\$2
..
.de (E
.(b
.ft CW
..
.de )E
.ft R
.)b
..
.ds PG "\\s-2POSTGRES\\s0
.ds PQ "\\s-2POSTQUEL\\s0
.ds UX "\\s-2UNIX\\s0
.ps 11
.\" on my printer, .79+.11 == 1 && .75 == 1.
.m1 0.79i
.m2 0.11i
.m3 0
.m4 0.75i
.ba +0.5i
.ll 6.5i
.ls 2
.RV $Header: /home/postgres/mao/misc/master/RCS/thesis.me,v 2.4 1992/05/12 22:53:29 mao FINAL $
.rm RV
.EQ
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.ps 10
.ft 2
.ps 11
.ps \n(99
.ft \n(98
.EN
.he '''\fR%\fP'
.ft B
.ce
ABSTRACT
.sp 0.5v
.lp
Advances in mass storage technology are producing
devices capable of holding terabytes of data.
These new devices,
often called
.i "tertiary storage devices" ,
have dramatically different performance characteristics than magnetic disks.
Conventional database systems
include explicit dependencies on magnetic disk,
and so are unsuited to manage tertiary storage devices.
A layer of abstraction has been introduced between the access methods
and physical storage devices in \*(PG.
This abstraction layer supports the addition of tertiary storage.
An example implementation of a tertiary storage device manager
is presented.
Finally,
the architecture of a novel file system
that takes advantage of POSTGRES database system services is discussed.
This file system may use any of the storage devices managed by
\*(PG.
.sh 1 "INTRODUCTION"
.pp
The research presented here investigates strategies for supporting massive data
storage devices in a relational database management system.
Such devices,
often called
.i "tertiary storage devices"
because they are slower and cheaper than magnetic disk (secondary) storage,
have recently become widely available commercially.
.pp
This thesis describes contributions in three key areas.
First,
a new internal interface for the \*(PG database system
makes it easy to add new kinds of hardware to the set of storage devices
available to users.
This abstraction,
called the
.i "storage manager switch" ,
hides characteristics of underlying storage devices,
allowing the data manager to treat all devices the same.
.pp
Second,
as an example of device extensibility,
the \*(PG data storage pool incorporates
a 327GByte Sony write-once optical disk jukebox.
This device stores relational data under \*(PG version 4.0 at Berkeley.
.pp
Finally,
a new file system has been designed and implemented on top of \*(PG.
Since it is built on top of the database system,
this file system provides functionality not available in
any other file system.
Some of the services provided are time travel and transaction
protection for user file updates.
The file system also takes advantage of the storage manager switch
to provide file storage on any device known to \*(PG.
.sh 2 "An Overview of the POSTGRES Database System"
.pp
The \*(PG database system [MOSH92]
uses a novel no-overwrite technique for managing storage.
This technique allows the user to see the entire history of the database
and obviates the need for a conventional write-ahead log,
speeding recovery [STON87].
When a tuple is updated or deleted,
the original tuple is marked invalid,
but remains in place.
For updates,
a new tuple containing the new values
is added to the database.
By using transaction start times and a special status file
that records whether or not a transaction has committed,
\*(PG can present a transaction-consistent view of the database
at any moment in history.
This capability is referred to as
.i "time travel" .
Since only the start time and commit state of a transaction
must be recorded in the status file,
no special log processing is required at crash recovery time.
.pp
Periodically,
obsolete tuples must be garbage-collected from the database,
and either moved elsewhere or physically deleted.
If the tuples are not saved elsewhere,
some historical state of the database is lost.
If time travel is desired,
the tuples must be saved forever somewhere.
This process is referred to as
.i "tuple archiving" .
.pp
\*(PG includes a special-purpose process,
called the
.i "vacuum cleaner" ,
that archives tuples.
Obsolete tuples are physically removed from the relation in which they
originally appeared,
and are inserted into a separate relation.
This keeps the relation storing current state of the database small,
but the database may still grow without bound.
.sh 2 "The Importance of Tertiary Storage"
.pp
Tertiary storage is generally an order of magnitude or more
slower,
both in access latency and throughput,
than magnetic disk storage.
In most cases,
physical recording media are managed by robotics,
which move them between shelves and players on demand.
Moving a platter typically takes thousands of times longer
than transferring a data block.
However, the use of robotics multiplies
the amount of storage available to a single reader,
typically by two or three orders of magnitude.
.pp
The inclusion of tertiary storage in relational database systems
is important for several reasons.
As described in [STON87],
\*(PG relies on tertiary storage to hold historical database state.
More generally,
the relatively low cost per byte of tertiary storage
makes it attractive to systems builders,
which puts pressure on software vendors to support it.
In addition,
new data types like voice,
video,
and still images will require orders of magnitude more
space than conventional data types.
Finally,
there is strong commercial pressure to support ever-larger
conventional databases.
For example,
retailers and stockbrokers
would store every financial transaction forever
if they had sufficient space [BERN88].
.sh 2 "Thesis Overview"
.pp
This thesis describes two key extensions to the \*(PG database system.
First,
\*(PG has been made device-extensible,
so that new storage technologies may be used to store data.
Second,
a file system has been implemented on top of the database system,
providing important semantic guarantees to file system users
by taking advantage of database services.
.pp
Other device-extensible database systems include Starburst [HAAS90].
However,
Starburst does not provide a file system interface to non-database users.
Current file systems research is exploring ways of adding
functionality to existing file systems.
[CABR88], [CHUT92], and [SELT90] describe transaction-protected file systems,
and [ROOM92] supports time travel,
but none of these systems offers both services simultaneously.
.pp
The remainder of this thesis is organized as follows.
Section two gives a brief overview of the new storage manager,
and the important differences it exhibits from [STON87].
Section three presents the new storage manager architecture.
Section four describes the inclusion of a particular new storage device,
namely a 327GByte Sony write-once optical disk jukebox.
Section five introduces the Inversion file system,
which is built on top of the \*(PG database system.
Section six shows performance measurements for the Inversion file system.
Section seven presents conclusions,
and section eight lists some directions for future research.
.sh 1 "OVERVIEW OF THE NEW POSTGRES STORAGE MANAGER"
.pp
Throughout this paper,
the term
.i "storage manager"
refers to an interface layer in \*(PG that accepts requests for
operations on the data store.
This layer determines which storage device must be used to
satisfy the request and calls device-specific code to handle it.
Each device has a set of interface routines for use by the storage manager.
The collection of these routines for a single device is called a
.i "device manager" .
.sh 2 "An Arbitrary Storage Hierarchy"
.pp
[STON87] proposed a two-level storage hierarchy
consisting of magnetic disk and tertiary storage,
as shown in Figure 1.
.(z
.in +0.5i
.hl
... 0.36 7.01 4.74 10.19 0 0 4096 4096
... 0u 1831u 2522u 0u -206u 5867u 2358266u -2352605u
.PS 1831 2522 
.br
.ps 11
\h'290u'\v'1290u'\D'a288u -539u 288u 540u'
.sp -1
\h'282u'\v'714u'\D'e590u 84u'
.sp -1
\h'290u'\v'714u'\D'l0u 576u'
.sp -1
\h'865u'\v'714u'\D'l0u 576u'
.sp -1
\v'357u'\D'l0u -357u'
.sp -1
\D'l1152u 0u'
.sp -1
\h'1152u'\D'l0u 357u'
.sp -1
\h'1152u'\v'357u'\D'l-1152u 0u'
.sp -1
\h'578u'\v'354u'\D'l0u 324u'
.sp -1
\h'563u'\v'621u'\D'l15u 57u'
.sp -1
\h'592u'\v'621u'\D'l-14u 57u'
.sp -1
\h'578u'\v'948u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fImagnetic\fP'u/2u'\s11\fImagnetic\fP\h'-\w'\s11\fImagnetic\fP'u/2u'
.sp -1
\h'578u'\v'1128u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIdisk\fP'u/2u'\s11\fIdisk\fP\h'-\w'\s11\fIdisk\fP'u/2u'
.sp -1
\h'578u'\v'192u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIPOSTGRES data manager\fP'u/2u'\s11\fIPOSTGRES data manager\fP\h'-\w'\s11\fIPOSTGRES data manager\fP'u/2u'
.sp -1
\h'541u'\v'480u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRNew data stored\fP'u'\s11\fRNew data stored\fP
.sp -1
\h'541u'\v'588u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRon magnetic disk\fP'u'\s11\fRon magnetic disk\fP
.sp -1
\h'1658u'\v'1543u'\D'l0u -1151u'
.sp -1
\h'1658u'\v'392u'\D'l864u 0u'
.sp -1
\h'2522u'\v'392u'\D'l0u 1151u'
.sp -1
\h'2522u'\v'1543u'\D'l-864u 0u'
.sp -1
\h'1727u'\v'605u'\D'l0u -70u'
.sp -1
\h'1727u'\v'535u'\D'l507u 0u'
.sp -1
\h'2234u'\v'535u'\D'l0u 70u'
.sp -1
\h'2234u'\v'605u'\D'l-507u 0u'
.sp -1
\h'2303u'\v'749u'\D'l0u -219u'
.sp -1
\h'2303u'\v'530u'\D'l144u 0u'
.sp -1
\h'2447u'\v'530u'\D'l0u 219u'
.sp -1
\h'2447u'\v'749u'\D'l-144u 0u'
.sp -1
\h'1729u'\v'678u'\D'l503u 0u'
.sp -1
\h'1729u'\v'714u'\D'l503u 0u'
.sp -1
\h'1729u'\v'750u'\D'l503u 0u'
.sp -1
\h'1729u'\v'786u'\D'l503u 0u'
.sp -1
\h'1729u'\v'822u'\D'l503u 0u'
.sp -1
\h'1729u'\v'858u'\D'l503u 0u'
.sp -1
\h'1729u'\v'894u'\D'l503u 0u'
.sp -1
\h'1729u'\v'930u'\D'l503u 0u'
.sp -1
\h'1729u'\v'966u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1002u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1038u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1074u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1110u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1146u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1182u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1218u'\D'l503u 0u'
.sp -1
\h'1729u'\v'1254u'\D'l503u 0u'
.sp -1
\h'2088u'\v'1416u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fItertiary storage\fP'u/2u'\s11\fItertiary storage\fP\h'-\w'\s11\fItertiary storage\fP'u/2u'
.sp -1
\h'720u'\v'1831u'\D'l0u 0u'
.sp -1
\h'720u'\v'1831u'\D'l0u 0u'
.sp -1
\h'720u'\v'1831u'\D'l0u 0u'
.sp -1
\h'720u'\v'1831u'\D'l0u 0u'
.sp -1
\h'865u'\v'966u'\D'l792u 0u'
.sp -1
\h'1599u'\v'981u'\D'l58u -15u'
.sp -1
\h'1599u'\v'952u'\D'l58u 14u'
.sp -1
\h'1009u'\v'1092u'\h'-0.0m'\v'0.2m'\s11\fROld data moves\fP
.sp -1
\h'1009u'\v'1200u'\h'-0.0m'\v'0.2m'\s11\fRto tertiary\fP
.sp -1
\h'1009u'\v'1308u'\h'-0.0m'\v'0.2m'\s11\fRstorage\fP
.sp -1
.sp 1+1831u
.PE
.sp -2v
.ce
Figure 1: Architecture of the \*(PG Storage Manager Proposed in [STON87]
.hl
.in -0.5i
.)z
.pp
The new storage manager,
described here,
is more general.
New devices may be added to the system
by writing a set of interface routines.
When a user creates a new relation,
he assigns it to a particular device,
where it resides until it is destroyed.
The relation's archive may also be assigned to any device manager.
Thus the fixed two-level hierarchy proposed in [STON87]
is replaced by a pool of storage,
as shown in Figure 2.
.(z
.in +0.5i
.hl
... 0.24 6.89 5.24 10.01 0 0 4096 4096
... 0u 1797u 2880u 0u -137u 5766u 2359158u -2353529u
.PS 1797 2880 
.br
.ps 11
\h'2086u'\v'629u'\h'-0.0m'\v'0.2m'\s11\fRUser controls\fP
.sp -1
\h'575u'\v'1475u'\D'a288u -539u 287u 540u'
.sp -1
\h'568u'\v'899u'\D'e590u 84u'
.sp -1
\h'575u'\v'899u'\D'l0u 576u'
.sp -1
\h'863u'\v'1133u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fImagnetic\fP'u/2u'\s11\fImagnetic\fP\h'-\w'\s11\fImagnetic\fP'u/2u'
.sp -1
\h'863u'\v'1313u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIdisk\fP'u/2u'\s11\fIdisk\fP\h'-\w'\s11\fIdisk\fP'u/2u'
.sp -1
\h'1150u'\v'899u'\D'l0u 576u'
.sp -1
\v'950u'\D'l0u -86u'
.sp -1
\v'864u'\D'l432u 0u'
.sp -1
\h'432u'\v'864u'\D'l0u 86u'
.sp -1
\h'432u'\v'950u'\D'l-432u 0u'
.sp -1
\v'1037u'\D'l0u -87u'
.sp -1
\v'950u'\D'l432u 0u'
.sp -1
\h'432u'\v'950u'\D'l0u 87u'
.sp -1
\h'432u'\v'1037u'\D'l-432u 0u'
.sp -1
\v'1123u'\D'l0u -86u'
.sp -1
\v'1037u'\D'l432u 0u'
.sp -1
\h'432u'\v'1037u'\D'l0u 86u'
.sp -1
\h'432u'\v'1123u'\D'l-432u 0u'
.sp -1
\v'1210u'\D'l0u -87u'
.sp -1
\v'1123u'\D'l432u 0u'
.sp -1
\h'432u'\v'1123u'\D'l0u 87u'
.sp -1
\h'432u'\v'1210u'\D'l-432u 0u'
.sp -1
\v'1296u'\D'l0u -86u'
.sp -1
\v'1210u'\D'l432u 0u'
.sp -1
\h'432u'\v'1210u'\D'l0u 86u'
.sp -1
\h'432u'\v'1296u'\D'l-432u 0u'
.sp -1
\h'2339u'\v'1365u'\D'l0u -69u'
.sp -1
\h'2339u'\v'1296u'\D'l109u 0u'
.sp -1
\h'2448u'\v'1296u'\D'l0u 69u'
.sp -1
\h'2448u'\v'1365u'\D'l-109u 0u'
.sp -1
\h'2483u'\v'1365u'\D'l0u -69u'
.sp -1
\h'2483u'\v'1296u'\D'l109u 0u'
.sp -1
\h'2592u'\v'1296u'\D'l0u 69u'
.sp -1
\h'2592u'\v'1365u'\D'l-109u 0u'
.sp -1
\h'2627u'\v'1365u'\D'l0u -69u'
.sp -1
\h'2627u'\v'1296u'\D'l109u 0u'
.sp -1
\h'2736u'\v'1296u'\D'l0u 69u'
.sp -1
\h'2736u'\v'1365u'\D'l-109u 0u'
.sp -1
\h'2339u'\v'1475u'\D'l0u -70u'
.sp -1
\h'2339u'\v'1405u'\D'l109u 0u'
.sp -1
\h'2448u'\v'1405u'\D'l0u 70u'
.sp -1
\h'2448u'\v'1475u'\D'l-109u 0u'
.sp -1
\h'2483u'\v'1475u'\D'l0u -70u'
.sp -1
\h'2483u'\v'1405u'\D'l109u 0u'
.sp -1
\h'2592u'\v'1405u'\D'l0u 70u'
.sp -1
\h'2592u'\v'1475u'\D'l-109u 0u'
.sp -1
\h'2627u'\v'1475u'\D'l0u -70u'
.sp -1
\h'2627u'\v'1405u'\D'l109u 0u'
.sp -1
\h'2736u'\v'1405u'\D'l0u 70u'
.sp -1
\h'2736u'\v'1475u'\D'l-109u 0u'
.sp -1
\h'2304u'\v'1728u'\D'l0u -864u'
.sp -1
\h'2304u'\v'864u'\D'l576u 0u'
.sp -1
\h'2880u'\v'864u'\D'l0u 864u'
.sp -1
\h'2880u'\v'1728u'\D'l-576u 0u'
.sp -1
\h'2339u'\v'1008u'\D'l0u -69u'
.sp -1
\h'2339u'\v'939u'\D'l109u 0u'
.sp -1
\h'2448u'\v'939u'\D'l0u 69u'
.sp -1
\h'2448u'\v'1008u'\D'l-109u 0u'
.sp -1
\h'2483u'\v'1008u'\D'l0u -69u'
.sp -1
\h'2483u'\v'939u'\D'l109u 0u'
.sp -1
\h'2592u'\v'939u'\D'l0u 69u'
.sp -1
\h'2592u'\v'1008u'\D'l-109u 0u'
.sp -1
\h'2627u'\v'1008u'\D'l0u -69u'
.sp -1
\h'2627u'\v'939u'\D'l109u 0u'
.sp -1
\h'2736u'\v'939u'\D'l0u 69u'
.sp -1
\h'2736u'\v'1008u'\D'l-109u 0u'
.sp -1
\h'2771u'\v'1043u'\D'l0u -110u'
.sp -1
\h'2771u'\v'933u'\D'l69u 0u'
.sp -1
\h'2840u'\v'933u'\D'l0u 110u'
.sp -1
\h'2840u'\v'1043u'\D'l-69u 0u'
.sp -1
\h'2771u'\v'1187u'\D'l0u -110u'
.sp -1
\h'2771u'\v'1077u'\D'l69u 0u'
.sp -1
\h'2840u'\v'1077u'\D'l0u 110u'
.sp -1
\h'2840u'\v'1187u'\D'l-69u 0u'
.sp -1
\h'2771u'\v'1331u'\D'l0u -110u'
.sp -1
\h'2771u'\v'1221u'\D'l69u 0u'
.sp -1
\h'2840u'\v'1221u'\D'l0u 110u'
.sp -1
\h'2840u'\v'1331u'\D'l-69u 0u'
.sp -1
\h'2771u'\v'1475u'\D'l0u -110u'
.sp -1
\h'2771u'\v'1365u'\D'l69u 0u'
.sp -1
\h'2840u'\v'1365u'\D'l0u 110u'
.sp -1
\h'2840u'\v'1475u'\D'l-69u 0u'
.sp -1
\h'2339u'\v'1117u'\D'l0u -69u'
.sp -1
\h'2339u'\v'1048u'\D'l109u 0u'
.sp -1
\h'2448u'\v'1048u'\D'l0u 69u'
.sp -1
\h'2448u'\v'1117u'\D'l-109u 0u'
.sp -1
\h'2483u'\v'1117u'\D'l0u -69u'
.sp -1
\h'2483u'\v'1048u'\D'l109u 0u'
.sp -1
\h'2592u'\v'1048u'\D'l0u 69u'
.sp -1
\h'2592u'\v'1117u'\D'l-109u 0u'
.sp -1
\h'2627u'\v'1117u'\D'l0u -69u'
.sp -1
\h'2627u'\v'1048u'\D'l109u 0u'
.sp -1
\h'2736u'\v'1048u'\D'l0u 69u'
.sp -1
\h'2736u'\v'1117u'\D'l-109u 0u'
.sp -1
\h'2339u'\v'1221u'\D'l0u -69u'
.sp -1
\h'2339u'\v'1152u'\D'l109u 0u'
.sp -1
\h'2448u'\v'1152u'\D'l0u 69u'
.sp -1
\h'2448u'\v'1221u'\D'l-109u 0u'
.sp -1
\h'2483u'\v'1221u'\D'l0u -69u'
.sp -1
\h'2483u'\v'1152u'\D'l109u 0u'
.sp -1
\h'2592u'\v'1152u'\D'l0u 69u'
.sp -1
\h'2592u'\v'1221u'\D'l-109u 0u'
.sp -1
\h'2627u'\v'1221u'\D'l0u -69u'
.sp -1
\h'2627u'\v'1152u'\D'l109u 0u'
.sp -1
\h'2736u'\v'1152u'\D'l0u 69u'
.sp -1
\h'2736u'\v'1221u'\D'l-109u 0u'
.sp -1
\h'2339u'\v'1259u'\D'l395u 0u'
.sp -1
\h'2590u'\v'1565u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIExabyte 8mm\fP'u/2u'\s11\fIExabyte 8mm\fP\h'-\w'\s11\fIExabyte 8mm\fP'u/2u'
.sp -1
\h'2590u'\v'1673u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fItape jukebox\fP'u/2u'\s11\fItape jukebox\fP\h'-\w'\s11\fItape jukebox\fP'u/2u'
.sp -1
\h'789u'\v'1728u'\D'l0u 0u'
.sp -1
\h'789u'\v'1728u'\D'l0u 0u'
.sp -1
\h'789u'\v'1728u'\D'l0u 0u'
.sp -1
\h'789u'\v'1728u'\D'l0u 0u'
.sp -1
\h'432u'\v'501u'\D'l0u -144u'
.sp -1
\h'432u'\v'357u'\D'l1728u 0u'
.sp -1
\h'2160u'\v'357u'\D'l0u 144u'
.sp -1
\h'2160u'\v'501u'\D'l-1728u 0u'
.sp -1
\h'1294u'\v'143u'\D'l0u 216u'
.sp -1
\h'1280u'\v'301u'\D'l14u 58u'
.sp -1
\h'1309u'\v'301u'\D'l-15u 58u'
.sp -1
\h'1043u'\v'503u'\D'l-756u 360u'
.sp -1
\h'333u'\v'825u'\D'l-46u 38u'
.sp -1
\h'345u'\v'851u'\D'l-58u 12u'
.sp -1
\h'1150u'\v'503u'\D'l-287u 360u'
.sp -1
\h'888u'\v'809u'\D'l-25u 54u'
.sp -1
\h'910u'\v'827u'\D'l-47u 36u'
.sp -1
\h'1438u'\v'503u'\D'l288u 360u'
.sp -1
\h'1679u'\v'827u'\D'l47u 36u'
.sp -1
\h'1702u'\v'809u'\D'l24u 54u'
.sp -1
\h'1582u'\v'503u'\D'l864u 360u'
.sp -1
\h'2388u'\v'854u'\D'l58u 9u'
.sp -1
\h'2399u'\v'827u'\D'l47u 36u'
.sp -1
\h'1365u'\v'1077u'\D'l0u -69u'
.sp -1
\h'1365u'\v'1008u'\D'l507u 0u'
.sp -1
\h'1872u'\v'1008u'\D'l0u 69u'
.sp -1
\h'1872u'\v'1077u'\D'l-507u 0u'
.sp -1
\h'1941u'\v'1221u'\D'l0u -213u'
.sp -1
\h'1941u'\v'1008u'\D'l144u 0u'
.sp -1
\h'2085u'\v'1008u'\D'l0u 213u'
.sp -1
\h'2085u'\v'1221u'\D'l-144u 0u'
.sp -1
\h'1366u'\v'1151u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1187u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1222u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1259u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1294u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1331u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1366u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1403u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1438u'\D'l504u 0u'
.sp -1
\h'1366u'\v'1475u'\D'l504u 0u'
.sp -1
\h'1296u'\v'1797u'\D'l0u -933u'
.sp -1
\h'1296u'\v'864u'\D'l864u 0u'
.sp -1
\h'2160u'\v'864u'\D'l0u 933u'
.sp -1
\h'2160u'\v'1797u'\D'l-864u 0u'
.sp -1
\h'576u'\v'144u'\D'l0u -144u'
.sp -1
\h'576u'\D'l1440u 0u'
.sp -1
\h'2016u'\D'l0u 144u'
.sp -1
\h'2016u'\v'144u'\D'l-1440u 0u'
.sp -1
\h'1294u'\v'449u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fINew POSTGRES storage manager switch\fP'u/2u'\s11\fINew POSTGRES storage manager switch\fP\h'-\w'\s11\fINew POSTGRES storage manager switch\fP'u/2u'
.sp -1
\h'1331u'\v'269u'\h'-0.0m'\v'0.2m'\s11\fRAll data treated identically\fP
.sp -1
\h'2230u'\v'737u'\h'-0.0m'\v'0.2m'\s11\fRdata placement\fP
.sp -1
\h'215u'\v'1385u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fImain\fP'u/2u'\s11\fImain\fP\h'-\w'\s11\fImain\fP'u/2u'
.sp -1
\h'215u'\v'1493u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fImemory\fP'u/2u'\s11\fImemory\fP\h'-\w'\s11\fImemory\fP'u/2u'
.sp -1
\h'1726u'\v'1601u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fISony optical disk\fP'u/2u'\s11\fISony optical disk\fP\h'-\w'\s11\fISony optical disk\fP'u/2u'
.sp -1
\h'1726u'\v'1709u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIjukebox\fP'u/2u'\s11\fIjukebox\fP\h'-\w'\s11\fIjukebox\fP'u/2u'
.sp -1
\h'1294u'\v'89u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIPOSTGRES data manager\fP'u/2u'\s11\fIPOSTGRES data manager\fP\h'-\w'\s11\fIPOSTGRES data manager\fP'u/2u'
.sp -1
.sp 1+1797u
.PE
.sp
.ce
Figure 2: New \*(PG Storage Manager Architecture
.hl
.in -0.5i
.)z
The \*(PG data manager no longer communicates directly
with devices to read or write data.
Instead,
a new interface layer handles data placement
and retrieval.
.pp
Device managers currently exist for magnetic disk,
battery-backed main memory,
and the Sony optical disk jukebox.
Other devices that may be incorporated in the future
include rewritable optical disk jukeboxes
and helical scan tape jukeboxes.
The user may elect to locate a new relation on any of these devices.
Once the relation is assigned to a device manager,
its location is transparent to the user.
It is accessed in exactly the same way,
regardless of its placement.
.sh 2 "Using Criteria other than Time to Determine Data Placement"
.pp
Since archiving historical data is an important feature in \*(PG,
the vacuum cleaner uses tuple update times to move data from
a
.q current
relation into a
.q historical
one.
The original storage manager proposal declared that current
data would be stored on a fast device
(magnetic disk),
and historical data on some slower device.
The implicit assumption is that current data will
be accessed more frequently than historical data.
The desired policy is to make the most frequent accesses fastest.
.pp
The design proposed in [STON87] has several limitations.
First,
there exist databases from which no tuples ever get deleted.
Consider,
for example,
a library of books.
New volumes are purchased,
but old ones remain in the collection forever.
In this case,
there is no historical data,
and the current database state grows without bound.
.pp
Second,
the assumption that current data will be read more frequently
than historical data is not always correct.
For example,
accountants at the Internal Revenue Service audit tax returns
two years after they are filed.
In this case,
the best policy would be to place all two-year-old tax returns
on magnetic disk,
and returns that are younger or older on tertiary storage.
IRS employees could then more quickly choose two-year-old returns to audit.
.pp
Finally,
users may want to specify criteria other than time for
data placement.
For example,
consider a credit-card billing database.
Tuples for customers who make many purchases
should be stored in main memory,
to speed up billing updates on their accounts.
Tuples for less active customers might be stored on magnetic
disk.
The least active customers of all could have their tuples
migrated to an optical disk store,
since the extra cost of fetching their bills would not be
incurred very often.
.pp
In all of these examples,
the user is able to predict his access patterns reliably,
and can instruct the database system how data should be laid
out in order to minimize access times.
Current systems support caches intended to speed up accesses,
but these caches are not subject to user control.
.pp
The new storage manager
supports a more general model than its predecessor.
First,
the storage manager
does away with the notion of a fixed two-level data storage hierarchy.
New devices with new performance characteristics may be added
to the database system.
When new relations are created,
they may be located on any storage device,
and their archives may be located on any storage device.
Thus,
a relation's current data may be put on the Sony write-once jukebox
and its historical data on magnetic disk,
if desired.
This would make accesses to old data faster than accesses to new data.
.pp
Second,
using the \*(PG rule system [POTA92],
the user may create a logical relation that is a view
of many physical relations.
These physical relations may be located on different storage devices.
Further,
the user may specify rules that move data from one physical relation
to another in response to an update.
Thus the user may choose to partition a logical relation
based on an arbitrarily complex function of any of its attributes.
The same strategy applies to historical data:
the user may partition the single logical relation storing history
among several physical relations,
according to whatever policy he desires.
.pp
For example,
consider a relational schema for storing tuples on employees.
For each employee,
the system should record an employee number,
a name,
a salary,
and a scanned photograph for security purposes.
A conventional schema would be
.(E
EMP(id = int, name = text, salary = float,
    picture = image)
.)E
where
.CW image
is a type known to the database system,
and used to store pictures.
.pp
One problem with the schema above is that it stores all data
in every employee tuple on magnetic disk.
Since scanned photographs are likely to be very large,
storing them on magnetic disk will be expensive.
In addition,
it might be preferable to store salaries in memory,
so that functions that operate on them can run more quickly.
Since the user may have a good idea of the sizes and access
frequencies of columns in the
.CW EMP
table,
he should be able to partition tuples to reduce storage
costs and speed up common queries.
.pp
The new storage manager allows the user to declare a more flexible schema.
The example above can be rewritten as
.(E
ENAME (id = int, name = text) store = "magnetic disk"
ESAL (id = int, salary = float) store = "main memory"
EPHOTO (id = int, photo = image) store = "sony jukebox"
.)E
The original
.CW EMP
relation has been decomposed into three new physical relations.
The following \*(PG rule will reconstruct the logical
.CW EMP
relation on demand:
.(E
on retrieve to EMP do instead
    retrieve (n.id, n.name, s.salary, p.photo)
        from n in ENAME, s in ESAL, p in EPHOTO
        where n.id = s.id and s.id = p.id
.)E
In addition,
a rule may be declared to handle updates on the logical
.CW EMP
relation:
.(E
on append to EMP do instead
    append ENAME (id = new.id, name = new.name)
    append ESAL (id = new.id, salary = new.salary)
    append EPHOTO (id = new.id, photo = new.photo)
.)E
.pp
Thus,
the new storage manager eliminates the fixed two-level storage
hierarchy.
Data may be placed on a device in order to conserve space
(as in the case of
.CW EMP.photo )
or in order to speed up accesses
(as in the case of
.CW EMP.salary ).
Using the \*(PG rules system in conjunction with the new storage manager
gives the user considerable freedom to horizontally or vertically
partition data tables to achieve good performance while conserving
expensive storage space.
.sh 1 "ARCHITECTURE OF THE NEW POSTGRES STORAGE MANAGER"
.pp
This section describes the architecture of the new storage manager in detail.
It begins with the design changes that were necessary
in the \*(PG data manager,
and is followed by the new design.
.sh 2 "Changes to POSTGRES"
.pp
Before the design and installation of the new storage manager,
\*(PG contained hard-coded dependencies on magnetic disk.
These dependencies were manifested in the following ways.
.BU
The query optimizer used a cost function that assumed
the performance characteristics of magnetic disk for
data stored in the database.
.BU
The code that handled updates to the database assumed
that data blocks could be overwritten in place.
.BU
The query planner and executor used system calls on the \*(UX file system
to compute characteristics like
the length and modification times of files storing relations.
.pp
Tertiary storage has dramatically different performance
characteristics than magnetic disk.
For example,
throughput and seek times for optical disk are typically
an order of magnitude worse than those for magnetic disk.
Tape performs poorly under random access patterns.
Robotics introduce sharp discontinuities into the performance
curves of jukebox devices,
because moving a tape or disk from a shelf to a drive takes tens of
seconds.
.pp
Hard-coded dependencies on magnetic disk made it very hard
to add new storage devices to the old system.
Even if it were possible to store data on the new devices,
the query optimizer and executor would make poor decisions
for queries that accessed tertiary storage.
.pp
Thus,
the most important change to \*(PG was to extract
dependencies on magnetic disk.
First,
the operations that the data manager carried out on storage devices
were determined.
From these operations,
an abstraction was created that hid device differences.
These abstractions are discussed in section 3.3.
A set of interface routines was defined that supported this
abstraction.
These interface routines are listed in detail in section 3.4.
.sh 2 "The Storage Manager Switch"
.pp
The abstraction that hides device-specific characteristics from
the data manager is called the
.i "storage manager switch" .
The storage manager switch is similar
to the
.i cdevsw
and
.i bdevsw
interfaces of \*(UX [LEFF89]
and the user-supplied backing store interface in Mach [RASH88].
.pp
As discussed in the previous section,
installing the storage manager switch required
that the abstract operations required on the data store be identified.
For example,
the interface requires
routines for opening relations and instantiating particular
8kByte relation blocks.
.pp
Once the abstract interface was defined,
an appropriate structure
(the
.i "storage manager switch table" )
was compiled into the data manager.
This structure contains an entry for all existing device managers.
New device types are added by editing a source file,
adding appropriate entries to the storage manager switch table,
and recompiling the \*(PG database system.
.pp
Whenever a device operation is required,
the data manager calls the storage manager to carry it out.
Every relation is tagged in \*(PG with the device manager
on which it appears.
The storage manager identifies the appropriate device and routine in
the storage manager switch table,
and calls the device manager to execute the routine
on behalf of the data manager.
.pp
Data may be located on a particular device manager
at the granularity of a relation.
Physical relations may not be horizontally or vertically partitioned,
but as mentioned before,
the \*(PG rules system can be used to create a partitioned logical relation.
.sh 2 "Isolation of Device Differences"
.pp
New device managers are written to manage new types of devices.
For example,
main memory,
magnetic disk,
and the Sony optical disk jukebox
are all managed by different device managers.
To isolate the database system from the underlying devices,
device managers are required to accept and return data
in the format used internally by the data manager.
This means,
among other things,
that data pages from user relations must be accepted and returned
in units of 8kBytes.
.pp
Nevertheless,
there is no requirement that a device manager use the internal
representation when writing a page to persistent storage.
Data pages may be compressed,
broken into pieces,
or concatenated into larger units by a device manager,
as long as the original page can be reconstructed on demand.
.pp
In addition,
device managers may implement caches managed according
to an appropriate policy.
This means that the new \*(PG architecture supports a multi-level cache.
A main-memory cache of recently-used blocks is maintained in
shared memory by the buffer manager.
The buffer manager calls the storage manager to
migrate pages between shared memory and persistent storage.
A device manager may implement a separate cache for its own use,
subject to whatever policy it likes.
For example,
the Sony jukebox device manager maintains a cache of
.i extents
(contiguous collections of user data blocks)
on magnetic disk.
If there is locality of reference in accesses to a Sony jukebox relation,
then some requests may be satisfied without accessing the jukebox
device at all.
This cache will be described in more detail in section 4.2.
.sh 2 "The Storage Manager Switch Interface"
.pp
This section defines the interface presented by the storage manager switch.
Although the routines presented here are specific to \*(PG,
the strategy of isolating storage management from data management is general,
and similar routines should exist for conventional database management
systems.
.pp
Figure 3 shows the storage manager switch data structure,
.CW f_smgr .
.(z
.in +0.5i
.hl
.ft C
.nf
typedef struct f_smgr {
    int         (*smgr_init)();         /* may be NULL */
    int         (*smgr_shutdown)();     /* may be NULL */
    int         (*smgr_create)();
    int         (*smgr_unlink)();
    int         (*smgr_extend)();
    int         (*smgr_open)();
    int         (*smgr_close)();
    int         (*smgr_read)();
    int         (*smgr_write)();
    int         (*smgr_flush)();
    int         (*smgr_blindwrt)();
    int         (*smgr_nblocks)();
    int         (*smgr_commit)();       /* may be NULL */
    int         (*smgr_abort)();        /* may be NULL */
    int         (*smgr_cost)();
} f_smgr;
.ft
.fi
.sp
.ce
Figure 3: The \fCf_smgr\fP Storage Manager Switch Structure
.hl
.in -0.5i
.)z
This structure defines the interface routines
that all device managers must define.
When a new device type is added to \*(PG,
this structure is filled in with function pointers appropriate to the device,
and \*(PG is recompiled.
The storage manager calls these routines in response
to requests from the data manager.
In some cases,
the switch entry for an interface routine may be
.CW NULL .
This means that the corresponding operation is not defined for the
device in question.
Unless otherwise stated in the sections that follow,
all interface routines return an integer status code
indicating success or failure.
.pp
Table 1 summarizes the responsibilities
of these interface routines.
.(z
.in +0.5i
.ft R
.hl
.TS
.if \n+(b.=1 .nr d. \n(.c-\n(c.-1
.de 3f
.ps \n(.s
.vs \n(.vu
.in \n(.iu
.if \n(.u .fi
.if \n(.j .ad
.if \n(.j=0 .na
..
.nf
.nr #~ 0
.if \n(.T .if n .nr #~ 0.6n
.ds #d .d
.if \(ts\n(.z\(ts\(ts .ds #d nl
.fc
.nr 3d \n(.s
.rm 4g 4h
.nr 3e \n(.lu
.eo
.am 4h
.br
.di a+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Called at startup to allow initialization.
.br
.di
.nr a| \n(dn
.nr a- \n(dl
..
.ec \
.eo
.am 4h
.br
.di b+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Called at shutdown to allow graceful exit.
.br
.di
.nr b| \n(dn
.nr b- \n(dl
..
.ec \
.eo
.am 4h
.br
.di c+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Create relation \fIr\fP.
.br
.di
.nr c| \n(dn
.nr c- \n(dl
..
.ec \
.eo
.am 4h
.br
.di d+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Destroy relation \fIr\fP.
.br
.di
.nr d| \n(dn
.nr d- \n(dl
..
.ec \
.eo
.am 4h
.br
.di e+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Add a new block to the end of relation \fIr\fP
and fill it with data from \fIbuf\fP.
.br
.di
.nr e| \n(dn
.nr e- \n(dl
..
.ec \
.eo
.am 4h
.br
.di f+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Open relation \fIr\fP.
The relation descriptor includes
a pointer to the state for the open object.
.br
.di
.nr f| \n(dn
.nr f- \n(dl
..
.ec \
.eo
.am 4h
.br
.di g+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Close relation \fIr\fP.
.br
.di
.nr g| \n(dn
.nr g- \n(dl
..
.ec \
.eo
.am 4h
.br
.di h+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Read block \fIblock\fP of relation \fIr\fP into the buffer
pointed to by \fIbuf\fP.
.br
.di
.nr h| \n(dn
.nr h- \n(dl
..
.ec \
.eo
.am 4h
.br
.di i+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Write \fIbuf\fP as block \fIblock\fP of relation \fIr\fP.
.br
.di
.nr i| \n(dn
.nr i- \n(dl
..
.ec \
.eo
.am 4h
.br
.di j+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Write \fIbuf\fP synchronously as block \fIblock\fP of relation \fIr\fP.
.br
.di
.nr j| \n(dn
.nr j- \n(dl
..
.ec \
.eo
.am 4h
.br
.di k+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Do a ``blind write'' of buffer \fIbuf\fP as block \fIblk\fP
of the relation named \fIn\s-2\dr\u\s0\fP in the database named
\fIn\s-2\dd\u\s0\fP.
.br
.di
.nr k| \n(dn
.nr k- \n(dl
..
.ec \
.eo
.am 4h
.br
.di l+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Return the number of blocks in relation \fIr\fP.
.br
.di
.nr l| \n(dn
.nr l- \n(dl
..
.ec \
.eo
.am 4h
.br
.di m+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Force all dirty data to stable storage.
.br
.di
.nr m| \n(dn
.nr m- \n(dl
..
.ec \
.eo
.am 4h
.br
.di n+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Changes made during this transaction may be discarded.
.br
.di
.nr n| \n(dn
.nr n- \n(dl
..
.ec \
.eo
.am 4h
.br
.di o+
.3f
.ft \n(.f
.ll 3in
.if \n(.l<\n(4h .ll \n(4hu
.in 0
.na
.fi
Return in \fIc\fP the estimated cost of accessing
.i nblock
blocks and
.i ntuples
tuples of relation \fIr\fP.
.br
.di
.nr o| \n(dn
.nr o- \n(dl
..
.ec \
.3f
.nf
.ll \n(3eu
.nr 4g 0
.nr 3i \w\fIRoutine\fP
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_init()
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_shutdown()
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_create(\fIr\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_unlink(\fIr\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_extend(\fIr\fP, \fIbuf\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_open(\fIr\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_close(\fIr\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_read(\fIr\fP, \fIblock\fP, \fIbuf\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_write(\fIr\fP, \fIblock\fP, \fIbuf\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_flush(\fIr\fP, \fIblock\fP, \fIbuf\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_blindwrt(\fIn\s-2\dd\u\s0\fP, \fIn\s-2\dr\u\s0\fP, \fIi\s-2\dd\u\s0\fP, \fIi\s-2\dr\u\s0\fP, \fIblk\fP, \fIbuf\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_nblocks(\fIr\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_commit()
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_abort()
.if \n(4g<\n(3i .nr 4g \n(3i
.nr 3i \wsmgr_cost(\fIr\fP, \fInblocks\fP, \fIntuples\fP, \fIc\fP)
.if \n(4g<\n(3i .nr 4g \n(3i
.4g
.rm 4g
.nr 4h 0
.nr 3i \w\fIPurpose\fP
.if \n(4h<\n(3i .nr 4h \n(3i
.4h
.rm 4h
.nr 3i \n(a-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(b-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(c-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(d-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(e-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(f-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(g-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(h-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(i-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(j-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(k-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(l-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(m-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(n-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i \n(o-
.if \n(4h<\n(3i .nr 4h \n(3i
.nr 3i 3in
.if \n(4h<\n(3i .nr 4h \n(3i
.3f
.nf
.ll \n(3eu
.nr 3i 1n
.nr 4f 0
.nr 4a \n(4f+((2*\n(3i)/2)
.nr 4g +\n(4a
.nr 4b \n(4g+((6*\n(3i)/2)
.nr 4h +\n(4b
.nr TW \n(4h
.nr TW +((2*\n(3i)/2)
.if t .if \n(TW>\n(.lu .tm Table at line 1247 file Input is too wide - \n(TW units
.ne 16v+30p
.nr #I \n(.i
.in +(\n(.lu-\n(TWu-\n(.iu)/2u
.fc  
.nr #T 0-1
.nr #a 0-1
.nr #a 0-1
.eo
.de T#
.nr 3f 1m
.ds #d .d
.if \(ts\n(.z\(ts\(ts .ds #d nl
.mk ##
.nr ## -1v
.ls 1
.if \n(#T>=0 .nr #a \n(#T
.if \n(T. .vs \n(.vu-\n(.sp
.if \n(T. \v'-1p'\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0\v'2p'\h'|0'\h'-1p'\s\n(3d\l'|\n(TWu+1p\(ul'\s0\v'-1p'
.if \n(T. .vs
.if \n(#a>=0 .sp -1
.if \n(#a>=0 \h'|0'\h'-1p'\v'1p'\s\n(3d\h'-\n(#~u'\L'|\n(#au-1v-2p'\s0\v'\n(\*(#du-\n(#au+1v+1p'\h'2p'\v'-1p'\s\n(3d\h'-\n(#~u'\L'|\n(#au-1v+2p'\s0\v'\n(\*(#du-\n(#au+1v-1p'\h'|\n(TWu'
.if \n(#a>=0 .sp -1
.if \n(#a>=0 \h'(|\n(4bu+|\n(4gu)/2u'\v'-1p'\s\n(3d\h'-\n(#~u'\L'|\n(#au-1v+2p'\s0\v'\n(\*(#du-\n(#au+1v-1p'\h'|\n(TWu'
.if \n(#a>=0 .sp -1
.if \n(#a>=0 \h'|\n(TWu'\h'-1p'\v'-1p'\s\n(3d\h'-\n(#~u'\L'|\n(#au-1v+2p'\s0\v'\n(\*(#du-\n(#au+1v-1p'\h'2p'\v'1p'\s\n(3d\h'-\n(#~u'\L'|\n(#au-1v-2p'\s0\v'\n(\*(#du-\n(#au+1v+1p'
.ls
..
.ec
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\v'-1p'\h'|0'\h'-1p'\s\n(3d\l'|\n(TWu+1p\(ul'\s0\v'2p'\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0\v'-1p'
.vs \n(3gu
.mk #a
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'\fIRoutine\fP\h'|\n(4bu'\fIPurpose\fP
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\v'-1p'\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0\v'2p'\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0\v'-1p'
.vs \n(3gu
.ne \n(a|u+\n(.Vu
.if (\n(a|+\n(#^-1v)>\n(#- .nr #- +(\n(a|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_init()\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.a+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(b|u+\n(.Vu
.if (\n(b|+\n(#^-1v)>\n(#- .nr #- +(\n(b|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_shutdown()\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.b+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(c|u+\n(.Vu
.if (\n(c|+\n(#^-1v)>\n(#- .nr #- +(\n(c|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_create(\fIr\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.c+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(d|u+\n(.Vu
.if (\n(d|+\n(#^-1v)>\n(#- .nr #- +(\n(d|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_unlink(\fIr\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.d+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(e|u+\n(.Vu
.if (\n(e|+\n(#^-1v)>\n(#- .nr #- +(\n(e|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_extend(\fIr\fP, \fIbuf\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.e+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(f|u+\n(.Vu
.if (\n(f|+\n(#^-1v)>\n(#- .nr #- +(\n(f|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_open(\fIr\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.f+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(g|u+\n(.Vu
.if (\n(g|+\n(#^-1v)>\n(#- .nr #- +(\n(g|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_close(\fIr\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.g+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(h|u+\n(.Vu
.if (\n(h|+\n(#^-1v)>\n(#- .nr #- +(\n(h|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_read(\fIr\fP, \fIblock\fP, \fIbuf\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.h+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(i|u+\n(.Vu
.if (\n(i|+\n(#^-1v)>\n(#- .nr #- +(\n(i|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_write(\fIr\fP, \fIblock\fP, \fIbuf\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.i+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(j|u+\n(.Vu
.if (\n(j|+\n(#^-1v)>\n(#- .nr #- +(\n(j|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_flush(\fIr\fP, \fIblock\fP, \fIbuf\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.j+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(k|u+\n(.Vu
.if (\n(k|+\n(#^-1v)>\n(#- .nr #- +(\n(k|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_blindwrt(\fIn\s-2\dd\u\s0\fP, \fIn\s-2\dr\u\s0\fP, \fIi\s-2\dd\u\s0\fP, \fIi\s-2\dr\u\s0\fP, \fIblk\fP, \fIbuf\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.k+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.sp 0.1v
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(l|u+\n(.Vu
.if (\n(l|+\n(#^-1v)>\n(#- .nr #- +(\n(l|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_nblocks(\fIr\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.l+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(m|u+\n(.Vu
.if (\n(m|+\n(#^-1v)>\n(#- .nr #- +(\n(m|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_commit()\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.m+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(n|u+\n(.Vu
.if (\n(n|+\n(#^-1v)>\n(#- .nr #- +(\n(n|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_abort()\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.n+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.nr 3g \n(.v
.vs \n(.vu-\n(.sp
\h'|0'\h'1p'\s\n(3d\l'|\n(TWu-1p\(ul'\s0
.vs \n(3gu
.ne \n(o|u+\n(.Vu
.if (\n(o|+\n(#^-1v)>\n(#- .nr #- +(\n(o|+\n(#^-\n(#--1v)
.ta \n(4gu \n(4hu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'smgr_cost(\fIr\fP, \fInblocks\fP, \fIntuples\fP, \fIc\fP)\h'|\n(4bu'
.mk ##
.nr 3b \n(##
.sp |\n(##u-1v
.nr 3h \n(4bu
.in +\n(3hu
.o+
.in -\n(3hu
.mk 3c
.if \n(3c>\n(3b .nr 3b \n(3c
.sp |\n(3bu
.fc
.nr T. 1
.T# 1
.in \n(#Iu
.3f
.nr #a 0
.rm a+
.rm b+
.rm c+
.rm d+
.rm e+
.rm f+
.rm g+
.rm h+
.rm i+
.rm j+
.rm k+
.rm l+
.rm m+
.rm n+
.rm o+
.TE
.if \n-(b.=0 .nr c. \n(.c-\n(d.-106
.ft B
.sp 0.5v
.ce
Table 1: The \*(PG Storage Manager Switch Interface
.ft
.hl
.in -0.5i
.)z
The rest of this section gives more detailed descriptions.
.sh 3 "smgr_init"
.pp
The routine
.(E
int
smgr_init()
.)E
is called when \*(PG starts up,
and should initialize any private data used by a device manager.
In addition,
the first call to this routine should initialize any shared
data.
Typically,
this is done by noticing whether shared data exists,
and creating and initializing it if it does not.
If a particular device manager requires no initialization,
it may define this routine to be NULL.
.sh 3 "smgr_shutdown"
.pp
The routine
.(E
int
smgr_shutdown()
.)E
is analogous to
.CW smgr_init() ,
but is called when \*(PG exits cleanly.
Any necessary cleanup may be done here.
If no work is required,
this routine may be NULL.
.sh 3 "smgr_create"
.pp
The routine to create new relations on a particular device manager is
.(E
int
smgr_create(reln)
    Relation reln;
.)E
The
.CW reln
argument points to a \*(PG relation descriptor for the relation to
be created.
The device manager should initialize whatever storage structure
is required for the new relation.
For example,
the magnetic disk device manager creates an empty file with the
name of the relation.
.sh 3 "smgr_unlink"
.pp
The routine
.(E
int
smgr_unlink(reln)
    Relation reln;
.)E
is called by the storage manager to destroy the relation described by
.CW reln .
Device managers may take whatever action is appropriate.
For example,
the magnetic disk device manager removes the relation's associated
file and archive, and reclaims the disk blocks they occupy.
Since the Sony write-once drive cannot reclaim blocks once they are used,
the Sony jukebox device manager simply marks the relation as
destroyed and returns.
.sh 3 "smgr_extend"
.pp
When a relation must be extended with a new data block,
the data manager calls
.(E
int
smgr_extend(reln, buffer)
    Relation reln;
    char *buffer;
.)E
to do the work.
The buffer is an 8kByte buffer in the format used by the data manager
internally.
The data manager assumes that 8kByte data blocks in relations are numbered
starting from one.
Thus
.CW smgr_extend()
must logically allocate a new block for the relation pointed to by
.CW reln
and store some representation of
.CW buffer
there.
.sh 3 "smgr_open"
.pp
The routine
.(E
int
smgr_open(reln)
    Relation reln;
.)E
should open the relation pointed to by
.CW reln
and return a file descriptor for it.
The notion of file descriptors is a holdover from
the pre-storage manager switch days,
and is no longer necessary.
If file descriptors make no sense for a given device,
then the associated device manager may return any non-negative value.
A negative return value indicates an error.
.sh 3 "smgr_close"
.pp
When the data manager is finished with a relation,
it calls
.(E
int
smgr_close(reln)
    Relation reln;
.)E
The device manager may release resources associated with the
relation pointed to by
.CW reln .
.sh 3 "smgr_read"
.pp
To instantiate an 8kByte data block from a relation,
the data manager calls
.(E
int
smgr_read(reln, blocknum, buffer)
    Relation reln;
    BlockNumber blocknum;
    char *buffer;
.)E
As stated above,
8kByte blocks in a relation are logically numbered from one.
The storage manager must locate the block of interest
and load its contents into
.CW buffer .
.sh 3 "smgr_write"
.pp
The routine
.(E
int
smgr_write(reln, blocknum, buffer)
    Relation reln;
    BlockNumber blocknum;
    char *buffer;
.)E
writes some representation of
.CW buffer
as block number
.CW blocknum
of the relation pointed to by
.CW reln .
This write may be asynchronous;
the device manager need not guarantee
that the buffer is flushed through to stable storage.
Synchronous writes are handled using the
.CW smgr_flush
routine,
described below.
As long as this routine
guarantees that the buffer has been copied somewhere safe
for eventual writing,
it may return successfully.
.pp
The buffer is an 8kByte \*(PG page in the format used
by the data manager.
It may be stored in any form,
but must be reconstructed exactly when the
.CW smgr_read
routine is called on it.
.sh 3 "smgr_flush"
.pp
This routine synchronously writes a block to stable storage.
The \*(PG data manager almost never needs to do synchronous writes
of single blocks.
In some cases,
however,
like the write of the transaction status file that marks a transaction
committed,
a single block must be written synchronously.
In this case,
the data manager calls the device manager's
.CW smgr_flush
routine.
The interface for this routine is
.(E
int
smgr_flush(reln, blocknum, buffer)
    Relation reln;
    BlockNumber blocknum;
    char *buffer;
.)E
The behavior of
.CW smgr_flush
is almost exactly the same as that of
.CW smgr_write ,
except that
.CW smgr_flush
must not return until
.CW buffer
is safely written to stable storage.
.sh 3 "smgr_blindwrt"
.pp
Under normal circumstances,
\*(PG moves dirty pages from memory to stable storage
by calling the
.CW smgr_write
routine for the appropriate storage manager.
In order to call
.CW smgr_write ,
the buffer manager must construct and pass a relation descriptor.
.pp
In certain cases,
\*(PG is unable to construct the relation descriptor
for a dirty buffer that must be evicted from the shared buffer cache.
This can happen,
for example,
when the relation descriptor has just been created in another process,
and has not yet been committed to the database.
In such a case,
\*(PG must write the buffer without constructing a relation descriptor.
.pp
The buffer manager can detect this case,
and handles it using a strategy called
.q "blind writes."
When a blind write must be done,
the buffer manager determines the name and object id of the database,
the name and object id of the relation,
and the block number of the buffer to be written.
All of this information is stored with the buffer in the buffer cache.
The device manager is responsible for determining,
from this information alone,
where the buffer must be written.
This means,
in general,
that every device manager must use some subset of these
values as a unique key for finding the proper location
to write a data block.
.pp
The interface for this routine is
.(E
int
smgr_blindwrt(dbname, relname, dbid, relid, blkno,
              buffer)
    Name dbname;
    Name relname;
    ObjectId dbid;
    ObjectId relid;
    BlockNumber blkno;
    char *buffer;
.)E
.pp
It is expected that most device managers will write data much
more efficiently using
.CW smgr_write
than
.CW smgr_blindwrt .
.sh 3 "smgr_nblocks"
.pp
The routine
.(E
int
smgr_nblocks(reln)
    Relation reln;
.)E
should return the number of blocks stored for the relation
pointed to by
.CW reln .
The number of blocks is used
during query planning and execution.
.pp
Since this is a potentially expensive operation,
the storage manager maintains a shared-memory cache of relation sizes
recently computed by device managers.
.sh 3 "smgr_commit"
.pp
When a transaction finishes,
the data manager calls the
.CW smgr_commit
routine for every device manager.
This routine must flush any pending asynchronous writes for this
transaction to disk,
so that all relation data is on persistent storage before
the transaction is marked committed.
In addition,
the device manager can release resources or update its state
appropriately.
.pp
This routine may be NULL,
if there is no work to do for a device manager at transaction commit.
If it is supplied,
its interface is
.(E
int
smgr_commit()
.)E
.pp
There is one important issue regarding transaction commit
that should also be mentioned.
Much research has been done in database systems to support
.i distribution ,
or databases that span multiple computers in a network.
Although \*(PG is not a distributed database system,
it does support
.i "data distribution"
by means of the storage manager switch
and device managers.
.pp
Data distribution means that different tables,
or parts of tables,
may reside on storage devices connected to different computers
on a network.
\*(PG does all data manipulation at a single site,
but the data it works with may be distributed.
The data manager will call the device manager,
which can operate on a local device,
or use the network to satisfy requests on a remote device.
In fact,
the Sony jukebox device manager works correctly whether it runs
on the host computer to which the jukebox is connected,
or on some other computer on the Internet.
.pp
At transaction commit time,
the data manager informs each device manager that all its
data must be on stable storage.
Each device manager flushes any pending writes synchronously,
and returns.
Once all device managers have forced data to stable storage,
the data manager synchronously records that the transaction
has committed by writing a ten-byte record to the
.i "transaction status file"
on one of the storage devices.
If the system crashes before the commit record is stored
in the transaction status file,
\*(PG will consider the transaction to be aborted when
the transaction's changes are encountered later.
.pp
Thus the
.CW smgr_commit
interface supports data distribution without
using a traditional multi-phase commit protocol ([KIM84], [SKEE81]).
In particular,
device managers are not notified whether a transaction actually
commits,
once they have forced data to stable storage.
.sh 3 "smgr_abort"
.pp
If a transaction aborts,
the data manager calls
.CW smgr_abort
in place of
.CW smgr_commit .
In this case,
it does not matter if changes made to user relations
during the aborted transaction are reflected on stable storage.
Device managers may deschedule pending writes,
as long as the writes include no data from other transactions.
.pp
This routine may be NULL,
if there is no work to do for a device manager at transaction abort.
.sh 3 "smgr_cost"
.pp
Query optimization requires cost estimates for different access paths
on relations [SELI79].
Since the cost function depends on the performance characteristics
of the device storing the relation,
each device manager must provide a cost estimator.
.pp
The interface is
.(E
int
smgr_cost(reln, nblocks, ntuples, costOutP)
    Relation reln;
    int nblocks;
    int ntuples;
    double *costOutP;
.)E
This function should compute an estimate of the cost of
instantiating
.CW nblocks
blocks
and
.CW ntuples
tuples from the relation pointed to by
.CW reln .
The
.CW costOutP
parameter is a pointer to a location for the returned cost.
An example cost function for a tertiary storage device is given
in section 4.
.sh 2 "Architecture Summary"
.pp
The new storage manager architecture
assigns a particular device manager to manage each device,
or class of devices,
available to the database system.
A well-defined set of interface routines makes relation accesses
device-transparent.
.pp
The new architecture uses a
.i "storage manager switch"
to record the interface routines for all the device managers
known to the system.
An earlier version of this architecture
stored these interface routines in a database relation,
and made the set of device managers dynamically extensible.
All existing device managers,
however,
use \*(PG shared memory and spinlocks.
Since existing device managers could not be developed outside
the \*(PG data manager,
the mechanism for doing so was eliminated.
.pp
So far,
three device managers exist.
The magnetic disk device manager is only 440 lines of C code,
including comments.
The main memory device manager is 600 lines,
most of which is a dynamically-extensible hashing package
to make block lookups fast.
The most complicated device manager by far \*-
the Sony optical disk write-once jukebox device manager \*-
is less than 3000 lines of code.
This complexity is largely due to the fast recovery and aggressive
sharing techniques used.
The design and implementation of the Sony jukebox device manager
are described in the next section.
.sh 1 "IMPLEMENTATION OF THE SONY JUKEBOX DEVICE MANAGER"
.pp
The tertiary storage device currently supported by \*(PG
is a jukebox of 50 double-sided write-once platters with
a total capacity of 327GByte.
A robotic arm moves platters between a bank of shelves and
two read/write decks.
The device is connected via a SCSI bus to a host computer.
A complete description of the hardware can be found in [SONY89a]
and [SONY89b].
.pp
\*(PG uses this device via a client/server protocol described in
section 4.1.
The Sony jukebox device manager caches optical disk blocks
on magnetic disk for faster access;
the cache management policy,
and some important issues in the cache design,
are described in section 4.2.
Section 4.3 describes the various shared-memory caches
used by the jukebox device manager.
Sections 4.4, 4.5, and 4.6 respectively discuss issues raised by the
write-once properties of the device,
how commits and aborts are handled,
and the cost function provided in
.CW sj_cost .
.sh 2 "POSTGRES Access to the Sony Jukebox"
.pp
The Sony jukebox must be physically connected to a host computer
over a SCSI bus.
This host computer
is not necessarily the computer on which users want to run \*(PG.
In order to make the device accessible from other computers,
a server on the host accepts requests to mount platters
and to read and write data over the network [MCFA91].
The network connection uses TCP/IP over BSD sockets [LEFF84].
.pp
The server manages a queue of requests from multiple clients,
so \*(PG is not necessarily the only user of the jukebox.
When requests on more than two platters appear in the request queue,
the server provides the illusion that all are mounted simultaneously
by moving platters among the two drives and shelves as necessary.
The request queue managed by the server is completely external to \*(PG,
so no information about the queue's length or contents can be used
when optimizing queries.
A future version of the jukebox server will make
queue status available to clients.
This will allow \*(PG to estimate query costs more accurately
than it does now
(see section 4.6 for more details),
and will result in better query optimization.
.pp
Figure 4 shows the jukebox server architecture.
In this figure,
there are three clients of the jukebox server;
\*(PG and two others.
.(z
.in +0.5i
.hl
... 0.24 7.51 4.99 10.26 0 0 4096 4096
... 0u 1584u 2736u 0u -137u 5910u 2359158u -2353385u
.PS 1584 2736 
.br
.ps 11
\h'1208u'\v'920u'\D'l-58u 15u'
.sp -1
\h'1208u'\v'949u'\D'l-58u -14u'
.sp -1
\h'1150u'\v'935u'\D'l720u 0u'
.sp -1
\h'1813u'\v'949u'\D'l57u -14u'
.sp -1
\h'1813u'\v'920u'\D'l57u 15u'
.sp -1
\h'2109u'\v'321u'\D'l49u -34u'
.sp -1
\h'2099u'\v'294u'\D'l59u -7u'
.sp -1
\h'2158u'\v'287u'\D'l-1151u 432u'
.sp -1
\h'1056u'\v'685u'\D'l-49u 34u'
.sp -1
\h'1066u'\v'712u'\D'l-59u 7u'
.sp -1
\h'1401u'\v'333u'\D'l37u -46u'
.sp -1
\h'1384u'\v'310u'\D'l54u -23u'
.sp -1
\h'1438u'\v'287u'\D'l-575u 432u'
.sp -1
\h'900u'\v'673u'\D'l-37u 46u'
.sp -1
\h'918u'\v'696u'\D'l-55u 23u'
.sp -1
\h'475u'\v'327u'\D'l-44u -40u'
.sp -1
\h'451u'\v'343u'\D'l-20u -56u'
.sp -1
\h'431u'\v'287u'\D'l288u 432u'
.sp -1
\h'675u'\v'679u'\D'l44u 40u'
.sp -1
\h'699u'\v'663u'\D'l20u 56u'
.sp -1
\h'432u'\v'1152u'\D'l0u -432u'
.sp -1
\h'432u'\v'720u'\D'l720u 0u'
.sp -1
\h'1152u'\v'720u'\D'l0u 432u'
.sp -1
\h'1152u'\v'1152u'\D'l-720u 0u'
.sp -1
\h'1941u'\v'933u'\D'l0u -69u'
.sp -1
\h'1941u'\v'864u'\D'l507u 0u'
.sp -1
\h'2448u'\v'864u'\D'l0u 69u'
.sp -1
\h'2448u'\v'933u'\D'l-507u 0u'
.sp -1
\h'2517u'\v'1077u'\D'l0u -219u'
.sp -1
\h'2517u'\v'858u'\D'l144u 0u'
.sp -1
\h'2661u'\v'858u'\D'l0u 219u'
.sp -1
\h'2661u'\v'1077u'\D'l-144u 0u'
.sp -1
\h'1942u'\v'1007u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1043u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1079u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1115u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1151u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1187u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1223u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1259u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1295u'\D'l504u 0u'
.sp -1
\h'1942u'\v'1331u'\D'l504u 0u'
.sp -1
\h'1872u'\v'288u'\D'l0u -144u'
.sp -1
\h'1872u'\v'144u'\D'l576u 0u'
.sp -1
\h'2448u'\v'144u'\D'l0u 144u'
.sp -1
\h'2448u'\v'288u'\D'l-576u 0u'
.sp -1
\h'1152u'\v'288u'\D'l0u -144u'
.sp -1
\h'1152u'\v'144u'\D'l576u 0u'
.sp -1
\h'1728u'\v'144u'\D'l0u 144u'
.sp -1
\h'1728u'\v'288u'\D'l-576u 0u'
.sp -1
\h'357u'\v'144u'\D'l0u 0u'
.sp -1
\h'357u'\v'144u'\D'l0u 0u'
.sp -1
\h'357u'\v'144u'\D'l0u 0u'
.sp -1
\h'357u'\v'144u'\D'l0u 0u'
.sp -1
\v'288u'\D'l0u -288u'
.sp -1
\D'l1008u 0u'
.sp -1
\h'1008u'\D'l0u 288u'
.sp -1
\h'1008u'\v'288u'\D'l-1008u 0u'
.sp -1
\h'1872u'\v'1584u'\D'l0u -864u'
.sp -1
\h'1872u'\v'720u'\D'l864u 0u'
.sp -1
\h'2736u'\v'720u'\D'l0u 864u'
.sp -1
\h'2736u'\v'1584u'\D'l-864u 0u'
.sp -1
\h'2627u'\v'212u'\h'-0.0m'\v'0.2m'\h'-\w'\s24\fR. . .\fP'u/2u'\s24\fR. . .\fP\h'-\w'\s24\fR. . .\fP'u/2u'
.sp -1
\h'503u'\v'161u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIPOSTGRES data manager\fP'u/2u'\s11\fIPOSTGRES data manager\fP\h'-\w'\s11\fIPOSTGRES data manager\fP'u/2u'
.sp -1
\h'1438u'\v'233u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIclient\fP'u/2u'\s11\fIclient\fP\h'-\w'\s11\fIclient\fP'u/2u'
.sp -1
\h'2158u'\v'233u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIclient\fP'u/2u'\s11\fIclient\fP\h'-\w'\s11\fIclient\fP'u/2u'
.sp -1
\h'791u'\v'845u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIjukebox server\fP'u/2u'\s11\fIjukebox server\fP\h'-\w'\s11\fIjukebox server\fP'u/2u'
.sp -1
\h'791u'\v'989u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIprocess\fP'u/2u'\s11\fIprocess\fP\h'-\w'\s11\fIprocess\fP'u/2u'
.sp -1
\h'1510u'\v'1025u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRSCSI bus\fP'u/2u'\s11\fRSCSI bus\fP\h'-\w'\s11\fRSCSI bus\fP'u/2u'
.sp -1
\h'1798u'\v'485u'\h'-0.0m'\v'0.2m'\s11\fRTCP/IP connections over\fP
.sp -1
\h'1798u'\v'593u'\h'-0.0m'\v'0.2m'\s11\fRnetwork\fP
.sp -1
\h'2302u'\v'1421u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fISony optical disk\fP'u/2u'\s11\fISony optical disk\fP\h'-\w'\s11\fISony optical disk\fP'u/2u'
.sp -1
\h'2302u'\v'1529u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fIjukebox\fP'u/2u'\s11\fIjukebox\fP\h'-\w'\s11\fIjukebox\fP'u/2u'
.sp -1
.sp 1+1584u
.PE
.sp
.ce
Figure 4: Sony Jukebox Client/Server Process Architecture
.hl
.in -0.5i
.)z
These others are not clients of \*(PG;
they are processes that have their own data stored on optical
platters,
which they manage themselves.
Clients of the database system connect to \*(PG,
which manages relational data stored on platters for them.
.pp
The jukebox server uses platter names and ownership to provide
some data protection,
and to keep users other than \*(PG from reading or writing \*(PG platters.
\*(PG must be the only user of platters that it owns,
so that it can handle block allocation cleanly.
.sh 2 "Cache Design"
.pp
Transfers from the jukebox to the database system are
expensive to set up
and slow to execute.
This is because the device itself is slow,
and because there may be a network between the database system
and the storage device.
In order to mask some of the latency,
the Sony jukebox device manager caches blocks from optical disk
on magnetic disk.
This cache is used for both reads and writes;
writes are flushed to the Sony jukebox in LRU order
when dirty cache entries are reclaimed.
This section describes the cache's structure,
contents,
and replacement policy.
.sh 3 "Persistent Magnetic Disk Cache"
.pp
The magnetic disk cache of optical disk blocks is
.i persistent .
This means that its contents are valid across executions of \*(PG.
Logically,
data stored either in the cache or on an optical platter
is considered to be on the Sony jukebox.
This is distinct from write-through techniques;
writes to the magnetic disk cache must survive system crashes,
but need not be flushed immediately to an optical disk platter.
.pp
There were two motivations for this decision.
First,
a persistent cache means that the cache need not be populated
when a data manager starts up,
and simplifies issues of sharing,
since all instantiations of \*(PG use the same cache.
Cache population is extremely expensive,
since transfers from the jukebox are so slow.
If there is any locality of reference between executions
of the database system,
then the persistent cache should speed up queries that
use the jukebox.
.pp
The second motivation for persistence
is to allow \*(PG to delay
writes to the jukebox device
for as long as possible,
to avoid paying unnecessary transfer costs.
Data written to the jukebox device manager
first hits the magnetic disk cache.
If the cache is persistent,
then it need not be write-through.
Blocks may remain in the magnetic disk cache,
since they will be found by subsequent invocations of \*(PG.
.pp
Persistence introduces substantial complexity into cache management.
For reasons described in section 4.2.2,
user blocks are stored as contiguous chunks of data in the
magnetic disk cache,
and cache metadata is stored separately.
Since individual writes to the cache are very large,
atomicity of writes is a problem.
Since the metadata and the data it describes are stored separately,
the atomicity problem is exacerbated.
.pp
In general,
the problem is that the cache data and metadata must be written
out separately.
If one,
but not both,
of the writes completes before a system crash,
then the cache is in an inconsistent state.
This state must be detected and repaired before the
associated data is used.
.pp
To solve this problem,
the Sony jukebox device manager uses a
.i "link token"
technique like the one described in [SULL92].
The cache data and its associated metadata are tagged with a unique number,
which is generated by the \*(PG object id service.
Whenever a cache entry is used,
the metadata and data are checked to be sure that they match.
If not,
then the cache's internal locking scheme guarantees that the inconsistency
was caused by a crash, and the correct state must be recovered.
.pp
The recovery strategy is to throw away both the cache data and metadata
and reinstate it from the optical disk jukebox.
The order in which cache operations are performed
guarantees that any inconsistent data in the cache
is either present on the Sony jukebox,
or is new data from a transaction in progress when the
system crashed.
In either case,
the inconsistent cache data may be discarded.
If it is on the jukebox,
then it may be reinstated later.
If it was new data from a transaction in progress at the
time of a system crash,
then it is invalid and need not be recovered.
.pp
This cache management strategy supports fast recovery,
an important feature of \*(PG [STON87], [STON90].
Since cache inconsistencies are discovered and repaired on first use,
no special recovery code needs to run when \*(PG restarts
after a crash.
.sh 3 "Device Manager Block Allocation and Caching"
.pp
The Sony jukebox device manager allocates
space to relations in units of
.i extents .
The extent size is configurable when \*(PG is compiled.
The current implementation uses 32 8kByte blocks,
or 256kBytes of user data.
When a new block is requested for a relation
(via
.CW smgr_extend )
and there is no existing extent with free space,
a new extent is allocated for it.
Subsequent
.CW smgr_extend
calls consume subsequent blocks in the new extent.
.pp
Large extents were chosen for several reasons.
First,
they favor sequential access patterns by guaranteeing
contiguity of user data.
Second,
large extents favor large transfers,
and large transfers are more efficient than smaller ones.
This is due to the high cost of setting up a transfer
on the Sony optical disk jukebox.
Finally,
the device manager maintains an index to locate extents quickly.
A single index entry can point at an entire extent,
so large extents require fewer index entries than small ones,
saving space.
.pp
Whenever possible,
the Sony jukebox device manager attempts to locate multiple extents
for a single relation on the same side of the same platter.
If the platter is full,
the next choice is to locate new extents on a new platter.
The last choice is to put them on the reverse side of the same platter.
This is because reading from one side of a platter,
flipping it over,
and reading from the other side is more expensive
than reading from one side of two different platters.
Only one side of a single platter may be mounted at a time.
Since there are two read/write decks in the jukebox,
two separate platters can be mounted simultaneously.
.pp
The current implementation
moves user data between the Sony jukebox and the magnetic
disk cache in units of full extents.
The transfer size is configurable,
and can be any size less than or equal to a full extent.
Full extents favor locality of reference.
The next section describes the replacement policy for
cached extents,
and addresses some issues raised by the write-once nature of the jukebox.
.sh 3 "Replacement Policy for Cached Extents"
.pp
Extents stored in the magnetic disk cache are flushed
to the Sony jukebox in least-recently-used order.
When space in the cache is required for a new extent,
the oldest unused extent is evicted.
If the extent is dirty
(that is,
some block in it has been modified since the extent entered the cache),
it must be written to the Sony jukebox.
Clean extents may be evicted without being written.
.pp
The important complications in this scheme come about
because of the write-once nature of the Sony optical disk jukebox.
If only part of an extent has been allocated and written since it
appeared in the cache,
then only the parts that contain new data should be written to the jukebox.
The unused part of the extent must be left free on the jukebox
for use later.
This requires the device manager to do transfers in units
smaller than complete extents
under some circumstances.
.pp
A worse problem is that the magnetic disk cache
may not always contain the most recent version of an extent.
This situation can come about because the \*(PG buffer manager
has a cache of relation blocks in shared memory.
If some dirty blocks in the shared memory cache have
not been written to the magnetic disk cache,
then incorrect or incomplete blocks may be written to the jukebox.
Once written,
these blocks can never be replaced.
Thus the jukebox device manager must guarantee that it has the
latest copies of all the blocks in the extent before
it attempts to write them to the Sony jukebox.
.pp
These two problems have distinct solutions.
Writing partial extents to the jukebox is straightforward;
among the information stored in the magnetic disk cache metadata
is a map of the dirty pages in an extent,
so only those pages need to be written.
Partial writes are less efficient than full-extent writes,
but are only necessary if cached extents must be forced out
before they are filled.
In the long term,
extents should fill up with data,
so reads will eventually transfer whole extents.
.pp
Locating the latest version of every block in an extent
required a change to the \*(PG buffer manager.
A new interface routine was added to the code that maintains
the shared memory buffer cache.
.(E
int
DirtyBufferCopy(reln, blocknum, buf)
    Relation reln;
    BlockNumber blocknum;
    char *buf;
.)E
If block
.CW blocknum
of relation
.CW reln
appears in the buffer pool and is dirty,
then it is copied to
.CW buf ,
marked clean,
and
.CW DirtyBufferCopy
returns
.CW TRUE .
Otherwise,
.CW DirtyBufferCopy
just returns
.CW FALSE .
.pp
Just before writing a dirty segment to optical disk,
the Sony jukebox device manager calls
.CW DirtyBufferCopy
once for each buffer that appears in the extent.
This assembles the latest copy of the dirty extent.
This is guaranteed to behave correctly,
since the Sony jukebox device manager acquires an
exclusive lock on the extent beforehand.
The exclusive lock guarantees that no user can
update blocks that appear in the extent until
the lock is released.
.pp
One final issue in cache replacement is detecting and recovering
from crashes during flushes of dirty extents.
Consider the case in which a dirty extent is being
written to the Sony jukebox,
and the system crashes before the write completes.
In that case,
the cache data and metadata are still consistent,
but some of the blocks marked
.q dirty
in the cache already appear on the jukebox.
.pp
Because the jukebox obeys the write ordering specified by \*(PG,
this case is easy to detect and handle.
Just before writing a dirty extent,
the Sony jukebox storage manager attempts to read the first block
it will write.
If the read succeeds,
then part of the extent was successfully written before a crash.
The amount written can be determined by reading the extent from
the Sony jukebox,
and the remaining dirty blocks can be written safely.
Attempts to overwrite blocks on the Sony jukebox actually destroy
the data stored there,
so the jukebox device manager must be very careful about this.
.sh 2 "Use of Shared Memory"
.pp
The fact that the magnetic disk block cache is shared
by all \*(PG processes requires that the Sony jukebox device
manager coordinate cache operations with its peers.
Specifically,
only a single \*(PG process may update a particular cache
region at a time.
In addition,
cache metadata is stored in memory for efficiency.
Processes that want to examine cache metadata can do so
without accessing the magnetic disk version.
Only processes that change the metadata must force their
changes to disk.
.pp
To support sharing,
the Sony jukebox device manager uses the \*(PG shared memory
and semaphore abstractions.
Cache metadata is stored at a named location in shared memory,
and all concurrent processes can attach this region
to their address spaces.
Semaphores with well-known names exist to control exclusive
access to objects in shared memory.
The rest of this section describes what is stored in shared memory
and how access to it is controlled.
.pp
As mentioned in section 4.2,
the magnetic disk block cache is logically divided into
.i "user data"
and
.i metadata .
The user data portion consists of a control block
followed by a sequence of data blocks.
The control block stores the name and object identifier
of the relation owning the data blocks,
the link token described in the previous section,
and some extra administrative information.
The metadata identifies dirty data and unallocated data blocks
and stores the link token,
among other information.
The
.i extent
allocated by the jukebox storage manager is the
user data portion,
including the control block.
This entire extent is read from and written to the jukebox.
The metadata is never stored on the jukebox.
It is maintained only on magnetic disk,
and is recreated from the control block when an extent
moves from an optical disk platter into the magnetic disk cache.
.pp
Both the metadata and the user data portions of the cache
are persistent,
as defined in section 4.2.1.
When the first \*(PG process starts up after a crash,
the Sony jukebox device manager
detects the fact that its shared memory is not yet initialized
and loads the cache metadata into shared memory.
This requires an exclusive lock on the cache metadata for the
duration of the load.
Such a lock is acquired using the \*(PG semaphore abstraction.
As the metadata is loaded,
a hash table is constructed in shared memory to make lookups
of particular extents (by relation, database, and block numbers) fast.
This initialization takes on the order of fifty milliseconds
on a DECstation 3100 running Ultrix 4.2.
.pp
Once the metadata is loaded into shared memory
and the hash table is initialized,
the exclusive lock is released.
This is the only case in which an exclusive lock is held
on a shared data structure during a disk access.
.pp
When a \*(PG process wants access to a particular cached extent,
it takes the following steps:
.np
It acquires an exclusive lock on the hash table
and looks up the extent.
The exclusive lock keeps concurrent processes from changing
the hash table during lookup.
This lock need not exclude other readers,
but does so (for the sake of simplicity)
in the current implementation.
For the purposes of this discussion,
assume that the extent is present in the cache.
.np
It releases the exclusive lock on the hash table.
.np
It acquires an exclusive lock on the particular extent.
Every cache entry has an associated semaphore that controls access to it.
Since semaphores are a scarce resource,
the \*(PG shared lock table could be used instead.
.np
It verifies that the extent is the one desired.
The extent may have changed,
for example,
if some concurrent process evicted the extent and replaced it with another
before the exclusive lock on the extent was acquired.
Another possibility is that the cache is inconsistent due to a crash.
It is possible to detect and recover from this case
as described in section 4.2.1.
.np
\*(PG then sets the
.CW IO_INPROGRESS
flag on the extent's metadata
and releases the exclusive lock.
Concurrent processes will notice that the
.CW IO_INPROGRESS
flag is set and defer their accesses until it is cleared.
Sleeping and wakeup are coordinated using spinlocks or semaphores,
depending on operating system support.
.np
Finally,
it releases the exclusive lock on the extent.
.pp
At this point,
the extent may be read or written freely.
.pp
If the extent sought does not yet appear in the cache,
\*(PG must evict an existing extent and load the one desired.
The process for doing so is similar to that for reading an extent,
except that there is a second round of locking to update the cache
metadata and the shared hash table.
In addition,
the code is careful to detect and handle race conditions,
such as the case where two concurrent processes instantiate the same
extent in the cache at the same time.
.pp
The cache management and locking protocol are loosely based on that
used by the \*(PG buffer manager.
The implementation holds exclusive locks on objects for as short a
period as possible,
and never holds an exclusive lock during an I/O
(except during initialization after a crash,
as described above).
In addition,
it guarantees that different processes can update different parts of
the cache simultaneously.
.pp
The complexity involved in managing the cache for such aggressive
sharing is substantial.
This,
along with cache persistence,
are the primary reasons that the Sony jukebox device manager requires
3000 lines of code.
.sh 2 "Managing Write-Once Media"
.pp
Another source of complexity in the Sony jukebox device manager
is the fact that the optical platters are write-once.
Although [STON87] claims that the \*(PG storage manager
is no-overwrite,
it would be more correct to call it
.q seldom-overwrite. \|
There are circumstances in which tuples or blocks
are overwritten in place,
and these cases cause problems for the Sony jukebox storage manager.
.pp
One such case is when a tuple is deleted or replaced.
The no-overwrite storage manager in \*(PG reserves some space
in the tuple to mark it with the
transaction ID of the updating transaction.
Marking a tuple constitutes an overwrite-in-place of the existing tuple,
and cannot be supported by the Sony jukebox device manager.
.pp
Another case requiring overwrite is when the \*(PG vacuum cleaner runs.
This process fills in the commit times of transactions affecting
tuples.
Commit times are recorded in fields in the tuples themselves,
and so this also constitutes overwrite-in-place.
Furthermore,
the vacuum cleaner assumes that it can physically remove
expired tuples from data pages.
This is not true for write-once media.
.pp
The present implementation of the Sony jukebox device manager
fails to solve either problem.
It detects attempts to overwrite existing pages,
and returns a failure error code to the caller.
This means that relations stored on the Sony jukebox cannot be vacuumed,
and that tuples stored on the jukebox can never be deleted.
Since vacuuming is only a performance optimization,
failure to support it does not lead to incorrect behavior.
The fact that jukebox tuples are permanent is a more serious flaw.
Essentially,
this makes physical device characteristics visible to the user.
The design goal of the storage manager switch was to make
relations location-transparent,
and this property violates location transparency.
.pp
A possible solution would be to change the representation of pages
stored inside the Sony jukebox device manager.
The pages would be separated into overwritable parts
(updating transaction ID and transaction commit times for tuples
on the page),
which would be stored on magnetic disk,
and non-overwritable parts
(the tuple data),
which would be stored on the jukebox.
The overwritable parts of a tuple consume thirteen bytes of storage,
so this strategy will increase the amount of magnetic disk space required.
Under this scheme,
conventional \*(PG data pages could be reassembled
by the Sony jukebox device manager on demand.
.pp
Another problem raised by write-once storage devices is the
fact that data blocks in \*(PG relations do get overwritten
in place.
For example,
a data block is overwritten every time a new tuple is added to it.
As above,
this causes problems for the Sony jukebox device manager.
.pp
In this case,
the solution is straightforward.
The current version of \*(PG does not cluster data in relations,
so all insertions to heap relations happen on the relation's
highest-numbered block.
The Sony jukebox device manager stores the last block of every
relation it manages on magnetic disk.
When a new block is allocated,
the old last block is moved to the optical disk jukebox,
and the new one replaces it on magnetic disk.
Note that the Sony jukebox device manager does not
use the magnetic disk device manager for this;
it operates on the magnetic disk directly.
.pp
The current system flags write-once device managers
in the storage manager switch,
and the data manager is careful not to vacuum relations stored on these.
In addition,
since indexed access methods generally require block overwrite-in-place
at random locations in the relation,
these cannot be located on write-once device managers.
The correct solution is to support logical overwrite-in-place
on write-once device managers
by writing software to handle overwrites-in-place,
but this has not yet been done in the Sony jukebox device manager.
.sh 2 "Commit and Abort Processing"
.pp
When a transaction commits or aborts,
the data manager calls the
.CW smgr_commit
or
.CW smgr_abort
interface routines to guarantee that all data managed by all
device managers is on stable storage.
.pp
Writes to the physical jukebox device are synchronous and unbuffered,
so no special processing needs to be done for them.
Writes to magnetic disk may be captured by the file system buffer cache,
and must be forced through at transaction commit.
In addition,
the Sony jukebox device manager must flush changes to stable storage
even in case of a transaction abort.
This is because the device manager may write a page to disk
on behalf of some other transaction.
For example,
a page in the buffer cache may be dirtied by one transaction,
and written by a device manager in another transaction
which evicts the page from the cache.
.pp
To guarantee that all changes are on stable storage,
the Sony jukebox device manager keeps track of all magnetic disk
file descriptors on which it has done writes.
At transaction commit or abort,
the jukebox device manager issues an
.CW fsync
system call on each of these file descriptors.
This forces all pending writes to disk.
On completion,
the device manager returns successfully.
.sh 2 "Query Cost Estimation"
.pp
The \*(PG optimizer is based on the design proposed by [SELI79]
and is described in detail in [FONG86].
To support multiple devices,
the query optimizer must call the
.CW smgr_cost
function for the device managers that manage relations under consideration.
The optimizer uses selectivity functions
defined by the access methods to estimate the number
of blocks and number of tuples in the relation that
the query plan will touch.
These values are passed to
.CW smgr_cost .
This section describes the cost estimation function provided by
the Sony jukebox device manager.
.pp
[SELI79] proposes a cost estimation function
.EQ
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.ps 11
.ft 2
.ds 11 "cost
.ds 12 "\ 
.as 11 "\|\*(12
.ds 12 "\(eq
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\*(12
.ds 12 "W
.ds 13 "\f11\fP
.as 12 \v'20u'\s-3\|\*(13\s+3\v'-20u'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "\v'-.3m'.\v'.3m'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "nblocks
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\|\*(12
.ds 12 "\(pl
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\*(12
.ds 12 "W
.ds 13 "\f12\fP
.as 12 \v'20u'\s-3\|\*(13\s+3\v'-20u'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "\v'-.3m'.\v'.3m'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "ntuples
.as 11 "\*(12
.ds 11 \x'0'\f2\s11\*(11\|\s\n(99\f(\n(98\x'3u'
.nr 11 \w'\*(11'
.nr MK 0
.if 108>\n(.v .ne 108u
.rn 11 10
\*(10
.ps \n(99
.ft \n(98
.EN
.sp 0.3v
to estimate the cost of a particular access path to a relation.
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "W
.ds 14 "\f11\fP
.as 13 \v'20u'\s-3\|\*(14\s+3\v'-20u'
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " is a coefficient that
.ps \n(99
.ft \n(98
\*(12
approximates the cost of instantiating a block from the storage device,
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.as 12 "and 
.ps 11
.ft 2
.ds 13 "W
.ds 14 "\f12\fP
.as 13 \v'20u'\s-3\|\*(14\s+3\v'-20u'
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " is a coefficient that approximates the CPU cost of processing
.ps \n(99
.ft \n(98
\*(12
a single tuple.
These coefficients are set empirically.
.pp
The Sony jukebox storage manager adds a term to this equation.
It uses
.EQ
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.ps 11
.ft 2
.ds 11 "cost
.ds 12 "\ 
.as 11 "\|\*(12
.ds 12 "\(eq
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\*(12
.ds 12 "W
.ds 13 "\f10\fP
.as 12 \v'20u'\s-3\|\*(13\s+3\v'-20u'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "\v'-.3m'.\v'.3m'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "npswitches
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\|\*(12
.ds 12 "\(pl
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\*(12
.ds 12 "W
.ds 13 "\f11\fP
.as 12 \v'20u'\s-3\|\*(13\s+3\v'-20u'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "\v'-.3m'.\v'.3m'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "nblocks
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\|\*(12
.ds 12 "\(pl
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\*(12
.ds 12 "W
.ds 13 "\f12\fP
.as 12 \v'20u'\s-3\|\*(13\s+3\v'-20u'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "\v'-.3m'.\v'.3m'
.as 11 "\*(12
.ds 12 "\|
.as 11 "\*(12
.ds 12 "ntuples
.as 11 "\*(12
.ds 11 \x'0'\f2\s11\*(11\|\s\n(99\f(\n(98\x'3u'
.nr 11 \w'\*(11'
.nr MK 0
.if 108>\n(.v .ne 108u
.rn 11 10
\*(10
.ps \n(99
.ft \n(98
.EN
.sp 0.3v
The new term estimates the cost of platter switches
associated with the access path under consideration.
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "W
.ds 14 "\f10\fP
.as 13 \v'20u'\s-3\|\*(14\s+3\v'-20u'
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " approximates the cost of a single platter switch,
.ps \n(99
.ft \n(98
\*(12
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.as 12 "and 
.ps 11
.ft 2
.ds 13 "npswitches
.ds 13 \x'0'\f2\s11\*(13\|\s\n(99\f(\n(98
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " estimates the number of platter switches
.ps \n(99
.ft \n(98
\*(12
required.
.pp
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "W
.ds 14 "\f10\fP
.as 13 \v'20u'\s-3\|\*(14\s+3\v'-20u'
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " is set empirically.
.ps \n(99
.ft \n(98
\*(12
A platter switch on the Sony jukebox takes,
on the average,
6.7 seconds.
This is roughly the amount of time it takes to read 5MByte,
or 625 8kByte data pages,
from an optical platter.
Thus
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "W
.ds 14 "\f10\fP
.as 13 \v'20u'\s-3\|\*(14\s+3\v'-20u'
.ds 14 "\ 
.as 13 "\*(14
.ds 14 "\(eq
.as 13 "\*(14
.ds 14 "\ 
.as 13 "\*(14
.ds 14 "\f16\fP\f12\fP\f15\fP
.as 13 "\*(14
.ds 14 "W
.ds 15 "\f11\fP
.as 14 \v'20u'\s-3\|\*(15\s+3\v'-20u'
.as 13 "\*(14
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 ".
.ps \n(99
.ft \n(98
\*(12
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "W
.ds 14 "\f11\fP
.as 13 \v'20u'\s-3\|\*(14\s+3\v'-20u'
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 ",
.ps \n(99
.ft \n(98
\*(12
in turn,
is set empirically to 40 times the cost of instantiating a
magnetic disk block.
In general,
device managers in \*(PG normalize their cost function coefficients
to the values used by the magnetic disk device manager.
.pp
The number of platter switches,
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "npswitches
.ds 13 \x'0'\f2\s11\*(13\|\s\n(99\f(\n(98
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 ",
.ps \n(99
.ft \n(98
\*(12
is estimated in the following way.
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.as 12 "Let 
.ps 11
.ft 2
.ds 13 "nblocks
.ds 14 "r
.as 13 \v'20u'\s-3\*(14\s+3\|\v'-20u'
.ds 13 \x'0'\f2\s11\*(13\s\n(99\f(\n(98\x'3u'
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " be the total number of blocks stored in
.ps \n(99
.ft \n(98
\*(12
the relation under consideration.
Then the probability that any particular platter must be loaded
is assumed to be
.EQ
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.ps 11
.ft 2
.ds 11 "nblocks
.ds 12 "nblocks
.ds 13 "r
.as 12 \v'20u'\s-3\*(13\s+3\|\v'-20u'
.nr 11 \w'\s11\*(11'
.nr 12 \w'\s11\*(12'
.nr 13 \n(11
.if \n(12>\n(13 .nr 13 \n(12
.nr 13 \n(13+\s11.5m\s0
.ds 11 \v'62u'\h'\n(13u-\n(12u/2u'\*(12\
\h'-\n(12u-\n(11u/2u'\v'-114u'\*(11\
\h'-\n(13u-\n(11u/2u+.1m'\v'26u'\l'\n(13u-.2m'\h'.1m'\v'26u'
.ds 11 \x'0'\x'0-35u'\f2\s11\*(11\s\n(99\f(\n(98\x'65u'
.nr 11 \w'\*(11'
.nr MK 0
.if 222>\n(.v .ne 222u
.rn 11 10
\*(10
.ps \n(99
.ft \n(98
.EN
.sp 0.3v
where
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "nblocks
.ds 13 \x'0'\f2\s11\*(13\|\s\n(99\f(\n(98
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 " is the parameter passed in by the query optimizer.
.ps \n(99
.ft \n(98
\*(12
This assumes that the probability of touching a given block is
uniform over all platters storing data for the relation;
that is,
that every platter storing data for the relation
has an equal chance of being loaded to satisfy the query.
In general,
this is an invalid assumption,
since it ignores the number of blocks stored on a given platter,
but it simplifies cost estimation substantially.
.pp
The number of platters storing data for the relation,
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.rm 12 
.ps 11
.ft 2
.ds 13 "nplatters
.ds 13 \x'0'\f2\s11\*(13\|\s\n(99\f(\n(98
.as 12 \*(13
.ps \n(99
.ft \n(98
.as 12 ",
.ps \n(99
.ft \n(98
\*(12
is computed by examining the system catalogs.
Then the number of platter switches is computed as
.EQ
.nr 99 \n(.s
.nr 98 \n(.f
.af 98 01
.ps 11
.ft 2
.ds 11 "npswitches
.ds 12 "\ 
.as 11 "\|\*(12
.ds 12 "\(eq
.as 11 "\*(12
.ds 12 "\ 
.as 11 "\*(12
.ds 12 "\f1max\fP
.as 11 "\*(12
.ds 12 "\|
.ds 13 "\f11\fP
.ds 14 "\|
.as 13 "\*(14
.ds 14 "\f1,\fP
.as 13 "\*(14
.ds 14 "\|
.as 13 "\*(14
.ds 14 "nblocks
.ds 15 "nblocks
.ds 16 "r
.as 15 \v'20u'\s-3\*(16\s+3\|\v'-20u'
.nr 14 \w'\s11\*(14'
.nr 15 \w'\s11\*(15'
.nr 16 \n(14
.if \n(15>\n(16 .nr 16 \n(15
.nr 16 \n(16+\s11.5m\s0
.ds 14 \v'62u'\h'\n(16u-\n(15u/2u'\*(15\
\h'-\n(15u-\n(14u/2u'\v'-114u'\*(14\
\h'-\n(16u-\n(14u/2u+.1m'\v'26u'\l'\n(16u-.2m'\h'.1m'\v'26u'
.as 13 "\*(14
.ds 14 "\v'-.3m'.\v'.3m'
.as 13 "\*(14
.ds 14 "nplatters
.as 13 "\*(14
.as 12 "\*(13
.ds 13 "\|
.as 12 "\|\*(13
.ds 12 \|\v'0u'\b'\(lt\(bv\(lb'\v'0u'\*(12\|\v'0u'\b'\(rt\(bv\(rb'\v'0u'
.as 11 "\*(12
.ds 11 \x'0'\x'0-56u'\f2\s11\*(11\s\n(99\f(\n(98\x'86u'
.nr 11 \w'\*(11'
.nr MK 0
.if 264>\n(.v .ne 264u
.rn 11 10
\*(10
.ps \n(99
.ft \n(98
.EN
.sp 0.3v
This formula assumes that at least one platter must be mounted
in order to satisfy a query.
.pp
The cost function described in this section favors access plans
that touch fewer blocks of relations on the Sony jukebox.
This cost function provides only a rough estimate query's actual cost.
The contents of the magnetic disk cache are ignored,
and no attempt is made to determine whether a desired
platter is already mounted in the jukebox.
In addition,
the actual distribution of data pages across platters is
not considered.
Section 8 suggests some improvements to the existing cost function.
.sh 2 "Summary of Sony Jukebox Device Manager Implementation"
.pp
This section has described the design of the Sony optical disk
jukebox device manager.
The most important features of this device manager
are the magnetic disk cache of recently-used optical disk blocks
and the write-once nature of the recording medium.
These two features make the device manager more complex
than the magnetic disk or main memory device managers.
.pp
The Sony jukebox device manager demonstrates that the
storage manager switch abstraction was successful.
New storage devices may be integrated into \*(PG
by a sophisticated programmer with knowledge of
the database system's internals.
.pp
The next section introduces the Inversion file system,
which was implemented on top of \*(PG and takes advantage
of the storage manager switch to provide a file system
on any device supported by \*(PG.
.sh 1 "DESIGN OF THE INVERSION FILE SYSTEM"
.pp
Conventional database systems are generally built on top of file
systems [STON92].
This section describes the design and implementation of a file system
built on top of the database system.
This file system,
called
.q Inversion
because the conventional roles of the file system
and database system are inverted,
runs under \*(PG version 4.0.
It supports file storage on any device managed by \*(PG,
and provides useful services not found in many conventional file systems.
.pp
Current file systems researchers are concentrating on providing
new services to administrators and users.
For example,
Episode [CHUT92] embeds transaction protection in the file system directly.
However,
transactions protect only file system metadata,
to speed recovery after crashes,
and are not available to users.
[SELT90] proposes embedding user-level transaction support
directly in existing file systems,
and QuickSilver [CABR88] is an early example of such a system.
Other researchers are investigating support for time travel.
For example,
Plan 9 [PIKE90] and 3DFS [ROOM92]
allow the user to see past states of the file system.
However,
most such systems provide coarse temporal granularity.
Both Plan 9 and 3DFS take periodic snapshots of the file system,
and states that existed between snapshots are not recoverable.
.pp
The Inversion file system provides transactions and fine-grained
time travel to users by taking advantage of the
\*(PG no-overwrite storage manager.
File data is stored in the database,
so that file updates are transaction-protected.
In addition,
the user may ask to see the state of the file system at
any time in the past.
All transactions that had committed as of that time will
be visible,
so the file system state will be exactly the same as it was at that moment.
This is an important improvement on the coarse-grained time travel
provided by other systems.
.pp
Another feature provided by Inversion is fast recovery.
No file system consistency checker needs to run on the Inversion file
system after a crash,
since recovery is managed by the \*(PG storage manager.
File system recovery is essentially instantaneous.
Any updates that were in progress at the time of the crash,
but had not committed,
will be rolled back.
Any committed updates are guaranteed to be persistent across crashes.
LFS [ROSE91] uses a log to provide similar guarantees about
recovery time,
but does not support transactions.
.pp
In addition,
files in the Inversion file system may be located on any
device manager in \*(PG.
This means that the Inversion file system can span multiple
devices (and device types) transparently.
.pp
A less conventional,
but nonetheless important,
feature of Inversion is that files are strongly typed.
Since \*(PG supports the dynamic loading of user code for
execution in the data manager,
this means that users can arrange for their applications to run
in the file system manager itself,
rather than in a separate address space.
By exploiting typing,
operations may be defined that can be applied
to certain groups of files.
For example,
object files can be treated differently from text files.
Locus [WALK83] supports typed files,
but the set of file types is small and difficult to extend.
.pp
Finally,
the fact that Inversion is built on top of \*(PG
makes it possible to issue
.i "ad hoc"
queries on the file system metadata,
or even file data itself.
Instead of mastering the use of many different programs,
the user may examine the file systems structure and contents
by formulating simple \*(PQ queries.
In addition,
indices may be defined to make selected queries run faster,
at the user's discretion.
.pp
.i "Ad hoc"
queries are especially powerful because they support
attribute-based naming of files [SECH91].
For example,
in Inversion,
it is possible to store revision control information
about a file in a relation.
The user can then formulate queries on the database
to find files based on revision control information.
Since arbitrary schemas are permitted,
any attribute of interest may be stored in the database
and used to look up files.
.pp
In summary,
the Inversion file system provides services not available
in conventional file systems,
including
.BU
transaction-protected file updates,
.BU
fine-grained time travel,
.BU
device independence,
.BU
strong typing of files,
and
.BU
.i "ad hoc"
queries on the file system state.
.pp
The rest of this section describes the design and implementation
of Inversion.
.(z
.in +0.5i
.hl
... 0 -1.2 5.25 0.1 0 0 4096 4096
... 0u 749u 3023u 0u 0u 58u 2358516u -2358458u
.PS 749 3023 
.br
\v'58u'\h'-0.0m'\v'0.2m'\h'-\w'\s11'u/2u'\s11\h'-\w'\s11'u/2u'
.sp -1
\h'173u'\v'58u'\h'-0.0m'\v'0.2m'\h'-\w'filename'u/2u'filename\h'-\w'filename'u/2u'
.sp -1
\v'115u'\D'l0u -115u'
.sp -1
\D'l345u 0u'
.sp -1
\h'345u'\D'l0u 115u'
.sp -1
\h'345u'\v'115u'\D'l-345u 0u'
.sp -1
\h'561u'\v'173u'\h'-0.0m'\v'0.2m'\h'-\w'0'u/2u'0\h'-\w'0'u/2u'
.sp -1
\h'345u'\v'230u'\D'l0u -115u'
.sp -1
\h'345u'\v'115u'\D'l432u 0u'
.sp -1
\h'777u'\v'115u'\D'l0u 115u'
.sp -1
\h'777u'\v'230u'\D'l-432u 0u'
.sp -1
\h'561u'\v'58u'\h'-0.0m'\v'0.2m'\h'-\w'\fIchunkno\fP'u/2u'\fIchunkno\fP\h'-\w'\fIchunkno\fP'u/2u'
.sp -1
\h'561u'\v'288u'\h'-0.0m'\v'0.2m'\h'-\w'1'u/2u'1\h'-\w'1'u/2u'
.sp -1
\h'345u'\v'345u'\D'l0u -115u'
.sp -1
\h'345u'\v'230u'\D'l432u 0u'
.sp -1
\h'777u'\v'230u'\D'l0u 115u'
.sp -1
\h'777u'\v'345u'\D'l-432u 0u'
.sp -1
\h'561u'\v'403u'\h'-0.0m'\v'0.2m'\h'-\w'2'u/2u'2\h'-\w'2'u/2u'
.sp -1
\h'345u'\v'461u'\D'l0u -116u'
.sp -1
\h'345u'\v'345u'\D'l432u 0u'
.sp -1
\h'777u'\v'345u'\D'l0u 116u'
.sp -1
\h'777u'\v'461u'\D'l-432u 0u'
.sp -1
\h'777u'\v'230u'\D'l0u -115u'
.sp -1
\h'777u'\v'115u'\D'l1152u 0u'
.sp -1
\h'1929u'\v'115u'\D'l0u 115u'
.sp -1
\h'1929u'\v'230u'\D'l-1152u 0u'
.sp -1
\h'1353u'\v'58u'\h'-0.0m'\v'0.2m'\h'-\w'\fIchunk\fP'u/2u'\fIchunk\fP\h'-\w'\fIchunk\fP'u/2u'
.sp -1
\h'777u'\v'345u'\D'l0u -115u'
.sp -1
\h'777u'\v'230u'\D'l1152u 0u'
.sp -1
\h'1929u'\v'230u'\D'l0u 115u'
.sp -1
\h'1929u'\v'345u'\D'l-1152u 0u'
.sp -1
\h'777u'\v'461u'\D'l0u -116u'
.sp -1
\h'777u'\v'345u'\D'l1152u 0u'
.sp -1
\h'1929u'\v'345u'\D'l0u 116u'
.sp -1
\h'1929u'\v'461u'\D'l-1152u 0u'
.sp -1
\h'345u'\v'461u'\D'l0u 288u'
.sp -1
\h'345u'\v'749u'\D'l1584u 0u'
.sp -1
\h'1929u'\v'749u'\D'l0u -288u'
.sp -1
\h'1281u'\v'489u'\h'-0.0m'\v'0.2m'\h'-\w'.'u/2u'.\h'-\w'.'u/2u'
.sp -1
\h'1281u'\v'547u'\h'-0.0m'\v'0.2m'\h'-\w'.'u/2u'.\h'-\w'.'u/2u'
.sp -1
\h'1281u'\v'605u'\h'-0.0m'\v'0.2m'\h'-\w'.'u/2u'.\h'-\w'.'u/2u'
.sp -1
\h'2678u'\v'58u'\h'-0.0m'\v'0.2m'\h'-\w'\s-2POSTGRES\s0 tuples'u/2u'\s-2POSTGRES\s0 tuples\h'-\w'\s-2POSTGRES\s0 tuples'u/2u'
.sp -1
\h'2332u'\v'58u'\D'l-403u 115u'
.sp -1
\h'1980u'\v'143u'\D'l-51u 30u'
.sp -1
\h'1988u'\v'171u'\D'l-59u 2u'
.sp -1
\h'2332u'\v'58u'\D'l-403u 230u'
.sp -1
\h'1972u'\v'247u'\D'l-43u 41u'
.sp -1
\h'1986u'\v'272u'\D'l-57u 16u'
.sp -1
\h'2332u'\v'58u'\D'l-403u 345u'
.sp -1
\h'1963u'\v'355u'\D'l-34u 48u'
.sp -1
\h'1982u'\v'377u'\D'l-53u 26u'
.sp -1
.sp 1+749u
.PE
.ce
Figure 5: Decomposition of files into relations in Inversion
.hl
.in -0.5i
.)z
.sh 2 "Decomposing Files into Relations"
.pp
Files,
generally viewed by users as byte streams,
are stored in conventional file systems
as a series of data blocks.
The Inversion file system similarly
.q chunks
user data.
Figure 5 shows the schema used to store file data
in \*(PG relations.
.pp
For every file,
a uniquely-named relation is created.
File data is collected into chunks slightly smaller than 8kBytes.
The size of the chunk is calculated so that a single
tuple will fit exactly on a \*(PG data manager page.
This page size was chosen early in the design of \*(PG,
and was intended to make magnetic disk transfers fast.
Although Inversion does not require magnetic disk in order
to function,
the page size inherited from the data manager survives.
.pp
When a user writes a new data chunk to a file,
a tuple is created consisting of the
.i "chunk number" ,
or index of this chunk into the file,
and the data chunk.
This tuple is appended to the relation storing the file.
.pp
The Inversion file system provides a set of interface routines
to create, open, close, read, write, and seek on files.
Byte-oriented operations are turned into operations on chunks
by calculating the chunk numbers of the affected chunks.
In order to be compatible with existing \*(UX filesystems,
Inversion uses longwords to communicate a file's current read
or write position.
This limits Inversion files to 2GBytes in size.
This constraint is easy to remove,
at the cost of compatibility with existing file systems.
.pp
A file is located on particular device manager at creation.
From that point on,
accesses are device-transparent,
both to the user and to the Inversion file system itself.
The underlying device manager is called to instantiate blocks
of the relation storing the file.
Inversion does not know anything about devices.
.pp
When a file is modified,
the tuples storing changed chunks are replaced
in the normal way.
In order to speed up seeks on files,
Inversion maintains a Btree index on the chunk number attribute.
.sh 2 "Namespace and Metadata Management"
.pp
Inversion stores the file system namespace in a class
.(E
pg_largeobject(poid, fname)
.)E
where
.CW poid
is the
.CW pg_largeobject
object identifier of the parent directory,
and
.CW fname
is the file name.
The
object identifier associated with
the
.CW pg_largeobject
tuple for the file is its filesystem-wide unique identifier.
.pp
The names of the relations containing file data are computed
from the file's
.CW pg_largeobject
object identifier.
If the object identifier is 12345,
then the relation containing the file data is
.CW inv12345 ,
and the block number index on the relation is called
.CW inx12345 .
.pp
In addition to namespace management,
most file systems manage other metadata associated with a file.
For example,
the file's owner,
size,
and last access,
modification,
and creation times are typically stored by the file system.
Inversion also maintains this data for all files.
The owner and creation time for a file may be determined
by examining the
.CW pg_class
tuple for the relation in which the file is stored.
.pp
In the current version of \*(PG,
relations are append-at-end only,
so frequently changing data consumes space quickly.
Since file sizes,
last access times,
and last modification times typically change frequently,
Inversion does not store them in a relation,
in order to conserve space.
Instead,
these attributes are stored in a hash table on magnetic disk,
and updated in place.
Once \*(PG can cluster data in relations and reuse free space
on data pages,
these items should be stored in a relation.
.sh 2 "File System Management Issues"
.pp
Two important issues in file system management
are the policy for caching filesystem blocks for fast access
and the policy for laying files out on storage devices.
The current implementation of Inversion does not address
either of these issues.
.pp
\*(PG supports a multi-level cache,
since the buffer manager caches buffers in shared memory
and device managers
(like the Sony jukebox device manager)
can implement caches for their own use.
These caches may be distributed among different storage technologies.
For example,
the Sony jukebox device manager maintains its cache of recently-used
extents on magnetic disk.
The \*(PG shared buffer cache resides in main memory.
This separation of cache location solves a commonly-cited
problem of multi-level caches.
Since the caches are on different devices,
they do not compete for resources.
.pp
Despite the fact that device managers can provide custom caching,
there is no special cache of file system data blocks
maintained by Inversion.
Inversion is built entirely above the storage manager switch
layer,
and so has no knowledge of a file's location.
It would have been possible to create an in-memory cache of file
system blocks,
but this is precisely the service already supplied by the
\*(PG shared buffer cache.
.pp
Since Inversion does not control file location,
it cannot control placement of file blocks on storage media.
Device managers enforce policies for block layout
in the current implementation.
At some point,
\*(PG implementors will have to move this policy
higher in the system,
since support for data clustering in relations
requires that the data manager make decisions on block layout.
In either case,
no support currently exists for enforcement of a layout policy
in the Inversion file system code.
.sh 1 "PERFORMANCE OF THE INVERSION FILE SYSTEM"
.pp
This section presents measurements of the performance
of the Inversion file system.
Inversion is intended to support physical scientists
working on the Sequoia 2000 project [STON91], [KATZ91].
In general,
these scientists will use Inversion as a network file server.
The system configuration evaluated here is the same
as will be used by Sequoia researchers.
Inversion is compared to NFS [SAND85] running on identical
hardware.
.sh 2 "System Configuration"
.pp
Inversion was installed on a DECstation 5900 with 128MBytes of main memory.
The operating system running on the machine was Ultrix 4.2.
Files were located on a 1.3GByte DEC RZ58 disk drive attached
to the DECsystem 5900.
.pp
Files were opened, read, and written from a remote client
running on a DECstation 3100 under Ultrix 4.2.
Client/server communication was via TCP/IP over a 10Mbit/sec Ethernet.
.pp
Inversion was compared to the Ultrix 4.2 implementation of NFS.
The NFS server was run on the same DECsystem 5900,
using the same disk,
as Inversion.
The NFS client was the same DECstation 3100.
.pp
The NFS implementation on the DECsystem 5900 used a service
called PRESTOserve to speed up writes.
To guarantee that NFS servers remain stateless,
NFS must force every write to stable storage synchronously [SAND85].
PRESTOserve consists of a board containing 1MByte of battery-backed
RAM and driver software to cache NFS writes in non-volatile memory.
As will be seen below,
this substantially improved the write throughput of NFS under Ultrix.
This non-volatile memory was not used by Inversion.
.sh 2 "The Benchmark"
.pp
The benchmark consisted of the following operations:
.BU
Create a 25MByte file.
.BU
Measure the latency to read or write a single byte
at a random location in the file.
.BU
Read 1MByte in a single large transfer.
.BU
Read 1MByte sequentially in page-sized units.
The page size was chosen to be efficient for the file system under test.
.BU
Read 1MByte in page-sized units distributed
at random throughout the file.
.BU
Repeat the 1MByte transfer tests,
writing instead of reading.
.pp
All caches were flushed before each test.
These tests measure worst-case throughput
for operations that Sequoia researchers are likely to carry out.
.(z
.in +0.5i
.hl
... 0.925 7.668 5.737 10.262 0 0 4096 4096
... 0u 1494u 2771u 0u -532u 5909u 2358157u -2352779u
.PS 1494 2771 
.br
.ps 11
\h'36u'\v'720u'\D'l72u 0u'
.sp -1
\h'36u'\v'432u'\D'l72u 0u'
.sp -1
\h'36u'\v'144u'\D'l72u 0u'
.sp -1
\h'216u'\v'1009u'\D'l0u -801u'
.sp -1
\h'216u'\v'208u'\D'l144u 0u'
.sp -1
\h'360u'\v'208u'\D'l0u 801u'
.sp -1
\h'360u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'584u'\v'1009u'\D'l0u -317u'
.sp -1
\h'584u'\v'692u'\D'l144u 0u'
.sp -1
\h'728u'\v'692u'\D'l0u 317u'
.sp -1
\h'728u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'72u'\D'l0u 1008u'
.sp -1
\h'72u'\v'1008u'\D'l864u 0u'
.sp -1
\v'738u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR50.0\fP'u'\s11\fR50.0\fP
.sp -1
\v'450u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR100.0\fP'u'\s11\fR100.0\fP
.sp -1
\v'162u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR150.0\fP'u'\s11\fR150.0\fP
.sp -1
\h'288u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'648u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
\h'648u'\v'1206u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRNFS\fP'u/2u'\s11\fRNFS\fP\h'-\w'\s11\fRNFS\fP'u/2u'
.sp -1
\h'432u'\v'1385u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR(a) Time to create\fP'u/2u'\s11\fR(a) Time to create\fP\h'-\w'\s11\fR(a) Time to create\fP'u/2u'
.sp -1
\h'432u'\v'1494u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRa 25MByte object\fP'u/2u'\s11\fRa 25MByte object\fP\h'-\w'\s11\fRa 25MByte object\fP'u/2u'
.sp -1
\h'36u'\v'17u'\h'-0.0m'\v'0.2m'\h'-\w'\s12\fRsec\fP'u'\s12\fRsec\fP
.sp -1
\h'1152u'\v'432u'\D'l72u 0u'
.sp -1
\h'1152u'\v'576u'\D'l72u 0u'
.sp -1
\h'1152u'\v'720u'\D'l72u 0u'
.sp -1
\h'1152u'\v'864u'\D'l72u 0u'
.sp -1
\h'1396u'\v'1009u'\D'l0u -288u'
.sp -1
\h'1396u'\v'721u'\D'l144u 0u'
.sp -1
\h'1540u'\v'721u'\D'l0u 288u'
.sp -1
\h'1540u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'2404u'\v'1009u'\D'l0u -219u'
.sp -1
\h'2404u'\v'790u'\D'l144u 0u'
.sp -1
\h'2548u'\v'790u'\D'l0u 219u'
.sp -1
\h'2548u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1702u'\v'1009u'\D'l0u -374u'
.sp -1
\h'1702u'\v'635u'\D'l144u 0u'
.sp -1
\h'1846u'\v'635u'\D'l0u 374u'
.sp -1
\h'1846u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'2122u'\v'1009u'\D'l0u -202u'
.sp -1
\h'2122u'\v'807u'\D'l144u 0u'
.sp -1
\h'2266u'\v'807u'\D'l0u 202u'
.sp -1
\h'2266u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1187u'\v'216u'\D'l0u 792u'
.sp -1
\h'1187u'\v'1008u'\D'l1584u 0u'
.sp -1
\h'1115u'\v'882u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR0.1\fP'u'\s11\fR0.1\fP
.sp -1
\h'1115u'\v'738u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR0.2\fP'u'\s11\fR0.2\fP
.sp -1
\h'1115u'\v'594u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR0.3\fP'u'\s11\fR0.3\fP
.sp -1
\h'1115u'\v'450u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR0.4\fP'u'\s11\fR0.4\fP
.sp -1
\h'1475u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRread\fP'u/2u'\s11\fRread\fP\h'-\w'\s11\fRread\fP'u/2u'
.sp -1
\h'1763u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRwrite\fP'u/2u'\s11\fRwrite\fP\h'-\w'\s11\fRwrite\fP'u/2u'
.sp -1
\h'2195u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRread\fP'u/2u'\s11\fRread\fP\h'-\w'\s11\fRread\fP'u/2u'
.sp -1
\h'2483u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRwrite\fP'u/2u'\s11\fRwrite\fP\h'-\w'\s11\fRwrite\fP'u/2u'
.sp -1
\h'2339u'\v'1242u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix NFS\fP'u/2u'\s11\fRUltrix NFS\fP\h'-\w'\s11\fRUltrix NFS\fP'u/2u'
.sp -1
\h'1619u'\v'1242u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'1979u'\v'1385u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR(b) Time to read or write a single byte\fP'u/2u'\s11\fR(b) Time to read or write a single byte\fP\h'-\w'\s11\fR(b) Time to read or write a single byte\fP'u/2u'
.sp -1
\h'1979u'\v'1494u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRat a random location\fP'u/2u'\s11\fRat a random location\fP\h'-\w'\s11\fRat a random location\fP'u/2u'
.sp -1
\h'1152u'\v'268u'\h'-0.0m'\v'0.2m'\h'-\w'\s12\fRsec\fP'u'\s12\fRsec\fP
.sp -1
.sp 1+1494u
.PE
.sp
.ce
.ps 11
Figure 6:  Elapsed time to create 25MByte file, read or write a single byte
.sp 0.5v
.hl
.in -0.5i
.)z
.sh 2 "Benchmark Results"
.pp
Figure 6a shows the elapsed time to create a 25MByte file
under Inversion and under Ultrix NFS.
As shown,
Inversion gets about 36% of the throughput of NFS for
file creation.
This difference is due primarily to the extra overhead in maintaining
indices in Inversion.
For every page written to the file,
Inversion must create a Btree index entry so that the page
can be located quickly later.
Btree writes are interleaved with data file writes,
penalizing Inversion by forcing the disk head to move frequently.
The NFS implementation does not maintain as much
indexing information on the data file,
and so can postpone writing its index until all data blocks
have been written.
This means that NFS writes the data file sequentially,
improving throughput.
.pp
Figure 6b shows the overhead for reading or writing a single
byte of the 25MByte file just created.
Since all caches were flushed prior to running the test,
a disk access is required.
For single-byte reads,
Inversion gets 70 percent of the throughput of NFS.
Single-byte writes are slightly worse;
Inversion is 61 percent of NFS.
Since Inversion never overwrites data in place,
a new entry must be written to the Btree block index,
accounting for the difference.
.(z
.in +0.5i
.hl
... 0.6 7.981 4.987 10.262 0 0 4096 4096
... 0u 1313u 2526u 0u -344u 5909u 2358099u -2352535u
.PS 1313 2526 
.br
.ps 11
\h'43u'\v'720u'\D'l72u 0u'
.sp -1
\h'43u'\v'432u'\D'l72u 0u'
.sp -1
\h'43u'\v'144u'\D'l72u 0u'
.sp -1
\h'43u'\v'288u'\D'l72u 0u'
.sp -1
\h'43u'\v'576u'\D'l72u 0u'
.sp -1
\h'43u'\v'864u'\D'l72u 0u'
.sp -1
\h'288u'\v'1009u'\D'l0u -478u'
.sp -1
\h'288u'\v'531u'\D'l144u 0u'
.sp -1
\h'432u'\v'531u'\D'l0u 478u'
.sp -1
\h'432u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'582u'\v'1009u'\D'l0u -398u'
.sp -1
\h'582u'\v'611u'\D'l143u 0u'
.sp -1
\h'725u'\v'611u'\D'l0u 398u'
.sp -1
\h'725u'\v'1009u'\D'l-143u 0u'
.sp -1
\h'1088u'\v'1009u'\D'l0u -697u'
.sp -1
\h'1088u'\v'312u'\D'l144u 0u'
.sp -1
\h'1232u'\v'312u'\D'l0u 697u'
.sp -1
\h'1232u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1376u'\v'1009u'\D'l0u -317u'
.sp -1
\h'1376u'\v'692u'\D'l144u 0u'
.sp -1
\h'1520u'\v'692u'\D'l0u 317u'
.sp -1
\h'1520u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1877u'\v'1009u'\D'l0u -783u'
.sp -1
\h'1877u'\v'226u'\D'l144u 0u'
.sp -1
\h'2021u'\v'226u'\D'l0u 783u'
.sp -1
\h'2021u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'2176u'\v'1009u'\D'l0u -346u'
.sp -1
\h'2176u'\v'663u'\D'l144u 0u'
.sp -1
\h'2320u'\v'663u'\D'l0u 346u'
.sp -1
\h'2320u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'79u'\D'l0u 1008u'
.sp -1
\h'79u'\v'1008u'\D'l2447u 0u'
.sp -1
\h'7u'\v'162u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR6.0\fP'u'\s11\fR6.0\fP
.sp -1
\h'7u'\v'306u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR5.0\fP'u'\s11\fR5.0\fP
.sp -1
\h'7u'\v'450u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR4.0\fP'u'\s11\fR4.0\fP
.sp -1
\h'7u'\v'594u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR3.0\fP'u'\s11\fR3.0\fP
.sp -1
\h'7u'\v'738u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR2.0\fP'u'\s11\fR2.0\fP
.sp -1
\h'7u'\v'882u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR1.0\fP'u'\s11\fR1.0\fP
.sp -1
\v'18u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRseconds\fP'u'\s11\fRseconds\fP
.sp -1
\h'511u'\v'1097u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRSingle 1MByte read\fP'u/2u'\s11\fRSingle 1MByte read\fP\h'-\w'\s11\fRSingle 1MByte read\fP'u/2u'
.sp -1
\h'1302u'\v'1097u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR1MByte read\fP'u/2u'\s11\fR1MByte read\fP\h'-\w'\s11\fR1MByte read\fP'u/2u'
.sp -1
\h'1302u'\v'1313u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'\s11\fRpage-sized chunks\fP\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'
.sp -1
\h'2094u'\v'1097u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR1MByte read\fP'u/2u'\s11\fR1MByte read\fP\h'-\w'\s11\fR1MByte read\fP'u/2u'
.sp -1
\h'2094u'\v'1206u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRat random in\fP'u/2u'\s11\fRat random in\fP\h'-\w'\s11\fRat random in\fP'u/2u'
.sp -1
\h'2094u'\v'1313u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'\s11\fRpage-sized chunks\fP\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'
.sp -1
\h'367u'\v'486u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'1158u'\v'270u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'1950u'\v'162u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'655u'\v'558u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
\h'1446u'\v'630u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
\h'1302u'\v'1206u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRsequentially in\fP'u/2u'\s11\fRsequentially in\fP\h'-\w'\s11\fRsequentially in\fP'u/2u'
.sp -1
\h'2238u'\v'594u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
.sp 1+1313u
.PE
.sp 2
.ce
Figure 7: Elapsed read times for Inversion and Ultrix NFS
.sp 0.5v
.hl
.in -0.5i
.)z
.pp
Figure 7 compares Inversion to Ultrix NFS on large and small
reads.
The first pair of bars compares throughput using a single
large transfer to move data from the server to the client.
In this case,
Inversion gets eighty percent of the throughput of NFS.
When smaller transfers are used,
Inversion drops to 47 percent of NFS.
Profiling reveals that extra work is done in allocating and
copying buffers in Inversion.
If some of this overhead were eliminated,
Inversion's performance could be brought closer to that of NFS.
Since single-byte transfer times are much closer under the
two file systems,
there is reason to believe that tuning will improve Inversion.
.pp
The final pair of bars in Figure 7 compares the transfer
rates of Inversion and Ultrix NFS when the pages read are
distributed at random throughout the 25MByte file.
In this case,
Inversion gets 43 percent the throughput of NFS.
The additional overhead incurred by traversing the Btree page index
in Inversion accounts for much of the slowdown.
.pp
Figure 8 presents the write performance of Inversion and Ultrix NFS.
.(z
.in +0.5i
.hl
... 0.613 7.981 4.987 10.262 0 0 4096 4096
... 0u 1314u 2519u 0u -352u 5910u 2358546u -2352988u
.PS 1314 2519 
.br
.ps 11
\h'36u'\v'864u'\D'l72u 0u'
.sp -1
\h'36u'\v'720u'\D'l72u 0u'
.sp -1
\h'36u'\v'576u'\D'l72u 0u'
.sp -1
\h'36u'\v'432u'\D'l72u 0u'
.sp -1
\h'36u'\v'288u'\D'l72u 0u'
.sp -1
\h'36u'\v'144u'\D'l72u 0u'
.sp -1
\h'286u'\v'1009u'\D'l0u -645u'
.sp -1
\h'286u'\v'364u'\D'l144u 0u'
.sp -1
\h'430u'\v'364u'\D'l0u 645u'
.sp -1
\h'430u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'574u'\v'1009u'\D'l0u -276u'
.sp -1
\h'574u'\v'733u'\D'l144u 0u'
.sp -1
\h'718u'\v'733u'\D'l0u 276u'
.sp -1
\h'718u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1081u'\v'1009u'\D'l0u -806u'
.sp -1
\h'1081u'\v'203u'\D'l144u 0u'
.sp -1
\h'1225u'\v'203u'\D'l0u 806u'
.sp -1
\h'1225u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1369u'\v'1009u'\D'l0u -236u'
.sp -1
\h'1369u'\v'773u'\D'l144u 0u'
.sp -1
\h'1513u'\v'773u'\D'l0u 236u'
.sp -1
\h'1513u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'1870u'\v'1009u'\D'l0u -881u'
.sp -1
\h'1870u'\v'128u'\D'l144u 0u'
.sp -1
\h'2014u'\v'128u'\D'l0u 881u'
.sp -1
\h'2014u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'2169u'\v'1009u'\D'l0u -236u'
.sp -1
\h'2169u'\v'773u'\D'l144u 0u'
.sp -1
\h'2313u'\v'773u'\D'l0u 236u'
.sp -1
\h'2313u'\v'1009u'\D'l-144u 0u'
.sp -1
\h'72u'\D'l0u 1008u'
.sp -1
\h'72u'\v'1008u'\D'l2447u 0u'
.sp -1
\v'162u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR6.0\fP'u'\s11\fR6.0\fP
.sp -1
\v'306u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR5.0\fP'u'\s11\fR5.0\fP
.sp -1
\v'450u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR4.0\fP'u'\s11\fR4.0\fP
.sp -1
\v'594u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR3.0\fP'u'\s11\fR3.0\fP
.sp -1
\v'738u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR2.0\fP'u'\s11\fR2.0\fP
.sp -1
\v'882u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR1.0\fP'u'\s11\fR1.0\fP
.sp -1
\v'54u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRseconds\fP'u'\s11\fRseconds\fP
.sp -1
\h'504u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRSingle 1MByte write\fP'u/2u'\s11\fRSingle 1MByte write\fP\h'-\w'\s11\fRSingle 1MByte write\fP'u/2u'
.sp -1
\h'1295u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR1MByte written\fP'u/2u'\s11\fR1MByte written\fP\h'-\w'\s11\fR1MByte written\fP'u/2u'
.sp -1
\h'1295u'\v'1206u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRsequentially in\fP'u/2u'\s11\fRsequentially in\fP\h'-\w'\s11\fRsequentially in\fP'u/2u'
.sp -1
\h'1295u'\v'1314u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'\s11\fRpage-sized chunks\fP\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'
.sp -1
\h'2087u'\v'1098u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fR1MByte written\fP'u/2u'\s11\fR1MByte written\fP\h'-\w'\s11\fR1MByte written\fP'u/2u'
.sp -1
\h'2087u'\v'1206u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRat random in\fP'u/2u'\s11\fRat random in\fP\h'-\w'\s11\fRat random in\fP'u/2u'
.sp -1
\h'2087u'\v'1314u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'\s11\fRpage-sized chunks\fP\h'-\w'\s11\fRpage-sized chunks\fP'u/2u'
.sp -1
\h'360u'\v'306u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'648u'\v'666u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
\h'1151u'\v'126u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'1439u'\v'702u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
\h'1943u'\v'54u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRInversion\fP'u/2u'\s11\fRInversion\fP\h'-\w'\s11\fRInversion\fP'u/2u'
.sp -1
\h'2231u'\v'702u'\h'-0.0m'\v'0.2m'\h'-\w'\s11\fRUltrix\fP'u/2u'\s11\fRUltrix\fP\h'-\w'\s11\fRUltrix\fP'u/2u'
.sp -1
.sp 1+1314u
.PE
.sp 2
.ce
Figure 8: Elapsed write times for Inversion and Ultrix NFS
.sp 0.5v
.hl
.in -0.5i
.)z
The tests run were identical to those performed for Figure 7,
except that reads became writes.
In these tests,
the effect of the PRESTOserve board used by NFS is dramatic.
.pp
Since NFS must flush every write to stable storage,
Inversion should have dramatically better performance
than NFS without non-volatile RAM.
The reason for this is that NFS is forced to treat every
write as a single transaction,
and commit it to disk immediately.
Inversion,
however,
can obey the transaction constraints imposed by the client program,
and commit a large number of writes simultaneously.
.pp
Figure 8 shows that Inversion is slower than Ultrix NFS
backed by PRESTOserve.
For a single large write request,
Inversion gets 43 percent the throughput of NFS.
For page-sized sequential writes,
Inversion does worse,
getting only 31 percent of NFS' throughput.
For random accesses,
Inversion has only 28 percent the performance of NFS.
In fact,
the NFS measurements show no degradation due to random accesses,
since the whole 1MByte write fits in the PRESTOserve cache,
and is not flushed to disk.
.sh 2 "Evaluation of Benchmark Results"
.pp
[STON92] presents results from a benchmark of Inversion
run on a symmetric multiprocessor.
The results from that study are substantially better than those
observed here.
On a multi-processor,
Inversion gets between 50 and 95 percent of the throughput
of a native file system running on the same hardware.
.pp
The most important difference between this study and that
conducted for [STON92] is the process structure used to
evaluate the file system.
In the current benchmark,
the client program ran on a remote machine from the file server.
In [STON92],
both processes ran on the same machine.
When the client and server are on the same hardware,
more intelligent IPC mechanisms can be used.
In particular,
buffers shared by the two processes can be marked copy-on-write.
This eliminates an extra data copy.
Since no traffic crosses the Ethernet,
the savings are even more substantial.
Finally,
the symmetric multiprocessor examined in [STON92] allowed
one processor to move blocks between a main memory cache
and the file system
while other processors were carrying out computations.
As a result,
the earlier performance study achieved better results than
those presented here.
.pp
The results presented in [STON92] make it clear that the
communication overhead incurred by the current implementation
of Inversion is high.
In order to determine just how high,
another version of the benchmark was run.
In this case,
the benchmark tests were dynamically loaded into the \*(PG
process being used as the Inversion file server.
Since users can dynamically load their routines into the data manager,
this is an available option in cases where performance in critical.
In this case,
the Inversion
.q client
and
.q server
run in the same address space on the DECsystem 5900.
.pp
In this configuration,
data is never copied across address space boundaries,
and protocol overhead is eliminated.
The performance presented here is the best that a user
could achieve with the current implementation of Inversion.
Note that the same file system can be used simultaneously
by dynamically-loaded code
and by the more conventional client/server architecture.
.pp
Table 2 shows the performance of the single-process benchmark.
For convenience,
the performance of client/server Inversion and Ultrix NFS,
presented in the previous section,
are included.
.(z
.in +0.5i
.hl
.TS
.if \n+(b.=1 .nr d. \n(.c-\n(c.-1
.de 3f
.ps \n(.s
.vs \n(.vu
.in \n(.iu
.if \n(.u .fi
.if \n(.j .ad
.if \n(.j=0 .na
..
.nf
.nr #~ 0
.if \n(.T .if n .nr #~ 0.6n
.ds #d .d
.if \(ts\n(.z\(ts\(ts .ds #d nl
.fc
.nr 3d \n(.s
.rm 4k 4l 4m 4n
.nr 4k 0
.nr 3i \w\fIOperation\fP
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wCreate 25MByte file
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wSingle 1MByte read
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wPage-sized sequential 1MByte read
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wPage-sized random 1MByte read
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wSingle 1MByte write
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wPage-sized sequential 1MByte write
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wPage-sized random 1MByte write
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wRead single byte
.if \n(4k<\n(3i .nr 4k \n(3i
.nr 3i \wWrite single byte
.if \n(4k<\n(3i .nr 4k \n(3i
.4k
.rm 4k
.nr 4l 0
.nr 3i \w\fIInversion\fP
.if \n(4l<\n(3i .nr 4l \n(3i
.nr 3i \w\fIclient/server\fP
.if \n(4l<\n(3i .nr 4l \n(3i
.nr 3b 0
.nr 3c 0
.nr 3i \w141
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.5
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w3
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.4
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w4
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.8
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w5
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.5
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w4
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.6
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w5
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.6
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w6
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.0
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.02
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.03
.if \n(3c<\n(3i .nr 3c \n(3i
.4l
.rm 4l
.nr 4g \n(3b
.nr 3i \n(4g+\n(3c
.if \n(3i>\n(4l .nr 4l \n(3i
.if \n(3i<\n(4l .nr 4g +(\n(4l-\n(3i)/2
.nr 4m 0
.nr 3i \w\fIUltrix NFS\fP
.if \n(4m<\n(3i .nr 4m \n(3i
.nr 3b 0
.nr 3c 0
.nr 3i \w50
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.6
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w2
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.8
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w2
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.2
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w2
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.4
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w2
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.0
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w1
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.7
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w1
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.7
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.01
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.02
.if \n(3c<\n(3i .nr 3c \n(3i
.4m
.rm 4m
.nr 4h \n(3b
.nr 3i \n(4h+\n(3c
.if \n(3i>\n(4m .nr 4m \n(3i
.if \n(3i<\n(4m .nr 4h +(\n(4m-\n(3i)/2
.nr 4n 0
.nr 3i \w\fIInversion\fP
.if \n(4n<\n(3i .nr 4n \n(3i
.nr 3i \w\fIsingle process\fP
.if \n(4n<\n(3i .nr 4n \n(3i
.nr 3b 0
.nr 3c 0
.nr 3i \w111
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.6
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.4
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.4
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.8
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w1
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.4
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w1
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.4
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w2
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.9
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.01
.if \n(3c<\n(3i .nr 3c \n(3i
.nr 3i \w0
.if \n(3b<\n(3i .nr 3b \n(3i
.nr 3i \w.02
.if \n(3c<\n(3i .nr 3c \n(3i
.4n
.rm 4n
.nr 4i \n(3b
.nr 3i \n(4i+\n(3c
.if \n(3i>\n(4n .nr 4n \n(3i
.if \n(3i<\n(4n .nr 4i +(\n(4n-\n(3i)/2
.nr 3i 1n
.nr 4j 0
.nr 4a \n(4j+((0*\n(3i)/2)
.nr 4k +\n(4a
.nr 4b \n(4k+((6*\n(3i)/2)
.nr 4l +\n(4b
.nr 4g +\n(4b
.nr 4c \n(4l+((6*\n(3i)/2)
.nr 4m +\n(4c
.nr 4h +\n(4c
.nr 4d \n(4m+((6*\n(3i)/2)
.nr 4n +\n(4d
.nr 4i +\n(4d
.nr TW \n(4n
.if t .if \n(TW>\n(.lu .tm Table at line 3960 file Input is too wide - \n(TW units
.nr #I \n(.i
.in +(\n(.lu-\n(TWu-\n(.iu)/2u
.fc  
.nr #T 0-1
.nr #a 0-1
.eo
.de T#
.nr 3f 1m
.ds #d .d
.if \(ts\n(.z\(ts\(ts .ds #d nl
.mk ##
.nr ## -1v
.ls 1
.ls
..
.ec
.ta \n(4ku \n(4lu \n(4mu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'\h'|\n(4bu'\fIInversion\fP\h'|\n(4cu'\h'|\n(4du'\fIInversion\fP
.ta \n(4ku \n(4lu \n(4mu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'\fIOperation\fP\h'|\n(4bu'\fIclient/server\fP\h'|\n(4cu'\fIUltrix NFS\fP\h'|\n(4du'\fIsingle process\fP
.sp 0.5v
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Create 25MByte file\h'|\n(4bu'141.5\h'|\n(4cu'50.6\h'|\n(4du'111.6
.sp 0.5v
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Single 1MByte read\h'|\n(4bu'3.4\h'|\n(4cu'2.8\h'|\n(4du'0.4
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Page-sized sequential 1MByte read\h'|\n(4bu'4.8\h'|\n(4cu'2.2\h'|\n(4du'0.4
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Page-sized random 1MByte read\h'|\n(4bu'5.5\h'|\n(4cu'2.4\h'|\n(4du'0.8
.sp 0.5v
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Single 1MByte write\h'|\n(4bu'4.6\h'|\n(4cu'2.0\h'|\n(4du'1.4
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Page-sized sequential 1MByte write\h'|\n(4bu'5.6\h'|\n(4cu'1.7\h'|\n(4du'1.4
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Page-sized random 1MByte write\h'|\n(4bu'6.0\h'|\n(4cu'1.7\h'|\n(4du'2.9
.sp 0.5v
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Read single byte\h'|\n(4bu'0.02\h'|\n(4cu'0.01\h'|\n(4du'0.01
.ta \n(4ku \n(4gu \n(4lu \n(4hu \n(4mu \n(4iu \n(4nu 
.nr 3f 1m
.nr 3b \n(.f
\&\h'|\n(4au'Write single byte\h'|\n(4bu'0.03\h'|\n(4cu'0.02\h'|\n(4du'0.02
.fc
.nr T. 1
.T# 1
.in \n(#Iu
.3f
.TE
.if \n-(b.=0 .nr c. \n(.c-\n(d.-20
.sp
.ce
Table 2: Elapsed time in seconds for benchmark tests in three configurations
.hl
.in -0.5i
.)z
The elapsed time for each test is reported in seconds.
The measurements shown are the means of ten runs.
In all cases,
the standard deviation was negligible.
.pp
As Table 2 shows,
the single-process Inversion benchmark is faster than either
of the network benchmarks in virtually all categories.
The important exception is in random write time,
for which Ultrix NFS using PRESTOserve is fastest,
since no disk seeks are required.
Note,
however,
that the single-process implementation of Inversion
is faster than Ultrix with PRESTOserve for sequential transfers.
File creation is slower in both Inversion benchmarks,
due to the overhead of creating the Btree index of blocks.
.pp
The important comparison is between Inversion running on two machines
and Inversion running in a single process.
For 1MByte operations,
remote access adds between three and five seconds to the
elapsed time of each test.
It is clear that the client/server communication protocol
used by the file system is much too heavy-weight,
and should be optimized.
Given this optimization,
it is reasonable to expect performance within fifty percent of
Ultrix NFS and PRESTOserve from Inversion.
.sh 1 "CONCLUSIONS"
.pp
The research presented here makes contributions in three areas.
These contributions are
.BU
introducing a storage manager switch
abstraction into a database system,
making it easy to add new device types to a data manager;
.BU
the design and implementation of a device manager for
a Sony write-once optical disk jukebox;
and
.BU
the design and implementation of a novel filesystem,
Inversion,
which provides important semantic guarantees.
.pp
As tertiary storage devices are more widely deployed,
the importance of the storage manager switch abstraction will increase.
Current commercial relational databases include hard-coded
dependencies on magnetic disk.
Experience with \*(PG indicates that these dependencies
can be subtle and hard to identify.
Eliminating these dependencies is critical,
since tertiary storage has dramatically different characteristics
from magnetic disks.
.pp
The issues raised in designing the Sony jukebox device manager
include cache management policies,
allocation strategies,
and transfer size optimizations.
The Sony jukebox device manager achieves aggressive sharing
and supports fast recovery
by implementing a persistent cache.
The expense of setting up transfers from optical platters
is amortized by doing extent-based allocation,
and making extent sizes sufficiently large.
The cache size,
extent size,
and transfer size are all configurable.
.pp
The Inversion file system provides transaction protection,
fine-grained time travel,
fast recovery,
device independence,
strong typing,
and support for
.i "ad hoc"
queries on file system metadata and data.
As shown in the performance evaluation section,
Inversion gets between thirty and eighty percent of the throughput of
an optimized version of Ultrix NFS, which uses special-purpose
hardware to accelerate file system accesses.
Tuning should improve Inversion's performance further.
.sh 1 "FUTURE WORK"
.pp
The work presented here raises several new research problems.
This section lists some of the ways in which the storage manager,
the Sony jukebox device manager,
the Inversion file system,
or \*(PG as a whole could be improved.
.sh 2 "Better Integration of the Storage Manager into POSTGRES"
.pp
There are several ways in which the integration of the new
storage manager into \*(PG could be improved.
Most important of these is making relations truly location-transparent.
Doing so requires hiding the fact that some devices are actually
write-once.
As mentioned above,
this requires writing software to support logical
overwrite-in-place on relation data blocks.
Overwrite should work on all device managers.
.pp
Another area for future work is improving the cost estimation
functions for tertiary storage.
This may require more extensive changes to the query optimizer,
so that device managers can,
for example,
estimate locality of particular access paths.
The current cost estimation function for the Sony jukebox device manager
provides only a coarse guide in choosing an optimal execution plan.
One way to improve the ability of device managers to correctly
estimate plan cost would be to allow them to collect device-specific
statistics on the relations that they store.
.pp
There are also interesting research issues
in designing new access paths to relations on tertiary storage.
For example,
it may be desirable to invent new join strategies or indexing structures
that are optimized for high-latency,
low-bandwidth devices.
Such strategies would make tertiary storage much more attractive.
.pp
Finally,
when \*(PG supports clustering of relation data,
some design decisions in (for example) the Sony jukebox device manager
will need to be rethought.
If blocks at arbitrary locations in relations may be overwritten,
then the current strategy of keeping the highest-numbered block
on magnetic disk will not work.
.sh 2 "Tuning the Sony Jukebox Device Manager"
.pp
More experiments should be done on the Sony jukebox device manager,
to quantify the costs of cache maintenance,
transfer setup,
and extent-based allocation.
The behavior of the device manager could certainly be tuned
with some analysis of its current behavior.
In addition,
cache replacement policies other than LRU should be investigated.
.pp
Another strategy would be to reimplement the caching code completely.
The current implementation assumes a cache size on the order of tens
of megabytes.
If this were scaled to,
say,
a gigabyte,
different cache layout strategies would make sense.
For example,
extents could be laid out using a hash function,
rather than contiguously on disk.
.sh 2 "Work on the Inversion File System"
.pp
The Inversion file system,
similarly,
should be profiled and tuned to improve its performance.
It may make sense to use other indices,
or other indexed access methods altogether,
on the blocks and metadata of files stored in Inversion.
The protocol used for network file access should be
redesigned to speed up transfers.
.pp
One issue that should be addressed is the programmatic
interface to Inversion files.
Since Inversion provides a strict superset of the services
provided by conventional \*(UX systems,
the interface to Inversion will be different from the
interface to the native file system.
For example,
time travel requires the specification of timestamps
when opening a file.
This interface is not yet clearly defined.
.pp
Two other important issues are management of a file system cache
and the policy used to lay out file data blocks on storage devices.
Unfortunately,
both of these require violating the storage manager abstraction.
It would be interesting to see if a clean interface
could be devised that would continue to support device extensibility,
and still allow Inversion to handle layout and caching.
.sh 2 "Addition of New Device Managers"
.pp
Finally,
new device managers should be added to \*(PG.
Hardware that will soon become available to the project
includes
a Metrum videocassette jukebox,
an Exabyte 8500 8mm helical scan jukebox,
and a Hewlitt-Packard magneto-optical disk jukebox.
New device managers should be written for these.
.pp
Currently,
a version of \*(PG exists that optimizes and executes
parallel query plans [HONG92].
Parallel \*(PG uses striping,
or horizontal partitioning of relations,
in order to achieve data parallelism.
A striping device manager for magnetic disk would be easy to write,
and would eliminate a good deal of special-case code
in the parallel version of the system.
.sh 1 "ACKNOWLEDGEMENTS"
.pp
Joey Hellerstein and Wei Hong
reviewed early versions of this thesis
and suggested many improvements.
Margo Seltzer and Mark Sullivan,
Sin City's senior citizens,
offered useful advice
on the design and implementation of the storage manager
switch and the Inversion file system.
Dr. Randy Katz and Dr. Ray Larson
served as members of my thesis committee.
Dr. Michael Stonebraker's guidance and encouragement
have been invaluable throughout the project.
Finally,
I want especially to thank my wife Teresa
and my son Matthew for their patience and support over the
course of an apparently endless academic career.
.sh 1 "REFERENCES"
.ls 1
.nr ps 1v
.nr ii 0.9i
.ip [BERN88]
Bernstein, P., \fIet al.\fP, ``Future Directions in DBMS Research'',
\fIThe Laguna Beach Report\fP, Laguna Beach, CA, February 1988.
.ip [CABR88]
Cabrera, L., and Wyllie, J., ``QuickSilver Distributed File Services:
An Architecture for Horizontal Growth'', \fIProc. 2nd IEEE Conference
on Computer Workstations\fP, March 1988.
.ip [CHUT92]
Chutani, S., \fIet al.\fP, ``The Episode File System'', \fIProc. USENIX
Winter 1992 Technical Conference\fP, San Francisco, CA, January 1992.
.ip [FONG86]
Fong, Z., ``The Design and Implementation of the \*(PG Query Optimizer'',
M.Sc. Report, University of California at Berkeley,
Berkeley, CA, Aug. 1986.
.ip [HAAS90]
Haas, L., \fIet al.\fP, ``Starburst Midflight: As the Dust Clears'',
\fIIEEE Transactions on Knowledge and Data Engineering\fP, March 1990.
.ip [HONG92]
Hong, W., ``Exploiting Inter-Operation Parallelism in XPRS'', \fIProc.
1992 ACM-SIGMOD International Conference on Management of Data\fP,
San Diego, CA, June, 1992 (to appear).
.ip [KATZ91]
Katz, R., \fIet al.\fP, ``Robo-line Storage: Low Latency, High Capacity
Storage Systems Over Geographically Distributed Networks,''
Sequoia 2000 Technical Report 91/3, UC Berkeley, October 1991.
.ip [KIM84]
Kim, W., ``Highly Available Systems for Database Applications'', \fIACM
Computing Surveys\fP 16(1), 1984.
.ip [LEFF84]
Leffler, S., \fIet al.\fP, ``An Advanced 4.3BSD Interprocess Communication
Tutorial'', \fI\s-1UNIX\s0 Programmer's Manual Supplementary Documents\fP
vol. 1, Berkeley, CA, 1984.
.ip [LEFF89]
Leffler, S., \fIet al.\fP, \fIThe Design and Implementation of the
4.3 BSD Operating System\fP, Prentice-Hall, 1989.
.ip [MCFA91]
McFadden, A., ``C Library Interface for the Sony Optical Disk Changer'',
CS199 Project Report, UC Berkeley, May 1991.
.ip [MOSH92]
Mosher, C., \fIed.\fP, ``The \*(PG Reference Manual, Version 4'',
UCB Technical Report M92/14, Electronics Research Laboratory, University
of California at Berkeley, Berkeley, CA, March 1992.
.ip [PIKE90]
Pike, R., \fIet al.\fP, ``Plan 9 From Bell Labs'', \fIProc. Summer 1990
UKUUG Conference\fP, London, England, July 1990.
.ip [POTA91]
Potamianos, J., ``Semantics and Performance of Integrated DBMS Rule
Systems'', Ph.D. Dissertation, University of California at Berkeley,
Berkeley, CA, January 1992.
.ip [RASH88]
Rashid, R., \fIet al.\fP, ``Machine-Independent Virtual Memory Management
for Paged Uniprocessor and Multiprocessor Architectures'', \fIIEEE
Transactions on Computers\fP Vol. 37 No. 8, August 1988.
.ip [ROOM92]
Roome, W.D., ``3DFS: A Time-Oriented File Server'', \fIProc. USENIX Winter
1992 Technical Conference\fP, San Francisco, CA, January 1992.
.ip [ROSE91]
Rosenblum, M. and Ousterhout, J., ``The Design and Implementation of
a Log-Structured File System'', \fIProc. 13th Symposium on Operating
Systems Principles\fP, 1991.
.ip [SAND85]
Sandberg, R., \fIet al.\fP, ``Design and Implementation of the Sun
Network File System'', \fIProc. Usenix Summer 1985 Technical Conference\fP,
June 1985.
.ip [SECH91]
Sechrest, S., ``Attribute-Based Naming of Files'', University of Michigan
Technical Report CSE-TR-78-91, January 1991.
.ip [SELI79]
Selinger, P., \fIet al.\fP, ``Access Path Selection in a Relational
Database Management System'', \fIProc. 1979 ACM-SIGMOD Conference on
Management of Data\fP, Boston, MA, June 1979.
.ip [SELT90]
Seltzer, M., and Stonebraker, M., ``Transaction Support in Read Optimized
and Write Optimized File Systems,'' \fIProc. 16th International Conference
on Very Large Data Bases\fR, Brisbane, Australia, August 1990.
.ip [SKEE81]
Skeen, D., ``Nonblocking Commit Protocols'', \fIProc. 1981 ACM SIGMOD
Conference\fP, 1981.
.ip [SONY89a]
Sony Corporation, ``Writable Disk Auto Changer WDA-610 Specifications
and Operating Instructions'', 3-751-106-21 (1), Japan, 1989.
.ip [SONY89b]
Sony Corporation, ``Writable Disk Drive WDD-600 and Writable Disk WDM-6DL0
Operating Instructions'', 3-751-047-21 (1), Japan, 1989.
.ip [STON87]
Stonebraker, M., ``The \*(PG Storage System'', \fIProc. 1987 VLDB
Conference\fP, Brighton, England, Sept. 1987.
.ip [STON90]
Sonebraker, M., \fIet al.\fP, ``The Implementation of \*(PG'',
\fIIEEE Transactions on Knowledge and Data Engineering\fP, March 1990.
.ip [STON91]
Stonebraker, M., and Dozier, J., ``Sequoia 2000:  Large Capacity
Object Servers to Support Global Change Research,''
Sequoia 2000 Technical Report 91/1, UC Berkeley, July 1991.
.ip [STON92]
Stonebraker, M., and Olson, M., ``Large Object Support in \*(PG'',
submitted to \fIProc. 1992 VLDB Conference\fP, Vancouver, BC, Canada.
.ip [SULL92]
Sullivan, M.  and Olson, M., ``An Index Implementation Supporting Fast
Recovery for the \*(PG Storage System'', \fIProc. 8th Annual International
Conference on Data Engineering\fP, Tempe, AZ, February 1992.
.ip [WALK83]
Walker, B., \fIet al.\fP, ``The LOCUS Distributed Operating System'',
\fIOperating Systems Review\fP v. 17 no. 5, November 1983.
.nr % 0
.af % i
.he ''''
.bp
.rs
.sp |2i
.ce 99
.ps 14
.ft B
Extending the \*(PG Database System
.sp 0.2v
to Manage Tertiary Storage
.ps 12
.he '''\fR%\fP'
.sp 2
.ft R
Michael Allen Olson
.sp 0.2v
Department of Electrical Engineering and Computer Science
University of California at Berkeley
.sp
Master's Thesis
.sp 3.5i
Version \*(VN (\*(VD)
\*(VT
Printed \*(td
.ce 0
.bp
.rs
.sp |4i
.ps 11
.ce 99
Extending the \*(PG Database System to Manage Tertiary Storage
.sp
Copyright \(co 1992
.sp
by
.sp
Michael Allen Olson
.bp
.ce
.ps 12
Table of Contents
.ps
.sp 2
.nr pi 0
.xp
@
