.\" This is -*-nroff-*-
.\" XXX standard disclaimer belongs here....
.\" $Header: /home2/aoki/master/src/ref/RCS/large_objects.3pqsrc,v 1.12 1993/08/23 09:03:16 aoki Exp $
.TH INTRODUCTION "LARGE OBJECTS" 01/23/93
.XA 0 "Section 7 \(em Large Objects"
.BH "SECTION 7 \(em LARGE OBJECTS"
.SH DESCRIPTION
In \*(PG, data values are stored in tuples, and individual tuples
cannot span multiple data pages.  Since the size of a data page is
8192 bytes, the upper limit on the size of a data value is relatively
low.  To support the storage of larger atomic values, \*(PG provides a
.IR "large object"
interface.  This interface provides file-oriented access to user data
that has been explicitly declared to be a large type.
.PP
Version \*(PV of \*(PG supports two different implementations of large
objects.  These two implementations allow users to trade off speed of
access against transaction protection and crash recovery on large
object data.  Applications that can tolerate lost data may store
object data in conventional files that are fast to access, but cannot
be recovered in the case of system crashes.  For applications that
require stricter guarantees of durability, a transaction-protected
large object implementation is available.  This section describes the
two implementations and the programmatic and query language interfaces
to large object data.
.PP
Unlike the BLOB support provided by most commercial relational
database management systems, \*(PG allows users to define specific
large object types.  \*(PG large objects are first-class objects in
the database, and any operation that can be applied to a conventional
(small) abstract data type (ADT) may also be applied to a large one.
For example, two different large object types, such as
.IR image
and
.IR voice ,
may be created.  Functions that operate on image data, and other
functions that operate on voice data, may be declared to the database
system.  The data manager will distinguish between image and voice
data automatically, and will allow users to invoke the appropriate
functions on values of each of these types.  In addition, indices may
be created large data values, or on functions of them.  Finally,
operators may be defined that operate on large values.  Users may
invoke these functions and operators from the query language.  The
database system will enforce type restrictions on large object data
values.
.PP
The \*(PG large object interface is modeled after the Unix file system
interface, with analogs of 
.IR open (2),
.IR read (2),
.IR write (2),
.IR lseek (2),
etc.  User functions call these routines to retrieve only the data of
interest from a large object.  For example, if a large object type
called
.IR mugshot
existed that stored photographs of faces, then a function called
.IR beard
could be declared on
.IR mugshot
data.
.IR Beard
could look at the lower third of a photograph, and determine the color
of the beard that appeared there, if any.  The entire large object
value need not be buffered, or even examined, by the
.IR beard
function.  As mentioned above, \*(PG supports functional indices on
large object data.  In this example, the results of the
.IR beard
function could be stored in a B-tree index to provide fast searches
for people with red beards.
.SH "\*(UU FILES AS LARGE OBJECT ADTS"
The simplest large object interface supplied with \*(PG is also the
least robust.  It does not support transaction protection, crash
recovery, or time travel.  On the other hand, it can be used on
existing data files (such as word-processor files) that must be
accessed simultaneously by the database system and existing
application programs.
.PP
This implementation stores large object data in a \*(UU file, and
stores only the file name in the database.  Importing a large object
into the database is as simple as storing the file name in a
distinguished \*(lqlarge object name\*(rq relation.  Interface
routines allow the database system to open, seek, read, write, and
close these \*(UU files by an internal large object identifier.
.PP
The functions
.IR lo_filein
and
.IR lo_fileout
convert between \*(UU filenames and internal large object identifiers.
These functions are \*(PG registered functions, meaning they can be
used directly in Postquel queries as well as from dynamically loaded C
functions.  If you are defining a simple large object ADT, these
functions can be used as your \*(lqinput\*(rq and \*(lqoutput\*(rq
functions (see
.IR "define type" (commands)
and the \*(PG Manual sections concerning user-defined types for
details).
.PP
The routine
.(C
char *lo_filein(filename)
	char *filename;
.)C
imports a new \*(UU file storing large object data into the database
system.  This routine stores the filename in a large object naming
relation and assigns it a unique large object identifier.
.PP
The converse routine,
.(C
char *lo_fileout(object)
	LargeObject *object;
.)C
returns the \*(UU filename associated with a large object.
.PP
The file storing the large object must be accessible on the machine on
which \*(PG is running.  The data is not copied into the database
system, so if the file is later removed, it is unrecoverable.
.PP
Large objects are accessible from both the \*(PG backend, using
dynamically-loaded functions, and from the front-end, using the \*(LI
interface.  These interfaces will be described in detail below.
.SH "INVERSION LARGE OBJECTS"
In contrast to \*(UU files as large objects, the Inversion large
object implementation guarantees transaction protection, crash
recovery, and time travel on user large object data.  This
implementation breaks large objects up into \*(lqchunks\*(rq and
stores the chunks in tuples in the database.  A B-tree index
guarantees fast searches for the correct chunk number when doing
random access reads and writes.
.PP
If a transaction that has made changes to an Inversion large object
subsequently aborts, the changes are backed out in the normal way.
Inversion large objects are stored in the database, and so are not
directly accessible to other programs.  Only programs that use the
\*(PG data manager can read and write Inversion large objects.
.PP
To use Inversion large objects, a new large object should be created
using the
.IR LOcreat ()
interface, defined below.  Afterwards, the name of the large object
can be stored in an ordinary tuple.
.PP
The next section describes the programmatic interface to both \*(UU
and Inversion large objects.
.XA 1 "Backend Interface to Large Objects"
.SH "BACKEND INTERFACE TO LARGE OBJECTS"
Large object data is accessible from front-end programs linked with
the \*(LI library, and from dynamically-loaded routines that execute
in the \*(PG backend.  This section describes access from dynamically
loaded C functions.
.SH "Creating New Large Objects"
The routine
.(C
int LOcreat(path, mode, objtype)
    char *path;
    int mode;
    int objtype;
.)C
creates a new large object.
.PP
The pathname is a slash-separated list of components, and must be a
unique pathname in the \*(PG large object namespace.  There is a
virtual root directory (\*(lq/\*(rq) in which objects may be placed.
.PP
The
.IR objtype
parameter can be one of
.IR Inversion
or
.IR Unix ,
which are symbolic constants defined in
.(C
\&.../include/catalog/pg_lobj.h
.)C
The interpretation of the
.IR mode
argument depends on the
.IR objtype
selected.
.PP
For \*(UU files,
.IR mode
is the mode used to protect the file on the \*(UU file system.  On
creation, the file is open for reading and writing.
.PP
For Inversion large objects,
.IR mode
is a bitmask describing several different attributes of the new
object.  The symbolic constants listed here are defined in
.(C
\&.../include/tmp/libpq-fs.h
.)C
The access type (read, write, or both) is controlled by
.SM OR\c
\&'ing together the bits
.SM INV_READ
and
.SM INV_WRITE\c
\&.  If the large object should be archived \(em that is, if
historical versions of it should be moved periodically to a special
archive relation \(em then the
.SM INV_ARCHIVE
bit should be set.  The low-order sixteen bits of
.IR mask
are the storage manager number on which the large object should
reside.  In the distributed version of \*(PG, only the magnetic disk
storage manager is supported.  For users running \*(PG at UC Berkeley,
additional storage managers are available.  For sites other than
Berkeley, these bits should always be zero.  At Berkeley, storage
manager zero is magnetic disk, storage manager one is a Sony optical
disk jukebox, and storage manager two is main memory.
.PP
The commands below open large objects of the two types for writing and
reading.  The Inversion large object is not archived, and is located
on magnetic disk:
.(C
unix_fd = LOcreat("/my_unix_obj", 0600, Unix);

inv_fd = LOcreat("/my_inv_obj",
                 INV_READ|INV_WRITE, Inversion);
.)C
.SH "Opening Large Objects"
Existing large objects may be opened for reading or writing by calling
the routine
.(C
int LOopen(path, mode)
    char *path;
    int mode;
.)C
The
.IR path
argument specifies the large object's pathname, and is the same as the
pathname used to create the object.  The
.IR mode
argument is interpreted by the two implementations differently.  For
\*(UU large objects, values should be chosen from the set of mode bits
passed to the
.IR open
system call; that is,
.SM O_CREAT\c
,
.SM O_RDONLY\c
,
.SM O_WRONLY\c
,
.SM O_RDWR\c
,
and
.SM O_TRUNC\c
\&.  For Inversion large objects, only the bits
.SM INV_READ
and
.SM INV_WRITE
have any meaning.
.PP
To open the two large objects created in the last example, a
programmer would issue the commands
.(C
unix_fd = LOopen("/my_unix_obj", O_RDWR);

inv_fd = LOopen("/my_inv_obj", INV_READ|INV_WRITE);
.)C
If a large object is opened before it has been created, then a new
large object is created using the \*(UU implementation, and the new
object is opened.
.SH "Seeking on Large Objects"
The command
.(C
int
LOlseek(fd, offset, whence)
    int fd;
    int offset;
    int whence;
.)C
moves the current location pointer for a large object to the specified
position.  The
.IR fd
parameter is the file descriptor returned by either
.IR LOcreat
or
.IR LOopen .
.IR Offset
is the byte offset in the large object to which to seek.  The only
legal value for
.IR whence
in the current release of the system is
.SM L_SET\c
, as defined in 
.IR "<sys/files.h>" .
.PP
\*(UU large objects allow holes to exist in objects; that is, a
program may seek well past the end of the object and write bytes.
Intervening blocks will not be created; reading them will return
zero-filled blocks.  Inversion large objects do not support holes.
.PP
The following code seeks to byte location 100000 of the example large
objects:
.(C
unix_status = LOlseek(unix_fd, 100000, L_SET);

inv_status = LOlseek(inv_fd, 100000, L_SET);
.)C
On error,
.IR LOlseek
returns a value less than zero.  On success, the new offset is
returned.
.SH "Writing to Large Objects"
Once a large object has been created, it may be filled by calling
.(C
int
LOwrite(fd, wbuf)
    int fd;
    struct varlena *wbuf;
.)C
Here,
.IR fd
is the file descriptor returned by
.IR LOcreat
or
.IR LOopen ,
and
.IR wbuf
describes the data to write.  The
.IR "varlena"
structure in \*(PG consists of four bytes in which the length of the
datum is stored, followed by the data itself.  The four length bytes
include themselves.
.PP
For example, to write 1024 bytes of zeroes to the sample large
objects:
.(C
struct varlena *vl;

vl = (struct varlena *) palloc(1028);
VARSIZE(vl) = 1028;
bzero(VARDATA(vl), 1024);

nwrite_unix = LOwrite(unix_fd, vl);

nwrite_inv = LOwrite(inv_fd, vl);
.)C
.IR LOwrite
returns the number of bytes actually written, or a negative number on
error.  For Inversion large objects, the entire write is guaranteed to
succeed or fail.  That is, if the number of bytes written is
non-negative, then it equals 
.IR VARSIZE (vl).
.PP
The 
.IR VARSIZE ()
and
.IR VARDATA ()
macros are declared in the file
.(C
\&.../include/tmp/postgres.h
.)C
.SH "Reading from Large Objects"
Data may be read from large objects by calling the routine
.(C
struct varlena *
LOread(fd, len)
    int fd;
    int len;
.)C
This routine returns the byte count actually read and the data in a
varlena structure.  For example,
.(C
struct varlena *unix_vl, *inv_vl;
int nread_ux, nread_inv;
char *data_ux, *data_inv;

unix_vl = LOread(unix_fd, 100);
nread_ux = VARSIZE(unix_vl);
data_ux = VARDATA(unix_vl);

inv_vl = LOread(inv_fd, 100);
nread_inv = VARSIZE(inv_vl);
data_inv = VARDATA(inv_vl);
.)C
The returned varlena structures have been allocated by the \*(PG
memory manager
.IR palloc ,
and may be
.IR pfree d
when they are no longer needed.
.SH "Closing a Large Object"
Once a large object is no longer needed, it may be closed by calling
.(C
int
LOclose(fd)
    int fd;
.)C
where
.IR fd
is the file descriptor returned by
.IR LOopen
or
.IR LOcreat .
On success,
.IR LOclose
returns zero.  A negative return value indicates an error.
.PP
For example,
.(C
if (LOclose(unix_fd) < 0)
    /* error */

if (LOclose(inv_fd) < 0)
    /* error */
.)C
.XA 1 "LIBPQ Interface to Large Objects"
.SH "LIBPQ LARGE OBJECT INTERFACE"
Large objects may also be accessed from database client programs that
link the \*(LI library.  This library provides a set of routines that
support opening, reading, writing, closing, and seeking on large
objects.  The interface is similar to that provided via the backend,
but rather than using varlena structures, a more conventional
\*(UU-style buffer scheme is used.
.PP
In version \*(PV of \*(PG, large object operations must be enclosed in
a transaction block.  This is true even for \*(UU large objects, which
are not transaction-protected.  This is due to a shortcoming in the
memory management scheme for large objects, and will be rectified in
the future.  The end of this section shows a short example program
that correctly transaction-protects its file system operations.
.PP
This section describes the \*(LI interface in detail.
.SH "Creating a Large Object"
The routine
.(C
int
p_creat(path, mode, objtype)
    char *path;
    int mode;
    int objtype;
.)C
creates a new large object.  The
.IR path
argument specifies a large-object system pathname.
.PP
The
.IR objtype
parameter can be one of
.IR Inversion
or
.IR Unix ,
which are symbolic constants defined in
.(C
\&.../include/catalog/pg_lobj.h
.)C
The interpretation of the
.IR mode
argument depends on the
.IR objtype
selected.
.PP
For \*(UU files,
.IR mode
is the mode used to protect the file on the \*(UU file system.  On
creation, the file is open for reading and writing.
.PP
For Inversion large objects,
.IR mode
is a bitmask describing several different attributes of the new
object.  The symbolic constants listed here are defined in
.(C
\&.../include/tmp/libpq-fs.h
.)C
The access type (read, write, or both) is controlled by
.SM OR\c
\&'ing together the bits
.SM INV_READ 
and
.SM INV_WRITE\c
\&.  If the large object should be archived \(em that is, if
historical versions of it should be moved periodically to a special
archive relation \(em then the
.SM INV_ARCHIVE
bit should be set.  The low-order sixteen bits of
.IR mask
are the storage manager number on which the large object should
reside.  For sites other than Berkeley, these bits should always be
zero.  At Berkeley, storage manager zero is magnetic disk, storage
manager one is a Sony optical disk jukebox, and storage manager two is
main memory.
.PP
The commands below open large objects of the two types for writing and
reading.  The Inversion large object is not archived, and is located
on magnetic disk:
.(C
unix_fd = p_creat("/my_unix_obj", 0600, Unix);
.sp 0.5v
inv_fd = p_creat("/my_inv_obj",
                 INV_READ|INV_WRITE, Inversion);
.)C
.SH "Opening an Existing Large Object"
To open an existing large object, call
.(C
int
p_open(path, mode)
    char *path;
    int mode;
.)C
The
.IR path
argument specifies the large object pathname for the object to open.
The mode bits control whether the object is opened for reading,
writing, or both.  For \*(UU large objects, the appropriate flags are
.SM O_CREAT\c
,
.SM O_RDONLY\c
,
.SM O_WRONLY\c
,
.SM O_RDWR\c
,
and
.SM O_TRUNC\c
\&.  For Inversion large objects, only
.SM INV_READ 
and
.SM INV_WRITE
are recognized.
.PP
If a large object is opened before it is created, it is created by
default using the \*(UU file implementation.
.SH "Writing Data to a Large Object"
The routine
.(C
int
p_write(fd, buf, len)
    int fd;
    char *buf;
    int len;
.)C
writes
.IR len
bytes from
.IR buf
to large object
.IR fd .
The
.IR fd
argument must have been returned by a previous
.IR p_creat
or
.IR p_open .
.PP
The number of bytes actually written is returned.
In the event of an error,
the return value is negative.
.SH "Reading Data from a Large Object"
The routine
.(C
int
p_read(fd, buf, nbytes)
    int fd;
    char *buf;
    int nbytes;
.)C
reads
.IR nbytes
bytes into buffer
.IR buf
from the large object descriptor
.IR fd .
The number of bytes actually read is returned.
In the event of an error,
the return value is less than zero.
.SH "Seeking on a Large Object"
To change the current read or write location on a large object,
call
.(C
int
p_lseek(fd, offset, whence)
    int fd;
    int offset;
    int whence;
.)C
This routine moves the current location pointer for the large object
described by
.IR fd
to the new location specified by
.IR offset .
For this release of \*(PG, only
.SM L_SET
is a legal value for
.IR whence .
.SH "Closing a Large Object"
A large object may be closed by calling
.(C
int
p_close(fd)
    int fd;
.)C
where
.IR fd
is a large object descriptor returned by
.IR p_creat
or
.IR p_open .
On success,
.IR p_close
returns zero.  On error, the return value is negative.
.XA 1 "Sample Large Object Programs"
.SH "SAMPLE LARGE OBJECT PROGRAMS"
The \*(PG large object implementation serves as the basis for a file
system (the \*(lqInversion file system\*(rq) built on top of the data
manager.  This file system provides time travel, transaction
protection, and fast crash recovery to clients of ordinary file system
services.  It uses the Inversion large object implementation to
provide these services.
.PP
The programs that comprise the Inversion file system are included in
the \*(PG source distribution, in the directory
.(C
\&.../src/bin/fsutils
.)C
These directories contain a set of programs for manipulating files and
directories.  These programs are based on the Berkeley Software
Distribution NET-2 release.
.PP
These programs are useful in manipulating Inversion files, but they
also serve as examples of how to code large object accesses in \*(LI.
All of the programs are \*(LI clients, and all use the interfaces that
have been described in this section.
.PP
Interested readers should refer to the files in the directory
.(C
\&.../src/bin/fsutils
.)C
for in-depth examples of the use of large objects.  Below, a more
terse example is provided.  This code fragment creates a new large
object managed by Inversion, fills it with data from a \*(UU file, and
closes it.
.bp
.(C M
#include "tmp/c.h"
#include "tmp/libpq-fe.h"
#include "tmp/libpq-fs.h"
#include "catalog/pg_lobj.h"

#define	MYBUFSIZ	1024

main()
{
	int inv_fd;
	int fd;
	char *qry_result;
	char buf[MYBUFSIZ];
	int nbytes;
	int tmp;

	PQsetdb("mydatabase");

	/* large object accesses must be */
	/* transaction-protected         */
	qry_result = PQexec("begin");

	if (*qry_result == 'E')	/* error */
		exit (1);

	/* open the UNIX file */
	fd = open("/my_unix_file", O_RDONLY, 0666);
	if (fd < 0)	/* error */
		exit (1);

	/* create the Inversion file */
	inv_fd = p_creat("/inv_file", INV_WRITE, Inversion);
	if (inv_fd < 0)	/* error */
		exit (1);

	/* copy the UNIX file to the Inversion */
	/* large object                        */
	while ((nbytes = read(fd, buf, MYBUFSIZ)) > 0)
	{
		tmp = p_write(inv_fd, buf, nbytes);
		if (tmp < nbytes)	/* error */
			exit (1);
	}

	(void) close(fd);
	(void) close(inv_fd);

	/* commit the transaction */
	qry_result = PQexec("end");

	if (*qry_result == 'E')	/* error */
		exit (1);

	/* by here, success */
	exit (0);
}
.)C
.SH "BUGS"
Shouldn't have to distinguish between Inversion and \*(UU large
objects when you open an existing large object.  The system knows
which implementation was used.  The flags argument should be the same
in these two cases.
.PP
All large object file names (paths) are limited to 256 characters.
.PP
In the Inversion file system, file name components (the sections
of the path preceding, following or in between \*(lq/\*(rq) are 
limited to 16 characters each.  The maximum path length is still 256
characters.
.SH "SEE ALSO"
introduction(commands),
define function(commands),
define type(commands),
load(commands).
