393 Lecture Notes in Computer Science - Jim Gray

[Pages:89]393

Lecture Notes in Computer Science

Edited by G. Goos and J. Hartmanis

60

M. J. Flynn, J. N. Gray, A. K. Jones, K. Lagally H. Opderbeck, G. J. Popek, B. Randell J. H. Saltzer, H. R. Wehle

Operating Systems

An Advanced Course

Edited by R. Bayer, R. M. Graham, and G. Seegm?ller

Springer-Verlag Berlin Heidelberg New York 1978

394

Notes on Data Base Operating Systems Jim Gray

IBM Research Laboratory San Jose, California. 95193

Summer 1977

ACKNOWLEDGMENTS

This paper plagiarizes the work of the large and anonymous army of people working in the field. Because of the state of the field, there are few references to the literature (much of the "literature" is in internal memoranda, private correspondence, program logic manuals and prologues to the source language listings of various systems.)

The section on data management largely reflects the ideas of Don Chamberlin, Ted Codd, Chris Date, Dieter Gawlick, Andy Heller, Frank King, Franco Putzolu, and Bob Taylor. The discussion of views is abstracted from a paper co-authored with Don Chamberlin and Irv Traiger.

The section on data communications stems from conversations with Denny Anderson, Homer Leonard, and Charlie Sanders.

The section on transaction scheduling derives from discussions with Bob Jackson and Thomas Work.

The ideas on distributed transaction management are an amalgam of discussions with Honor Leonard.

Bruce Lindsay motivated the discussion of exception handling.

The presentation of consistency and locking derives from discussions and papers co-authored with Kapali Eswaran, Raymond Lorie, Franco Putzolu and Irving Traiger. Also Ron Obermark (IMS program isolation), Phil Macri (DMS 1100), and Paul Roever clarified many locking issues for me.

The presentation of recovery is co-authored with Paul McJones and John Nauman. Dieter Gawlick made many valuable suggestions. It reflects the ideas of Mike Blasgen, Dar Busa, Ron Obermark, Earl Jenner, Tom Price, Franco Putzolu, Butler Lampson, Howard Sturgis and Steve Weick.

All members of the System R group (IBM Research, San Jose) have contributed materially to this paper.

I am indebted to Mike Blasgen, Dieter Gawlick, especially to John Nauman, each of whom made many constructive suggestions about earlier drafts of these notes.

If you feel your ideas or work are inadequately plagiarized, please annotate this manuscript and return it to me.

395

1. INTRODUCTION

Most large institutions have now heavily invested in a data base system. In general they have automated such clerical tasks as inventory control, order entry, or billing. These systems often support a worldwide network of hundreds of terminals. Their purpose is to reliably store and retrieve large quantities of data. The life of many institutions is critically dependent on such systems, when the system is down the corporation has amnesia.

This puts an enormous burden on the implementers and operators of such systems. The systems must on the one hand be very high performance and on the other hand they must be very reliable.

1.1. A SAMPLE SYSTEM

Perhaps it is best to begin by giving an example of such a system. A large bank may have one thousand teller terminals (several have 20,000 tellers but at present no single system supports such a large network). For each teller, there is a record describing the teller's cash drawer and for each branch there is a record describing the cash position of that branch (bank general ledger), It is likely to have several million demand deposit accounts (say 10,000,000 accounts). Associated with each account is a master record giving the account owner, the account balance, and a list of recent deposits and withdrawals applicable to this account. This database occupies over 10,000,000,000 bytes and must all be on-line at all times,

The database is manipulated with application dependent transactions, which were written for this application when it was installed. There are many transactions defined on this database to query it and update it. A particular user is allowed to invoke a subset of these transactions. Invoking a transaction consists of typing a message and pushing a button. The teller terminal appends the transaction identity, teller identity and terminal identity to the message and transmits it to the central data manager. The data communication manager receives the message and translates it to some canonical form.

It then passes the message to the transaction manager, which validates the teller's authorization to invoke the specified transaction and then allocates and dispatches an instance of the transaction. The transaction processes the message, generates a response, and terminates. Data communications delivers the message to the teller.

Perhaps the most common transaction is in this environment is the DEBIT_CREDIT transaction which takes in a message from any teller, debits or credits the appropriate account (after running some validity checks), adjusts the teller cash drawer and branch balance, and then sends a response message to the teller. The transaction flow is:

396

DEBIT_CREDIT: BEGIN_TRANSACTION; GET MESSAGE; EXTRACT ACCOUT_NUMBER, DELTA, TELLER, BRANCH FROM MESSAGE; FIND ACCOUNT(ACCOUT_NUMBER) IN DATA BASE; IF NOT_FOUND | ACCOUNT_BALANCE + DELTA < 0 THEN PUT NEGATIVE RESPONSE; ELSE DO; ACCOUNT_BALANCE = ACCOUNT_BALANCE + DELTA; POST HISTORY RECORD ON ACCOUNT (DELTA); CASH_DRAWER(TELLER) = CASH_DRAWER(TELLER) + DELTA; BRANCH_BALANCE(BRANCH) = BRANCH_BALANCE(BRANCH) + DELTA; PUT MESSAGE ('NEW BALANCE =' ACCOUNT_BALANCE); END; COMMIT;

At peak periods the system runs about thirty transactions per second with a response time of two seconds.

The DEBIT_CREDIT transaction is very "small". There is another class of transactions that behave rather differently. For example, once a month a transaction is run which produces a summary statement for each account. This transaction might be described by:

MONTHLY_STATEMENT: ANSWER :: = SELECT * FROM ACCOUNT, HISTORY WHERE ACCOUNT. ACCOUNT_NUMBER = HISTORY. ACCOUNT_NUMBER AND HISTORY_DATE > LAST_REPORT GROUPED BY ACCOUNT. ACCOUNT_NUMBER, ASCENDING BY ACCOUNT. ACCOUNT_ADDRESS;

That is, collect all recent history records for each account and place them clustered with the account record into an answer file. The answers appear sorted by mailing address.

If each account has about fifteen transactions against it per month then this transaction will read 160,000,000 records and write a similar number of records. A naive implementation of this transaction will take 80 days to execute (50 milliseconds per disk seek implies two million seeks per day.) However, the system must run this transaction once a month and it must complete within a few hours.

There is a broad spread of transactions between these two types. Two particularly interesting types of transactions are conversational transactions that carry on a dialogue with the user and distributed transactions that access data or terminals at several nodes of a computer network,

Systems of 10,000 terminals or 100,000,000,000 bytes of on-line data or 150 transactions per second are generally considered to be the limit of present technology (software and hardware).

1.2. RELATIONSHIP TO OPERATING SYSTEM

If one tries to implement such an application on top of a general-purpose operating system it quickly becomes clear that many necessary functions are absent from the operating system. Historically, two approaches have been taken to this problem:

397

o Write a new, simpler and "vastly superior" operating system.

o Extend the basic operating system to have the desired function.

The first approach was very popular in the mid-sixties and is having a renaissance with the advent of minicomputers. The initial cost of a data management system is so low that almost any large customer can justify "rolling his own". The performance of such tailored systems is often ten times better than one based on a general purpose system. One must trade this off against the problems of maintaining the system as it grows to meet new needs and applications. Group's that followed this path now find themselves maintaining a rather large operating system, which must be modified to support new devices (faster disks, tape archives,...) and new protocols (e. g. networks and displays.) Gradually, these systems have grown to include all the functions of a general-purpose operating system. Perhaps the most successful approach to this has been to implement a hypervisor that runs both the data management operating system and some non-standard operating system. The "standard" operating system runs when the data manager is idle. The hypervisor is simply an interrupt handler which dispatches one or another system.

The second approach of extending the basic operating system is plagued with a different set of difficulties. The principal problem is the performance penalty of a general-purpose operating system. Very few systems are designed to deal with very large files, or with networks of thousands of nodes. To take a specific example, consider the process structure of a general-purpose system: The allocation and deallocation of a process should be very fast (500 instructions for the pair is expensive) because we want to do it 100 times per second. The storage occupied by the process descriptor should also be small (less than 1000 bytes.) Lastly, preemptive scheduling of processes makes no sense since they are not CPU bound (they do a lot of I/O). A typical system uses 16,000 bytes to represent a process and requires 200,000 instructions to allocate and deallocate this structure (systems without protection do it cheaper.) Another problem is that the general-purpose systems have been designed for batch and time-sharing operation. They have not paid sufficient attention to issues such as continuous operation: keeping the system up for weeks at a time and gracefully degrading in case of some hardware or software error.

1.3. GENERAL STRUCTURE OF DATA MANAGEMENT SYSTEMS

These notes try to discuss issues that are independent of which operating system strategy is adopted. No matter how the system is structured, there are certain problems it must solve. The general structure common to several data management systems is presented. Then two particular problems within the transaction management component are discussed in detail: concurrency control (locking) and system reliability (recovery).

398

This presentation decomposes the system into four major components: o Dictionary: the central repository of the description and definition of all persistent system objects. o Data Communications: manages teleprocessing lines and message traffic. o Data Base manager: manages the information stored in the system. o Transaction Management: manages system resources and system services such as locking and recovery.

Each of these components calls one another and in turn depends on the basic operating system for services. 1.3. BIBLIOGRAPHY These notes are rather nitty-gritty; they are aimed at system implementers rather than at users. If this is the wrong level of detail for you (is too detailed) then you may prefer the very readable books: Martin, Computer Data Base Organization, Prentice Hall, 1977 (What every DP Vice President should

know.) Martin, Computer Data Base Organization, (2nd edition), Prentice Hall, 1976 (What every application

programmer should know.) The following is a brief list of sane of the more popular general-purpose data management systems that are commercially available: Airlines Control Program, International Business Machines Corporation Customer Information Computer System, International Business Machines Corporation Data Management System 1100, Sperry Univac Corporation Extended Data Management system, Xerox Corporation Information Management System /Virtual Systems, International Business Machines Corporation Integrated Database Management System, Cullinane Corporation Integrated Data Store/1, Honeywell Information Systems Inc Model 204 Data Base Management System, Computer Corporation of America System 2000, MRI Systems Corporation. Total, Cincom Systems Corporation Each of these manufacturers will be pleased to provide you with extensive descriptions of their systems.

399

Several experimental systems are under construction at present. Some of the more interesting are: Astrahan et. al., "System R: a Relational Approach to Database Management", Astrahan et. al., ACM

Transactions on Database Systems, Vol. 1, No. 2, June 1976. Marill and Stern, "The Datacomputer-A Network Data Utility." Proc. 1975 National Computer Conference,

AFIPS Press, 1975, Stonebraker et. al., "The Design and Implementation of INGRESS." ACM Transactions on Database

Systems, Vol. 1, No. 3, Sept 1976, There are very few publicly available case studies of data base usage. The following are interesting but may not be representative: IBM Systems Journal, Vol. 16, No. 2, June 1977. (Describes the facilities and use of IMS and ACP). "IMS/VS Primer," IBM World Trade Systems Center, Palo Alto California, Form number S320-5767-1,

January 1977. "Share Guide IMS User Profile, A Summary of Message Processing Program Activity in Online IMS

Systems" IBM Palo Alto-Raleigh Systems Center Bulletin, form number 6320-6005, January 1977 Also there is one "standard" (actually "proposed standard" system): CODASYL Data Base Task Group Report, April 1971. Available from ACM

400

2. DICTIONARY

2.1. WHAT IT IS

The description of the system, the databases, the transactions, the telecommunications network, and of the users are all collected in the dictionary. This repository:

o Defines the attributes of objects such as databases and terminals.

o Cross-references these objects.

o Records natural language (e. g. German) descriptions of the meaning and use of objects.

When the system arrives, the dictionary contains only a very few definitions of transactions (usually utilities), defines a few distinguished users (operator, data base administrator,...), and defines a few special terminals (master console). The system administrator proceeds to define new terminals, transactions, users, and databases. (The system administrator function includes data base administration (DBA) and data communications (network) administration (DCA). Also, the system administrator may modify existing definitions to match the actual system or to reflect changes. This addition and modification process is treated as an editing operation.

For example, one defines a new user by entering the "define" transaction and selecting USER from the menu of definable types. This causes a form to be displayed, which has a field for each attribute of a user. The definer fills in this form and submits it to the dictionary. If the form is incorrectly filled out, it is redisplayed and the definer corrects it. Redefinition follows a similar pattern; the current form is displayed, edited and then submitted. (There is also a non-interactive interface to the dictionary for programs rather than people.)

All changes are validated by the dictionary for syntactic and semantic correctness. The ability to establish the correctness of a definition is similar to ability of a compiler to detect the correctness of a program. That is, many semantic errors go undetected. These errors are a significant problem.

Aside from validating and storing definitions, the dictionary provides a query facility which answers questions such as: "Which transactions use record type A of file B?" or, "What are the attributes of terminal 34261".

The dictionary performs one further service, that of compiling the definitions into a "machine readable" form more directly usable by the other system components. For example, a terminal definition is converted from a variable length character string to a fixed format "descriptor" giving the terminal attributes in non-symbolic form.

The dictionary is a database along with a set of transactions to manipulate this database. Some systems integrate the dictionary with the data management system so that the data definition and data manipulation interface are homogeneous. This has the virtue of sharing large bodies of code and of providing a uniform interface to the user. Ingress and System R are examples of such systems.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download