If you were designing a Web-based system to make airline reservations and to sell airline tickets, which DBMS Architecture would you choose from Section 2.5? Why? Why would the other architectures not be a good choice?
Instructions: Your response to the initial question should be 250-300 words. There must be at least one APA formatted reference (and APA in-text citation) to support the thoughts in the post as needed. Do not use direct quotes, rather rephrase the author’s words and continue to use in-text citations.
FUNDAMENTALS OF
Database
Systems
SEVENTH EDITION
This page intentionally left blank
FUNDAMENTALS OF
Database
Systems
SEVENTH EDITION
Ramez Elmasri
Department of Computer Science and Engineering
The University of Texas at Arlington
Shamkant B. Navathe
College of Computing
Georgia Institute of Technology
Boston Columbus Indianapolis New York San Francisco Hoboken
Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto
Delhi Mexico City São Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo
Vice President and Editorial Director, ECS:
Marcia J. Horton
Acquisitions Editor: Matt Goldstein
Editorial Assistant: Kelsey Loanes
Marketing Managers: Bram Van Kempen, Demetrius Hall
Marketing Assistant: Jon Bryant
Senior Managing Editor: Scott Disanno
Production Project Manager: Rose Kernan
Program Manager: Carole Snyder
Global HE Director of Vendor Sourcing
and Procurement: Diane Hynes
Director of Operations: Nick Sklitsis
Operations Specialist: Maura Zaldivar-Garcia
Cover Designer: Black Horse Designs
Manager, Rights and Permissions: Rachel Youdelman
Associate Project Manager, Rights and Permissions:
Timothy Nicholls
Full-Service Project Management: Rashmi Tickyani,
iEnergizer Aptara®, Ltd.
Composition: iEnergizer Aptara®, Ltd.
Printer/Binder: Edwards Brothers Malloy
Cover Printer: Phoenix Color/Hagerstown
Cover Image: Micha Pawlitzki/Terra/Corbis
Typeface: 10.5/12 Minion Pro
ISBN-10: 0-13-397077-9
ISBN-13: 978-0-13-397077-7
Copyright © 2016, 2011, 2007 by Ramez Elmasri and Shamkant B. Navathe. All rights reserved. Manufactured
in the United States of America. This publication is protected by Copyright and permissions should be obtained
from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any
form or by any means, electronic, mechanical, photocopying, recording, or likewise. To obtain permission(s) to
use materials from this work, please submit a written request to Pearson Higher Education, Permissions
Department, 221 River Street, Hoboken, NJ 07030.
Many of the designations by manufacturers and seller to distinguish their products are claimed as trademarks.
Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations
have been printed in initial caps or all caps.
The author and publisher of this book have used their best efforts in preparing this book. These efforts include
the development, research, and testing of theories and programs to determine their effectiveness. The author and
publisher make no warranty of any kind, expressed or implied, with regard to these programs or the
documentation contained in this book. The author and publisher shall not be liable in any event for incidental or
consequential damages with, or arising out of, the furnishing, performance, or use of these programs.
Microsoft and/or its respective suppliers make no representations about the suitability of the information
contained in the documents and related graphics published as part of the services for any purpose. All such
documents and related graphics are provided “as is” without warranty of any kind. Microsoft and/or its respective
suppliers hereby disclaim all warranties and conditions with regard to this information, including all warranties
and conditions of merchantability. Whether express, implied or statutory, fitness for a particular purpose, title
and non-infringement. In no event shall microsoft and/or its respective suppliers be liable for any special,
indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether
in an action of contract. Negligence or other tortious action, arising out of or in connection with the use or
performance of information available from the services.
The documents and related graphics contained herein could include technical inaccuracies or typographical
errors. Changes are periodically added to the information herein. Microsoft and/or its respective suppliers may
make improvements and/or changes in the product(s) and/or the program(s) described herein at any time.
Partial screen shots may be viewed in full within the software version specified.
Library of Congress Cataloging-in-Publication Data on File
10 9 8 7 6 5 4 3 2 1
To Amalia
and
to Ramy, Riyad, Katrina, and Thomas
R. E.
To my wife Aruna for her love, support, and understanding
and
to Rohan, Maya, and Ayush for bringing so much joy into our lives
S.B.N.
This page intentionally left blank
This book introduces the fundamental concepts
necessary for designing, using, and implementing
database systems and database applications. Our presentation stresses the funda-
mentals of database modeling and design, the languages and models provided by the
database management systems, and database system implementation techniques.
The book is meant to be used as a textbook for a one- or two-semester course in
database systems at the junior, senior, or graduate level, and as a reference book. Our
goal is to provide an in-depth and up-to-date presentation of the most important
aspects of database systems and applications, and related technologies. We assume
that readers are familiar with elementary programming and data-structuring con-
cepts and that they have had some exposure to the basics of computer organization.
New to This Edition
The following key features have been added in the seventh edition:
■ A reorganization of the chapter ordering (this was based on a survey of the
instructors who use the textbook); however, the book is still organized so
that the individual instructor can choose to follow the new chapter ordering
or choose a different ordering of chapters (for example, follow the chapter
order from the sixth edition) when presenting the materials.
■ There are two new chapters on recent advances in database systems and big
data processing; one new chapter (Chapter 24) covers an introduction to the
newer class of database systems known as NOSQL databases, and the other
new chapter (Chapter 25) covers technologies for processing big data,
including MapReduce and Hadoop.
■ The chapter on query processing and optimization has been expanded and
reorganized into two chapters; Chapter 18 focuses on strategies and algo-
rithms for query processing whereas Chapter 19 focuses on query optimiza-
tion techniques.
■ A second UNIVERSITY database example has been added to the early chap-
ters (Chapters 3 through 8) in addition to our COMPANY database example
from the previous editions.
■ Many of the individual chapters have been updated to varying degrees to include
newer techniques and methods; rather than discuss these enhancements here,
Preface
vii
viii Preface
we will describe them later in the preface when we discuss the organization of
the seventh edition.
The following are key features of the book:
■ A self-contained, flexible organization that can be tailored to individual
needs; in particular, the chapters can be used in different orders depending
on the instructor’s preference.
■ A companion website (http://www.pearsonhighered.com/cs-resources)
includes data to be loaded into various types of relational databases for more
realistic student laboratory exercises.
■ A dependency chart (shown later in this preface) to show which chapters
depend on other earlier chapters; this can guide the instructor who wants to
tailor the order of presentation of the chapters.
■ A collection of supplements, including a robust set of materials for instruc-
tors and students such as PowerPoint slides, figures from the text, and an
instructor’s guide with solutions.
Organization and Contents of the Seventh Edition
There are some organizational changes in the seventh edition as well as improve-
ment to the individual chapters. The book is now divided into 12 parts as follows:
■ Part 1 (Chapters 1 and 2) describes the basic introductory concepts neces-
sary for a good understanding of database models, systems, and languages.
Chapters 1 and 2 introduce databases, typical users, and DBMS concepts,
terminology, and architecture, as well as a discussion of the progression of
database technologies over time and a brief history of data models. These
chapters have been updated to introduce some of the newer technologies
such as NOSQL systems.
■ Part 2 (Chapters 3 and 4) includes the presentation on entity-relationship
modeling and database design; however, it is important to note that instruc-
tors can cover the relational model chapters (Chapters 5 through 8) before
Chapters 3 and 4 if that is their preferred order of presenting the course
materials. In Chapter 3, the concepts of the Entity-Relationship (ER) model
and ER diagrams are presented and used to illustrate conceptual database
design. Chapter 4 shows how the basic ER model can be extended to incorpo-
rate additional modeling concepts such as subclasses, specialization, gener-
alization, union types (categories) and inheritance, leading to the
enhanced-ER (EER) data model and EER diagrams. The notation for the class
diagrams of UML are also introduced in Chapters 7 and 8 as an alternative
model and diagrammatic notation for ER/EER diagrams.
■ Part 3 (Chapters 5 through 8) includes a detailed presentation on relational
databases and SQL with some additional new material in the SQL chapters
to cover a few SQL constructs that were not in the previous edition. Chapter 5
Preface ix
describes the basic relational model, its integrity constraints, and update
operations. Chapter 6 describes some of the basic parts of the SQL standard
for relational databases, including data definition, data modification opera-
tions, and simple SQL queries. Chapter 7 presents more complex SQL que-
ries, as well as the SQL concepts of triggers, assertions, views, and schema
modification. Chapter 8 describes the formal operations of the relational
algebra and introduces the relational calculus. The material on SQL (Chap-
ters 6 and 7) is presented before our presentation on relational algebra and
calculus in Chapter 8 to allow instructors to start SQL projects early in a
course if they wish (it is possible to cover Chapter 8 before Chapters 6 and 7
if the instructor desires this order). The final chapter in Part 2, Chapter 9,
covers ER- and EER-to-relational mapping, which are algorithms that can be
used for designing a relational database schema from a conceptual ER/EER
schema design.
■ Part 4 (Chapters 10 and 11) are the chapters on database programming tech-
niques; these chapters can be assigned as reading materials and augmented
with materials on the particular language used in the course for program-
ming projects (much of this documentation is readily available on the Web).
Chapter 10 covers traditional SQL programming topics, such as embedded
SQL, dynamic SQL, ODBC, SQLJ, JDBC, and SQL/CLI. Chapter 11 introduces
Web database programming, using the PHP scripting language in our exam-
ples, and includes new material that discusses Java technologies for Web
database programming.
■ Part 5 (Chapters 12 and 13) covers the updated material on object-relational
and object-oriented databases (Chapter 12) and XML (Chapter 13); both of
these chapters now include a presentation of how the SQL standard incorpo-
rates object concepts and XML concepts into more recent versions of the
SQL standard. Chapter 12 first introduces the concepts for object databases,
and then shows how they have been incorporated into the SQL standard in
order to add object capabilities to relational database systems. It then covers
the ODMG object model standard, and its object definition and query lan-
guages. Chapter 13 covers the XML (eXtensible Markup Language) model
and languages, and discusses how XML is related to database systems. It
presents XML concepts and languages, and compares the XML model to
traditional database models. We also show how data can be converted
between the XML and relational representations, and the SQL commands
for extracting XML documents from relational tables.
■ Part 6 (Chapters 14 and 15) are the normalization and relational design
theory chapters (we moved all the formal aspects of normalization algo-
rithms to Chapter 15). Chapter 14 defines functional dependencies, and
the normal forms that are based on functional dependencies. Chapter 14
also develops a step-by-step intuitive normalization approach, and includes
the definitions of multivalued dependencies and join dependencies.
Chapter 15 covers normalization theory, and the formalisms, theories,
x Preface
and algorithms developed for relational database design by normaliza-
tion, including the relational decomposition algorithms and the relational
synthesis algorithms.
■ Part 7 (Chapters 16 and 17) contains the chapters on file organizations on
disk (Chapter 16) and indexing of database files (Chapter 17). Chapter 16
describes primary methods of organizing files of records on disk, including
ordered (sorted), unordered (heap), and hashed files; both static and
dynamic hashing techniques for disk files are covered. Chapter 16 has been
updated to include materials on buffer management strategies for DBMSs as
well as an overview of new storage devices and standards for files and mod-
ern storage architectures. Chapter 17 describes indexing techniques for files,
including B-tree and B+-tree data structures and grid files, and has been
updated with new examples and an enhanced discussion on indexing,
including how to choose appropriate indexes and index creation during
physical design.
■ Part 8 (Chapters 18 and 19) includes the chapters on query processing algo-
rithms (Chapter 18) and optimization techniques (Chapter 19); these two
chapters have been updated and reorganized from the single chapter that
covered both topics in the previous editions and include some of the newer
techniques that are used in commercial DBMSs. Chapter 18 presents algo-
rithms for searching for records on disk files, and for joining records from
two files (tables), as well as for other relational operations. Chapter 18 con-
tains new material, including a discussion of the semi-join and anti-join
operations with examples of how they are used in query processing, as well
as a discussion of techniques for selectivity estimation. Chapter 19 covers
techniques for query optimization using cost estimation and heuristic rules;
it includes new material on nested subquery optimization, use of histograms,
physical optimization, and join ordering methods and optimization of
typical queries in data warehouses.
■ Part 9 (Chapters 20, 21, and 22) covers transaction processing concepts;
concurrency control; and database recovery from failures. These chapters
have been updated to include some of the newer techniques that are used
in some commercial and open source DBMSs. Chapter 20 introduces the
techniques needed for transaction processing systems, and defines the
concepts of recoverability and serializability of schedules; it has a new sec-
tion on buffer replacement policies for DBMSs and a new discussion on
the concept of snapshot isolation. Chapter 21 gives an overview of the var-
ious types of concurrency control protocols, with a focus on two-phase
locking. We also discuss timestamp ordering and optimistic concurrency
control techniques, as well as multiple-granularity locking. Chapter 21
includes a new presentation of concurrency control methods that are based
on the snapshot isolation concept. Finally, Chapter 23 focuses on database
recovery protocols, and gives an overview of the concepts and techniques
that are used in recovery.
■ Part 10 (Chapters 23, 24, and 25) includes the chapter on distributed data-
bases (Chapter 23), plus the two new chapters on NOSQL storage systems
for big data (Chapter 24) and big data technologies based on Hadoop and
MapReduce (Chapter 25). Chapter 23 introduces distributed database
concepts, including availability and scalability, replication and fragmenta-
tion of data, maintaining data consistency among replicas, and many other
concepts and techniques. In Chapter 24, NOSQL systems are categorized
into four general categories with an example system in each category used
for our examples, and the data models, operations, as well as the replica-
tion/distribution/scalability strategies of each type of NOSQL system are
discussed and compared. In Chapter 25, the MapReduce programming
model for distributed processing of big data is introduced, and then we
have presentations of the Hadoop system and HDFS (Hadoop Distributed
File System), as well as the Pig and Hive high-level interfaces, and the
YARN architecture.
■ Part 11 (Chapters 26 through 29) is entitled Advanced Database Models,
Systems, and Applications and includes the following materials: Chapter 26
introduces several advanced data models including active data-
bases/triggers (Section 26.1), temporal databases (Section 26.2), spatial data-
bases (Section 26.3), multimedia databases (Section 26.4), and deductive
databases (Section 26.5). Chapter 27 discusses information retrieval (IR)
and Web search, and includes topics such as IR and keyword-based search,
comparing DB with IR, retrieval models, search evaluation, and ranking
algorithms. Chapter 28 is an introduction to data mining including over-
views of various data mining methods such as associate rule mining, cluster-
ing, classification, and sequential pattern discovery. Chapter 29 is an
overview of data warehousing including topics such as data warehousing
models and operations, and the process of building a data warehouse.
■ Part 12 (Chapter 30) includes one chapter on database security, which
includes a discussion of SQL commands for discretionary access control
(GRANT, REVOKE), as well as mandatory security levels and models for
including mandatory access control in relational databases, and a discussion
of threats such as SQL injection attacks, as well as other techniques and
methods related to data security and privacy.
Appendix A gives a number of alternative diagrammatic notations for displaying a
conceptual ER or EER schema. These may be substituted for the notation we use, if
the instructor prefers. Appendix B gives some important physical parameters of
disks. Appendix C gives an overview of the QBE graphical query language, and
Appendixes D and E (available on the book’s Companion Website located at
http://www.pearsonhighered.com/elmasri) cover legacy database systems, based on
the hierarchical and network database models. They have been used for more than
thirty years as a basis for many commercial database applications and transaction-
processing systems.
Preface xi
Guidelines for Using This Book
There are many different ways to teach a database course. The chapters in Parts 1
through 7 can be used in an introductory course on database systems in the order
that they are given or in the preferred order of individual instructors. Selected chap-
ters and sections may be left out and the instructor can add other chapters from the
rest of the book, depending on the emphasis of the course. At the end of the open-
ing section of some of the book’s chapters, we list sections that are candidates for
being left out whenever a less-detailed discussion of the topic is desired. We suggest
covering up to Chapter 15 in an introductory database course and including selected
parts of other chapters, depending on the background of the students and the
desired coverage. For an emphasis on system implementation techniques, chapters
from Parts 7, 8, and 9 should replace some of the earlier chapters.
Chapters 3 and 4, which cover conceptual modeling using the ER and EER models,
are important for a good conceptual understanding of databases. However, they
may be partially covered, covered later in a course, or even left out if the emphasis
is on DBMS implementation. Chapters 16 and 17 on file organizations and indexing
may also be covered early, later, or even left out if the emphasis is on database mod-
els and languages. For students who have completed a course on file organization,
parts of these chapters can be assigned as reading material or some exercises can be
assigned as a review for these concepts.
If the emphasis of a course is on database design, then the instructor should cover
Chapters 3 and 4 early on, followed by the presentation of relational databases. A
total life-cycle database design and implementation project would cover conceptual
design (Chapters 3 and 4), relational databases (Chapters 5, 6, and 7), data model
mapping (Chapter 9), normalization (Chapter 14), and application programs
implementation with SQL (Chapter 10). Chapter 11 also should be covered if the
emphasis is on Web database programming and applications. Additional documen-
tation on the specific programming languages and RDBMS used would be required.
The book is written so that it is possible to cover topics in various sequences. The
following chapter dependency chart shows the major dependencies among chap-
ters. As the diagram illustrates, it is possible to start with several different topics
following the first two introductory chapters. Although the chart may seem com-
plex, it is important to note that if the chapters are covered in order, the dependen-
cies are not lost. The chart can be consulted by instructors wishing to use an
alternative order of presentation.
For a one-semester course based on this book, selected chapters can be assigned as
reading material. The book also can be used for a two-semester course sequence.
The first course, Introduction to Database Design and Database Systems, at the
sophomore, junior, or senior level, can cover most of Chapters 1 through 15. The
second course, Database Models and Implementation Techniques, at the senior or
first-year graduate level, can cover most of Chapters 16 through 30. The two-
semester sequence can also be designed in various other ways, depending on the
preferences of the instructors.
xii Preface
Supplemental Materials
Support material is available to qualified instructors at Pearson’s instructor
resource center (http://www.pearsonhighered.com/irc). For access, contact your
local Pearson representative.
■ PowerPoint lecture notes and figures.
■ A solutions manual.
Acknowledgments
It is a great pleasure to acknowledge the assistance and contributions of many indi-
viduals to this effort. First, we would like to thank our editor, Matt Goldstein, for
his guidance, encouragement, and support. We would like to acknowledge the
excellent work of Rose Kernan for production management, Patricia Daly for a
1, 2
Introductory
3, 4
ER, EER
Models
5
Relational
Model
8
Relational
Algebra
9
ER-, EER-to-
Relational
16, 17
File Organization,
Indexing
28, 29
Data Mining,
Warehousing
10, 11
DB, Web
Programming
30
DB
Security
14, 15
FD, MVD,
Normalization
23, 24, 25
DDB, NOSQL,
Big Data
20, 21, 22
Transactions,
CC, Recovery
12, 13
ODB, ORDB,
XML
26, 27
Advanced
Models, IR
6, 7
SQL
18, 19
Query Processing,
Optimization
Preface xiii
thorough copy editing of the book, Martha McMaster for her diligence in proofing
the pages, and Scott Disanno, Managing Editor of the production team. We also
wish to thank Kelsey Loanes from Pearson for her continued help with the project,
and reviewers Michael Doherty, Deborah Dunn, Imad Rahal, Karen Davis, Gilliean
Lee, Leo Mark, Monisha Pulimood, Hassan Reza, Susan Vrbsky, Li Da Xu, Weining
Zhang and Vincent Oria.
Ramez Elmasri would like to thank Kulsawasd Jitkajornwanich, Vivek Sharma, and
Surya Swaminathan for their help with preparing some of the material in Chap-
ter 24. Sham Navathe would like to acknowledge the following individuals who
helped in critically reviewing and revising various topics. Dan Forsythe and Satish
Damle for discussion of storage systems; Rafi Ahmed for detailed re-organization
of the material on query processing and optimization; Harish Butani, Balaji
Palanisamy, and Prajakta Kalmegh for their help with the Hadoop and MapReduce
technology material; Vic Ghorpadey and Nenad Jukic for revision of the Data
Warehousing material; and finally, Frank Rietta for newer techniques in database
security, Kunal Malhotra for various discussions, and Saurav Sahay for advances in
information retrieval systems.
We would like to repeat our thanks to those who have reviewed and contributed to
previous editions of Fundamentals of Database Systems.
■ First edition. Alan Apt (editor), Don Batory, Scott Downing, Dennis
Heimbinger, Julia Hodges, Yannis Ioannidis, Jim Larson, Per-Ake Larson,
Dennis McLeod, Rahul Patel, Nicholas Roussopoulos, David Stemple,
Michael Stonebraker, Frank Tompa, and Kyu-Young Whang.
■ Second edition. Dan Joraanstad (editor), Rafi Ahmed, Antonio Albano, David
Beech, Jose Blakeley, Panos Chrysanthis, Suzanne Dietrich, Vic Ghorpadey,
Goetz Graefe, Eric Hanson, Junguk L. Kim, Roger King, Vram Kouramajian,
Vijay Kumar, John Lowther, Sanjay Manchanda, Toshimi Minoura, Inderpal
Mumick, Ed Omiecinski, Girish Pathak, Raghu Ramakrishnan, Ed Robertson,
Eugene Sheng, David Stotts, Marianne Winslett, and Stan Zdonick.
■ Third edition. Maite Suarez-Rivas and Katherine Harutunian (editors);
Suzanne Dietrich, Ed Omiecinski, Rafi Ahmed, Francois Bancilhon, Jose
Blakeley, Rick Cattell, Ann Chervenak, David W. Embley, Henry A. Etlinger,
Leonidas Fegaras, Dan Forsyth, Farshad Fotouhi, Michael Franklin, Sreejith
Gopinath, Goetz Craefe, Richard Hull, Sushil Jajodia, Ramesh K. Karne,
Harish Kotbagi, Vijay Kumar, Tarcisio Lima, Ramon A. Mata-Toledo, Jack
McCaw, Dennis McLeod, Rokia Missaoui, Magdi Morsi, M. Narayanaswamy,
Carlos Ordonez, Joan Peckham, Betty Salzberg, Ming-Chien Shan, Junping
Sun, Rajshekhar Sunderraman, Aravindan Veerasamy, and Emilia E. Villareal.
■ Fourth edition. Maite Suarez-Rivas, Katherine Harutunian, Daniel Rausch,
and Juliet Silveri (editors); Phil Bernhard, Zhengxin Chen, Jan Chomicki,
Hakan Ferhatosmanoglu, Len Fisk, William Hankley, Ali R. Hurson, Vijay
Kumar, Peretz Shoval, Jason T. L. Wang (reviewers); Ed Omiecinski (who
contributed to Chapter 27). Contributors from the University of Texas at
xiv Preface
Arlington are Jack Fu, Hyoil Han, Babak Hojabri, Charley Li, Ande Swathi,
and Steven Wu; Contributors from Georgia Tech are Weimin Feng, Dan For-
sythe, Angshuman Guin, Abrar Ul-Haque, Bin Liu, Ying Liu, Wanxia Xie,
and Waigen Yee.
■ Fifth edition. Matt Goldstein and Katherine Harutunian (editors); Michelle
Brown, Gillian Hall, Patty Mahtani, Maite Suarez-Rivas, Bethany Tidd, and
Joyce Cosentino Wells (from Addison-Wesley); Hani Abu-Salem, Jamal R.
Alsabbagh, Ramzi Bualuan, Soon Chung, Sumali Conlon, Hasan Davulcu,
James Geller, Le Gruenwald, Latifur Khan, Herman Lam, Byung S. Lee,
Donald Sanderson, Jamil Saquer, Costas Tsatsoulis, and Jack C. Wileden
(reviewers); Raj Sunderraman (who contributed the laboratory projects);
Salman Azar (who contributed some new exercises); Gaurav Bhatia, Fari-
borz Farahmand, Ying Liu, Ed Omiecinski, Nalini Polavarapu, Liora Sahar,
Saurav Sahay, and Wanxia Xie (from Georgia Tech).
■ Sixth edition. Matt Goldstein (editor); Gillian Hall (production manage-
ment); Rebecca Greenberg (copy editing); Jeff Holcomb, Marilyn Lloyd,
Margaret Waples, and Chelsea Bell (from Pearson); Rafi Ahmed, Venu
Dasigi, Neha Deodhar, Fariborz Farahmand, Hariprasad Kumar, Leo Mark,
Ed Omiecinski, Balaji Palanisamy, Nalini Polavarapu, Parimala R. Pranesh,
Bharath Rengarajan, Liora Sahar, Saurav Sahay, Narsi Srinivasan, and
Wanxia Xie.
Last, but not least, we gratefully acknowledge the support, encouragement, and
patience of our families.
R. E.
S.B.N.
Preface xv
This page intentionally left blank
Contents
Preface vii
About the Authors xxx
■ part 1
Introduction to Databases ■
chapter 1 Databases and Database Users 3
1.1 Introduction 4
1.2 An Example 6
1.3 Characteristics of the Database Approach 10
1.4 Actors on the Scene 15
1.5 Workers behind the Scene 17
1.6 Advantages of Using the DBMS Approach 17
1.7 A Brief History of Database Applications 23
1.8 When Not to Use a DBMS 27
1.9 Summary 27
Review Questions 28
Exercises 28
Selected Bibliography 29
chapter 2 Database System Concepts
and Architecture 31
2.1 Data Models, Schemas, and Instances 32
2.2 Three-Schema Architecture and Data Independence 36
2.3 Database Languages and Interfaces 38
2.4 The Database System Environment 42
2.5 Centralized and Client/Server Architectures for DBMSs 46
2.6 Classification of Database Management Systems 51
2.7 Summary 54
Review Questions 55
Exercises 55
Selected Bibliography 56
xvii
xviii Contents
■ part 2
Conceptual Data Modeling and Database Design ■
chapter 3 Data Modeling Using the Entity–Relationship (ER)
Model 59
3.1 Using High-Level Conceptual Data Models
for Database Design 60
3.2 A Sample Database Application 62
3.3 Entity Types, Entity Sets, Attributes, and Keys 63
3.4 Relationship Types, Relationship Sets, Roles, and Structural
Constraints 72
3.5 Weak Entity Types 79
3.6 Refining the ER Design for the COMPANY Database 80
3.7 ER Diagrams, Naming Conventions, and Design Issues 81
3.8 Example of Other Notation: UML Class Diagrams 85
3.9 Relationship Types of Degree Higher than Two 88
3.10 Another Example: A UNIVERSITY Database 92
3.11 Summary 94
Review Questions 96
Exercises 96
Laboratory Exercises 103
Selected Bibliography 104
chapter 4 The Enhanced Entity–Relationship (EER)
Model 107
4.1 Subclasses, Superclasses, and Inheritance 108
4.2 Specialization and Generalization 110
4.3 Constraints and Characteristics of Specialization and Generalization
Hierarchies 113
4.4 Modeling of UNION Types Using Categories 120
4.5 A Sample UNIVERSITY EER Schema, Design Choices, and Formal
Definitions 122
4.6 Example of Other Notation: Representing Specialization and
Generalization in UML Class Diagrams 127
4.7 Data Abstraction, Knowledge Representation, and Ontology
Concepts 128
4.8 Summary 135
Review Questions 135
Exercises 136
Laboratory Exercises 143
Selected Bibliography 146
Contents xix
■ part 3
The Relational Data Model and SQL ■
chapter 5 The Relational Data Model and Relational
Database Constraints 149
5.1 Relational Model Concepts 150
5.2 Relational Model Constraints and Relational Database Schemas 157
5.3 Update Operations, Transactions, and Dealing with Constraint
Violations 165
5.4 Summary 169
Review Questions 170
Exercises 170
Selected Bibliography 175
chapter 6 Basic SQL 177
6.1 SQL Data Definition and Data Types 179
6.2 Specifying Constraints in SQL 184
6.3 Basic Retrieval Queries in SQL 187
6.4 INSERT, DELETE, and UPDATE Statements in SQL 198
6.5 Additional Features of SQL 201
6.6 Summary 202
Review Questions 203
Exercises 203
Selected Bibliography 205
chapter 7 More SQL: Complex Queries, Triggers, Views,
and Schema Modification 207
7.1 More Complex SQL Retrieval Queries 207
7.2 Specifying Constraints as Assertions and Actions as Triggers 225
7.3 Views (Virtual Tables) in SQL 228
7.4 Schema Change Statements in SQL 232
7.5 Summary 234
Review Questions 236
Exercises 236
Selected Bibliography 238
chapter 8 The Relational Algebra and Relational Calculus 239
8.1 Unary Relational Operations: SELECT and PROJECT 241
8.2 Relational Algebra Operations from Set Theory 246
8.3 Binary Relational Operations: JOIN and DIVISION 251
8.4 Additional Relational Operations 259
8.5 Examples of Queries in Relational Algebra 265
8.6 The Tuple Relational Calculus 268
8.7 The Domain Relational Calculus 277
8.8 Summary 279
Review Questions 280
Exercises 281
Laboratory Exercises 286
Selected Bibliography 288
chapter 9 Relational Database Design by ER- and
EER-to-Relational Mapping 289
9.1 Relational Database Design Using ER-to-Relational Mapping 290
9.2 Mapping EER Model Constructs to Relations 298
9.3 Summary 303
Review Questions 303
Exercises 303
Laboratory Exercises 305
Selected Bibliography 306
■ part 4
Database Programming Techniques ■
chapter 10 Introduction to SQL Programming
Techniques 309
10.1 Overview of Database Programming Techniques and Issues 310
10.2 Embedded SQL, Dynamic SQL, and SQLJ 314
10.3 Database Programming with Function Calls and Class
Libraries: SQL/CLI and JDBC 326
10.4 Database Stored Procedures and SQL/PSM 335
10.5 Comparing the Three Approaches 338
10.6 Summary 339
Review Questions 340
Exercises 340
Selected Bibliography 341
chapter 11 Web Database Programming Using PHP 343
11.1 A Simple PHP Example 344
11.2 Overview of Basic Features of PHP 346
xx Contents
11.3 Overview of PHP Database Programming 353
11.4 Brief Overview of Java Technologies for Database Web
Programming 358
11.5 Summary 358
Review Questions 359
Exercises 359
Selected Bibliography 359
■ part 5
Object, Object-Relational, and XML: Concepts, Models,
Languages, and Standards ■
chapter 12 Object and Object-Relational
Databases 363
12.1 Overview of Object Database Concepts 365
12.2 Object Database Extensions to SQL 379
12.3 The ODMG Object Model and the Object Definition Language
ODL 386
12.4 Object Database Conceptual Design 405
12.5 The Object Query Language OQL 408
12.6 Overview of the C++ Language Binding in the ODMG
Standard 417
12.7 Summary 418
Review Questions 420
Exercises 421
Selected Bibliography 422
chapter 13 XML: Extensible Markup Language 425
13.1 Structured, Semistructured, and Unstructured Data 426
13.2 XML Hierarchical (Tree) Data Model 430
13.3 XML Documents, DTD, and XML Schema 433
13.4 Storing and Extracting XML Documents
from Databases 442
13.5 XML Languages 443
13.6 Extracting XML Documents from Relational Databases 447
13.7 XML/SQL: SQL Functions for Creating XML Data 453
13.8 Summary 455
Review Questions 456
Exercises 456
Selected Bibliography 456
Contents xxi
■ part 6
Database Design Theory and Normalization ■
chapter 14 Basics of Functional Dependencies
and Normalization for Relational
Databases 459
14.1 Informal Design Guidelines for Relation
Schemas 461
14.2 Functional Dependencies 471
14.3 Normal Forms Based on Primary Keys 474
14.4 General Definitions of Second and Third Normal
Forms 483
14.5 Boyce-Codd Normal Form 487
14.6 Multivalued Dependency and Fourth
Normal Form 491
14.7 Join Dependencies and Fifth Normal Form 494
14.8 Summary 495
Review Questions 496
Exercises 497
Laboratory Exercises 501
Selected Bibliography 502
chapter 15 Relational Database Design Algorithms
and Further Dependencies 503
15.1 Further Topics in Functional Dependencies: Inference Rules,
Equivalence, and Minimal Cover 505
15.2 Properties of Relational Decompositions 513
15.3 Algorithms for Relational Database Schema
Design 519
15.4 About Nulls, Dangling Tuples, and Alternative Relational
Designs 523
15.5 Further Discussion of Multivalued Dependencies
and 4NF 527
15.6 Other Dependencies and Normal Forms 530
15.7 Summary 533
Review Questions 534
Exercises 535
Laboratory Exercises 536
Selected Bibliography 537
xxii Contents
■ part 7
File Structures, Hashing, Indexing, and Physical
Database Design ■
chapter 16 Disk Storage, Basic File Structures,
Hashing, and Modern Storage
Architectures 541
16.1 Introduction 542
16.2 Secondary Storage Devices 547
16.3 Buffering of Blocks 556
16.4 Placing File Records on Disk 560
16.5 Operations on Files 564
16.6 Files of Unordered Records (Heap Files) 567
16.7 Files of Ordered Records (Sorted Files) 568
16.8 Hashing Techniques 572
16.9 Other Primary File Organizations 582
16.10 Parallelizing Disk Access Using RAID
Technology 584
16.11 Modern Storage Architectures 588
16.12 Summary 592
Review Questions 593
Exercises 595
Selected Bibliography 598
chapter 17 Indexing Structures for Files and Physical
Database Design 601
17.1 Types of Single-Level Ordered Indexes 602
17.2 Multilevel Indexes 613
17.3 Dynamic Multilevel Indexes Using B-Trees
and B+-Trees 617
17.4 Indexes on Multiple Keys 631
17.5 Other Types of Indexes 633
17.6 Some General Issues Concerning Indexing 638
17.7 Physical Database Design in Relational
Databases 643
17.8 Summary 646
Review Questions 647
Exercises 648
Selected Bibliography 650
Contents xxiii
■ part 8
Query Processing and Optimization ■
chapter 18 Strategies for Query Processing 655
18.1 Translating SQL Queries into Relational Algebra
and Other Operators 657
18.2 Algorithms for External Sorting 660
18.3 Algorithms for SELECT Operation 663
18.4 Implementing the JOIN Operation 668
18.5 Algorithms for PROJECT and Set Operations 676
18.6 Implementing Aggregate Operations and Different
Types of JOINs 678
18.7 Combining Operations Using Pipelining 681
18.8 Parallel Algorithms for Query Processing 683
18.9 Summary 688
Review Questions 688
Exercises 689
Selected Bibliography 689
chapter 19 Query Optimization 691
19.1 Query Trees and Heuristics for Query
Optimization 692
19.2 Choice of Query Execution Plans 701
19.3 Use of Selectivities in Cost-Based
Optimization 710
19.4 Cost Functions for SELECT Operation 714
19.5 Cost Functions for the JOIN Operation 717
19.6 Example to Illustrate Cost-Based Query
Optimization 726
19.7 Additional Issues Related to Query
Optimization 728
19.8 An Example of Query Optimization in Data
Warehouses 731
19.9 Overview of Query Optimization in Oracle 733
19.10 Semantic Query Optimization 737
19.11 Summary 738
Review Questions 739
Exercises 740
Selected Bibliography 740
xxiv Contents
■ part 9
Transaction Processing, Concurrency Control,
and Recovery ■
chapter 20 Introduction to Transaction Processing
Concepts and Theory 745
20.1 Introduction to Transaction Processing 746
20.2 Transaction and System Concepts 753
20.3 Desirable Properties of Transactions 757
20.4 Characterizing Schedules Based on Recoverability 759
20.5 Characterizing Schedules Based on Serializability 763
20.6 Transaction Support in SQL 773
20.7 Summary 776
Review Questions 777
Exercises 777
Selected Bibliography 779
chapter 21 Concurrency Control Techniques 781
21.1 Two-Phase Locking Techniques for Concurrency
Control 782
21.2 Concurrency Control Based on Timestamp Ordering 792
21.3 Multiversion Concurrency Control Techniques 795
21.4 Validation (Optimistic) Techniques and Snapshot Isolation
Concurrency Control 798
21.5 Granularity of Data Items and Multiple Granularity
Locking 800
21.6 Using Locks for Concurrency Control in Indexes 805
21.7 Other Concurrency Control Issues 806
21.8 Summary 807
Review Questions 808
Exercises 809
Selected Bibliography 810
chapter 22 Database Recovery Techniques 813
22.1 Recovery Concepts 814
22.2 NO-UNDO/REDO Recovery Based on Deferred
Update 821
22.3 Recovery Techniques Based on Immediate Update 823
Contents xxv
22.4 Shadow Paging 826
22.5 The ARIES Recovery Algorithm 827
22.6 Recovery in Multidatabase Systems 831
22.7 Database Backup and Recovery from Catastrophic Failures 832
22.8 Summary 833
Review Questions 834
Exercises 835
Selected Bibliography 838
■ part 10
Distributed Databases, NOSQL Systems,
and Big Data ■
chapter 23 Distributed Database Concepts 841
23.1 Distributed Database Concepts 842
23.2 Data Fragmentation, Replication, and Allocation Techniques for
Distributed Database Design 847
23.3 Overview of Concurrency Control and Recovery in Distributed
Databases 854
23.4 Overview of Transaction Management in Distributed Databases 857
23.5 Query Processing and Optimization in Distributed Databases 859
23.6 Types of Distributed Database Systems 865
23.7 Distributed Database Architectures 868
23.8 Distributed Catalog Management 875
23.9 Summary 876
Review Questions 877
Exercises 878
Selected Bibliography 880
chapter 24 NOSQL Databases and Big Data Storage
Systems 883
24.1 Introduction to NOSQL Systems 884
24.2 The CAP Theorem 888
24.3 Document-Based NOSQL Systems and MongoDB 890
24.4 NOSQL Key-Value Stores 895
24.5 Column-Based or Wide Column NOSQL Systems 900
24.6 NOSQL Graph Databases and Neo4j 903
24.7 Summary 909
Review Questions 909
Selected Bibliography 910
xxvi Contents
chapter 25 Big Data Technologies Based on MapReduce
and Hadoop 911
25.1 What Is Big Data? 914
25.2 Introduction to MapReduce and Hadoop 916
25.3 Hadoop Distributed File System (HDFS) 921
25.4 MapReduce: Additional Details 926
25.5 Hadoop v2 alias YARN 936
25.6 General Discussion 944
25.7 Summary 953
Review Questions 954
Selected Bibliography 956
■ part 11
Advanced Database Models, Systems, and
Applications ■
chapter 26 Enhanced Data Models: Introduction to Active,
Temporal, Spatial, Multimedia, and Deductive
Databases 961
26.1 Active Database Concepts and Triggers 963
26.2 Temporal Database Concepts 974
26.3 Spatial Database Concepts 987
26.4 Multimedia Database Concepts 994
26.5 Introduction to Deductive Databases 999
26.6 Summary 1012
Review Questions 1014
Exercises 1015
Selected Bibliography 1018
chapter 27 Introduction to Information Retrieval
and Web Search 1021
27.1 Information Retrieval (IR) Concepts 1022
27.2 Retrieval Models 1029
27.3 Types of Queries in IR Systems 1035
27.4 Text Preprocessing 1037
27.5 Inverted Indexing 1040
27.6 Evaluation Measures of Search Relevance 1044
27.7 Web Search and Analysis 1047
Contents xxvii
27.8 Trends in Information Retrieval 1057
27.9 Summary 1063
Review Questions 1064
Selected Bibliography 1066
chapter 28 Data Mining Concepts 1069
28.1 Overview of Data Mining Technology 1070
28.2 Association Rules 1073
28.3 Classification 1085
28.4 Clustering 1088
28.5 Approaches to Other Data Mining Problems 1091
28.6 Applications of Data Mining 1094
28.7 Commercial Data Mining Tools 1094
28.8 Summary 1097
Review Questions 1097
Exercises 1098
Selected Bibliography 1099
chapter 29 Overview of Data Warehousing
and OLAP 1101
29.1 Introduction, Definitions, and Terminology 1102
29.2 Characteristics of Data Warehouses 1103
29.3 Data Modeling for Data Warehouses 1105
29.4 Building a Data Warehouse 1111
29.5 Typical Functionality of a Data Warehouse 1114
29.6 Data Warehouse versus Views 1115
29.7 Difficulties of Implementing Data Warehouses 1116
29.8 Summary 1117
Review Questions 1117
Selected Bibliography 1118
■ part 12
Additional Database Topics: Security ■
chapter 30 Database Security 1121
30.1 Introduction to Database Security Issues 1122
30.2 Discretionary Access Control Based on Granting and Revoking
Privileges 1129
30.3 Mandatory Access Control and Role-Based Access Control for
Multilevel Security 1134
xxviii Contents
30.4 SQL Injection 1143
30.5 Introduction to Statistical Database Security 1146
30.6 Introduction to Flow Control 1147
30.7 Encryption and Public Key Infrastructures 1149
30.8 Privacy Issues and Preservation 1153
30.9 Challenges to Maintaining Database Security 1154
30.10 Oracle Label-Based Security 1155
30.11 Summary 1158
Review Questions 1159
Exercises 1160
Selected Bibliography 1161
appendix A Alternative Diagrammatic Notations for ER
Models 1163
appendix B Parameters of Disks 1167
appendix C Overview of the QBE Language 1171
C.1 Basic Retrievals in QBE 1171
C.2 Grouping, Aggregation, and Database Modification in QBE 1175
appendix D Overview of the Hierarchical Data Model
(located on the Companion Website at
http://www.pearsonhighered.com/elmasri)
appendix E Overview of the Network Data Model
(located on the Companion Website at
http://www.pearsonhighered.com/elmasri)
Selected Bibliography 1179
Index 1215
Contents xxix
About the Authors
Ramez Elmasri is a professor and the associate chairperson of the Department of
Computer Science and Engineering at the University of Texas at Arlington. He has
over 140 refereed research publications, and has supervised 16 PhD students and
over 100 MS students. His research has covered many areas of database manage-
ment and big data, including conceptual modeling and data integration, query
languages and indexing techniques, temporal and spatio-temporal databases, bio-
informatics databases, data collection from sensor networks, and mining/analysis
of spatial and spatio-temporal data. He has worked as a consultant to various com-
panies, including Digital, Honeywell, Hewlett Packard, and Action Technologies,
as well as consulting with law firms on patents. He was the Program Chair of the
1993 International Conference on Conceptual Modeling (ER conference) and pro-
gram vice-chair of the 1994 IEEE International Conference on Data Engineering.
He has served on the ER conference steering committee and has been on the pro-
gram committees of many conferences. He has given several tutorials at the VLDB,
ICDE, and ER conferences. He also co-authored the book “Operating Systems: A
Spiral Approach” (McGraw-Hill, 2009) with Gil Carrick and David Levine. Elmasri
is a recipient of the UTA College of Engineering Outstanding Teaching Award in
1999. He holds a BS degree in Engineering from Alexandria University, and MS
and PhD degrees in Computer Science from Stanford University.
Shamkant B. Navathe is a professor and the founder of the database research group
at the College of Computing, Georgia Institute of Technology, Atlanta. He has
worked with IBM and Siemens in their research divisions and has been a consultant
to various companies including Digital, Computer Corporation of America,
Hewlett Packard, Equifax, and Persistent Systems. He was the General Co-chairman
of the 1996 International VLDB (Very Large Data Base) conference in Bombay,
India. He was also program co-chair of ACM SIGMOD 1985 International Confer-
ence and General Co-chair of the IFIP WG 2.6 Data Semantics Workshop in 1995.
He has served on the VLDB foundation and has been on the steering committees of
several conferences. He has been an associate editor of a number of journals
including ACM Computing Surveys, and IEEE Transactions on Knowledge and
Data Engineering. He also co-authored the book “Conceptual Design: An Entity
Relationship Approach” (Addison Wesley, 1992) with Carlo Batini and Stefano
Ceri. Navathe is a fellow of the Association for Computing Machinery (ACM) and
recipient of the IEEE TCDE Computer Science, Engineering and Education Impact
award in 2015. Navathe holds a PhD from the University of Michigan and has over
150 refereed publications in journals and conferences.
xxx
part 1
Introduction
to Databases
This page intentionally left blank
3
1chapter 1
Databases and
Database Users
Databases and database systems are an essential
component of life in modern society: most of us
encounter several activities every day that involve some interaction with a database.
For example, if we go to the bank to deposit or withdraw funds, if we make a hotel
or airline reservation, if we access a computerized library catalog to search for a
bibliographic item, or if we purchase something online—such as a book, toy, or
computer—chances are that our activities will involve someone or some computer
program accessing a database. Even purchasing items at a supermarket often auto-
matically updates the database that holds the inventory of grocery items.
These interactions are examples of what we may call traditional database
applications, in which most of the information that is stored and accessed is either
textual or numeric. In the past few years, advances in technology have led to exciting
new applications of database systems. The proliferation of social media Web sites,
such as Facebook, Twitter, and Flickr, among many others, has required the cre-
ation of huge databases that store nontraditional data, such as posts, tweets,
images, and video clips. New types of database systems, often referred to as big data
storage systems, or NOSQL systems, have been created to manage data for social
media applications. These types of systems are also used by companies such as
Google, Amazon, and Yahoo, to manage the data required in their Web search
engines, as well as to provide cloud storage, whereby users are provided with stor-
age capabilities on the Web for managing all types of data including documents,
programs, images, videos and emails. We will give an overview of these new types
of database systems in Chapter 24.
We now mention some other applications of databases. The wide availability of
photo and video technology on cellphones and other devices has made it possible to
4 Chapter 1 Databases and Database Users
store images, audio clips, and video streams digitally. These types of files are becom-
ing an important component of multimedia databases. Geographic information
systems (GISs) can store and analyze maps, weather data, and satellite images.
Data warehouses and online analytical processing (OLAP) systems are used in
many companies to extract and analyze useful business information from very large
databases to support decision making. Real-time and active database technology
is used to control industrial and manufacturing processes. And database search
techniques are being applied to the World Wide Web to improve the search for
information that is needed by users browsing the Internet.
To understand the fundamentals of database technology, however, we must start
from the basics of traditional database applications. In Section 1.1 we start by defin-
ing a database, and then we explain other basic terms. In Section 1.2, we provide a
simple UNIVERSITY database example to illustrate our discussion. Section 1.3
describes some of the main characteristics of database systems, and Sections 1.4
and 1.5 categorize the types of personnel whose jobs involve using and interacting
with database systems. Sections 1.6, 1.7, and 1.8 offer a more thorough discussion
of the various capabilities provided by database systems and discuss some typical
database applications. Section 1.9 summarizes the chapter.
The reader who desires a quick introduction to database systems can study
Sections 1.1 through 1.5, then skip or browse through Sections 1.6 through 1.8 and
go on to Chapter 2.
1.1 Introduction
Databases and database technology have had a major impact on the growing use of
computers. It is fair to say that databases play a critical role in almost all areas where
computers are used, including business, electronic commerce, social media, engi-
neering, medicine, genetics, law, education, and library science. The word database
is so commonly used that we must begin by defining what a database is. Our initial
definition is quite general.
A database is a collection of related data.1 By data, we mean known facts that can
be recorded and that have implicit meaning. For example, consider the names,
telephone numbers, and addresses of the people you know. Nowadays, this data is
typically stored in mobile phones, which have their own simple database software.
This data can also be recorded in an indexed address book or stored on a hard
drive, using a personal computer and software such as Microsoft Access or Excel.
This collection of related data with an implicit meaning is a database.
The preceding definition of database is quite general; for example, we may consider
the collection of words that make up this page of text to be related data and hence to
1We will use the word data as both singular and plural, as is common in database literature; the context
will determine whether it is singular or plural. In standard English, data is used for plural and datum for
singular.
1.1 Introduction 5
constitute a database. However, the common use of the term database is usually
more restricted. A database has the following implicit properties:
■ A database represents some aspect of the real world, sometimes called the
miniworld or the universe of discourse (UoD). Changes to the miniworld
are reflected in the database.
■ A database is a logically coherent collection of data with some inherent
meaning. A random assortment of data cannot correctly be referred to as a
database.
■ A database is designed, built, and populated with data for a specific purpose.
It has an intended group of users and some preconceived applications in
which these users are interested.
In other words, a database has some source from which data is derived, some degree
of interaction with events in the real world, and an audience that is actively inter-
ested in its contents. The end users of a database may perform business transactions
(for example, a customer buys a camera) or events may happen (for example, an
employee has a baby) that cause the information in the database to change. In order
for a database to be accurate and reliable at all times, it must be a true reflection of
the miniworld that it represents; therefore, changes must be reflected in the data-
base as soon as possible.
A database can be of any size and complexity. For example, the list of names and
addresses referred to earlier may consist of only a few hundred records, each with a
simple structure. On the other hand, the computerized catalog of a large library
may contain half a million entries organized under different categories—by pri-
mary author’s last name, by subject, by book title—with each category organized
alphabetically. A database of even greater size and complexity would be maintained
by a social media company such as Facebook, which has more than a billion users.
The database has to maintain information on which users are related to one another
as friends, the postings of each user, which users are allowed to see each posting,
and a vast amount of other types of information needed for the correct operation of
their Web site. For such Web sites, a large number of databases are needed to keep
track of the constantly changing information required by the social media Web site.
An example of a large commercial database is Amazon.com. It contains data for
over 60 million active users, and millions of books, CDs, videos, DVDs, games,
electronics, apparel, and other items. The database occupies over 42 terabytes
(a terabyte is 1012 bytes worth of storage) and is stored on hundreds of computers
(called servers). Millions of visitors access Amazon.com each day and use the
database to make purchases. The database is continually updated as new books
and other items are added to the inventory, and stock quantities are updated as
purchases are transacted.
A database may be generated and maintained manually or it may be computer-
ized. For example, a library card catalog is a database that may be created and
maintained manually. A computerized database may be created and maintained
either by a group of application programs written specifically for that task or by a
6 Chapter 1 Databases and Database Users
database management system. Of course, we are only concerned with computer-
ized databases in this text.
A database management system (DBMS) is a computerized system that enables
users to create and maintain a database. The DBMS is a general-purpose software
system that facilitates the processes of defining, constructing, manipulating, and
sharing databases among various users and applications. Defining a database
involves specifying the data types, structures, and constraints of the data to be
stored in the database. The database definition or descriptive information is also
stored by the DBMS in the form of a database catalog or dictionary; it is called
meta-data. Constructing the database is the process of storing the data on some
storage medium that is controlled by the DBMS. Manipulating a database includes
functions such as querying the database to retrieve specific data, updating the data-
base to reflect changes in the miniworld, and generating reports from the data.
Sharing a database allows multiple users and programs to access the database
simultaneously.
An application program accesses the database by sending queries or requests for
data to the DBMS. A query2 typically causes some data to be retrieved; a transaction
may cause some data to be read and some data to be written into the database.
Other important functions provided by the DBMS include protecting the database
and maintaining it over a long period of time. Protection includes system protec-
tion against hardware or software malfunction (or crashes) and security protection
against unauthorized or malicious access. A typical large database may have a life
cycle of many years, so the DBMS must be able to maintain the database system by
allowing the system to evolve as requirements change over time.
It is not absolutely necessary to use general-purpose DBMS software to implement
a computerized database. It is possible to write a customized set of programs to cre-
ate and maintain the database, in effect creating a special-purpose DBMS software
for a specific application, such as airlines reservations. In either case—whether we
use a general-purpose DBMS or not—a considerable amount of complex software
is deployed. In fact, most DBMSs are very complex software systems.
To complete our initial definitions, we will call the database and DBMS software
together a database system. Figure 1.1 illustrates some of the concepts we have
discussed so far.
1.2 An Example
Let us consider a simple example that most readers may be familiar with: a
UNIVERSITY database for maintaining information concerning students, courses,
and grades in a university environment. Figure 1.2 shows the database structure
and a few sample data records. The database is organized as five files, each of which
2The term query, originally meaning a question or an inquiry, is sometimes loosely used for all types of
interactions with databases, including modifying the data.
1.2 An Example 7
stores data records of the same type.3 The STUDENT file stores data on each stu-
dent, the COURSE file stores data on each course, the SECTION file stores data on
each section of a course, the GRADE_REPORT file stores the grades that students
receive in the various sections they have completed, and the PREREQUISITE file
stores the prerequisites of each course.
To define this database, we must specify the structure of the records of each file by
specifying the different types of data elements to be stored in each record. In
Figure 1.2, each STUDENT record includes data to represent the student’s Name,
Student_number, Class (such as freshman or ‘1’, sophomore or ‘2’, and so forth),
and Major (such as mathematics or ‘MATH’ and computer science or ‘CS’); each
COURSE record includes data to represent the Course_name, Course_number,
Credit_hours, and Department (the department that offers the course), and so
on. We must also specify a data type for each data element within a record. For
example, we can specify that Name of STUDENT is a string of alphabetic characters,
Student_number of STUDENT is an integer, and Grade of GRADE_REPORT is a
3We use the term file informally here. At a conceptual level, a file is a collection of records that may or
may not be ordered.
Database
System
Users/Programmers
Application Programs/Queries
Software to Process
Queries/Programs
Software to Access
Stored Data
Stored Database
Stored Database
Definition
(Meta-Data)
DBMS
Software
Figure 1.1
A simplified database
system environment.
8 Chapter 1 Databases and Database Users
Name Student_number Class Major
Smith 17 1 CS
Brown 8 2 CS
STUDENT
Course_name Course_number Credit_hours Department
Intro to Computer Science CS1310 4 CS
Data Structures CS3320 4 CS
Discrete Mathematics MATH2410 3 MATH
Database CS3380 3 CS
COURSE
Section_identifier Course_number Semester Year Instructor
85 MATH2410 Fall 07 King
92 CS1310 Fall 07 Anderson
102 CS3320 Spring 08 Knuth
112 MATH2410 Fall 08 Chang
119 CS1310 Fall 08 Anderson
135 CS3380 Fall 08 Stone
SECTION
Student_number Section_identifier Grade
17 112 B
17 119 C
8 85 A
8 92 A
8 102 B
8 135 A
GRADE_REPORT
Course_number Prerequisite_number
CS3380 CS3320
CS3380 MATH2410
CS3320 CS1310
PREREQUISITE
Figure 1.2
A database that stores
student and course
information.
1.2 An Example 9
single character from the set {‘A’, ‘B’, ‘C’, ‘D’, ‘F’, ‘I’}. We may also use a coding
scheme to represent the values of a data item. For example, in Figure 1.2 we rep-
resent the Class of a STUDENT as 1 for freshman, 2 for sophomore, 3 for junior,
4 for senior, and 5 for graduate student.
To construct the UNIVERSITY database, we store data to represent each student,
course, section, grade report, and prerequisite as a record in the appropriate file.
Notice that records in the various files may be related. For example, the record for
Smith in the STUDENT file is related to two records in the GRADE_REPORT file that
specify Smith’s grades in two sections. Similarly, each record in the PREREQUISITE
file relates two course records: one representing the course and the other represent-
ing the prerequisite. Most medium-size and large databases include many types of
records and have many relationships among the records.
Database manipulation involves querying and updating. Examples of queries are as
follows:
■ Retrieve the transcript—a list of all courses and grades—of ‘Smith’
■ List the names of students who took the section of the ‘Database’ course
offered in fall 2008 and their grades in that section
■ List the prerequisites of the ‘Database’ course
Examples of updates include the following:
■ Change the class of ‘Smith’ to sophomore
■ Create a new section for the ‘Database’ course for this semester
■ Enter a grade of ‘A’ for ‘Smith’ in the ‘Database’ section of last semester
These informal queries and updates must be specified precisely in the query lan-
guage of the DBMS before they can be processed.
At this stage, it is useful to describe the database as part of a larger undertaking
known as an information system within an organization. The Information Tech-
nology (IT) department within an organization designs and maintains an informa-
tion system consisting of various computers, storage systems, application software,
and databases. Design of a new application for an existing database or design of a
brand new database starts off with a phase called requirements specification and
analysis. These requirements are documented in detail and transformed into a
conceptual design that can be represented and manipulated using some comput-
erized tools so that it can be easily maintained, modified, and transformed into a
database implementation. (We will introduce a model called the Entity-Relation-
ship model in Chapter 3 that is used for this purpose.) The design is then translated
to a logical design that can be expressed in a data model implemented in a com-
mercial DBMS. (Various types of DBMSs are discussed throughout the text, with an
emphasis on relational DBMSs in Chapters 5 through 9.)
The final stage is physical design, during which further specifications are provided for
storing and accessing the database. The database design is implemented, populated
with actual data, and continuously maintained to reflect the state of the miniworld.
10 Chapter 1 Databases and Database Users
1.3 Characteristics of the Database Approach
A number of characteristics distinguish the database approach from the much
older approach of writing customized programs to access data stored in files. In
traditional file processing, each user defines and implements the files needed for a
specific software application as part of programming the application. For example,
one user, the grade reporting office, may keep files on students and their grades.
Programs to print a student’s transcript and to enter new grades are implemented
as part of the application. A second user, the accounting office, may keep track of
students’ fees and their payments. Although both users are interested in data about
students, each user maintains separate files—and programs to manipulate these
files—because each requires some data not available from the other user’s files.
This redundancy in defining and storing data results in wasted storage space and
in redundant efforts to maintain common up-to-date data.
In the database approach, a single repository maintains data that is defined once
and then accessed by various users repeatedly through queries, transactions, and
application programs. The main characteristics of the database approach versus the
file-processing approach are the following:
■ Self-describing nature of a database system
■ Insulation between programs and data, and data abstraction
■ Support of multiple views of the data
■ Sharing of data and multiuser transaction processing
We describe each of these characteristics in a separate section. We will discuss addi-
tional characteristics of database systems in Sections 1.6 through 1.8.
1.3.1 Self-Describing Nature of a Database System
A fundamental characteristic of the database approach is that the database system
contains not only the database itself but also a complete definition or description of
the database structure and constraints. This definition is stored in the DBMS cata-
log, which contains information such as the structure of each file, the type and stor-
age format of each data item, and various constraints on the data. The information
stored in the catalog is called meta-data, and it describes the structure of the pri-
mary database (Figure 1.1). It is important to note that some newer types of data-
base systems, known as NOSQL systems, do not require meta-data. Rather the data
is stored as self-describing data that includes the data item names and data values
together in one structure (see Chapter 24).
The catalog is used by the DBMS software and also by database users who need
information about the database structure. A general-purpose DBMS software
package is not written for a specific database application. Therefore, it must refer
to the catalog to know the structure of the files in a specific database, such as the
type and format of data it will access. The DBMS software must work equally well
with any number of database applications—for example, a university database, a
1.3 Characteristics of the Database Approach 11
banking database, or a company database—as long as the database definition is
stored in the catalog.
In traditional file processing, data definition is typically part of the application pro-
grams themselves. Hence, these programs are constrained to work with only one
specific database, whose structure is declared in the application programs. For
example, an application program written in C++ may have struct or class declara-
tions. Whereas file-processing software can access only specific databases, DBMS
software can access diverse databases by extracting the database definitions from
the catalog and using these definitions.
For the example shown in Figure 1.2, the DBMS catalog will store the definitions of
all the files shown. Figure 1.3 shows some entries in a database catalog. Whenever a
request is made to access, say, the Name of a STUDENT record, the DBMS software
refers to the catalog to determine the structure of the STUDENT file and the position
and size of the Name data item within a STUDENT record. By contrast, in a typical
file-processing application, the file structure and, in the extreme case, the exact
location of Name within a STUDENT record are already coded within each program
that accesses this data item.
Figure 1.3
An example of a
database catalog for
the database in
Figure 1.2.
Relation_name No_of_columns
STUDENT 4
COURSE 4
SECTION 5
GRADE_REPORT 3
PREREQUISITE 2
Column_name Data_type Belongs_to_relation
Name Character (30) STUDENT
Student_number Character (4) STUDENT
Class Integer (1) STUDENT
Major Major_type STUDENT
Course_name Character (10) COURSE
Course_number XXXXNNNN COURSE
…. …. …..
…. …. …..
…. …. …..
Prerequisite_number XXXXNNNN PREREQUISITE
RELATIONS
COLUMNS
Note: Major_type is defined as an enumerated type with all known majors.
XXXXNNNN is used to define a type with four alphabetic characters followed by four numeric digits.
12 Chapter 1 Databases and Database Users
1.3.2 Insulation between Programs and Data,
and Data Abstraction
In traditional file processing, the structure of data files is embedded in the applica-
tion programs, so any changes to the structure of a file may require changing all
programs that access that file. By contrast, DBMS access programs do not require
such changes in most cases. The structure of data files is stored in the DBMS cata-
log separately from the access programs. We call this property program-data
independence.
For example, a file access program may be written in such a way that it can access
only STUDENT records of the structure shown in Figure 1.4. If we want to add
another piece of data to each STUDENT record, say the Birth_date, such a program
will no longer work and must be changed. By contrast, in a DBMS environment, we
only need to change the description of STUDENT records in the catalog (Figure 1.3)
to reflect the inclusion of the new data item Birth_date; no programs are changed.
The next time a DBMS program refers to the catalog, the new structure of
STUDENT records will be accessed and used.
In some types of database systems, such as object-oriented and object-relational
systems (see Chapter 12), users can define operations on data as part of the database
definitions. An operation (also called a function or method) is specified in two
parts. The interface (or signature) of an operation includes the operation name and
the data types of its arguments (or parameters). The implementation (or method) of
the operation is specified separately and can be changed without affecting the inter-
face. User application programs can operate on the data by invoking these opera-
tions through their names and arguments, regardless of how the operations are
implemented. This may be termed program-operation independence.
The characteristic that allows program-data independence and program-operation
independence is called data abstraction. A DBMS provides users with a conceptual
representation of data that does not include many of the details of how the data is
stored or how the operations are implemented. Informally, a data model is a type of
data abstraction that is used to provide this conceptual representation. The data
model uses logical concepts, such as objects, their properties, and their interrela-
tionships, that may be easier for most users to understand than computer storage
concepts. Hence, the data model hides storage and implementation details that are
not of interest to most database users.
Looking at the example in Figures 1.2 and 1.3, the internal implementation of the
STUDENT file may be defined by its record length—the number of characters
(bytes) in each record—and each data item may be specified by its starting byte
within a record and its length in bytes. The STUDENT record would thus be repre-
sented as shown in Figure 1.4. But a typical database user is not concerned with the
location of each data item within a record or its length; rather, the user is concerned
that when a reference is made to Name of STUDENT, the correct value is returned.
A conceptual representation of the STUDENT records is shown in Figure 1.2. Many
other details of file storage organization—such as the access paths specified on a
1.3 Characteristics of the Database Approach 13
file—can be hidden from database users by the DBMS; we discuss storage details in
Chapters 16 and 17.
In the database approach, the detailed structure and organization of each file are
stored in the catalog. Database users and application programs refer to the concep-
tual representation of the files, and the DBMS extracts the details of file storage
from the catalog when these are needed by the DBMS file access modules. Many
data models can be used to provide this data abstraction to database users. A major
part of this text is devoted to presenting various data models and the concepts they
use to abstract the representation of data.
In object-oriented and object-relational databases, the abstraction process includes
not only the data structure but also the operations on the data. These operations
provide an abstraction of miniworld activities commonly understood by the users.
For example, an operation CALCULATE_GPA can be applied to a STUDENT object
to calculate the grade point average. Such operations can be invoked by the user
queries or application programs without having to know the details of how the
operations are implemented.
1.3.3 Support of Multiple Views of the Data
A database typically has many types of users, each of whom may require a different
perspective or view of the database. A view may be a subset of the database or it may
contain virtual data that is derived from the database files but is not explicitly stored.
Some users may not need to be aware of whether the data they refer to is stored or
derived. A multiuser DBMS whose users have a variety of distinct applications must
provide facilities for defining multiple views. For example, one user of the database
of Figure 1.2 may be interested only in accessing and printing the transcript of each
student; the view for this user is shown in Figure 1.5(a). A second user, who is inter-
ested only in checking that students have taken all the prerequisites of each course
for which the student registers, may require the view shown in Figure 1.5(b).
1.3.4 Sharing of Data and Multiuser Transaction Processing
A multiuser DBMS, as its name implies, must allow multiple users to access the
database at the same time. This is essential if data for multiple applications is to be
integrated and maintained in a single database. The DBMS must include concurrency
control software to ensure that several users trying to update the same data
Data Item Name Starting Position in Record Length in Characters (bytes)
Name 1 30
Student_number 31 4
Class 35 1
Major 36 4
Figure 1.4
Internal storage format
for a STUDENT record,
based on the database
catalog in Figure 1.3.
14 Chapter 1 Databases and Database Users
do so in a controlled manner so that the result of the updates is correct. For exam-
ple, when several reservation agents try to assign a seat on an airline flight, the
DBMS should ensure that each seat can be accessed by only one agent at a time for
assignment to a passenger. These types of applications are generally called online
transaction processing (OLTP) applications. A fundamental role of multiuser
DBMS software is to ensure that concurrent transactions operate correctly and
efficiently.
The concept of a transaction has become central to many database applications. A
transaction is an executing program or process that includes one or more database
accesses, such as reading or updating of database records. Each transaction is sup-
posed to execute a logically correct database access if executed in its entirety with-
out interference from other transactions. The DBMS must enforce several
transaction properties. The isolation property ensures that each transaction
appears to execute in isolation from other transactions, even though hundreds of
transactions may be executing concurrently. The atomicity property ensures that
either all the database operations in a transaction are executed or none are. We dis-
cuss transactions in detail in Part 9.
The preceding characteristics are important in distinguishing a DBMS from tradi-
tional file-processing software. In Section 1.6 we discuss additional features that
characterize a DBMS. First, however, we categorize the different types of people
who work in a database system environment.
Student_name
Student_transcript
Course_number Grade Semester Year Section_id
Smith
CS1310 C Fall 08 119
MATH2410 B Fall 08 112
Brown
MATH2410 A Fall 07 85
CS1310 A Fall 07 92
CS3320 B Spring 08 102
CS3380 A Fall 08 135
TRANSCRIPT
Course_name Course_number Prerequisites
Database CS3380
CS3320
MATH2410
Data Structures CS3320 CS1310
COURSE_PREREQUISITES
(a)
(b)
Figure 1.5
Two views derived from the database in Figure 1.2. (a) The TRANSCRIPT view.
(b) The COURSE_PREREQUISITES view.
1.4 Actors on the Scene 15
1.4 Actors on the Scene
For a small personal database, such as the list of addresses discussed in Section 1.1,
one person typically defines, constructs, and manipulates the database, and there is
no sharing. However, in large organizations, many people are involved in the
design, use, and maintenance of a large database with hundreds or thousands of
users. In this section we identify the people whose jobs involve the day-to-day use
of a large database; we call them the actors on the scene. In Section 1.5 we consider
people who may be called workers behind the scene—those who work to maintain
the database system environment but who are not actively interested in the data-
base contents as part of their daily job.
1.4.1 Database Administrators
In any organization where many people use the same resources, there is a need for
a chief administrator to oversee and manage these resources. In a database environ-
ment, the primary resource is the database itself, and the secondary resource is the
DBMS and related software. Administering these resources is the responsibility of
the database administrator (DBA). The DBA is responsible for authorizing access
to the database, coordinating and monitoring its use, and acquiring software and
hardware resources as needed. The DBA is accountable for problems such as secu-
rity breaches and poor system response time. In large organizations, the DBA is
assisted by a staff that carries out these functions.
1.4.2 Database Designers
Database designers are responsible for identifying the data to be stored in the data-
base and for choosing appropriate structures to represent and store this data. These
tasks are mostly undertaken before the database is actually implemented and popu-
lated with data. It is the responsibility of database designers to communicate with
all prospective database users in order to understand their requirements and to cre-
ate a design that meets these requirements. In many cases, the designers are on the
staff of the DBA and may be assigned other staff responsibilities after the database
design is completed. Database designers typically interact with each potential group
of users and develop views of the database that meet the data and processing
requirements of these groups. Each view is then analyzed and integrated with the
views of other user groups. The final database design must be capable of supporting
the requirements of all user groups.
1.4.3 End Users
End users are the people whose jobs require access to the database for querying,
updating, and generating reports; the database primarily exists for their use. There
are several categories of end users:
■ Casual end users occasionally access the database, but they may need differ-
ent information each time. They use a sophisticated database query interface
16 Chapter 1 Databases and Database Users
to specify their requests and are typically middle- or high-level managers or
other occasional browsers.
■ Naive or parametric end users make up a sizable portion of database
end users. Their main job function revolves around constantly querying
and updating the database, using standard types of queries and updates—
called canned transactions—that have been carefully programmed and
tested. Many of these tasks are now available as mobile apps for use with
mobile devices. The tasks that such users perform are varied. A few
examples are:
� Bank customers and tellers check account balances and post withdrawals
and deposits.
� Reservation agents or customers for airlines, hotels, and car rental com-
panies check availability for a given request and make reservations.
� Employees at receiving stations for shipping companies enter package
identifications via bar codes and descriptive information through buttons
to update a central database of received and in-transit packages.
� Social media users post and read items on social media Web sites.
■ Sophisticated end users include engineers, scientists, business analysts, and
others who thoroughly familiarize themselves with the facilities of the DBMS
in order to implement their own applications to meet their complex require-
ments.
■ Standalone users maintain personal databases by using ready-made pro-
gram packages that provide easy-to-use menu-based or graphics-based
interfaces. An example is the user of a financial software package that stores
a variety of personal financial data.
A typical DBMS provides multiple facilities to access a database. Naive end users
need to learn very little about the facilities provided by the DBMS; they simply have
to understand the user interfaces of the mobile apps or standard transactions
designed and implemented for their use. Casual users learn only a few facilities that
they may use repeatedly. Sophisticated users try to learn most of the DBMS facilities
in order to achieve their complex requirements. Standalone users typically become
very proficient in using a specific software package.
1.4.4 System Analysts and Application Programmers
(Software Engineers)
System analysts determine the requirements of end users, especially naive and
parametric end users, and develop specifications for standard canned transactions
that meet these requirements. Application programmers implement these specifi-
cations as programs; then they test, debug, document, and maintain these canned
transactions. Such analysts and programmers—commonly referred to as software
developers or software engineers—should be familiar with the full range of capa-
bilities provided by the DBMS to accomplish their tasks.
1.6 Advantages of Using the DBMS Approach 17
1.5 Workers behind the Scene
In addition to those who design, use, and administer a database, others are associ-
ated with the design, development, and operation of the DBMS software and system
environment. These persons are typically not interested in the database content
itself. We call them the workers behind the scene, and they include the following
categories:
■ DBMS system designers and implementers design and implement the
DBMS modules and interfaces as a software package. A DBMS is a very
complex software system that consists of many components, or modules,
including modules for implementing the catalog, query language process-
ing, interface processing, accessing and buffering data, controlling concur-
rency, and handling data recovery and security. The DBMS must interface
with other system software, such as the operating system and compilers for
various programming languages.
■ Tool developers design and implement tools—the software packages that
facilitate database modeling and design, database system design, and
improved performance. Tools are optional packages that are often pur-
chased separately. They include packages for database design, performance
monitoring, natural language or graphical interfaces, prototyping, simula-
tion, and test data generation. In many cases, independent software vendors
develop and market these tools.
■ Operators and maintenance personnel (system administration personnel)
are responsible for the actual running and maintenance of the hardware and
software environment for the database system.
Although these categories of workers behind the scene are instrumental in making
the database system available to end users, they typically do not use the database
contents for their own purposes.
1.6 Advantages of Using the DBMS Approach
In this section we discuss some additional advantages of using a DBMS and the
capabilities that a good DBMS should possess. These capabilities are in addition to
the four main characteristics discussed in Section 1.3. The DBA must utilize these
capabilities to accomplish a variety of objectives related to the design, administra-
tion, and use of a large multiuser database.
1.6.1 Controlling Redundancy
In traditional software development utilizing file processing, every user group
maintains its own files for handling its data-processing applications. For example,
consider the UNIVERSITY database example of Section 1.2; here, two groups of
users might be the course registration personnel and the accounting office. In the
traditional approach, each group independently keeps files on students. The
18 Chapter 1 Databases and Database Users
accounting office keeps data on registration and related billing information,
whereas the registration office keeps track of student courses and grades. Other
groups may further duplicate some or all of the same data in their own files.
This redundancy in storing the same data multiple times leads to several problems.
First, there is the need to perform a single logical update—such as entering data on
a new student—multiple times: once for each file where student data is recorded.
This leads to duplication of effort. Second, storage space is wasted when the same
data is stored repeatedly, and this problem may be serious for large databases.
Third, files that represent the same data may become inconsistent. This may happen
because an update is applied to some of the files but not to others. Even if an
update—such as adding a new student—is applied to all the appropriate files, the
data concerning the student may still be inconsistent because the updates are applied
independently by each user group. For example, one user group may enter a stu-
dent’s birth date erroneously as ‘JAN-19-1988’, whereas the other user groups may
enter the correct value of ‘JAN-29-1988’.
In the database approach, the views of different user groups are integrated during
database design. Ideally, we should have a database design that stores each logical
data item—such as a student’s name or birth date—in only one place in the data-
base. This is known as data normalization, and it ensures consistency and saves
storage space (data normalization is described in Part 6 of the text).
However, in practice, it is sometimes necessary to use controlled redundancy to
improve the performance of queries. For example, we may store Student_name and
Course_number redundantly in a GRADE_REPORT file (Figure 1.6(a)) because
whenever we retrieve a GRADE_REPORT record, we want to retrieve the student
name and course number along with the grade, student number, and section identi-
fier. By placing all the data together, we do not have to search multiple files to col-
lect this data. This is known as denormalization. In such cases, the DBMS should
Student_number Student_name Section_identifier Course_number Grade
17 Smith 112 MATH2410 B
17 Smith 119 CS1310 C
8 Brown 85 MATH2410 A
8 Brown 92 CS1310 A
8 Brown 102 CS3320 B
8 Brown 135 CS3380 A
GRADE_REPORT
Student_number Student_name Section_identifier Course_number Grade
17 Brown 112 MATH2410 B
GRADE_REPORT
(a)
(b)
Figure 1.6
Redundant storage
of Student_name
and Course_name in
GRADE_REPORT.
(a) Consistent data.
(b) Inconsistent
record.
1.6 Advantages of Using the DBMS Approach 19
have the capability to control this redundancy in order to prohibit inconsisten-
cies among the files. This may be done by automatically checking that the
Student_name–Student_number values in any GRADE_REPORT record in Fig-
ure 1.6(a) match one of the Name–Student_number values of a STUDENT record (Fig-
ure 1.2). Similarly, the Section_identifier–Course_number values in GRADE_REPORT
can be checked against SECTION records. Such checks can be specified to the DBMS
during database design and automatically enforced by the DBMS whenever the
GRADE_REPORT file is updated. Figure 1.6(b) shows a GRADE_REPORT record that
is inconsistent with the STUDENT file in Figure 1.2; this kind of error may be entered
if the redundancy is not controlled. Can you tell which part is inconsistent?
1.6.2 Restricting Unauthorized Access
When multiple users share a large database, it is likely that most users will not be
authorized to access all information in the database. For example, financial data
such as salaries and bonuses is often considered confidential, and only autho-
rized persons are allowed to access such data. In addition, some users may only
be permitted to retrieve data, whereas others are allowed to retrieve and update.
Hence, the type of access operation—retrieval or update—must also be con-
trolled. Typically, users or user groups are given account numbers protected by
passwords, which they can use to gain access to the database. A DBMS should
provide a security and authorization subsystem, which the DBA uses to create
accounts and to specify account restrictions. Then, the DBMS should enforce
these restrictions automatically. Notice that we can apply similar controls to the
DBMS software. For example, only the DBA’s staff may be allowed to use certain
privileged software, such as the software for creating new accounts. Similarly,
parametric users may be allowed to access the database only through the pre-
defined apps or canned transactions developed for their use. We discuss data-
base security and authorization in Chapter 30.
1.6.3 Providing Persistent Storage for Program Objects
Databases can be used to provide persistent storage for program objects and data
structures. This is one of the main reasons for object-oriented database systems
(see Chapter 12). Programming languages typically have complex data structures,
such as structs or class definitions in C++ or Java. The values of program variables
or objects are discarded once a program terminates, unless the programmer explic-
itly stores them in permanent files, which often involves converting these complex
structures into a format suitable for file storage. When the need arises to read this
data once more, the programmer must convert from the file format to the program
variable or object structure. Object-oriented database systems are compatible with
programming languages such as C++ and Java, and the DBMS software auto-
matically performs any necessary conversions. Hence, a complex object in C++
can be stored permanently in an object-oriented DBMS. Such an object is said to
be persistent, since it survives the termination of program execution and can
later be directly retrieved by another program.
20 Chapter 1 Databases and Database Users
The persistent storage of program objects and data structures is an important func-
tion of database systems. Traditional database systems often suffered from the so-
called impedance mismatch problem, since the data structures provided by the
DBMS were incompatible with the programming language’s data structures.
Object-oriented database systems typically offer data structure compatibility with
one or more object-oriented programming languages.
1.6.4 Providing Storage Structures and Search
Techniques for Efficient Query Processing
Database systems must provide capabilities for efficiently executing queries and
updates. Because the database is typically stored on disk, the DBMS must provide
specialized data structures and search techniques to speed up disk search for the
desired records. Auxiliary files called indexes are often used for this purpose.
Indexes are typically based on tree data structures or hash data structures that are
suitably modified for disk search. In order to process the database records needed
by a particular query, those records must be copied from disk to main memory.
Therefore, the DBMS often has a buffering or caching module that maintains parts
of the database in main memory buffers. In general, the operating system is respon-
sible for disk-to-memory buffering. However, because data buffering is crucial to
the DBMS performance, most DBMSs do their own data buffering.
The query processing and optimization module of the DBMS is responsible for
choosing an efficient query execution plan for each query based on the existing
storage structures. The choice of which indexes to create and maintain is part of
physical database design and tuning, which is one of the responsibilities of the DBA
staff. We discuss query processing and optimization in Part 8 of the text.
1.6.5 Providing Backup and Recovery
A DBMS must provide facilities for recovering from hardware or software failures.
The backup and recovery subsystem of the DBMS is responsible for recovery. For
example, if the computer system fails in the middle of a complex update transac-
tion, the recovery subsystem is responsible for making sure that the database is
restored to the state it was in before the transaction started executing. Disk backup
is also necessary in case of a catastrophic disk failure. We discuss recovery and
backup in Chapter 22.
1.6.6 Providing Multiple User Interfaces
Because many types of users with varying levels of technical knowledge use a data-
base, a DBMS should provide a variety of user interfaces. These include apps for
mobile users, query languages for casual users, programming language interfaces
for application programmers, forms and command codes for parametric users,
and menu-driven interfaces and natural language interfaces for standalone users.
Both forms-style interfaces and menu-driven interfaces are commonly known as
1.6 Advantages of Using the DBMS Approach 21
graphical user interfaces (GUIs). Many specialized languages and environments
exist for specifying GUIs. Capabilities for providing Web GUI interfaces to a
database—or Web-enabling a database—are also quite common.
1.6.7 Representing Complex Relationships among Data
A database may include numerous varieties of data that are interrelated in many
ways. Consider the example shown in Figure 1.2. The record for ‘Brown’ in the
STUDENT file is related to four records in the GRADE_REPORT file. Similarly,
each section record is related to one course record and to a number of
GRADE_REPORT records—one for each student who completed that section. A
DBMS must have the capability to represent a variety of complex relationships
among the data, to define new relationships as they arise, and to retrieve and
update related data easily and efficiently.
1.6.8 Enforcing Integrity Constraints
Most database applications have certain integrity constraints that must hold for
the data. A DBMS should provide capabilities for defining and enforcing these
constraints. The simplest type of integrity constraint involves specifying a data
type for each data item. For example, in Figure 1.3, we specified that the value of
the Class data item within each STUDENT record must be a one-digit integer and
that the value of Name must be a string of no more than 30 alphabetic characters.
To restrict the value of Class between 1 and 5 would be an additional constraint
that is not shown in the current catalog. A more complex type of constraint that
frequently occurs involves specifying that a record in one file must be related to
records in other files. For example, in Figure 1.2, we can specify that every section
record must be related to a course record. This is known as a referential integrity
constraint. Another type of constraint specifies uniqueness on data item values,
such as every course record must have a unique value for Course_number. This is
known as a key or uniqueness constraint. These constraints are derived from the
meaning or semantics of the data and of the miniworld it represents. It is the
responsibility of the database designers to identify integrity constraints during
database design. Some constraints can be specified to the DBMS and automatically
enforced. Other constraints may have to be checked by update programs or at the
time of data entry. For typical large applications, it is customary to call such con-
straints business rules.
A data item may be entered erroneously and still satisfy the specified integrity con-
straints. For example, if a student receives a grade of ‘A’ but a grade of ‘C’ is entered
in the database, the DBMS cannot discover this error automatically because ‘C’ is a
valid value for the Grade data type. Such data entry errors can only be discovered
manually (when the student receives the grade and complains) and corrected later
by updating the database. However, a grade of ‘Z’ would be rejected automatically
by the DBMS because ‘Z’ is not a valid value for the Grade data type. When we dis-
cuss each data model in subsequent chapters, we will introduce rules that pertain to
22 Chapter 1 Databases and Database Users
that model implicitly. For example, in the Entity-Relationship model in Chapter 3,
a relationship must involve at least two entities. Rules that pertain to a specific data
model are called inherent rules of the data model.
1.6.9 Permitting Inferencing and Actions
Using Rules and Triggers
Some database systems provide capabilities for defining deduction rules for infer-
encing new information from the stored database facts. Such systems are called
deductive database systems. For example, there may be complex rules in the mini-
world application for determining when a student is on probation. These can be
specified declaratively as rules, which when compiled and maintained by the DBMS
can determine all students on probation. In a traditional DBMS, an explicit proce-
dural program code would have to be written to support such applications. But if
the miniworld rules change, it is generally more convenient to change the declared
deduction rules than to recode procedural programs. In today’s relational database
systems, it is possible to associate triggers with tables. A trigger is a form of a rule
activated by updates to the table, which results in performing some additional oper-
ations to some other tables, sending messages, and so on. More involved proce-
dures to enforce rules are popularly called stored procedures; they become a part of
the overall database definition and are invoked appropriately when certain condi-
tions are met. More powerful functionality is provided by active database systems,
which provide active rules that can automatically initiate actions when certain
events and conditions occur (see Chapter 26 for introductions to active databases in
Section 26.1 and deductive databases in Section 26.5).
1.6.10 Additional Implications of Using
the Database Approach
This section discusses a few additional implications of using the database approach
that can benefit most organizations.
Potential for Enforcing Standards. The database approach permits the DBA to
define and enforce standards among database users in a large organization. This facil-
itates communication and cooperation among various departments, projects, and
users within the organization. Standards can be defined for names and formats of
data elements, display formats, report structures, terminology, and so on. The DBA
can enforce standards in a centralized database environment more easily than in an
environment where each user group has control of its own data files and software.
Reduced Application Development Time. A prime selling feature of the data-
base approach is that developing a new application—such as the retrieval of certain
data from the database for printing a new report—takes very little time. Designing
and implementing a large multiuser database from scratch may take more time
than writing a single specialized file application. However, once a database is up
and running, substantially less time is generally required to create new applications
1.7 A Brief History of Database Applications 23
using DBMS facilities. Development time using a DBMS is estimated to be one-
sixth to one-fourth of that for a file system.
Flexibility. It may be necessary to change the structure of a database as require-
ments change. For example, a new user group may emerge that needs information
not currently in the database. In response, it may be necessary to add a file to the
database or to extend the data elements in an existing file. Modern DBMSs allow
certain types of evolutionary changes to the structure of the database without affect-
ing the stored data and the existing application programs.
Availability of Up-to-Date Information. A DBMS makes the database available
to all users. As soon as one user’s update is applied to the database, all other users
can immediately see this update. This availability of up-to-date information is
essential for many transaction-processing applications, such as reservation systems
or banking databases, and it is made possible by the concurrency control and recov-
ery subsystems of a DBMS.
Economies of Scale. The DBMS approach permits consolidation of data and
applications, thus reducing the amount of wasteful overlap between activities of
data-processing personnel in different projects or departments as well as redundan-
cies among applications. This enables the whole organization to invest in more
powerful processors, storage devices, or networking gear, rather than having each
department purchase its own (lower performance) equipment. This reduces overall
costs of operation and management.
1.7 A Brief History of Database Applications
We now give a brief historical overview of the applications that use DBMSs and
how these applications provided the impetus for new types of database systems.
1.7.1 Early Database Applications Using Hierarchical
and Network Systems
Many early database applications maintained records in large organizations such as
corporations, universities, hospitals, and banks. In many of these applications,
there were large numbers of records of similar structure. For example, in a univer-
sity application, similar information would be kept for each student, each course,
each grade record, and so on. There were also many types of records and many
interrelationships among them.
One of the main problems with early database systems was the intermixing of con-
ceptual relationships with the physical storage and placement of records on disk.
Hence, these systems did not provide sufficient data abstraction and program-data
independence capabilities. For example, the grade records of a particular student
could be physically stored next to the student record. Although this provided very
24 Chapter 1 Databases and Database Users
efficient access for the original queries and transactions that the database was
designed to handle, it did not provide enough flexibility to access records efficiently
when new queries and transactions were identified. In particular, new queries that
required a different storage organization for efficient processing were quite difficult
to implement efficiently. It was also laborious to reorganize the database when
changes were made to the application’s requirements.
Another shortcoming of early systems was that they provided only programming
language interfaces. This made it time-consuming and expensive to implement
new queries and transactions, since new programs had to be written, tested, and
debugged. Most of these database systems were implemented on large and
expensive mainframe computers starting in the mid-1960s and continuing
through the 1970s and 1980s. The main types of early systems were based on
three main paradigms: hierarchical systems, network model–based systems, and
inverted file systems.
1.7.2 Providing Data Abstraction and Application Flexibility
with Relational Databases
Relational databases were originally proposed to separate the physical storage of
data from its conceptual representation and to provide a mathematical foundation
for data representation and querying. The relational data model also introduced
high-level query languages that provided an alternative to programming language
interfaces, making it much faster to write new queries. Relational representation of
data somewhat resembles the example we presented in Figure 1.2. Relational sys-
tems were initially targeted to the same applications as earlier systems, and pro-
vided flexibility to develop new queries quickly and to reorganize the database as
requirements changed. Hence, data abstraction and program-data independence
were much improved when compared to earlier systems.
Early experimental relational systems developed in the late 1970s and the com-
mercial relational database management systems (RDBMS) introduced in the
early 1980s were quite slow, since they did not use physical storage pointers or
record placement to access related data records. With the development of new
storage and indexing techniques and better query processing and optimization,
their performance improved. Eventually, relational databases became the domi-
nant type of database system for traditional database applications. Relational data-
bases now exist on almost all types of computers, from small personal computers
to large servers.
1.7.3 Object-Oriented Applications and the Need
for More Complex Databases
The emergence of object-oriented programming languages in the 1980s and the
need to store and share complex, structured objects led to the development of
object-oriented databases (OODBs). Initially, OODBs were considered a competitor
1.7 A Brief History of Database Applications 25
to relational databases, since they provided more general data structures. They also
incorporated many of the useful object-oriented paradigms, such as abstract data
types, encapsulation of operations, inheritance, and object identity. However, the
complexity of the model and the lack of an early standard contributed to their lim-
ited use. They are now mainly used in specialized applications, such as engineering
design, multimedia publishing, and manufacturing systems. Despite expectations
that they will make a big impact, their overall penetration into the database prod-
ucts market remains low. In addition, many object-oriented concepts were incor-
porated into the newer versions of relational DBMSs, leading to object-relational
database management systems, known as ORDBMSs.
1.7.4 Interchanging Data on the Web
for E-Commerce Using XML
The World Wide Web provides a large network of interconnected computers.
Users can create static Web pages using a Web publishing language, such as Hyper-
Text Markup Language (HTML), and store these documents on Web servers where
other users (clients) can access them and view them through Web browsers. Docu-
ments can be linked through hyperlinks, which are pointers to other documents.
Starting in the 1990s, electronic commerce (e-commerce) emerged as a major
application on the Web. Much of the critical information on e-commerce Web
pages is dynamically extracted data from DBMSs, such as flight information, prod-
uct prices, and product availability. A variety of techniques were developed to allow
the interchange of dynamically extracted data on the Web for display on Web
pages. The eXtended Markup Language (XML) is one standard for interchanging
data among various types of databases and Web pages. XML combines concepts
from the models used in document systems with database modeling concepts.
Chapter 13 is devoted to an overview of XML.
1.7.5 Extending Database Capabilities
for New Applications
The success of database systems in traditional applications encouraged devel-
opers of other types of applications to attempt to use them. Such applications
traditionally used their own specialized software and file and data structures.
Database systems now offer extensions to better support the specialized require-
ments for some of these applications. The following are some examples of these
applications:
■ Scientific applications that store large amounts of data resulting from scien-
tific experiments in areas such as high-energy physics, the mapping of the
human genome, and the discovery of protein structures
■ Storage and retrieval of images, including scanned news or personal photo-
graphs, satellite photographic images, and images from medical procedures
such as x-rays and MRI (magnetic resonance imaging) tests
26 Chapter 1 Databases and Database Users
■ Storage and retrieval of videos, such as movies, and video clips from news
or personal digital cameras
■ Data mining applications that analyze large amounts of data to search for
the occurrences of specific patterns or relationships, and for identifying
unusual patterns in areas such as credit card fraud detection
■ Spatial applications that store and analyze spatial locations of data, such as
weather information, maps used in geographical information systems, and
automobile navigational systems
■ Time series applications that store information such as economic data at
regular points in time, such as daily sales and monthly gross national
product figures
It was quickly apparent that basic relational systems were not very suitable for many
of these applications, usually for one or more of the following reasons:
■ More complex data structures were needed for modeling the application
than the simple relational representation.
■ New data types were needed in addition to the basic numeric and character
string types.
■ New operations and query language constructs were necessary to manipu-
late the new data types.
■ New storage and indexing structures were needed for efficient searching on
the new data types.
This led DBMS developers to add functionality to their systems. Some functionality
was general purpose, such as incorporating concepts from object-oriented data-
bases into relational systems. Other functionality was special purpose, in the form
of optional modules that could be used for specific applications. For example, users
could buy a time series module to use with their relational DBMS for their time
series application.
1.7.6 Emergence of Big Data Storage Systems
and NOSQL Databases
In the first decade of the twenty-first century, the proliferation of applications and
platforms such as social media Web sites, large e-commerce companies, Web search
indexes, and cloud storage/backup led to a surge in the amount of data stored on
large databases and massive servers. New types of database systems were necessary
to manage these huge databases—systems that would provide fast search and
retrieval as well as reliable and safe storage of nontraditional types of data, such as
social media posts and tweets. Some of the requirements of these new systems were
not compatible with SQL relational DBMSs (SQL is the standard data model and
language for relational databases). The term NOSQL is generally interpreted as Not
Only SQL, meaning that in systems than manage large amounts of data, some of the
data is stored using SQL systems, whereas other data would be stored using NOSQL,
depending on the application requirements.
1.9 Summary 27
1.8 When Not to Use a DBMS
In spite of the advantages of using a DBMS, there are a few situations in which a
DBMS may involve unnecessary overhead costs that would not be incurred in
traditional file processing. The overhead costs of using a DBMS are due to the
following:
■ High initial investment in hardware, software, and training
■ The generality that a DBMS provides for defining and processing data
■ Overhead for providing security, concurrency control, recovery, and integ-
rity functions
Therefore, it may be more desirable to develop customized database applications
under the following circumstances:
■ Simple, well-defined database applications that are not expected to change
at all
■ Stringent, real-time requirements for some application programs that may
not be met because of DBMS overhead
■ Embedded systems with limited storage capacity, where a general-purpose
DBMS would not fit
■ No multiple-user access to data
Certain industries and applications have elected not to use general-purpose
DBMSs. For example, many computer-aided design (CAD) tools used by mechan-
ical and civil engineers have proprietary file and data management software that
is geared for the internal manipulations of drawings and 3D objects. Similarly,
communication and switching systems designed by companies like AT&T were
early manifestations of database software that was made to run very fast with
hierarchically organized data for quick access and routing of calls. GIS imple-
mentations often implement their own data organization schemes for efficiently
implementing functions related to processing maps, physical contours, lines,
polygons, and so on.
1.9 Summary
In this chapter we defined a database as a collection of related data, where data
means recorded facts. A typical database represents some aspect of the real world
and is used for specific purposes by one or more groups of users. A DBMS is a
generalized software package for implementing and maintaining a computerized
database. The database and software together form a database system. We identi-
fied several characteristics that distinguish the database approach from traditional
file-processing applications, and we discussed the main categories of database
users, or the actors on the scene. We noted that in addition to database users, there
are several categories of support personnel, or workers behind the scene, in a data-
base environment.
28 Chapter 1 Databases and Database Users
We presented a list of capabilities that should be provided by the DBMS software to
the DBA, database designers, and end users to help them design, administer, and
use a database. Then we gave a brief historical perspective on the evolution of data-
base applications. We pointed out the recent rapid growth of the amounts and types
of data that must be stored in databases, and we discussed the emergence of new
systems for handling “big data” applications. Finally, we discussed the overhead
costs of using a DBMS and discussed some situations in which it may not be advan-
tageous to use one.
Review Questions
1.1. Define the following terms: data, database, DBMS, database system, data-
base catalog, program-data independence, user view, DBA, end user, canned
transaction, deductive database system, persistent object, meta-data, and
transaction-processing application.
1.2. What four main types of actions involve databases? Briefly discuss each.
1.3. Discuss the main characteristics of the database approach and how it differs
from traditional file systems.
1.4. What are the responsibilities of the DBA and the database designers?
1.5. What are the different types of database end users? Discuss the main activi-
ties of each.
1.6. Discuss the capabilities that should be provided by a DBMS.
1.7. Discuss the differences between database systems and information retrieval
systems.
Exercises
1.8. Identify some informal queries and update operations that you would expect
to apply to the database shown in Figure 1.2.
1.9. What is the difference between controlled and uncontrolled redundancy?
Illustrate with examples.
1.10. Specify all the relationships among the records of the database shown in
Figure 1.2.
1.11. Give some additional views that may be needed by other user groups for the
database shown in Figure 1.2.
1.12. Cite some examples of integrity constraints that you think can apply to the
database shown in Figure 1.2.
1.13. Give examples of systems in which it may make sense to use traditional file
processing instead of a database approach.
Selected Bibliography 29
1.14. Consider Figure 1.2.
a. If the name of the ‘CS’ (Computer Science) Department changes to ‘CSSE’
(Computer Science and Software Engineering) Department and the cor-
responding prefix for the course number also changes, identify the col-
umns in the database that would need to be updated.
b. Can you restructure the columns in the COURSE, SECTION, and
PREREQUISITE tables so that only one column will need to be updated?
Selected Bibliography
The October 1991 issue of Communications of the ACM and Kim (1995) include
several articles describing next-generation DBMSs; many of the database features
discussed in the former are now commercially available. The March 1976 issue of
ACM Computing Surveys offers an early introduction to database systems and may
provide a historical perspective for the interested reader. We will include references
to other concepts, systems, and applications introduced in this chapter in the later
text chapters that discuss each topic in more detail.
This page intentionally left blank
31
2chapter 2
Database System Concepts
and Architecture
The architecture of DBMS packages has evolved
from the early monolithic systems, where the whole
DBMS software package was one tightly integrated system, to the modern DBMS
packages that are modular in design, with a client/server system architecture. The
recent growth in the amount of data requiring storage has led to database systems
with distributed architectures comprised of thousands of computers that manage
the data stores. This evolution mirrors the trends in computing, where large cen-
tralized mainframe computers are replaced by hundreds of distributed worksta-
tions and personal computers connected via communications networks to various
types of server machines—Web servers, database servers, file servers, application
servers, and so on. The current cloud computing environments consist of thou-
sands of large servers managing so-called big data for users on the Web.
In a basic client/server DBMS architecture, the system functionality is distributed
between two types of modules.1 A client module is typically designed so that it
will run on a mobile device, user workstation, or personal computer (PC). Typi-
cally, application programs and user interfaces that access the database run in the
client module. Hence, the client module handles user interaction and provides
the user-friendly interfaces such as apps for mobile devices, or forms- or menu-
based GUIs (graphical user interfaces) for PCs. The other kind of module, called
a server module, typically handles data storage, access, search, and other func-
tions. We discuss client/server architectures in more detail in Section 2.5. First,
we must study more basic concepts that will give us a better understanding of
modern database architectures.
1As we shall see in Section 2.5, there are variations on this simple two-tier client/server architecture.
32 Chapter 2 Database System Concepts and Architecture
In this chapter we present the terminology and basic concepts that will be used
throughout the text. Section 2.1 discusses data models and defines the concepts
of schemas and instances, which are fundamental to the study of database sys-
tems. We discuss the three-schema DBMS architecture and data independence
in Section 2.2; this provides a user’s perspective on what a DBMS is supposed to
do. In Section 2.3 we describe the types of interfaces and languages that are typi-
cally provided by a DBMS. Section 2.4 discusses the database system software
environment. Section 2.5 gives an overview of various types of client/server
architectures. Finally, Section 2.6 presents a classification of the types of DBMS
packages. Section 2.7 summarizes the chapter.
The material in Sections 2.4 through 2.6 provides detailed concepts that may be
considered as supplementary to the basic introductory material.
2.1 Data Models, Schemas, and Instances
One fundamental characteristic of the database approach is that it provides some
level of data abstraction. Data abstraction generally refers to the suppression of
details of data organization and storage, and the highlighting of the essential fea-
tures for an improved understanding of data. One of the main characteristics of the
database approach is to support data abstraction so that different users can perceive
data at their preferred level of detail. A data model—a collection of concepts that
can be used to describe the structure of a database—provides the necessary means
to achieve this abstraction.2 By structure of a database we mean the data types, rela-
tionships, and constraints that apply to the data. Most data models also include a
set of basic operations for specifying retrievals and updates on the database.
In addition to the basic operations provided by the data model, it is becoming more
common to include concepts in the data model to specify the dynamic aspect or
behavior of a database application. This allows the database designer to specify a set
of valid user-defined operations that are allowed on the database objects.3 An
example of a user-defined operation could be COMPUTE_GPA, which can be
applied to a STUDENT object. On the other hand, generic operations to insert,
delete, modify, or retrieve any kind of object are often included in the basic data
model operations. Concepts to specify behavior are fundamental to object-oriented
data models (see Chapter 12) but are also being incorporated in more traditional
data models. For example, object-relational models (see Chapter 12) extend the basic
relational model to include such concepts, among others. In the basic relational data
model, there is a provision to attach behavior to the relations in the form of persis-
tent stored modules, popularly known as stored procedures (see Chapter 10).
2Sometimes the word model is used to denote a specific database description, or schema—for example,
the marketing data model. We will not use this interpretation.
3The inclusion of concepts to describe behavior reflects a trend whereby database design and software
design activities are increasingly being combined into a single activity. Traditionally, specifying behavior is
associated with software design.
2.1 Data Models, Schemas, and Instances 33
2.1.1 Categories of Data Models
Many data models have been proposed, which we can categorize according to
the types of concepts they use to describe the database structure. High-level or
conceptual data models provide concepts that are close to the way many users per-
ceive data, whereas low-level or physical data models provide concepts that describe
the details of how data is stored on the computer storage media, typically magnetic
disks. Concepts provided by physical data models are generally meant for computer
specialists, not for end users. Between these two extremes is a class of representational
(or implementation) data models,4 which provide concepts that may be easily
understood by end users but that are not too far removed from the way data is orga-
nized in computer storage. Representational data models hide many details of data
storage on disk but can be implemented on a computer system directly.
Conceptual data models use concepts such as entities, attributes, and relationships.
An entity represents a real-world object or concept, such as an employee or a project
from the miniworld that is described in the database. An attribute represents some
property of interest that further describes an entity, such as the employee’s name or
salary. A relationship among two or more entities represents an association among
the entities, for example, a works-on relationship between an employee and a
project. Chapter 3 presents the entity–relationship model—a popular high-level
conceptual data model. Chapter 4 describes additional abstractions used for advanced
modeling, such as generalization, specialization, and categories (union types).
Representational or implementation data models are the models used most fre-
quently in traditional commercial DBMSs. These include the widely used relational
data model, as well as the so-called legacy data models—the network and
hierarchical models—that have been widely used in the past. Part 3 of the text is
devoted to the relational data model, and its constraints, operations, and languages.5
The SQL standard for relational databases is described in Chapters 6 and 7. Repre-
sentational data models represent data by using record structures and hence are
sometimes called record-based data models.
We can regard the object data model as an example of a new family of higher-level
implementation data models that are closer to conceptual data models. A standard
for object databases called the ODMG object model has been proposed by the
Object Data Management Group (ODMG). We describe the general characteristics
of object databases and the object model proposed standard in Chapter 12. Object
data models are also frequently utilized as high-level conceptual models, particu-
larly in the software engineering domain.
Physical data models describe how data is stored as files in the computer by repre-
senting information such as record formats, record orderings, and access paths. An
4The term implementation data model is not a standard term; we have introduced it to refer to the avail-
able data models in commercial database systems.
5A summary of the hierarchical and network data models is included in Appendices D and E. They are
accessible from the book’s Web site.
34 Chapter 2 Database System Concepts and Architecture
access path is a search structure that makes the search for particular database
records efficient, such as indexing or hashing. We discuss physical storage tech-
niques and access structures in Chapters 16 and 17. An index is an example of an
access path that allows direct access to data using an index term or a keyword. It is
similar to the index at the end of this text, except that it may be organized in a lin-
ear, hierarchical (tree-structured), or some other fashion.
Another class of data models is known as self-describing data models. The data
storage in systems based on these models combines the description of the data with
the data values themselves. In traditional DBMSs, the description (schema) is sepa-
rated from the data. These models include XML (see Chapter 12) as well as many of
the key-value stores and NOSQL systems (see Chapter 24) that were recently cre-
ated for managing big data.
2.1.2 Schemas, Instances, and Database State
In a data model, it is important to distinguish between the description of the
database and the database itself. The description of a database is called the
database schema, which is specified during database design and is not expected
to change frequently.6 Most data models have certain conventions for displaying
schemas as diagrams.7 A displayed schema is called a schema diagram. Figure 2.1
shows a schema diagram for the database shown in Figure 1.2; the diagram dis-
plays the structure of each record type but not the actual instances of records.
6Schema changes are usually needed as the requirements of the database applications change. Most
database systems include operations for allowing schema changes.
7It is customary in database parlance to use schemas as the plural for schema, even though schemata is
the proper plural form. The word scheme is also sometimes used to refer to a schema.
Section_identifier SemesterCourse_number InstructorYear
SECTION
Course_name Course_number Credit_hours Department
COURSE
Name Student_number Class Major
STUDENT
Course_number Prerequisite_number
PREREQUISITE
Student_number GradeSection_identifier
GRADE_REPORT
Figure 2.1
Schema diagram for
the database in
Figure 1.2.
2.1 Data Models, Schemas, and Instances 35
We call each object in the schema—such as STUDENT or COURSE—a schema
construct.
A schema diagram displays only some aspects of a schema, such as the names of
record types and data items, and some types of constraints. Other aspects are not
specified in the schema diagram; for example, Figure 2.1 shows neither the data
type of each data item nor the relationships among the various files. Many types of
constraints are not represented in schema diagrams. A constraint such as students
majoring in computer science must take CS1310 before the end of their sophomore
year is quite difficult to represent diagrammatically.
The actual data in a database may change quite frequently. For example, the data-
base shown in Figure 1.2 changes every time we add a new student or enter a new
grade. The data in the database at a particular moment in time is called a database
state or snapshot. It is also called the current set of occurrences or instances in
the database. In a given database state, each schema construct has its own current
set of instances; for example, the STUDENT construct will contain the set of indi-
vidual student entities (records) as its instances. Many database states can be con-
structed to correspond to a particular database schema. Every time we insert or
delete a record or change the value of a data item in a record, we change one state
of the database into another state.
The distinction between database schema and database state is very important.
When we define a new database, we specify its database schema only to the
DBMS. At this point, the corresponding database state is the empty state with
no data. We get the initial state of the database when the database is first
populated or loaded with the initial data. From then on, every time an update
operation is applied to the database, we get another database state. At any point
in time, the database has a current state.8 The DBMS is partly responsible for
ensuring that every state of the database is a valid state—that is, a state that
satisfies the structure and constraints specified in the schema. Hence, specify-
ing a correct schema to the DBMS is extremely important and the schema must
be designed with utmost care. The DBMS stores the descriptions of the schema
constructs and constraints—also called the meta-data—in the DBMS catalog so
that DBMS software can refer to the schema whenever it needs to. The schema
is sometimes called the intension, and a database state is called an extension of
the schema.
Although, as mentioned earlier, the schema is not supposed to change frequently,
it is not uncommon that changes occasionally need to be applied to the schema as
the application requirements change. For example, we may decide that another
data item needs to be stored for each record in a file, such as adding the Date_of_birth
to the STUDENT schema in Figure 2.1. This is known as schema evolution. Most
modern DBMSs include some operations for schema evolution that can be applied
while the database is operational.
8The current state is also called the current snapshot of the database. It has also been called a database
instance, but we prefer to use the term instance to refer to individual records.
36 Chapter 2 Database System Concepts and Architecture
2.2 Three-Schema Architecture
and Data Independence
Three of the four important characteristics of the database approach, listed in
Section 1.3, are (1) use of a catalog to store the database description (schema) so
as to make it self-describing, (2) insulation of programs and data (program-data
and program-operation independence), and (3) support of multiple user views.
In this section we specify an architecture for database systems, called the
three-schema architecture,9 that was proposed to help achieve and visualize
these characteristics. Then we discuss further the concept of data independence.
2.2.1 The Three-Schema Architecture
The goal of the three-schema architecture, illustrated in Figure 2.2, is to separate
the user applications from the physical database. In this architecture, schemas can
be defined at the following three levels:
1. The internal level has an internal schema, which describes the physical
storage structure of the database. The internal schema uses a physical data
model and describes the complete details of data storage and access paths for
the database.
9This is also known as the ANSI/SPARC (American National Standards Institute/ Standards Planning
And Requirements Committee) architecture, after the committee that proposed it (Tsichritzis & Klug, 1978).
External
View
Conceptual Schema
Internal Schema
Stored Database
External
View
Internal Level
Conceptual/Internal
Mapping
Conceptual Level
External/Conceptual
Mapping
External Level
End Users
. . .
Figure 2.2
The three-schema
architecture.
2.2 Three-Schema Architecture and Data Independence 37
2. The conceptual level has a conceptual schema, which describes the structure
of the whole database for a community of users. The conceptual schema hides
the details of physical storage structures and concentrates on describing enti-
ties, data types, relationships, user operations, and constraints. Usually, a rep-
resentational data model is used to describe the conceptual schema when a
database system is implemented. This implementation conceptual schema is
often based on a conceptual schema design in a high-level data model.
3. The external or view level includes a number of external schemas or user
views. Each external schema describes the part of the database that a partic-
ular user group is interested in and hides the rest of the database from that
user group. As in the previous level, each external schema is typically imple-
mented using a representational data model, possibly based on an external
schema design in a high-level conceptual data model.
The three-schema architecture is a convenient tool with which the user can visual-
ize the schema levels in a database system. Most DBMSs do not separate the three
levels completely and explicitly, but they support the three-schema architecture to
some extent. Some older DBMSs may include physical-level details in the concep-
tual schema. The three-level ANSI architecture has an important place in database
technology development because it clearly separates the users’ external level, the
database’s conceptual level, and the internal storage level for designing a database.
It is very much applicable in the design of DBMSs, even today. In most DBMSs that
support user views, external schemas are specified in the same data model that
describes the conceptual-level information (for example, a relational DBMS like
Oracle or SQLServer uses SQL for this).
Notice that the three schemas are only descriptions of data; the actual data is stored
at the physical level only. In the three-schema architecture, each user group refers
to its own external schema. Hence, the DBMS must transform a request specified
on an external schema into a request against the conceptual schema, and then into
a request on the internal schema for processing over the stored database. If the
request is a database retrieval, the data extracted from the stored database must be
reformatted to match the user’s external view. The processes of transforming
requests and results between levels are called mappings. These mappings may be
time-consuming, so some DBMSs—especially those that are meant to support small
databases—do not support external views. Even in such systems, however, it is nec-
essary to transform requests between the conceptual and internal levels.
2.2.2 Data Independence
The three-schema architecture can be used to further explain the concept of data
independence, which can be defined as the capacity to change the schema at one
level of a database system without having to change the schema at the next higher
level. We can define two types of data independence:
1. Logical data independence is the capacity to change the conceptual schema
without having to change external schemas or application programs. We
38 Chapter 2 Database System Concepts and Architecture
may change the conceptual schema to expand the database (by adding a
record type or data item), to change constraints, or to reduce the database
(by removing a record type or data item). In the last case, external schemas
that refer only to the remaining data should not be affected. For example,
the external schema of Figure 1.5(a) should not be affected by changing the
GRADE_REPORT file (or record type) shown in Figure 1.2 into the one
shown in Figure 1.6(a). Only the view definition and the mappings need to
be changed in a DBMS that supports logical data independence. After the
conceptual schema undergoes a logical reorganization, application pro-
grams that reference the external schema constructs must work as before.
Changes to constraints can be applied to the conceptual schema without
affecting the external schemas or application programs.
2. Physical data independence is the capacity to change the internal schema
without having to change the conceptual schema. Hence, the external sche-
mas need not be changed as well. Changes to the internal schema may be
needed because some physical files were reorganized—for example, by cre-
ating additional access structures—to improve the performance of retrieval
or update. If the same data as before remains in the database, we should not
have to change the conceptual schema. For example, providing an access
path to improve retrieval speed of SECTION records (Figure 1.2) by semes-
ter and year should not require a query such as list all sections offered in fall
2008 to be changed, although the query would be executed more efficiently
by the DBMS by utilizing the new access path.
Generally, physical data independence exists in most databases and file environ-
ments where physical details, such as the exact location of data on disk, and hard-
ware details of storage encoding, placement, compression, splitting, merging of
records, and so on are hidden from the user. Applications remain unaware of these
details. On the other hand, logical data independence is harder to achieve because it
allows structural and constraint changes without affecting application programs—a
much stricter requirement.
Whenever we have a multiple-level DBMS, its catalog must be expanded to include
information on how to map requests and data among the various levels. The DBMS
uses additional software to accomplish these mappings by referring to the mapping
information in the catalog. Data independence occurs because when the schema is
changed at some level, the schema at the next higher level remains unchanged; only
the mapping between the two levels is changed. Hence, application programs refer-
ring to the higher-level schema need not be changed.
2.3 Database Languages and Interfaces
In Section 1.4 we discussed the variety of users supported by a DBMS. The DBMS
must provide appropriate languages and interfaces for each category of users. In
this section we discuss the types of languages and interfaces provided by a DBMS
and the user categories targeted by each interface.
2.3 Database Languages and Interfaces 39
2.3.1 DBMS Languages
Once the design of a database is completed and a DBMS is chosen to implement the
database, the first step is to specify conceptual and internal schemas for the data-
base and any mappings between the two. In many DBMSs where no strict separa-
tion of levels is maintained, one language, called the data definition language
(DDL), is used by the DBA and by database designers to define both schemas. The
DBMS will have a DDL compiler whose function is to process DDL statements in
order to identify descriptions of the schema constructs and to store the schema
description in the DBMS catalog.
In DBMSs where a clear separation is maintained between the conceptual and
internal levels, the DDL is used to specify the conceptual schema only. Another
language, the storage definition language (SDL), is used to specify the internal
schema. The mappings between the two schemas may be specified in either one of
these languages. In most relational DBMSs today, there is no specific language that
performs the role of SDL. Instead, the internal schema is specified by a combination
of functions, parameters, and specifications related to storage of files. These permit
the DBA staff to control indexing choices and mapping of data to storage. For a true
three-schema architecture, we would need a third language, the view definition
language (VDL), to specify user views and their mappings to the conceptual
schema, but in most DBMSs the DDL is used to define both conceptual and external
schemas. In relational DBMSs, SQL is used in the role of VDL to define user or
application views as results of predefined queries (see Chapters 6 and 7).
Once the database schemas are compiled and the database is populated with data,
users must have some means to manipulate the database. Typical manipulations
include retrieval, insertion, deletion, and modification of the data. The DBMS pro-
vides a set of operations or a language called the data manipulation language
(DML) for these purposes.
In current DBMSs, the preceding types of languages are usually not considered dis-
tinct languages; rather, a comprehensive integrated language is used that includes
constructs for conceptual schema definition, view definition, and data manipula-
tion. Storage definition is typically kept separate, since it is used for defining physi-
cal storage structures to fine-tune the performance of the database system, which is
usually done by the DBA staff. A typical example of a comprehensive database lan-
guage is the SQL relational database language (see Chapters 6 and 7), which repre-
sents a combination of DDL, VDL, and DML, as well as statements for constraint
specification, schema evolution, and many other features. The SDL was a compo-
nent in early versions of SQL but has been removed from the language to keep it at
the conceptual and external levels only.
There are two main types of DMLs. A high-level or nonprocedural DML can be
used on its own to specify complex database operations concisely. Many DBMSs
allow high-level DML statements either to be entered interactively from a display
monitor or terminal or to be embedded in a general-purpose programming lan-
guage. In the latter case, DML statements must be identified within the program so
40 Chapter 2 Database System Concepts and Architecture
that they can be extracted by a precompiler and processed by the DBMS. A low-
level or procedural DML must be embedded in a general-purpose programming
language. This type of DML typically retrieves individual records or objects from
the database and processes each separately. Therefore, it needs to use programming
language constructs, such as looping, to retrieve and process each record from a set
of records. Low-level DMLs are also called record-at-a-time DMLs because of this
property. High-level DMLs, such as SQL, can specify and retrieve many records in
a single DML statement; therefore, they are called set-at-a-time or set-oriented
DMLs. A query in a high-level DML often specifies which data to retrieve rather
than how to retrieve it; therefore, such languages are also called declarative.
Whenever DML commands, whether high level or low level, are embedded in a
general-purpose programming language, that language is called the host language
and the DML is called the data sublanguage.10 On the other hand, a high-level
DML used in a standalone interactive manner is called a query language. In gen-
eral, both retrieval and update commands of a high-level DML may be used inter-
actively and are hence considered part of the query language.11
Casual end users typically use a high-level query language to specify their requests,
whereas programmers use the DML in its embedded form. For naive and paramet-
ric users, there usually are user-friendly interfaces for interacting with the data-
base; these can also be used by casual users or others who do not want to learn the
details of a high-level query language. We discuss these types of interfaces next.
2.3.2 DBMS Interfaces
User-friendly interfaces provided by a DBMS may include the following:
Menu-based Interfaces for Web Clients or Browsing. These interfaces pres-
ent the user with lists of options (called menus) that lead the user through the for-
mulation of a request. Menus do away with the need to memorize the specific
commands and syntax of a query language; rather, the query is composed step-by-
step by picking options from a menu that is displayed by the system. Pull-down
menus are a very popular technique in Web-based user interfaces. They are also
often used in browsing interfaces, which allow a user to look through the contents
of a database in an exploratory and unstructured manner.
Apps for Mobile Devices. These interfaces present mobile users with access to
their data. For example, banking, reservations, and insurance companies, among
many others, provide apps that allow users to access their data through a mobile
phone or mobile device. The apps have built-in programmed interfaces that typically
10In object databases, the host and data sublanguages typically form one integrated language—for
example, C++ with some extensions to support database functionality. Some relational systems also
provide integrated languages—for example, Oracle’s PL/SQL.
11According to the English meaning of the word query, it should really be used to describe retrievals
only, not updates.
2.3 Database Languages and Interfaces 41
allow users to login using their account name and password; the apps then provide
a limited menu of options for mobile access to the user data, as well as options such
as paying bills (for banks) or making reservations (for reservation Web sites).
Forms-based Interfaces. A forms-based interface displays a form to each user.
Users can fill out all of the form entries to insert new data, or they can fill out only
certain entries, in which case the DBMS will retrieve matching data for the remain-
ing entries. Forms are usually designed and programmed for naive users as inter-
faces to canned transactions. Many DBMSs have forms specification languages,
which are special languages that help programmers specify such forms. SQL*Forms
is a form-based language that specifies queries using a form designed in conjunc-
tion with the relational database schema. Oracle Forms is a component of the Ora-
cle product suite that provides an extensive set of features to design and build
applications using forms. Some systems have utilities that define a form by letting
the end user interactively construct a sample form on the screen.
Graphical User Interfaces. A GUI typically displays a schema to the user in dia-
grammatic form. The user then can specify a query by manipulating the diagram.
In many cases, GUIs utilize both menus and forms.
Natural Language Interfaces. These interfaces accept requests written in Eng-
lish or some other language and attempt to understand them. A natural language
interface usually has its own schema, which is similar to the database conceptual
schema, as well as a dictionary of important words. The natural language interface
refers to the words in its schema, as well as to the set of standard words in its dic-
tionary, that are used to interpret the request. If the interpretation is successful, the
interface generates a high-level query corresponding to the natural language request
and submits it to the DBMS for processing; otherwise, a dialogue is started with the
user to clarify the request.
Keyword-based Database Search. These are somewhat similar to Web search
engines, which accept strings of natural language (like English or Spanish) words
and match them with documents at specific sites (for local search engines) or Web
pages on the Web at large (for engines like Google or Ask). They use predefined
indexes on words and use ranking functions to retrieve and present resulting docu-
ments in a decreasing degree of match. Such “free form” textual query interfaces are
not yet common in structured relational databases, although a research area called
keyword-based querying has emerged recently for relational databases.
Speech Input and Output. Limited use of speech as an input query and speech
as an answer to a question or result of a request is becoming commonplace. Appli-
cations with limited vocabularies, such as inquiries for telephone directory, flight
arrival/departure, and credit card account information, are allowing speech for
input and output to enable customers to access this information. The speech input
is detected using a library of predefined words and used to set up the parameters
that are supplied to the queries. For output, a similar conversion from text or num-
bers into speech takes place.
42 Chapter 2 Database System Concepts and Architecture
Interfaces for Parametric Users. Parametric users, such as bank tellers, often
have a small set of operations that they must perform repeatedly. For example, a
teller is able to use single function keys to invoke routine and repetitive transactions
such as account deposits or withdrawals, or balance inquiries. Systems analysts and
programmers design and implement a special interface for each known class of
naive users. Usually a small set of abbreviated commands is included, with the goal
of minimizing the number of keystrokes required for each request.
Interfaces for the DBA. Most database systems contain privileged commands
that can be used only by the DBA staff. These include commands for creating
accounts, setting system parameters, granting account authorization, changing a
schema, and reorganizing the storage structures of a database.
2.4 The Database System Environment
A DBMS is a complex software system. In this section we discuss the types of soft-
ware components that constitute a DBMS and the types of computer system soft-
ware with which the DBMS interacts.
2.4.1 DBMS Component Modules
Figure 2.3 illustrates, in a simplified form, the typical DBMS components. The
figure is divided into two parts. The top part of the figure refers to the various
users of the database environment and their interfaces. The lower part shows the
internal modules of the DBMS responsible for storage of data and processing of
transactions.
The database and the DBMS catalog are usually stored on disk. Access to the
disk is controlled primarily by the operating system (OS), which schedules disk
read/write. Many DBMSs have their own buffer management module to sched-
ule disk read/write, because management of buffer storage has a considerable
effect on performance. Reducing disk read/write improves performance consid-
erably. A higher-level stored data manager module of the DBMS controls access
to DBMS information that is stored on disk, whether it is part of the database or
the catalog.
Let us consider the top part of Figure 2.3 first. It shows interfaces for the DBA staff,
casual users who work with interactive interfaces to formulate queries, application
programmers who create programs using some host programming languages, and
parametric users who do data entry work by supplying parameters to predefined
transactions. The DBA staff works on defining the database and tuning it by mak-
ing changes to its definition using the DDL and other privileged commands.
The DDL compiler processes schema definitions, specified in the DDL, and stores
descriptions of the schemas (meta-data) in the DBMS catalog. The catalog includes
information such as the names and sizes of files, names and data types of data items,
storage details of each file, mapping information among schemas, and constraints.
2.4 The Database System Environment 43
In addition, the catalog stores many other types of information that are needed by
the DBMS modules, which can then look up the catalog information as needed.
Casual users and persons with occasional need for information from the database
interact using the interactive query interface in Figure 2.3. We have not explicitly
shown any menu-based or form-based or mobile interactions that are typically used
to generate the interactive query automatically or to access canned transactions.
These queries are parsed and validated for correctness of the query syntax, the
names of files and data elements, and so on by a query compiler that compiles
Query
Compiler
Runtime
Database
Processor
Precompiler
System
Catalog/
Data
Dictionary
Query
Optimizer
DML
Compiler
Host
Language
Compiler
Concurrency Control/
Backup/Recovery
Subsystems
Stored
Data
Manager
Compiled
Transactions
Stored Database
DBA Commands,
Queries, and Transactions
Input/Output
from DatabaseQuery and Transaction
Execution:
DDL
Compiler
DDL
Statements
Privileged
Commands
Interactive
Query
Application
Programs
DBA Staff Casual Users Application
Programmers
Parametric UsersUsers:
Figure 2.3
Component modules of a DBMS and their interactions.
44 Chapter 2 Database System Concepts and Architecture
them into an internal form. This internal query is subjected to query optimization
(discussed in Chapters 18 and 19). Among other things, the query optimizer is
concerned with the rearrangement and possible reordering of operations, elimina-
tion of redundancies, and use of efficient search algorithms during execution. It
consults the system catalog for statistical and other physical information about the
stored data and generates executable code that performs the necessary operations
for the query and makes calls on the runtime processor.
Application programmers write programs in host languages such as Java, C, or C++
that are submitted to a precompiler. The precompiler extracts DML commands
from an application program written in a host programming language. These com-
mands are sent to the DML compiler for compilation into object code for database
access. The rest of the program is sent to the host language compiler. The object
codes for the DML commands and the rest of the program are linked, forming a
canned transaction whose executable code includes calls to the runtime database
processor. It is also becoming increasingly common to use scripting languages such
as PHP and Python to write database programs. Canned transactions are executed
repeatedly by parametric users via PCs or mobile apps; these users simply supply
the parameters to the transactions. Each execution is considered to be a separate
transaction. An example is a bank payment transaction where the account number,
payee, and amount may be supplied as parameters.
In the lower part of Figure 2.3, the runtime database processor executes (1) the
privileged commands, (2) the executable query plans, and (3) the canned transac-
tions with runtime parameters. It works with the system catalog and may update it
with statistics. It also works with the stored data manager, which in turn uses basic
operating system services for carrying out low-level input/output (read/write)
operations between the disk and main memory. The runtime database processor
handles other aspects of data transfer, such as management of buffers in the main
memory. Some DBMSs have their own buffer management module whereas others
depend on the OS for buffer management. We have shown concurrency control
and backup and recovery systems separately as a module in this figure. They are
integrated into the working of the runtime database processor for purposes of
transaction management.
It is common to have the client program that accesses the DBMS running on a
separate computer or device from the computer on which the database resides. The
former is called the client computer running DBMS client software and the latter is
called the database server. In many cases, the client accesses a middle computer,
called the application server, which in turn accesses the database server. We elabo-
rate on this topic in Section 2.5.
Figure 2.3 is not meant to describe a specific DBMS; rather, it illustrates typical
DBMS modules. The DBMS interacts with the operating system when disk accesses—
to the database or to the catalog—are needed. If the computer system is shared by
many users, the OS will schedule DBMS disk access requests and DBMS processing
along with other processes. On the other hand, if the computer system is mainly
dedicated to running the database server, the DBMS will control main memory
2.4 The Database System Environment 45
buffering of disk pages. The DBMS also interfaces with compilers for general-
purpose host programming languages, and with application servers and client pro-
grams running on separate machines through the system network interface.
2.4.2 Database System Utilities
In addition to possessing the software modules just described, most DBMSs have
database utilities that help the DBA manage the database system. Common utili-
ties have the following types of functions:
■ Loading. A loading utility is used to load existing data files—such as text
files or sequential files—into the database. Usually, the current (source) for-
mat of the data file and the desired (target) database file structure are speci-
fied to the utility, which then automatically reformats the data and stores it
in the database. With the proliferation of DBMSs, transferring data from
one DBMS to another is becoming common in many organizations. Some
vendors offer conversion tools that generate the appropriate loading pro-
grams, given the existing source and target database storage descriptions
(internal schemas).
■ Backup. A backup utility creates a backup copy of the database, usually by
dumping the entire database onto tape or other mass storage medium. The
backup copy can be used to restore the database in case of catastrophic disk
failure. Incremental backups are also often used, where only changes since
the previous backup are recorded. Incremental backup is more complex, but
saves storage space.
■ Database storage reorganization. This utility can be used to reorganize a
set of database files into different file organizations and create new access
paths to improve performance.
■ Performance monitoring. Such a utility monitors database usage and pro-
vides statistics to the DBA. The DBA uses the statistics in making decisions
such as whether or not to reorganize files or whether to add or drop indexes
to improve performance.
Other utilities may be available for sorting files, handling data compression,
monitoring access by users, interfacing with the network, and performing other
functions.
2.4.3 Tools, Application Environments,
and Communications Facilities
Other tools are often available to database designers, users, and the DBMS. CASE
tools12 are used in the design phase of database systems. Another tool that can be
quite useful in large organizations is an expanded data dictionary (or data repository)
12Although CASE stands for computer-aided software engineering, many CASE tools are used primarily
for database design.
46 Chapter 2 Database System Concepts and Architecture
system. In addition to storing catalog information about schemas and constraints,
the data dictionary stores other information, such as design decisions, usage stan-
dards, application program descriptions, and user information. Such a system is
also called an information repository. This information can be accessed directly by
users or the DBA when needed. A data dictionary utility is similar to the DBMS
catalog, but it includes a wider variety of information and is accessed mainly by
users rather than by the DBMS software.
Application development environments, such as PowerBuilder (Sybase)
or JBuilder (Borland), have been quite popular. These systems provide an environ-
ment for developing database applications and include facilities that help in many
facets of database systems, including database design, GUI development, querying
and updating, and application program development.
The DBMS also needs to interface with communications software, whose function
is to allow users at locations remote from the database system site to access the
database through computer terminals, workstations, or personal computers. These
are connected to the database site through data communications hardware such as
Internet routers, phone lines, long-haul networks, local networks, or satellite com-
munication devices. Many commercial database systems have communication
packages that work with the DBMS. The integrated DBMS and data communica-
tions system is called a DB/DC system. In addition, some distributed DBMSs are
physically distributed over multiple machines. In this case, communications net-
works are needed to connect the machines. These are often local area networks
(LANs), but they can also be other types of networks.
2.5 Centralized and Client/Server
Architectures for DBMSs
2.5.1 Centralized DBMSs Architecture
Architectures for DBMSs have followed trends similar to those for general com-
puter system architectures. Older architectures used mainframe computers to pro-
vide the main processing for all system functions, including user application
programs and user interface programs, as well as all the DBMS functionality. The
reason was that in older systems, most users accessed the DBMS via computer ter-
minals that did not have processing power and only provided display capabilities.
Therefore, all processing was performed remotely on the computer system housing
the DBMS, and only display information and controls were sent from the computer
to the display terminals, which were connected to the central computer via various
types of communications networks.
As prices of hardware declined, most users replaced their terminals with PCs and
workstations, and more recently with mobile devices. At first, database systems
used these computers similarly to how they had used display terminals, so that the
DBMS itself was still a centralized DBMS in which all the DBMS functionality,
2.5 Centralized and Client/Server Architectures for DBMSs 47
application program execution, and user interface processing were carried out on
one machine. Figure 2.4 illustrates the physical components in a centralized archi-
tecture. Gradually, DBMS systems started to exploit the available processing power
at the user side, which led to client/server DBMS architectures.
2.5.2 Basic Client/Server Architectures
First, we discuss client/server architecture in general; then we discuss how it is
applied to DBMSs. The client/server architecture was developed to deal with com-
puting environments in which a large number of PCs, workstations, file servers,
printers, database servers, Web servers, e-mail servers, and other software and
equipment are connected via a network. The idea is to define specialized servers
with specific functionalities. For example, it is possible to connect a number of PCs
or small workstations as clients to a file server that maintains the files of the client
machines. Another machine can be designated as a printer server by being con-
nected to various printers; all print requests by the clients are forwarded to this
machine. Web servers or e-mail servers also fall into the specialized server cate-
gory. The resources provided by specialized servers can be accessed by many client
machines. The client machines provide the user with the appropriate interfaces to
utilize these servers, as well as with local processing power to run local applications.
This concept can be carried over to other software packages, with specialized pro-
grams—such as a CAD (computer-aided design) package—being stored on specific
server machines and being made accessible to multiple clients. Figure 2.5 illustrates
Display
Monitor
Display
Monitor
Network
Software
Hardware/Firmware
Operating System
Display
Monitor
Application
Programs
DBMS
Controller
CPU
Controller
. . .
. . .
. . .
Controller
Memory Disk
I/O Devices
(Printers,
Tape Drives, . . .)
Compilers
Text
Editors
Terminal
Display Control
System Bus
Terminals . . .
. . .
Figure 2.4
A physical centralized
architecture.
48 Chapter 2 Database System Concepts and Architecture
client/server architecture at the logical level; Figure 2.6 is a simplified diagram that
shows the physical architecture. Some machines would be client sites only (for
example, mobile devices or workstations/PCs that have only client software
installed). Other machines would be dedicated servers, and others would have both
client and server functionality.
The concept of client/server architecture assumes an underlying framework that
consists of many PCs/workstations and mobile devices as well as a smaller number
of server machines, connected via wireless networks or LANs and other types of
computer networks. A client in this framework is typically a user machine that pro-
vides user interface capabilities and local processing. When a client requires access
to additional functionality—such as database access—that does not exist at the cli-
ent, it connects to a server that provides the needed functionality. A server is a sys-
tem containing both hardware and software that can provide services to the client
machines, such as file access, printing, archiving, or database access. In general,
some machines install only client software, others only server software, and still
others may include both client and server software, as illustrated in Figure 2.6.
However, it is more common that client and server software usually run on separate
Client Client Client
Print
Server
DBMS
Server
File
Server
. . .
. . .
Network
Figure 2.5
Logical two-tier
client/server
architecture.
Client CLIENT
Site 2
Client
with Disk
Client
Site 1
Diskless
Client
Server
Site 3
Server
Communication
Network
Site n
Server
and Client
. . .
Client
Server
Figure 2.6
Physical two-tier
client/server
architecture.
2.5 Centralized and Client/Server Architectures for DBMSs 49
machines. Two main types of basic DBMS architectures were created on this under-
lying client/server framework: two-tier and three-tier.13 We discuss them next.
2.5.3 Two-Tier Client/Server Architectures for DBMSs
In relational database management systems (RDBMSs), many of which started
as centralized systems, the system components that were first moved to the
client side were the user interface and application programs. Because SQL (see
Chapters 6 and 7) provided a standard language for RDBMSs, this created a
logical dividing point between client and server. Hence, the query and transac-
tion functionality related to SQL processing remained on the server side. In
such an architecture, the server is often called a query server or transaction
server because it provides these two functionalities. In an RDBMS, the server is
also often called an SQL server.
The user interface programs and application programs can run on the client side.
When DBMS access is required, the program establishes a connection to the
DBMS (which is on the server side); once the connection is created, the client
program can communicate with the DBMS. A standard called Open Database
Connectivity (ODBC) provides an application programming interface (API),
which allows client-side programs to call the DBMS, as long as both client and
server machines have the necessary software installed. Most DBMS vendors pro-
vide ODBC drivers for their systems. A client program can actually connect to
several RDBMSs and send query and transaction requests using the ODBC API,
which are then processed at the server sites. Any query results are sent back to the
client program, which can process and display the results as needed. A related
standard for the Java programming language, called JDBC, has also been defined.
This allows Java client programs to access one or more DBMSs through a stan-
dard interface.
The architectures described here are called two-tier architectures because the soft-
ware components are distributed over two systems: client and server. The advan-
tages of this architecture are its simplicity and seamless compatibility with existing
systems. The emergence of the Web changed the roles of clients and servers, leading
to the three-tier architecture.
2.5.4 Three-Tier and n-Tier Architectures
for Web Applications
Many Web applications use an architecture called the three-tier architecture,
which adds an intermediate layer between the client and the database server, as
illustrated in Figure 2.7(a).
13There are many other variations of client/server architectures. We discuss the two most basic ones
here.
50 Chapter 2 Database System Concepts and Architecture
This intermediate layer or middle tier is called the application server or the Web
server, depending on the application. This server plays an intermediary role by
running application programs and storing business rules (procedures or con-
straints) that are used to access data from the database server. It can also improve
database security by checking a client’s credentials before forwarding a request to
the database server. Clients contain user interfaces and Web browsers. The inter-
mediate server accepts requests from the client, processes the request and sends
database queries and commands to the database server, and then acts as a conduit
for passing (partially) processed data from the database server to the clients, where
it may be processed further and filtered to be presented to the users. Thus, the user
interface, application rules, and data access act as the three tiers. Figure 2.7(b) shows
another view of the three-tier architecture used by database and other application
package vendors. The presentation layer displays information to the user and allows
data entry. The business logic layer handles intermediate rules and constraints before
data is passed up to the user or down to the DBMS. The bottom layer includes all
data management services. The middle layer can also act as a Web server, which
retrieves query results from the database server and formats them into dynamic
Web pages that are viewed by the Web browser at the client side. The client machine
is typically a PC or mobile device connected to the Web.
Other architectures have also been proposed. It is possible to divide the layers
between the user and the stored data further into finer components, thereby giving
rise to n-tier architectures, where n may be four or five tiers. Typically, the business
logic layer is divided into multiple layers. Besides distributing programming and
data throughout a network, n-tier applications afford the advantage that any one
tier can run on an appropriate processor or operating system platform and can be
handled independently. Vendors of ERP (enterprise resource planning) and CRM
(customer relationship management) packages often use a middleware layer, which
GUI,
Web Interface
Client
Application Server
or
Web Server
Database
Server
Application
Programs,
Web Pages
Database
Management
System
Presentation
Layer
Business
Logic Layer
Database
Services
Layer
(a) (b)
Figure 2.7
Logical three-tier
client/server
architecture, with a
couple of commonly
used nomenclatures.
2.6 Classification of Database Management Systems 51
accounts for the front-end modules (clients) communicating with a number of
back-end databases (servers).
Advances in encryption and decryption technology make it safer to transfer sensi-
tive data from server to client in encrypted form, where it will be decrypted. The
latter can be done by the hardware or by advanced software. This technology gives
higher levels of data security, but the network security issues remain a major con-
cern. Various technologies for data compression also help to transfer large amounts
of data from servers to clients over wired and wireless networks.
2.6 Classification of Database
Management Systems
Several criteria can be used to classify DBMSs. The first is the data model on
which the DBMS is based. The main data model used in many current commercial
DBMSs is the relational data model, and the systems based on this model are
known as SQL systems. The object data model has been implemented in some
commercial systems but has not had widespread use. Recently, so-called big data
systems, also known as key-value storage systems and NOSQL systems, use vari-
ous data models: document-based, graph-based, column-based, and key-value
data models. Many legacy applications still run on database systems based on the
hierarchical and network data models.
The relational DBMSs are evolving continuously, and, in particular, have been
incorporating many of the concepts that were developed in object databases. This
has led to a new class of DBMSs called object-relational DBMSs. We can catego-
rize DBMSs based on the data model: relational, object, object-relational, NOSQL,
key-value, hierarchical, network, and other.
Some experimental DBMSs are based on the XML (eXtended Markup Language)
model, which is a tree-structured data model. These have been called native XML
DBMSs. Several commercial relational DBMSs have added XML interfaces and
storage to their products.
The second criterion used to classify DBMSs is the number of users supported by
the system. Single-user systems support only one user at a time and are mostly
used with PCs. Multiuser systems, which include the majority of DBMSs, support
concurrent multiple users.
The third criterion is the number of sites over which the database is distributed. A
DBMS is centralized if the data is stored at a single computer site. A centralized
DBMS can support multiple users, but the DBMS and the database reside totally at
a single computer site. A distributed DBMS (DDBMS) can have the actual database
and DBMS software distributed over many sites connected by a computer network.
Big data systems are often massively distributed, with hundreds of sites. The data is
often replicated on multiple sites so that failure of a site will not make some data
unavailable.
52 Chapter 2 Database System Concepts and Architecture
Homogeneous DDBMSs use the same DBMS software at all the sites, whereas
heterogeneous DDBMSs can use different DBMS software at each site. It is also
possible to develop middleware software to access several autonomous preexisting
databases stored under heterogeneous DBMSs. This leads to a federated DBMS (or
multidatabase system), in which the participating DBMSs are loosely coupled and
have a degree of local autonomy. Many DDBMSs use client-server architecture, as
we described in Section 2.5.
The fourth criterion is cost. It is difficult to propose a classification of DBMSs
based on cost. Today we have open source (free) DBMS products like MySQL and
PostgreSQL that are supported by third-party vendors with additional services.
The main RDBMS products are available as free examination 30-day copy versions
as well as personal versions, which may cost under $100 and allow a fair amount of
functionality. The giant systems are being sold in modular form with components
to handle distribution, replication, parallel processing, mobile capability, and so
on, and with a large number of parameters that must be defined for the configura-
tion. Furthermore, they are sold in the form of licenses—site licenses allow unlim-
ited use of the database system with any number of copies running at the customer
site. Another type of license limits the number of concurrent users or the number
of user seats at a location. Standalone single-user versions of some systems like
Microsoft Access are sold per copy or included in the overall configuration of a
desktop or laptop. In addition, data warehousing and mining features, as well as
support for additional data types, are made available at extra cost. It is possible to
pay millions of dollars for the installation and maintenance of large database sys-
tems annually.
We can also classify a DBMS on the basis of the types of access path options for
storing files. One well-known family of DBMSs is based on inverted file structures.
Finally, a DBMS can be general purpose or special purpose. When performance is
a primary consideration, a special-purpose DBMS can be designed and built for a
specific application; such a system cannot be used for other applications without
major changes. Many airline reservations and telephone directory systems devel-
oped in the past are special-purpose DBMSs. These fall into the category of online
transaction processing (OLTP) systems, which must support a large number of
concurrent transactions without imposing excessive delays.
Let us briefly elaborate on the main criterion for classifying DBMSs: the data
model. The relational data model represents a database as a collection of tables,
where each table can be stored as a separate file. The database in Figure 1.2 resem-
bles a basic relational representation. Most relational databases use the high-level
query language called SQL and support a limited form of user views. We discuss
the relational model and its languages and operations in Chapters 5 through 8, and
techniques for programming relational applications in Chapters 10 and 11.
The object data model defines a database in terms of objects, their properties, and
their operations. Objects with the same structure and behavior belong to a class,
and classes are organized into hierarchies (or acyclic graphs). The operations of
2.6 Classification of Database Management Systems 53
each class are specified in terms of predefined procedures called methods. Rela-
tional DBMSs have been extending their models to incorporate object database
concepts and other capabilities; these systems are referred to as object-relational or
extended relational systems. We discuss object databases and object-relational
systems in Chapter 12.
Big data systems are based on various data models, with the following four data
models most common. The key-value data model associates a unique key with
each value (which can be a record or object) and provides very fast access to a
value given its key. The document data model is based on JSON (Java Script
Object Notation) and stores the data as documents, which somewhat resemble
complex objects. The graph data model stores objects as graph nodes and rela-
tionships among objects as directed graph edges. Finally, the column-based data
models store the columns of rows clustered on disk pages for fast access and
allow multiple versions of the data. We will discuss some of these in more detail
in Chapter 24.
The XML model has emerged as a standard for exchanging data over the Web and
has been used as a basis for implementing several prototype native XML systems.
XML uses hierarchical tree structures. It combines database concepts with concepts
from document representation models. Data is represented as elements; with the
use of tags, data can be nested to create complex tree structures. This model con-
ceptually resembles the object model but uses different terminology. XML capabili-
ties have been added to many commercial DBMS products. We present an overview
of XML in Chapter 13.
Two older, historically important data models, now known as legacy data models,
are the network and hierarchical models. The network model represents data as
record types and also represents a limited type of 1:N relationship, called a set type.
A 1:N, or one-to-many, relationship relates one instance of a record to many record
instances using some pointer linking mechanism in these models. The network
model, also known as the CODASYL DBTG model,14 has an associated record-at-
a-time language that must be embedded in a host programming language. The net-
work DML was proposed in the 1971 Database Task Group (DBTG) Report as an
extension of the COBOL language.
The hierarchical model represents data as hierarchical tree structures. Each hierar-
chy represents a number of related records. There is no standard language for the
hierarchical model. A popular hierarchical DML is DL/1 of the IMS system. It dom-
inated the DBMS market for over 20 years between 1965 and 1985. Its DML, called
DL/1, was a de facto industry standard for a long time.15
14CODASYL DBTG stands for Conference on Data Systems Languages Database Task Group, which is
the committee that specified the network model and its language.
15The full chapters on the network and hierarchical models from the second edition of this book are
available from this book’s Companion Web site at http://www.aw.com/elmasri.
54 Chapter 2 Database System Concepts and Architecture
2.7 Summary
In this chapter we introduced the main concepts used in database systems. We
defined a data model and we distinguished three main categories:
■ High-level or conceptual data models (based on entities and relationships)
■ Low-level or physical data models
■ Representational or implementation data models (record-based, object-
oriented)
We distinguished the schema, or description of a database, from the database itself.
The schema does not change very often, whereas the database state changes every
time data is inserted, deleted, or modified. Then we described the three-schema
DBMS architecture, which allows three schema levels:
■ An internal schema describes the physical storage structure of the database.
■ A conceptual schema is a high-level description of the whole database.
■ External schemas describe the views of different user groups.
A DBMS that cleanly separates the three levels must have mappings among
the schemas to transform requests and query results from one level to the
next. Most DBMSs do not separate the three levels completely. We used the
three-schema architecture to define the concepts of logical and physical data
independence.
Then we discussed the main types of languages and interfaces that DBMSs support.
A data definition language (DDL) is used to define the database conceptual schema.
In most DBMSs, the DDL also defines user views and, sometimes, storage struc-
tures; in other DBMSs, separate languages or functions exist for specifying storage
structures. This distinction is fading away in today’s relational implementations,
with SQL serving as a catchall language to perform multiple roles, including view
definition. The storage definition part (SDL) was included in SQL’s early versions,
but is now typically implemented as special commands for the DBA in relational
DBMSs. The DBMS compiles all schema definitions and stores their descriptions in
the DBMS catalog.
A data manipulation language (DML) is used for specifying database retrievals and
updates. DMLs can be high level (set-oriented, nonprocedural) or low level (record-
oriented, procedural). A high-level DML can be embedded in a host programming
language, or it can be used as a standalone language; in the latter case it is often
called a query language.
We discussed different types of interfaces provided by DBMSs and the types of
DBMS users with which each interface is associated. Then we discussed the
database system environment, typical DBMS software modules, and DBMS
utilities for helping users and the DBA staff perform their tasks. We continued
with an overview of the two-tier and three-tier architectures for database
applications.
Exercises 55
Finally, we classified DBMSs according to several criteria: data model, number of
users, number of sites, types of access paths, and cost. We discussed the availabil-
ity of DBMSs and additional modules—from no cost in the form of open source
software to configurations that annually cost millions to maintain. We also
pointed out the variety of licensing arrangements for DBMS and related prod-
ucts. The main classification of DBMSs is based on the data model. We briefly
discussed the main data models used in current commercial DBMSs.
Review Questions
2.1. Define the following terms: data model, database schema, database state,
internal schema, conceptual schema, external schema, data independence,
DDL, DML, SDL, VDL, query language, host language, data sublanguage,
database utility, catalog, client/server architecture, three-tier architecture,
and n-tier architecture.
2.2. Discuss the main categories of data models. What are the basic differences
among the relational model, the object model, and the XML model?
2.3. What is the difference between a database schema and a database state?
2.4. Describe the three-schema architecture. Why do we need mappings among
schema levels? How do different schema definition languages support this
architecture?
2.5. What is the difference between logical data independence and physical data
independence? Which one is harder to achieve? Why?
2.6. What is the difference between procedural and nonprocedural DMLs?
2.7. Discuss the different types of user-friendly interfaces and the types of users
who typically use each.
2.8. With what other computer system software does a DBMS interact?
2.9. What is the difference between the two-tier and three-tier client/server
architectures?
2.10. Discuss some types of database utilities and tools and their functions.
2.11. What is the additional functionality incorporated in n-tier architecture
(n . 3)?
Exercises
2.12. Think of different users for the database shown in Figure 1.2. What types of
applications would each user need? To which user category would each
belong, and what type of interface would each need?
56 Chapter 2 Database System Concepts and Architecture
2.13. Choose a database application with which you are familiar. Design a schema
and show a sample database for that application, using the notation of Fig-
ures 1.2 and 2.1. What types of additional information and constraints
would you like to represent in the schema? Think of several users of your
database, and design a view for each.
2.14. If you were designing a Web-based system to make airline reservations and sell
airline tickets, which DBMS architecture would you choose from Section 2.5?
Why? Why would the other architectures not be a good choice?
2.15. Consider Figure 2.1. In addition to constraints relating the values of col-
umns in one table to columns in another table, there are also constraints that
impose restrictions on values in a column or a combination of columns
within a table. One such constraint dictates that a column or a group of col-
umns must be unique across all rows in the table. For example, in the
STUDENT table, the Student_number column must be unique (to prevent two
different students from having the same Student_number). Identify the col-
umn or the group of columns in the other tables that must be unique across
all rows in the table.
Selected Bibliography
Many database textbooks, including Date (2004), Silberschatz et al. (2011), Ramak-
rishnan and Gehrke (2003), Garcia-Molina et al. (2002, 2009), and Abiteboul et al.
(1995), provide a discussion of the various database concepts presented here.
Tsichritzis and Lochovsky (1982) is an early textbook on data models. Tsichritzis
and Klug (1978) and Jardine (1977) present the three-schema architecture, which
was first suggested in the DBTG CODASYL report (1971) and later in an American
National Standards Institute (ANSI) report (1975). An in-depth analysis of the rela-
tional data model and some of its possible extensions is given in Codd (1990). The
proposed standard for object-oriented databases is described in Cattell et al. (2000).
Many documents describing XML are available on the Web, such as XML (2005).
Examples of database utilities are the ETI Connect, Analyze and Transform tools
(http://www.eti.com) and the database administration tool, DBArtisan, from
Embarcadero Technologies (http://www.embarcadero.com).
part 2
Conceptual Data Modeling and
Database Design
This page intentionally left blank
59
Data Modeling Using the Entity–
Relationship (ER) Model
Conceptual modeling is a very important phase in
designing a successful database application. Gener-
ally, the term database application refers to a particular database and the associ-
ated programs that implement the database queries and updates. For example, a
BANK database application that keeps track of customer accounts would include
programs that implement database updates corresponding to customer deposits
and withdrawals. These programs would provide user-friendly graphical user inter-
faces (GUIs) utilizing forms and menus for the end users of the application—the
bank customers or bank tellers in this example. In addition, it is now common to
provide interfaces to these programs to BANK customers via mobile devices using
mobile apps. Hence, a major part of the database application will require the
design, implementation, and testing of these application programs. Traditionally,
the design and testing of application programs has been considered to be part of
software engineering rather than database design. In many software design tools, the
database design methodologies and software engineering methodologies are inter-
twined since these activities are strongly related.
In this chapter, we follow the traditional approach of concentrating on the database
structures and constraints during conceptual database design. The design of appli-
cation programs is typically covered in software engineering courses. We present
the modeling concepts of the entity–relationship (ER) model, which is a popular
high-level conceptual data model. This model and its variations are frequently used
for the conceptual design of database applications, and many database design tools
employ its concepts. We describe the basic data-structuring concepts and con-
straints of the ER model and discuss their use in the design of conceptual schemas
for database applications. We also present the diagrammatic notation associated
with the ER model, known as ER diagrams.
3chapter 3
60 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
Object modeling methodologies such as the Unified Modeling Language (UML)
are becoming increasingly popular in both database and software design. These
methodologies go beyond database design to specify detailed design of software
modules and their interactions using various types of diagrams. An important part
of these methodologies—namely, class diagrams1—is similar in many ways to the
ER diagrams. In class diagrams, operations on objects are specified, in addition to
specifying the database schema structure. Operations can be used to specify the
functional requirements during database design, as we will discuss in Section 3.1.
We present some of the UML notation and concepts for class diagrams that are
particularly relevant to database design in Section 3.8, and we briefly compare these
to ER notation and concepts. Additional UML notation and concepts are presented
in Section 4.6.
This chapter is organized as follows: Section 3.1 discusses the role of high-level con-
ceptual data models in database design. We introduce the requirements for a sam-
ple database application in Section 3.2 to illustrate the use of concepts from the ER
model. This sample database is used throughout the text. In Section 3.3 we present
the concepts of entities and attributes, and we gradually introduce the diagram-
matic technique for displaying an ER schema. In Section 3.4 we introduce the con-
cepts of binary relationships and their roles and structural constraints. Section 3.5
introduces weak entity types. Section 3.6 shows how a schema design is refined to
include relationships. Section 3.7 reviews the notation for ER diagrams, summa-
rizes the issues and common pitfalls that occur in schema design, and discusses
how to choose the names for database schema constructs such as entity types and
relationship types. Section 3.8 introduces some UML class diagram concepts, com-
pares them to ER model concepts, and applies them to the same COMPANY data-
base example. Section 3.9 discusses more complex types of relationships. Sec –
tion 3.10 summarizes the chapter.
The material in Sections 3.8 and 3.9 may be excluded from an introductory course. If
a more thorough coverage of data modeling concepts and conceptual database design
is desired, the reader should continue to Chapter 4, where we describe extensions to
the ER model that lead to the enhanced–ER (EER) model, which includes concepts
such as specialization, generalization, inheritance, and union types (categories).
3.1 Using High-Level Conceptual Data Models
for Database Design
Figure 3.1 shows a simplified overview of the database design process. The first step
shown is requirements collection and analysis. During this step, the database
designers interview prospective database users to understand and document their
data requirements. The result of this step is a concisely written set of users’ require-
ments. These requirements should be specified in as detailed and complete a form
as possible. In parallel with specifying the data requirements, it is useful to specify
1A class is similar to an entity type in many ways.
3.1 Using High-Level Conceptual Data Models for Database Design 61
the known functional requirements of the application. These consist of the user-
defined operations (or transactions) that will be applied to the database, including
both retrievals and updates. In software design, it is common to use data flow dia-
grams, sequence diagrams, scenarios, and other techniques to specify functional
requirements. We will not discuss any of these techniques here; they are usually
described in detail in software engineering texts.
Once the requirements have been collected and analyzed, the next step is to create a
conceptual schema for the database, using a high-level conceptual data model. This
Functional Requirements
REQUIREMENTS
COLLECTION AND
ANALYSIS
Miniworld
Data Requirements
CONCEPTUAL DESIGN
Conceptual Schema
(In a high-level data model)
LOGICAL DESIGN
(DATA MODEL MAPPING)
Logical (Conceptual) Schema
(In the data model of a specific DBMS)
PHYSICAL DESIGN
Internal Schema
Application Programs
TRANSACTION
IMPLEMENTATION
APPLICATION PROGRAM
DESIGN
DBMS-specific
DBMS-independent
High-Level Transaction
Specification
FUNCTIONAL ANALYSIS
Figure 3.1
A simplified diagram to illustrate the main phases of database design.
62 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
step is called conceptual design. The conceptual schema is a concise description of
the data requirements of the users and includes detailed descriptions of the entity
types, relationships, and constraints; these are expressed using the concepts pro-
vided by the high-level data model. Because these concepts do not include imple-
mentation details, they are usually easier to understand and can be used to
communicate with nontechnical users. The high-level conceptual schema can also
be used as a reference to ensure that all users’ data requirements are met and that
the requirements do not conflict. This approach enables database designers to con-
centrate on specifying the properties of the data, without being concerned with
storage and implementation details, which makes it is easier to create a good con-
ceptual database design.
During or after the conceptual schema design, the basic data model operations can
be used to specify the high-level user queries and operations identified during
functional analysis. This also serves to confirm that the conceptual schema meets
all the identified functional requirements. Modifications to the conceptual schema
can be introduced if some functional requirements cannot be specified using the
initial schema.
The next step in database design is the actual implementation of the database, using
a commercial DBMS. Most current commercial DBMSs use an implementation
data model—such as the relational (SQL) model—so the conceptual schema is
transformed from the high-level data model into the implementation data model.
This step is called logical design or data model mapping; its result is a database
schema in the implementation data model of the DBMS. Data model mapping is
often automated or semiautomated within the database design tools.
The last step is the physical design phase, during which the internal storage struc-
tures, file organizations, indexes, access paths, and physical design parameters for
the database files are specified. In parallel with these activities, application pro-
grams are designed and implemented as database transactions corresponding to the
high-level transaction specifications.
We present only the basic ER model concepts for conceptual schema design in this
chapter. Additional modeling concepts are discussed in Chapter 4, when we intro-
duce the EER model.
3.2 A Sample Database Application
In this section we describe a sample database application, called COMPANY, which
serves to illustrate the basic ER model concepts and their use in schema design. We
list the data requirements for the database here, and then create its conceptual
schema step-by-step as we introduce the modeling concepts of the ER model. The
COMPANY database keeps track of a company’s employees, departments, and
projects. Suppose that after the requirements collection and analysis phase, the
database designers provide the following description of the miniworld—the part of
the company that will be represented in the database.
3.3 Entity Types, Entity Sets, Attributes, and Keys 63
■ The company is organized into departments. Each department has a unique
name, a unique number, and a particular employee who manages the depart-
ment. We keep track of the start date when that employee began managing
the department. A department may have several locations.
■ A department controls a number of projects, each of which has a unique
name, a unique number, and a single location.
■ The database will store each employee’s name, Social Security number,2
address, salary, sex (gender), and birth date. An employee is assigned to one
department, but may work on several projects, which are not necessarily
controlled by the same department. It is required to keep track of the cur-
rent number of hours per week that an employee works on each project, as
well as the direct supervisor of each employee (who is another employee).
■ The database will keep track of the dependents of each employee for insur-
ance purposes, including each dependent’s first name, sex, birth date, and
relationship to the employee.
Figure 3.2 shows how the schema for this database application can be displayed by
means of the graphical notation known as ER diagrams. This figure will be
explained gradually as the ER model concepts are presented. We describe the step-
by-step process of deriving this schema from the stated requirements—and explain
the ER diagrammatic notation—as we introduce the ER model concepts.
3.3 Entity Types, Entity Sets, Attributes,
and Keys
The ER model describes data as entities, relationships, and attributes. In Section 3.3.1
we introduce the concepts of entities and their attributes. We discuss entity types
and key attributes in Section 3.3.2. Then, in Section 3.3.3, we specify the initial con-
ceptual design of the entity types for the COMPANY database. We describe relation-
ships in Section 3.4.
3.3.1 Entities and Attributes
Entities and Their Attributes. The basic concept that the ER model represents is
an entity, which is a thing or object in the real world with an independent existence.
An entity may be an object with a physical existence (for example, a particular per-
son, car, house, or employee) or it may be an object with a conceptual existence (for
instance, a company, a job, or a university course). Each entity has attributes—the
particular properties that describe it. For example, an EMPLOYEE entity may be
described by the employee’s name, age, address, salary, and job. A particular entity
2The Social Security number, or SSN, is a unique nine-digit identifier assigned to each individual in the
United States to keep track of his or her employment, benefits, and taxes. Other countries may have
similar identification schemes, such as personal identification card numbers.
64 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
will have a value for each of its attributes. The attribute values that describe each
entity become a major part of the data stored in the database.
Figure 3.3 shows two entities and the values of their attributes. The EMPLOYEE
entity e1 has four attributes: Name, Address, Age, and Home_phone; their values
are ‘John Smith,’ ‘2311 Kirby, Houston, Texas 77001’, ‘55’, and ‘713-749-2630’,
respectively. The COMPANY entity c1 has three attributes: Name, Headquarters, and
President; their values are ‘Sunco Oil’, ‘Houston’, and ‘John Smith’, respectively.
EMPLOYEE
Fname Minit Lname
Name Address
Sex
Salary
Ssn
Bdate
Supervisor Supervisee
SUPERVISION1 N
Hours
WORKS_ON
CONTROLS
M N
1
DEPENDENTS_OF
Name
Location
N
1
1 1
PROJECT
DEPARTMENT
Locations
Name Number
Number
Number_of_employees
MANAGES
Start_date
WORKS_FOR
1N
N
DEPENDENT
Sex Birth_date RelationshipName
Figure 3.2
An ER schema diagram for the COMPANY database. The diagrammatic notation is introduced gradually throughout
this chapter and is summarized in Figure 3.14.
3.3 Entity Types, Entity Sets, Attributes, and Keys 65
Several types of attributes occur in the ER model: simple versus composite, single-
valued versus multivalued, and stored versus derived. First we define these attribute
types and illustrate their use via examples. Then we discuss the concept of a NULL
value for an attribute.
Composite versus Simple (Atomic) Attributes. Composite attributes can be
divided into smaller subparts, which represent more basic attributes with indepen-
dent meanings. For example, the Address attribute of the EMPLOYEE entity shown
in Figure 3.3 can be subdivided into Street_address, City, State, and Zip,3 with the
values ‘2311 Kirby’, ‘Houston’, ‘Texas’, and ‘77001’. Attributes that are not divisible
are called simple or atomic attributes. Composite attributes can form a hierarchy;
for example, Street_address can be further subdivided into three simple component
attributes: Number, Street, and Apartment_number, as shown in Figure 3.4. The value
of a composite attribute is the concatenation of the values of its component simple
attributes.
Composite attributes are useful to model situations in which a user sometimes
refers to the composite attribute as a unit but at other times refers specifically to its
Name = John Smith Name = Sunco Oil
Headquarters = Houston
President = John Smith
Address = 2311 Kirby
Houston, Texas 77001
Age = 55
e1 c1
Home_phone = 713-749-2630
Figure 3.3
Two entities,
EMPLOYEE e1, and
COMPANY c1, and
their attributes.
3Zip Code is the name used in the United States for a five-digit postal code, such as 76019, which can
be extended to nine digits, such as 76019-0015. We use the five-digit Zip in our examples.
Address
CityStreet_address
Number Street Apartment_number
State Zip
Figure 3.4
A hierarchy of
composite attributes.
66 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
components. If the composite attribute is referenced only as a whole, there is no
need to subdivide it into component attributes. For example, if there is no need to
refer to the individual components of an address (Zip Code, street, and so on), then
the whole address can be designated as a simple attribute.
Single-Valued versus Multivalued Attributes. Most attributes have a single
value for a particular entity; such attributes are called single-valued. For example,
Age is a single-valued attribute of a person. In some cases an attribute can have a
set of values for the same entity—for instance, a Colors attribute for a car, or a
College_degrees attribute for a person. Cars with one color have a single value,
whereas two-tone cars have two color values. Similarly, one person may not have any
college degrees, another person may have one, and a third person may have two or
more degrees; therefore, different people can have different numbers of values for the
College_degrees attribute. Such attributes are called multivalued. A multivalued
attribute may have lower and upper bounds to constrain the number of values allowed
for each individual entity. For example, the Colors attribute of a car may be restricted to
have between one and two values, if we assume that a car can have two colors at most.
Stored versus Derived Attributes. In some cases, two (or more) attribute val-
ues are related—for example, the Age and Birth_date attributes of a person. For a
particular person entity, the value of Age can be determined from the current
(today’s) date and the value of that person’s Birth_date. The Age attribute is hence
called a derived attribute and is said to be derivable from the Birth_date attribute,
which is called a stored attribute. Some attribute values can be derived from related
entities; for example, an attribute Number_of_employees of a DEPARTMENT entity
can be derived by counting the number of employees related to (working for) that
department.
NULL Values. In some cases, a particular entity may not have an applicable value
for an attribute. For example, the Apartment_number attribute of an address applies
only to addresses that are in apartment buildings and not to other types of resi-
dences, such as single-family homes. Similarly, a College_degrees attribute applies
only to people with college degrees. For such situations, a special value called NULL
is created. An address of a single-family home would have NULL for its
Apartment_number attribute, and a person with no college degree would have
NULL for College_degrees. NULL can also be used if we do not know the value of an
attribute for a particular entity—for example, if we do not know the home phone
number of ‘John Smith’ in Figure 3.3. The meaning of the former type of NULL is
not applicable, whereas the meaning of the latter is unknown. The unknown category
of NULL can be further classified into two cases. The first case arises when it is known
that the attribute value exists but is missing—for instance, if the Height attribute of a
person is listed as NULL. The second case arises when it is not known whether the
attribute value exists—for example, if the Home_phone attribute of a person is NULL.
Complex Attributes. Notice that, in general, composite and multivalued attri-
butes can be nested arbitrarily. We can represent arbitrary nesting by grouping
3.3 Entity Types, Entity Sets, Attributes, and Keys 67
components of a composite attribute between parentheses ( ) and separating
the components with commas, and by displaying multivalued attributes between
braces { }. Such attributes are called complex attributes. For example, if a person
can have more than one residence and each residence can have a single address and
multiple phones, an attribute Address_phone for a person can be specified as shown
in Figure 3.5.4 Both Phone and Address are themselves composite attributes.
3.3.2 Entity Types, Entity Sets, Keys, and Value Sets
Entity Types and Entity Sets. A database usually contains groups of entities that
are similar. For example, a company employing hundreds of employees may want to
store similar information concerning each of the employees. These employee entities
share the same attributes, but each entity has its own value(s) for each attribute. An
entity type defines a collection (or set) of entities that have the same attributes. Each
entity type in the database is described by its name and attributes. Figure 3.6 shows
two entity types: EMPLOYEE and COMPANY, and a list of some of the attributes
for each. A few individual entities of each type are also illustrated, along with the
values of their attributes. The collection of all entities of a particular entity type in the
4For those familiar with XML, we should note that complex attributes are similar to complex elements in
XML (see Chapter 13).
{Address_phone( {Phone(Area_code,Phone_number)},Address(Street_address
(Number,Street,Apartment_number),City,State,Zip) )}
Figure 3.5
A complex attribute:
Address_phone.
Entity Type Name:
Entity Set:
(Extension)
COMPANY
Name, Headquarters, President
EMPLOYEE
Name, Age, Salary
(John Smith, 55, 80k)
(Fred Brown, 40, 30K)
(Judy Clark, 25, 20K)
e1 c1
c2e2
e3
(Sunco Oil, Houston, John Smith)
(Fast Computer, Dallas, Bob King)
Figure 3.6
Two entity types,
EMPLOYEE and
COMPANY, and some
member entities of
each.
68 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
database at any point in time is called an entity set or entity collection; the entity set
is usually referred to using the same name as the entity type, even though they are
two separate concepts. For example, EMPLOYEE refers to both a type of entity as
well as the current collection of all employee entities in the database. It is now more
common to give separate names to the entity type and entity collection; for example
in object and object-relational data models (see Chapter 12).
An entity type is represented in ER diagrams5 (see Figure 3.2) as a rectangular box
enclosing the entity type name. Attribute names are enclosed in ovals and are
attached to their entity type by straight lines. Composite attributes are attached to
their component attributes by straight lines. Multivalued attributes are displayed in
double ovals. Figure 3.7(a) shows a CAR entity type in this notation.
An entity type describes the schema or intension for a set of entities that share the
same structure. The collection of entities of a particular entity type is grouped into
an entity set, which is also called the extension of the entity type.
Key Attributes of an Entity Type. An important constraint on the entities of an
entity type is the key or uniqueness constraint on attributes. An entity type usually
has one or more attributes whose values are distinct for each individual entity in the
entity set. Such an attribute is called a key attribute, and its values can be used to
identify each entity uniquely. For example, the Name attribute is a key of the
COMPANY entity type in Figure 3.6 because no two companies are allowed to have
the same name. For the PERSON entity type, a typical key attribute is Ssn (Social Secu-
rity number). Sometimes several attributes together form a key, meaning that the
combination of the attribute values must be distinct for each entity. If a set of attri-
butes possesses this property, the proper way to represent this in the ER model that
we describe here is to define a composite attribute and designate it as a key attribute
of the entity type. Notice that such a composite key must be minimal; that is, all
component attributes must be included in the composite attribute to have the
uniqueness property. Superfluous attributes must not be included in a key. In ER
diagrammatic notation, each key attribute has its name underlined inside the oval,
as illustrated in Figure 3.7(a).
Specifying that an attribute is a key of an entity type means that the preceding
uniqueness property must hold for every entity set of the entity type. Hence, it is a
constraint that prohibits any two entities from having the same value for the key
attribute at the same time. It is not the property of a particular entity set; rather, it is
a constraint on any entity set of the entity type at any point in time. This key con-
straint (and other constraints we discuss later) is derived from the constraints of the
miniworld that the database represents.
Some entity types have more than one key attribute. For example, each of the
Vehicle_id and Registration attributes of the entity type CAR (Figure 3.7) is a key in
5We use a notation for ER diagrams that is close to the original proposed notation (Chen, 1976). Many
other notations are in use; we illustrate some of them later in this chapter when we present UML class
diagrams, and some additional diagrammatic notations are given in Appendix A.
3.3 Entity Types, Entity Sets, Attributes, and Keys 69
its own right. The Registration attribute is an example of a composite key formed
from two simple component attributes, State and Number, neither of which is a key
on its own. An entity type may also have no key, in which case it is called a weak
entity type (see Section 3.5).
In our diagrammatic notation, if two attributes are underlined separately, then each
is a key on its own. Unlike the relational model (see Section 5.2.2), there is no con-
cept of primary key in the ER model that we present here; the primary key will be
chosen during mapping to a relational schema (see Chapter 9).
Value Sets (Domains) of Attributes. Each simple attribute of an entity type is
associated with a value set (or domain of values), which specifies the set of values
that may be assigned to that attribute for each individual entity. In Figure 3.6, if the
range of ages allowed for employees is between 16 and 70, we can specify the value
set of the Age attribute of EMPLOYEE to be the set of integer numbers between 16
and 70. Similarly, we can specify the value set for the Name attribute to be the set of
strings of alphabetic characters separated by blank characters, and so on. Value sets
are not typically displayed in basic ER diagrams and are similar to the basic data
types available in most programming languages, such as integer, string, Boolean,
float, enumerated type, subrange, and so on. However, data types of attributes can
Model
Make
Vehicle_id
Year
Color
Registration
State(a)
(b)
Number
CAR
CAR1
((ABC 123, TEXAS), TK629, Ford Mustang, convertible, 2004 {red, black})
CAR2
((ABC 123, NEW YORK), WP9872, Nissan Maxima, 4-door, 2005, {blue})
CAR3
((VSY 720, TEXAS), TD729, Chrysler LeBaron, 4-door, 2002, {white, blue})
CAR
Registration (Number, State), Vehicle_id, Make, Model, Year, {Color}
Figure 3.7
The CAR entity type
with two key attributes,
Registration and
Vehicle_id. (a) ER
diagram notation.
(b) Entity set with
three entities.
70 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
be specified in UML class diagrams (see Section 3.8) and in other diagrammatic
notations used in database design tools. Additional data types to represent common
database types, such as date, time, and other concepts, are also employed.
Mathematically, an attribute A of entity set E whose value set is V can be defined as
a function from E to the power set6 P(V) of V:
A : E → P(V)
We refer to the value of attribute A for entity e as A(e). The previous definition cov-
ers both single-valued and multivalued attributes, as well as NULLs. A NULL value
is represented by the empty set. For single-valued attributes, A(e) is restricted to
being a singleton set for each entity e in E, whereas there is no restriction on multi-
valued attributes.7 For a composite attribute A, the value set V is the power set of
the Cartesian product of P(V1), P(V2), . . . , P(Vn), where V1, V2, . . . , Vn are the
value sets of the simple component attributes that form A:
V = P(P(V1) × P(V2) × . . . × P(Vn))
The value set provides all possible values. Usually only a small number of these val-
ues exist in the database at a particular time. Those values represent the data from
the current state of the miniworld and correspond to the data as it actually exists in
the miniworld.
3.3.3 Initial Conceptual Design of the COMPANY Database
We can now define the entity types for the COMPANY database, based on the
requirements described in Section 3.2. After defining several entity types and their
attributes here, we refine our design in Section 3.4 after we introduce the concept of
a relationship. According to the requirements listed in Section 3.2, we can identify
four entity types—one corresponding to each of the four items in the specification
(see Figure 3.8):
1. An entity type DEPARTMENT with attributes Name, Number, Locations,
Manager, and Manager_start_date. Locations is the only multivalued attribute.
We can specify that both Name and Number are (separate) key attributes
because each was specified to be unique.
2. An entity type PROJECT with attributes Name, Number, Location, and
Controlling_department. Both Name and Number are (separate) key attributes.
3. An entity type EMPLOYEE with attributes Name, Ssn, Sex, Address, Salary,
Birth_date, Department, and Supervisor. Both Name and Address may be
composite attributes; however, this was not specified in the requirements.
We must go back to the users to see if any of them will refer to the individual
components of Name—First_name, Middle_initial, Last_name—or of Address. In
6The power set P(V ) of a set V is the set of all subsets of V.
7A singleton set is a set with only one element (value).
3.3 Entity Types, Entity Sets, Attributes, and Keys 71
our example, Name is modeled as a composite attribute, whereas Address is
not, presumably after consultation with the users.
4. An entity type DEPENDENT with attributes Employee, Dependent_name, Sex,
Birth_date, and Relationship (to the employee).
Another requirement is that an employee can work on several projects, and the
database has to store the number of hours per week an employee works on each
project. This requirement is listed as part of the third requirement in Section 3.2,
and it can be represented by a multivalued composite attribute of EMPLOYEE
called Works_on with the simple components (Project, Hours). Alternatively, it
can be represented as a multivalued composite attribute of PROJECT called
Workers with the simple components (Employee, Hours). We choose the first
Address
Sex
Birth_date
Project Hours
Works_on
Fname Minit Lname
Department
Salary
Supervisor
Name
EMPLOYEE
Ssn
Sex
Relationship
Employee
Dependent_name
DEPENDENT
Birth_date
Location
Number
Controlling_department
Name
PROJECT
Manager_start_date
Number
ManagerDEPARTMENT
Name
Locations
Figure 3.8
Preliminary design of
entity types for the
COMPANY database.
Some of the shown
attributes will be refined
into relationships.
72 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
alternative in Figure 3.8; we shall see in the next section that this will be refined into
a many-to-many relationship, once we introduce the concepts of relationships.
3.4 Relationship Types, Relationship Sets,
Roles, and Structural Constraints
In Figure 3.8 there are several implicit relationships among the various entity types.
In fact, whenever an attribute of one entity type refers to another entity type, some
relationship exists. For example, the attribute Manager of DEPARTMENT refers to
an employee who manages the department; the attribute Controlling_department
of PROJECT refers to the department that controls the project; the attribute
Supervisor of EMPLOYEE refers to another employee (the one who supervises this
employee); the attribute Department of EMPLOYEE refers to the department for
which the employee works; and so on. In the ER model, these references should not
be represented as attributes but as relationships. The initial COMPANY database
schema from Figure 3.8 will be refined in Section 3.6 to represent relationships
explicitly. In the initial design of entity types, relationships are typically captured in
the form of attributes. As the design is refined, these attributes get converted into
relationships between entity types.
This section is organized as follows: Section 3.4.1 introduces the concepts of rela-
tionship types, relationship sets, and relationship instances. We define the concepts
of relationship degree, role names, and recursive relationships in Section 3.4.2, and
then we discuss structural constraints on relationships—such as cardinality ratios
and existence dependencies—in Section 3.4.3. Section 3.4.4 shows how relationship
types can also have attributes.
3.4.1 Relationship Types, Sets, and Instances
A relationship type R among n entity types E1, E2, . . . , En defines a set of associa-
tions—or a relationship set—among entities from these entity types. Similar to the
case of entity types and entity sets, a relationship type and its corresponding rela-
tionship set are customarily referred to by the same name, R. Mathematically, the
relationship set R is a set of relationship instances ri, where each ri associates n
individual entities (e1, e2, . . . , en), and each entity ej in ri is a member of entity set Ej,
1 ≤ j ≤ n. Hence, a relationship set is a mathematical relation on E1, E2, . . . , En;
alternatively, it can be defined as a subset of the Cartesian product of the entity sets
E1 × E2 × . . . × En. Each of the entity types E1, E2, . . . , En is said to participate in the
relationship type R; similarly, each of the individual entities e1, e2, . . . , en is said to
participate in the relationship instance ri = (e1, e2, . . . , en).
Informally, each relationship instance ri in R is an association of entities, where the
association includes exactly one entity from each participating entity type. Each
such relationship instance ri represents the fact that the entities participating in ri
are related in some way in the corresponding miniworld situation. For example,
consider a relationship type WORKS_FOR between the two entity types
3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints 73
EMPLOYEE and DEPARTMENT, which associates each employee with the depart-
ment for which the employee works. Each relationship instance in the relationship
set WORKS_FOR associates one EMPLOYEE entity and one DEPARTMENT
entity. Figure 3.9 illustrates this example, where each relationship instance ri is
shown connected to the EMPLOYEE and DEPARTMENT entities that participate
in ri. In the miniworld represented by Figure 3.9, the employees e1, e3, and e6 work
for department d1; the employees e2 and e4 work for department d2; and the employ-
ees e5 and e7 work for department d3.
In ER diagrams, relationship types are displayed as diamond-shaped boxes, which
are connected by straight lines to the rectangular boxes representing the participat-
ing entity types. The relationship name is displayed in the diamond-shaped box
(see Figure 3.2).
3.4.2 Relationship Degree, Role Names, and Recursive
Relationships
Degree of a Relationship Type. The degree of a relationship type is the number
of participating entity types. Hence, the WORKS_FOR relationship is of degree
two. A relationship type of degree two is called binary, and one of degree three is
called ternary. An example of a ternary relationship is SUPPLY, shown in Fig-
ure 3.10, where each relationship instance ri associates three entities—a supplier s, a
part p, and a project j—whenever s supplies part p to project j. Relationships can
EMPLOYEE WORKS_FOR DEPARTMENT
e1
e2
e3
e4
e5
e6
e7
r1
r2
r3
r4
r5
r6
r7
d1
d2
d3
Figure 3.9
Some instances in
the WORKS_FOR
relationship set,
which represents a
relationship type
WORKS_FOR
between EMPLOYEE
and DEPARTMENT.
74 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
generally be of any degree, but the ones most common are binary relationships.
Higher-degree relationships are generally more complex than binary relationships;
we characterize them further in Section 3.9.
Relationships as Attributes. It is sometimes convenient to think of a binary rela-
tionship type in terms of attributes, as we discussed in Section 3.3.3. Consider the
WORKS_FOR relationship type in Figure 3.9. One can think of an attribute called
Department of the EMPLOYEE entity type, where the value of Department for each
EMPLOYEE entity is (a reference to) the DEPARTMENT entity for which that
employee works. Hence, the value set for this Department attribute is the set of all
DEPARTMENT entities, which is the DEPARTMENT entity set. This is what we did in
Figure 3.8 when we specified the initial design of the entity type EMPLOYEE for the
COMPANY database. However, when we think of a binary relationship as an attribute,
we always have two options or two points of view. In this example, the alternative point
of view is to think of a multivalued attribute Employees of the entity type
DEPARTMENT whose value for each DEPARTMENT entity is the set of EMPLOYEE enti-
ties who work for that department. The value set of this Employees attribute is the power
set of the EMPLOYEE entity set. Either of these two attributes—Department of
EMPLOYEE or Employees of DEPARTMENT—can represent the WORKS_FOR relation-
ship type. If both are represented, they are constrained to be inverses of each other.8
SUPPLIER
PART
SUPPLY PROJECT
p1
p2
p3
r1
r2
r3
r4
r5
r6
r7
j1
j2
j3
s1
s2
Figure 3.10
Some relationship
instances in the
SUPPLY ternary
relationship set.
8This concept of representing relationship types as attributes is used in a class of data models called
functional data models. In object databases (see Chapter 12), relationships can be represented by
reference attributes, either in one direction or in both directions as inverses. In relational databases
(see Chapter 5), foreign keys are a type of reference attribute used to represent relationships.
3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints 75
Role Names and Recursive Relationships. Each entity type that participates
in a relationship type plays a particular role in the relationship. The role name sig-
nifies the role that a participating entity from the entity type plays in each relation-
ship instance, and it helps to explain what the relationship means. For example, in
the WORKS_FOR relationship type, EMPLOYEE plays the role of employee or worker
and DEPARTMENT plays the role of department or employer.
Role names are not technically necessary in relationship types where all the partici-
pating entity types are distinct, since each participating entity type name can be used
as the role name. However, in some cases the same entity type participates more than
once in a relationship type in different roles. In such cases the role name becomes
essential for distinguishing the meaning of the role that each participating entity
plays. Such relationship types are called recursive relationships or self-referencing
relationships. Figure 3.11 shows an example. The SUPERVISION relationship type
relates an employee to a supervisor, where both employee and supervisor entities are
members of the same EMPLOYEE entity set. Hence, the EMPLOYEE entity type
participates twice in SUPERVISION: once in the role of supervisor (or boss), and
once in the role of supervisee (or subordinate). Each relationship instance ri in
SUPERVISION associates two different employee entities ej and ek, one of which
plays the role of supervisor and the other the role of supervisee. In Figure 3.11, the
lines marked ‘1’ represent the supervisor role, and those marked ‘2’ represent the
supervisee role; hence, e1 supervises e2 and e3, e4 supervises e6 and e7, and e5 super-
vises e1 and e4. In this example, each relationship instance must be connected with
two lines, one marked with ‘1’ (supervisor) and the other with ‘2’ (supervisee).
EMPLOYEE
2
2
2
SUPERVISION
e1
e2
e3
e4
e5
e6
e7
r1
r2
r3
r4
r5
r6
2
2
2
1
1
1
1
1
1
Figure 3.11
A recursive relationship
SUPERVISION
between EMPLOYEE
in the supervisor role
(1) and EMPLOYEE in
the subordinate role (2).
76 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
3.4.3 Constraints on Binary Relationship Types
Relationship types usually have certain constraints that limit the possible combina-
tions of entities that may participate in the corresponding relationship set. These
constraints are determined from the miniworld situation that the relationships rep-
resent. For example, in Figure 3.9, if the company has a rule that each employee
must work for exactly one department, then we would like to describe this con-
straint in the schema. We can distinguish two main types of binary relationship
constraints: cardinality ratio and participation.
Cardinality Ratios for Binary Relationships. The cardinality ratio for a binary
relationship specifies the maximum number of relationship instances that an entity
can participate in. For example, in the WORKS_FOR binary relationship type,
DEPARTMENT:EMPLOYEE is of cardinality ratio 1:N, meaning that each department
can be related to (that is, employs) any number of employees (N),9 but an employee
can be related to (work for) at most one department (1). This means that for
this particular relationship type WORKS_FOR, a particular department entity can
be related to any number of employees (N indicates there is no maximum number).
On the other hand, an employee can be related to a maximum of one department.
The possible cardinality ratios for binary relationship types are 1:1, 1:N, N:1,
and M:N.
An example of a 1:1 binary relationship is MANAGES (Figure 3.12), which relates a
department entity to the employee who manages that department. This represents
the miniworld constraints that—at any point in time—an employee can manage at
9N stands for any number of related entities (zero or more). In some notations, the asterisk symbol (*) is
used instead of N.
EMPLOYEE MANAGES DEPARTMENT
e1
e2
e3
e4
e5
e6
e7
d1
d2
d3
r1
r2
r3
Figure 3.12
A 1:1 relationship,
MANAGES.
3.4 Relationship Types, Relationship Sets, Roles, and Structural Constraints 77
most one department and a department can have at most one manager. The rela-
tionship type WORKS_ON (Figure 3.13) is of cardinality ratio M:N, because the
miniworld rule is that an employee can work on several projects and a project can
have several employees.
Cardinality ratios for binary relationships are represented on ER diagrams by dis-
playing 1, M, and N on the diamonds as shown in Figure 3.2. Notice that in this
notation, we can either specify no maximum (N) or a maximum of one (1) on par-
ticipation. An alternative notation (see Section 3.7.4) allows the designer to specify
a specific maximum number on participation, such as 4 or 5.
Participation Constraints and Existence Dependencies. The participation
constraint specifies whether the existence of an entity depends on its being related
to another entity via the relationship type. This constraint specifies the minimum
number of relationship instances that each entity can participate in and is some-
times called the minimum cardinality constraint. There are two types of participa-
tion constraints—total and partial—that we illustrate by example. If a company
policy states that every employee must work for a department, then an employee
entity can exist only if it participates in at least one WORKS_FOR relationship
instance (Figure 3.9). Thus, the participation of EMPLOYEE in WORKS_FOR is
called total participation, meaning that every entity in the total set of employee
entities must be related to a department entity via WORKS_FOR. Total participation
is also called existence dependency. In Figure 3.12 we do not expect every
employee to manage a department, so the participation of EMPLOYEE in the
EMPLOYEE WORKS_ON PROJECT
e1
e2
e3
e4
r1
r2
r3
r4
r5
r6
r7
p1
p2
p3
p4
Figure 3.13
An M:N relationship,
WORKS_ON.
78 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
MANAGES relationship type is partial, meaning that some or part of the set of
employee entities are related to some department entity via MANAGES, but not
necessarily all. We will refer to the cardinality ratio and participation constraints,
taken together, as the structural constraints of a relationship type.
In ER diagrams, total participation (or existence dependency) is displayed as a double
line connecting the participating entity type to the relationship, whereas partial par-
ticipation is represented by a single line (see Figure 3.2). Notice that in this notation,
we can either specify no minimum (partial participation) or a minimum of one (total
participation). An alternative notation (see Section 3.7.4) allows the designer to spec-
ify a specific minimum number on participation in the relationship, such as 4 or 5.
We will discuss constraints on higher-degree relationships in Section 3.9.
3.4.4 Attributes of Relationship Types
Relationship types can also have attributes, similar to those of entity types. For
example, to record the number of hours per week that a particular employee works
on a particular project, we can include an attribute Hours for the WORKS_ON
relationship type in Figure 3.13. Another example is to include the date on which
a manager started managing a department via an attribute Start_date for the
MANAGES relationship type in Figure 3.12.
Notice that attributes of 1:1 or 1:N relationship types can be migrated to one of the
participating entity types. For example, the Start_date attribute for the MANAGES
relationship can be an attribute of either EMPLOYEE (manager) or DEPARTMENT,
although conceptually it belongs to MANAGES. This is because MANAGES is a 1:1
relationship, so every department or employee entity participates in at most one
relationship instance. Hence, the value of the Start_date attribute can be determined
separately, either by the participating department entity or by the participating
employee (manager) entity.
For a 1:N relationship type, a relationship attribute can be migrated only to the
entity type on the N-side of the relationship. For example, in Figure 3.9, if the
WORKS_FOR relationship also has an attribute Start_date that indicates when an
employee started working for a department, this attribute can be included as an
attribute of EMPLOYEE. This is because each employee works for at most one
department, and hence participates in at most one relationship instance in
WORKS_FOR, but a department can have many employees, each with a different start date.
In both 1:1 and 1:N relationship types, the decision where to place a relationship
attribute—as a relationship type attribute or as an attribute of a participating entity
type—is determined subjectively by the schema designer.
For M:N (many-to-many) relationship types, some attributes may be determined
by the combination of participating entities in a relationship instance, not by any
single entity. Such attributes must be specified as relationship attributes. An example
is the Hours attribute of the M:N relationship WORKS_ON (Figure 3.13); the number
of hours per week an employee currently works on a project is determined by an
employee-project combination and not separately by either entity.
3.5 Weak Entity Types 79
3.5 Weak Entity Types
Entity types that do not have key attributes of their own are called weak entity types. In
contrast, regular entity types that do have a key attribute—which include all the exam-
ples discussed so far—are called strong entity types. Entities belonging to a weak entity
type are identified by being related to specific entities from another entity type in com-
bination with one of their attribute values. We call this other entity type the identifying
or owner entity type,10 and we call the relationship type that relates a weak entity type
to its owner the identifying relationship of the weak entity type.11 A weak entity type
always has a total participation constraint (existence dependency) with respect to its
identifying relationship because a weak entity cannot be identified without an owner
entity. However, not every existence dependency results in a weak entity type. For
example, a DRIVER_LICENSE entity cannot exist unless it is related to a PERSON entity,
even though it has its own key (License_number) and hence is not a weak entity.
Consider the entity type DEPENDENT, related to EMPLOYEE, which is used to keep
track of the dependents of each employee via a 1:N relationship (Figure 3.2). In our
example, the attributes of DEPENDENT are Name (the first name of the dependent),
Birth_date, Sex, and Relationship (to the employee). Two dependents of two distinct
employees may, by chance, have the same values for Name, Birth_date, Sex, and
Relationship, but they are still distinct entities. They are identified as distinct entities
only after determining the particular employee entity to which each dependent is
related. Each employee entity is said to own the dependent entities that are related to it.
A weak entity type normally has a partial key, which is the attribute that can
uniquely identify weak entities that are related to the same owner entity.12 In our
example, if we assume that no two dependents of the same employee ever have the
same first name, the attribute Name of DEPENDENT is the partial key. In the worst
case, a composite attribute of all the weak entity’s attributes will be the partial key.
In ER diagrams, both a weak entity type and its identifying relationship are distin-
guished by surrounding their boxes and diamonds with double lines (see Fig-
ure 3.2). The partial key attribute is underlined with a dashed or dotted line.
Weak entity types can sometimes be represented as complex (composite, multival-
ued) attributes. In the preceding example, we could specify a multivalued attribute
Dependents for EMPLOYEE, which is a multivalued composite attribute with the
component attributes Name, Birth_date, Sex, and Relationship. The choice of which
representation to use is made by the database designer. One criterion that may be
used is to choose the weak entity type representation if the weak entity type partici-
pates independently in relationship types other than its identifying relationship type.
In general, any number of levels of weak entity types can be defined; an owner
entity type may itself be a weak entity type. In addition, a weak entity type may have
more than one identifying entity type and an identifying relationship type of degree
higher than two, as we illustrate in Section 3.9.
10The identifying entity type is also sometimes called the parent entity type or the dominant entity type.
11The weak entity type is also sometimes called the child entity type or the subordinate entity type.
12The partial key is sometimes called the discriminator.
80 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
3.6 Refining the ER Design for
the COMPANY Database
We can now refine the database design in Figure 3.8 by changing the attributes that
represent relationships into relationship types. The cardinality ratio and participa-
tion constraint of each relationship type are determined from the requirements
listed in Section 3.2. If some cardinality ratio or dependency cannot be determined
from the requirements, the users must be questioned further to determine these
structural constraints.
In our example, we specify the following relationship types:
■ MANAGES, which is a 1:1(one-to-one) relationship type between EMPLOYEE
and DEPARTMENT. EMPLOYEE participation is partial. DEPARTMENT
participation is not clear from the requirements. We question the users, who
say that a department must have a manager at all times, which implies total
participation.13 The attribute Start_date is assigned to this relationship type.
■ WORKS_FOR, a 1:N (one-to-many) relationship type between
DEPARTMENT and EMPLOYEE. Both participations are total.
■ CONTROLS, a 1:N relationship type between DEPARTMENT and PROJECT.
The participation of PROJECT is total, whereas that of DEPARTMENT is deter-
mined to be partial, after consultation with the users indicates that some
departments may control no projects.
■ SUPERVISION, a 1:N relationship type between EMPLOYEE (in the supervi-
sor role) and EMPLOYEE (in the supervisee role). Both participations are
determined to be partial, after the users indicate that not every employee is a
supervisor and not every employee has a supervisor.
■ WORKS_ON, determined to be an M:N (many-to-many) relationship type
with attribute Hours, after the users indicate that a project can have several
employees working on it. Both participations are determined to be total.
■ DEPENDENTS_OF, a 1:N relationship type between EMPLOYEE and
DEPENDENT, which is also the identifying relationship for the weak entity
type DEPENDENT. The participation of EMPLOYEE is partial, whereas that of
DEPENDENT is total.
After specifying the previous six relationship types, we remove from the entity types in
Figure 3.8 all attributes that have been refined into relationships. These include Manager
and Manager_start_date from DEPARTMENT; Controlling_department from
PROJECT; Department, Supervisor, and Works_on from EMPLOYEE; and Employee from
DEPENDENT. It is important to have the least possible redundancy when we design the
conceptual schema of a database. If some redundancy is desired at the storage level or at
the user view level, it can be introduced later, as discussed in Section 1.6.1.
13The rules in the miniworld that determine the constraints are sometimes called the business rules,
since they are determined by the business or organization that will utilize the database.
3.7 ER Diagrams, Naming Conventions, and Design Issues 81
3.7 ER Diagrams, Naming Conventions,
and Design Issues
3.7.1 Summary of Notation for ER Diagrams
Figures 3.9 through 3.13 illustrate examples of the participation of entity types in
relationship types by displaying their entity sets and relationship sets (or
extensions)—the individual entity instances in an entity set and the individual rela-
tionship instances in a relationship set. In ER diagrams the emphasis is on repre-
senting the schemas rather than the instances. This is more useful in database
design because a database schema changes rarely, whereas the contents of the entity
sets may change frequently. In addition, the schema is obviously easier to display,
because it is much smaller.
Figure 3.2 displays the COMPANY ER database schema as an ER diagram. We now
review the full ER diagram notation. Regular (strong) entity types such as
EMPLOYEE, DEPARTMENT, and PROJECT are shown in rectangular boxes. Relation-
ship types such as WORKS_FOR, MANAGES, CONTROLS, and WORKS_ON are
shown in diamond-shaped boxes attached to the participating entity types with
straight lines. Attributes are shown in ovals, and each attribute is attached by a straight
line to its entity type or relationship type. Component attributes of a composite attri-
bute are attached to the oval representing the composite attribute, as illustrated by the
Name attribute of EMPLOYEE. Multivalued attributes are shown in double ovals, as
illustrated by the Locations attribute of DEPARTMENT. Key attributes have their names
underlined. Derived attributes are shown in dotted ovals, as illustrated by the
Number_of_employees attribute of DEPARTMENT.
Weak entity types are distinguished by being placed in double rectangles and by
having their identifying relationship placed in double diamonds, as illustrated by
the DEPENDENT entity type and the DEPENDENTS_OF identifying relationship type.
The partial key of the weak entity type is underlined with a dotted line.
In Figure 3.2 the cardinality ratio of each binary relationship type is specified
by attaching a 1, M, or N on each participating edge. The cardinality ratio
of DEPARTMENT:EMPLOYEE in MANAGES is 1:1, whereas it is 1:N for
DEPARTMENT: EMPLOYEE in WORKS_FOR, and M:N for WORKS_ON. The partici-
pation constraint is specified by a single line for partial participation and by double
lines for total participation (existence dependency).
In Figure 3.2 we show the role names for the SUPERVISION relationship type
because the same EMPLOYEE entity type plays two distinct roles in that relation-
ship. Notice that the cardinality ratio is 1:N from supervisor to supervisee because
each employee in the role of supervisee has at most one direct supervisor, whereas
an employee in the role of supervisor can supervise zero or more employees.
Figure 3.14 summarizes the conventions for ER diagrams. It is important to note
that there are many other alternative diagrammatic notations (see Section 3.7.4 and
Appendix A).
82 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
3.7.2 Proper Naming of Schema Constructs
When designing a database schema, the choice of names for entity types, attributes,
relationship types, and (particularly) roles is not always straightforward. One
should choose names that convey, as much as possible, the meanings attached to
the different constructs in the schema. We choose to use singular names for entity
types, rather than plural ones, because the entity type name applies to each indi-
vidual entity belonging to that entity type. In our ER diagrams, we will use the con-
vention that entity type and relationship type names are in uppercase letters,
attribute names have their initial letter capitalized, and role names are in lowercase
letters. We have used this convention in Figure 3.2.
As a general practice, given a narrative description of the database requirements,
the nouns appearing in the narrative tend to give rise to entity type names, and the
verbs tend to indicate names of relationship types. Attribute names generally arise
from additional nouns that describe the nouns corresponding to entity types.
Another naming consideration involves choosing binary relationship names to
make the ER diagram of the schema readable from left to right and from top to bot-
tom. We have generally followed this guideline in Figure 3.2. To explain this nam-
ing convention further, we have one exception to the convention in Figure 3.2—the
DEPENDENTS_OF relationship type, which reads from bottom to top. When we
describe this relationship, we can say that the DEPENDENT entities (bottom entity
type) are DEPENDENTS_OF (relationship name) an EMPLOYEE (top entity type). To
change this to read from top to bottom, we could rename the relationship type to
HAS_DEPENDENTS, which would then read as follows: An EMPLOYEE entity (top
entity type) HAS_DEPENDENTS (relationship name) of type DEPENDENT (bottom
entity type). Notice that this issue arises because each binary relationship can be
described starting from either of the two participating entity types, as discussed in
the beginning of Section 3.4.
3.7.3 Design Choices for ER Conceptual Design
It is occasionally difficult to decide whether a particular concept in the miniworld
should be modeled as an entity type, an attribute, or a relationship type. In this
section, we give some brief guidelines as to which construct should be chosen in
particular situations.
In general, the schema design process should be considered an iterative refinement
process, where an initial design is created and then iteratively refined until the most
suitable design is reached. Some of the refinements that are often used include the
following:
■ A concept may be first modeled as an attribute and then refined into a rela-
tionship because it is determined that the attribute is a reference to another
entity type. It is often the case that a pair of such attributes that are inverses of
one another are refined into a binary relationship. We discussed this type of
refinement in detail in Section 3.6. It is important to note that in our notation,
3.7 ER Diagrams, Naming Conventions, and Design Issues 83
MeaningSymbol
Entity
Weak Entity
Indentifying Relationship
Relationship
Composite Attribute
. . .
Key Attribute
Attribute
Derived Attribute
Multivalued Attribute
Total Participation of E2 in RRE1 E2
Cardinality Ratio 1: N for E1 : E2 in RRE1 E2
N1
Structural Constraint (min, max)
on Participation of E in RR E
(min, max)
Figure 3.14
Summary of the
notation for ER
diagrams.
84 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
once an attribute is replaced by a relationship, the attribute itself should be
removed from the entity type to avoid duplication and redundancy.
■ Similarly, an attribute that exists in several entity types may be elevated or
promoted to an independent entity type. For example, suppose that each
of several entity types in a UNIVERSITY database, such as STUDENT,
INSTRUCTOR, and COURSE, has an attribute Department in the
initial design; the designer may then choose to create an entity type
DEPARTMENT with a single attribute Dept_name and relate it to the three
entity types (STUDENT, INSTRUCTOR, and COURSE) via appropriate rela-
tionships. Other attributes/relationships of DEPARTMENT may be discov-
ered later.
■ An inverse refinement to the previous case may be applied—for example, if
an entity type DEPARTMENT exists in the initial design with a single attribute
Dept_name and is related to only one other entity type, STUDENT. In
this case, DEPARTMENT may be reduced or demoted to an attribute of
STUDENT.
■ Section 3.9 discusses choices concerning the degree of a relationship. In Chap-
ter 4, we discuss other refinements concerning specialization/generalization.
3.7.4 Alternative Notations for ER Diagrams
There are many alternative diagrammatic notations for displaying ER diagrams.
Appendix A gives some of the more popular notations. In Section 3.8, we introduce
the Unified Modeling Language (UML) notation for class diagrams, which has been
proposed as a standard for conceptual object modeling.
In this section, we describe one alternative ER notation for specifying structural
constraints on relationships, which replaces the cardinality ratio (1:1, 1:N, M:N)
and single/double-line notation for participation constraints. This notation
involves associating a pair of integer numbers (min, max) with each participation
of an entity type E in a relationship type R, where 0 ≤ min ≤ max and max ≥ 1. The
numbers mean that for each entity e in E, e must participate in at least min and at
most max relationship instances in R at any point in time. In this method,
min = 0 implies partial participation, whereas min > 0 implies total participation.
Figure 3.15 displays the COMPANY database schema using the (min, max) nota-
tion.14 Usually, one uses either the cardinality ratio/single-line/double-line nota-
tion or the (min, max) notation. The (min, max) notation is more precise, and we
can use it to specify some structural constraints for relationship types of higher
degree. However, it is not sufficient for specifying some key constraints on higher-
degree relationships, as discussed in Section 3.9.
Figure 3.15 also displays all the role names for the COMPANY database schema.
14In some notations, particularly those used in object modeling methodologies such as UML, the (min,
max) is placed on the opposite sides to the ones we have shown. For example, for the WORKS_FOR
relationship in Figure 3.15, the (1,1) would be on the DEPARTMENT side, and the (4,N) would be on the
EMPLOYEE side. Here we used the original notation from Abrial (1974).
3.8 Example of Other Notation: UML Class Diagrams 85
3.8 Example of Other Notation:
UML Class Diagrams
The UML methodology is being used extensively in software design and has many
types of diagrams for various software design purposes. We only briefly present the
basics of UML class diagrams here and compare them with ER diagrams. In some
EMPLOYEE
Minit Lname
Name Address
Sex
Salary
Ssn
Bdate
Supervisor
(0,N) (0,1)
(1,1)
Employee
(1,1)
(1,N)
(1,1)
(0,N)Department
Managed
(4,N)
Department
(0,1)
Manager
Supervisee
SUPERVISION
Hours
WORKS_ON
CONTROLS
DEPENDENTS_OF
Name
Location
PROJECT
DEPARTMENT
Locations
Name Number
Number
Number_of_employees
MANAGES
Start_date
WORKS_FOR
DEPENDENT
Sex Birth_date RelationshipName
Controlling
Department
Controlled
Project
Project
(1,N)
Worker
(0,N)
Employee
(1,1) Dependent
Fname
Figure 3.15
ER diagrams for the company schema, with structural constraints specified using
(min, max) notation and role names.
86 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
ways, class diagrams can be considered as an alternative notation to ER diagrams.
Additional UML notation and concepts are presented in Section 8.6. Figure 3.16
shows how the COMPANY ER database schema in Figure 3.15 can be displayed
using UML class diagram notation. The entity types in Figure 3.15 are modeled as
classes in Figure 3.16. An entity in ER corresponds to an object in UML.
In UML class diagrams, a class (similar to an entity type in ER) is displayed as a box
(see Figure 3.16) that includes three sections: The top section gives the class name
(similar to entity type name); the middle section includes the attributes; and the
last section includes operations that can be applied to individual objects (similar to
individual entities in an entity set) of the class. Operations are not specified in ER
diagrams. Consider the EMPLOYEE class in Figure 3.16. Its attributes are Name, Ssn,
Bdate, Sex, Address, and Salary. The designer can optionally specify the domain (or
data type) of an attribute if desired, by placing a colon (:) followed by the domain
name or description, as illustrated by the Name, Sex, and Bdate attributes
of EMPLOYEE in Figure 3.16. A composite attribute is modeled as a
structured domain, as illustrated by the Name attribute of EMPLOYEE. A multival-
ued attribute will generally be modeled as a separate class, as illustrated by the
LOCATION class in Figure 3.16.
supervisee
Name: Name_dom
Fname
Minit
Lname
Ssn
Bdate: Date
Sex: {M,F}
Address
Salary
4..*
1..*
1..* *
*
1..1
1..1
1..1
1..1
1..*
0..1
0..*
0..*
age
change_department
change_projects
. . .
Sex: {M,F}
Birth_date: Date
Relationship
DEPENDENT
. . .
0..1
supervisor
Dependent_name
EMPLOYEE
Name
Number
add_employee
number_of_employees
change_manager
. . .
DEPARTMENT
Name
Number
add_employee
add_project
change_manager
. . .
PROJECT
Start_date
MANAGES
CONTROLS
Hours
WORKS_ON Name
LOCATION
1..1
0..*
0..1
Multiplicity
Notation in OMT:
Aggregation
Notation in UML:
Whole Part
WORKS_FOR
Figure 3.16
The COMPANY conceptual schema in UML class diagram notation.
3.8 Example of Other Notation: UML Class Diagrams 87
Relationship types are called associations in UML terminology, and relationship
instances are called links. A binary association (binary relationship type) is repre-
sented as a line connecting the participating classes (entity types), and may option-
ally have a name. A relationship attribute, called a link attribute, is placed in a box
that is connected to the association’s line by a dashed line. The (min, max) notation
described in Section 3.7.4 is used to specify relationship constraints, which are
called multiplicities in UML terminology. Multiplicities are specified in the form
min..max, and an asterisk (*) indicates no maximum limit on participation. How-
ever, the multiplicities are placed on the opposite ends of the relationship when com-
pared with the (min, max) notation discussed in Section 3.7.4 (compare Fig –
ures 3.15 and 3.16). In UML, a single asterisk indicates a multiplicity of 0 ..*, and a
single 1 indicates a multiplicity of 1..1. A recursive relationship type (see Section 3.4.2)
is called a reflexive association in UML, and the role names—like the multiplicities—
are placed at the opposite ends of an association when compared with the placing of
role names in Figure 3.15.
In UML, there are two types of relationships: association and aggregation.
Aggregation is meant to represent a relationship between a whole object and its com-
ponent parts, and it has a distinct diagrammatic notation. In Figure 3.16, we modeled
the locations of a department and the single location of a project as aggregations.
However, aggregation and association do not have different structural properties, and
the choice as to which type of relationship to use—aggregation or association—is
somewhat subjective. In the ER model, both are represented as relationships.
UML also distinguishes between unidirectional and bidirectional associations
(or aggregations). In the unidirectional case, the line connecting the classes is dis-
played with an arrow to indicate that only one direction for accessing related
objects is needed. If no arrow is displayed, the bidirectional case is assumed, which
is the default. For example, if we always expect to access the manager of a depart-
ment starting from a DEPARTMENT object, we would draw the association line rep-
resenting the MANAGES association with an arrow from DEPARTMENT to
EMPLOYEE. In addition, relationship instances may be specified to be ordered.
For example, we could specify that the employee objects related to each depart-
ment through the WORKS_FOR association (relationship) should be ordered by
their Start_date attribute value. Association (relationship) names are optional in
UML, and relationship attributes are displayed in a box attached with a dashed
line to the line representing the association/aggregation (see Start_date and Hours
in Figure 3.16).
The operations given in each class are derived from the functional requirements of
the application, as we discussed in Section 3.1. It is generally sufficient to specify the
operation names initially for the logical operations that are expected to be applied
to individual objects of a class, as shown in Figure 3.16. As the design is refined,
more details are added, such as the exact argument types (parameters) for each
operation, plus a functional description of each operation. UML has function
descriptions and sequence diagrams to specify some of the operation details, but
these are beyond the scope of our discussion.
88 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
Weak entities can be modeled using the UML construct called qualified association
(or qualified aggregation); this can represent both the identifying relationship
and the partial key, which is placed in a box attached to the owner class. This is
illustrated by the DEPENDENT class and its qualified aggregation to EMPLOYEE in
Figure 3.16. In UML terminology, the partial key attribute Dependent_name is called
the discriminator, because its value distinguishes the objects associated with
(related to) the same EMPLOYEE entity. Qualified associations are not restricted to
modeling weak entities, and they can be used to model other situations in UML.
This section is not meant to be a complete description of UML class diagrams, but
rather to illustrate one popular type of alternative diagrammatic notation that can
be used for representing ER modeling concepts.
3.9 Relationship Types of Degree
Higher than Two
In Section 3.4.2 we defined the degree of a relationship type as the number of par-
ticipating entity types and called a relationship type of degree two binary and a
relationship type of degree three ternary. In this section, we elaborate on the differ-
ences between binary and higher-degree relationships, when to choose higher-
degree versus binary relationships, and how to specify constraints on higher-degree
relationships.
3.9.1 Choosing between Binary and Ternary
(or Higher-Degree) Relationships
The ER diagram notation for a ternary relationship type is shown in Figure 3.17(a),
which displays the schema for the SUPPLY relationship type that was displayed at the
instance level in Figure 3.10. Recall that the relationship set of SUPPLY is a set of rela-
tionship instances (s, j, p), where the meaning is that s is a SUPPLIER who is currently
supplying a PART p to a PROJECT j. In general, a relationship type R of degree n will
have n edges in an ER diagram, one connecting R to each participating entity type.
Figure 3.17(b) shows an ER diagram for three binary relationship types CAN_SUPPLY,
USES, and SUPPLIES. In general, a ternary relationship type represents different
information than do three binary relationship types. Consider the three binary
relationship types CAN_SUPPLY, USES, and SUPPLIES. Suppose that
CAN_SUPPLY, between SUPPLIER and PART, includes an instance (s, p) whenever
supplier s can supply part p (to any project); USES, between PROJECT and PART,
includes an instance (j, p) whenever project j uses part p; and SUPPLIES, between
SUPPLIER and PROJECT, includes an instance (s, j) whenever supplier s supplies
some part to project j. The existence of three relationship instances (s, p),
(j, p), and (s, j) in CAN_SUPPLY, USES, and SUPPLIES, respectively, does not neces-
sarily imply that an instance (s, j, p) exists in the ternary relationship SUPPLY,
because the meaning is different. It is often tricky to decide whether a particular
relationship should be represented as a relationship type of degree n or should be
3.9 Relationship Types of Degree Higher than Two 89
broken down into several relationship types of smaller degrees. The designer must
base this decision on the semantics or meaning of the particular situation being
represented. The typical solution is to include the ternary relationship plus one or
more of the binary relationships, if they represent different meanings and if all are
needed by the application.
(a) SUPPLY
Sname
Part_no
SUPPLIER
Quantity
PROJECT
PART
Proj_name
(b)
(c)
Part_no
PART
N
Sname
SUPPLIER
Proj_name
PROJECT
N
Quantity
SUPPLY
N1
Part_no
M N
CAN_SUPPLY
N
M
Sname
SUPPLIER
Proj_name
PROJECT
USES
PART
M
N
SUPPLIES
SP
SPJSS
1
1
Figure 3.17
Ternary relationship types. (a) The SUPPLY relationship. (b) Three binary relationships not
equivalent to SUPPLY. (c) SUPPLY represented as a weak entity type.
90 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
Some database design tools are based on variations of the ER model that permit
only binary relationships. In this case, a ternary relationship such as SUPPLY must
be represented as a weak entity type, with no partial key and with three identifying
relationships. The three participating entity types SUPPLIER, PART, and PROJECT
are together the owner entity types (see Figure 3.17(c)). Hence, an entity in the
weak entity type SUPPLY in Figure 3.17(c) is identified by the combination of its
three owner entities from SUPPLIER, PART, and PROJECT.
It is also possible to represent the ternary relationship as a regular entity type by
introducing an artificial or surrogate key. In this example, a key attribute Supply_id
could be used for the supply entity type, converting it into a regular entity type.
Three binary N:1 relationships relate SUPPLY to each of the three participating
entity types.
Another example is shown in Figure 3.18. The ternary relationship type OFFERS
represents information on instructors offering courses during particular semesters;
hence it includes a relationship instance (i, s, c) whenever INSTRUCTOR i offers
COURSE c during SEMESTER s. The three binary relationship types shown in Fig-
ure 3.18 have the following meanings: CAN_TEACH relates a course to the instruc-
tors who can teach that course, TAUGHT_DURING relates a semester to the instructors
who taught some course during that semester, and OFFERED_DURING relates a
semester to the courses offered during that semester by any instructor. These ter-
nary and binary relationships represent different information, but certain
constraints should hold among the relationships. For example, a relationship
instance (i, s, c) should not exist in OFFERS unless an instance (i, s) exists in
TAUGHT_DURING, an instance (s, c) exists in OFFERED_DURING, and an instance
(i, c) exists in CAN_TEACH. However, the reverse is not always true;
we may have instances (i, s), (s, c), and (i, c) in the three binary relationship types
with no corresponding instance (i, s, c) in OFFERS. Note that in this example,
based on the meanings of the relationships, we can infer the instances of
TAUGHT_DURING and OFFERED_DURING from the instances in OFFERS, but
Cnumber
CAN_TEACH
Lname
INSTRUCTOR
Sem_year
YearSemester
SEMESTER
OFFERED_DURING
COURSE
OFFERS
TAUGHT_DURING
Figure 3.18
Another example of
ternary versus binary
relationship types.
3.9 Relationship Types of Degree Higher than Two 91
we cannot infer the instances of CAN_TEACH; therefore, TAUGHT_DURING and
OFFERED_DURING are redundant and can be left out.
Although in general three binary relationships cannot replace a ternary relation-
ship, they may do so under certain additional constraints. In our example, if the
CAN_TEACH relationship is 1:1 (an instructor can teach only one course, and a
course can be taught by only one instructor), then the ternary relationship OFFERS
can be left out because it can be inferred from the three binary relationships
CAN_TEACH, TAUGHT_DURING, and OFFERED_DURING. The schema designer
must analyze the meaning of each specific situation to decide which of the binary
and ternary relationship types are needed.
Notice that it is possible to have a weak entity type with a ternary (or n-ary) identi-
fying relationship type. In this case, the weak entity type can have several owner
entity types. An example is shown in Figure 3.19. This example shows part of a
database that keeps track of candidates interviewing for jobs at various companies,
which may be part of an employment agency database. In the requirements, a can-
didate can have multiple interviews with the same company (for example, with dif-
ferent company departments or on separate dates), but a job offer is made based on
one of the interviews. Here, INTERVIEW is represented as a weak entity with two
owners CANDIDATE and COMPANY, and with the partial key Dept_date. An
INTERVIEW entity is uniquely identified by a candidate, a company, and the combi-
nation of the date and department of the interview.
3.9.2 Constraints on Ternary (or Higher-Degree)
Relationships
There are two notations for specifying structural constraints on n-ary relationships,
and they specify different constraints. They should thus both be used if it is impor-
tant to fully specify the structural constraints on a ternary or higher-degree rela-
tionship. The first notation is based on the cardinality ratio notation of binary
relationships displayed in Figure 3.2. Here, a 1, M, or N is specified on each
Dept_date
DateDepartment
RESULTS_IN
Name
CANDIDATE
Cname
COMPANY
INTERVIEW JOB_OFFER
CCI
Figure 3.19
A weak entity type
INTERVIEW with a
ternary identifying
relationship type.
92 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
participation arc (both M and N symbols stand for many or any number).15 Let us
illustrate this constraint using the SUPPLY relationship in Figure 3.17.
Recall that the relationship set of SUPPLY is a set of relationship instances (s, j, p),
where s is a SUPPLIER, j is a PROJECT, and p is a PART. Suppose that the constraint
exists that for a particular project-part combination, only one supplier will be used
(only one supplier supplies a particular part to a particular project). In this case, we
place 1 on the SUPPLIER participation, and M, N on the PROJECT, PART participa-
tions in Figure 3.17. This specifies the constraint that a particular (j, p) combination
can appear at most once in the relationship set because each such (PROJECT, PART)
combination uniquely determines a single supplier. Hence, any relationship
instance (s, j, p) is uniquely identified in the relationship set by its (j, p) combina-
tion, which makes (j, p) a key for the relationship set. In this notation, the participa-
tions that have a 1 specified on them are not required to be part of the identifying
key for the relationship set.16 If all three cardinalities are M or N, then the key will
be the combination of all three participants.
The second notation is based on the (min, max) notation displayed in Figure 3.15
for binary relationships. A (min, max) on a participation here specifies that each
entity is related to at least min and at most max relationship instances in the rela-
tionship set. These constraints have no bearing on determining the key of an n-ary
relationship, where n > 2,17 but specify a different type of constraint that places
restrictions on how many relationship instances each entity can participate in.
3.10 Another Example: A UNIVERSITY Database
We now present another example, a UNIVERSITY database, to illustrate the ER
modeling concepts. Suppose that a database is needed to keep track of student
enrollments in classes and students’ final grades. After analyzing the miniworld
rules and the users’ needs, the requirements for this database were determined to be
as follows (for brevity, we show the chosen entity type names and attribute names
for the conceptual schema in parentheses as we describe the requirements; relation-
ship type names are only shown in the ER schema diagram):
■ The university is organized into colleges (COLLEGE), and each college has a
unique name (CName), a main office (COffice) and phone (CPhone), and a
particular faculty member who is dean of the college. Each college adminis-
ters a number of academic departments (DEPT). Each department has a
unique name (DName), a unique code number (DCode), a main office
(DOffice) and phone (DPhone), and a particular faculty member who chairs
the department. We keep track of the start date (CStartDate) when that fac-
ulty member began chairing the department.
15This notation allows us to determine the key of the relationship relation, as we discuss in Chapter 9.
16This is also true for cardinality ratios of binary relationships.
17The (min, max) constraints can determine the keys for binary relationships.
3.10 Another Example: A UNIVERSITY Database 93
■ A department offers a number of courses (COURSE), each of which has a
unique course name (CoName), a unique code number (CCode), a course
level (Level: this can be coded as 1 for freshman level, 2 for sophomore, 3 for
junior, 4 for senior, 5 for MS level, and 6 for PhD level), a course credit
hours (Credits), and a course description (CDesc). The database also keeps
track of instructors (INSTRUCTOR); and each instructor has a unique iden-
tifier (Id), name (IName), office (IOffice), phone (IPhone), and rank (Rank);
in addition, each instructor works for one primary academic department.
■ The database will keep student data (STUDENT) and stores each student’s
name (SName, composed of first name (FName), middle name (MName),
last name (LName)), student id (Sid, unique for every student), address
(Addr), phone (Phone), major code (Major), and date of birth (DoB). A stu-
dent is assigned to one primary academic department. It is required to keep
track of the student’s grades in each section the student has completed.
■ Courses are offered as sections (SECTION). Each section is related to a single
course and a single instructor and has a unique section identifier (SecId). A
section also has a section number (SecNo: this is coded as 1, 2, 3, . . . for mul-
tiple sections offered during the same semester/year), semester (Sem), year
(Year), classroom (CRoom: this is coded as a combination of building code
(Bldg) and room number (RoomNo) within the building), and days/times
(DaysTime: for example, ‘MWF 9am-9.50am’ or ‘TR 3.30pm-5.20pm’—
restricted to only allowed days/time values). (Note: The database will keep
track of all the sections offered for the past several years, in addition to the
current offerings. The SecId is unique for all sections, not just the sections for
a particular semester.) The database keeps track of the students in each section,
and the grade is recorded when available (this is a many-to-many relationship
between students and sections). A section must have at least five students.
The ER diagram for these requirements is shown in Figure 3.20 using the min-max ER
diagrammatic notation. Notice that for the SECTION entity type, we only showed
SecID as an underlined key, but because of the miniworld constraints, several other
combinations of values have to be unique for each section entity. For example, each of
the following combinations must be unique based on the typical miniworld constraints:
1. (SecNo, Sem, Year, CCode (of the COURSE related to the SECTION)): This
specifies that the section numbers of a particular course must be different
during each particular semester and year.
2. (Sem, Year, CRoom, DaysTime): This specifies that in a particular semester
and year, a classroom cannot be used by two different sections at the same
days/time.
3. (Sem, Year, DaysTime, Id (of the INSTRUCTOR teaching the SECTION)):
This specifies that in a particular semester and year, an instructor cannot
teach two sections at the same days/time. Note that this rule will not apply if
an instructor is allowed to teach two combined sections together in the par-
ticular university.
Can you think of any other attribute combinations that have to be unique?
94 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
COLLEGE
DEPT
COURSE SECTION SecNoSECS
Grade
TAKES
Sem
Year
INSTRUCTOR
COffice
TEACHES
ADMINS
DEAN
MName
SName
Addr
Phone
Major
DOB
FName
STUDENT
LName
CHAIR
CStartDate
EMPLOYS
HAS
(1,1)
(1,1)
(1,1)
(1,1)
(1,1)
(1,1)
(1,1)
(0,N)
(0,N)
(0,N)
(0,N)
(0,N)
(0,N)
(0,N)
(0,1)
(0,1)
(0,1)
(5,N)
CName
DName
CCode SecId
IOffice
IName
Rank
CPhone
DCode
DOffice
CoName
Credits
CDesc
Level
DPhone
IPhoneId
SId
OFFERS
CRoom
Bldg RoomNo
DaysTime
Figure 3.20
An ER diagram for a UNIVERSITY database schema.
3.11 Summary
In this chapter we presented the modeling concepts of a high-level conceptual data
model, the entity–relationship (ER) model. We started by discussing the role that a
high-level data model plays in the database design process, and then we presented a
sample set of database requirements for the COMPANY database, which is one of the
3.11 Summary 95
examples that is used throughout this text. We defined the basic ER model concepts
of entities and their attributes. Then we discussed NULL values and presented the
various types of attributes, which can be nested arbitrarily to produce complex
attributes:
■ Simple or atomic
■ Composite
■ Multivalued
We also briefly discussed stored versus derived attributes. Then we discussed the
ER model concepts at the schema or “intension” level:
■ Entity types and their corresponding entity sets
■ Key attributes of entity types
■ Value sets (domains) of attributes
■ Relationship types and their corresponding relationship sets
■ Participation roles of entity types in relationship types
We presented two methods for specifying the structural constraints on relationship
types. The first method distinguished two types of structural constraints:
■ Cardinality ratios (1:1, 1:N, M:N for binary relationships)
■ Participation constraints (total, partial)
We noted that, alternatively, another method of specifying structural constraints is
to specify minimum and maximum numbers (min, max) on the participation of
each entity type in a relationship type. We discussed weak entity types and the
related concepts of owner entity types, identifying relationship types and partial key
attributes.
Entity–relationship schemas can be represented diagrammatically as ER diagrams.
We showed how to design an ER schema for the COMPANY database by first defin-
ing the entity types and their attributes and then refining the design to include rela-
tionship types. We displayed the ER diagram for the COMPANY database schema.
We discussed some of the basic concepts of UML class diagrams and how they
relate to ER modeling concepts. We also described ternary and higher-degree
relationship types in more detail, and we discussed the circumstances under which
they are distinguished from binary relationships. Finally, we presented require-
ments for a UNIVERSITY database schema as another example, and we showed the
ER schema design.
The ER modeling concepts we have presented thus far—entity types, relationship
types, attributes, keys, and structural constraints—can model many database appli-
cations. However, more complex applications—such as engineering design, medi-
cal information systems, and telecommunications—require additional concepts if
we want to model them with greater accuracy. We discuss some advanced model-
ing concepts in Chapter 8 and revisit further advanced data modeling techniques in
Chapter 26.
96 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
Review Questions
3.1. Discuss the role of a high-level data model in the database design process.
3.2. List the various cases where use of a NULL value would be appropriate.
3.3. Define the following terms: entity, attribute, attribute value, relationship
instance, composite attribute, multivalued attribute, derived attribute, com-
plex attribute, key attribute, and value set (domain).
3.4. What is an entity type? What is an entity set? Explain the differences among
an entity, an entity type, and an entity set.
3.5. Explain the difference between an attribute and a value set.
3.6. What is a relationship type? Explain the differences among a relationship
instance, a relationship type, and a relationship set.
3.7. What is a participation role? When is it necessary to use role names in the
description of relationship types?
3.8. Describe the two alternatives for specifying structural constraints on rela-
tionship types. What are the advantages and disadvantages of each?
3.9. Under what conditions can an attribute of a binary relationship type be
migrated to become an attribute of one of the participating entity types?
3.10. When we think of relationships as attributes, what are the value sets of these
attributes? What class of data models is based on this concept?
3.11. What is meant by a recursive relationship type? Give some examples of
recursive relationship types.
3.12. When is the concept of a weak entity used in data modeling? Define the
terms owner entity type, weak entity type, identifying relationship type, and
partial key.
3.13. Can an identifying relationship of a weak entity type be of a degree greater
than two? Give examples to illustrate your answer.
3.14. Discuss the conventions for displaying an ER schema as an ER diagram.
3.15. Discuss the naming conventions used for ER schema diagrams.
Exercises
3.16. Which combinations of attributes have to be unique for each individual
SECTION entity in the UNIVERSITY database shown in Figure 3.20 to enforce
each of the following miniworld constraints:
a. During a particular semester and year, only one section can use a particu-
lar classroom at a particular DaysTime value.
Exercises 97
b. During a particular semester and year, an instructor can teach only one
section at a particular DaysTime value.
c. During a particular semester and year, the section numbers for sections
offered for the same course must all be different.
Can you think of any other similar constraints?
3.17. Composite and multivalued attributes can be nested to any number of lev-
els. Suppose we want to design an attribute for a STUDENT entity type to
keep track of previous college education. Such an attribute will have one
entry for each college previously attended, and each such entry will be com-
posed of college name, start and end dates, degree entries (degrees awarded
at that college, if any), and transcript entries (courses completed at that col-
lege, if any). Each degree entry contains the degree name and the month and
year the degree was awarded, and each transcript entry contains a course
name, semester, year, and grade. Design an attribute to hold this informa-
tion. Use the conventions in Figure 3.5.
3.18. Show an alternative design for the attribute described in Exercise 3.17 that
uses only entity types (including weak entity types, if needed) and relation-
ship types.
3.19. Consider the ER diagram in Figure 3.21, which shows a simplified schema
for an airline reservations system. Extract from the ER diagram the require-
ments and constraints that produced this schema. Try to be as precise as
possible in your requirements and constraints specification.
3.20. In Chapters 1 and 2, we discussed the database environment and database
users. We can consider many entity types to describe such an environment,
such as DBMS, stored database, DBA, and catalog/data dictionary. Try to
specify all the entity types that can fully describe a database system and its
environment; then specify the relationship types among them, and draw an
ER diagram to describe such a general database environment.
3.21. Design an ER schema for keeping track of information about votes taken in
the U.S. House of Representatives during the current two-year congress-
ional session. The database needs to keep track of each U.S. STATE’s Name
(e.g., ‘Texas’, ‘New York’, ‘California’) and include the Region of the state
(whose domain is {‘Northeast’, ‘Midwest’, ‘Southeast’, ‘Southwest’, ‘West’}).
Each CONGRESS_PERSON in the House of Representatives is described by
his or her Name, plus the District represented, the Start_date when the con-
gressperson was first elected, and the political Party to which he or she
belongs (whose domain is {‘Republican’, ‘Democrat’, ‘Independent’,
‘Other’}). The database keeps track of each BILL (i.e., proposed law),
including the Bill_name, the Date_of_vote on the bill, whether the bill
Passed_or_failed (whose domain is {‘Yes’, ‘No’}), and the Sponsor (the
congressperson(s) who sponsored—that is, proposed—the bill). The data-
base also keeps track of how each congressperson voted on each bill (domain
98 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
Restrictions
M
N
N
1
N
N
1
1N
AIRPORT
City State
AIRPLANE_
TYPE
Dep_time
Arr_time
Name
Scheduled_dep_time
INSTANCE_OF
Weekdays
Airline
Instances
N
1
1 N
Airport_code
Number
Scheduled_arr_time
CAN_
LAND
TYPE
N
1
DEPARTS
N
1
ARRIVES
N1
ASSIGNED
ARRIVAL_
AIRPORT
DEPARTURE_
AIRPORT N1
SEAT
Max_seatsType_name
Code
AIRPLANE
Airplane_id Total_no_of_seats
LEGS
FLIGHT
FLIGHT_LEG
Leg_no
FARES
FARE
Amount
CphoneCustomer_name
Date
No_of_avail_seats
RESERVATION
Seat_no
Company
LEG_INSTANCE
Notes:
A LEG (segment) is a nonstop portion of a flight.
A LEG_INSTANCE is a particular occurrence
of a LEG on a particular date.
1
Figure 3.21
An ER diagram for an AIRLINE database schema.
of Vote attribute is {‘Yes’, ‘No’, ‘Abstain’, ‘Absent’}). Draw an ER schema
diagram for this application. State clearly any assumptions you make.
3.22. A database is being constructed to keep track of the teams and games of a
sports league. A team has a number of players, not all of whom participate in
each game. It is desired to keep track of the players participating in each
game for each team, the positions they played in that game, and the result of
Exercises 99
the game. Design an ER schema diagram for this application, stating any
assumptions you make. Choose your favorite sport (e.g., soccer, baseball,
football).
3.23. Consider the ER diagram shown in Figure 3.22 for part of a BANK database.
Each bank can have multiple branches, and each branch can have multiple
accounts and loans.
a. List the strong (nonweak) entity types in the ER diagram.
b. Is there a weak entity type? If so, give its name, partial key, and identify-
ing relationship.
c. What constraints do the partial key and the identifying relationship of the
weak entity type specify in this diagram?
d. List the names of all relationship types, and specify the (min, max)
constraint on each participation of an entity type in a relationship type.
Justify your choices.
BANK
LOAN
Balance
Type
AmountLoan_no
1
N
1
N
N
N
M M
NameCode
1 N BANK_BRANCH
L_CA_C
ACCTS LOANS
BRANCHES
ACCOUNT
CUSTOMER
Acct_no
Name
AddrPhone
Type
Addr Branch_noAddr
Ssn
Figure 3.22
An ER diagram for a BANK database schema.
100 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
e. List concisely the user requirements that led to this ER schema design.
f. Suppose that every customer must have at least one account but is
restricted to at most two loans at a time, and that a bank branch cannot
have more than 1,000 loans. How does this show up on the (min, max)
constraints?
3.24. Consider the ER diagram in Figure 3.23. Assume that an employee may
work in up to two departments or may not be assigned to any department.
Assume that each department must have one and may have up to three
phone numbers. Supply (min, max) constraints on this diagram. State clearly
any additional assumptions you make. Under what conditions would the
relationship HAS_PHONE be redundant in this example?
3.25. Consider the ER diagram in Figure 3.24. Assume that a course may or may
not use a textbook, but that a text by definition is a book that is used in some
course. A course may not use more than five books. Instructors teach from
two to four courses. Supply (min, max) constraints on this diagram. State
clearly any additional assumptions you make. If we add the relationship
ADOPTS, to indicate the textbook(s) that an instructor uses for a course,
should it be a binary relationship between INSTRUCTOR and TEXT, or a
ternary relationship among all three entity types? What (min, max) con-
straints would you put on the relationship? Why?
EMPLOYEE DEPARTMENT
CONTAINSHAS_PHONE
WORKS_IN
PHONE
Figure 3.23
Part of an ER diagram
for a COMPANY
database.
INSTRUCTOR COURSE
USES
TEACHES
TEXT
Figure 3.24
Part of an ER diagram
for a COURSES
database.
Exercises 101
3.26. Consider an entity type SECTION in a UNIVERSITY database, which describes
the section offerings of courses. The attributes of SECTION are
Section_number, Semester, Year, Course_number, Instructor, Room_no (where
section is taught), Building (where section is taught), Weekdays (domain is
the possible combinations of weekdays in which a section can be offered
{‘MWF’, ‘MW’, ‘TT’, and so on}), and Hours (domain is all possible
time periods during which sections are offered {‘9–9:50 a.m.’, ‘10–10:50
a.m.’, . . . , ‘3:30–4:50 p.m.’, ‘5:30–6:20 p.m.’, and so on}). Assume that
Section_number is unique for each course within a particular semes-
ter/year combination (that is, if a course is offered multiple times during
a particular semester, its section offerings are numbered 1, 2, 3, and so
on). There are several composite keys for section, and some attributes
are components of more than one key. Identify three composite keys,
and show how they can be represented in an ER schema diagram.
3.27. Cardinality ratios often dictate the detailed design of a database. The cardi-
nality ratio depends on the real-world meaning of the entity types involved
and is defined by the specific application. For the following binary relation-
ships, suggest cardinality ratios based on the common-sense meaning of the
entity types. Clearly state any assumptions you make.
Entity 1 Cardinality Ratio Entity 2
1. STUDENT ______________ SOCIAL_SECURITY_CARD
2. STUDENT ______________ TEACHER
3. CLASSROOM ______________ WALL
4. COUNTRY ______________ CURRENT_PRESIDENT
5. COURSE ______________ TEXTBOOK
6. ITEM (that can be found
in an order)
______________ ORDER
7. STUDENT ______________ CLASS
8. CLASS ______________ INSTRUCTOR
9. INSTRUCTOR ______________ OFFICE
10. EBAY_AUCTION_ITEM ______________ EBAY_BID
3.28. Consider the ER schema for the MOVIES database in Figure 3.25.
Assume that MOVIES is a populated database. ACTOR is used as a generic term
and includes actresses. Given the constraints shown in the ER schema, respond
to the following statements with True, False, or Maybe. Assign a response of
Maybe to statements that, although not explicitly shown to be True, cannot be
proven False based on the schema as shown. Justify each answer.
a. There are no actors in this database that have been in no movies.
b. There are some actors who have acted in more than ten movies.
c. Some actors have done a lead role in multiple movies.
d. A movie can have only a maximum of two lead actors.
102 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
e. Every director has been an actor in some movie.
f. No producer has ever been an actor.
g. A producer cannot be an actor in some other movie.
h. There are movies with more than a dozen actors.
i. Some producers have been a director as well.
j. Most movies have one director and one producer.
k. Some movies have one director but several producers.
l. There are some actors who have done a lead role, directed a movie, and
produced a movie.
m. No movie has a director who also acted in that movie.
3.29. Given the ER schema for the MOVIES database in Figure 3.25, draw an
instance diagram using three movies that have been released recently.
Draw instances of each entity type: MOVIES, ACTORS, PRODUCERS,
DIRECTORS involved; make up instances of the relationships as they exist in
reality for those movies.
ACTOR
MOVIE
LEAD_ROLE
PERFORMS_IN
DIRECTSDIRECTOR
ALSO_A_
DIRECTOR
PRODUCESPRODUCER
ACTOR_
PRODUCER
1
1
1
1
1
M
M
2 N
N
N
N
Figure 3.25
An ER diagram for a MOVIES database schema.
Laboratory Exercises 103
3.30. Illustrate the UML diagram for Exercise 3.16. Your UML design should
observe the following requirements:
a. A student should have the ability to compute his/her GPA and add or
drop majors and minors.
b. Each department should be able to add or delete courses and hire or ter-
minate faculty.
c. Each instructor should be able to assign or change a student’s grade for a
course.
Note: Some of these functions may be spread over multiple classes.
Laboratory Exercises
3.31. Consider the UNIVERSITY database described in Exercise 3.16. Build the ER
schema for this database using a data modeling tool such as ERwin or
Rational Rose.
3.32. Consider a MAIL_ORDER database in which employees take orders for parts
from customers. The data requirements are summarized as follows:
■ The mail order company has employees, each identified by a unique em-
ployee number, first and last name, and Zip Code.
■ Each customer of the company is identified by a unique customer number,
first and last name, and Zip Code.
■ Each part sold by the company is identified by a unique part number, a
part name, price, and quantity in stock.
■ Each order placed by a customer is taken by an employee and is given a
unique order number. Each order contains specified quantities of one or
more parts. Each order has a date of receipt as well as an expected ship
date. The actual ship date is also recorded.
Design an entity–relationship diagram for the mail order database and build
the design using a data modeling tool such as ERwin or Rational Rose.
3.33. Consider a MOVIE database in which data is recorded about the movie
industry. The data requirements are summarized as follows:
■ Each movie is identified by title and year of release. Each movie has a
length in minutes. Each has a production company, and each is classified
under one or more genres (such as horror, action, drama, and so forth).
Each movie has one or more directors and one or more actors appear in it.
Each movie also has a plot outline. Finally, each movie has zero or more
quotable quotes, each of which is spoken by a particular actor appearing
in the movie.
■ Actors are identified by name and date of birth and appear in one or more
movies. Each actor has a role in the movie.
104 Chapter 3 Data Modeling Using the Entity–Relationship (ER) Model
■ Directors are also identified by name and date of birth and direct one or
more movies. It is possible for a director to act in a movie (including one
that he or she may also direct).
■ Production companies are identified by name and each has an address. A
production company produces one or more movies.
Design an entity–relationship diagram for the movie database and enter the
design using a data modeling tool such as ERwin or Rational Rose.
3.34. Consider a CONFERENCE_REVIEW database in which researchers submit
their research papers for consideration. Reviews by reviewers are recorded
for use in the paper selection process. The database system caters primarily
to reviewers who record answers to evaluation questions for each paper they
review and make recommendations regarding whether to accept or reject
the paper. The data requirements are summarized as follows:
■ Authors of papers are uniquely identified by e-mail id. First and last names
are also recorded.
■ Each paper is assigned a unique identifier by the system and is described
by a title, abstract, and the name of the electronic file containing the paper.
■ A paper may have multiple authors, but one of the authors is designated as
the contact author.
■ Reviewers of papers are uniquely identified by e-mail address. Each re-
viewer’s first name, last name, phone number, affiliation, and topics of in-
terest are also recorded.
■ Each paper is assigned between two and four reviewers. A reviewer rates
each paper assigned to him or her on a scale of 1 to 10 in four categories:
technical merit, readability, originality, and relevance to the conference.
Finally, each reviewer provides an overall recommendation regarding
each paper.
■ Each review contains two types of written comments: one to be seen by
the review committee only and the other as feedback to the author(s).
Design an entity–relationship diagram for the CONFERENCE_REVIEW data-
base and build the design using a data modeling tool such as ERwin or
Rational Rose.
3.35. Consider the ER diagram for the AIRLINE database shown in Figure 3.21.
Build this design using a data modeling tool such as ERwin or Rational Rose.
Selected Bibliography
The entity–relationship model was introduced by Chen (1976), and related work
appears in Schmidt and Swenson (1975), Wiederhold and Elmasri (1979), and
Senko (1975). Since then, numerous modifications to the ER model have been
suggested. We have incorporated some of these in our presentation. Structural
Selected Bibliography 105
constraints on relationships are discussed in Abrial (1974), Elmasri and Wieder-
hold (1980), and Lenzerini and Santucci (1983). Multivalued and composite attri-
butes are incorporated in the ER model in Elmasri et al. (1985). Although we did
not discuss languages for the ER model and its extensions, there have been several
proposals for such languages. Elmasri and Wiederhold (1981) proposed the
GORDAS query language for the ER model. Another ER query language was pro-
posed by Markowitz and Raz (1983). Senko (1980) presented a query language for
Senko’s DIAM model. A formal set of operations called the ER algebra was
presented by Parent and Spaccapietra (1985). Gogolla and Hohenstein (1991) pre-
sented another formal language for the ER model. Campbell et al. (1985) presented
a set of ER operations and showed that they are relationally complete. A conference
for the dissemination of research results related to the ER model has been held reg-
ularly since 1979. The conference, now known as the International Conference on
Conceptual Modeling, has been held in Los Angeles (ER 1979, ER 1983, ER 1997),
Washington, D.C. (ER 1981), Chicago (ER 1985), Dijon, France (ER 1986), New
York City (ER 1987), Rome (ER 1988), Toronto (ER 1989), Lausanne, Switzerland
(ER 1990), San Mateo, California (ER 1991), Karlsruhe, Germany (ER 1992),
Arlington, Texas (ER 1993), Manchester, England (ER 1994), Brisbane, Australia
(ER 1995), Cottbus, Germany (ER 1996), Singapore (ER 1998), Paris, France (ER
1999), Salt Lake City, Utah (ER 2000), Yokohama, Japan (ER 2001), Tampere, Fin-
land (ER 2002), Chicago, Illinois (ER 2003), Shanghai, China (ER 2004), Klagen-
furt, Austria (ER 2005), Tucson, Arizona (ER 2006), Auckland, New Zealand (ER
2007), Barcelona, Catalonia, Spain (ER 2008), and Gramado, RS, Brazil (ER 2009).
The 2010 conference was held in Vancouver, British Columbia, Canada (ER2010),
2011 in Brussels, Belgium (ER2011), 2012 in Florence, Italy (ER2012) , 2013 in
Hong Kong, China (ER2013), and the 2014 conference was held in Atlanta, Georgia
(ER 2014). The 2015 conference is to be held in Stockholm, Sweden.
This page intentionally left blank
107
4
The Enhanced Entity–Relationship
(EER) Model
The ER modeling concepts discussed in Chapter 3
are sufficient for representing many database sche-
mas for traditional database applications, which include many data-processing
applications in business and industry. Since the late 1970s, however, designers of
database applications have tried to design more accurate database schemas that
reflect the data properties and constraints more precisely. This was particularly
important for newer applications of database technology, such as databases for
engineering design and manufacturing (CAD/CAM),1 telecommunications, com-
plex software systems, and geographic information systems (GISs), among many
other applications. These types of databases have requirements that are more com-
plex than the more traditional applications. This led to the development of addi-
tional semantic data modeling concepts that were incorporated into conceptual
data models such as the ER model. Various semantic data models have been pro-
posed in the literature. Many of these concepts were also developed independently
in related areas of computer science, such as the knowledge representation area of
artificial intelligence and the object modeling area in software engineering.
In this chapter, we describe features that have been proposed for semantic data
models and show how the ER model can be enhanced to include these concepts,
which leads to the enhanced ER (EER) model.2 We start in Section 4.1 by incorpo-
rating the concepts of class/subclass relationships and type inheritance into the ER
model. Then, in Section 4.2, we add the concepts of specialization and generalization.
Section 4.3 discusses the various types of constraints on specialization/generalization,
and Section 4.4 shows how the UNION construct can be modeled by including the
chapter 4
1CAD/CAM stands for computer-aided design/computer-aided manufacturing.
2EER has also been used to stand for extended ER model.
108 Chapter 4 The Enhanced Entity–Relationship (EER) Model
concept of category in the EER model. Section 4.5 gives a sample UNIVERSITY
database schema in the EER model and summarizes the EER model concepts by
giving formal definitions. We will use the terms object and entity interchangeably
in this chapter, because many of these concepts are commonly used in object-
oriented models.
We present the UML class diagram notation for representing specialization and
generalization in Section 4.6, and we briefly compare these with EER notation and
concepts. This serves as an example of alternative notation, and is a continuation
of Section 3.8, which presented basic UML class diagram notation that corre-
sponds to the basic ER model. In Section 4.7, we discuss the fundamental abstrac-
tions that are used as the basis of many semantic data models. Section 4.8
summarizes the chapter.
For a detailed introduction to conceptual modeling, Chapter 4 should be consid-
ered a continuation of Chapter 3. However, if only a basic introduction to ER mod-
eling is desired, this chapter may be omitted. Alternatively, the reader may choose
to skip some or all of the later sections of this chapter (Sections 4.4 through 4.8).
4.1 Subclasses, Superclasses, and Inheritance
The EER model includes all the modeling concepts of the ER model that were pre-
sented in Chapter 3. In addition, it includes the concepts of subclass and superclass
and the related concepts of specialization and generalization (see Sections 4.2
and 4.3). Another concept included in the EER model is that of a category or union
type (see Section 4.4), which is used to represent a collection of objects (entities)
that is the union of objects of different entity types. Associated with these concepts
is the important mechanism of attribute and relationship inheritance. Unfortu-
nately, no standard terminology exists for these concepts, so we use the most com-
mon terminology. Alternative terminology is given in footnotes. We also describe a
diagrammatic technique for displaying these concepts when they arise in an EER
schema. We call the resulting schema diagrams enhanced ER or EER diagrams.
The first enhanced ER (EER) model concept we take up is that of a subtype or
subclass of an entity type. As we discussed in Chapter 3, the name of an entity type is
used to represent both a type of entity and the entity set or collection of entities of that
type that exist in the database. For example, the entity type EMPLOYEE describes the
type (that is, the attributes and relationships) of each employee entity, and also refers
to the current set of EMPLOYEE entities in the COMPANY database. In many cases an
entity type has numerous subgroupings or subtypes of its entities that are meaningful
and need to be represented explicitly because of their significance to the database
application. For example, the entities that are members of the EMPLOYEE entity
type may be distinguished further into SECRETARY, ENGINEER, MANAGER,
TECHNICIAN, SALARIED_EMPLOYEE, HOURLY_EMPLOYEE, and so on. The set or
collection of entities in each of the latter groupings is a subset of the entities that
belong to the EMPLOYEE entity set, meaning that every entity that is a member of
one of these subgroupings is also an employee. We call each of these subgroupings a
4.1 Subclasses, Superclasses, and Inheritance 109
subclass or subtype of the EMPLOYEE entity type, and the EMPLOYEE entity type is
called the superclass or supertype for each of these subclasses. Figure 4.1 shows how
to represent these concepts diagramatically in EER diagrams. (The circle notation in
Figure 4.1 will be explained in Section 4.2.)
We call the relationship between a superclass and any one of its subclasses a
superclass/subclass or supertype/subtype or simply class/subclass relationship.3
In our previous example, EMPLOYEE/SECRETARY and EMPLOYEE/TECHNICIAN
are two class/subclass relationships. Notice that a member entity of the subclass
represents the same real-world entity as some member of the superclass; for
example, a SECRETARY entity ‘Joan Logano’ is also the EMPLOYEE ‘Joan Logano.’
Hence, the subclass member is the same as the entity in the superclass, but in a
distinct specific role. When we implement a superclass/subclass relationship in
the database system, however, we may represent a member of the subclass as a
distinct database object—say, a distinct record that is related via the key attribute
to its superclass entity. In Section 9.2, we discuss various options for representing
superclass/subclass relationships in relational databases.
An entity cannot exist in the database merely by being a member of a subclass; it
must also be a member of the superclass. Such an entity can be included optionally
3A class/subclass relationship is often called an IS-A (or IS-AN) relationship because of the way we
refer to the concept. We say a SECRETARY is an EMPLOYEE, a TECHNICIAN is an EMPLOYEE, and
so on.
MANAGES
d
Minit Lname
Name Birth_date AddressSsn
Fname
Eng_typeTgradeTyping_speed Pay_scale
HOURLY_EMPLOYEE
SALARIED_EMPLOYEE
Salary
PROJECT
SECRETARY TECHNICIAN ENGINEER MANAGER
EMPLOYEE
TRADE_UNION
BELONGS_TO
d
Three specializations of EMPLOYEE:
{SECRETARY, TECHNICIAN, ENGINEER}
{MANAGER}
{HOURLY_EMPLOYEE, SALARIED_EMPLOYEE}
Figure 4.1
EER diagram
notation to represent
subclasses and
specialization.
110 Chapter 4 The Enhanced Entity–Relationship (EER) Model
as a member of any number of subclasses. For example, a salaried employee who is
also an engineer belongs to the two subclasses ENGINEER and SALARIED_EMPLOYEE
of the EMPLOYEE entity type. However, it is not necessary that every entity in a
superclass is a member of some subclass.
An important concept associated with subclasses (subtypes) is that of type
inheritance. Recall that the type of an entity is defined by the attributes it possesses
and the relationship types in which it participates. Because an entity in the subclass
represents the same real-world entity from the superclass, it should possess values
for its specific attributes as well as values of its attributes as a member of the super-
class. We say that an entity that is a member of a subclass inherits all the attributes of
the entity as a member of the superclass. The entity also inherits all the relationships
in which the superclass participates. Notice that a subclass, with its own specific (or
local) attributes and relationships together with all the attributes and relationships it
inherits from the superclass, can be considered an entity type in its own right.4
4.2 Specialization and Generalization
4.2.1 Specialization
Specialization is the process of defining a set of subclasses of an entity type; this
entity type is called the superclass of the specialization. The set of subclasses that
forms a specialization is defined on the basis of some distinguishing characteristic
of the entities in the superclass. For example, the set of subclasses {SECRETARY,
ENGINEER, TECHNICIAN} is a specialization of the superclass EMPLOYEE that dis-
tinguishes among employee entities based on the job type of each employee.
We may have several specializations of the same entity type based on different
distinguishing characteristics. For example, another specialization of the
EMPLOYEE entity type may yield the set of subclasses {SALARIED_EMPLOYEE,
HOURLY_EMPLOYEE}; this specialization distinguishes among employees based on
the method of pay.
Figure 4.1 shows how we represent a specialization diagrammatically in an EER
diagram. The subclasses that define a specialization are attached by lines to a circle
that represents the specialization, which is connected in turn to the superclass. The
subset symbol on each line connecting a subclass to the circle indicates the direction
of the superclass/subclass relationship.5 Attributes that apply only to entities of a
particular subclass—such as TypingSpeed of SECRETARY—are attached to the rect-
angle representing that subclass. These are called specific (or local) attributes of
the subclass. Similarly, a subclass can participate in specific relationship types,
such as the HOURLY_EMPLOYEE subclass participating in the BELONGS_TO
4In some object-oriented programming languages, a common restriction is that an entity (or object) has
only one type. This is generally too restrictive for conceptual database modeling.
5There are many alternative notations for specialization; we present the UML notation in Section 4.6 and
other proposed notations in Appendix A.
4.2 Specialization and Generalization 111
relationship in Figure 4.1. We will explain the d symbol in the circles in Figure 4.1
and additional EER diagram notation shortly.
Figure 4.2 shows a few entity instances that belong to subclasses of the {SECRETARY,
ENGINEER, TECHNICIAN} specialization. Again, notice that an entity that belongs to
a subclass represents the same real-world entity as the entity connected to it in the
EMPLOYEE superclass, even though the same entity is shown twice; for example, e1
is shown in both EMPLOYEE and SECRETARY in Figure 4.2. As the figure suggests,
a superclass/subclass relationship such as EMPLOYEE/SECRETARY somewhat
resembles a 1:1 relationship at the instance level (see Figure 3.12). The main differ-
ence is that in a 1:1 relationship two distinct entities are related, whereas in a super-
class/subclass relationship the entity in the subclass is the same real-world entity as
the entity in the superclass but is playing a specialized role—for example, an
EMPLOYEE specialized in the role of SECRETARY, or an EMPLOYEE specialized in
the role of TECHNICIAN.
There are two main reasons for including class/subclass relationships and special-
izations. The first is that certain attributes may apply to some but not all entities of
EMPLOYEE
SECRETARY
ENGINEER
TECHNICIAN
e1
e2
e3
e4
e5
e6
e7
e8
e1
e2
e3
e4
e5
e7
e8
Figure 4.2
Instances of a specialization.
112 Chapter 4 The Enhanced Entity–Relationship (EER) Model
the superclass entity type. A subclass is defined in order to group the entities to
which these attributes apply. The members of the subclass may still share the
majority of their attributes with the other members of the superclass. For example,
in Figure 4.1 the SECRETARY subclass has the specific attribute Typing_speed,
whereas the ENGINEER subclass has the specific attribute Eng_type, but
SECRETARY and ENGINEER share their other inherited attributes from the
EMPLOYEE entity type.
The second reason for using subclasses is that some relationship types may be par-
ticipated in only by entities that are members of the subclass. For example, if only
HOURLY_EMPLOYEES can belong to a trade union, we can represent that fact by
creating the subclass HOURLY_EMPLOYEE of EMPLOYEE and relating the subclass
to an entity type TRADE_UNION via the BELONGS_TO relationship type, as illus-
trated in Figure 4.1.
4.2.2 Generalization
We can think of a reverse process of abstraction in which we suppress the differences
among several entity types, identify their common features, and generalize them
into a single superclass of which the original entity types are special subclasses. For
example, consider the entity types CAR and TRUCK shown in Figure 4.3(a). Because
they have several common attributes, they can be generalized into the entity type
VEHICLE, as shown in Figure 4.3(b). Both CAR and TRUCK are now subclasses of the
(a)
(b)
Max_speed
Vehicle_id
No_of_passengers
License_plate_no
CAR Price Price
License_plate_no
No_of_axles
Vehicle_id
Tonnage
TRUCK
Vehicle_id Price License_plate_no
VEHICLE
No_of_passengers
Max_speed
CAR TRUCK
No_of_axles
Tonnage
d
Figure 4.3
Generalization. (a) Two entity types, CAR and TRUCK.
(b) Generalizing CAR and TRUCK into the superclass VEHICLE.
4.3 Constraints and Characteristics of Specialization and Generalization Hierarchies 113
generalized superclass VEHICLE. We use the term generalization to refer to the pro-
cess of defining a generalized entity type from the given entity types.
Notice that the generalization process can be viewed as being functionally the
inverse of the specialization process; we can view {CAR, TRUCK} as a specialization
of VEHICLE rather than viewing VEHICLE as a generalization of CAR and TRUCK. A
diagrammatic notation to distinguish between generalization and specialization is
used in some design methodologies. An arrow pointing to the generalized super-
class represents a generalization process, whereas arrows pointing to the special-
ized subclasses represent a specialization process. We will not use this notation
because the decision as to which process was followed in a particular situation is
often subjective.
So far we have introduced the concepts of subclasses and superclass/subclass rela-
tionships, as well as the specialization and generalization processes. In general, a
superclass or subclass represents a collection of entities of the same type and hence
also describes an entity type; that is why superclasses and subclasses are all shown in
rectangles in EER diagrams, like entity types.
4.3 Constraints and Characteristics
of Specialization and Generalization
Hierarchies
First, we discuss constraints that apply to a single specialization or a single general-
ization. For brevity, our discussion refers only to specialization even though it
applies to both specialization and generalization. Then, we discuss differences
between specialization/generalization lattices (multiple inheritance) and hierarchies
(single inheritance), and we elaborate on the differences between the specialization
and generalization processes during conceptual database schema design.
4.3.1 Constraints on Specialization and Generalization
In general, we may have several specializations defined on the same entity type (or
superclass), as shown in Figure 4.1. In such a case, entities may belong to subclasses
in each of the specializations. A specialization may also consist of a single subclass
only, such as the {MANAGER} specialization in Figure 4.1; in such a case, we do not
use the circle notation.
In some specializations we can determine exactly the entities that will become
members of each subclass by placing a condition on the value of some attribute of
the superclass. Such subclasses are called predicate-defined (or condition-defined)
subclasses. For example, if the EMPLOYEE entity type has an attribute Job_type, as
shown in Figure 4.4, we can specify the condition of membership in the
SECRETARY subclass by the condition (Job_type = ‘Secretary’), which we call the
defining predicate of the subclass. This condition is a constraint specifying that
exactly those entities of the EMPLOYEE entity type whose attribute value for Job_type
114 Chapter 4 The Enhanced Entity–Relationship (EER) Model
is ‘Secretary’ belong to the subclass. We display a predicate-defined subclass by
writing the predicate condition next to the line that connects the subclass to the
specialization circle.
If all subclasses in a specialization have their membership condition on the same
attribute of the superclass, the specialization itself is called an attribute-defined
specialization, and the attribute is called the defining attribute of the special-
ization.6 In this case, all the entities with the same value for the attribute belong to
the same subclass. We display an attribute-defined specialization by placing the
defining attribute name next to the arc from the circle to the superclass, as shown
in Figure 4.4.
When we do not have a condition for determining membership in a subclass, the
subclass is called user-defined. Membership in such a subclass is determined by the
database users when they apply the operation to add an entity to the subclass; hence,
membership is specified individually for each entity by the user, not by any condi-
tion that may be evaluated automatically.
Two other constraints may apply to a specialization. The first is the disjointness
constraint, which specifies that the subclasses of the specialization must be disjoint
sets. This means that an entity can be a member of at most one of the subclasses of
the specialization. A specialization that is attribute-defined implies the disjointness
constraint (if the attribute used to define the membership predicate is single-
valued). Figure 4.4 illustrates this case, where the d in the circle stands for disjoint. The
d notation also applies to user-defined subclasses of a specialization that must be
disjoint, as illustrated by the specialization {HOURLY_EMPLOYEE, SALARIED_EMPLOYEE}
in Figure 4.1. If the subclasses are not constrained to be disjoint, their sets of entities
6Such an attribute is called a discriminator or discriminating attribute in UML terminology.
d
Minit Lname
Name Birth_date Address Job_typeSsn
Fname
Eng_typeTgrade ‘Technician’
Job_type
‘Secretary’ ‘Engineer’
Typing_speed
SECRETARY TECHNICIAN ENGINEER
EMPLOYEE
Figure 4.4
EER diagram notation
for an attribute-defined
specialization on
Job_type.
4.3 Constraints and Characteristics of Specialization and Generalization Hierarchies 115
may be overlapping; that is, the same (real-world) entity may be a member of more
than one subclass of the specialization. This case, which is the default, is displayed
by placing an o in the circle, as shown in Figure 4.5.
The second constraint on specialization is called the completeness (or totalness)
constraint, which may be total or partial. A total specialization constraint specifies
that every entity in the superclass must be a member of at least one subclass
in the specialization. For example, if every EMPLOYEE must be either an
HOURLY_EMPLOYEE or a SALARIED_EMPLOYEE, then the specialization
{HOURLY_EMPLOYEE, SALARIED_EMPLOYEE} in Figure 4.1 is a total specialization
of EMPLOYEE. This is shown in EER diagrams by using a double line to connect
the superclass to the circle. A single line is used to display a partial specialization,
which allows an entity not to belong to any of the subclasses. For example, if some
EMPLOYEE entities do not belong to any of the subclasses {SECRETARY, ENGINEER,
TECHNICIAN} in Figures 4.1 and 4.4, then that specialization is partial.7
Notice that the disjointness and completeness constraints are independent. Hence,
we have the following four possible constraints on a specialization:
■ Disjoint, total
■ Disjoint, partial
■ Overlapping, total
■ Overlapping, partial
Of course, the correct constraint is determined from the real-world meaning that
applies to each specialization. In general, a superclass that was identified through
the generalization process usually is total, because the superclass is derived from the
subclasses and hence contains only the entities that are in the subclasses.
Certain insertion and deletion rules apply to specialization (and generalization) as a
consequence of the constraints specified earlier. Some of these rules are as follows:
■ Deleting an entity from a superclass implies that it is automatically deleted
from all the subclasses to which it belongs.
7The notation of using single or double lines is similar to that for partial or total participation of an entity
type in a relationship type, as described in Chapter 3.
Part_no Description
PARTManufacture_date
Drawing_no
PURCHASED_PART
Supplier_name
Batch_no
List_price
o
MANUFACTURED_PART
Figure 4.5
EER diagram notation
for an overlapping
(nondisjoint)
specialization.
116 Chapter 4 The Enhanced Entity–Relationship (EER) Model
■ Inserting an entity in a superclass implies that the entity is mandatorily
inserted in all predicate-defined (or attribute-defined) subclasses for which
the entity satisfies the defining predicate.
■ Inserting an entity in a superclass of a total specialization implies that
the entity is mandatorily inserted in at least one of the subclasses of the
specialization.
The reader is encouraged to make a complete list of rules for insertions and dele-
tions for the various types of specializations.
4.3.2 Specialization and Generalization Hierarchies
and Lattices
A subclass itself may have further subclasses specified on it, forming a hierarchy or
a lattice of specializations. For example, in Figure 4.6 ENGINEER is a subclass of
EMPLOYEE and is also a superclass of ENGINEERING_MANAGER; this represents the
real-world constraint that every engineering manager is required to be an engineer.
A specialization hierarchy has the constraint that every subclass participates as a
subclass in only one class/subclass relationship; that is, each subclass has only one
parent, which results in a tree structure or strict hierarchy. In contrast, for a
specialization lattice, a subclass can be a subclass in more than one class/subclass
relationship. Hence, Figure 4.6 is a lattice.
Figure 4.7 shows another specialization lattice of more than one level. This may
be part of a conceptual schema for a UNIVERSITY database. Notice that this
arrangement would have been a hierarchy except for the STUDENT_ASSISTANT
subclass, which is a subclass in two distinct class/subclass relationships.
d
HOURLY_EMPLOYEE
SALARIED_EMPLOYEE
ENGINEERING_MANAGER
SECRETARY TECHNICIAN ENGINEER MANAGER
EMPLOYEE
d
Figure 4.6
A specialization lattice with shared subclass
ENGINEERING_MANAGER.
4.3 Constraints and Characteristics of Specialization and Generalization Hierarchies 117
The requirements for the part of the UNIVERSITY database shown in Figure 4.7
are the following:
1. The database keeps track of three types of persons: employees, alumni, and
students. A person can belong to one, two, or all three of these types. Each
person has a name, SSN, sex, address, and birth date.
2. Every employee has a salary, and there are three types of employees: fac-
ulty, staff, and student assistants. Each employee belongs to exactly one
of these types. For each alumnus, a record of the degree or degrees that
he or she earned at the university is kept, including the name of the
degree, the year granted, and the major department. Each student has a
major department.
3. Each faculty has a rank, whereas each staff member has a staff position. Stu-
dent assistants are classified further as either research assistants or teaching
assistants, and the percent of time that they work is recorded in the database.
Research assistants have their research project stored, whereas teaching
assistants have the current course they work on.
STAFF
Percent_time
FACULTY
Name Sex Address
PERSON
Salary
EMPLOYEE
Major_dept
Birth_date
ALUMNUS
d
o
STUDENT_
ASSISTANT
STUDENT
Degrees
DegreeYear Major
GRADUATE_
STUDENT
d
UNDERGRADUATE_
STUDENT
RESEARCH_ASSISTANT
d
TEACHING_ASSISTANT
Position Rank Degree_program Class
CourseProject
Ssn
Figure 4.7
A specialization lattice
with multiple inheritance
for a UNIVERSITY
database.
118 Chapter 4 The Enhanced Entity–Relationship (EER) Model
4. Students are further classified as either graduate or undergraduate, with
the specific attributes degree program (M.S., Ph.D., M.B.A., and so on)
for graduate students and class (freshman, sophomore, and so on) for
undergraduates.
In Figure 4.7, all person entities represented in the database are members of
the PERSON entity type, which is specialized into the subclasses {EMPLOYEE,
ALUMNUS, STUDENT}. This specialization is overlapping; for example, an alum-
nus may also be an employee and a student pursuing an advanced degree. The
subclass STUDENT is the superclass for the specialization {GRADUATE_STUDENT,
UNDERGRADUATE_STUDENT}, whereas EMPLOYEE is the superclass for the
specialization {STUDENT_ASSISTANT, FACULTY, STAFF} . Notice that
STUDENT_ASSISTANT is also a subclass of STUDENT. Finally, STUDENT_ASSISTANT
is the superclass for the specialization into {RESEARCH_ASSISTANT,
TEACHING_ASSISTANT}.
In such a specialization lattice or hierarchy, a subclass inherits the attributes not
only of its direct superclass, but also of all its predecessor superclasses all the way to
the root of the hierarchy or lattice if necessary. For example, an entity in
GRADUATE_STUDENT inherits all the attributes of that entity as a STUDENT and as a
PERSON. Notice that an entity may exist in several leaf nodes of the hierarchy,
where a leaf node is a class that has no subclasses of its own. For example, a member
of GRADUATE_STUDENT may also be a member of RESEARCH_ASSISTANT.
A subclass with more than one superclass is called a shared subclass, such as
ENGINEERING_MANAGER in Figure 4.6. This leads to the concept known as
multiple inheritance, where the shared subclass ENGINEERING_MANAGER
directly inherits attributes and relationships from multiple superclasses. Notice
that the existence of at least one shared subclass leads to a lattice (and hence to
multiple inheritance); if no shared subclasses existed, we would have a hierarchy
rather than a lattice and only single inheritance would exist. An important rule
related to multiple inheritance can be illustrated by the example of the shared
subclass STUDENT_ASSISTANT in Figure 4.7, which inherits attributes from
both EMPLOYEE and STUDENT. Here, both EMPLOYEE and STUDENT inherit the
same attributes from PERSON. The rule states that if an attribute (or relation-
ship) originating in the same superclass (PERSON) is inherited more than once
via different paths (EMPLOYEE and STUDENT) in the lattice, then it should be
included only once in the shared subclass (STUDENT_ASSISTANT). Hence, the
attributes of PERSON are inherited only once in the STUDENT_ASSISTANT sub-
class in Figure 4.7.
It is important to note here that some models and languages are limited to single
inheritance and do not allow multiple inheritance (shared subclasses). It is also
important to note that some models do not allow an entity to have multiple
types, and hence an entity can be a member of only one leaf class.8 In such a
model, it is necessary to create additional subclasses as leaf nodes to cover all
8In some models, the class is further restricted to be a leaf node in the hierarchy or lattice.
4.3 Constraints and Characteristics of Specialization and Generalization Hierarchies 119
possible combinations of classes that may have some entity that belongs to all
these classes simultaneously. For example, in the overlapping specialization of
PERSON into {EMPLOYEE, ALUMNUS, STUDENT} (or {E, A, S} for short), it would
be necessary to create seven subclasses of PERSON in order to cover all possible
types of entities: E, A, S, E_A, E_S, A_S, and E_A_S. Obviously, this can lead to
extra complexity.
Although we have used specialization to illustrate our discussion, similar concepts
apply equally to generalization, as we mentioned at the beginning of this section.
Hence, we can also speak of generalization hierarchies and generalization lattices.
4.3.3 Utilizing Specialization and Generalization in
Refining Conceptual Schemas
Now we elaborate on the differences between the specialization and generalization
processes and how they are used to refine conceptual schemas during conceptual
database design. In the specialization process, the database designers typically start
with an entity type and then define subclasses of the entity type by successive spe-
cialization; that is, they repeatedly define more specific groupings of the entity
type. For example, when designing the specialization lattice in Figure 4.7, we may
first specify an entity type PERSON for a university database. Then we discover
that three types of persons will be represented in the database: university employ-
ees, alumni, and students and we create the specialization {EMPLOYEE, ALUMNUS,
STUDENT}. The overlapping constraint is chosen because a person may belong
to more than one of the subclasses. We specialize EMPLOYEE further into
{STAFF, FACULTY, STUDENT_ASSISTANT}, and specialize STUDENT into
{GRADUATE_STUDENT, UNDERGRADUATE_STUDENT}. Finally, we specialize
STUDENT_ASSISTANT into {RESEARCH_ASSISTANT, TEACHING_ASSISTANT}.
This process is called top-down conceptual refinement. So far, we have a hier-
archy; then we realize that STUDENT_ASSISTANT is a shared subclass, since it is
also a subclass of STUDENT, leading to the lattice.
It is possible to arrive at the same hierarchy or lattice from the other direction. In
such a case, the process involves generalization rather than specialization and cor-
responds to a bottom-up conceptual synthesis. For example, the database design-
ers may first discover entity types such as STAFF, FACULTY, ALUMNUS,
GRADUATE_STUDENT, UNDERGRADUATE_STUDENT, RESEARCH_ASSISTANT,
TEACHING_ASSISTANT, and so on; then they generalize {GRADUATE_STUDENT,
UNDERGRADUATE_STUDENT} into STUDENT; then {RESEARCH_ASSISTANT,
TEACHING_ASSISTANT} into STUDENT_ASSISTANT; then {STAFF, FACULTY,
STUDENT_ASSISTANT} into EMPLOYEE; and finally {EMPLOYEE, ALUMNUS, STUDENT}
into PERSON.
The final design of hierarchies or lattices resulting from either process may be
identical; the only difference relates to the manner or order in which the schema
superclasses and subclasses were created during the design process. In practice, it
is likely that a combination of the two processes is employed. Notice that the
120 Chapter 4 The Enhanced Entity–Relationship (EER) Model
notion of representing data and knowledge by using superclass/subclass hierar-
chies and lattices is quite common in knowledge-based systems and expert sys-
tems, which combine database technology with artificial intelligence techniques.
For example, frame-based knowledge representation schemes closely resemble
class hierarchies. Specialization is also common in software engineering design
methodologies that are based on the object-oriented paradigm.
4.4 Modeling of UNION Types
Using Categories
It is sometimes necessary to represent a collection of entities from different entity
types. In this case, a subclass will represent a collection of entities that is a subset of
the UNION of entities from distinct entity types; we call such a subclass a union type
or a category.9
For example, suppose that we have three entity types: PERSON, BANK, and
COMPANY. In a database for motor vehicle registration, an owner of a vehicle can
be a person, a bank (holding a lien on a vehicle), or a company. We need to create
a class (collection of entities) that includes entities of all three types to play the
role of vehicle owner. A category (union type) OWNER that is a subclass of the
UNION of the three entity sets of COMPANY, BANK, and PERSON can be created
for this purpose. We display categories in an EER diagram as shown in Figure 4.8.
The superclasses COMPANY, BANK, and PERSON are connected to the circle with
the ∪ symbol, which stands for the set union operation. An arc with the subset
symbol connects the circle to the (subclass) OWNER category. In Figure 4.8 we
have two categories: OWNER, which is a subclass (subset) of the union of PERSON,
BANK, and COMPANY; and REGISTERED_VEHICLE, which is a subclass (subset) of
the union of CAR and TRUCK.
A category has two or more superclasses that may represent collections of enti-
ties from distinct entity types, whereas other superclass/subclass relationships
always have a single superclass. To better understand the difference,
we can compare a category, such as OWNER in Figure 4.8, with the
ENGINEERING_MANAGER shared subclass in Figure 4.6. The latter is a subclass of
each of the three superclasses ENGINEER, MANAGER, and SALARIED_EMPLOYEE,
so an entity that is a member of ENGINEERING_MANAGER must exist in all
three collections. This represents the constraint that an engineering manager must
be an ENGINEER, a MANAGER, and a SALARIED_EMPLOYEE; that is, the
ENGINEERING_MANAGER entity set is a subset of the intersection of the three
entity sets. On the other hand, a category is a subset of the union of its super-
classes. Hence, an entity that is a member of OWNER must exist in only one of the
superclasses. This represents the constraint that an OWNER may be a COMPANY,
a BANK, or a PERSON in Figure 4.8.
9Our use of the term category is based on the ECR (entity–category–relationship) model (Elmasri et al.,
1985).
4.4 Modeling of UNION Types Using Categories 121
Attribute inheritance works more selectively in the case of categories. For exam-
ple, in Figure 4.8 each OWNER entity inherits the attributes of a COMPANY, a
PERSON, or a BANK, depending on the superclass to which the entity belongs. On
the other hand, a shared subclass such as ENGINEERING_MANAGER (Figure 4.6)
inherits all the attributes of its superclasses SALARIED_EMPLOYEE, ENGINEER,
and MANAGER.
It is interesting to note the difference between the category REGISTERED_VEHICLE
(Figure 4.8) and the generalized superclass VEHICLE (Figure 4.3(b)). In Fig-
ure 4.3(b), every car and every truck is a VEHICLE; but in Figure 4.8, the
REGISTERED_VEHICLE category includes some cars and some trucks but not necessarily
Name Address
Driver_license_no
Ssn
License_plate_no
Lien_or_regular
Purchase_date
Bname Baddress
Cname Caddress
BANK
PERSON
OWNER
OWNS
M
N
U
REGISTERED_VEHICLE
COMPANY
U
Cstyle
Cyear
Vehicle_id
Cmake
Cmodel
CAR
Tonnage
Tyear
Vehicle_id
Tmake
Tmodel
TRUCK
Figure 4.8
Two categories (union
types): OWNER and
REGISTERED_VEHICLE.
122 Chapter 4 The Enhanced Entity–Relationship (EER) Model
all of them (for example, some cars or trucks may not be registered). In general,
a specialization or generalization such as that in Figure 4.3(b), if it were partial,
would not preclude VEHICLE from containing other types of entities, such as
motorcycles. However, a category such as REGISTERED_VEHICLE in Figure 4.8
implies that only cars and trucks, but not other types of entities, can be members
of REGISTERED_VEHICLE.
A category can be total or partial. A total category holds the union of all entities in
its superclasses, whereas a partial category can hold a subset of the union. A total
category is represented diagrammatically by a double line connecting the category
and the circle, whereas a partial category is indicated by a single line.
The superclasses of a category may have different key attributes, as demonstrated
by the OWNER category in Figure 4.8, or they may have the same key attribute, as
demonstrated by the REGISTERED_VEHICLE category. Notice that if a category is
total (not partial), it may be represented alternatively as a total specialization (or a
total generalization). In this case, the choice of which representation to use is sub-
jective. If the two classes represent the same type of entities and share numerous
attributes, including the same key attributes, specialization/generalization is pre-
ferred; otherwise, categorization (union type) is more appropriate.
It is important to note that some modeling methodologies do not have union
types. In these models, a union type must be represented in a roundabout way
(see Section 9.2).
4.5 A Sample UNIVERSITY EER Schema,
Design Choices, and Formal Definitions
In this section, we first give an example of a database schema in the EER model to
illustrate the use of the various concepts discussed here and in Chapter 3. Then, we
discuss design choices for conceptual schemas, and finally we summarize the EER
model concepts and define them formally in the same manner in which we formally
defined the concepts of the basic ER model in Chapter 3.
4.5.1 A Different UNIVERSITY Database Example
Consider a UNIVERSITY database that has different requirements from the UNIVERSITY
database presented in Section 3.10. This database keeps track of students and their
majors, transcripts, and registration as well as of the university’s course offerings.
The database also keeps track of the sponsored research projects of faculty and
graduate students. This schema is shown in Figure 4.9. A discussion of the require-
ments that led to this schema follows.
For each person, the database maintains information on the person’s Name [Name],
Social Security number [Ssn], address [Address], sex [Sex], and birth date [Bdate].
Two subclasses of the PERSON entity type are identified: FACULTY and STUDENT.
Specific attributes of FACULTY are rank [Rank] (assistant, associate, adjunct, research,
4.5 A Sample UNIVERSITY EER Schema, Design Choices, and Formal Definitions 123
Foffice
Salary
Rank
Fphone
FACULTY
d
College Degree Year
1 N
M N
M
Degrees
Class
1
M
1
N
N
M
1
N
N
Qtr = Current_qtr and
Year = Current_year
N
N
1
M
N
N
1
Cname
CdescC#
1 N
1
Office
Dphone
Dname
N
1
1
N
Class=5
Fname LnameMinit
Name
BdateSsn Sex No Street Apt_no City State Zip
Address
U
ADVISOR
COMMITTEE
CHAIRS
BELONGS
MINOR
MAJOR
DCCD
Agency
St_date
NoTitle
Start
Time
End
CURRENT_SECTION
Grade
Sec# Year
Qtr
CofficeCname
Dean
PERSON
GRAD_STUDENT
STUDENT
GRANT
SUPPORT
REGISTERED
TRANSCRIPT
SECTION
TEACH
DEPARTMENT
COURSECOLLEGE
CS
INSTRUCTOR_RESEARCHER
PI
Figure 4.9
An EER conceptual schema
for a different UNIVERSITY
database.
124 Chapter 4 The Enhanced Entity–Relationship (EER) Model
visiting, and so on), office [Foffice], office phone [Fphone], and salary [Salary]. All fac-
ulty members are related to the academic department(s) with which they are affiliated
[BELONGS] (a faculty member can be associated with several departments, so the
relationship is M:N). A specific attribute of STUDENT is [Class] (freshman = 1, sopho-
more = 2, … , MS student = 5, PhD student = 6). Each STUDENT is also related to his
or her major and minor departments (if known) [MAJOR] and [MINOR], to the course
sections he or she is currently attending [REGISTERED], and to the courses completed
[TRANSCRIPT]. Each TRANSCRIPT instance includes the grade the student received
[Grade] in a section of a course.
GRAD_STUDENT is a subclass of STUDENT, with the defining predicate (Class = 5 OR
Class = 6). For each graduate student, we keep a list of previous degrees in a compos-
ite, multivalued attribute [Degrees]. We also relate the graduate student to a faculty
advisor [ADVISOR] and to a thesis committee [COMMITTEE], if one exists.
An academic department has the attributes name [Dname], telephone [Dphone], and
office number [Office] and is related to the faculty member who is its chairperson
[CHAIRS] and to the college to which it belongs [CD]. Each college has attributes col-
lege name [Cname], office number [Coffice], and the name of its dean [Dean].
A course has attributes course number [C#], course name [Cname], and course
description [Cdesc]. Several sections of each course are offered, with each section
having the attributes section number [Sec#] and the year and quarter in which the
section was offered ([Year] and [Qtr]).10 Section numbers uniquely identify each
section. The sections being offered during the current quarter are in a subclass
CURRENT_SECTION of SECTION, with the defining predicate Qtr = Current_qtr and
Year = Current_year. Each section is related to the instructor who taught or is teach-
ing it ([TEACH]), if that instructor is in the database.
The category INSTRUCTOR_RESEARCHER is a subset of the union of FACULTY and
GRAD_STUDENT and includes all faculty, as well as graduate students who are sup-
ported by teaching or research. Finally, the entity type GRANT keeps track of research
grants and contracts awarded to the university. Each grant has attributes grant title
[Title], grant number [No], the awarding agency [Agency], and the starting date
[St_date]. A grant is related to one principal investigator [PI] and to all researchers it
supports [SUPPORT]. Each instance of support has as attributes the starting date of
support [Start], the ending date of the support (if known) [End], and the percentage of
time being spent on the project [Time] by the researcher being supported.
4.5.2 Design Choices for Specialization/Generalization
It is not always easy to choose the most appropriate conceptual design for a
database application. In Section 3.7.3, we presented some of the typical issues
that confront a database designer when choosing among the concepts of entity
10We assume that the quarter system rather than the semester system is used in this university.
4.5 A Sample UNIVERSITY EER Schema, Design Choices, and Formal Definitions 125
types, relationship types, and attributes to represent a particular miniworld sit-
uation as an ER schema. In this section, we discuss design guidelines and
choices for the EER concepts of specialization/generalization and categories
(union types).
As we mentioned in Section 3.7.3, conceptual database design should be considered
as an iterative refinement process until the most suitable design is reached. The fol-
lowing guidelines can help to guide the design process for EER concepts:
■ In general, many specializations and subclasses can be defined to make
the conceptual model accurate. However, the drawback is that the
design becomes quite cluttered. It is important to represent only those
subclasses that are deemed necessary to avoid extreme cluttering of the
conceptual schema.
■ If a subclass has few specific (local) attributes and no specific relationships,
it can be merged into the superclass. The specific attributes would hold NULL
values for entities that are not members of the subclass. A type attribute
could specify whether an entity is a member of the subclass.
■ Similarly, if all the subclasses of a specialization/generalization have few spe-
cific attributes and no specific relationships, they can be merged into the
superclass and replaced with one or more type attributes that specify the
subclass or subclasses that each entity belongs to (see Section 9.2 for how
this criterion applies to relational databases).
■ Union types and categories should generally be avoided unless the situation
definitely warrants this type of construct, which does occur in some practi-
cal situations. If possible, we try to model using specialization/generaliza-
tion as discussed at the end of Section 4.4.
■ The choice of disjoint/overlapping and total/partial constraints on special-
ization/generalization is driven by the rules in the miniworld being mod-
eled. If the requirements do not indicate any particular constraints, the
default would generally be overlapping and partial, since this does not spec-
ify any restrictions on subclass membership.
As an example of applying these guidelines, consider Figure 4.6, where no specific
(local) attributes are shown. We could merge all the subclasses into the EMPLOYEE
entity type and add the following attributes to EMPLOYEE:
■ An attribute Job_type whose value set {‘Secretary’, ‘Engineer’, ‘Technician’}
would indicate which subclass in the first specialization each employee
belongs to.
■ An attribute Pay_method whose value set {‘Salaried’, ‘Hourly’} would
indicate which subclass in the second specialization each employee
belongs to.
126 Chapter 4 The Enhanced Entity–Relationship (EER) Model
■ An attribute Is_a_manager whose value set {‘Yes’, ‘No’} would indicate
whether an individual employee entity is a manager or not.
4.5.3 Formal Definitions for the EER Model Concepts
We now summarize the EER model concepts and give formal definitions. A class11
defines a type of entity and represents a set or collection of entities of that type; this
includes any of the EER schema constructs that correspond to collections of enti-
ties, such as entity types, subclasses, superclasses, and categories. A subclass S is a
class whose entities must always be a subset of the entities in another class, called
the superclass C of the superclass/subclass (or IS-A) relationship. We denote
such a relationship by C/S. For such a superclass/subclass relationship, we must
always have
S ⊆ C
A specialization Z = {S1, S2, … , Sn} is a set of subclasses that have the same super-
class G; that is, G/Si is a superclass/subclass relationship for i = 1, 2, … , n. G is called
a generalized entity type (or the superclass of the specialization, or a generalization
of the subclasses {S1, S2, … , Sn} ). Z is said to be total if we always (at any point in
time) have
∪
n
i=1
Si = G
Otherwise, Z is said to be partial. Z is said to be disjoint if we always have
Si ∩ Sj = ∅ (empty set) for i ≠ j
Otherwise, Z is said to be overlapping.
A subclass S of C is said to be predicate-defined if a predicate p on the attributes of
C is used to specify which entities in C are members of S; that is, S = C[p], where
C[p] is the set of entities in C that satisfy p. A subclass that is not defined by a
predicate is called user-defined.
A specialization Z (or generalization G) is said to be attribute-defined if a
predicate (A = ci), where A is an attribute of G and ci is a constant value from
the domain of A, is used to specify membership in each subclass Si in Z. Notice
that if ci ≠ cj for i ≠ j, and A is a single-valued attribute, then the specialization
will be disjoint.
A category T is a class that is a subset of the union of n defining superclasses D1, D2,
… , Dn, n > 1 and is formally specified as follows:
T ⊆ (D1 ∪ D2 … ∪ Dn)
11The use of the word class here refers to a collection (set) of entities, which differs from its more
common use in object-oriented programming languages such as C++. In C++, a class is a structured
type definition along with its applicable functions (operations).
4.6 Example of Other Notation: Representing Specialization and Generalization in UML Class Diagrams 127
A predicate pi on the attributes of Di can be used to specify the members of each Di
that are members of T. If a predicate is specified on every Di, we get
T = (D1[p1] ∪ D2[p2] … ∪ Dn[pn])
We should now extend the definition of relationship type given in Chapter 3 by
allowing any class—not only any entity type—to participate in a relationship.
Hence, we should replace the words entity type with class in that definition. The
graphical notation of EER is consistent with ER because all classes are represented
by rectangles.
4.6 Example of Other Notation: Representing
Specialization and Generalization in UML
Class Diagrams
We now discuss the UML notation for generalization/specialization and inheri-
tance. We already presented basic UML class diagram notation and terminology
in Section 3.8. Figure 4.10 illustrates a possible UML class diagram corresponding
to the EER diagram in Figure 4.7. The basic notation for specialization/generaliza-
tion (see Figure 4.10) is to connect the subclasses by vertical lines to a horizontal
line, which has a triangle connecting the horizontal line through another vertical
line to the superclass. A blank triangle indicates a specialization/generalization
with the disjoint constraint, and a filled triangle indicates an overlapping con-
straint. The root superclass is called the base class, and the subclasses (leaf nodes)
are called leaf classes.
The preceding discussion and the example in Figure 4.10, as well as the presenta-
tion in Section 3.8, gave a brief overview of UML class diagrams and terminology.
We focused on the concepts that are relevant to ER and EER database modeling
rather than on those concepts that are more relevant to software engineering. In
UML, there are many details that we have not discussed because they are outside
the scope of this text and are mainly relevant to software engineering. For example,
classes can be of various types:
■ Abstract classes define attributes and operations but do not have objects
corresponding to those classes. These are mainly used to specify a set of
attributes and operations that can be inherited.
■ Concrete classes can have objects (entities) instantiated to belong to the
class.
■ Template classes specify a template that can be further used to define
other classes.
In database design, we are mainly concerned with specifying concrete classes whose
collections of objects are permanently (or persistently) stored in the database. The
bibliographic notes at the end of this chapter give some references to books that
describe complete details of UML.
128 Chapter 4 The Enhanced Entity–Relationship (EER) Model
Project
change_project
. . .
RESEARCH_
ASSISTANT
Course
assign_to_course
. . .
TEACHING_
ASSISTANT
Degree_program
change_degree_program
. . .
GRADUATE_
STUDENT
Class
change_classification
. . .
UNDERGRADUATE_
STUDENT
Position
hire_staff
. . .
STAFF
Rank
promote
. . .
FACULTY
Percent_time
hire_student
. . .
STUDENT_ASSISTANT
Year
Degree
Major
DEGREE
. . .
Salary
hire_emp
. . .
EMPLOYEE
new_alumnus
1 *
. . .
ALUMNUS
Major_dept
change_major
. . .
STUDENT
Name
Ssn
Birth_date
Sex
Address
age
. . .
PERSON
Figure 4.10
A UML class diagram corresponding to the EER diagram in Figure 4.7,
illustrating UML notation for specialization/generalization.
4.7 Data Abstraction, Knowledge
Representation, and Ontology Concepts
In this section, we discuss in general terms some of the modeling concepts that we
described quite specifically in our presentation of the ER and EER models in Chap-
ter 3 and earlier in this chapter. This terminology is not only used in conceptual
4.7 Data Abstraction, Knowledge Representation, and Ontology Concepts 129
data modeling but also in artificial intelligence literature when discussing
knowledge representation (KR). This section discusses the similarities and differ-
ences between conceptual modeling and knowledge representation, and introduces
some of the alternative terminology and a few additional concepts.
The goal of KR techniques is to develop concepts for accurately modeling some domain
of knowledge by creating an ontology12 that describes the concepts of the domain
and how these concepts are interrelated. The ontology is used to store and manipu-
late knowledge for drawing inferences, making decisions, or answering questions.
The goals of KR are similar to those of semantic data models, but there are some
important similarities and differences between the two disciplines:
■ Both disciplines use an abstraction process to identify common properties and
important aspects of objects in the miniworld (also known as domain of discourse
in KR) while suppressing insignificant differences and unimportant details.
■ Both disciplines provide concepts, relationships, constraints, operations,
and languages for defining data and representing knowledge.
■ KR is generally broader in scope than semantic data models. Different forms
of knowledge, such as rules (used in inference, deduction, and search),
incomplete and default knowledge, and temporal and spatial knowledge, are
represented in KR schemes. Database models are being expanded to include
some of these concepts (see Chapter 26).
■ KR schemes include reasoning mechanisms that deduce additional facts
from the facts stored in a database. Hence, whereas most current database
systems are limited to answering direct queries, knowledge-based systems
using KR schemes can answer queries that involve inferences over the
stored data. Database technology is being extended with inference mecha-
nisms (see Section 26.5).
■ Whereas most data models concentrate on the representation of database
schemas, or meta-knowledge, KR schemes often mix up the schemas with
the instances themselves in order to provide flexibility in representing
exceptions. This often results in inefficiencies when these KR schemes are
implemented, especially when compared with databases and when a large
amount of structured data (facts) needs to be stored.
We now discuss four abstraction concepts that are used in semantic data models,
such as the EER model, as well as in KR schemes: (1) classification and instantia-
tion, (2) identification, (3) specialization and generalization, and (4) aggregation
and association. The paired concepts of classification and instantiation are inverses
of one another, as are generalization and specialization. The concepts of aggrega-
tion and association are also related. We discuss these abstract concepts and their
relation to the concrete representations used in the EER model to clarify the data
abstraction process and to improve our understanding of the related process of
conceptual schema design. We close the section with a brief discussion of ontology,
which is being used widely in recent knowledge representation research.
12An ontology is somewhat similar to a conceptual schema, but with more knowledge, rules, and exceptions.
130 Chapter 4 The Enhanced Entity–Relationship (EER) Model
4.7.1 Classification and Instantiation
The process of classification involves systematically assigning similar objects/enti-
ties to object classes/entity types. We can now describe (in DB) or reason about (in
KR) the classes rather than the individual objects. Collections of objects that share
the same types of attributes, relationships, and constraints are classified into classes
in order to simplify the process of discovering their properties. Instantiation is the
inverse of classification and refers to the generation and specific examination of
distinct objects of a class. An object instance is related to its object class by the
IS-AN-INSTANCE-OF or IS-A-MEMBER-OF relationship. Although EER dia-
grams do not display instances, the UML diagrams allow a form of instantiation by
permitting the display of individual objects. We did not describe this feature in our
introduction to UML class diagrams.
In general, the objects of a class should have a similar type structure. However,
some objects may display properties that differ in some respects from the other
objects of the class; these exception objects also need to be modeled, and KR
schemes allow more varied exceptions than do database models. In addition, cer-
tain properties apply to the class as a whole and not to the individual objects; KR
schemes allow such class properties. UML diagrams also allow specification of
class properties.
In the EER model, entities are classified into entity types according to their basic
attributes and relationships. Entities are further classified into subclasses and cat-
egories based on additional similarities and differences (exceptions) among them.
Relationship instances are classified into relationship types. Hence, entity types,
subclasses, categories, and relationship types are the different concepts that are
used for classification in the EER model. The EER model does not provide
explicitly for class properties, but it may be extended to do so. In UML, objects
are classified into classes, and it is possible to display both class properties and
individual objects.
Knowledge representation models allow multiple classification schemes in
which one class is an instance of another class (called a meta-class). Notice that
this cannot be represented directly in the EER model, because we have only two
levels—classes and instances. The only relationship among classes in the EER
model is a superclass/subclass relationship, whereas in some KR schemes an
additional class/instance relationship can be represented directly in a class
hierarchy. An instance may itself be another class, allowing multiple-level
classification schemes.
4.7.2 Identification
Identification is the abstraction process whereby classes and objects are made
uniquely identifiable by means of some identifier. For example, a class name uniquely
identifies a whole class within a schema. An additional mechanism is necessary for
telling distinct object instances apart by means of object identifiers. Moreover, it is
necessary to identify multiple manifestations in the database of the same real-world
4.7 Data Abstraction, Knowledge Representation, and Ontology Concepts 131
object. For example, we may have a tuple <‘Matthew Clarke’, ‘610618’, ‘376-9821’> in
a PERSON relation and another tuple <‘301-54-0836’, ‘CS’, 3.8> in a STUDENT rela-
tion that happen to represent the same real-world entity. There is no way to identify
the fact that these two database objects (tuples) represent the same real-world
entity unless we make a provision at design time for appropriate cross-referencing to
supply this identification. Hence, identification is needed at two levels:
■ To distinguish among database objects and classes
■ To identify database objects and to relate them to their real-world counterparts
In the EER model, identification of schema constructs is based on a system of
unique names for the constructs in a schema. For example, every class in an EER
schema—whether it is an entity type, a subclass, a category, or a relationship type—
must have a distinct name. The names of attributes of a particular class must also be
distinct. Rules for unambiguously identifying attribute name references in a spe-
cialization or generalization lattice or hierarchy are needed as well.
At the object level, the values of key attributes are used to distinguish among enti-
ties of a particular entity type. For weak entity types, entities are identified by a
combination of their own partial key values and the entities they are related to in
the owner entity type(s). Relationship instances are identified by some combination
of the entities that they relate to, depending on the cardinality ratio specified.
4.7.3 Specialization and Generalization
Specialization is the process of classifying a class of objects into more specialized
subclasses. Generalization is the inverse process of generalizing several classes into
a higher-level abstract class that includes the objects in all these classes. Specializa-
tion is conceptual refinement, whereas generalization is conceptual synthesis. Sub-
classes are used in the EER model to represent specialization and generalization.
We call the relationship between a subclass and its superclass an IS-A-SUBCLASS-OF
relationship, or simply an IS-A relationship. This is the same as the IS-A relation-
ship discussed earlier in Section 4.5.3.
4.7.4 Aggregation and Association
Aggregation is an abstraction concept for building composite objects from their
component objects. There are three cases where this concept can be related to the
EER model. The first case is the situation in which we aggregate attribute values of
an object to form the whole object. The second case is when we represent an aggre-
gation relationship as an ordinary relationship. The third case, which the EER
model does not provide for explicitly, involves the possibility of combining objects
that are related by a particular relationship instance into a higher-level aggregate
object. This is sometimes useful when the higher-level aggregate object is itself to be
related to another object. We call the relationship between the primitive objects and
their aggregate object IS-A-PART-OF; the inverse is called IS-A-COMPONENT-OF.
UML provides for all three types of aggregation.
132 Chapter 4 The Enhanced Entity–Relationship (EER) Model
The abstraction of association is used to associate objects from several independent
classes. Hence, it is somewhat similar to the second use of aggregation. It is repre-
sented in the EER model by relationship types, and in UML by associations. This
abstract relationship is called IS-ASSOCIATED-WITH.
In order to understand the different uses of aggregation better, consider the ER
schema shown in Figure 4.11(a), which stores information about interviews by
job applicants to various companies. The class COMPANY is an aggregation of
the attributes (or component objects) Cname (company name) and Caddress
(company address), whereas JOB_APPLICANT is an aggregate of Ssn, Name,
Address, and Phone. The relationship attributes Contact_name and Contact_phone
represent the name and phone number of the person in the company who is
responsible for the interview. Suppose that some interviews result in job offers,
whereas others do not. We would like to treat INTERVIEW as a class to associate it
with JOB_OFFER. The schema shown in Figure 4.11(b) is incorrect because it
requires each interview relationship instance to have a job offer. The schema
shown in Figure 4.11(c) is not allowed because the ER model does not allow rela-
tionships among relationships.
One way to represent this situation is to create a higher-level aggregate class com-
posed of COMPANY, JOB_APPLICANT, and INTERVIEW and to relate this class to
JOB_OFFER, as shown in Figure 4.11(d). Although the EER model as described in
this book does not have this facility, some semantic data models do allow it and call
the resulting object a composite or molecular object. Other models treat entity
types and relationship types uniformly and hence permit relationships among rela-
tionships, as illustrated in Figure 4.11(c).
To represent this situation correctly in the ER model as described here, we need to
create a new weak entity type INTERVIEW, as shown in Figure 4.11(e), and relate it to
JOB_OFFER. Hence, we can always represent these situations correctly in the ER
model by creating additional entity types, although it may be conceptually more
desirable to allow direct representation of aggregation, as in Figure 4.11(d), or to
allow relationships among relationships, as in Figure 4.11(c).
The main structural distinction between aggregation and association is that when
an association instance is deleted, the participating objects may continue to exist.
However, if we support the notion of an aggregate object—for example, a CAR that
is made up of objects ENGINE, CHASSIS, and TIRES—then deleting the aggregate
CAR object amounts to deleting all its component objects.
4.7.5 Ontologies and the Semantic Web
In recent years, the amount of computerized data and information available on
the Web has spiraled out of control. Many different models and formats are used.
In addition to the database models that we present in this text, much information
is stored in the form of documents, which have considerably less structure than
4.7 Data Abstraction, Knowledge Representation, and Ontology Concepts 133
(a)
COMPANY JOB_APPLICANT
AddressName Ssn PhoneCaddressCname
Contact_phoneContact_name
Date
INTERVIEW
(c)
JOB_OFFER
COMPANY JOB_APPLICANTINTERVIEW
RESULTS_IN
(b)
JOB_OFFER
COMPANY JOB_APPLICANTINTERVIEW
(d)
JOB_OFFER
COMPANY JOB_APPLICANTINTERVIEW
RESULTS_IN
(e)
JOB_OFFER
COMPANY JOB_APPLICANT
AddressName Ssn PhoneCaddressCname
Contact_phone
Contact_name
RESULTS_IN
CJI
INTERVIEWDate
Figure 4.11
Aggregation. (a) The
relationship type INTERVIEW.
(b) Including JOB_OFFER in a
ternary relationship type
(incorrect). (c) Having the
RESULTS_IN relationship
participate in other relationships
(not allowed in ER). (d) Using
aggregation and a composite
(molecular) object (generally
not allowed in ER but allowed
by some modeling tools).
(e) Correct representation
in ER.
134 Chapter 4 The Enhanced Entity–Relationship (EER) Model
database information does. One ongoing project that is attempting to allow
information exchange among computers on the Web is called the Semantic
Web, which attempts to create knowledge representation models that are quite
general in order to allow meaningful information exchange and search among
machines. The concept of ontology is considered to be the most promising basis
for achieving the goals of the Semantic Web and is closely related to knowledge
representation. In this section, we give a brief introduction to what ontology is
and how it can be used as a basis to automate information understanding, search,
and exchange.
The study of ontologies attempts to describe the concepts and relationships that are
possible in reality through some common vocabulary; therefore, it can be consid-
ered as a way to describe the knowledge of a certain community about reality.
Ontology originated in the fields of philosophy and metaphysics. One commonly
used definition of ontology is a specification of a conceptualization.13
In this definition, a conceptualization is the set of concepts and relationships that
are used to represent the part of reality or knowledge that is of interest to a com-
munity of users. Specification refers to the language and vocabulary terms that are
used to specify the conceptualization. The ontology includes both specification and
conceptualization. For example, the same conceptualization may be specified in two
different languages, giving two separate ontologies. Based on this general defini-
tion, there is no consensus on what an ontology is exactly. Some possible ways to
describe ontologies are as follows:
■ A thesaurus (or even a dictionary or a glossary of terms) describes the rela-
tionships between words (vocabulary) that represent various concepts.
■ A taxonomy describes how concepts of a particular area of knowledge
are related using structures similar to those used in a specialization or
generalization.
■ A detailed database schema is considered by some to be an ontology that
describes the concepts (entities and attributes) and relationships of a mini-
world from reality.
■ A logical theory uses concepts from mathematical logic to try to define con-
cepts and their interrelationships.
Usually the concepts used to describe ontologies are similar to the concepts we dis-
cuss in conceptual modeling, such as entities, attributes, relationships, specializa-
tions, and so on. The main difference between an ontology and, say, a database
schema, is that the schema is usually limited to describing a small subset of a mini-
world from reality in order to store and manage data. An ontology is usually con-
sidered to be more general in that it attempts to describe a part of reality or a
domain of interest (for example, medical terms, electronic-commerce applications,
sports, and so on) as completely as possible.
13This definition is given in Gruber (1995).
Review Questions 135
4.8 Summary
In this chapter we discussed extensions to the ER model that improve its repre-
sentational capabilities. We called the resulting model the enhanced ER or EER
model. We presented the concept of a subclass and its superclass and the related
mechanism of attribute/relationship inheritance. We saw how it is sometimes
necessary to create additional classes of entities, either because of additional spe-
cific attributes or because of specific relationship types. We discussed two main
processes for defining superclass/subclass hierarchies and lattices: specialization
and generalization.
Next, we showed how to display these new constructs in an EER diagram. We also
discussed the various types of constraints that may apply to specialization or gener-
alization. The two main constraints are total/partial and disjoint/overlapping. We
discussed the concept of a category or union type, which is a subset of the union of
two or more classes, and we gave formal definitions of all the concepts presented.
We introduced some of the notation and terminology of UML for representing
specialization and generalization. In Section 4.7, we briefly discussed the discipline
of knowledge representation and how it is related to semantic data modeling. We
also gave an overview and summary of the types of abstract data representation
concepts: classification and instantiation, identification, specialization and gener-
alization, and aggregation and association. We saw how EER and UML concepts
are related to each of these.
Review Questions
4.1. What is a subclass? When is a subclass needed in data modeling?
4.2. Define the following terms: superclass of a subclass, superclass/subclass rela-
tionship, IS-A relationship, specialization, generalization, category, specific
(local) attributes, and specific relationships.
4.3. Discuss the mechanism of attribute/relationship inheritance. Why is it use-
ful?
4.4. Discuss user-defined and predicate-defined subclasses, and identify the dif-
ferences between the two.
4.5. Discuss user-defined and attribute-defined specializations, and identify the
differences between the two.
4.6. Discuss the two main types of constraints on specializations and generalizations.
4.7. What is the difference between a specialization hierarchy and a specializa-
tion lattice?
4.8. What is the difference between specialization and generalization? Why do
we not display this difference in schema diagrams?
136 Chapter 4 The Enhanced Entity–Relationship (EER) Model
4.9. How does a category differ from a regular shared subclass? What is a cate-
gory used for? Illustrate your answer with examples.
4.10. For each of the following UML terms (see Sections 3.8 and 4.6), discuss the
corresponding term in the EER model, if any: object, class, association, aggre-
gation, generalization, multiplicity, attributes, discriminator, link, link attri-
bute, reflexive association, and qualified association.
4.11. Discuss the main differences between the notation for EER schema dia-
grams and UML class diagrams by comparing how common concepts are
represented in each.
4.12. List the various data abstraction concepts and the corresponding modeling
concepts in the EER model.
4.13. What aggregation feature is missing from the EER model? How can the EER
model be further enhanced to support it?
4.14. What are the main similarities and differences between conceptual database
modeling techniques and knowledge representation techniques?
4.15. Discuss the similarities and differences between an ontology and a database
schema.
Exercises
4.16. Design an EER schema for a database application that you are interested in.
Specify all constraints that should hold on the database. Make sure that the
schema has at least five entity types, four relationship types, a weak entity
type, a superclass/subclass relationship, a category, and an n-ary (n > 2) rela-
tionship type.
4.17. Consider the BANK ER schema in Figure 3.21, and suppose that it
is necessary to keep track of different types of ACCOUNTS
(SAVINGS_ACCTS, CHECKING_ACCTS, … ) and LOANS (CAR_LOANS,
HOME_LOANS, … ). Suppose that it is also desirable to keep track of
each ACCOUNT’s TRANSACTIONS (deposits, withdrawals, checks, … )
and each LOAN’s PAYMENTS; both of these include the amount, date,
and time. Modify the BANK schema, using ER and EER concepts of
specialization and generalization. State any assumptions you make
about the additional requirements.
4.18. The following narrative describes a simplified version of the organization of
Olympic facilities planned for the summer Olympics. Draw an EER diagram
that shows the entity types, attributes, relationships, and specializations for
this application. State any assumptions you make. The Olympic facilities are
divided into sports complexes. Sports complexes are divided into one-sport
and multisport types. Multisport complexes have areas of the complex desig-
nated for each sport with a location indicator (e.g., center, NE corner, and so
Exercises 137
on). A complex has a location, chief organizing individual, total occupied
area, and so on. Each complex holds a series of events (e.g., the track sta-
dium may hold many different races). For each event there is a planned date,
duration, number of participants, number of officials, and so on. A roster of
all officials will be maintained together with the list of events each official
will be involved in. Different equipment is needed for the events (e.g., goal
posts, poles, parallel bars) as well as for maintenance. The two types of facil-
ities (one-sport and multisport) will have different types of information. For
each type, the number of facilities needed is kept, together with an approxi-
mate budget.
4.19. Identify all the important concepts represented in the library database case
study described below. In particular, identify the abstractions of classifica-
tion (entity types and relationship types), aggregation, identification, and
specialization/generalization. Specify (min, max) cardinality constraints
whenever possible. List details that will affect the eventual design but that
have no bearing on the conceptual design. List the semantic constraints sep-
arately. Draw an EER diagram of the library database.
Case Study: The Georgia Tech Library (GTL) has approximately 16,000
members, 100,000 titles, and 250,000 volumes (an average of 2.5 copies per
book). About 10% of the volumes are out on loan at any one time. The librar-
ians ensure that the books that members want to borrow are available when
the members want to borrow them. Also, the librarians must know how
many copies of each book are in the library or out on loan at any given time.
A catalog of books is available online that lists books by author, title, and
subject area. For each title in the library, a book description is kept in the
catalog; the description ranges from one sentence to several pages. The refer-
ence librarians want to be able to access this description when members
request information about a book. Library staff includes chief librarian,
departmental associate librarians, reference librarians, check-out staff, and
library assistants.
Books can be checked out for 21 days. Members are allowed to have only
five books out at a time. Members usually return books within three to four
weeks. Most members know that they have one week of grace before a
notice is sent to them, so they try to return books before the grace period
ends. About 5% of the members have to be sent reminders to return books.
Most overdue books are returned within a month of the due date. Approxi-
mately 5% of the overdue books are either kept or never returned. The most
active members of the library are defined as those who borrow books at
least ten times during the year. The top 1% of membership does 15% of the
borrowing, and the top 10% of the membership does 40% of the borrowing.
About 20% of the members are totally inactive in that they are members
who never borrow.
To become a member of the library, applicants fill out a form including their
SSN, campus and home mailing addresses, and phone numbers. The librari-
138 Chapter 4 The Enhanced Entity–Relationship (EER) Model
ans issue a numbered, machine-readable card with the member’s photo on it.
This card is good for four years. A month before a card expires, a notice is
sent to a member for renewal. Professors at the institute are considered auto-
matic members. When a new faculty member joins the institute, his or her
information is pulled from the employee records and a library card is mailed
to his or her campus address. Professors are allowed to check out books for
three-month intervals and have a two-week grace period. Renewal notices to
professors are sent to their campus address.
The library does not lend some books, such as reference books, rare books,
and maps. The librarians must differentiate between books that can be lent
and those that cannot be lent. In addition, the librarians have a list of some
books they are interested in acquiring but cannot obtain, such as rare or out-
of-print books and books that were lost or destroyed but have not been
replaced. The librarians must have a system that keeps track of books that
cannot be lent as well as books that they are interested in acquiring. Some
books may have the same title; therefore, the title cannot be used as a means
of identification. Every book is identified by its International Standard Book
Number (ISBN), a unique international code assigned to all books. Two
books with the same title can have different ISBNs if they are in different
languages or have different bindings (hardcover or softcover). Editions of
the same book have different ISBNs.
The proposed database system must be designed to keep track of the mem-
bers, the books, the catalog, and the borrowing activity.
4.20. Design a database to keep track of information for an art museum. Assume
that the following requirements were collected:
■ The museum has a collection of ART_OBJECTS. Each ART_OBJECT has a
unique Id_no, an Artist (if known), a Year (when it was created, if known),
a Title, and a Description. The art objects are categorized in several ways, as
discussed below.
■ ART_OBJECTS are categorized based on their type. There are three main
types—PAINTING, SCULPTURE, and STATUE—plus another type called
OTHER to accommodate objects that do not fall into one of the three main
types.
■ A PAINTING has a Paint_type (oil, watercolor, etc.), material on which
it is Drawn_on (paper, canvas, wood, etc.), and Style (modern,
abstract, etc.).
■ A SCULPTURE or a statue has a Material from which it was created (wood,
stone, etc.), Height, Weight, and Style.
■ An art object in the OTHER category has a Type (print, photo, etc.) and Style.
■ ART_OBJECTs are categorized as either PERMANENT_COLLECTION
(objects that are owned by the museum) and BORROWED. Information
captured about objects in the PERMANENT_COLLECTION includes
Date_acquired, Status (on display, on loan, or stored), and Cost. Information
Exercises 139
captured about BORROWED objects includes the Collection from which it
was borrowed, Date_borrowed, and Date_returned.
■ Information describing the country or culture of Origin (Italian, Egyptian,
American, Indian, and so forth) and Epoch (Renaissance, Modern,
Ancient, and so forth) is captured for each ART_OBJECT.
■ The museum keeps track of ARTIST information, if known: Name,
DateBorn (if known), Date_died (if not living), Country_of_origin, Epoch,
Main_style, and Description. The Name is assumed to be unique.
■ Different EXHIBITIONS occur, each having a Name, Start_date, and End_date.
EXHIBITIONS are related to all the art objects that were on display during
the exhibition.
■ Information is kept on other COLLECTIONS with which the museum
interacts; this information includes Name (unique), Type (museum, per-
sonal, etc.), Description, Address, Phone, and current Contact_person.
Draw an EER schema diagram for this application. Discuss any assumptions
you make, and then justify your EER design choices.
4.21. Figure 4.12 shows an example of an EER diagram for a small-private-airport
database; the database is used to keep track of airplanes, their owners, air-
port employees, and pilots. From the requirements for this database, the fol-
lowing information was collected: Each AIRPLANE has a registration number
[Reg#], is of a particular plane type [OF_TYPE], and is stored in a particular
hangar [STORED_IN]. Each PLANE_TYPE has a model number [Model], a
capacity [Capacity], and a weight [Weight]. Each HANGAR has a number
[Number], a capacity [Capacity], and a location [Location]. The database also
keeps track of the OWNERs of each plane [OWNS] and the EMPLOYEEs who
have maintained the plane [MAINTAIN]. Each relationship instance in OWNS
relates an AIRPLANE to an OWNER and includes the purchase date [Pdate].
Each relationship instance in MAINTAIN relates an EMPLOYEE to a service
record [SERVICE]. Each plane undergoes service many times; hence, it is
related by [PLANE_SERVICE] to a number of SERVICE records. A SERVICE
record includes as attributes the date of maintenance [Date], the number of
hours spent on the work [Hours], and the type of work done [Work_code]. We
use a weak entity type [SERVICE] to represent airplane service, because the
airplane registration number is used to identify a service record. An OWNER
is either a person or a corporation. Hence, we use a union type (category)
[OWNER] that is a subset of the union of corporation [CORPORATION] and
person [PERSON] entity types. Both pilots [PILOT] and employees
[EMPLOYEE] are subclasses of PERSON. Each PILOT has specific attributes
license number [Lic_num] and restrictions [Restr]; each EMPLOYEE has spe-
cific attributes salary [Salary] and shift worked [Shift]. All PERSON entities in
the database have data kept on their Social Security number [Ssn], name
[Name], address [Address], and telephone number [Phone]. For CORPORATION
entities, the data kept includes name [Name], address [Address], and
telephone number [Phone]. The database also keeps track of the types of
140 Chapter 4 The Enhanced Entity–Relationship (EER) Model
planes each pilot is authorized to fly [FLIES] and the types of planes each
employee can do maintenance work on [WORKS_ON]. Show how the
SMALL_AIRPORT EER schema in Figure 4.12 may be represented in UML
notation. (Note: We have not discussed how to represent categories (union
types) in UML, so you do not have to map the categories in this and the fol-
lowing question.)
4.22. Show how the UNIVERSITY EER schema in Figure 4.9 may be represented in
UML notation.
Number Location
Capacity
Name Phone
Address
Name
Ssn
Phone
Address
Lic_numRestr
Date/workcode
1
N
N
1
N
1
PLANE_TYPE
Model Capacity
Pdate
Weight
MAINTAIN
M
M
N
OF_TYPE
STORED_IN
NM
OWNS
FLIES
WORKS_ON
N
N
M
Reg#
Date
Hours
HANGAR
PILOT
EMPLOYEE
Salary
PLANE_SERVICE
SERVICE
Workcode
AIRPLANE
Shift
U
CORPORATION PERSON
OWNER
Figure 4.12
EER schema for a SMALL_AIRPORT database.
Exercises 141
4.23. Consider the entity sets and attributes shown in the following table. Place a
checkmark in one column in each row to indicate the relationship between
the far left and far right columns.
a. The left side has a relationship with the right side.
b. The right side is an attribute of the left side.
c. The left side is a specialization of the right side.
d. The left side is a generalization of the right side.
Entity Set
(a) Has a
Relationship
with
(b) Has an
Attribute
that is
(c) Is a
Specialization
of
(d) Is a
Generalization
of
Entity Set
or Attribute
1. MOTHER PERSON
2. DAUGHTER MOTHER
3. STUDENT PERSON
4. STUDENT Student_id
5. SCHOOL STUDENT
6. SCHOOL CLASS_ROOM
7. ANIMAL HORSE
8. HORSE Breed
9. HORSE Age
10. EMPLOYEE SSN
11. FURNITURE CHAIR
12. CHAIR Weight
13. HUMAN WOMAN
14. SOLDIER PERSON
15. ENEMY_COMBATANT PERSON
4.24. Draw a UML diagram for storing a played game of chess in a database.
You may look at http://www.chessgames.com for an application similar to
what you are designing. State clearly any assumptions you make in your
UML diagram. A sample of assumptions you can make about the scope is
as follows:
1. The game of chess is played between two players.
2. The game is played on an 8 × 8 board like the one shown below:
142 Chapter 4 The Enhanced Entity–Relationship (EER) Model
3. The players are assigned a color of black or white at the start of the game.
4. Each player starts with the following pieces (traditionally called
chessmen):
a. king
b. queen
c. 2 rooks
d. 2 bishops
e. 2 knights
f. 8 pawns
5. Every piece has its own initial position.
6. Every piece has its own set of legal moves based on the state of the game.
You do not need to worry about which moves are or are not legal except
for the following issues:
a. A piece may move to an empty square or capture an opposing piece.
b. If a piece is captured, it is removed from the board.
c. If a pawn moves to the last row, it is “promoted” by converting it to
another piece (queen, rook, bishop, or knight).
Note: Some of these functions may be spread over multiple classes.
4.25. Draw an EER diagram for a game of chess as described in Exercise 4. 24. Focus
on persistent storage aspects of the system. For example, the system would
need to retrieve all the moves of every game played in sequential order.
4.26. Which of the following EER diagrams is/are incorrect and why? State clearly
any assumptions you make.
a.
b.
E d
E1
E2
R
1
1
E
E1
E2
R
1
E3
No
Laboratory Exercises 143
4.27. Consider the following EER diagram that describes the computer systems at
a company. Provide your own attributes and key for each entity type. Supply
max cardinality constraints justifying your choice. Write a complete narra-
tive description of what this EER diagram represents.
c.
E1
R
E3
N
o
M
MEMORY VIDEO_CARD
d
LAPTOP DESKTOP
INSTALLED
d
COMPUTER
SOFTWARE
OPERATING_
SYSTEM
INSTALLED_OS
SUPPORTS
COMPONENT
OPTIONS
SOUND_CARD
MEM_OPTIONS
KEYBOARD MOUSE
d
ACCESSORY
MONITOR
SOLD_WITH
Laboratory Exercises
4.28. Consider a GRADE_BOOK database in which instructors within an academic
department record points earned by individual students in their classes. The
data requirements are summarized as follows:
■ Each student is identified by a unique identifier, first and last name, and
an e-mail address.
■ Each instructor teaches certain courses each term. Each course is identified
by a course number, a section number, and the term in which it is taught. For
144 Chapter 4 The Enhanced Entity–Relationship (EER) Model
each course he or she teaches, the instructor specifies the minimum number
of points required in order to earn letter grades A, B, C, D, and F. For exam-
ple, 90 points for an A, 80 points for a B, 70 points for a C, and so forth.
■ Students are enrolled in each course taught by the instructor.
■ Each course has a number of grading components (such as midterm
exam, final exam, project, and so forth). Each grading component has a
maximum number of points (such as 100 or 50) and a weight (such as
20% or 10%). The weights of all the grading components of a course usu-
ally total 100.
■ Finally, the instructor records the points earned by each student in each of
the grading components in each of the courses. For example, student 1234
earns 84 points for the midterm exam grading component of the section 2
course CSc2310 in the fall term of 2009. The midterm exam grading com-
ponent may have been defined to have a maximum of 100 points and a
weight of 20% of the course grade.
Design an enhanced entity–relationship diagram for the grade book data-
base and build the design using a data modeling tool such as ERwin or
Rational Rose.
4.29. Consider an ONLINE_AUCTION database system in which members (buyers
and sellers) participate in the sale of items. The data requirements for this
system are summarized as follows:
■ The online site has members, each of whom is identified by a unique
member number and is described by an e-mail address, name, password,
home address, and phone number.
■ A member may be a buyer or a seller. A buyer has a shipping address
recorded in the database. A seller has a bank account number and routing
number recorded in the database.
■ Items are placed by a seller for sale and are identified by a unique item
number assigned by the system. Items are also described by an item title,
a description, starting bid price, bidding increment, the start date of the
auction, and the end date of the auction.
■ Items are also categorized based on a fixed classification hierarchy (for
example, a modem may be classified as COMPUTER → HARDWARE →
MODEM).
■ Buyers make bids for items they are interested in. Bid price and time of
bid are recorded. The bidder at the end of the auction with the highest bid
price is declared the winner, and a transaction between buyer and seller
may then proceed.
■ The buyer and seller may record feedback regarding their completed
transactions. Feedback contains a rating of the other party participating
in the transaction (1–10) and a comment.
Laboratory Exercises 145
Design an enhanced entity–relationship diagram for the ONLINE_AUCTION
database and build the design using a data modeling tool such as ERwin or
Rational Rose.
4.30. Consider a database system for a baseball organization such as the major
leagues. The data requirements are summarized as follows:
■ The personnel involved in the league include players, coaches, managers,
and umpires. Each is identified by a unique personnel id. They are also
described by their first and last names along with the date and place of
birth.
■ Players are further described by other attributes such as their batting ori-
entation (left, right, or switch) and have a lifetime batting average (BA).
■ Within the players group is a subset of players called pitchers. Pitchers
have a lifetime ERA (earned run average) associated with them.
■ Teams are uniquely identified by their names. Teams are also described by
the city in which they are located and the division and league in which
they play (such as Central division of the American League).
■ Teams have one manager, a number of coaches, and a number of players.
■ Games are played between two teams, with one designated as the home
team and the other the visiting team on a particular date. The score (runs,
hits, and errors) is recorded for each team. The team with the most runs is
declared the winner of the game.
■ With each finished game, a winning pitcher and a losing pitcher are
recorded. In case there is a save awarded, the save pitcher is also recorded.
■ With each finished game, the number of hits (singles, doubles, triples, and
home runs) obtained by each player is also recorded.
Design an enhanced entity–relationship diagram for the BASEBALL data-
base and enter the design using a data modeling tool such as ERwin or
Rational Rose.
4.31. Consider the EER diagram for the UNIVERSITY database shown in Figure 4.9.
Enter this design using a data modeling tool such as ERwin or Rational Rose.
Make a list of the differences in notation between the diagram in the text
and the corresponding equivalent diagrammatic notation you end up using
with the tool.
4.32. Consider the EER diagram for the small AIRPORT database shown in Fig-
ure 4.12. Build this design using a data modeling tool such as ERwin or Rational
Rose. Be careful how you model the category OWNER in this diagram. (Hint:
Consider using CORPORATION_IS_OWNER and PERSON_IS_ OWNER as
two distinct relationship types.)
4.33. Consider the UNIVERSITY database described in Exercise 3.16. You already
developed an ER schema for this database using a data modeling tool such as
146 Chapter 4 The Enhanced Entity–Relationship (EER) Model
ERwin or Rational Rose in Lab Exercise 3.31. Modify this diagram by clas-
sifying COURSES as either UNDERGRAD_COURSES or GRAD_COURSES
and INSTRUCTORS as either JUNIOR_PROFESSORS or SENIOR_PROFESSORS.
Include appropriate attributes for these new entity types. Then establish
relationships indicating that junior instructors teach undergraduate courses
whereas senior instructors teach graduate courses.
Selected Bibliography
Many papers have proposed conceptual or semantic data models. We give a repre-
sentative list here. One group of papers, including Abrial (1974), Senko’s DIAM
model (1975), the NIAM method (Verheijen and VanBekkum 1982), and Bracchi
et al. (1976), presents semantic models that are based on the concept of binary rela-
tionships. Another group of early papers discusses methods for extending the rela-
tional model to enhance its modeling capabilities. This includes the papers by
Schmid and Swenson (1975), Navathe and Schkolnick (1978), Codd’s RM/T model
(1979), Furtado (1978), and the structural model of Wiederhold and Elmasri (1979).
The ER model was proposed originally by Chen (1976) and is formalized in Ng
(1981). Since then, numerous extensions of its modeling capabilities have been pro-
posed, as in Scheuermann et al. (1979), Dos Santos et al. (1979), Teorey et al. (1986),
Gogolla and Hohenstein (1991), and the entity–category–relationship (ECR) model
of Elmasri et al. (1985). Smith and Smith (1977) present the concepts of generaliza-
tion and aggregation. The semantic data model of Hammer and McLeod (1981)
introduces the concepts of class/subclass lattices, as well as other advanced model-
ing concepts.
A survey of semantic data modeling appears in Hull and King (1987). Eick (1991)
discusses design and transformations of conceptual schemas. Analysis of con-
straints for n-ary relationships is given in Soutou (1998). UML is described in detail
in Booch, Rumbaugh, and Jacobson (1999). Fowler and Scott (2000) and Stevens
and Pooley (2000) give concise introductions to UML concepts.
Fensel (2000, 2003) discusses the Semantic Web and application of ontologies.
Uschold and Gruninger (1996) and Gruber (1995) discuss ontologies. The June
2002 issue of Communications of the ACM is devoted to ontology concepts and
applications. Fensel (2003) discusses ontologies and e-commerce.
The Relational Data
Model and SQL
part 3
This page intentionally left blank
149
5
The Relational Data Model and
Relational Database Constraints
This chapter opens Part 3 of the book, which covers
relational databases. The relational data model was
first introduced by Ted Codd of IBM Research in 1970 in a classic paper (Codd,
1970), and it attracted immediate attention due to its simplicity and mathematical
foundation. The model uses the concept of a mathematical relation—which looks
somewhat like a table of values—as its basic building block, and has its theoretical
basis in set theory and first-order predicate logic. In this chapter we discuss the
basic characteristics of the model and its constraints.
The first commercial implementations of the relational model became available in
the early 1980s, such as the SQL/DS system on the MVS operating system by IBM
and the Oracle DBMS. Since then, the model has been implemented in a large num-
ber of commercial systems, as well as a number of open source systems. Current
popular commercial relational DBMSs (RDBMSs) include DB2 (from IBM), Oracle
(from Oracle), Sybase DBMS (now from SAP), and SQLServer and Microsoft
Access (from Microsoft). In addition, several open source systems, such as MySQL
and PostgreSQL, are available.
Because of the importance of the relational model, all of Part 2 is devoted to this
model and some of the languages associated with it. In Chapters 6 and 7, we describe
some aspects of SQL, which is a comprehensive model and language that is the
standard for commercial relational DBMSs. (Additional aspects of SQL will be cov-
ered in other chapters.) Chapter 8 covers the operations of the relational algebra and
introduces the relational calculus—these are two formal languages associated with
the relational model. The relational calculus is considered to be the basis for the
SQL language, and the relational algebra is used in the internals of many database
implementations for query processing and optimization (see Part 8 of the book).
chapter 5
150 Chapter 5 The Relational Data Model and Relational Database Constraints
Other features of the relational model are presented in subsequent parts of the
book. Chapter 9 relates the relational model data structures to the constructs of the
ER and EER models (presented in Chapters 3 and 4), and presents algorithms for
designing a relational database schema by mapping a conceptual schema in the ER
or EER model into a relational representation. These mappings are incorporated
into many database design and CASE1 tools. Chapters 10 and 11 in Part 4 discuss
the programming techniques used to access database systems and the notion of
connecting to relational databases via ODBC and JDBC standard protocols. We
also introduce the topic of Web database programming in Chapter 11. Chapters 14
and 15 in Part 6 present another aspect of the relational model, namely the formal
constraints of functional and multivalued dependencies; these dependencies are
used to develop a relational database design theory based on the concept known as
normalization.
In this chapter, we concentrate on describing the basic principles of the relational
model of data. We begin by defining the modeling concepts and notation of the
relational model in Section 5.1. Section 5.2 is devoted to a discussion of relational
constraints that are considered an important part of the relational model and are
automatically enforced in most relational DBMSs. Section 5.3 defines the update
operations of the relational model, discusses how violations of integrity constraints
are handled, and introduces the concept of a transaction. Section 5.4 summarizes
the chapter.
This chapter and Chapter 8 focus on the formal foundations of the relational model,
whereas Chapters 6 and 7 focus on the SQL practical relational model, which is the
basis of most commercial and open source relational DBMSs. Many concepts are
common between the formal and practical models, but a few differences exist that
we shall point out.
5.1 Relational Model Concepts
The relational model represents the database as a collection of relations. Informally,
each relation resembles a table of values or, to some extent, a flat file of records. It is
called a flat file because each record has a simple linear or flat structure. For exam-
ple, the database of files that was shown in Figure 1.2 is similar to the basic rela-
tional model representation. However, there are important differences between
relations and files, as we shall soon see.
When a relation is thought of as a table of values, each row in the table represents a
collection of related data values. A row represents a fact that typically corresponds
to a real-world entity or relationship. The table name and column names are used
to help to interpret the meaning of the values in each row. For example, the
first table of Figure 1.2 is called STUDENT because each row represents facts
about a particular student entity. The column names—Name, Student_number,
1CASE stands for computer-aided software engineering.
5.1 Relational Model Concepts 151
Class, and Major—specify how to interpret the data values in each row, based on the
column each value is in. All values in a column are of the same data type.
In the formal relational model terminology, a row is called a tuple, a column
header is called an attribute, and the table is called a relation. The data type
describing the types of values that can appear in each column is represented by a
domain of possible values. We now define these terms—domain, tuple, attribute,
and relation—formally.
5.1.1 Domains, Attributes, Tuples, and Relations
A domain D is a set of atomic values. By atomic we mean that each value in the
domain is indivisible as far as the formal relational model is concerned. A common
method of specifying a domain is to specify a data type from which the data values
forming the domain are drawn. It is also useful to specify a name for the domain, to
help in interpreting its values. Some examples of domains follow:
■ Usa_phone_numbers. The set of ten-digit phone numbers valid in the United
States.
■ Local_phone_numbers. The set of seven-digit phone numbers valid within a
particular area code in the United States. The use of local phone numbers is
quickly becoming obsolete, being replaced by standard ten-digit numbers.
■ Social_security_numbers. The set of valid nine-digit Social Security numbers.
(This is a unique identifier assigned to each person in the United States for
employment, tax, and benefits purposes.)
■ Names: The set of character strings that represent names of persons.
■ Grade_point_averages. Possible values of computed grade point averages;
each must be a real (floating-point) number between 0 and 4.
■ Employee_ages. Possible ages of employees in a company; each must be an
integer value between 15 and 80.
■ Academic_department_names. The set of academic department names in a
university, such as Computer Science, Economics, and Physics.
■ Academic_department_codes. The set of academic department codes, such as
‘CS’, ‘ECON’, and ‘PHYS’.
The preceding are called logical definitions of domains. A data type or format is
also specified for each domain. For example, the data type for the domain
Usa_phone_numbers can be declared as a character string of the form (ddd)ddd-dddd,
where each d is a numeric (decimal) digit and the first three digits form a valid
telephone area code. The data type for Employee_ages is an integer number between
15 and 80. For Academic_department_names, the data type is the set of all character
strings that represent valid department names. A domain is thus given a name, data
type, and format. Additional information for interpreting the values of a domain
can also be given; for example, a numeric domain such as Person_weights should
have the units of measurement, such as pounds or kilograms.
152 Chapter 5 The Relational Data Model and Relational Database Constraints
A relation schema2 R, denoted by R(A1, A2, … , An), is made up of a relation name
R and a list of attributes, A1, A2, … , An. Each attribute Ai is the name of a role
played by some domain D in the relation schema R. D is called the domain of Ai
and is denoted by dom(Ai). A relation schema is used to describe a relation; R is
called the name of this relation. The degree (or arity) of a relation is the number of
attributes n of its relation schema.
A relation of degree seven, which stores information about university students,
would contain seven attributes describing each student as follows:
STUDENT(Name, Ssn, Home_phone, Address, Office_phone, Age, Gpa)
Using the data type of each attribute, the definition is sometimes written as:
STUDENT(Name: string, Ssn: string, Home_phone: string, Address: string,
Office_phone: string, Age: integer, Gpa: real)
For this relation schema, STUDENT is the name of the relation, which has seven
attributes. In the preceding definition, we showed assignment of generic types such
as string or integer to the attributes. More precisely, we can specify the following
previously defined domains for some of the attributes of the STUDENT relation:
dom(Name) = Names; dom(Ssn) = Social_security_numbers; dom(HomePhone) =
USA_phone_numbers3, dom(Office_phone) = USA_phone_numbers, and dom(Gpa) =
Grade_point_averages. It is also possible to refer to attributes of a relation schema by
their position within the relation; thus, the second attribute of the STUDENT rela-
tion is Ssn, whereas the fourth attribute is Address.
A relation (or relation state)4 r of the relation schema R(A1, A2, … , An), also denoted
by r(R), is a set of n-tuples r = {t1, t2, … , tm}. Each n-tuple t is an ordered list of n
values t =
a special NULL value. (NULL values are discussed further below and in Section 5.1.2.)
The ith value in tuple t, which corresponds to the attribute Ai, is referred to as t[Ai] or
t.Ai (or t[i] if we use the positional notation). The terms relation intension for the
schema R and relation extension for a relation state r(R) are also commonly used.
Figure 5.1 shows an example of a STUDENT relation, which corresponds to the
STUDENT schema just specified. Each tuple in the relation represents a particular
student entity (or object). We display the relation as a table, where each tuple is
shown as a row and each attribute corresponds to a column header indicating a role
or interpretation of the values in that column. NULL values represent attributes
whose values are unknown or do not exist for some individual STUDENT tuple.
2A relation schema is sometimes called a relation scheme.
3With the large increase in phone numbers caused by the proliferation of mobile phones, most metropol-
itan areas in the United States now have multiple area codes, so seven-digit local dialing has been
discontinued in most areas. We changed this domain to Usa_phone_numbers instead of Local_phone_
numbers, which would be a more general choice. This illustrates how database requirements can change
over time.
4This has also been called a relation instance. We will not use this term because instance is also used
to refer to a single tuple or row.
5.1 Relational Model Concepts 153
The earlier definition of a relation can be restated more formally using set theory
concepts as follows. A relation (or relation state) r(R) is a mathematical relation of
degree n on the domains dom(A1), dom(A2), … , dom(An), which is a subset of the
Cartesian product (denoted by ×) of the domains that define R:
r(R) ⊆ (dom(A1) × dom(A2) × . . . × (dom(An))
The Cartesian product specifies all possible combinations of values from the under-
lying domains. Hence, if we denote the total number of values, or cardinality, in a
domain D by |D| (assuming that all domains are finite), the total number of tuples
in the Cartesian product is
|dom(A1)| × |dom(A2)| × . . . × |dom(An)|
This product of cardinalities of all domains represents the total number of possible
instances or tuples that can ever exist in any relation state r(R). Of all these possible
combinations, a relation state at a given time—the current relation state—reflects
only the valid tuples that represent a particular state of the real world. In general, as
the state of the real world changes, so does the relation state, by being transformed
into another relation state. However, the schema R is relatively static and changes
very infrequently—for example, as a result of adding an attribute to represent new
information that was not originally stored in the relation.
It is possible for several attributes to have the same domain. The attribute names indi-
cate different roles, or interpretations, for the domain. For example, in the STUDENT
relation, the same domain USA_phone_numbers plays the role of Home_phone, referring
to the home phone of a student, and the role of Office_phone, referring to the office
phone of the student. A third possible attribute (not shown) with the same domain
could be Mobile_phone.
5.1.2 Characteristics of Relations
The earlier definition of relations implies certain characteristics that make a rela-
tion different from a file or a table. We now discuss some of these characteristics.
Relation Name
Tuples
STUDENT
Name
Benjamin Bayer
Chung-cha Kim
Dick Davidson
Rohan Panchal
Barbara Benson
Ssn
305-61-2435
381-62-1245
422-11-2320
489-22-1100
533-69-1238
Home_phone
(817)373-1616
(817)375-4409
NULL
(817)376-9821
(817)839-8461
Address
2918 Bluebonnet Lane
125 Kirby Road
3452 Elgin Road
265 Lark Lane
7384 Fontana Lane
Office_phone
NULL
NULL
(817)749-1253
(817)749-6492
NULL
Age
19
18
25
28
19
3.21
2.89
3.53
3.93
3.25
Gpa
Attributes
Figure 5.1
The attributes and tuples of a relation STUDENT.
154 Chapter 5 The Relational Data Model and Relational Database Constraints
Ordering of Tuples in a Relation. A relation is defined as a set of tuples. Math-
ematically, elements of a set have no order among them; hence, tuples in a relation
do not have any particular order. In other words, a relation is not sensitive to the
ordering of tuples. However, in a file, records are physically stored on disk (or in
memory), so there always is an order among the records. This ordering indicates
first, second, ith, and last records in the file. Similarly, when we display a relation as
a table, the rows are displayed in a certain order.
Tuple ordering is not part of a relation definition because a relation attempts to rep-
resent facts at a logical or abstract level. Many tuple orders can be specified on the
same relation. For example, tuples in the STUDENT relation in Figure 5.1 could be
ordered by values of Name, Ssn, Age, or some other attribute. The definition of a rela-
tion does not specify any order: There is no preference for one ordering over another.
Hence, the relation displayed in Figure 5.2 is considered identical to the one shown in
Figure 5.1. When a relation is implemented as a file or displayed as a table, a particular
ordering may be specified on the records of the file or the rows of the table.
Ordering of Values within a Tuple and an Alternative Definition of a Relation.
According to the preceding definition of a relation, an n-tuple is an ordered list of n
values, so the ordering of values in a tuple—and hence of attributes in a relation
schema—is important. However, at a more abstract level, the order of attributes
and their values is not that important as long as the correspondence between attri-
butes and values is maintained.
An alternative definition of a relation can be given, making the ordering of values
in a tuple unnecessary. In this definition, a relation schema R = {A1, A2, … , An} is a
set of attributes (instead of an ordered list of attributes), and a relation state r(R) is
a finite set of mappings r = {t1, t2, … , tm}, where each tuple ti is a mapping from R
to D, and D is the union (denoted by ∪) of the attribute domains; that is, D =
dom(A1) ∪ dom(A2) ∪ … ∪ dom(An). In this definition, t[Ai] must be in dom(Ai)
for 1 ≤ i ≤ n for each mapping t in r. Each mapping ti is called a tuple.
According to this definition of tuple as a mapping, a tuple can be considered as a
set of (
from an attribute Ai to a value vi from dom(Ai). The ordering of attributes is not
important, because the attribute name appears with its value. By this definition, the
Dick Davidson
Barbara Benson
Rohan Panchal
Chung-cha Kim
422-11-2320
533-69-1238
489-22-1100
381-62-1245
NULL
(817)839-8461
(817)376-9821
(817)375-4409
3452 Elgin Road
7384 Fontana Lane
265 Lark Lane
125 Kirby Road
(817)749-1253
NULL
(817)749-6492
NULL
25
19
28
18
3.53
3.25
3.93
2.89
Benjamin Bayer 305-61-2435 (817)373-1616 2918 Bluebonnet Lane NULL 19 3.21
STUDENT
Name Ssn Home_phone Address Office_phone Age Gpa
Figure 5.2
The relation STUDENT from Figure 5.1 with a different order of tuples.
5.1 Relational Model Concepts 155
two tuples shown in Figure 5.3 are identical. This makes sense at an abstract level,
since there really is no reason to prefer having one attribute value appear before
another in a tuple. When the attribute name and value are included together in a
tuple, it is known as self-describing data, because the description of each value
(attribute name) is included in the tuple.
We will mostly use the first definition of relation, where the attributes are ordered
in the relation schema and the values within tuples are similarly ordered, because it
simplifies much of the notation. However, the alternative definition given here is
more general.5
Values and NULLs in the Tuples. Each value in a tuple is an atomic value; that
is, it is not divisible into components within the framework of the basic relational
model. Hence, composite and multivalued attributes (see Chapter 3) are not
allowed. This model is sometimes called the flat relational model. Much of the
theory behind the relational model was developed with this assumption in mind,
which is called the first normal form assumption.6 Hence, multivalued attributes
must be represented by separate relations, and composite attributes are represented
only by their simple component attributes in the basic relational model.7
An important concept is that of NULL values, which are used to represent the values of
attributes that may be unknown or may not apply to a tuple. A special value, called
NULL, is used in these cases. For example, in Figure 5.1, some STUDENT tuples have
NULL for their office phones because they do not have an office (that is, office phone
does not apply to these students). Another student has a NULL for home phone, presum-
ably because either he does not have a home phone or he has one but we do not know it
(value is unknown). In general, we can have several meanings for NULL values, such as
value unknown, value exists but is not available, or attribute does not apply to this tuple
(also known as value undefined). An example of the last type of NULL will occur if we
add an attribute Visa_status to the STUDENT relation that applies only to tuples repre-
senting foreign students. It is possible to devise different codes for different meanings of
5We will use the alternative definition of relation when we discuss query processing and optimization in
Chapter 18.
6We discuss this assumption in more detail in Chapter 14.
7Extensions of the relational model remove these restrictions. For example, object-relational systems
(Chapter 12) allow complex-structured attributes, as do the non-first normal form or nested relational
models.
t = < (Name, Dick Davidson),(Ssn, 422-11-2320),(Home_phone, NULL),(Address, 3452 Elgin Road), (Office_phone, (817)749-1253),(Age, 25),(Gpa, 3.53)>
t = < (Address, 3452 Elgin Road),(Name, Dick Davidson),(Ssn, 422-11-2320),(Age, 25), (Office_phone, (817)749-1253),(Gpa, 3.53),(Home_phone, NULL)>
Figure 5.3
Two identical tuples when the order of attributes and values is not part of relation definition.
156 Chapter 5 The Relational Data Model and Relational Database Constraints
NULL values. Incorporating different types of NULL values into relational model opera-
tions has proven difficult and is outside the scope of our presentation.
The exact meaning of a NULL value governs how it fares during arithmetic aggrega-
tions or comparisons with other values. For example, a comparison of two NULL
values leads to ambiguities—if both Customer A and B have NULL addresses, it does
not mean they have the same address. During database design, it is best to avoid
NULL values as much as possible. We will discuss this further in Chapters 7 and 8 in
the context of operations and queries, and in Chapter 14 in the context of database
design and normalization.
Interpretation (Meaning) of a Relation. The relation schema can be interpreted
as a declaration or a type of assertion. For example, the schema of the STUDENT
relation of Figure 5.1 asserts that, in general, a student entity has a Name, Ssn,
Home_phone, Address, Office_phone, Age, and Gpa. Each tuple in the relation can
then be interpreted as a fact or a particular instance of the assertion. For example,
the first tuple in Figure 5.1 asserts the fact that there is a STUDENT whose Name is
Benjamin Bayer, Ssn is 305-61-2435, Age is 19, and so on.
Notice that some relations may represent facts about entities, whereas other rela-
tions may represent facts about relationships. For example, a relation schema
MAJORS (Student_ssn, Department_code) asserts that students major in academic
disciplines. A tuple in this relation relates a student to his or her major discipline.
Hence, the relational model represents facts about both entities and relationships
uniformly as relations. This sometimes compromises understandability because
one has to guess whether a relation represents an entity type or a relationship type.
We introduced the entity–relationship (ER) model in detail in Chapter 3, where the
entity and relationship concepts were described in detail. The mapping procedures
in Chapter 9 show how different constructs of the ER/EER conceptual data models
(see Part 2) get converted to relations.
An alternative interpretation of a relation schema is as a predicate; in this case, the
values in each tuple are interpreted as values that satisfy the predicate. For example,
the predicate STUDENT (Name, Ssn, …) is true for the five tuples in relation STUDENT
of Figure 5.1. These tuples represent five different propositions or facts in the
real world. This interpretation is quite useful in the context of logical programming
languages, such as Prolog, because it allows the relational model to be used within
these languages (see Section 26.5). An assumption called the closed world assumption
states that the only true facts in the universe are those present within the extension
(state) of the relation(s). Any other combination of values makes the predicate false.
This interpretation is useful when we consider queries on relations based on
relational calculus in Section 8.6.
5.1.3 Relational Model Notation
We will use the following notation in our presentation:
■ A relation schema R of degree n is denoted by R(A1, A2, … , An).
5.2 Relational Model Constraints and Relational Database Schemas 157
■ The uppercase letters Q, R, S denote relation names.
■ The lowercase letters q, r, s denote relation states.
■ The letters t, u, v denote tuples.
■ In general, the name of a relation schema such as STUDENT also indicates
the current set of tuples in that relation—the current relation state—whereas
STUDENT(Name, Ssn, …) refers only to the relation schema.
■ An attribute A can be qualified with the relation name R to which it belongs
by using the dot notation R.A—for example, STUDENT.Name or
STUDENT.Age. This is because the same name may be used for two attri-
butes in different relations. However, all attribute names in a particular
relation must be distinct.
■ An n-tuple t in a relation r(R) is denoted by t =
the value corresponding to attribute Ai. The following notation refers to
component values of tuples:
� Both t[Ai] and t.Ai (and sometimes t[i]) refer to the value vi in t for attri-
bute Ai.
� Both t[Au, Aw, … , Az] and t.(Au, Aw, … , Az), where Au, Aw, … , Az is a list
of attributes from R, refer to the subtuple of values
corresponding to the attributes specified in the list.
As an example, consider the tuple t = <’Barbara Benson’, ‘533-69-1238’,
‘(817)839-8461’, ‘7384 Fontana Lane’, NULL, 19, 3.25> from the STUDENT relation in Fig-
ure 5.1; we have t[Name] = <‘Barbara Benson’>, and t[Ssn, Gpa, Age] = <‘533-69-1238’,
3.25, 19>.
5.2 Relational Model Constraints
and Relational Database Schemas
So far, we have discussed the characteristics of single relations. In a relational data-
base, there will typically be many relations, and the tuples in those relations are
usually related in various ways. The state of the whole database will correspond to
the states of all its relations at a particular point in time. There are generally many
restrictions or constraints on the actual values in a database state. These constraints
are derived from the rules in the miniworld that the database represents, as we dis-
cussed in Section 1.6.8.
In this section, we discuss the various restrictions on data that can be specified on a
relational database in the form of constraints. Constraints on databases can gener-
ally be divided into three main categories:
1. Constraints that are inherent in the data model. We call these inherent
model-based constraints or implicit constraints.
2. Constraints that can be directly expressed in the schemas of the data model, typi-
cally by specifying them in the DDL (data definition language, see Section 2.3.1).
We call these schema-based constraints or explicit constraints.
158 Chapter 5 The Relational Data Model and Relational Database Constraints
3. Constraints that cannot be directly expressed in the schemas of the data
model, and hence must be expressed and enforced by the application pro-
grams or in some other way. We call these application-based or semantic
constraints or business rules.
The characteristics of relations that we discussed in Section 5.1.2 are the inherent
constraints of the relational model and belong to the first category. For example, the
constraint that a relation cannot have duplicate tuples is an inherent constraint. The
constraints we discuss in this section are of the second category, namely, constraints
that can be expressed in the schema of the relational model via the DDL. Constraints
in the third category are more general, relate to the meaning as well as behavior of
attributes, and are difficult to express and enforce within the data model, so they are
usually checked within the application programs that perform database updates. In
some cases, these constraints can be specified as assertions in SQL (see Chapter 7).
Another important category of constraints is data dependencies, which include
functional dependencies and multivalued dependencies. They are used mainly for
testing the “goodness” of the design of a relational database and are utilized in a
process called normalization, which is discussed in Chapters 14 and 15.
The schema-based constraints include domain constraints, key constraints, con-
straints on NULLs, entity integrity constraints, and referential integrity constraints.
5.2.1 Domain Constraints
Domain constraints specify that within each tuple, the value of each attribute A must
be an atomic value from the domain dom(A). We have already discussed the ways in
which domains can be specified in Section 5.1.1. The data types associated with
domains typically include standard numeric data types for integers (such as short
integer, integer, and long integer) and real numbers (float and double-precision float).
Characters, Booleans, fixed-length strings, and variable-length strings are also avail-
able, as are date, time, timestamp, and other special data types. Domains can also be
described by a subrange of values from a data type or as an enumerated data type in
which all possible values are explicitly listed. Rather than describe these in detail here,
we discuss the data types offered by the SQL relational standard in Section 6.1.
5.2.2 Key Constraints and Constraints on NULL Values
In the formal relational model, a relation is defined as a set of tuples. By definition,
all elements of a set are distinct; hence, all tuples in a relation must also be distinct.
This means that no two tuples can have the same combination of values for all their
attributes. Usually, there are other subsets of attributes of a relation schema R with
the property that no two tuples in any relation state r of R should have the same
combination of values for these attributes. Suppose that we denote one such subset
of attributes by SK; then for any two distinct tuples t1 and t2 in a relation state r of R,
we have the constraint that:
t1[SK] ≠ t2[SK]
5.2 Relational Model Constraints and Relational Database Schemas 159
Any such set of attributes SK is called a superkey of the relation schema R. A super-
key SK specifies a uniqueness constraint that no two distinct tuples in any state r of
R can have the same value for SK. Every relation has at least one default superkey—
the set of all its attributes. A superkey can have redundant attributes, however, so a
more useful concept is that of a key, which has no redundancy. A key k of a relation
schema R is a superkey of R with the additional property that removing any attri-
bute A from K leaves a set of attributes K′ that is not a superkey of R any more.
Hence, a key satisfies two properties:
1. Two distinct tuples in any state of the relation cannot have identical values
for (all) the attributes in the key. This uniqueness property also applies to a
superkey.
2. It is a minimal superkey—that is, a superkey from which we cannot remove
any attributes and still have the uniqueness constraint hold. This minimality
property is required for a key but is optional for a superkey.
Hence, a key is a superkey but not vice versa. A superkey may be a key (if it is mini-
mal) or may not be a key (if it is not minimal). Consider the STUDENT relation of
Figure 5.1. The attribute set {Ssn} is a key of STUDENT because no two student
tuples can have the same value for Ssn.8 Any set of attributes that includes Ssn—for
example, {Ssn, Name, Age}—is a superkey. However, the superkey {Ssn, Name, Age}
is not a key of STUDENT because removing Name or Age or both from the set still
leaves us with a superkey. In general, any superkey formed from a single attribute is
also a key. A key with multiple attributes must require all its attributes together to
have the uniqueness property.
The value of a key attribute can be used to identify uniquely each tuple in the rela-
tion. For example, the Ssn value 305-61-2435 identifies uniquely the tuple corre-
sponding to Benjamin Bayer in the STUDENT relation. Notice that a set of attributes
constituting a key is a property of the relation schema; it is a constraint that should
hold on every valid relation state of the schema. A key is determined from the mean-
ing of the attributes, and the property is time-invariant: It must continue to hold
when we insert new tuples in the relation. For example, we cannot and should not
designate the Name attribute of the STUDENT relation in Figure 5.1 as a key because
it is possible that two students with identical names will exist at some point in a
valid state.9
In general, a relation schema may have more than one key. In this case, each of the
keys is called a candidate key. For example, the CAR relation in Figure 5.4 has two
candidate keys: License_number and Engine_serial_number. It is common to designate
one of the candidate keys as the primary key of the relation. This is the candidate
key whose values are used to identify tuples in the relation. We use the convention
that the attributes that form the primary key of a relation schema are underlined, as
shown in Figure 5.4. Notice that when a relation schema has several candidate keys,
8Note that Ssn is also a superkey.
9Names are sometimes used as keys, but then some artifact—such as appending an ordinal number—must
be used to distinguish between persons with identical names.
160 Chapter 5 The Relational Data Model and Relational Database Constraints
the choice of one to become the primary key is somewhat arbitrary; however, it is
usually better to choose a primary key with a single attribute or a small number
of attributes. The other candidate keys are designated as unique keys and are
not underlined.
Another constraint on attributes specifies whether NULL values are or are not per-
mitted. For example, if every STUDENT tuple must have a valid, non-NULL value for
the Name attribute, then Name of STUDENT is constrained to be NOT NULL.
5.2.3 Relational Databases and Relational
Database Schemas
The definitions and constraints we have discussed so far apply to single relations
and their attributes. A relational database usually contains many relations, with
tuples in relations that are related in various ways. In this section, we define a rela-
tional database and a relational database schema.
A relational database schema S is a set of relation schemas S = {R1, R2, … , Rm} and
a set of integrity constraints IC. A relational database state10 DB of S is a set of
relation states DB = {r1, r2, … , rm} such that each ri is a state of Ri and such that the
ri relation states satisfy the integrity constraints specified in IC. Figure 5.5 shows a
relational database schema that we call COMPANY = {EMPLOYEE, DEPARTMENT,
DEPT_LOCATIONS, PROJECT, WORKS_ON, DEPENDENT}. In each relation schema,
the underlined attribute represents the primary key. Figure 5.6 shows a relational
database state corresponding to the COMPANY schema. We will use this schema
and database state in this chapter and in Chapters 4 through 6 for developing
sample queries in different relational languages. (The data shown here is
expanded and available for loading as a populated database from the Compan-
ion Website for the text, and can be used for the hands-on project exercises at
the end of the chapters.)
When we refer to a relational database, we implicitly include both its schema and its
current state. A database state that does not obey all the integrity constraints is
CAR
License_number Engine_serial_number Make Model Year
Texas ABC-739
Florida TVP-347
New York MPO-22
California 432-TFY
California RSK-629
Texas RSK-629
A69352
B43696
X83554
C43742
Y82935
U028365
Ford
Oldsmobile
Oldsmobile
Mercedes
Toyota
Jaguar
Mustang
Cutlass
Delta
190-D
Camry
XJS
02
05
01
99
04
04
Figure 5.4
The CAR relation, with
two candidate keys:
License_number and
Engine_serial_number.
10A relational database state is sometimes called a relational database snapshot or instance. However,
as we mentioned earlier, we will not use the term instance since it also applies to single tuples.
5.2 Relational Model Constraints and Relational Database Schemas 161
called not valid, and a state that satisfies all the constraints in the defined set of
integrity constraints IC is called a valid state.
In Figure 5.5, the Dnumber attribute in both DEPARTMENT and DEPT_LOCATIONS
stands for the same real-world concept—the number given to a department. That
same concept is called Dno in EMPLOYEE and Dnum in PROJECT. Attributes that
represent the same real-world concept may or may not have identical names in dif-
ferent relations. Alternatively, attributes that represent different concepts may have
the same name in different relations. For example, we could have used the attribute
name Name for both Pname of PROJECT and Dname of DEPARTMENT; in this case, we
would have two attributes that share the same name but represent different real-
world concepts—project names and department names.
In some early versions of the relational model, an assumption was made that the
same real-world concept, when represented by an attribute, would have identical
attribute names in all relations. This creates problems when the same real-world
concept is used in different roles (meanings) in the same relation. For example, the
concept of Social Security number appears twice in the EMPLOYEE relation of
Figure 5.5: once in the role of the employee’s SSN, and once in the role of the
supervisor’s SSN. We are required to give them distinct attribute names—Ssn and
Super_ssn, respectively—because they appear in the same relation and in order to
distinguish their meaning.
Each relational DBMS must have a data definition language (DDL) for defining a
relational database schema. Current relational DBMSs are mostly using SQL for
this purpose. We present the SQL DDL in Sections 6.1 and 6.2.
DEPARTMENT
Fname Minit Lname Ssn Bdate Address Sex Salary Super_ssn Dno
EMPLOYEE
DEPT_LOCATIONS
Dnumber Dlocation
PROJECT
Pname Pnumber Plocation Dnum
WORKS_ON
Essn Pno Hours
DEPENDENT
Essn Dependent_name Sex Bdate Relationship
Dname Dnumber Mgr_ssn Mgr_start_date
Figure 5.5
Schema diagram for the
COMPANY relational
database schema.
162 Chapter 5 The Relational Data Model and Relational Database Constraints
DEPT_LOCATIONS
Dnumber
Houston
Stafford
Bellaire
Sugarland
Dlocation
DEPARTMENT
Dname
Research
Administration
Headquarters 1
5
4
888665555
333445555
987654321
1981-06-19
1988-05-22
1995-01-01
Dnumber Mgr_ssn Mgr_start_date
WORKS_ON
Essn
123456789
123456789
666884444
453453453
453453453
333445555
333445555
333445555
333445555
999887777
999887777
987987987
987987987
987654321
987654321
888665555
3
1
2
2
1
2
30
30
30
10
10
3
10
20
20
20
40.0
32.5
7.5
10.0
10.0
10.0
10.0
20.0
20.0
30.0
5.0
10.0
35.0
20.0
15.0
NULL
Pno Hours
PROJECT
Pname
ProductX
ProductY
ProductZ
Computerization
Reorganization
Newbenefits
3
1
2
30
10
20
5
5
5
4
4
1
Houston
Bellaire
Sugarland
Stafford
Stafford
Houston
Pnumber Plocation Dnum
DEPENDENT
333445555
333445555
333445555
987654321
123456789
123456789
123456789
Joy
Alice F
M
F
M
M
F
F
1986-04-05
1983-10-25
1958-05-03
1942-02-28
1988-01-04
1988-12-30
1967-05-05
Theodore
Alice
Elizabeth
Abner
Michael
Spouse
Daughter
Son
Daughter
Spouse
Spouse
Son
Dependent_name Sex Bdate Relationship
EMPLOYEE
Fname
John
Franklin
Jennifer
Alicia
Ramesh
Joyce
James
Ahmad
Narayan
English
Borg
Jabbar
666884444
453453453
888665555
987987987
F
F
M
M
M
M
M
F
4
4
5
5
4
1
5
5
25000
43000
30000
40000
25000
55000
38000
25000
987654321
888665555
333445555
888665555
987654321
NULL
333445555
333445555
Zelaya
Wallace
Smith
Wong
3321 Castle, Spring, TX
291 Berry, Bellaire, TX
731 Fondren, Houston, TX
638 Voss, Houston, TX
1968-01-19
1941-06-20
1965-01-09
1955-12-08
1969-03-29
1937-11-10
1962-09-15
1972-07-31
980 Dallas, Houston, TX
450 Stone, Houston, TX
975 Fire Oak, Humble, TX
5631 Rice, Houston, TX
999887777
987654321
123456789
333445555
Minit Lname Ssn Bdate Address Sex DnoSalary Super_ssn
B
T
J
S
K
A
V
E
Houston
1
4
5
5
Essn
5
Figure 5.6
One possible database state for the COMPANY relational database schema.
5.2 Relational Model Constraints and Relational Database Schemas 163
Integrity constraints are specified on a database schema and are expected to hold on
every valid database state of that schema. In addition to domain, key, and NOT NULL
constraints, two other types of constraints are considered part of the relational
model: entity integrity and referential integrity.
5.2.4 Entity Integrity, Referential Integrity, and Foreign Keys
The entity integrity constraint states that no primary key value can be NULL. This is
because the primary key value is used to identify individual tuples in a relation. Hav-
ing NULL values for the primary key implies that we cannot identify some tuples. For
example, if two or more tuples had NULL for their primary keys, we may not be able
to distinguish them if we try to reference them from other relations.
Key constraints and entity integrity constraints are specified on individual relations.
The referential integrity constraint is specified between two relations and is used to
maintain the consistency among tuples in the two relations. Informally, the referen-
tial integrity constraint states that a tuple in one relation that refers to another rela-
tion must refer to an existing tuple in that relation. For example, in Figure 5.6, the
attribute Dno of EMPLOYEE gives the department number for which each employee
works; hence, its value in every EMPLOYEE tuple must match the Dnumber value of
some tuple in the DEPARTMENT relation.
To define referential integrity more formally, first we define the concept of a foreign
key. The conditions for a foreign key, given below, specify a referential integrity
constraint between the two relation schemas R1 and R2. A set of attributes FK in
relation schema R1 is a foreign key of R1 that references relation R2 if it satisfies the
following rules:
1. The attributes in FK have the same domain(s) as the primary key attributes
PK of R2; the attributes FK are said to reference or refer to the relation R2.
2. A value of FK in a tuple t1 of the current state r1(R1) either occurs as a value
of PK for some tuple t2 in the current state r2(R2) or is NULL. In the former
case, we have t1[FK] = t2[PK], and we say that the tuple t1 references or
refers to the tuple t2.
In this definition, R1 is called the referencing relation and R2 is the referenced
relation. If these two conditions hold, a referential integrity constraint from R1 to
R2 is said to hold. In a database of many relations, there are usually many referential
integrity constraints.
To specify these constraints, first we must have a clear understanding of the mean-
ing or role that each attribute or set of attributes plays in the various relation sche-
mas of the database. Referential integrity constraints typically arise from the
relationships among the entities represented by the relation schemas. For example,
consider the database shown in Figure 5.6. In the EMPLOYEE relation, the attribute
Dno refers to the department for which an employee works; hence, we designate Dno
to be a foreign key of EMPLOYEE referencing the DEPARTMENT relation. This means
that a value of Dno in any tuple t1 of the EMPLOYEE relation must match a value of
164 Chapter 5 The Relational Data Model and Relational Database Constraints
the primary key of DEPARTMENT—the Dnumber attribute—in some tuple t2 of the
DEPARTMENT relation, or the value of Dno can be NULL if the employee does not
belong to a department or will be assigned to a department later. For example, in
Figure 5.6 the tuple for employee ‘John Smith’ references the tuple for the ‘Research’
department, indicating that ‘John Smith’ works for this department.
Notice that a foreign key can refer to its own relation. For example, the attribute
Super_ssn in EMPLOYEE refers to the supervisor of an employee; this is another
employee, represented by a tuple in the EMPLOYEE relation. Hence, Super_ssn is a
foreign key that references the EMPLOYEE relation itself. In Figure 5.6 the tuple for
employee ‘John Smith’ references the tuple for employee ‘Franklin Wong,’ indicat-
ing that ‘Franklin Wong’ is the supervisor of ‘John Smith’.
We can diagrammatically display referential integrity constraints by drawing a directed
arc from each foreign key to the relation it references. For clarity, the arrowhead may
point to the primary key of the referenced relation. Figure 5.7 shows the schema in
Figure 5.5 with the referential integrity constraints displayed in this manner.
All integrity constraints should be specified on the relational database schema (that is,
specified as part of its definition) if we want the DBMS to enforce these constraints on
DEPARTMENT
Fname Minit Lname Ssn Bdate Address Sex Salary Super_ssn Dno
EMPLOYEE
DEPT_LOCATIONS
Dnumber Dlocation
PROJECT
Pname Pnumber Plocation Dnum
WORKS_ON
Essn Pno Hours
DEPENDENT
Essn Dependent_name Sex Bdate Relationship
Dname Dnumber Mgr_ssn Mgr_start_date
Figure 5.7
Referential integrity constraints displayed on the COMPANY relational database schema.
5.3 Update Operations, Transactions, and Dealing with Constraint Violations 165
the database states. Hence, the DDL includes provisions for specifying the various
types of constraints so that the DBMS can automatically enforce them. In SQL, the
CREATE TABLE statement of the SQL DDL allows the definition of primary key,
unique key, NOT NULL, entity integrity, and referential integrity constraints, among
other constraints (see Sections 6.1 and 6.2) .
5.2.5 Other Types of Constraints
The preceding integrity constraints are included in the data definition language
because they occur in most database applications. Another class of general con-
straints, sometimes called semantic integrity constraints, are not part of the DDL
and have to be specified and enforced in a different way. Examples of such con-
straints are the salary of an employee should not exceed the salary of the employee’s
supervisor and the maximum number of hours an employee can work on all projects
per week is 56. Such constraints can be specified and enforced within the applica-
tion programs that update the database, or by using a general-purpose constraint
specification language. Mechanisms called triggers and assertions can be used in
SQL, through the CREATE ASSERTION and CREATE TRIGGER statements, to specify
some of these constraints (see Chapter 7). It is more common to check for these
types of constraints within the application programs than to use constraint specifi-
cation languages because the latter are sometimes difficult and complex to use, as
we discuss in Section 26.1.
The types of constraints we discussed so far may be called state constraints
because they define the constraints that a valid state of the database must satisfy.
Another type of constraint, called transition constraints, can be defined to deal
with state changes in the database.11 An example of a transition constraint is: “the
salary of an employee can only increase.” Such constraints are typically enforced
by the application programs or specified using active rules and triggers, as we dis-
cuss in Section 26.1.
5.3 Update Operations, Transactions,
and Dealing with Constraint Violations
The operations of the relational model can be categorized into retrievals and
updates. The relational algebra operations, which can be used to specify retrievals,
are discussed in detail in Chapter 8. A relational algebra expression forms a new
relation after applying a number of algebraic operators to an existing set of rela-
tions; its main use is for querying a database to retrieve information. The user for-
mulates a query that specifies the data of interest, and a new relation is formed by
applying relational operators to retrieve this data. The result relation becomes the
answer to (or result of ) the user’s query. Chapter 8 also introduces the language
11State constraints are sometimes called static constraints, and transition constraints are sometimes
called dynamic constraints.
166 Chapter 5 The Relational Data Model and Relational Database Constraints
called relational calculus, which is used to define a query declaratively without giv-
ing a specific order of operations.
In this section, we concentrate on the database modification or update operations.
There are three basic operations that can change the states of relations in the data-
base: Insert, Delete, and Update (or Modify). They insert new data, delete old data,
or modify existing data records, respectively. Insert is used to insert one or more
new tuples in a relation, Delete is used to delete tuples, and Update (or Modify) is
used to change the values of some attributes in existing tuples. Whenever these
operations are applied, the integrity constraints specified on the relational database
schema should not be violated. In this section we discuss the types of constraints
that may be violated by each of these operations and the types of actions that may
be taken if an operation causes a violation. We use the database shown in Figure 5.6
for examples and discuss only domain constraints, key constraints, entity integrity
constraints, and the referential integrity constraints shown in Figure 5.7. For each
type of operation, we give some examples and discuss any constraints that each
operation may violate.
5.3.1 The Insert Operation
The Insert operation provides a list of attribute values for a new tuple t that is to be
inserted into a relation R. Insert can violate any of the four types of constraints.
Domain constraints can be violated if an attribute value is given that does not
appear in the corresponding domain or is not of the appropriate data type. Key
constraints can be violated if a key value in the new tuple t already exists in another
tuple in the relation r(R). Entity integrity can be violated if any part of the primary
key of the new tuple t is NULL. Referential integrity can be violated if the value of
any foreign key in t refers to a tuple that does not exist in the referenced relation.
Here are some examples to illustrate this discussion.
■ Operation:
Insert <‘Cecilia’, ‘F’, ‘Kolonsky’, NULL, ‘1960-04-05’, ‘6357 Windy Lane, Katy,
TX’, F, 28000, NULL, 4> into EMPLOYEE.
Result: This insertion violates the entity integrity constraint (NULL for the
primary key Ssn), so it is rejected.
■ Operation:
Insert <‘Alicia’, ‘J’, ‘Zelaya’, ‘999887777’, ‘1960-04-05’, ‘6357 Windy Lane, Katy,
TX’, F, 28000, ‘987654321’, 4> into EMPLOYEE.
Result: This insertion violates the key constraint because another tuple with
the same Ssn value already exists in the EMPLOYEE relation, and so it is
rejected.
■ Operation:
Insert <‘Cecilia’, ‘F’, ‘Kolonsky’, ‘677678989’, ‘1960-04-05’, ‘6357 Windswept,
Katy, TX’, F, 28000, ‘987654321’, 7> into EMPLOYEE.
Result: This insertion violates the referential integrity constraint specified on
Dno in EMPLOYEE because no corresponding referenced tuple exists in
DEPARTMENT with Dnumber = 7.
5.3 Update Operations, Transactions, and Dealing with Constraint Violations 167
■ Operation:
Insert <‘Cecilia’, ‘F’, ‘Kolonsky’, ‘677678989’, ‘1960-04-05’, ‘6357 Windy Lane,
Katy, TX’, F, 28000, NULL, 4> into EMPLOYEE.
Result: This insertion satisfies all constraints, so it is acceptable.
If an insertion violates one or more constraints, the default option is to reject the
insertion. In this case, it would be useful if the DBMS could provide a reason to the
user as to why the insertion was rejected. Another option is to attempt to correct the
reason for rejecting the insertion, but this is typically not used for violations caused by
Insert; rather, it is used more often in correcting violations for Delete and Update.
In the first operation, the DBMS could ask the user to provide a value for Ssn, and
could then accept the insertion if a valid Ssn value is provided. In operation 3, the
DBMS could either ask the user to change the value of Dno to some valid value
(or set it to NULL), or it could ask the user to insert a DEPARTMENT tuple with
Dnumber = 7 and could accept the original insertion only after such an operation
was accepted. Notice that in the latter case the insertion violation can cascade back
to the EMPLOYEE relation if the user attempts to insert a tuple for department 7 with
a value for Mgr_ssn that does not exist in the EMPLOYEE relation.
5.3.2 The Delete Operation
The Delete operation can violate only referential integrity. This occurs if the tuple
being deleted is referenced by foreign keys from other tuples in the database. To
specify deletion, a condition on the attributes of the relation selects the tuple (or
tuples) to be deleted. Here are some examples.
■ Operation:
Delete the WORKS_ON tuple with Essn = ‘999887777’ and Pno = 10.
Result: This deletion is acceptable and deletes exactly one tuple.
■ Operation:
Delete the EMPLOYEE tuple with Ssn = ‘999887777’.
Result: This deletion is not acceptable, because there are tuples in
WORKS_ON that refer to this tuple. Hence, if the tuple in EMPLOYEE is
deleted, referential integrity violations will result.
■ Operation:
Delete the EMPLOYEE tuple with Ssn = ‘333445555’.
Result: This deletion will result in even worse referential integrity violations,
because the tuple involved is referenced by tuples from the EMPLOYEE,
DEPARTMENT, WORKS_ON, and DEPENDENT relations.
Several options are available if a deletion operation causes a violation. The first
option, called restrict, is to reject the deletion. The second option, called cascade, is
to attempt to cascade (or propagate) the deletion by deleting tuples that reference the
tuple that is being deleted. For example, in operation 2, the DBMS could automati-
cally delete the offending tuples from WORKS_ON with Essn = ‘999887777’. A
third option, called set null or set default, is to modify the referencing attribute
values that cause the violation; each such value is either set to NULL or changed to
168 Chapter 5 The Relational Data Model and Relational Database Constraints
reference another default valid tuple. Notice that if a referencing attribute that
causes a violation is part of the primary key, it cannot be set to NULL; otherwise, it
would violate entity integrity.
Combinations of these three options are also possible. For example, to avoid having
operation 3 cause a violation, the DBMS may automatically delete all tuples from
WORKS_ON and DEPENDENT with Essn = ‘333445555’. Tuples in EMPLOYEE with
Super_ssn = ‘333445555’ and the tuple in DEPARTMENT with Mgr_ssn = ‘333445555’
can have their Super_ssn and Mgr_ssn values changed to other valid values or to
NULL. Although it may make sense to delete automatically the WORKS_ON and
DEPENDENT tuples that refer to an EMPLOYEE tuple, it may not make sense to delete
other EMPLOYEE tuples or a DEPARTMENT tuple.
In general, when a referential integrity constraint is specified in the DDL, the DBMS
will allow the database designer to specify which of the options applies in case of a
violation of the constraint. We discuss how to specify these options in the SQL DDL
in Chapter 6.
5.3.3 The Update Operation
The Update (or Modify) operation is used to change the values of one or more
attributes in a tuple (or tuples) of some relation R. It is necessary to specify a condi-
tion on the attributes of the relation to select the tuple (or tuples) to be modified.
Here are some examples.
■ Operation:
Update the salary of the EMPLOYEE tuple with Ssn = ‘999887777’ to 28000.
Result: Acceptable.
■ Operation:
Update the Dno of the EMPLOYEE tuple with Ssn = ‘999887777’ to 1.
Result: Acceptable.
■ Operation:
Update the Dno of the EMPLOYEE tuple with Ssn = ‘999887777’ to 7.
Result: Unacceptable, because it violates referential integrity.
■ Operation:
Update the Ssn of the EMPLOYEE tuple with Ssn = ‘999887777’ to ‘987654321’.
Result: Unacceptable, because it violates primary key constraint by repeating
a value that already exists as a primary key in another tuple; it violates refer-
ential integrity constraints because there are other relations that refer to the
existing value of Ssn.
Updating an attribute that is neither part of a primary key nor part of a foreign key
usually causes no problems; the DBMS need only check to confirm that the new
value is of the correct data type and domain. Modifying a primary key value is simi-
lar to deleting one tuple and inserting another in its place because we use the pri-
mary key to identify tuples. Hence, the issues discussed earlier in both Sections 5.3.1
(Insert) and 5.3.2 (Delete) come into play. If a foreign key attribute is modified, the
5.4 Summary 169
DBMS must make sure that the new value refers to an existing tuple in the refer-
enced relation (or is set to NULL). Similar options exist to deal with referential integ-
rity violations caused by Update as those options discussed for the Delete operation.
In fact, when a referential integrity constraint is specified in the DDL, the DBMS will
allow the user to choose separate options to deal with a violation caused by Delete
and a violation caused by Update (see Section 6.2).
5.3.4 The Transaction Concept
A database application program running against a relational database typically exe-
cutes one or more transactions. A transaction is an executing program that includes
some database operations, such as reading from the database, or applying inser-
tions, deletions, or updates to the database. At the end of the transaction, it must
leave the database in a valid or consistent state that satisfies all the constraints spec-
ified on the database schema. A single transaction may involve any number of
retrieval operations (to be discussed as part of relational algebra and calculus in
Chapter 8, and as a part of the language SQL in Chapters 6 and 7) and any number
of update operations. These retrievals and updates will together form an atomic
unit of work against the database. For example, a transaction to apply a bank with-
drawal will typically read the user account record, check if there is a sufficient bal-
ance, and then update the record by the withdrawal amount.
A large number of commercial applications running against relational databases in
online transaction processing (OLTP) systems are executing transactions at rates
that reach several hundred per second. Transaction processing concepts, concur-
rent execution of transactions, and recovery from failures will be discussed in
Chapters 20 to 22.
5.4 Summary
In this chapter we presented the modeling concepts, data structures, and constraints
provided by the relational model of data. We started by introducing the concepts of
domains, attributes, and tuples. Then, we defined a relation schema as a list of attri-
butes that describe the structure of a relation. A relation, or relation state, is a set of
tuples that conforms to the schema.
Several characteristics differentiate relations from ordinary tables or files. The first
is that a relation is not sensitive to the ordering of tuples. The second involves the
ordering of attributes in a relation schema and the corresponding ordering of val-
ues within a tuple. We gave an alternative definition of relation that does not require
ordering of attributes, but we continued to use the first definition, which requires
attributes and tuple values to be ordered, for convenience. Then, we discussed val-
ues in tuples and introduced NULL values to represent missing or unknown infor-
mation. We emphasized that NULL values should be avoided as much as possible.
We classified database constraints into inherent model-based constraints, explicit
schema-based constraints, and semantic constraints or business rules. Then, we
170 Chapter 5 The Relational Data Model and Relational Database Constraints
discussed the schema constraints pertaining to the relational model, starting with
domain constraints, then key constraints (including the concepts of superkey,
key, and primary key), and the NOT NULL constraint on attributes. We defined
relational databases and relational database schemas. Additional relational con-
straints include the entity integrity constraint, which prohibits primary key attri-
butes from being NULL. We described the interrelation referential integrity
constraint, which is used to maintain consistency of references among tuples
from various relations.
The modification operations on the relational model are Insert, Delete, and Update.
Each operation may violate certain types of constraints (refer to Section 5.3). When-
ever an operation is applied, the resulting database state must be a valid state.
Finally, we introduced the concept of a transaction, which is important in relational
DBMSs because it allows the grouping of several database operations into a single
atomic action on the database.
Review Questions
5.1. Define the following terms as they apply to the relational model of data:
domain, attribute, n-tuple, relation schema, relation state, degree of a rela-
tion, relational database schema, and relational database state.
5.2. Why are tuples in a relation not ordered?
5.3. Why are duplicate tuples not allowed in a relation?
5.4. What is the difference between a key and a superkey?
5.5. Why do we designate one of the candidate keys of a relation to be the pri-
mary key?
5.6. Discuss the characteristics of relations that make them different from ordi-
nary tables and files.
5.7. Discuss the various reasons that lead to the occurrence of NULL values in
relations.
5.8. Discuss the entity integrity and referential integrity constraints. Why is each
considered important?
5.9. Define foreign key. What is this concept used for?
5.10. What is a transaction? How does it differ from an Update operation?
Exercises
5.11. Suppose that each of the following Update operations is applied directly to
the database state shown in Figure 5.6. Discuss all integrity constraints
Exercises 171
violated by each operation, if any, and the different ways of enforcing
these constraints.
a. Insert <‘Robert’, ‘F’, ‘Scott’, ‘943775543’, ‘1972-06-21’, ‘2365 Newcastle Rd, Bellaire, TX’, M, 58000, ‘888665555’, 1> into EMPLOYEE.
b. Insert <‘ProductA’, 4, ‘Bellaire’, 2> into PROJECT.
c. Insert <‘Production’, 4, ‘943775543’, ‘2007-10-01’> into DEPARTMENT.
d. Insert <‘677678989’, NULL, ‘40.0’> into WORKS_ON.
e. Insert <‘453453453’, ‘John’, ‘M’, ‘1990-12-12’, ‘spouse’> into DEPENDENT.
f. Delete the WORKS_ON tuples with Essn = ‘333445555’.
g. Delete the EMPLOYEE tuple with Ssn = ‘987654321’.
h. Delete the PROJECT tuple with Pname = ‘ProductX’.
i. Modify the Mgr_ssn and Mgr_start_date of the DEPARTMENT tuple with
Dnumber = 5 to ‘123456789’ and ‘2007-10-01’, respectively.
j. Modify the Super_ssn attribute of the EMPLOYEE tuple with Ssn =
‘999887777’ to ‘943775543’.
k. Modify the Hours attribute of the WORKS_ON tuple with Essn =
‘999887777’ and Pno = 10 to ‘5.0’.
5.12. Consider the AIRLINE relational database schema shown in Figure 5.8,
which describes a database for airline flight information. Each FLIGHT is
identified by a Flight_number, and consists of one or more FLIGHT_LEGs
with Leg_numbers 1, 2, 3, and so on. Each FLIGHT_LEG has scheduled
arrival and departure times, airports, and one or more LEG_INSTANCEs—
one for each Date on which the flight travels. FAREs are kept for each
FLIGHT. For each FLIGHT_LEG instance, SEAT_RESERVATIONs are kept, as
are the AIRPLANE used on the leg and the actual arrival and departure times
and airports. An AIRPLANE is identified by an Airplane_id and is of a particu-
lar AIRPLANE_TYPE. CAN_LAND relates AIRPLANE_TYPEs to the AIRPORTs
at which they can land. An AIRPORT is identified by an Airport_code. Con-
sider an update for the AIRLINE database to enter a reservation on a particu-
lar flight or flight leg on a given date.
a. Give the operations for this update.
b. What types of constraints would you expect to check?
c. Which of these constraints are key, entity integrity, and referential integ-
rity constraints, and which are not?
d. Specify all the referential integrity constraints that hold on the schema
shown in Figure 5.8.
5.13. Consider the relation CLASS(Course#, Univ_Section#, Instructor_name,
Semester, Building_code, Room#, Time_period, Weekdays, Credit_hours). This rep-
resents classes taught in a university, with unique Univ_section#s. Identify what
you think should be various candidate keys, and write in your own words the
conditions or assumptions under which each candidate key would be valid.
172 Chapter 5 The Relational Data Model and Relational Database Constraints
AIRPORT
Airport_code Name City State
Flight_number Airline Weekdays
FLIGHT
FLIGHT_LEG
Flight_number Leg_number Departure_airport_code Scheduled_departure_time
Scheduled_arrival_timeArrival_airport_code
LEG_INSTANCE
Flight_number Leg_number Date Number_of_available_seats Airplane_id
FARE
Flight_number Fare_code Amount Restrictions
AIRPLANE_TYPE
Airplane_type_name Max_seats Company
CAN_LAND
Airplane_type_name Airport_code
AIRPLANE
Airplane_id Total_number_of_seats Airplane_type
SEAT_RESERVATION
Leg_number Date Seat_number Customer_name Customer_phoneFlight_number
Arrival_timeArrival_airport_codeDeparture_timeDeparture_airport_code
Figure 5.8
The AIRLINE relational database schema.
5.14. Consider the following six relations for an order-processing database appli-
cation in a company:
CUSTOMER(Cust#, Cname, City)
ORDER(Order#, Odate, Cust#, Ord_amt)
ORDER_ITEM(Order#, Item#, Qty)
Exercises 173
ITEM(Item#, Unit_price)
SHIPMENT(Order#, Warehouse#, Ship_date)
WAREHOUSE(Warehouse#, City)
Here, Ord_amt refers to total dollar amount of an order; Odate is the date the
order was placed; and Ship_date is the date an order (or part of an order) is
shipped from the warehouse. Assume that an order can be shipped from several
warehouses. Specify the foreign keys for this schema, stating any assumptions
you make. What other constraints can you think of for this database?
5.15. Consider the following relations for a database that keeps track of business
trips of salespersons in a sales office:
SALESPERSON(Ssn, Name, Start_year, Dept_no)
TRIP(Ssn, From_city, To_city, Departure_date, Return_date, Trip_id)
EXPENSE(Trip_id, Account#, Amount)
A trip can be charged to one or more accounts. Specify the foreign keys for
this schema, stating any assumptions you make.
5.16. Consider the following relations for a database that keeps track of student
enrollment in courses and the books adopted for each course:
STUDENT(Ssn, Name, Major, Bdate)
COURSE(Course#, Cname, Dept)
ENROLL(Ssn, Course#, Quarter, Grade)
BOOK_ADOPTION(Course#, Quarter, Book_isbn)
TEXT(Book_isbn, Book_title, Publisher, Author)
Specify the foreign keys for this schema, stating any assumptions you make.
5.17. Consider the following relations for a database that keeps track of automo-
bile sales in a car dealership (OPTION refers to some optional equipment
installed on an automobile):
CAR(Serial_no, Model, Manufacturer, Price)
OPTION(Serial_no, Option_name, Price)
SALE(Salesperson_id, Serial_no, Date, Sale_price)
SALESPERSON(Salesperson_id, Name, Phone)
First, specify the foreign keys for this schema, stating any assumptions you
make. Next, populate the relations with a few sample tuples, and then give
an example of an insertion in the SALE and SALESPERSON relations that
violates the referential integrity constraints and of another insertion that
does not.
5.18. Database design often involves decisions about the storage of attributes. For
example, a Social Security number can be stored as one attribute or split into
three attributes (one for each of the three hyphen-delineated groups of
174 Chapter 5 The Relational Data Model and Relational Database Constraints
numbers in a Social Security number—XXX-XX-XXXX). However, Social
Security numbers are usually represented as just one attribute. The decision
is based on how the database will be used. This exercise asks you to think
about specific situations where dividing the SSN is useful.
5.19. Consider a STUDENT relation in a UNIVERSITY database with the following
attributes (Name, Ssn, Local_phone, Address, Cell_phone, Age, Gpa). Note that
the cell phone may be from a different city and state (or province) from the
local phone. A possible tuple of the relation is shown below:
Name Ssn Local_phone Address Cell_phone Age Gpa
George Shaw 123-45-6789 555-1234 123 Main St., 555-4321 19 3.75
William Edwards Anytown, CA 94539
a. Identify the critical missing information from the Local_phone and
Cell_phone attributes. (Hint: How do you call someone who lives in a dif-
ferent state or province?)
b. Would you store this additional information in the Local_phone and
Cell_phone attributes or add new attributes to the schema for STUDENT?
c. Consider the Name attribute. What are the advantages and disadvantages
of splitting this field from one attribute into three attributes (first name,
middle name, and last name)?
d. What general guideline would you recommend for deciding when to
store information in a single attribute and when to split the information?
e. Suppose the student can have between 0 and 5 phones. Suggest two dif-
ferent designs that allow this type of information.
5.20. Recent changes in privacy laws have disallowed organizations from using
Social Security numbers to identify individuals unless certain restrictions
are satisfied. As a result, most U.S. universities cannot use SSNs as primary
keys (except for financial data). In practice, Student_id, a unique identifier
assigned to every student, is likely to be used as the primary key rather than
SSN since Student_id can be used throughout the system.
a. Some database designers are reluctant to use generated keys (also known
as surrogate keys) for primary keys (such as Student_id) because they are
artificial. Can you propose any natural choices of keys that can be used to
identify the student record in a UNIVERSITY database?
b. Suppose that you are able to guarantee uniqueness of a natural key that
includes last name. Are you guaranteed that the last name will not change
during the lifetime of the database? If last name can change, what solu-
tions can you propose for creating a primary key that still includes last
name but remains unique?
c. What are the advantages and disadvantages of using generated (surro-
gate) keys?
Selected Bibliography 175
Selected Bibliography
The relational model was introduced by Codd (1970) in a classic paper. Codd also
introduced relational algebra and laid the theoretical foundations for the relational
model in a series of papers (Codd, 1971, 1972, 1972a, 1974); he was later given the
Turing Award, the highest honor of the ACM (Association for Computing Machin-
ery) for his work on the relational model. In a later paper, Codd (1979) discussed
extending the relational model to incorporate more meta-data and semantics about
the relations; he also proposed a three-valued logic to deal with uncertainty in rela-
tions and incorporating NULLs in the relational algebra. The resulting model is
known as RM/T. Childs (1968) had earlier used set theory to model databases.
Later, Codd (1990) published a book examining over 300 features of the relational
data model and database systems. Date (2001) provides a retrospective review and
analysis of the relational data model.
Since Codd’s pioneering work, much research has been conducted on various
aspects of the relational model. Todd (1976) describes an experimental DBMS
called PRTV that directly implements the relational algebra operations. Schmidt
and Swenson (1975) introduce additional semantics into the relational model by
classifying different types of relations. Chen’s (1976) entity–relationship model,
which is discussed in Chapter 3, is a means to communicate the real-world seman-
tics of a relational database at the conceptual level. Wiederhold and Elmasri (1979)
introduce various types of connections between relations to enhance its constraints.
Extensions of the relational model are discussed in Chapters 11 and 26. Additional
bibliographic notes for other aspects of the relational model and its languages, sys-
tems, extensions, and theory are given in Chapters 6 to 9, 14, 15, 23, and 30. Maier
(1983) and Atzeni and De Antonellis (1993) provide an extensive theoretical treat-
ment of the relational data model.
This page intentionally left blank
177
6
Basic SQL
The SQL language may be considered one of the
major reasons for the commercial success of rela-
tional databases. Because it became a standard for relational databases, users were
less concerned about migrating their database applications from other types of
database systems—for example, older network or hierarchical systems—to rela-
tional systems. This is because even if the users became dissatisfied with the partic-
ular relational DBMS product they were using, converting to another relational
DBMS product was not expected to be too expensive and time-consuming because
both systems followed the same language standards. In practice, of course, there
are differences among various commercial relational DBMS packages. However,
if the user is diligent in using only those features that are part of the standard,
and if two relational DBMSs faithfully support the standard, then conversion
between two systems should be simplified. Another advantage of having such a
standard is that users may write statements in a database application program
that can access data stored in two or more relational DBMSs without having to
change the database sublanguage (SQL), as long as both/all of the relational
DBMSs support standard SQL.
This chapter presents the practical relational model, which is based on the SQL
standard for commercial relational DBMSs, whereas Chapter 5 presented the most
important concepts underlying the formal relational data model. In Chapter 8 (Sec-
tions 8.1 through 8.5 ), we shall discuss the relational algebra operations, which are
very important for understanding the types of requests that may be specified on a
relational database. They are also important for query processing and optimization
in a relational DBMS, as we shall see in Chapters 18 and 19. However, the relational
algebra operations are too low-level for most commercial DBMS users because a
query in relational algebra is written as a sequence of operations that, when exe-
cuted, produces the required result. Hence, the user must specify how—that is, in
what order—to execute the query operations. On the other hand, the SQL language
chapter 6
178 Chapter 6 Basic SQL
provides a higher-level declarative language interface, so the user only specifies
what the result is to be, leaving the actual optimization and decisions on how to
execute the query to the DBMS. Although SQL includes some features from rela-
tional algebra, it is based to a greater extent on the tuple relational calculus, which
we describe in Section 8.6. However, the SQL syntax is more user-friendly than
either of the two formal languages.
The name SQL is presently expanded as Structured Query Language. Originally,
SQL was called SEQUEL (Structured English QUEry Language) and was designed
and implemented at IBM Research as the interface for an experimental relational
database system called SYSTEM R. SQL is now the standard language for com-
mercial relational DBMSs. The standardization of SQL is a joint effort by the
American National Standards Institute (ANSI) and the International Standards
Organization (ISO), and the first SQL standard is called SQL-86 or SQL1. A
revised and much expanded standard called SQL-92 (also referred to as SQL2)
was subsequently developed. The next standard that is well-recognized is
SQL:1999, which started out as SQL3. Additional updates to the standard are
SQL:2003 and SQL:2006, which added XML features (see Chapter 13) among
other updates to the language. Another update in 2008 incorporated more object
database features into SQL (see Chapter 12), and a further update is SQL:2011.
We will try to cover the latest version of SQL as much as possible, but some of the
newer features are discussed in later chapters. It is also not possible to cover the
language in its entirety in this text. It is important to note that when new features
are added to SQL, it usually takes a few years for some of these features to make it
into the commercial SQL DBMSs.
SQL is a comprehensive database language: It has statements for data definitions,
queries, and updates. Hence, it is both a DDL and a DML. In addition, it has facili-
ties for defining views on the database, for specifying security and authorization,
for defining integrity constraints, and for specifying transaction controls. It also has
rules for embedding SQL statements into a general-purpose programming lan-
guage such as Java or C/C++.1
The later SQL standards (starting with SQL:1999) are divided into a core specifica-
tion plus specialized extensions. The core is supposed to be implemented by all
RDBMS vendors that are SQL compliant. The extensions can be implemented as
optional modules to be purchased independently for specific database applications
such as data mining, spatial data, temporal data, data warehousing, online analyti-
cal processing (OLAP), multimedia data, and so on.
Because the subject of SQL is both important and extensive, we devote two chap-
ters to its basic features. In this chapter, Section 6.1 describes the SQL DDL com-
mands for creating schemas and tables, and gives an overview of the basic data
types in SQL. Section 6.2 presents how basic constraints such as key and referen-
tial integrity are specified. Section 6.3 describes the basic SQL constructs for
1Originally, SQL had statements for creating and dropping indexes on the files that represent relations,
but these have been dropped from the SQL standard for some time.
6.1 SQL Data Definition and Data Types 179
specifying retrieval queries, and Section 6.4 describes the SQL commands for
insertion, deletion, and update.
In Chapter 7, we will describe more complex SQL retrieval queries, as well as the
ALTER commands for changing the schema. We will also describe the CREATE
ASSERTION statement, which allows the specification of more general constraints
on the database, and the concept of triggers, which is presented in more detail in
Chapter 26. We discuss the SQL facility for defining views on the database in Chap-
ter 7. Views are also called virtual or derived tables because they present the user
with what appear to be tables; however, the information in those tables is derived
from previously defined tables.
Section 6.5 lists some SQL features that are presented in other chapters of the book;
these include object-oriented features in Chapter 12, XML in Chapter 13, transac-
tion control in Chapter 20, active databases (triggers) in Chapter 26, online analyti-
cal processing (OLAP) features in Chapter 29, and security/authorization in
Chapter 30. Section 6.6 summarizes the chapter. Chapters 10 and 11 discuss the
various database programming techniques for programming with SQL.
6.1 SQL Data Definition and Data Types
SQL uses the terms table, row, and column for the formal relational model terms
relation, tuple, and attribute, respectively. We will use the corresponding terms
interchangeably. The main SQL command for data definition is the CREATE state-
ment, which can be used to create schemas, tables (relations), types, and domains,
as well as other constructs such as views, assertions, and triggers. Before we describe
the relevant CREATE statements, we discuss schema and catalog concepts in Sec-
tion 6.1.1 to place our discussion in perspective. Section 6.1.2 describes how tables
are created, and Section 6.1.3 describes the most important data types available for
attribute specification. Because the SQL specification is very large, we give a descrip-
tion of the most important features. Further details can be found in the various SQL
standards documents (see end-of-chapter bibliographic notes).
6.1.1 Schema and Catalog Concepts in SQL
Early versions of SQL did not include the concept of a relational database schema;
all tables (relations) were considered part of the same schema. The concept of an
SQL schema was incorporated starting with SQL2 in order to group together tables
and other constructs that belong to the same database application (in some systems,
a schema is called a database). An SQL schema is identified by a schema name and
includes an authorization identifier to indicate the user or account who owns the
schema, as well as descriptors for each element in the schema. Schema elements
include tables, types, constraints, views, domains, and other constructs (such as
authorization grants) that describe the schema. A schema is created via the CREATE
SCHEMA statement, which can include all the schema elements’ definitions. Alter-
natively, the schema can be assigned a name and authorization identifier, and the
180 Chapter 6 Basic SQL
elements can be defined later. For example, the following statement creates a
schema called COMPANY owned by the user with authorization identifier ‘Jsmith’.
Note that each statement in SQL ends with a semicolon.
CREATE SCHEMA COMPANY AUTHORIZATION ‘Jsmith’;
In general, not all users are authorized to create schemas and schema elements. The
privilege to create schemas, tables, and other constructs must be explicitly granted
to the relevant user accounts by the system administrator or DBA.
In addition to the concept of a schema, SQL uses the concept of a catalog—a named
collection of schemas.2 Database installations typically have a default environment
and schema, so when a user connects and logs in to that database installation, the
user can refer directly to tables and other constructs within that schema without
having to specify a particular schema name. A catalog always contains a special
schema called INFORMATION_SCHEMA, which provides information on all the
schemas in the catalog and all the element descriptors in these schemas. Integrity
constraints such as referential integrity can be defined between relations only if
they exist in schemas within the same catalog. Schemas within the same catalog can
also share certain elements, such as type and domain definitions.
6.1.2 The CREATE TABLE Command in SQL
The CREATE TABLE command is used to specify a new relation by giving it a name
and specifying its attributes and initial constraints. The attributes are specified first,
and each attribute is given a name, a data type to specify its domain of values, and
possibly attribute constraints, such as NOT NULL. The key, entity integrity, and ref-
erential integrity constraints can be specified within the CREATE TABLE statement
after the attributes are declared, or they can be added later using the ALTER TABLE
command (see Chapter 7). Figure 6.1 shows sample data definition statements in
SQL for the COMPANY relational database schema shown in Figure 3.7.
Typically, the SQL schema in which the relations are declared is implicitly specified
in the environment in which the CREATE TABLE statements are executed. Alterna-
tively, we can explicitly attach the schema name to the relation name, separated by
a period. For example, by writing
CREATE TABLE COMPANY.EMPLOYEE
rather than
CREATE TABLE EMPLOYEE
as in Figure 6.1, we can explicitly (rather than implicitly) make the EMPLOYEE table
part of the COMPANY schema.
The relations declared through CREATE TABLE statements are called base tables
(or base relations); this means that the table and its rows are actually created
2SQL also includes the concept of a cluster of catalogs.
6.1 SQL Data Definition and Data Types 181
CREATE TABLE EMPLOYEE
( Fname
Minit
Lname
Ssn
Bdate
Address
Sex
Salary
Super_ssn
Dno
VARCHAR(15)
CHAR,
VARCHAR(15)
CHAR(9)
DATE,
VARCHAR(30),
CHAR,
DECIMAL(10,2),
CHAR(9),
INT
NOT NULL,
NOT NULL,
NOT NULL,
NOT NULL,
PRIMARY KEY (Ssn),
CREATE TABLE DEPARTMENT
( Dname
Dnumber
Mgr_ssn
Mgr_start_date
VARCHAR(15)
INT
CHAR(9)
DATE,
NOT NULL,
NOT NULL,
NOT NULL,
PRIMARY KEY (Dnumber),
UNIQUE (Dname),
FOREIGN KEY (Mgr_ssn) REFERENCES EMPLOYEE(Ssn) );
CREATE TABLE DEPT_LOCATIONS
( Dnumber
Dlocation
INT
VARCHAR(15)
NOT NULL,
NOT NULL,
PRIMARY KEY (Dnumber, Dlocation),
FOREIGN KEY (Dnumber) REFERENCES DEPARTMENT(Dnumber) );
CREATE TABLE PROJECT
( Pname
Pnumber
Plocation
Dnum
VARCHAR(15)
INT
VARCHAR(15),
INT
NOT NULL,
NOT NULL,
NOT NULL,
PRIMARY KEY (Pnumber),
UNIQUE (Pname),
FOREIGN KEY (Dnum) REFERENCES DEPARTMENT(Dnumber) );
CREATE TABLE WORKS_ON
( Essn
Pno
Hours
CHAR(9)
INT
DECIMAL(3,1)
NOT NULL,
NOT NULL,
NOT NULL,
PRIMARY KEY (Essn, Pno),
FOREIGN KEY (Essn) REFERENCES EMPLOYEE(Ssn),
FOREIGN KEY (Pno) REFERENCES PROJECT(Pnumber) );
CREATE TABLE DEPENDENT
( Essn
Dependent_name
Sex
Bdate
Relationship
CHAR(9)
VARCHAR(15)
CHAR,
DATE,
VARCHAR(8),
NOT NULL,
NOT NULL,
PRIMARY KEY (Essn, Dependent_name),
FOREIGN KEY (Essn) REFERENCES EMPLOYEE(Ssn) );
Figure 6.1
SQL CREATE
TABLE data
definition statements
for defining the
COMPANY schema
from Figure 5.7.
182 Chapter 6 Basic SQL
and stored as a file by the DBMS. Base relations are distinguished from virtual
relations, created through the CREATE VIEW statement (see Chapter 7), which
may or may not correspond to an actual physical file. In SQL, the attributes in a
base table are considered to be ordered in the sequence in which they are speci-
fied in the CREATE TABLE statement. However, rows (tuples) are not considered
to be ordered within a table (relation).
It is important to note that in Figure 6.1, there are some foreign keys that may cause
errors because they are specified either via circular references or because they refer
to a table that has not yet been created. For example, the foreign key Super_ssn in
the EMPLOYEE table is a circular reference because it refers to the EMPLOYEE table
itself. The foreign key Dno in the EMPLOYEE table refers to the DEPARTMENT table,
which has not been created yet. To deal with this type of problem, these constraints
can be left out of the initial CREATE TABLE statement, and then added later using
the ALTER TABLE statement (see Chapter 7). We displayed all the foreign keys in
Figure 6.1 to show the complete COMPANY schema in one place.
6.1.3 Attribute Data Types and Domains in SQL
The basic data types available for attributes include numeric, character string, bit
string, Boolean, date, and time.
■ Numeric data types include integer numbers of various sizes (INTEGER or
INT, and SMALLINT) and floating-point (real) numbers of various precision
(FLOAT or REAL, and DOUBLE PRECISION). Formatted numbers can be
declared by using DECIMAL(i, j)—or DEC(i, j) or NUMERIC(i, j)—where i, the
precision, is the total number of decimal digits and j, the scale, is the number
of digits after the decimal point. The default for scale is zero, and the default
for precision is implementation-defined.
■ Character-string data types are either fixed length—CHAR(n) or
CHARACTER(n), where n is the number of characters—or varying length—
VARCHAR(n) or CHAR VARYING(n) or CHARACTER VARYING(n), where n is
the maximum number of characters. When specifying a literal string value,
it is placed between single quotation marks (apostrophes), and it is case sen-
sitive (a distinction is made between uppercase and lowercase).3 For fixed-
length strings, a shorter string is padded with blank characters to the right.
For example, if the value ‘Smith’ is for an attribute of type CHAR(10), it is
padded with five blank characters to become ‘Smith’ if needed. Padded
blanks are generally ignored when strings are compared. For comparison
purposes, strings are considered ordered in alphabetic (or lexicographic)
order; if a string str1 appears before another string str2 in alphabetic order,
then str1 is considered to be less than str2.4 There is also a concatenation
operator denoted by || (double vertical bar) that can concatenate two strings
3This is not the case with SQL keywords, such as CREATE or CHAR. With keywords, SQL is case insen-
sitive, meaning that SQL treats uppercase and lowercase letters as equivalent in keywords.
4For nonalphabetic characters, there is a defined order.
6.1 SQL Data Definition and Data Types 183
in SQL. For example, ‘abc’ || ‘XYZ’ results in a single string ‘abcXYZ’.
Another variable-length string data type called CHARACTER LARGE OBJECT
or CLOB is also available to specify columns that have large text values, such
as documents. The CLOB maximum length can be specified in kilobytes
(K), megabytes (M), or gigabytes (G). For example, CLOB(20M) specifies a
maximum length of 20 megabytes.
■ Bit-string data types are either of fixed length n—BIT(n)—or varying length—
BIT VARYING(n), where n is the maximum number of bits. The default for n,
the length of a character string or bit string, is 1. Literal bit strings are placed
between single quotes but preceded by a B to distinguish them from character
strings; for example, B‘10101’.5 Another variable-length bitstring data type
called BINARY LARGE OBJECT or BLOB is also available to specify columns
that have large binary values, such as images. As for CLOB, the maximum
length of a BLOB can be specified in kilobits (K), megabits (M), or gigabits (G).
For example, BLOB(30G) specifies a maximum length of 30 gigabits.
■ A Boolean data type has the traditional values of TRUE or FALSE. In SQL,
because of the presence of NULL values, a three-valued logic is used, so a
third possible value for a Boolean data type is UNKNOWN. We discuss the
need for UNKNOWN and the three-valued logic in Chapter 7.
■ The DATE data type has ten positions, and its components are YEAR, MONTH,
and DAY in the form YYYY-MM-DD. The TIME data type has at least eight
positions, with the components HOUR, MINUTE, and SECOND in the form
HH:MM:SS. Only valid dates and times should be allowed by the SQL imple-
mentation. This implies that months should be between 1 and 12 and days
must be between 01 and 31; furthermore, a day should be a valid day for the
corresponding month. The < (less than) comparison can be used with dates
or times—an earlier date is considered to be smaller than a later date, and
similarly with time. Literal values are represented by single-quoted strings
preceded by the keyword DATE or TIME; for example, DATE ‘2014-09-27’ or
TIME ‘09:12:47’. In addition, a data type TIME(i), where i is called time frac-
tional seconds precision, specifies i + 1 additional positions for TIME—one
position for an additional period (.) separator character, and i positions for
specifying decimal fractions of a second. A TIME WITH TIME ZONE data type
includes an additional six positions for specifying the displacement from the
standard universal time zone, which is in the range +13:00 to –12:59 in units
of HOURS:MINUTES. If WITH TIME ZONE is not included, the default is the
local time zone for the SQL session.
Some additional data types are discussed below. The list of types discussed here is
not exhaustive; different implementations have added more data types to SQL.
■ A timestamp data type (TIMESTAMP) includes the DATE and TIME fields, plus
a minimum of six positions for decimal fractions of seconds and an optional
WITH TIME ZONE qualifier. Literal values are represented by single-quoted
5Bit strings whose length is a multiple of 4 can be specified in hexadecimal notation, where the literal
string is preceded by X and each hexadecimal character represents 4 bits.
184 Chapter 6 Basic SQL
strings preceded by the keyword TIMESTAMP, with a blank space between
data and time; for example, TIMESTAMP ‘2014-09-27 09:12:47.648302’.
■ Another data type related to DATE, TIME, and TIMESTAMP is the INTERVAL data
type. This specifies an interval—a relative value that can be used to increment
or decrement an absolute value of a date, time, or timestamp. Intervals are
qualified to be either YEAR/MONTH intervals or DAY/TIME intervals.
The format of DATE, TIME, and TIMESTAMP can be considered as a special type of
string. Hence, they can generally be used in string comparisons by being cast (or
coerced or converted) into the equivalent strings.
It is possible to specify the data type of each attribute directly, as in Figure 6.1; alter-
natively, a domain can be declared, and the domain name can be used with the
attribute specification. This makes it easier to change the data type for a domain
that is used by numerous attributes in a schema, and improves schema readability.
For example, we can create a domain SSN_TYPE by the following statement:
CREATE DOMAIN SSN_TYPE AS CHAR(9);
We can use SSN_TYPE in place of CHAR(9) in Figure 6.1 for the attributes Ssn and
Super_ssn of EMPLOYEE, Mgr_ssn of DEPARTMENT, Essn of WORKS_ON, and Essn
of DEPENDENT. A domain can also have an optional default specification via a
DEFAULT clause, as we discuss later for attributes. Notice that domains may not be
available in some implementations of SQL.
In SQL, there is also a CREATE TYPE command, which can be used to create user
defined types or UDTs. These can then be used either as data types for attributes, or
as the basis for creating tables. We shall discuss CREATE TYPE in detail in Chap-
ter 12, because it is often used in conjunction with specifying object database features
that have been incorporated into more recent versions of SQL.
6.2 Specifying Constraints in SQL
This section describes the basic constraints that can be specified in SQL as part of
table creation. These include key and referential integrity constraints, restrictions
on attribute domains and NULLs, and constraints on individual tuples within a rela-
tion using the CHECK clause. We discuss the specification of more general con-
straints, called assertions, in Chapter 7.
6.2.1 Specifying Attribute Constraints and Attribute Defaults
Because SQL allows NULLs as attribute values, a constraint NOT NULL may be specified
if NULL is not permitted for a particular attribute. This is always implicitly specified for
the attributes that are part of the primary key of each relation, but it can be specified for
any other attributes whose values are required not to be NULL, as shown in Figure 6.1.
It is also possible to define a default value for an attribute by appending the clause
DEFAULT
6.2 Specifying Constraints in SQL 185
new tuple if an explicit value is not provided for that attribute. Figure 6.2 illustrates
an example of specifying a default manager for a new department and a default
department for a new employee. If no default clause is specified, the default default
value is NULL for attributes that do not have the NOT NULL constraint.
Another type of constraint can restrict attribute or domain values using the CHECK
clause following an attribute or domain definition.6 For example, suppose that
department numbers are restricted to integer numbers between 1 and 20; then, we
can change the attribute declaration of Dnumber in the DEPARTMENT table (see Fig-
ure 6.1) to the following:
Dnumber INT NOT NULL CHECK (Dnumber > 0 AND Dnumber < 21);
The CHECK clause can also be used in conjunction with the CREATE DOMAIN state-
ment. For example, we can write the following statement:
CREATE DOMAIN D_NUM AS INTEGER
CHECK (D_NUM > 0 AND D_NUM < 21);
6The CHECK clause can also be used for other purposes, as we shall see.
CREATE TABLE EMPLOYEE
( … ,
Dno INT NOT NULL DEFAULT 1,
CONSTRAINT EMPPK
PRIMARY KEY (Ssn),
CONSTRAINT EMPSUPERFK
FOREIGN KEY (Super_ssn) REFERENCES EMPLOYEE(Ssn)
ON DELETE SET NULL ON UPDATE CASCADE,
CONSTRAINT EMPDEPTFK
FOREIGN KEY(Dno) REFERENCES DEPARTMENT(Dnumber)
ON DELETE SET DEFAULT ON UPDATE CASCADE);
CREATE TABLE DEPARTMENT
( … ,
Mgr_ssn CHAR(9) NOT NULL DEFAULT ‘888665555’,
… ,
CONSTRAINT DEPTPK
PRIMARY KEY(Dnumber),
CONSTRAINT DEPTSK
UNIQUE (Dname),
CONSTRAINT DEPTMGRFK
FOREIGN KEY (Mgr_ssn) REFERENCES EMPLOYEE(Ssn)
ON DELETE SET DEFAULT ON UPDATE CASCADE);
CREATE TABLE DEPT_LOCATIONS
( … ,
PRIMARY KEY (Dnumber, Dlocation),
FOREIGN KEY (Dnumber) REFERENCES DEPARTMENT(Dnumber)
ON DELETE CASCADE ON UPDATE CASCADE);
Figure 6.2
Example illustrating
how default attribute
values and referential
integrity triggered
actions are specified
in SQL.
186 Chapter 6 Basic SQL
We can then use the created domain D_NUM as the attribute type for all attributes
that refer to department numbers in Figure 6.1, such as Dnumber of DEPARTMENT,
Dnum of PROJECT, Dno of EMPLOYEE, and so on.
6.2.2 Specifying Key and Referential Integrity Constraints
Because keys and referential integrity constraints are very important, there are spe-
cial clauses within the CREATE TABLE statement to specify them. Some examples to
illustrate the specification of keys and referential integrity are shown in Figure 6.1.7
The PRIMARY KEY clause specifies one or more attributes that make up the primary
key of a relation. If a primary key has a single attribute, the clause can follow the
attribute directly. For example, the primary key of DEPARTMENT can be specified as
follows (instead of the way it is specified in Figure 6.1):
Dnumber INT PRIMARY KEY,
The UNIQUE clause specifies alternate (unique) keys, also known as candidate keys
as illustrated in the DEPARTMENT and PROJECT table declarations in Figure 6.1.
The UNIQUE clause can also be specified directly for a unique key if it is a single
attribute, as in the following example:
Dname VARCHAR(15) UNIQUE,
Referential integrity is specified via the FOREIGN KEY clause, as shown in Fig-
ure 6.1. As we discussed in Section 5.2.4, a referential integrity constraint can be
violated when tuples are inserted or deleted, or when a foreign key or primary key
attribute value is updated. The default action that SQL takes for an integrity viola-
tion is to reject the update operation that will cause a violation, which is known as
the RESTRICT option. However, the schema designer can specify an alternative
action to be taken by attaching a referential triggered action clause to any foreign
key constraint. The options include SET NULL, CASCADE, and SET DEFAULT. An
option must be qualified with either ON DELETE or ON UPDATE. We illustrate this
with the examples shown in Figure 6.2. Here, the database designer chooses ON
DELETE SET NULL and ON UPDATE CASCADE for the foreign key Super_ssn of
EMPLOYEE. This means that if the tuple for a supervising employee is deleted, the
value of Super_ssn is automatically set to NULL for all employee tuples that were
referencing the deleted employee tuple. On the other hand, if the Ssn value for a
supervising employee is updated (say, because it was entered incorrectly), the new
value is cascaded to Super_ssn for all employee tuples referencing the updated
employee tuple.8
In general, the action taken by the DBMS for SET NULL or SET DEFAULT is the
same for both ON DELETE and ON UPDATE: The value of the affected referencing
attributes is changed to NULL for SET NULL and to the specified default value of the
7Key and referential integrity constraints were not included in early versions of SQL.
8Notice that the foreign key Super_ssn in the EMPLOYEE table is a circular reference and hence may
have to be added later as a named constraint using the ALTER TABLE statement as we discussed at
the end of Section 6.1.2.
6.3 Basic Retrieval Queries in SQL 187
referencing attribute for SET DEFAULT. The action for CASCADE ON DELETE is to
delete all the referencing tuples, whereas the action for CASCADE ON UPDATE is to
change the value of the referencing foreign key attribute(s) to the updated (new)
primary key value for all the referencing tuples. It is the responsibility of the data-
base designer to choose the appropriate action and to specify it in the database
schema. As a general rule, the CASCADE option is suitable for “relationship” rela-
tions (see Section 9.1) , such as WORKS_ON; for relations that represent multival-
ued attributes, such as DEPT_LOCATIONS; and for relations that represent weak
entity types, such as DEPENDENT.
6.2.3 Giving Names to Constraints
Figure 6.2 also illustrates how a constraint may be given a constraint name, follow-
ing the keyword CONSTRAINT. The names of all constraints within a particular
schema must be unique. A constraint name is used to identify a particular con-
straint in case the constraint must be dropped later and replaced with another con-
straint, as we discuss in Chapter 7. Giving names to constraints is optional. It is also
possible to temporarily defer a constraint until the end of a transaction, as we shall
discuss in Chapter 20 when we present transaction concepts.
6.2.4 Specifying Constraints on Tuples Using CHECK
In addition to key and referential integrity constraints, which are specified by spe-
cial keywords, other table constraints can be specified through additional CHECK
clauses at the end of a CREATE TABLE statement. These can be called row-based
constraints because they apply to each row individually and are checked whenever
a row is inserted or modified. For example, suppose that the DEPARTMENT table in
Figure 6.1 had an additional attribute Dept_create_date, which stores the date when
the department was created. Then we could add the following CHECK clause at the
end of the CREATE TABLE statement for the DEPARTMENT table to make sure that a
manager’s start date is later than the department creation date.
CHECK (Dept_create_date <= Mgr_start_date);
The CHECK clause can also be used to specify more general constraints using
the CREATE ASSERTION statement of SQL. We discuss this in Chapter 7 because
it requires the full power of queries, which are discussed in Sections 6.3
and 7.1.
6.3 Basic Retrieval Queries in SQL
SQL has one basic statement for retrieving information from a database: the
SELECT statement. The SELECT statement is not the same as the SELECT operation
of relational algebra, which we shall discuss in Chapter 8. There are many options
and flavors to the SELECT statement in SQL, so we will introduce its features grad-
ually. We will use example queries specified on the schema of Figure 5.5 and will
188 Chapter 6 Basic SQL
refer to the sample database state shown in Figure 5.6 to show the results of some
of these queries. In this section, we present the features of SQL for simple retrieval
queries. Features of SQL for specifying more complex retrieval queries are pre-
sented in Section 7.1.
Before proceeding, we must point out an important distinction between the practical
SQL model and the formal relational model discussed in Chapter 5: SQL allows a
table (relation) to have two or more tuples that are identical in all their attribute
values. Hence, in general, an SQL table is not a set of tuples, because a set does not
allow two identical members; rather, it is a multiset (sometimes called a bag) of
tuples. Some SQL relations are constrained to be sets because a key constraint has
been declared or because the DISTINCT option has been used with the SELECT state-
ment (described later in this section). We should be aware of this distinction as we
discuss the examples.
6.3.1 The SELECT-FROM-WHERE Structure
of Basic SQL Queries
Queries in SQL can be very complex. We will start with simple queries, and then
progress to more complex ones in a step-by-step manner. The basic form of the
SELECT statement, sometimes called a mapping or a select-from-where block, is
formed of the three clauses SELECT, FROM, and WHERE and has the following form:9
SELECT
FROM