Description

"An overview of relational databases and how they are used; exposure to relational database technology. Fundamentals of transactions. Database views. Introductions to two or three alternative data models and systems, such as those used for structured text, spatial data, multimedia data, and information retrieval. Introduction to several current topics in database research, such as warehousing, data mining, managing data streams, data cleaning, data integration, and distributed databases."

Note, this is an introductory course. Students who are already familiar with the basic principles of database management and use may be more interested in the database systems course (CS 648).

Prerequisites

Students are expected to understand the fundamentals of programming languages, data structures, operating systems, and algorithms, each at least at the level of an introductory course.

Students are required to review the materials concerning academic integrity and academic honesty. Each student must complete and sign the Academic Integrity Acknowledgement Form, and hand it in before the first assignment is due.

References

The course is structured around a collection of published research articles. Each of these articles includes an extensive bibliography of related work. There are also many database textbooks that are used to teach database fundamentals to undergraduate students. Students might wish to refer to one of the following:

Database Management Systems, 3rd ed., by R. Ramakrishnan and J. Gehrke, McGraw-Hill, 2003.
Database Systems: The Complete Book, 2nd ed., by H. Garcia-Molina, J. Ullman, and J. Widom, Prentice Hall, 2008.

The Encyclopedia of Database Systems includes short, readable articles on many topics:

Encyclopedia of Database Systems, by L. Liu, M. T. Özsu (eds.), Springer, 2009.

and several topics are covered by short monographs in the series:

Synthesis Lectures on Database Management, by M. T. Özsu (ed.), Morgan and Claypool.

Finally, the database area is well-covered by Michael Ley's online DBLP bibliography:

The DBLP Computer Science Bibliography, by M. Ley, http://www.informatik.uni-trier.de/~ley/db/index.html.

Workload and Evaluation

(20%) The design and description of a database for the health domain. (Due Jan. 31)
(20%) The formulation of queries against an example database. (Due Mar. 5)
(60%) Two critical reviews of related conference/journal papers or technical reports. (Due Mar. 19 and Apr. 2)
(Effort Adjustment Factor between 0.95 and 1.05) Participation and contribution to the class.

There will be no tests or exams.

Schedule (tentative)

Tuesdays and Thursdays 11:30 - 12:50; DC 3313

	Tuesday	Thursday	Assignments due
Jan 8 / 10	Presentation: Introduction, ER Slides (Intro)	Presentation: ER (cont'd), RM (basics) Slides (ER), Slides (RM)
Jan 15 / 17	Presentation: RM (cont'd) Slides	Discussion: ER models + DB design + ER to RM Slides, Reading list
Jan 22 / 24	Presentation^:* Query languages (RA) Slides	- no class -
Jan 29 / 31	- no class -	Discussion^:* RM (normalization) Slides, Reading list	A1
Feb 5 / 7	Presentation^:* Physical design Slides	Discussion: Query languages (SEQUEL/SQL, QBE) Exercise, Exercise Solutions, Reading list
Feb 12 / 14	Presentation: Query processing Slides	Discussion: Physical design and storage Reading list
Feb 19 / 21	- no class - (reading week)	- no class - (reading week)
Feb 26 / 28	Presentation: Transaction management Slides	Discussion: Query processing + optimization Reading list
Mar 5 / 7	Presentation: Security Slides	Discussion: Transaction mgmt. (concurrency control, locking) Slides, Reading list	A2
Mar 12 / 14	Presentation: Views Slides	Discussion: Transaction mgmt. (recovery) Cheat sheet, Reading list
Mar 19 / 21	Presentation: Distributed DBs Slides	Discussion: Views Slides, Reading list	A3
Mar 26 / 28	Presentation: DB Application Development Slides	Discussion: Parallel Database Systems Slides, Reading list
Apr 2 / 4	Presentation: The Web as a DB	Discussion: Data Warehousing Slides, Reading list	A4

^*instructor: M. Tamer Öszu

For each topic, background material and some of the principle ideas will be introduced first, and (except for the last topic) the related publications will be discussed the following week. Students must read the assigned publications in advance of the class in which the content will be discussed. The following questions are intended to guide your reading and help you prepare for the discussion in class. You need not submit written answers unless specifically instructed in advance to do so. Nevertheless, you will find the discussion more fruitful if you are prepared to answer the following questions:

Briefly summarize the paper. What innovations are described in the paper?
Briefly explain each of the specific new key concepts. Provide an example for each concept. These example should be different from the examples presented by the authors.
Which parts of the paper remain unclear or confusing? What might help you to resolve the difficulties?

Furthermore, you will likely find Keshav's essay entitled "How to Read a Paper" instructive.

Reading Lists

ER Models + DB Design + ER to RM

Il-Yeol Song and Kristin Froehlich: Entity-Relationship Modeling: A Practical How-to Guide. IEEE Potentials 13(5): 29-45 (1994-1995).
"The Entity-Relationship (ER) model and its accompanying ER diagrams are widely used for database design and systems analysis.... We will present step-by-step guidelines, a set of decision rules proven to be useful in building ER diagrams, and a case study problem with a preferred answer as well as a set of incorrect diagrams for the problem."
Toby J. Teorey, Dongqing Yang, and James P. Fry: A logical design methodology for relational databases using the extended entity-relationship model. ACM Computing Surveys 18(2): 197-222 (1986).
"A database design methodology is defined for the design of large relational databases. First, the data requirements are conceptualized using an extended entity-relationship model, with the extensions being additional semantics such as ternary relationships, optional relationships, and the generalization abstraction. The extended entity-relationship model is then decomposed according to a set of basic entity-relationship constructs, and these are transformed into candidate relations. A set of basic transformations has been developed for the three types of relations: entity relations, extended entity relations, and relationship relations. Candidate relations are further analyzed and modified to attain the highest degree of normalization desired. The methodology produces database designs that are not only accurate representations of reality, but flexible enough to accommodate future processing requirements. It also reduces the number of data dependencies that must be analyzed, using the extended ER model conceptualization, and maintains data integrity through normalization. This approach can be implemented manually or in a simple software package as long as a "good" solution is acceptable and absolute optimality is not required."

Relational Model (Normalization)

Moshe Y. Vardi: Fundamentals of Dependency Theory. In Trends in Theoretical Computer Science. ed. E. Borger, Computer Science Press (1987).

(!) Note, Sections 5.3.3, 5.3.5, 5.5, and 5.6 are not relevant for class discussion (and, thus, do not need to be read). Furthermore, the proofs may be skimmed through.

Query Languages

Donald D. Chamberlin, Morton M. Astrahan, Kapali P. Eswaran, Patricia P. Griffiths, Raymond A. Lorie, James W. Mehl, Phyllis Reisner, and Bradford W. Wade: SEQUEL 2: A Unified Approach to Data Definition, Manipulation, and Control. IBM Journal of Research and Development 20(6): 560-575 (1976).
"SEQUEL 2 is a relational data language that provides a consistent, English keyword-oriented set of facilities for query, data definition, data manipulation, and data control. SEQUEL 2 may be used either as a stand-alone interface for nonspecialists in data processing or as a data sublanguage embedded in a host programming language for use by application programmers and data base administrators. This paper describes SEQUEL 2 and the means by which it is coupled to a host language."
Moshé M. Zloof: Query-by-Example: A Data Base Language. IBM Systems Journal 16(4): 324-343 (1977).
"Discussed is a high-level data base management language that provides the user with a convenient and unified interface to query, update, define, and control a data base. When the user performs an operation against the data base, he fills in an example of a solution to that operation in skeleton tables that can be associated with actual tables in the data base. The system is currently being used experimentally for various applications."
For more recent information on SQL, see IBM's SQL Reference Volume 1 (mostly DML, see especially Chapter 6) and Volume 2 (mostly DDL).

To install your own copy of DB2, follow these instructions.

Physical Design and Data Storage

Michael Hammer and Arvola Chan: Index Selection in a Self-Adaptive Data Base Management System. SIGMOD 1976.
"We address the problem of automatically adjusting the physical organization of a data base to optimize its performance as its access requirements change. We describe the principles of the automatic index selection facility of a prototype self-adaptive data base management system that is currently under development. The importance of accurate usage model acquisition and data characteristics estimation is stressed. The statistics gathering mechanisms that are being incorporated into our prototype system are discussed. Exponential smoothing techniques are used for averaging statistics observed over different periods of time in order to predict future characteristics. An heuristic algorithm for selecting indices to match projected access requirements is presented. The cost model on which the decision procedure is based is flexible enough to incorporate the overhead costs of index creation, index storage and application program recompilation."
Anastassia Ailamaki, David J. DeWitt, Mark D. Hill, and Marios Skounakis: Weaving Relations for Cache Performance. VLDB 2001.
"Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM’s stall time due to data cache accesses, (b) range selection queries and updates on memory-resident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster."

Query Optimization

Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: 34-43.
"There has been extensive work in query optimization since the early '70s. It is hard to capture the breadth and depth of this large body of work in a short article. Therefore, I have decided to focus primarily on the optimization of SQL queries in relational database systems and present my biased and incomplete view of this field. The goal of this article is not to be comprehensive, but rather to explain the foundations and present samplings of significant work in this area. I would like to apologize to the many contributors in this area whose work I have failed to explicitly acknowledge due to oversight or lack of space. I take the liberty of trading technical precision for ease of presentation."

Transaction Management

Jim Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: Granularity of Locks and Degrees of Consistency in a Shared Data Base. IFIP Working Conference on Modelling in Data Base Management Systems 1976: 365-394.
"The problem of choosing the appropriate granularity (size) of lockable objects is introduced and the tradeoff between concurrency and overhead is discussed. A locking protocol which allows simultaneous locking at various granularities by different transactions is presented. It is based on the introduction of additional lock modes besides the conventional share mode and exclusive mode. A proof is given of the equivalence of this protocol to a conventional one...."
Theo Haerder and Andreas Reuter: Principles of Transaction-Oriented Database Recovery. ACM Comput. Surv. 15(4): 287-317 (1983).
"In this paper, a terminological framework is provided for describing different transactionoriented recovery schemes for database systems in a conceptual rather than an implementation-dependent way. By introducing the terms materialized database, propagation strategy, and checkpoint, we obtain a means for classifying arbitrary implementations from a unified viewpoint. This is complemented by a classification scheme for logging techniques, which are precisely defined by using the other terms. It is shown that these criteria are related to all relevant questions such as speed and scope of recovery and amount of redundant information required. The primary purpose of this paper, however, is to establish an adequate and precise terminology for a topic in which the confusion of concepts and implementational aspects still imposes a lot of problems."

Views

Arthur M. Keller: The Role of Semantics in Translating View Updates. IEEE Computer 19(1): 63-73 (1986).
"... it is desirable to provide the users with interfaces that give them only information that is relevant to them. In shared relational databases, which are the subject of this article, this is done by defining views for each class of users. Views represent simplified models of the database, and users can express queries and updates against them. How to handle queries expressed against views is well understood: The user's query is composed with the view definition so as to obtain a query that can be executed on the underlying database. Similarly, updates expressed against a view have to be translated into updates that can be executed on the underlying database...."
Jonathan Goldstein, Per-Åke Larson: Optimizing Queries Using Materialized Views: A Practical, Scalable Solution. SIGMOD Conference 2001: 331-342.
"Materialized views can provide massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must know how and when to exploit materialized views. This paper presents a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and describes how it can be incorporated in transformation-based optimizers. The current version handles views composed of selections, joins and a final group-by. Optimization remains fully cost based, that is, a single best rewrite is not selected by heuristic rules but multiple rewrites are generated and the optimizer chooses the best alternative in the normal way. Experimental results based on an implementation in Microsoft SQL Server show outstanding performance and scalability. Optimization time increases slowly with the number of views but remains low even up to a thousand."

Parallel Database Systems

David DeWitt and Jim Gray: Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM 35(6): 85-98 (1992).

Data Warehousing

Surajit Chaudhuri and Umeshwar Dayal: An Overview of Data Warehousing and OLAP Technology. SIGMOD Record 26(1): 65-74 (1997).
"Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996."
Craig S. Ledbetter and Matthew W. Morgan: Toward Best Practice: Leveraging the Electronic Patient Record as a Clinical Data Warehouse. J. Healthcare Info. Mgmt 15(2): 119-131 (2001).
"Automating clinical and administrative processes via an electronic patient record (EPR) gives clinicians the point-of-care tools they need to deliver better patient care. However, to improve clinical practice as a whole and then evaluate it, healthcare must go beyond basic automation and convert EPR data into aggregated, multidimensional information. Unfortunately, few EPR systems have the established, powerful analytical clinical data warehproxy.lib.uwaterloo.caouses (CDWs) required for this conversion. This article describes how an organization can support best practice by leveraging a CDW that is fully integrated into its EPR and clinical decision support (CDS) system. The article (1) discusses the requirements for comprehensive CDS, including on-line analytical processing (OLAP) of data at both transactional and aggregate levels, (2) suggests that the transactional data acquired by an OLTP EPR system must be remodeled to support retrospective, population-based, aggregate analysis of those data, and (3) concludes that this aggregate analysis is best provided by a separate CDW system."

Apr. 4, 2013. Olaf Hartig