CS 640 Principles of Database Management and Use
Winter 2013
https://cs.uwaterloo.ca/~ohartig/CS640/
DC 2556, x33734
"An overview of relational databases and how they are used; exposure to relational database technology. Fundamentals of transactions. Database views. Introductions to two or three alternative data models and systems, such as those used for structured text, spatial data, multimedia data, and information retrieval. Introduction to several current topics in database research, such as warehousing, data mining, managing data streams, data cleaning, data integration, and distributed databases."
Note, this is an introductory course. Students who are already familiar with the basic principles of database management and use may be more interested in the database systems course (CS 648).
Students are expected to understand the fundamentals of programming languages, data structures, operating systems, and algorithms, each at least at the level of an introductory course.
Students are required to review the materials concerning academic integrity and academic honesty. Each student must complete and sign the Academic Integrity Acknowledgement Form, and hand it in before the first assignment is due.
The course is structured around a collection of published research articles. Each of these articles includes an extensive bibliography of related work. There are also many database textbooks that are used to teach database fundamentals to undergraduate students. Students might wish to refer to one of the following:
The Encyclopedia of Database Systems includes short, readable articles on many topics:
and several topics are covered by short monographs in the series:
Finally, the database area is well-covered by Michael Ley's online DBLP bibliography:
There will be no tests or exams.
Tuesdays and Thursdays 11:30 - 12:50; DC 3313
Tuesday | Thursday | Assignments due | |
Jan 8 / 10 | Presentation: Introduction, ER | Presentation: ER (cont'd), RM (basics) | |
Jan 15 / 17 | Presentation: RM (cont'd) | Discussion: ER models + DB design + ER to RM | |
Jan 22 / 24 | Presentation*: Query languages (RA) | - no class - | |
Jan 29 / 31 | - no class - | Discussion*: RM (normalization) | A1 |
Feb 5 / 7 | Presentation*: Physical design | Discussion: Query languages (SEQUEL/SQL, QBE) | |
Feb 12 / 14 | Presentation: Query processing | Discussion: Physical design and storage | |
Feb 19 / 21 | -
no class - (reading week) |
-
no class - (reading week) |
|
Feb 26 / 28 | Presentation: Transaction management | Discussion: Query processing + optimization | |
Mar 5 / 7 | Presentation: Security | Discussion: Transaction mgmt. (concurrency control, locking) | A2 |
Mar 12 / 14 | Presentation: Views | Discussion: Transaction mgmt. (recovery) | |
Mar 19 / 21 | Presentation: Distributed DBs | Discussion: Views | A3 |
Mar 26 / 28 | Presentation: DB Application Development | Discussion: Parallel Database Systems | |
Apr 2 / 4 | Presentation: The Web as a DB
|
Discussion: Data Warehousing | A4 |
*instructor: M. Tamer Öszu
For each topic, background material and some of the principle ideas will be introduced first, and (except for the last topic) the related publications will be discussed the following week. Students must read the assigned publications in advance of the class in which the content will be discussed. The following questions are intended to guide your reading and help you prepare for the discussion in class. You need not submit written answers unless specifically instructed in advance to do so. Nevertheless, you will find the discussion more fruitful if you are prepared to answer the following questions:
Furthermore, you will likely find Keshav's essay entitled "How to Read a Paper" instructive.
"The Entity-Relationship (ER) model and its accompanying ER diagrams are widely used for database design and systems analysis.... We will present step-by-step guidelines, a set of decision rules proven to be useful in building ER diagrams, and a case study problem with a preferred answer as well as a set of incorrect diagrams for the problem."
"A database design methodology is defined for the design of large relational databases. First, the data requirements are conceptualized using an extended entity-relationship model, with the extensions being additional semantics such as ternary relationships, optional relationships, and the generalization abstraction. The extended entity-relationship model is then decomposed according to a set of basic entity-relationship constructs, and these are transformed into candidate relations. A set of basic transformations has been developed for the three types of relations: entity relations, extended entity relations, and relationship relations. Candidate relations are further analyzed and modified to attain the highest degree of normalization desired. The methodology produces database designs that are not only accurate representations of reality, but flexible enough to accommodate future processing requirements. It also reduces the number of data dependencies that must be analyzed, using the extended ER model conceptualization, and maintains data integrity through normalization. This approach can be implemented manually or in a simple software package as long as a "good" solution is acceptable and absolute optimality is not required."
(!) Note, Sections 5.3.3, 5.3.5, 5.5, and 5.6 are not relevant for class discussion (and, thus, do not need to be read). Furthermore, the proofs may be skimmed through.
"SEQUEL 2 is a relational data language that provides a consistent, English keyword-oriented set of facilities for query, data definition, data manipulation, and data control. SEQUEL 2 may be used either as a stand-alone interface for nonspecialists in data processing or as a data sublanguage embedded in a host programming language for use by application programmers and data base administrators. This paper describes SEQUEL 2 and the means by which it is coupled to a host language."
"Discussed is a high-level data base management language that provides the user with a convenient and unified interface to query, update, define, and control a data base. When the user performs an operation against the data base, he fills in an example of a solution to that operation in skeleton tables that can be associated with actual tables in the data base. The system is currently being used experimentally for various applications."
To install your own copy of DB2, follow these instructions.
"We address the problem of automatically adjusting the physical organization of a data base to optimize its performance as its access requirements change. We describe the principles of the automatic index selection facility of a prototype self-adaptive data base management system that is currently under development. The importance of accurate usage model acquisition and data characteristics estimation is stressed. The statistics gathering mechanisms that are being incorporated into our prototype system are discussed. Exponential smoothing techniques are used for averaging statistics observed over different periods of time in order to predict future characteristics. An heuristic algorithm for selecting indices to match projected access requirements is presented. The cost model on which the decision procedure is based is flexible enough to incorporate the overhead costs of index creation, index storage and application program recompilation."
"Relational database systems have traditionally optimzed for I/O performance and organized records sequentially on disk pages using the N-ary Storage Model (NSM) (a.k.a., slotted pages). Recent research, however, indicates that cache utilization and performance is becoming increasingly important on modern platforms. In this paper, we first demonstrate that in-page data placement is the key to high cache performance and that NSM exhibits low cache utilization on modern platforms. Next, we propose a new data organization model called PAX (Partition Attributes Across), that significantly improves cache performance by grouping together all values of each attribute within each page. Because PAX only affects layout inside the pages, it incurs no storage penalty and does not affect I/O behavior. According to our experimental results, when compared to NSM (a) PAX exhibits superior cache and memory bandwidth utilization, saving at least 75% of NSM’s stall time due to data cache accesses, (b) range selection queries and updates on memory-resident relations execute 17-25% faster, and (c) TPC-H queries involving I/O execute 11-48% faster."
"There has been extensive work in query optimization since the early '70s. It is hard to capture the breadth and depth of this large body of work in a short article. Therefore, I have decided to focus primarily on the optimization of SQL queries in relational database systems and present my biased and incomplete view of this field. The goal of this article is not to be comprehensive, but rather to explain the foundations and present samplings of significant work in this area. I would like to apologize to the many contributors in this area whose work I have failed to explicitly acknowledge due to oversight or lack of space. I take the liberty of trading technical precision for ease of presentation."
"The problem of choosing the appropriate granularity (size) of lockable objects is introduced and the tradeoff between concurrency and overhead is discussed. A locking protocol which allows simultaneous locking at various granularities by different transactions is presented. It is based on the introduction of additional lock modes besides the conventional share mode and exclusive mode. A proof is given of the equivalence of this protocol to a conventional one...."
"In this paper, a terminological framework is provided for describing different transactionoriented recovery schemes for database systems in a conceptual rather than an implementation-dependent way. By introducing the terms materialized database, propagation strategy, and checkpoint, we obtain a means for classifying arbitrary implementations from a unified viewpoint. This is complemented by a classification scheme for logging techniques, which are precisely defined by using the other terms. It is shown that these criteria are related to all relevant questions such as speed and scope of recovery and amount of redundant information required. The primary purpose of this paper, however, is to establish an adequate and precise terminology for a topic in which the confusion of concepts and implementational aspects still imposes a lot of problems."
"... it is desirable to provide the users with interfaces that give them only information that is relevant to them. In shared relational databases, which are the subject of this article, this is done by defining views for each class of users. Views represent simplified models of the database, and users can express queries and updates against them. How to handle queries expressed against views is well understood: The user's query is composed with the view definition so as to obtain a query that can be executed on the underlying database. Similarly, updates expressed against a view have to be translated into updates that can be executed on the underlying database...."
"Materialized views can provide massive improvements in query processing time, especially for aggregation queries over large tables. To realize this potential, the query optimizer must know how and when to exploit materialized views. This paper presents a fast and scalable algorithm for determining whether part or all of a query can be computed from materialized views and describes how it can be incorporated in transformation-based optimizers. The current version handles views composed of selections, joins and a final group-by. Optimization remains fully cost based, that is, a single best rewrite is not selected by heuristic rules but multiple rewrites are generated and the optimizer chooses the best alternative in the normal way. Experimental results based on an implementation in Microsoft SQL Server show outstanding performance and scalability. Optimization time increases slowly with the number of views but remains low even up to a thousand."
"Data warehousing and on-line analytical processing (OLAP) are essential elements of decision support, which has increasingly become a focus of the database industry. Many commercial products and services are now available, and all of the principal database management system vendors now have offerings in these areas. Decision support places some rather different requirements on database technology compared to traditional on-line transaction processing applications. This paper provides an overview of data warehousing and OLAP technologies, with an emphasis on their new requirements. We describe back end tools for extracting, cleaning and loading data into a data warehouse; multidimensional data models typical of OLAP; front end client tools for querying and data analysis; server extensions for efficient query processing; and tools for metadata management and for managing the warehouse. In addition to surveying the state of the art, this paper also identifies some promising research issues, some of which are related to problems that the database research community has worked on for years, but others are only just beginning to be addressed. This overview is based on a tutorial that the authors presented at the VLDB Conference, 1996."
"Automating clinical and administrative processes via an electronic patient record (EPR) gives clinicians the point-of-care tools they need to deliver better patient care. However, to improve clinical practice as a whole and then evaluate it, healthcare must go beyond basic automation and convert EPR data into aggregated, multidimensional information. Unfortunately, few EPR systems have the established, powerful analytical clinical data warehproxy.lib.uwaterloo.caouses (CDWs) required for this conversion. This article describes how an organization can support best practice by leveraging a CDW that is fully integrated into its EPR and clinical decision support (CDS) system. The article (1) discusses the requirements for comprehensive CDS, including on-line analytical processing (OLAP) of data at both transactional and aggregate levels, (2) suggests that the transactional data acquired by an OLTP EPR system must be remodeled to support retrospective, population-based, aggregate analysis of those data, and (3) concludes that this aggregate analysis is best provided by a separate CDW system."
Apr. 4, 2013. Olaf Hartig