CPSC 350 Database Essentials and Data Mining

From MWCSWiki

Jump to: navigation, search

Click here for an "easy to read" version

NEW COURSE PROPOSAL

CPSC 350 Database Essentials and Data Mining

Contents

Course History

A. Has this course been taught as a "Topics" or Experimental Course"? If so, list previous title and number. (NOTE: Under current guidelines, courses do not have to be taught on an "experimental" basis before being approved.)

In spring 2007, the special topics course CPSC 370 – “Database Essentials and Data Mining” will be offered as a pilot for what we hope is this permanent course. Also, in the past some of this subject matter has been covered in other courses, namely CPSC 410 (Database Principles and Design.) The relationship between the two courses is explained below.

B. What were enrollments in all previous offerings? (Marsha, what have the enrollments been for Rita’s class?)

C. When was the course taught? N/A

Catalogue Description

Department/discipline and proposed number.

CPSC 350 (course number not yet approved)

Brief description of course content (include a syllabus).

Introduces logical database modeling and design, query languages, and issues in data representation. Surveys data mining techniques to extract effectual knowledge from large-scale data stores, including preparation of data sets, statistical algorithms for data analysis, and evaluation of algorithm performance. May involve student team projects to develop small but representative data collection and analysis applications.

Prerequisites.

CPSC 125, CPSC 230

Number of credits proposed.

3 Credits.

Relevance and Importance of the course:

A. To department/discipline offerings.

What is the reason for this course proposal (i.e., how does it relate to other courses in the department's curriculum)?

The management of information has become one of the key issues confronting technologists today, especially as the so-called “information explosion” increases. Nearly every application domain of computer science is generating large volumes of information – whether scientific observations, marketing data, or English text – and demanding that it be stored, retrieved, and analyzed. Indeed, the primary function of many of today’s computer systems is this ability to help users access their information, with all other functions merely supplemental.

Because database systems are so pervasive within both academia and industry, and because they are so broadly applicable, we feel that it is important for all of our majors to learn the theory behind them and gain experience designing them. We also want this to take place relatively early in the curriculum, since projects in later courses (e.g., CPSC 390 – Software Engineering, CPSC 415 – Artificial Intelligence) as well as many individual study and research projects, will feature a database in some part of their design. In fact, the principles underlying how to represent and query data are so ubiquitous that it seems foolish to delay introducing them any longer than strictly necessary. We seek, therefore, to establish a junior-level course, required for all majors, that will present students with the fundamentals of how database systems operate and give them practical experience in working with them.

This view about the importance of databases in the curriculum is confirmed by the Association for Computing Machinery (ACM), the world’s most prestigious organization for technology professionals and educators. The ACM publishes curriculum guidelines for computer science programs (where can these be found?), and the most recent version of these (what year?) recommend that a course in database theory and practice be required for an undergraduate degree in the field. Additionally, we have received strong feedback from local alumni on this topic. As they look back at their undergraduate coursework in light of the issues that arise in their developing careers, a significant number of them have mentioned that material on databases was what they see as having been most lacking at Mary Washington. For all of these reasons, we feel it is imperative to introduce such a course and make it mandatory for all majors.

This proposed course would complement, not replace, our existing course entitled CPSC 410 – Database Principles and Design. To see why, it is important to understand the difference between a “database” and a “database management system” (or “DBMS.”) A database is simply a structured repository of information. A DBMS, on the other hand, is a type of system (such as Oracle, or Microsoft SQL Server) that software developers use to create a database. CPSC 410 will concentrate on the issues involved in implementing a DBMS, such as concurrency control, recovery, transaction logs, and distributed protocols. These topics are common to many standard courses on databases, yet they represent the portion of the database field that is not nearly as widespread as what we will cover in CPSC 350. The distinction between the two courses is thus how databases can be used as components of a larger solution (350) versus how database management systems, which developers use to create databases, are themselves implemented (410). The latter topic involves rich and fertile material, and hence it will remain an elective. Requiring juniors to master the material in CPSC 350 will allow 410 instructors to assume this basic knowledge and to have sufficient time to explore these advanced topics in more detail.

We also wish to include material in CPSC 350 on the emerging field of “data mining,” or extracting meaningful knowledge from large amounts of raw data. Increasingly, the sheer amount of data stored in many databases precludes it from being analyzed manually: the only way to make use of it is by employing statistical algorithms to derive general conclusions about trends and patterns. This is what allows, say, a supermarket chain to analyze millions of consumer purchases that are recorded at each cash register and draw conclusions about how best to arrange items on the shelves. We believe that in the future, merely understanding how databases operate without some exposure to these data mining concepts will be insufficient for our students. And since the basics of database theory can be comfortably covered in half a semester, this course presents a perfect opportunity to give students an introduction to this topic as well.


2. If this course is approved, which course(s), if any, would be deleted from your current offerings?

None.

3. Will this course: a) Be required for the major? Yes, both the traditional Computer Science major and also the new concentration in Information Systems. b) Be an elective in the major? No. c) Not count in the major? No. B. To the Curriculum of the College: How is this course related to the courses in other departments? N/A Why is the proposing department the appropriate one to offer the course? The Computer Science department has the practical and academic background that is required to teach this course.

Resource Requirements

A. Does the library have the materials needed to support the course (provide a list of material needed for the course that the library does not have on hand)? Yes. B. Will this course require staff time from departments other than your own (Audio Visual, Library, etc.)? No. C. Does the College have the faculty to teach the course (who, and how will this affect their current teaching duties)? (What should I say here?) D. Does the College have the equipment, facilities, etc., currently on hand, that will be needed to teach the course, and, if not, tell how and where you will obtain these resources)? (Just checking – can we have student teams easily host PHP or RoR websites that are visible from anywhere on the Web?)

Effective Date

Spring 2008 (?)


Syllabus

SAMPLE COURSE SYLLABUS

CPSC 350: Database Essentials and Data Mining

Week(s) Topics

1 Database modeling concepts (abstraction, data/instance vs. metadata/schema, inconsistency & redundancy)

2-3 Relational database design (review of set theory, the relational model, functional dependencies, candidate & foreign keys, normalization)

4 Overview of the relational algebra (select, project, and join operators)

5 Overview of the relational calculus (declarative vs. procedural formulations)

5-6 SQL as an implementation of the relational calculus (DDL & DML, views, indexes)

7 Accessing data programmatically via embedded SQL (using C++/Java examples)

7-8 Web-driven databases (using an MVC framework)

9 Data mining concepts (inductive statistical inference; assumptions and limitations)

10 Data preparation and cleansing (issues)Data mining paradigms: classification vs. clustering vs. numeric prediction

11-12 Algorithms: naïve Bayes classifiers, decision trees, rule-based covering

12 Feature selection

13 Evaluation techniques: cross-validation, bootstrapping

14 Project presentations

Texts

· Introduction to Database Systems, C.J.Date, Addison-Wesley, 8th Edition.

· Data Mining, Ian Witten & Eibe Frank, Morgan Kaufmann, 2nd Edition.

Personal tools