CSCI 598/682 with Dr. J, California State University, Chico

CSCI 598:  Advanced Topics in Computer Science
CSCI 682:  Topics in Artificial Intelligence


Data Mining: Knowledge Discovery in Databases


Registration Information

This course is also being offered as a special session, self-paced, archived course. If you are interested in signing up for this course as a distance education student, please contact the Center for Regional and Continuing Education (RCE) by sending e-mail to rce@csuchico.edu, or call 530 898-6105 for detailed registration information.


Information for local students:


 Term/Year 
 

Class
 Number 

 

  Section  
 

  Act  
 

  Days  
 

  Time  
 

  Room  
 
Spring 2008 6331 CSCI 682-01 DIS TR 1230-0145  OCNL 239  
Fall 2006 5900 CSCI 598-02 DIS TR 1100-1215  MLIB 031  
 

* Archived webcast available through HorizonLive! as part of the CHICO Computer Science Program.


Description

"Data mining, also known as knowledge-discovery in databases (KDD), is the practice of automatically searching large stores of data for patterns. To do this, data mining uses computational techniques from statistics and pattern recognition."
-  From Wikipedia at http://en.wikipedia.org/wiki/Data_mining

Data mining is "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data."
-  From W. Frawley and G. Piatetsky-Shapiro and C. Matheus,
"Knowledge discovery in databases: An overview."
AI Magazine, Fall 1992, pages 213-228.

Data mining is "the science of extracting useful information from large data sets or databases."
-  from D. Hand, H. Mannila, P. Smyth: Principles of Data Mining.
MIT Press, Cambridge, MA, 2001. ISBN 0-262-08290-X.
Check these ACM Computing Reviews 2008 articles out: Adversarial Information Retrieval, Computational Intelligence in Human Genetics

This course introduces the student to basic concepts, tasks, methods, and techniques in data mining; in particular, the course focuses on practical machine learning tools and techniques used in data mining. Students will develop an understanding of the data mining process and issues, learn various techniques for data mining, and apply the techniques in solving data mining problems using data mining tools and systems.

Students from departments such as Statistics, Biology, Mathematics, and Electrical & Computer Engineering who are working in interdisciplinary research (e.g., bioinformatics, modeling, data analysis) are especially encouraged to take this course.

Prerequisites

[CSCI 598] CSCI 311 or graduate standing.
[CSCI 682] Graduate standing or permission of instructor.



Required Text(s)

Click for textbook website ... Data Mining: Practical Machine Learning Tools and Techniques, 2/e
Ian Witten and Eibe Frank, 2005.
Elsevier Inc. Burlington, Massachussetts.
ISBN 0-12-088407-0

Also available: Companion website for the textbook.


Additional Requirements

Students will be required to open and maintain a Chico State Connection (CSC Portal) account.
 
Students are responsible for regularly checking their WebCT Vista account (automatically generated through the CSC Portal) to access an up-to-date on-line calendar of events, current scores, on-line quizzes, etc.
 
Students are expected to use the WEKA open source data mining software in Java.



Objectives

  1. To become familiar with the fundamental concepts of data mining relative to the computing sciences.
  2. To become competent in recognizing what machine learning algorithms and techniques are available for specific application areas in data mining.
  3. To develop a sufficient understanding of data mining necessary to facilitate conversations with the machine learning community.


Grade Evaluation

This course is designed to give students an equal opportunity of exposure to both Theory and Practice. Students are expected to demonstrate proficiency on both the theoretical and practical aspects of this course.

 
   60%    Written homework/assignment; laboratory (WEKA) projects   
   35%    (Individual) Research paper
    Components:
    • 5%   Title/Topic and brief description
    • 15%   Annotated bibliography
    • 20%   Rough draft
    • 50%   Final paper
    • 10%   Oral presentation - mandatory for local students
    Selected topics for research papers:
    • Web search and Web mining
    • data mining for fraud detection
    • mining text and sequential data
    • data mining in bioinformatics (genomics, proteomics, etc.)
  
    5%    Class participation (local students)   
 

Students are expected to turn in all course requirements assigned by the professor; otherwise, the professor reserves the right to assign a lower final grade than that normally calculated by the student.


Final Grades

Final grades shall be expressed as a percentage of the maximum possible score of all evaluated materials. Letter grades will be given according to the following scheme:


  Real Interval  
 

  Letter Grade  
 

  University Definition  
 
[96.25,100.00]   Superior Work
[92.50, 96.25) A-
[88.75, 92.50) B+   Very Good Work
[85.00, 88.75)
[81.25, 85.00) B-
[77.50, 81.25) C+   Adequate Work
[73.75, 77.50)
[70.00, 73.75) C-
[66, 70) D+   Minimally Acceptable Work  
[60, 66)
[ 0, 60)   Unacceptable Work
     


Note:  It is Dr. J's policy not to assign a final grade of D or D+ to graduate students. Hence,
graduate students with a class standing less than C- (70%) earn a final grade of F.



General Policies

Dr. J has some general policies (see http://www.ecst.csuchico.edu/~juliano/Teaching/Policies.html) that apply to all courses that he teaches. Students are expected to read and understand these policies upon registration of the course.

Please note that these policies are designed specifically for all Dr. J's on-site courses; not all policies may apply to this course, particularly if you are registered through the Center for Regional and Continuing Education (RCE) as a remote student. You must contact Dr. J if you have any questions or concerns regarding the applicability of a policy to this course.



Topical Coverage

Decision tables, decision trees, classification rules, association rules, numeric prediction, linear models, instance-based learning, clustering, training and testing, predicting performance, cross-validation, Bayesian networks, combining multiple models.