SENG 474: Data Mining

Instructor: Alex Thomo
Phone: (250) 472-5786
Office: ECS 556
Office Hours: MR 1:30 - 2:30 p.m.
Email: thomo@cs.uvic.ca
TA: Maryam Shoaran
Email: maryam@csc.uvic.ca
Course Outline: Link

Text:

Introduction to Data Mining (First Edition)
by Pang-Ning Tan, Michael Steinbach, Vipin Kumar
Addison Wesley (2005)

References:

  1. Data Mining: Concepts and Techniques
    by Jiawei Han, Micheline Kamber
    Morgan Kaufmann; 2nd edition (2006)
  2. Data Mining: Practical Machine Learning Tools and Techniques
    by Ian H. Witten, Eibe Frank
    Morgan Kaufmann; 2nd edition (2005)

Marks so far: link

Midterm: will be on Feb 15 (Thursday).

Midterm Solutions.

Assignments:

Assignment 1. Naive Bayes in Weka Solutions.

Assignment 2. Solutions.

Assignment 3. Solutions.

ORACLE Data Mining: Tutorial.

Term Paper: Description.

Lecture Handouts:

Predictive Data Mining

  • Intro to Data Mining Slides.
  • Data-Related Issues, Types of Attributes, Types of data sets, Data with Relationships among Objects, Data Quality, Feature Subset Selection. Slides.
  • Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain. Decision Trees with Numerical Attributes. Regression Trees. Slides.
  • Rule-Based Classifiers. Coverage and Accuracy. Decision Trees vs. rules. Ordered Rule Set. Separate-and-conquer algorithms. PRISM and RIPPER algorithms. Slides.
  • Uncertain knowledge. Belief and Probability. Conditional probability. Bayes' Rule. Conditional Independence. Normalization constant. Naive Bayes Classifier. Text Categorization. Slides.
  • Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity. Slides.
  • Bayesian Belief Networks: Practice. Slides.
  • Credibility: Evaluating whats been learned. Predicting performance. Confidence intervals. Holdout estimation. Cross-validation. The bootstrap. Counting the cost. Slides I. Slides II (practice).
  • ROC curves. Slides.
  • Precision and Recall Slides.
  • Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm. Slides. Example Excel spreadsheet. See also Point-LineDistance.
  • Beyond Linear Separability: Artificial Neural Networks. Slides.
  • Midterm Review. Slides.

Association Analysis

  • Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting. Slides.
  • More on Apriori Algorithm. Rule Generation: Confidence-Based Pruning, Rule Generation in Apriori Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets. Slides.
  • Alternative Methods for Frequent Itemset Generation. FP-Growth Algorithm: FP-Tree Representation, Frequent Itemset Generation in FP-Growth Algorithm. Slides.
  • Evaluation of Association Patterns: Objective Measures of Interestingness, Skewed distribution, Cross support patterns, Lowest confidence rule.
    Applications: Transforming attributes. Multi-level Association Rules. Slides. Simpson's Paradox Slides.
  • Mining word associations. Min-Apriori. Slides.
  • Mining of sequences. Candidate Generation. Timing Constraints. Slides.
  • FP-Tree/FP-Growth practice. Slides.
  • Mining Graphs. Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates. Slides.

Cluster Analysis

  • Applications of Cluster Analysis. Types of Clusters. K-means Algorithm. Problems with Selecting Initial Points. Bisecting K-means. Limitations of K-means. Slides. Self Organizing Maps. Slides.
  • Agglomerative Hierarchical Clustering. Divisive Hierarchical Clustering. Density based clustering DBSCAN. Fuzzy Clustering. HICAP: Hierarchical Clustering with Pattern Preservation. Slides.

Mining the Web

  • Information Retrieval. PageRank. Web Link Matrix. Dead ends, and Spider traps. Hubs and Authorities. Web spam. Combating spam. Slides.
Review: Slides. Exercising material: Slides.

Assignments:
There will be three assignments.