Instructor: |
Alex Thomo |
Phone: |
(250) 472-5786 |
Office: |
ECS 556 |
Office Hours: |
MR 1:30 - 2:30 p.m. |
Email: |
thomo@cs.uvic.ca |
TA: |
Maryam Shoaran |
Email: |
maryam@csc.uvic.ca |
Course Outline: |
Link |
Text:
Introduction to Data Mining (First Edition)
by Pang-Ning Tan, Michael Steinbach, Vipin Kumar
Addison Wesley (2005)
References:
- Data Mining: Concepts and Techniques
by Jiawei Han, Micheline Kamber
Morgan Kaufmann; 2nd edition (2006)
- Data Mining: Practical Machine Learning Tools and Techniques
by Ian H. Witten, Eibe Frank
Morgan Kaufmann; 2nd edition (2005)
Marks so far:
link
Midterm: will be on Feb 15 (Thursday).
Midterm Solutions.
Assignments:
Assignment 1.
Naive Bayes in Weka
Solutions.
Assignment 2.
Solutions.
Assignment 3.
Solutions.
ORACLE Data Mining:
Tutorial.
Term Paper:
Description.
Lecture Handouts:
Predictive Data Mining
- Intro to Data Mining
Slides.
- Data-Related Issues, Types of Attributes, Types of data sets, Data with Relationships among Objects,
Data Quality, Feature Subset Selection.
Slides.
- Applying Decision Trees. Learning Decision Trees. Measures of Node Impurity, Entropy. Information Gain.
Decision Trees with Numerical Attributes. Regression Trees.
Slides.
- Rule-Based Classifiers. Coverage and Accuracy.
Decision Trees vs. rules. Ordered Rule Set.
Separate-and-conquer algorithms. PRISM and RIPPER algorithms.
Slides.
- Uncertain knowledge. Belief and Probability. Conditional
probability. Bayes' Rule. Conditional Independence. Normalization constant.
Naive Bayes Classifier. Text Categorization.
Slides.
- Bayesian Belief Networks: Semantics, Inference, Classification, Construction, Complexity.
Slides.
- Bayesian Belief Networks: Practice.
Slides.
- Credibility: Evaluating whats been learned. Predicting performance. Confidence intervals.
Holdout estimation. Cross-validation. The bootstrap.
Counting the cost.
Slides I.
Slides II (practice).
- ROC curves.
Slides.
- Precision and Recall
Slides.
- Linear Separators: Hyperplane Geometry, Margin, Perceptron Algorithm.
Slides. Example Excel spreadsheet.
See also Point-LineDistance.
- Beyond Linear Separability: Artificial Neural Networks.
Slides.
- Midterm Review.
Slides.
Association Analysis
- Frequent Itemset Generation: The Apriori Principle, Apriori Algorithm, Candidate Generation and Pruning, Support Counting.
Slides.
- More on Apriori Algorithm. Rule Generation: Confidence-Based Pruning, Rule Generation in Apriori
Algorithm. Compact Representation of Frequent Itemsets: Maximal Frequent Itemsets, Closed Frequent Itemsets.
Slides.
-
Alternative Methods for Frequent Itemset Generation.
FP-Growth Algorithm: FP-Tree Representation, Frequent Itemset Generation in FP-Growth Algorithm.
Slides.
- Evaluation of Association Patterns: Objective Measures of Interestingness,
Skewed distribution, Cross support
patterns, Lowest confidence rule.
Applications: Transforming attributes. Multi-level Association Rules.
Slides.
Simpson's Paradox
Slides.
- Mining word associations. Min-Apriori.
Slides.
- Mining of sequences. Candidate Generation. Timing Constraints.
Slides.
- FP-Tree/FP-Growth practice.
Slides.
-
Mining Graphs.
Frequent Subgraph Mining. Edge Growing. Multiplicity of Candidates.
Slides.
Cluster Analysis
- Applications of Cluster Analysis. Types of Clusters. K-means Algorithm.
Problems with Selecting Initial Points. Bisecting K-means.
Limitations of K-means.
Slides. Self Organizing Maps. Slides.
- Agglomerative Hierarchical Clustering. Divisive Hierarchical Clustering.
Density based clustering DBSCAN. Fuzzy Clustering.
HICAP: Hierarchical Clustering with Pattern Preservation.
Slides.
Mining the Web
- Information Retrieval. PageRank. Web Link Matrix. Dead ends, and Spider traps. Hubs and Authorities.
Web spam. Combating spam.
Slides.
Review: Slides.
Exercising material: Slides.
Assignments: There will be three
assignments.
|