Statistical Learning and Data Mining

BIOS 667: Advanced Data Analysis

Course Objectives
Biostatisticians must increasingly utilize statistical methods that enable one to discover patterns and relationships in a vast amount of data. During this course students will gain insight regarding statistical methods used to discover the underlying structure of large complex datasets. Specific topics will include bootstrap methods, discriminant analysis, k-nearest neighbors, classification and regression trees, and random forests. At the conclusion of the course students will be able to analyze data using the methods presented in the R/S-Plus programming environment.

Data Mining and Pattern Recognition
This course focuses on methods that are tied to machine learning, i.e., methods that seek to discover structure from the evidence of the data alone. Hence, most methods discussed are computationally intensive, requiring the analyst to develop proficiency in using an efficient statistical programming environment. Therefore, this course requires the use of the R/S-Plus programming environment.

Course Time: Monday/Wednesday 1:30PM– 2:50PM

Room: Theater Row, Room 1015

Office hours: Monday/Wednesday 3:00PM– 4:00PM

Office: Theater Row, 3022

Required Text
Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer, New York.

Supplemental Materials are posted via Blackboard

Pre-requisites: BIOS 513, 514, and 524

Grading: Grades will be based on assigned homeworks that will consist of short answer, statistical computing, and problem solving exercises. In addition, a final project is required, with weighting for the final assigned grade as follows:

90% Homework
10% Term Paper

Students must use VCU's honor system when handing in any take-home work.

Course Outline

I. Introduction to the R Programming Environment

II. Review of Linear and Logistic Regression

III. Penalized models

IV. Bayes' rule

V. Discriminant analysis

VI. kernel density estimation

VII. k-nearest neighbors

VIII. Model assessment and cross validation

IX. Classification and regression trees

X. Random Forests

XI. Bootstrap methods