BIOS 667: Advanced Data Analysis
Course Objectives
Biostatisticians must increasingly utilize statistical methods that enable one to discover patterns and relationships in a vast amount of data. During this course students will gain insight regarding statistical methods used to discover the underlying structure of large complex datasets. Specific topics will include bootstrap methods, discriminant analysis, k-nearest neighbors, classification and regression trees, and random forests. At the conclusion of the course students will be able to analyze data using the methods presented in the R/S-Plus programming environment.
Data Mining and Pattern Recognition
This course focuses on methods that are tied to machine learning, i.e., methods that seek to discover structure from the evidence of the data alone. Hence, most methods discussed are computationally intensive, requiring the analyst to develop proficiency in using an efficient statistical programming environment. Therefore, this course requires the use of the R/S-Plus programming environment.
Course Time: Tuesdays/Thursdays 10:30AM– 11:50am
Room: Theater Row, Room 1015
Office hours: Mondays/Tuesdays 12:00PM– 1:00PM
Office: Theater Row, 3022
Required Text
Trevor Hastie, Robert Tibshirani, Jerome Friedman (2001) The Elements of Statistical Learning.
Springer, New York.
Supplemental Materials are posted via Blackboard
Pre-requisites: BIOS 513, 514, and 524
Grading: Grades will be based on assigned homeworks that will consist of short answer, statistical computing, and problem solving exercises. In addition, a final project is required, with weighting for the final assigned grade as follows:
- 90% Homework
- 10% Term Paper
Course Outline
I. Introduction to the R Programming Environment
II. Bayes' rule
III. Discriminant analysis
IV. kernel density estimation
V. k-nearest neighbors
VI. Classification and regression trees
VII. Model assessment and cross validation
VIII. Random Forests
IX. L1 penalized models
X. Bootstrap methods

