BIOS 667: Advanced Data Analysis

Course Objectives
Biostatisticians must increasingly utilize statistical methods that enable one to discover patterns and relationships in a vast amount of data. During this course students will gain insight regarding statistical methods used to discover the underlying structure of large complex datasets. Specific topics will include bootstrap methods, discriminant analysis, k-nearest neighbors, classification and regression trees, and random forests. At the conclusion of the course students will be able to analyze data using the methods presented in the R/S-Plus programming environment.

Data Mining and Pattern Recognition
This course focuses on methods that are tied to machine learning, i.e., methods that seek to discover structure from the evidence of the data alone. Hence, most methods discussed are computationally intensive, requiring the analyst to develop proficiency in using an efficient statistical programming environment. Therefore, this course requires the use of the R/S-Plus programming environment.

Course Time: Monday/Wednesday 1:30PM– 2:50PM

Room: Theater Row, Room 1015

Office hours: Monday/Wednesday 3:00PM– 4:00PM

Office: Theater Row, 3022

Required Text
Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer, New York.

Supplemental Materials are posted via Blackboard

Pre-requisites: BIOS 513, 514, and 524

Grading: Grades will be based on assigned homeworks that will consist of short answer, statistical computing, and problem solving exercises. In addition, a final project is required, with weighting for the final assigned grade as follows:

  • Students must use VCU's honor system when handing in any take-home work.
  • Course Outline

    I. Introduction to the R Programming Environment

    II. Review of Linear and Logistic Regression

    III. Penalized models

    IV. Bayes' rule

    V. Discriminant analysis

    VI. kernel density estimation

    VII. k-nearest neighbors

    VIII. Model assessment and cross validation

    IX. Classification and regression trees

    X. Random Forests

    XI. Bootstrap methods



    R