Opportunities and Challenges of Complex Biomedical Data: Introduction to the Science of "Big Data"

BIOSTAT 202 Summer 2020 (3 unit)
Course Director: Aaron Wolfe Scheffler, PhD
Assistant Professor
Department of Epidemiology & Biostatistics


The growing availability of large amounts of data — obtained either through research or electronic capture of everyday activity — has been termed "big data". This course introduces the opportunities and challenges of using biological and health-related "big data" to perform biomedical research. We will distinguish big data from non-big data and explore the phases of data science: obtaining data, cleaning data, visualizing data, analyzing data, and drawing conclusions.

At the conclusion of this course, students will be able to:

  • Access public use (and non-public) sources of data such as NHANES and social media data;
  • Use software to manipulate and clean “big data”;
  • Generate effective graphical displays of data;
  • Describe the advantages and disadvantages of different approaches to both supervised (classification and regression) and unsupervised modeling (clustering and data reduction);
  • Describe the issues that arise when trying to use "big data"-based observational studies to derive causal conclusions; and
  • Describe the features of pragmatic clinical trials and how they are different from more usual clinical trials.


Course Director:

Aaron Wolfe Scheffler, PhD
email: aaron.scheffler@ucsf.edu

Lecturers: Elaine Allen, PhD
email: isabel.allen@ucsf.edu

Charles McCulloch, PhD
email: charles.mcculloch@ucsf.edu

Teaching Assistants: TBA

Twice weekly lectures introduce the substantive content for each module, which is subsequently reinforced in weekly applied homework problem sets. Weekly computer lab sessions give students guided problems to work through and the opportunity to learn to use the software, ask questions, and have more interaction with faculty.

  1. Lectures: Mondays and Thursdays: 1:00 to 2:30 PM, July 30 through September 10.
  2. Computer Labs: Thursdays, 2:45 PM to 4:15 PM, convened by Zoom webconferencing software.

In addition, all students will be required to submit a final project in which they manipulate, clean, and analyze data emanating from a large data source. Students will be given a choice of datasets and guidelines for performing the project.

All course materials and handouts will be posted on the course's online syllabus.


The software package IBM SPSS Modeler is used in this course. This software is available free for students on the course. Instructions on how to obtain the free license for this software and install it will be provided by the Course Director prior to the course start date.


Grades will be based on the Computer Lab assignments and the Final Project. Lab assignments will be due by the start of lecture the following week. Homework problem sets will account for 70% of the points for the course. The final project, based on course supplied data sets, will account for 30% of the points possible for the course.

Students must hand in all homework problem sets (even if late), complete a satisfactory Final Project, and receive at least 80% of the total number of points assigned during the quarter to receive a Satisfactory (if taking Satisfactory/Unsatisfactory) or B (if taking for a letter grade) in the course.

Students not in full-year TICR Programs who satisfactorily pass all course requirements will, upon request, receive a Certificate of Course Completion.

UCSF Graduate Division Policy on Disabilities


To apply for this course, please fill out and submit the application below. Please see our fee page for cost information. The deadline for application is July 13, 2020. Only one application needs to be completed for all courses desired during the quarter.

The application is best completed using the latest version of Firefox, Chrome or Safari.

APPLICATION Information for how to pay;
please read before applying