General Overview

STAT 447: Data Science Programming Methods is a course in the Department of Statistics at the University of Illinois.

Data Science Programming Methods was first offered in the Spring 2019, Fall 2019 and Fall 2020 terms as the earlier STAT 430: Topics in Applied Statistics course. Since the Fall 2021 term, it is offered under a new course number STAT 447. The instructor is Dirk Eddelbuettel who also designed the course, and taught the previous instances (which can still be accessed, see the resources/websites link on the left).

Course lectures slides as well as guest lectures are publically accessible, see the lectures by topic and guest lectures links on the left.

Objectives

A 2018 report by National Academies of Sciences, Engineering, and Medicine stated:

Data science is emerging as a field that is revolutionizing science and industries alike. Work across nearly all domains is becoming more data driven, affecting both the jobs that are available and the skills that are required. As more data and ways of analyzing them become available, more aspects of the economy, society, and daily life will become dependent on data.

This courses introduces key concepts for computational literacy in a data science context:

  • Basic shell operations (which are a core building block for all computing systems) and key commands (find, grep, sed, awk, …) up to shell scripts
  • Git (and GitHub) for version control: for versioning source code, write ups and much more such as social computing
  • Structured Query Language (SQL) as a key data processing tool
  • R Programming: getting familiar with the best language and environment for programming with data.
  • Reproducible computing: Markdown is at the core of this and incredibly versatile
  • Further extensions such as Docker for deployment.

This course is fast paced. We cover a considerable amount of material.

Format

  • Lectures, generally as (on-line) slides along with short asynchronous videos
  • Self-study, which offers plenty of reading and coding to do
  • Six (on-line) homeworks administered via the PrairieLearn system
  • Six (on-line) quizzes via the PrairieTest facility also using PrairieLearn system
  • A self-directed group project demonstrating data science programming
  • On-line office hours with instructor and course assistants

Brief One-Paragraph Description

Statistics and Data Science are focused on making sense of data – and face an ever-increasing demand for their work. Yet at the same time, data sets increase in size and scope. Proper tooling is essential to meet these challenges, and as applied work in data analysis is in effect applied computational work, we will learn the computational tools and programming methods to meet these data science challenges. Proficiency at the shell, familiarity with git version control, sufficient understanding of SQL, and of course acquiring actual expertise in R programming are the goals of this course to prepare students for the coming computational challenges. We have used RStudio Cloud instances in the past and can recommend it, but there no longer is campus-provide framework. Prior programming experience (in R or another language) will certainly be helpful, but is not a formal requirement for taking the course.

Detailed Content and Lectures

Please see the Lectures by Topics link to the left. Content is often refreshed or added as the course progresses but you always have prior years (from most recent year 2021 to prior runs in 2020 and Fall 2019 as well as Spring 2019) as a complete reference.