HW 00

Important

This first homework will not be graded, however it will be critical that you complete this on time. All of the computing tools we’ll use in the class will be introduced during this assignemnt.

Introduction

The goal of this homework is to briefly introduce you to the computing tools we’ll use in the course, set up your computing access, and complete the BIOSTAT 725 student survey. We will talk more about these computing tools in lecture.

Computing

RStudio

Note

R is the name of the programming language itself and RStudio is a convenient interface.

In this class, you have the option to use RStudio on your laptop (i.e., locally) or on through a container hosted by Duke OIT. My suggestion is for everyone to be comfortable using both options, for the following reasons:

  • Flexibility: If you’re laptop has problems right before a due date, it will be helpful to be setup in the container.
  • Independence: It is important to be able to compute on your laptop, because when you graduate you will no longer have access to the Duke containers.

The container is offered as a convenience and you should take advantage of it when needed. We will now give instructions for using both.

Installing RStudio on your laptop

  • Most of you probably already have RStudio installed on your laptop. In case you do not, please follow these instructions to install both R and RStudio, Installation instruction.
  • When given the option, choose the most recent stable version of both.

Reserve RStudio container

  • Go to https://cmgr.oit.duke.edu/containers. You will log in using your NetID credentials.

  • Click “Reserve STA725” to reserve an RStudio container. Be sure you reserve the container labeled STA725 to ensure you have the computing set up you need for the class.

You only need to reserve a container once per semester.

Open RStudio container

  • Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.

  • Click STA725 to log into the Docker container. You should now see the RStudio environment.

Stan

In this course, we will use the package rstan as our primary tool for conducting Bayesian inference. The container already has rstan installed, so these steps need to be used for installation on your laptop.

Make sure to follow these instructions closely, since prior to installing rstan, you need to configure your R installation to be able to compile C++ code.

Stan Hello World!

Once rstan has been installed, test to make sure we can load the R package.

library(rstan)

Now we will simulate some data that we can fit with linear regression. We will also define a Stan data object (no need to understand this now, we will go into this in detail in future lectures).

###Set a seed for reproducibility
set.seed(54)

###Simulation settings
n <- 100 # number of observations
p <- 3 # number of covariates

###True parameter values
beta <- matrix(c(rnorm(p + 1)), ncol = 1)
sigma <- 1.5

###Simulate covariates and outcome
X <- cbind(1, matrix(rnorm(n * p), ncol = p))
Y <- as.numeric(X %*% beta + rnorm(n, 0, sigma))

###Create a Stan data object
stan_data <- list(
  n = n,
  p = p,
  Y = Y,  
  X = X
)

Now, we can define a Stan model (you do not need to understand this yet, we are just testing!). In RStudio, create a new .stan file called test.stan and then copy and paste the following Stan code. To create a .stan file from RStudio, File -> New File -> Stan File.

// Saved in test.stan
data {
  int<lower = 1> n;
  int<lower = 1> p;
  vector[n] Y;
  matrix[n, p + 1] X;
}
parameters {
  vector[p + 1] beta;
  real<lower = 0> sigma;
}
model {
  Y ~ normal(X * beta, sigma);
}

Next, we test to see if the model can compile. Note compilation can sometimes take a bit of time.

stan_model <- stan_model(file = "test.stan")

OK, great. We will now obtain posterior samples, using default specifications for inference.

fit <- sampling(stan_model, data = stan_data, refresh = 0)

Finally, print some summary estimates for the model parameters.

print(fit)
Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
beta[1]    1.89    0.00 0.18    1.54    1.77    1.89    2.01    2.23  4666    1
beta[2]    0.20    0.00 0.18   -0.15    0.08    0.20    0.32    0.55  4391    1
beta[3]   -0.30    0.00 0.16   -0.62   -0.41   -0.31   -0.19    0.01  5189    1
beta[4]    1.58    0.00 0.19    1.22    1.46    1.58    1.71    1.95  4983    1
sigma      1.71    0.00 0.13    1.49    1.63    1.71    1.79    1.98  4479    1
lp__    -102.49    0.04 1.61 -106.30 -103.33 -102.13 -101.31 -100.34  1907    1

Samples were drawn using NUTS(diag_e) at Mon Jan  6 15:15:47 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Git and GitHub

In addition to R and RStudio, we will use git and GitHub for version control and collaboration.

Note

Git is a version control system (like “Track Changes” features from Microsoft Word but more powerful) and GitHub is the home for your Git-based projects on the internet (like DropBox but much better). Git is important because:

Sign up for GitHub account

You will need a GitHub account to access the assignments, exams, and in-class exercises for the course.

  • If you do not have a GitHub account, go to https://github.com and sign up for an account.
Tip

Click here for advice on choosing a username.

  • If you already have a GitHub account, you can move on to the next step.

Connect RStudio and GitHub

Now that you have RStudio and a GitHub account, we will configure git so that RStudio and GitHub communicate with one another.

Set up your SSH Key

You will authenticate GitHub using SSH. Below are an outline of the authentication steps.

Note

You only need to do this authentication process one time on a single system. So, if you are using both your laptop and the container, you will need to do this process twice.

  • Step 0: Open your RStudio (either the STA725 RStudio container or your laptop).
  • Step 1: Type credentials::ssh_setup_github() into the console on the bottom left of the RStudio environment.
  • Step 2: R will ask “No SSH key found. Generate one now?” Click 1 for yes.
  • Step 3: You will generate a key. It will begin with “ssh-rsa….” R will then ask “Would you like to open a browser now?” Click 1 for yes.
  • Step 4: You may be asked to provide your username and password to log into GitHub. This would be the ones associated with your account that you set up. After entering this information, paste the key in and give it a name. You might name it in a way that indicates where the key will be used (e.g., biostat725).

Configure git

The last thing we need to do is configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your name and email address.

To do so, you will use the use_git_config() function from the usethis package.

Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.

usethis::use_git_config(
  user.name = "Your name", 
  user.email = "Email associated with your GitHub account")

For example, mine would be

usethis::use_git_config(
  user.name = "berchuck",
  user.email = "sib2@duke.edu")

It may look like nothing happened but you are now ready interact between GitHub and RStudio! We will begin working with RStudio and GitHub in lecture this week.

Note

You should be using the email address you used to create your GitHub account, it’s ok if it isn’t your Duke email.

Submit GitHub username

Use the link below to submit your GitHub username and confirm that you (1) have access to an RStudio container and (2) have completed the steps to configure git.

🔗 https://forms.office.com/r/c732cyKi6f

BIOSTAT 725 Student Survey

Use the link below to complete the BIOSTAT 725 Student Survey. This survey will help me learn more about you, your interests, and your previous statistics and computing experience.

🔗 https://forms.office.com/r/ygCiEV0wK3

References

Alexander, Rohan. 2023. “Telling Stories with Data,” June. https://doi.org/10.1201/9781003229407.
Ostblom, Joel, and Tiffany Timbers. 2022. “Opinionated Practices for Teaching Reproducibility: Motivation, Guided Instruction and Practice.” Journal of Statistics and Data Science Education 30 (3): 241–50. https://doi.org/10.1080/26939169.2022.2074922.