library(rstan)HW 00
This first homework will not be graded, however it will be critical that you complete this on time. All of the computing tools we’ll use in the class will be introduced during this assignemnt.
Introduction
The goal of this homework is to briefly introduce you to the computing tools we’ll use in the course, set up your computing access, and complete the BIOSTAT 725 student survey. We will talk more about these computing tools in lecture.
Computing
RStudio
R is the name of the programming language itself and RStudio is a convenient interface.
In this class, you have the option to use RStudio on your laptop (i.e., locally) or on through a container hosted by Duke OIT. My suggestion is for everyone to be comfortable using both options, for the following reasons:
- Flexibility: If you’re laptop has problems right before a due date, it will be helpful to be setup in the container.
- Independence: It is important to be able to compute on your laptop, because when you graduate you will no longer have access to the Duke containers.
The container is offered as a convenience and you should take advantage of it when needed. We will now give instructions for using both.
Installing RStudio on your laptop
- Most of you probably already have RStudio installed on your laptop. In case you do not, please follow these instructions to install both R and RStudio, Installation instruction.
- When given the option, choose the most recent stable version of both.
Reserve RStudio container
Go to https://cmgr.oit.duke.edu/containers. You will log in using your NetID credentials.
Click “Reserve STA725” to reserve an RStudio container. Be sure you reserve the container labeled
STA725to ensure you have the computing set up you need for the class.
You only need to reserve a container once per semester.
Open RStudio container
Go to https://cmgr.oit.duke.edu/containers and log in with your Duke NetID and Password.
Click
STA725to log into the Docker container. You should now see the RStudio environment.
Stan
In this course, we will use the package rstan as our primary tool for conducting Bayesian inference. The container already has rstan installed, so these steps need to be used for installation on your laptop.
- Follow the installation guide here: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
Make sure to follow these instructions closely, since prior to installing rstan, you need to configure your R installation to be able to compile C++ code.
Stan Hello World!
Once rstan has been installed, test to make sure we can load the R package.
Now we will simulate some data that we can fit with linear regression. We will also define a Stan data object (no need to understand this now, we will go into this in detail in future lectures).
###Set a seed for reproducibility
set.seed(54)
###Simulation settings
n <- 100 # number of observations
p <- 3 # number of covariates
###True parameter values
beta <- matrix(c(rnorm(p + 1)), ncol = 1)
sigma <- 1.5
###Simulate covariates and outcome
X <- cbind(1, matrix(rnorm(n * p), ncol = p))
Y <- as.numeric(X %*% beta + rnorm(n, 0, sigma))
###Create a Stan data object
stan_data <- list(
n = n,
p = p,
Y = Y,
X = X
)Now, we can define a Stan model (you do not need to understand this yet, we are just testing!). In RStudio, create a new .stan file called test.stan and then copy and paste the following Stan code. To create a .stan file from RStudio, File -> New File -> Stan File.
// Saved in test.stan
data {
int<lower = 1> n;
int<lower = 1> p;
vector[n] Y;
matrix[n, p + 1] X;
}
parameters {
vector[p + 1] beta;
real<lower = 0> sigma;
}
model {
Y ~ normal(X * beta, sigma);
}Next, we test to see if the model can compile. Note compilation can sometimes take a bit of time.
stan_model <- stan_model(file = "test.stan")OK, great. We will now obtain posterior samples, using default specifications for inference.
fit <- sampling(stan_model, data = stan_data, refresh = 0)Finally, print some summary estimates for the model parameters.
print(fit)Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta[1] 1.89 0.00 0.18 1.54 1.77 1.89 2.01 2.23 4666 1
beta[2] 0.20 0.00 0.18 -0.15 0.08 0.20 0.32 0.55 4391 1
beta[3] -0.30 0.00 0.16 -0.62 -0.41 -0.31 -0.19 0.01 5189 1
beta[4] 1.58 0.00 0.19 1.22 1.46 1.58 1.71 1.95 4983 1
sigma 1.71 0.00 0.13 1.49 1.63 1.71 1.79 1.98 4479 1
lp__ -102.49 0.04 1.61 -106.30 -103.33 -102.13 -101.31 -100.34 1907 1
Samples were drawn using NUTS(diag_e) at Mon Jan 6 15:15:47 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Git and GitHub
In addition to R and RStudio, we will use git and GitHub for version control and collaboration.
Git is a version control system (like “Track Changes” features from Microsoft Word but more powerful) and GitHub is the home for your Git-based projects on the internet (like DropBox but much better). Git is important because:
Results produced are more reliable and trustworthy (Ostblom and Timbers 2022)
Facilitates more effective collaboration (Ostblom and Timbers 2022)
Contributing to science, which builds and organizes knowledge in terms of testable hypotheses (Alexander 2023)
Possible to identify and correct errors or biases in the analysis process (Alexander 2023)
Sign up for GitHub account
You will need a GitHub account to access the assignments, exams, and in-class exercises for the course.
- If you do not have a GitHub account, go to https://github.com and sign up for an account.
Click here for advice on choosing a username.
- If you already have a GitHub account, you can move on to the next step.
Connect RStudio and GitHub
Now that you have RStudio and a GitHub account, we will configure git so that RStudio and GitHub communicate with one another.
Set up your SSH Key
You will authenticate GitHub using SSH. Below are an outline of the authentication steps.
You only need to do this authentication process one time on a single system. So, if you are using both your laptop and the container, you will need to do this process twice.
- Step 0: Open your RStudio (either the
STA725RStudio container or your laptop). - Step 1: Type
credentials::ssh_setup_github()into the console on the bottom left of the RStudio environment. - Step 2: R will ask “No SSH key found. Generate one now?” Click 1 for yes.
- Step 3: You will generate a key. It will begin with “ssh-rsa….” R will then ask “Would you like to open a browser now?” Click 1 for yes.
- Step 4: You may be asked to provide your username and password to log into GitHub. This would be the ones associated with your account that you set up. After entering this information, paste the key in and give it a name. You might name it in a way that indicates where the key will be used (e.g.,
biostat725).
Configure git
The last thing we need to do is configure your git so that RStudio can communicate with GitHub. This requires two pieces of information: your name and email address.
To do so, you will use the use_git_config() function from the usethis package.
Type the following lines of code in the console in RStudio filling in your name and the email address associated with your GitHub account.
usethis::use_git_config(
user.name = "Your name",
user.email = "Email associated with your GitHub account")For example, mine would be
usethis::use_git_config(
user.name = "berchuck",
user.email = "sib2@duke.edu")It may look like nothing happened but you are now ready interact between GitHub and RStudio! We will begin working with RStudio and GitHub in lecture this week.
You should be using the email address you used to create your GitHub account, it’s ok if it isn’t your Duke email.
Submit GitHub username
Use the link below to submit your GitHub username and confirm that you (1) have access to an RStudio container and (2) have completed the steps to configure git.
BIOSTAT 725 Student Survey
Use the link below to complete the BIOSTAT 725 Student Survey. This survey will help me learn more about you, your interests, and your previous statistics and computing experience.