Purpose

Imagine that you are interested in the relationship between stress and performance. To assess it, you observe 600 people at work and measure their stress via a self-report (e.g., “I feel stressed”) and their performance via objective performance scores for the day (e.g., number of sales). You regress performance on stress and find that the estimated coefficient relating to two is 0.45. You then build a 95% confidence interval using the standard error that the analysis spit out and find that the CI is 0.45 +- 0.1.

What does that confidence interval actually mean? The purpose of this exercise is to build intuition behind frequentist CIs.

Steps

  1. Generate the population

  2. Sample the population. On that sample…

    2a) Regress performance on stress

    2b) Calculate a CI

    2c) Does the CI contain the population parameter?

  3. Re-sample and repeat

1) Generate the population

Our population will contain 100,000 people

pop_number <- 100000

with stress scores distributed about zero. The scale here doesn’t matter – we care about the relationship between stress and performance and less about (in this example) the distributions of stress and performance themselves.

population_stress <- rnorm(pop_number, 0, 5)

The true relationship between stress and performance will be 0.45. Let’s set that parameter

stress_performance_coefficient <- 0.45

and then generate performance.

population_performance <- stress_performance_coefficient*population_stress + rnorm(pop_number, 0, 1)

Now plug everything into a data set. Remember, this is the population.

df <- data.frame(
  'person' = c(1:pop_number),
  'stress' = c(population_stress),
  'performance' = c(population_performance)
)

What is the paramter relating stress to performance? 0.45, keep that in mind. Time to sample the population as if we conducted a study and run our regression.

2) Sample the population

Randomly select 600 people from our population. That is, pretend we ran a study on 600 subjects.

sample_size <- 600
random_numbers <- sample(c(1:pop_number), sample_size)

sample_df <- df %>%
  filter(person %in% random_numbers)

2a) Regress Performance on Stress

Use the lm command for regression in R.

summary(lm(performance ~ stress,
           data = sample_df))
## 
## Call:
## lm(formula = performance ~ stress, data = sample_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07131 -0.60920  0.00378  0.61895  3.08501 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04018    0.03778  -1.064    0.288    
## stress       0.44654    0.00734  60.836   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.925 on 598 degrees of freedom
## Multiple R-squared:  0.8609, Adjusted R-squared:  0.8607 
## F-statistic:  3701 on 1 and 598 DF,  p-value: < 2.2e-16

2b) Compute the CI

Save the output of the regression in an object so we can pull out the specific coefficients that we are interested in.

output <- summary(lm(performance ~ stress,
                  data = sample_df))

Pull out the coefficient relating stress to performance

slope_coefficient <- output$coefficients[2,1]

and use it along with the SEs to calculate the confidence interval.

se_upper <- slope_coefficient + 1.96*output$coefficients[2,2]
se_lower <- slope_coefficient - 1.96*output$coefficients[2,2]

2c) Does the CI contain the population parameter?

Remember that the parameter is 0.45.

contain_parameter <- NULL

if(se_lower <= stress_performance_coefficient && se_upper >= stress_performance_coefficient){
  contain_parameter <- 'yes'
}else{
  contain_parameter <- 'no'
}

contain_parameter
## [1] "yes"

What did we do? We sampled the population, ran a regression to relate stress to performance, and then calculated a CI on the slope term. The interpretation of a CI, however, is across infinite samples. Now we need to run through the sample, regress, and calculate CI procedure again and again and again – Monte Carlo.

3) Re sample and repeat

I created a function that samples the population, runs a regression, calculates the CI, and then saves whether or not the interval contained 0.45 (‘yes’ or ‘no’). You can view that code in the raw rmarkdown file. For now, just know that the function is called sample_regress_calc_ci.

We are going to re-run step 2 from above 900 times

sims <- 900

and store the ‘yes’ or ‘no’ result in a vector.

all_ci_contains <- numeric(sims)

Here is the full Monte Carlo code.

sims <- 900
all_ci_contains <- numeric(sims)

for(i in 1:sims){
  
  result <- sample_regress_calc_ci()
  all_ci_contains[i] <- result
  
}

Interpretation

How many computed intervals contain the population parameter?

sum(all_ci_contains == 'yes') / sims
## [1] 0.9566667