One way to think about regression is as a tool that takes a set of predictors and creates a weighted, linear composite that maximally correlates with the response variable. It finds a way to combine multiple predictors into a single thing, using regression weights, and the weights are chosen such that, once the single composite is formed, it maximally correlates with the outcome.

Here’s a simulation to punch that point home.

500 people.

N <- 500

The correlation matrix for three variables, x1, x2, and the outcome, y. The correlation between x1 and x2 is 0.1, the correlation between x1 and y is 0.4, and the correlation between x2 and y is 0.4.

sigma <- matrix(c(1.0, 0.1, 0.4,
                  0.1, 1.0, 0.4,
                  0.4, 0.4, 1.0), 3, 3, byrow = T)

The mean for each variable is 0.

mu <- c(0,0,0)

Use the correlation matrix and mean specifications to generate data.

library(MASS)

df <- mvrnorm(N, mu, sigma)

Turn it into a data frame and label it.

df <- data.frame(df)
names(df) <- c('x1', 'x2', 'y')
df$id <- c(1:N)

Run regression and print the output.

summary(lm(y ~ x1 + x2,
           data = df))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.35466 -0.61663 -0.01457  0.57977  2.28447 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03398    0.03946   0.861     0.39    
## x1           0.32643    0.04118   7.927 1.49e-14 ***
## x2           0.41226    0.03930  10.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8824 on 497 degrees of freedom
## Multiple R-squared:  0.2823, Adjusted R-squared:  0.2794 
## F-statistic: 97.76 on 2 and 497 DF,  p-value: < 2.2e-16

Here’s the kicker: you can think of those weights as optimal functions telling us how to create the composite.

Create a composite using the regression weights.

library(tidyverse)
df <- df %>%
  mutate(composite_x = 0.33*x1 + 0.4*x2)

Those weights provide the maximum correlation between our composite and the outcome.

cor(df$y, df$composite_x)
## [1] 0.5312647

In other words, the above correlation could not be higher with any other set of weights. Regression found the weights that makes the correlation above as large as it can be.

summary(lm(y ~ composite_x,
           data = df))
## 
## Call:
## lm(formula = y ~ composite_x, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36900 -0.61715 -0.01408  0.58139  2.26629 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03395    0.03943   0.861     0.39    
## composite_x  1.01427    0.07248  13.994   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8816 on 498 degrees of freedom
## Multiple R-squared:  0.2822, Adjusted R-squared:  0.2808 
## F-statistic: 195.8 on 1 and 498 DF,  p-value: < 2.2e-16

Bo\(^2\)m =)