One way to think about regression is as a tool that takes a set of predictors and creates a weighted, linear composite that maximally correlates with the response variable. It finds a way to combine multiple predictors into a single thing, using regression weights, and the weights are chosen such that, once the single composite is formed, it maximally correlates with the outcome.

Here’s a simulation to punch that point home.

500 people.

`N <- 500`

The correlation matrix for three variables, x1, x2, and the outcome, y. The correlation between x1 and x2 is 0.1, the correlation between x1 and y is 0.4, and the correlation between x2 and y is 0.4.

```
sigma <- matrix(c(1.0, 0.1, 0.4,
0.1, 1.0, 0.4,
0.4, 0.4, 1.0), 3, 3, byrow = T)
```

The mean for each variable is 0.

`mu <- c(0,0,0)`

Use the correlation matrix and mean specifications to generate data.

```
library(MASS)
df <- mvrnorm(N, mu, sigma)
```

Turn it into a data frame and label it.

```
df <- data.frame(df)
names(df) <- c('x1', 'x2', 'y')
df$id <- c(1:N)
```

Run regression and print the output.

```
summary(lm(y ~ x1 + x2,
data = df))
```

```
##
## Call:
## lm(formula = y ~ x1 + x2, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.35466 -0.61663 -0.01457 0.57977 2.28447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03398 0.03946 0.861 0.39
## x1 0.32643 0.04118 7.927 1.49e-14 ***
## x2 0.41226 0.03930 10.489 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8824 on 497 degrees of freedom
## Multiple R-squared: 0.2823, Adjusted R-squared: 0.2794
## F-statistic: 97.76 on 2 and 497 DF, p-value: < 2.2e-16
```

Here’s the kicker: you can think of those weights as optimal functions telling us how to create the composite.

Create a composite using the regression weights.

```
library(tidyverse)
df <- df %>%
mutate(composite_x = 0.33*x1 + 0.4*x2)
```

Those weights provide the maximum correlation between our composite and the outcome.

`cor(df$y, df$composite_x)`

`## [1] 0.5312647`

In other words, the above correlation could not be higher with any other set of weights. Regression found the weights that makes the correlation above as large as it can be.

```
summary(lm(y ~ composite_x,
data = df))
```

```
##
## Call:
## lm(formula = y ~ composite_x, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.36900 -0.61715 -0.01408 0.58139 2.26629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03395 0.03943 0.861 0.39
## composite_x 1.01427 0.07248 13.994 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8816 on 498 degrees of freedom
## Multiple R-squared: 0.2822, Adjusted R-squared: 0.2808
## F-statistic: 195.8 on 1 and 498 DF, p-value: < 2.2e-16
```

Bo\(^2\)m =)