library(PerformanceAnalytics)
library(tidyverse)
library(psych)

Multiple Linear Regression

This week we will cover multiple linear regression. As in the previous post and in any model that you want to do we will need to test our assumptions and follow some general steps

Procedure to perform Multiple Linear Regression:

Scatterplot matrix
Fit multiple linear regression models
Calculation estimate parameters, determine their standard errors, obtain tests, and confidence intervals.
Evaluation and explanation

Let’s have our data

set.seed(21) #so we get the same results even with a random selection
df<-data.frame(temperature=rnorm(100,mean = 25,sd=5),
               wind=rnorm(100,mean = 10,sd=1.2),
               radiation=rnorm(100,mean = 5000,sd=1000),
               bacterias=rnorm(100,mean = 50,sd=15),
               groups=c("Day","Night")) #groups
head(df,2)
##   temperature     wind radiation bacterias groups
## 1    28.96507 9.026934  5748.074  32.28645    Day
## 2    27.61126 6.467452  3630.011  31.57719  Night

MLR Assumption (Independency)

Simple linear regression does not require this test because it has one independent variable. In this chapter, we test linearity and independence of observations (autocorrelation) to determine that independent variables are not too highly correlated.

cor(df$temperature,df$wind) #function to test the relationship between your independent variables
## [1] -0.0069861

Normality and Linearity Assumptions

We still need to lest the other two assumptions before the model. A function that will save us time is chart.Correlation, we can get Marginal Response Plots showing correlations and the significance of those correlations (independency), histograms and distributions (normality), and bivariate scatter plot with a fitted line (linearity). Although this does not test the normal distribution you can observe with histograms the distribution of your data for all your numeric variables and make some initial estimations.

chart.Correlation(df[,-5], #numeric data set
                  histogram = T, #True show histograms
                  method = "pearson") #correlation method "pearson", "kendall", "spearman"

Similarly, with a few variables then you can use psych::corr.test and determine correlation and the significance. What we used for the correlation plot in week 1.

correlation.test <- psych::corr.test(df[,1:4],use = "pairwise",method = "spearman") #relationship between all your variables
my.correlation<-correlation.test$r # from the test get ($) the correlation values
p.values.correlation<-correlation.test$p # from the test get ($) the correlation p values

Values close to 1 indicate that independent variables are nearly identical meaning that they do not meet the assumptions for multiple linear regression and should be treated or discarded.

Regression fitting

fit.lm<-lm(bacterias~.,data=df) #all the variables 
#fit.lm<-lm(bacterias~temperature+wind+radiation,data=df) same as before
summary(fit.lm)
## 
## Call:
## lm(formula = bacterias ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.829 -10.180   2.138  10.238  39.795 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.119654  19.742062   3.450 0.000836 ***
## temperature -0.304967   0.317770  -0.960 0.339638    
## wind        -1.469393   1.474082  -0.997 0.321385    
## radiation    0.000886   0.001674   0.529 0.597873    
## groupsNight -0.544984   3.265375  -0.167 0.867805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.14 on 95 degrees of freedom
## Multiple R-squared:  0.0252, Adjusted R-squared:  -0.01584 
## F-statistic: 0.6141 on 4 and 95 DF,  p-value: 0.6535

Breakdown of the elemet of the result:

The estimates (Estimate) column gives you the change in y caused by each element of the regression. For Intercept or the value of the y when all the rest elements are 0 in this case 68.12 bacteria units on each test. Similarly, temperature/rest of the variables show changes in y (population of bacteria) with one unit increase in x. In other words, the temperature in the full model reduces bacteria population by -0.304 units for each increment in temperature if P < 0.05. Here it is not so we can not reach that conclusion.
The standard error of the estimated values in the second column Std. Error.
The test statistic t value.
The p-value Pr(>| t | ), the probability of finding the given t-statistic and therefore, the calculated estimate by chance.

Model test (Homoscedasticity)

We should ensure that the model fitted meets this last assumption, the same variance, to maintain the model. The error term needs to be the same across all values of the independent variables.

par(mfrow=c(2,2))
plot(fit.lm)

par(mfrow=c(1,1))

Here we can see that the residial errors are constant across groups and they are distributed evenly. Look at the Normal Q-Q (top right) which under homoscedasticity it should show points along the line.

Leverage and extrapolation

Leverage h is a quantity solely dependent on the predictors for all the cases in the data without any respect for responses. Leverage hi is called the leverage of observation i. It indicates the “pull” an observation has on the regression fit.

Let’s look at that pull for different points.

Cases with large leverages should be inspected to make sure there are no errors in the data or other problems. We will show some methods to deal with them in the following posts