6 min read

Week 7: Relating Mean Functions

This week we will use the following packages as always, load them at the beginning of the session 👨👩

library(tidyverse)
library(EnvStats)
library(DMwR)
#install.packages("package_name")

Removing Terms

We try to get a more straightforward mean function by constraining some of the coefficients in \(n\) to equal specified values or equal to one another. Models formed in this way are called submodels because they are created by imposing constraints on the regression coefficients in the full model \(E(y|x)=n^Tu\).

Submodels

They remove terms from the full model to select the simplest model that best fits the data. They assume and test for a constant variance \(Var(y|x)=Var(y|u)=\sigma^2\)

A regression model based on a subset of terms depends on the distribution of the terms, which in turn depends on the distribution of the predictors. If the terms are independent or linearly related, the submodel mean functions generally have the same form as the full mean function. If the terms are dependent but not linearly related, anything can happen, and the regression for the full model may tell us little about the regression for a submodel. These comments apply whenever the term deleted has a nonzero regression coefficient in the full model. If the coefficient is zero, then deleting the term has no effect on the mean function.

When we did linear regression, we calculated the interaction and removed it when it was non-significant. That was a way to remove terms. Now we will remove terms from multiple regressions.

data("algae")
algae_a3<-filter(algae,a3!=0)%>%
  select(1:11,14)%>%
  drop_na()#na.omit(complete.cases())
fit_H0<-lm(a3~.,data=algae_a3)
summary(fit_H0)
## 
## Call:
## lm(formula = a3 ~ ., data = algae_a3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.527  -4.308  -1.205   2.487  30.599 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  43.1645992 15.6403798   2.760  0.00682 **
## seasonspring  5.1478595  2.2536983   2.284  0.02435 * 
## seasonsummer  0.8013056  2.1432028   0.374  0.70924   
## seasonwinter  4.6709330  2.2595379   2.067  0.04115 * 
## sizemedium   -0.5528254  2.0015445  -0.276  0.78293   
## sizesmall    -2.7688779  2.2582335  -1.226  0.22287   
## speedlow     -2.8809072  2.5902335  -1.112  0.26856   
## speedmedium  -1.7684714  1.8970353  -0.932  0.35334   
## mxPH         -3.5083661  1.8176549  -1.930  0.05626 . 
## mnO2         -0.8201334  0.3830932  -2.141  0.03458 * 
## Cl            0.0050799  0.0154392   0.329  0.74278   
## NO3          -0.2209960  0.3540525  -0.624  0.53384   
## NH4          -0.0019624  0.0008455  -2.321  0.02221 * 
## oPO4          0.0008710  0.0229068   0.038  0.96974   
## PO4          -0.0004776  0.0187716  -0.025  0.97975   
## Chla         -0.0137879  0.0458108  -0.301  0.76402   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.219 on 106 degrees of freedom
## Multiple R-squared:  0.2504, Adjusted R-squared:  0.1443 
## F-statistic:  2.36 on 15 and 106 DF,  p-value: 0.00576

Well, we got many terms, and we may think that we should select the significant ones. However, that may not be the best option as one variable may change with different combinations. Dropping a term from a regression can change both the mean function and the variance function in fundamental ways unless the value of the regression coefficient for that term is = 0. 🤔 so how to decide?

Analysis of variance is used to summarize the calculations necessary for comparing competing models on the same data. It is appropriate only for such nested hypotheses.

  • \(H_0\): The relationship assumed in the nested model is reasonable, i.e., dropping terms did not change the variance \(Var(y)=\sigma^2\).
  • \(H_A\): The relationship assumed in the nested model is not reasonable, i.e., dropping terms did change the variance \(Var(y)=\sigma^2\).
fit_HA<-lm(a3~season+mxPH+mnO2+NH4,data=algae_a3)
summary(fit_HA)
## 
## Call:
## lm(formula = a3 ~ season + mxPH + mnO2 + NH4, data = algae_a3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.819  -4.206  -1.210   1.747  30.284 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  37.318118  12.257879   3.044  0.00289 **
## seasonspring  5.631754   2.097030   2.686  0.00831 **
## seasonsummer  0.953364   2.066956   0.461  0.64550   
## seasonwinter  4.771590   2.088103   2.285  0.02414 * 
## mxPH         -3.297595   1.382247  -2.386  0.01868 * 
## mnO2         -0.699682   0.308001  -2.272  0.02497 * 
## NH4          -0.002156   0.000787  -2.740  0.00712 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.077 on 115 degrees of freedom
## Multiple R-squared:  0.2184, Adjusted R-squared:  0.1776 
## F-statistic: 5.356 on 6 and 115 DF,  p-value: 6.589e-05

Now, the majority of significant terms reduced their P even mxPH is now significant, so will this model be better than the last one. You can test both models with \(F^*=\frac{(RSS_{NH}-RSS_{AH})/(df_{NH}-df_{AH})}{\hat\sigma_{AH}^2}\) Larger value of F provide evidence against the NH and in favor of the AH. To judge the worth of the submodel, look at the mean squared error (MSE) of a fitted value from the submodel:

In R we will use the function anova(model1,model2) to test which model is better.

anova(fit_HA,fit_H0)
## Analysis of Variance Table
## 
## Model 1: a3 ~ season + mxPH + mnO2 + NH4
## Model 2: a3 ~ season + size + speed + mxPH + mnO2 + Cl + NO3 + NH4 + oPO4 + 
##     PO4 + Chla
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    115 5760.4                           
## 2    106 5524.6  9    235.78 0.5027 0.8697

Our F-value is small, meaning that dropping terms did not change the variance to keep the reduced model. As you may have realized, maybe there are several options \(2^k\) with k terms that will be time-consuming, so we will let the machine calculate and use Stepwise regression.

Stepwise regression

It is realized by successively adding/deleting terms to a base model. We can consider this because including highly correlated terms in a regression model can inflate the variances of estimates, fitted values, and predictions. So it is better to remove highly correlated terms from a regression model linearity assumption.

full.model <- lm(a3~., data = algae_a3)
# Stepwise regression model
step.model <- MASS::stepAIC(full.model, direction = "both", 
                      trace = FALSE)
summary(step.model)
## 
## Call:
## lm(formula = a3 ~ season + mxPH + mnO2 + NH4, data = algae_a3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.819  -4.206  -1.210   1.747  30.284 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  37.318118  12.257879   3.044  0.00289 **
## seasonspring  5.631754   2.097030   2.686  0.00831 **
## seasonsummer  0.953364   2.066956   0.461  0.64550   
## seasonwinter  4.771590   2.088103   2.285  0.02414 * 
## mxPH         -3.297595   1.382247  -2.386  0.01868 * 
## mnO2         -0.699682   0.308001  -2.272  0.02497 * 
## NH4          -0.002156   0.000787  -2.740  0.00712 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.077 on 115 degrees of freedom
## Multiple R-squared:  0.2184, Adjusted R-squared:  0.1776 
## F-statistic: 5.356 on 6 and 115 DF,  p-value: 6.589e-05

You got two options; forward selection will add terms one by one until the t-statistic for the next term is less than the reference cutoff; backward elimination, terms are deleted until the t-statistics for the next term to be deleted is greater than the cutoff. The final and suggested model will have the smallest AIC. Look at all the different models by using trace=TRUE.

Nevertheless, there are a few articles 1, 2, and other posts that consider sequential fitting bias and likely to give you false readings as multiple testings are being run. Algorithmic stepwise model selection can overstate significance; therefore, you can use Factor analysis, Partial Least Squares Regression, or Least Absolute Shrinkage and Selection Operator (LASSO) that group or constrain the coefficient estimates in some way. Packages in R like pls, lars, or glmnet can help with the tasks.