Executive Summary

Regression analysis is one of the most important tools for a data scientist and statisticians in general since it provides a simple yet insightful way to model the underlying dynamics of the data. In this analysis, we show some strategies to model the Miles/gallon as a function of transmission type using regression analysis in order to see wheter there is a significant relationship between the two. We also explore some tools of regression to make inferential conclusions about the population. The analysis concludes that while there is a positive relationship between manual transmission and gas consumption, this effect is not statistically different from zero. More analysis needs to be done perhaps with a bigger data set to more precisely estimate this effect.

Introduction

Regression analysis is one of the most important tools for a data scientist and statisticians in general since it provides a simple yet insightful way to model the underlying dynamics of the data. Regression analysis provides an intuitive and easy to understand framework to make population inference, meassure uncertainty, develop predictions and meassure partialized effects of variables. If we are prepared to mathematically complicate the computations at the cost of interpretability, generalized regression methods can provide a wide veriety of families of distributions that can be used to model the data. In this brief analysis of the data from the 1974 Motor Trend US magazine, which comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models), we show some of the advantages of using regression models. The attributes in the data have the following descriptions: 1. mpg Miles/(US) gallon 2. cyl Number of cylinders 3. disp Displacement (cu.in.) 4. hp Gross horsepower 5. drat Rear axle ratio 6. wt Weight (1000 lbs) 7. qsec 1/4 mile time 8. vs V/S 9. am Transmission (0 = automatic, 1 = manual) 10. gear Number of forward gears 11. carb Number of carburetors

Regression Analysis

First we are going to load the data set and make some summaries.

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

There are only 32 observations in the data, this is gonna be important since the variance of our estimators depend on the number of observations, ergo we expect our estimated variances to be inflated. We observe there are some categorical variables in the data coded as numeric, for instance the cylinder number and the carburetors number. Normally we would recode them as factors since we are interested in the patial effect of each level in the mpg; however each level minus the reference level enters as one variable in the regression model and we don’t want to reduce our degrees of freedom even more. Recall that the degrees of freedom of the t statistics is equal to \(n-k\) where \(n\) is the number of observations and \(k\) is the number of estimators includind the intercept. There is no assuption about the distribution of our covariates that limits us from running these variables as numeric.

We are interested in wether the gasoline consumpion in mpg has a statistically significant relationship with the transmission type.

To answer this queston we first need to inform ourselves about the sample distribution of mpg, our outcome variable. We can see that the sample distribution is not skewed and it is reasonable to assume it comes from a normal distribution. In fact, in order to fit a Classic Linear Model, we need the error term \(mpg_i - E(mpg|X_i)\)~\(N(0, \sigma ^2)\) and that seems to be a reasonable assumption for this sample distribution, the reality is, we can’t test it, that is an assumption we have to make. Let’s place some labels to identify the levels of the transmission type and plot our variables of interest, mpg and am. From the previos plot, it seems there exists some linear relationship between gas consumption and transmission type. Let’s run our linear model.

## 
## Call:
## lm(formula = mpg ~ am, data = mtcars)
## 
## Coefficients:
## (Intercept)     ammanual  
##      17.147        7.245

There is a statistically significant effect of the transmission type on the gas consumption. In fact, manual transmission seems to increase the gas consumption by 7.245 Miles/gallon on average with a probability of Type I error significantly low at acceptable levels.

In order to make a unbiased conclusion about the effect of transmission type on gas consumption, the Linear Regression Model assumes that \(Cov(error, transmission type) = 0\) which means that there are not other relevant variables determining mpg that also are related to the transmission type. This is a very strong assumption and it is not realistic in our framework. Some manufacturers may opt to construct the most powerful cars with a specific type of transmission or automatic transmission may cause the car to be heavier, both of which variables are related to gas consumption.

Our next strategy is to use every variable in the data set to construct our model.

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl         disp           hp         drat  
##    12.30337     -0.11144      0.01334     -0.02148      0.78711  
##          wt         qsec           vs     ammanual         gear  
##    -3.71530      0.82104      0.31776      2.52023      0.65541  
##        carb  
##    -0.19942

In this estimation none of the estimators is significant at 5% significance level. With 21 degrees of freedom left, it is very difficult to precisely estimate the variance estimators of our variables, let alone the estimators of the effects themselves. We decided to compute the variance inflation factors of each variable to see if there are some variables that are being explained by a linear combination of the others. Normally we wouldn’t need to do this unless there is some variables highly correlated with the transmission type, but we are aiming for frugality in our strategy.

##      drat        am        vs      gear      qsec      carb        hp 
##  3.374620  4.648487  4.965873  5.357452  7.527958  7.908747  9.832037 
##        wt       cyl      disp 
## 15.164887 15.373833 21.620241

The displacement and number of cylinders have high vif and we can hipothesize that the displacement is already explained by the number of cylinders, the horsepower and the weight, the same as the other variables with high vif are explained by others. We decided to drop these three variables and fit our linear model once again.

## 
## Call:
## lm(formula = mpg ~ . - disp - cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)           hp         drat           wt         qsec  
##    13.80810     -0.01225      0.88894     -2.60968      0.63983  
##          vs     ammanual         gear         carb  
##     0.08786      2.42418      0.69390     -0.61286

We end up with 23 degrees of freedom and now one of our variables is statistically significant at the 5% significance level in a two sided t test. The high number of insignificant variables is still a concern with this model since the degrees of freedom is still very low.

We decided to follow a different approach into modeling which variables may be relevant for the model. We already know by general knowledge that mpg is caused by most if not all of our variables in our data set but in order to not violate our assuption we need to identify which ones are likely related with the transmission type in our population. To give some light on this, we plotted a scatter plot of transmission type vs all the variables in our data set excluding the ones we already discarded and, of course, mpg. It seems from the plots that, at least in the sample, the transmission type is related with weight, seconds to querter mile, number of carburetors and the number of gears. We then fit our model with these variables.

## 
## Call:
## lm(formula = mpg ~ wt + qsec + carb + gear + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1008 -1.4091 -0.1297  1.2894  4.3129 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  10.9012     8.0844   1.348  0.18915   
## wt           -3.1456     0.9283  -3.389  0.00225 **
## qsec          0.9507     0.3553   2.676  0.01274 * 
## carb         -0.7094     0.5328  -1.332  0.19457   
## gear          0.8588     1.2477   0.688  0.49735   
## ammanual      2.8799     1.7602   1.636  0.11387   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.468 on 26 degrees of freedom
## Multiple R-squared:  0.8594, Adjusted R-squared:  0.8323 
## F-statistic: 31.77 on 5 and 26 DF,  p-value: 2.761e-10

Weight again is significantly different from zero now at 99% Type I error and qsec is now relevant at 98% significance level. Our variable of interest transmission type is still not statistically significant. This might be our final specification. Let’s quickly plot some residual plots. From the Residuals vs Fitted plot it seems there might be some heteroskedasticity in the data but not very strong and combined with the small number of observations, we think there is not enough evidence of this. The normal Q-Q seems to strenghten our assumption of normality in the residuals, making this Linear Model ideal. In the residuals vs leverage plot, we can see that there are two cars, Merc 230 and Ford Pantera L with high leverage and we can say they are using their leverage because the standarized residuals are between -1 and -2. We took out those two points to see what happens with the estimators and their variances.

## 
## Call:
## lm(formula = mpg ~ wt + qsec + carb + gear + am, data = mtcars[!row.names(mtcars) == 
##     "Merc 230" & !row.names(mtcars) == "Ford Pantera L", ])
## 
## Coefficients:
## (Intercept)           wt         qsec         carb         gear  
##      7.0587      -2.6054       0.8817      -1.0048       2.1076  
##    ammanual  
##      2.3644

While the estimators are changing, our transmission type variables continues to be statistically not signifficant which means its 95% and even 90% confidence intervals touch the zero. We then decided to keep the last model without droping the two outliers as our final model.

Conclusion

Let’s finally make a confidence interval for our variable of interest.
##                  2.5 %     97.5 %
## (Intercept) -5.7165450 27.5188704
## wt          -5.0537402 -1.2375448
## qsec         0.2203138  1.6810164
## carb        -1.8044745  0.3857182
## gear        -1.7058344  3.4234115
## ammanual    -0.7383426  6.4980468

Our transmission type point estimation is 2.8799 which means the average manual car has a gas consumption 2.88 Miles/gallon higher than the average automatic car, controling for weight, seconds to querter mile, number of carburetors and the number of gears. The confidence interval at 95% level is [-0.738, 6.498] for the change from automatic to manual which means we cannot reject the null hipothesis that this change is zero.