Introduction
With the advent of devices such as Jawbone Up, Nike FuelBand and Fitbit, it is increasibly easy and cheap for people to monitor and report body movements in order to improve their health, find patterns in their habits and just because is fun. With this devices comes an increasingly large and interesting bank of data on body movements which we can use in order to classify a variety of daily tasks performed by humans. This classifiers can be built in order to predict what the person is doing in real time and detonate a wide number of useful events, for example, play some music according to the kind of task, start caloric estimation counts, show specific ads, etc. One interesting event would be for the device to understand you are working out and predict what kind of workout you are doing in order to monitor whether you are doing it wrong and give an alert in order to prevent lesions.
In this document, we analyise the Weight Lifting Exercise Dataset1 in order to build a classifier to predict if the excersise is being done corrrectly or not. Each subject was asked to perform a Dumbbell Biceps Curl excersise in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). We built 5 classifiers using different Machine Learning Techniques with cross validation of order 5, each one using the same kind of classifier either by bagging or boosting techinques and blended them together with a simple mayority blender.
Data set processing
We read the RAW dataset and show some summary statistics of the first 20 columns.
training <- read.csv("train.csv")
summary(training[,1:20])
## X user_name raw_timestamp_part_1 raw_timestamp_part_2
## Min. : 1 adelmo :3892 Min. :1.322e+09 Min. : 294
## 1st Qu.: 4906 carlitos:3112 1st Qu.:1.323e+09 1st Qu.:252912
## Median : 9812 charles :3536 Median :1.323e+09 Median :496380
## Mean : 9812 eurico :3070 Mean :1.323e+09 Mean :500656
## 3rd Qu.:14717 jeremy :3402 3rd Qu.:1.323e+09 3rd Qu.:751891
## Max. :19622 pedro :2610 Max. :1.323e+09 Max. :998801
##
## cvtd_timestamp new_window num_window roll_belt
## 28/11/2011 14:14: 1498 no :19216 Min. : 1.0 Min. :-28.90
## 05/12/2011 11:24: 1497 yes: 406 1st Qu.:222.0 1st Qu.: 1.10
## 30/11/2011 17:11: 1440 Median :424.0 Median :113.00
## 05/12/2011 11:25: 1425 Mean :430.6 Mean : 64.41
## 02/12/2011 14:57: 1380 3rd Qu.:644.0 3rd Qu.:123.00
## 02/12/2011 13:34: 1375 Max. :864.0 Max. :162.00
## (Other) :11007
## pitch_belt yaw_belt total_accel_belt kurtosis_roll_belt
## Min. :-55.8000 Min. :-180.00 Min. : 0.00 :19216
## 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00 #DIV/0! : 10
## Median : 5.2800 Median : -13.00 Median :17.00 -1.908453: 2
## Mean : 0.3053 Mean : -11.21 Mean :11.31 -0.016850: 1
## 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00 -0.021024: 1
## Max. : 60.3000 Max. : 179.00 Max. :29.00 -0.025513: 1
## (Other) : 391
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## :19216 :19216 :19216
## #DIV/0! : 32 #DIV/0!: 406 #DIV/0! : 9
## 47.000000: 4 0.000000 : 4
## -0.150950: 3 0.422463 : 2
## -0.684748: 3 -0.003095: 1
## -1.750749: 3 -0.010002: 1
## (Other) : 361 (Other) : 389
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## :19216 :19216 Min. :-94.300 Min. : 3.00
## #DIV/0! : 32 #DIV/0!: 406 1st Qu.:-88.000 1st Qu.: 5.00
## 0.000000 : 4 Median : -5.100 Median :18.00
## -2.156553: 3 Mean : -6.667 Mean :12.92
## -3.072669: 3 3rd Qu.: 18.500 3rd Qu.:19.00
## -6.324555: 3 Max. :180.000 Max. :30.00
## (Other) : 361 NA's :19216 NA's :19216
## max_yaw_belt
## :19216
## -1.1 : 30
## -1.4 : 29
## -1.2 : 26
## -0.9 : 24
## -1.3 : 22
## (Other): 275
There are several reading intervals from 0.5 to 2.5 seconds with 0.5 seconds overlaping and for each of these intervals, some transformations were developed and reported such as the kourtosis and Euler angles. Several of these indicators unfortunately, had either no variation or most of the samples were NA’s. We performed some basic data cleaning in order to keep only statistically useful attributes to build our classifier.
training <- training[,-grep("skewness|kurtosis|yaw|pitch|roll|picth|var",names(training))]
Also we got rid of variables such as the timestamps which were not particulary useful to build the prediction algorithm but kept important attributes such as the window number of the reading and the user name performing the task since each person may perform the dumbell curl slightly differently and we want to account for that variation.
training <- training[,c(2,7:48)]
This is the final training data set in which we built our classifier.
Training Phase
To build this classifier, we didn’t performed a data split because:
- We avoided overfitting, performed model selection and tuned each classifier using 5-fold cross validation.
- The out of sample accuracy measurement is not of the outmost importance to us since it will be provided by a web algorithm using the provided test set (which has no outcome variable class).
We trained 6 Machine Learning algorithms:
- Linear discriminant analysis
- Penalized Multinomial Regression
- Model Averaged Neural Network
- Naive Bayes
- Random Forests
- Bayesian Generalized Linear Model
Using the cross validated accuracy parameters, we decided that we were not using the Model Averaged Neural Network because it did not fit the data very well. The other 5 classifiers were used in the majority vote function.
set.seed(574)
z <- trainControl(method = "cv", number = 5)
fit1 <- train(classe~., data=training, method="lda", trControl=z)
fit2 <- train(classe~., data=training, method="multinom", trControl=z)
fit3 <- train(classe~., data=training, method="avNNet", trControl=z)
fit6 <- train(classe~., data=training, method="nb", trControl=z)
fit9 <- train(classe~., data=training, method="rf", trControl=z)
fit10 <- train(classe~., data=training, method="bayesglm", trControl=z)
Other models were discarded beacause of the computational time contraint.
Results
In this section we show the cross validated accuracy estimators and build the blending algorithm to predict on the testing set and feedt the predictions to the web based app.
fit1$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.6920287 0.6089305 0.003304142 0.004077175
fit2$results
## decay Accuracy Kappa AccuracySD KappaSD
## 1 0e+00 0.6317905 0.5345320 0.008711726 0.01027731
## 2 1e-04 0.6317905 0.5345320 0.008711726 0.01027731
## 3 1e-01 0.6318414 0.5345976 0.008754426 0.01033499
fit6$results
## usekernel fL adjust Accuracy Kappa AccuracySD KappaSD
## 1 FALSE 0 1 0.4267021 0.2945092 0.043800363 0.04386220
## 2 TRUE 0 1 0.7211282 0.6422316 0.007378415 0.01144043
fit9$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9923557 0.9903296 0.0016599164 0.0020999124
## 2 24 0.9967895 0.9959389 0.0007760752 0.0009816961
## 3 46 0.9946492 0.9932312 0.0013817935 0.0017480006
fit10$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.3902767 0.2203644 0.003932735 0.00461844
We can see that the best accuracy comes from the random forest algorithm with a final mtry of 24 to provide a final cross validated accuracy of 99.64% and \(\kappa\) = 99.59%. The rest of the classifiers are not very good by their own but we can use them to build an out of sample superior classification algorithm than any of the calssifiers alone.
predictions <- data.frame(lda=predict(fit1, testing), multinom=predict(fit2, testing), nb=predict(fit6, testing), rf=predict(fit9, testing), bayesglm=predict(fit10, testing))
counts <- matrix(nrow = 20,ncol = 5)
for (i in 1:5){
counts[,i] <- apply(predictions, 1, function(x) sum(x==LETTERS[i]))
}
counts <- data.frame(counts)
names(counts)<-LETTERS[1:5]
finalPrediction <- names(counts)[max.col(counts)]
This blended final prediction is gonna be fed to the web-based code to build the out of sample accuracy rate but that is out of the scope of the present analysis.
References
1: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#ixzz4WCwxc9hG
2: Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2016). caret: Classification and Regression Training. R package version 6.0-73. https://CRAN.R-project.org/package=caret
Annex
Overview of the tuning algorithms and the final optimal parameters
fit1
## Linear Discriminant Analysis
##
## 19622 samples
## 42 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15695, 15698, 15699, 15699, 15697
## Resampling results:
##
## Accuracy Kappa
## 0.6920287 0.6089305
##
##
fit2
## Penalized Multinomial Regression
##
## 19622 samples
## 42 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15698, 15699, 15696, 15697, 15698
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.6317905 0.5345320
## 1e-04 0.6317905 0.5345320
## 1e-01 0.6318414 0.5345976
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
fit6
## Naive Bayes
##
## 19622 samples
## 42 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15699, 15698, 15698, 15698, 15695
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa
## FALSE 0.4267021 0.2945092
## TRUE 0.7211282 0.6422316
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
## and adjust = 1.
fit9
## Random Forest
##
## 19622 samples
## 42 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15698, 15698, 15699, 15695, 15698
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9923557 0.9903296
## 24 0.9967895 0.9959389
## 46 0.9946492 0.9932312
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 24.
fit10
## Bayesian Generalized Linear Model
##
## 19622 samples
## 42 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15697, 15698, 15696, 15699, 15698
## Resampling results:
##
## Accuracy Kappa
## 0.3902767 0.2203644
##
##