Introduction

With the advent of devices such as Jawbone Up, Nike FuelBand and Fitbit, it is increasibly easy and cheap for people to monitor and report body movements in order to improve their health, find patterns in their habits and just because is fun. With this devices comes an increasingly large and interesting bank of data on body movements which we can use in order to classify a variety of daily tasks performed by humans. This classifiers can be built in order to predict what the person is doing in real time and detonate a wide number of useful events, for example, play some music according to the kind of task, start caloric estimation counts, show specific ads, etc. One interesting event would be for the device to understand you are working out and predict what kind of workout you are doing in order to monitor whether you are doing it wrong and give an alert in order to prevent lesions.

In this document, we analyise the Weight Lifting Exercise Dataset1 in order to build a classifier to predict if the excersise is being done corrrectly or not. Each subject was asked to perform a Dumbbell Biceps Curl excersise in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). We built 5 classifiers using different Machine Learning Techniques with cross validation of order 5, each one using the same kind of classifier either by bagging or boosting techinques and blended them together with a simple mayority blender.

Data set processing

We read the RAW dataset and show some summary statistics of the first 20 columns.

training <- read.csv("train.csv")
summary(training[,1:20])
##        X            user_name    raw_timestamp_part_1 raw_timestamp_part_2
##  Min.   :    1   adelmo  :3892   Min.   :1.322e+09    Min.   :   294      
##  1st Qu.: 4906   carlitos:3112   1st Qu.:1.323e+09    1st Qu.:252912      
##  Median : 9812   charles :3536   Median :1.323e+09    Median :496380      
##  Mean   : 9812   eurico  :3070   Mean   :1.323e+09    Mean   :500656      
##  3rd Qu.:14717   jeremy  :3402   3rd Qu.:1.323e+09    3rd Qu.:751891      
##  Max.   :19622   pedro   :2610   Max.   :1.323e+09    Max.   :998801      
##                                                                           
##           cvtd_timestamp  new_window    num_window      roll_belt     
##  28/11/2011 14:14: 1498   no :19216   Min.   :  1.0   Min.   :-28.90  
##  05/12/2011 11:24: 1497   yes:  406   1st Qu.:222.0   1st Qu.:  1.10  
##  30/11/2011 17:11: 1440               Median :424.0   Median :113.00  
##  05/12/2011 11:25: 1425               Mean   :430.6   Mean   : 64.41  
##  02/12/2011 14:57: 1380               3rd Qu.:644.0   3rd Qu.:123.00  
##  02/12/2011 13:34: 1375               Max.   :864.0   Max.   :162.00  
##  (Other)         :11007                                               
##    pitch_belt          yaw_belt       total_accel_belt kurtosis_roll_belt
##  Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00             :19216   
##  1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00    #DIV/0!  :   10   
##  Median :  5.2800   Median : -13.00   Median :17.00    -1.908453:    2   
##  Mean   :  0.3053   Mean   : -11.21   Mean   :11.31    -0.016850:    1   
##  3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00    -0.021024:    1   
##  Max.   : 60.3000   Max.   : 179.00   Max.   :29.00    -0.025513:    1   
##                                                        (Other)  :  391   
##  kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
##           :19216            :19216              :19216   
##  #DIV/0!  :   32     #DIV/0!:  406     #DIV/0!  :    9   
##  47.000000:    4                       0.000000 :    4   
##  -0.150950:    3                       0.422463 :    2   
##  -0.684748:    3                       -0.003095:    1   
##  -1.750749:    3                       -0.010002:    1   
##  (Other)  :  361                       (Other)  :  389   
##  skewness_roll_belt.1 skewness_yaw_belt max_roll_belt     max_picth_belt 
##           :19216             :19216     Min.   :-94.300   Min.   : 3.00  
##  #DIV/0!  :   32      #DIV/0!:  406     1st Qu.:-88.000   1st Qu.: 5.00  
##  0.000000 :    4                        Median : -5.100   Median :18.00  
##  -2.156553:    3                        Mean   : -6.667   Mean   :12.92  
##  -3.072669:    3                        3rd Qu.: 18.500   3rd Qu.:19.00  
##  -6.324555:    3                        Max.   :180.000   Max.   :30.00  
##  (Other)  :  361                        NA's   :19216     NA's   :19216  
##   max_yaw_belt  
##         :19216  
##  -1.1   :   30  
##  -1.4   :   29  
##  -1.2   :   26  
##  -0.9   :   24  
##  -1.3   :   22  
##  (Other):  275

There are several reading intervals from 0.5 to 2.5 seconds with 0.5 seconds overlaping and for each of these intervals, some transformations were developed and reported such as the kourtosis and Euler angles. Several of these indicators unfortunately, had either no variation or most of the samples were NA’s. We performed some basic data cleaning in order to keep only statistically useful attributes to build our classifier.

training <- training[,-grep("skewness|kurtosis|yaw|pitch|roll|picth|var",names(training))]

Also we got rid of variables such as the timestamps which were not particulary useful to build the prediction algorithm but kept important attributes such as the window number of the reading and the user name performing the task since each person may perform the dumbell curl slightly differently and we want to account for that variation.

training <- training[,c(2,7:48)]

This is the final training data set in which we built our classifier.

Training Phase

To build this classifier, we didn’t performed a data split because:

  1. We avoided overfitting, performed model selection and tuned each classifier using 5-fold cross validation.
  2. The out of sample accuracy measurement is not of the outmost importance to us since it will be provided by a web algorithm using the provided test set (which has no outcome variable class).

We trained 6 Machine Learning algorithms:

  • Linear discriminant analysis
  • Penalized Multinomial Regression
  • Model Averaged Neural Network
  • Naive Bayes
  • Random Forests
  • Bayesian Generalized Linear Model

Using the cross validated accuracy parameters, we decided that we were not using the Model Averaged Neural Network because it did not fit the data very well. The other 5 classifiers were used in the majority vote function.

set.seed(574)
z <- trainControl(method = "cv", number = 5)
fit1 <- train(classe~., data=training, method="lda", trControl=z)
fit2 <- train(classe~., data=training, method="multinom", trControl=z)
fit3 <- train(classe~., data=training, method="avNNet", trControl=z)
fit6 <- train(classe~., data=training, method="nb", trControl=z)
fit9 <- train(classe~., data=training, method="rf", trControl=z)
fit10 <- train(classe~., data=training, method="bayesglm", trControl=z)

Other models were discarded beacause of the computational time contraint.

Results

In this section we show the cross validated accuracy estimators and build the blending algorithm to predict on the testing set and feedt the predictions to the web based app.

fit1$results
##   parameter  Accuracy     Kappa  AccuracySD     KappaSD
## 1      none 0.6920287 0.6089305 0.003304142 0.004077175
fit2$results
##   decay  Accuracy     Kappa  AccuracySD    KappaSD
## 1 0e+00 0.6317905 0.5345320 0.008711726 0.01027731
## 2 1e-04 0.6317905 0.5345320 0.008711726 0.01027731
## 3 1e-01 0.6318414 0.5345976 0.008754426 0.01033499
fit6$results
##   usekernel fL adjust  Accuracy     Kappa  AccuracySD    KappaSD
## 1     FALSE  0      1 0.4267021 0.2945092 0.043800363 0.04386220
## 2      TRUE  0      1 0.7211282 0.6422316 0.007378415 0.01144043
fit9$results
##   mtry  Accuracy     Kappa   AccuracySD      KappaSD
## 1    2 0.9923557 0.9903296 0.0016599164 0.0020999124
## 2   24 0.9967895 0.9959389 0.0007760752 0.0009816961
## 3   46 0.9946492 0.9932312 0.0013817935 0.0017480006
fit10$results
##   parameter  Accuracy     Kappa  AccuracySD    KappaSD
## 1      none 0.3902767 0.2203644 0.003932735 0.00461844

We can see that the best accuracy comes from the random forest algorithm with a final mtry of 24 to provide a final cross validated accuracy of 99.64% and \(\kappa\) = 99.59%. The rest of the classifiers are not very good by their own but we can use them to build an out of sample superior classification algorithm than any of the calssifiers alone.

predictions <- data.frame(lda=predict(fit1, testing), multinom=predict(fit2, testing), nb=predict(fit6, testing), rf=predict(fit9, testing), bayesglm=predict(fit10, testing))
counts <- matrix(nrow = 20,ncol = 5)

for (i in 1:5){
    counts[,i] <- apply(predictions, 1, function(x) sum(x==LETTERS[i]))
}

counts <- data.frame(counts)
names(counts)<-LETTERS[1:5]
finalPrediction <- names(counts)[max.col(counts)]

This blended final prediction is gonna be fed to the web-based code to build the out of sample accuracy rate but that is out of the scope of the present analysis.

References

1: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13). Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#ixzz4WCwxc9hG

2: Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and Tyler Hunt. (2016). caret: Classification and Regression Training. R package version 6.0-73. https://CRAN.R-project.org/package=caret

Annex

Overview of the tuning algorithms and the final optimal parameters

fit1
## Linear Discriminant Analysis 
## 
## 19622 samples
##    42 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15695, 15698, 15699, 15699, 15697 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.6920287  0.6089305
## 
## 
fit2
## Penalized Multinomial Regression 
## 
## 19622 samples
##    42 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15698, 15699, 15696, 15697, 15698 
## Resampling results across tuning parameters:
## 
##   decay  Accuracy   Kappa    
##   0e+00  0.6317905  0.5345320
##   1e-04  0.6317905  0.5345320
##   1e-01  0.6318414  0.5345976
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was decay = 0.1.
fit6
## Naive Bayes 
## 
## 19622 samples
##    42 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15698, 15698, 15698, 15695 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa    
##   FALSE      0.4267021  0.2945092
##    TRUE      0.7211282  0.6422316
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
##  and adjust = 1.
fit9
## Random Forest 
## 
## 19622 samples
##    42 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15698, 15698, 15699, 15695, 15698 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9923557  0.9903296
##   24    0.9967895  0.9959389
##   46    0.9946492  0.9932312
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 24.
fit10
## Bayesian Generalized Linear Model 
## 
## 19622 samples
##    42 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15697, 15698, 15696, 15699, 15698 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.3902767  0.2203644
## 
##