With the advent of devices such as Jawbone Up, Nike FuelBand and Fitbit, it is increasibly easy and cheap for people to monitor and report body movements in order to improve their health, find patterns in their habits and just because is fun. With this devices comes an increasingly large and interesting bank of data on body movements which we can use in order to classify a variety of daily tasks performed by humans. This classifiers can be built in order to predict what the person is doing in real time and detonate a wide number of useful events, for example, play some music according to the kind of task, start caloric estimation counts, show specific ads, etc. One interesting event would be for the device to understand you are working out and predict what kind of workout you are doing in order to monitor whether you are doing it wrong and give an alert in order to prevent lesions.
In this document, we analyise the Weight Lifting Exercise Dataset1 in order to build a classifier to predict if the excersise is being done corrrectly or not. Each subject was asked to perform a Dumbbell Biceps Curl excersise in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). We built 5 classifiers using different Machine Learning Techniques with cross validation of order 5, each one using the same kind of classifier either by bagging or boosting techinques and blended them together with a simple mayority blender.
Data set processing
We read the RAW dataset and show some summary statistics of the first 20 columns.
training <- read.csv("train.csv")
There are several reading intervals from 0.5 to 2.5 seconds with 0.5 seconds overlaping and for each of these intervals, some transformations were developed and reported such as the kourtosis and Euler angles. Several of these indicators unfortunately, had either no variation or most of the samples were NA’s. We performed some basic data cleaning in order to keep only statistically useful attributes to build our classifier.
training <- training[,-grep("skewness|kurtosis|yaw|pitch|roll|picth|var",names(training))]
Also we got rid of variables such as the timestamps which were not particulary useful to build the prediction algorithm but kept important attributes such as the window number of the reading and the user name performing the task since each person may perform the dumbell curl slightly differently and we want to account for that variation.
training <- training[,c(2,7:48)]
This is the final training data set in which we built our classifier.
Training Phase
To build this classifier, we didn’t performed a data split because:
- We avoided overfitting, performed model selection and tuned each classifier using 5-fold cross validation.
- The out of sample accuracy measurement is not of the outmost importance to us since it will be provided by a web algorithm using the provided test set (which has no outcome variable class).
We trained 6 Machine Learning algorithms:
- Linear discriminant analysis
- Penalized Multinomial Regression
- Model Averaged Neural Network
- Naive Bayes
- Random Forests
- Bayesian Generalized Linear Model
Using the cross validated accuracy parameters, we decided that we were not using the Model Averaged Neural Network because it did not fit the data very well. The other 5 classifiers were used in the majority vote function.
z <- trainControl(method = "cv", number = 5)
fit1 <- train(classe~., data=training, method="lda", trControl=z)
fit2 <- train(classe~., data=training, method="multinom", trControl=z)
fit3 <- train(classe~., data=training, method="avNNet", trControl=z)
fit6 <- train(classe~., data=training, method="nb", trControl=z)
fit9 <- train(classe~., data=training, method="rf", trControl=z)
fit10 <- train(classe~., data=training, method="bayesglm", trControl=z)
Other models were discarded beacause of the computational time contraint.
In this section we show the cross validated accuracy estimators and build the blending algorithm to predict on the testing set and feedt the predictions to the web based app.
We can see that the best accuracy comes from the random forest algorithm with a final mtry of 24 to provide a final cross validated accuracy of 99.64% and \(\kappa\) = 99.59%. The rest of the classifiers are not very good by their own but we can use them to build an out of sample superior classification algorithm than any of the calssifiers alone.
predictions <- data.frame(lda=predict(fit1, testing), multinom=predict(fit2, testing), nb=predict(fit6, testing), rf=predict(fit9, testing), bayesglm=predict(fit10, testing))
counts <- matrix(nrow = 20,ncol = 5)
for (i in 1:5){
counts[,i] <- apply(predictions, 1, function(x) sum(x==LETTERS[i]))
counts <- data.frame(counts)
finalPrediction <- names(counts)[max.col(counts)]
This blended final prediction is gonna be fed to the web-based code to build the out of sample accuracy rate but that is out of the scope of the present analysis.
