Abstract
In this document we present some aplications of statistical inference methods on real data. First we explore the most impotant theory on statistical inference, the Central Limit Theorem and show some difference between theoretical and sample statistics using an exponential distribution. Second, we explore the ToothGrowth database which features the results of an A/B test (a Randomized Experiment) on 60 guinea pigs to explore the effects of three different doses of vitamin C by one of two delivery methods on length of odontoblasts.
Sample statistics properties through simulation and the Central Limit Theorem.
At the very bottom of statistical inference, lies the definition of probability itself. Under the frequentist interpretation of probability, we assume the stochastic processes are repeatable and the probability of some outcome is the frecuency that outcome will appear in a large number of trials. The reality is, some processes are not repeatable and we are stuck with some limited realization of events. We rely on Statistical Inference in order to draw conclusions from sampled data making assumptions about the underlying distributions and derived properties of sample statistics.
In this section we present simulated samples of size 40 from an exponential distribution fuction with \(\lambda\) = 0.2 to study some properties of the sample mean and variance and how we can use inference to draw conclusions about the population.
First we set the seed to make the simulations reproducible since we use pseudo random numbers. Then we plot 1000 random numbers from the underlying exponential function. The previous plot simply plots the Exponential Density Function with \(\lambda\) = 0.2 which has mean and standard deviation of 1/alpha. We see that the simulations are very close to the theoretical distribution as it was supposed to be.
Next, we compute 1000 simulations of size 40 of the same underlying exponential distribution and compute the mean and variance of each sample.
The Central Limit Theorem states that the mean of random variables has a limiting normal distribution with mean equal to the population mean and standard deviation equal to the population standard deviation divided by n squared, regardless of the underlying distribution.
From our simulations we can plot the sample means and see how the distribution has changed. We show with this histogram that the means follow closely a normal distribution function with mean 5 and standard deviation 5 (\(1/\lambda\)), which is what the Central Limit Theroem formulates. The blue line represents the aforementioned normal distribution, the x axis plots the sample means for each simulation, while the red line shows that the histogram is centered at the population mean.
We also plot the student’s distribution to show that, since we know the population variance is 1/lamda, the distribution of sample means is not as close as the normal distribution. Finnaly, the orange points show the theoretical exponential distribution with \(\lambda\)=0.2 to show the sampled meand are not closely related to it, despite the fact that each sample come from it.
Finally, we plot a histogram of the sample variances for each simulation to explore its distribution. As we can see, the distribution of the sampled variances for each simulation is not related with the normal distribution function nor the underlying exponential distribution. The theoretical variance \(1/\lambda\) is ploted as the red line and the sampled mean variance is plotted as the purple line. We can in general say that the sampled mean variance is a biased estimator of the variance, we can see this in the previous plot since the lines do not coincide.
Statistical Inference Using the ThoothGrowth Data.
This section develops some tests for the difference of means of groups to see the effects of different doses and administration methods of Vitamin C on tooth growth of guinea pigs.
First we attach the data and call some summaries.
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
This is a very simple data set and in order to make some exploratory analysis is sufficient to plot each pair of attributes in the data set. The first inference from the data we are interested in doing is the estimation of the Average Treatment Effect of the two different delivery methods, orange juice and ascorbic acid.
##
## Two Sample t-test
##
## data: ToothGrowth$len by ToothGrowth$supp
## t = 1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 98.33333 percent confidence interval:
## -1.062765 8.462765
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
Since we are interested in difference of means, using the Central Limit Theorem and the fact that we are estimating the variance from the sampled variance, the difference of means between groups has a student’s t distribution with n-1 degrees of freedom, in this case 59. This result, does not make any assumption about the underlying distribution from which the data was drawn.
In our test, we are defining a two sided test because we are not making assumptions about which group should have greater mean. Our mu argument is equal to zero because the difference of means is zero under the null hipothesis. The data is unpaired because each group is a randomized sample of different pigs. We are assuming equal variance between the groups since the subjects are randomized and there is no reason to believe otherwise. The confidence level is selected with a corrected Type I error rate according with the Bonferroni Correction. This correction is needed since we are making three tests from the same data realization and the probablity of drawing a wrong conclusion aggregates with the ammount of tests; there is only so much inference we can draw from one set of data. The Bonferroni correction is nothing more than the significance level divided by the number of tests you are doing.
From the previous test, we cannot reject the null hipothesis of equal means on the effect of orange juice and ascorbic acid on the length of odontoblasts, even when the sample means are very different. We can almost see it from the initial plot since the variance of ascorbic acid is very large with respect to the variance of orange juice.
Next we make two test comparing the first dosage level with the second and the second with the third using the same parameters as we used in the first test.
##
## Two Sample t-test
##
## data: ToothGrowth$len[ToothGrowth$dose == 0.5] and ToothGrowth$len[ToothGrowth$dose == 1]
## t = -6.4766, df = 38, p-value = 1.266e-07
## alternative hypothesis: true difference in means is not equal to 0
## 98.33333 percent confidence interval:
## -12.660704 -5.599296
## sample estimates:
## mean of x mean of y
## 10.605 19.735
##
## Two Sample t-test
##
## data: ToothGrowth$len[ToothGrowth$dose == 1] and ToothGrowth$len[ToothGrowth$dose == 2]
## t = -4.9005, df = 38, p-value = 1.811e-05
## alternative hypothesis: true difference in means is not equal to 0
## 98.33333 percent confidence interval:
## -9.618121 -3.111879
## sample estimates:
## mean of x mean of y
## 19.735 26.100
We can infer from the data that the second group has a statistically significantly bigger mean than group one and the third group has a mean bigger than the group two.