Effect size and why it matters.

In this notebook I analyze diabetes data downloaded from NCD RisC. The dataset contains age standardized diabetes prevalence percentages for every country of the world and spans the years from 1980 to 2014.

Here I demonstrate that statistical hypothesis testing has limitations that unless taken into consideration can lead to erroneous conlusions even at a compellingly low p value of a hypothesis test.

The notebook and part of the analysis was inspired by the work of Gail M. Sullivan and Richard Feinn published in the Journal of Medical Education in September 2012 under the title "Using effect size - or why the P values is not enough" (and references therein).

Some feature names are a bit long and cumbersome to work with. Let's rename them to something easier to work with. Also, note that the prevalence is reported in floating point number ranging from 0 to 1. I will turn those to percentages.

It might be useful at some point to get the entire confidence interval. Therefore, let's define its' range.

The data should be clean. But just in case, let's check.

We don't have any indication about the trend of the data over time. So the reasonable thing to do is to get the distribution of prevalence values. We do so for men and women separately and compare.

The distibutions are reminiscent of Gaussians with a slight right skew. But we'd like to confirm with rigorous statistical testing. The following function performs 4 such tests and summarizes results in the output table. We apply the function on data for men and women.

All tests consistently reject the null hypothesis, which is that the distribution is a Gaussian. One final check is QQ plots. We use the probplot function from the scipy library to produce those.

The conclusion is that these are not Gaussian distributions.

The extended overlap seems to leave no doubt that drawing a sample would be almost impossible to distinguish if it came from a woman or a man. Nevertheless, we go ahead and run a hypothesis test. We will use a non-parametric test suitable for independent samples, the Mann-Whitney U test. Mann-Whitney does not depend on the distribution of values (i.e. if it is normal or not etc). The null hypothesis is that the median of the distributions are equivalent and the alternative is that one is higher than the other.

Note that the two samples are quite large with a length of 7000 each!

The hypothesis testing rejects the Ho! Our intuition, based on the distribution plots, was telling us the exact opposite!

Note that since the sample size is quite large we could run a 2 (independent) sample ttest. That would only result in a lower p value.

This is were effect size comes at play. Effect size is a statistical measure of the magnitude of the difference of the central tendencies of the distributions between 2 groups.

To begin we combine the data of the two arrays in a new dataframe, named comparison_df, and then calculate means and standard deviations for men (M) and women (W).

Effect Size Calculations

We are now ready to calculate effect size. We concentrate on the standardized mean difference between two groups. We use two metrics Cohen's d and Hedges' g defined as following.

$Cohen's d = \frac{M_{1}-M_{2}}{SD_{pooled}}$, $SD_{pooled} = \sqrt{\frac{SD_{1}^{2} + SD_{2}^{2}}{2}}$

$Hedges' g = \frac{M_{1}-M_{2}}{SD_{pooled}^{*}}$, $SD_{pooled}^{*} = \sqrt{\frac{(n_{1}-1)SD_{1}^{2} + (n_{2}-1)SD_{2}^{2}}{n_{1} + n_{2} - 2}}$

$M$ is the mean, $SD$ refers to the standard deviation of the group, and $n$ is the population size.

We will also use Pearon's r coefficent as a measure of association between the two groups. For all metrics the following table is a guide on the result.

Value Effect size
0.2 low
0.5 medium
0.8 strong

The conclusion is that the effect size analysis reveals a negligible substansive significance between the distributions of the two groups. This was expected from the histogram plots but was not supported by the hypothesis testing.

Therefore, there are 2 key take aways:

  1. Statistical plots, such as histograms, are a very useful way of identifying differences in distributions.
  2. Although a P value can be misleading, the effect size is not.

Map prevalence in time

Since there is no statistical significance of the values between men and women I define a new mean of the age standardized prevalence and use plotly's choropleth function to give a visual of the time dependence of that mean. The visual was saved as a separate html file.

From the time dependent map we see that diabetes had a significant increase in the North African countries, the Arab peninsula.