The Games-Howell test is a nonparametric post hoc analysis approach for performing multiple comparisons for two or more sample populations. The Games-Howell test is somewhat similar to Tukey's post hoc test. Still, unlike Tukey's test, it does not assume homogeneity of variances or equal sample sizes. Thus, the Games-Howell test can be applied in settings when the assumptions of Tukey's test do not hold. The Games-Howell test and Tukey's test will often report similar results with data that is assumed to have equal variance and equal sample sizes.
Games-Howell test employs the Welch-Satterthwaite equation for degrees of freedom, which is also known as the pooled degrees of freedom and is based on Tukey's studentized range distribution, denoted $q$. As the Games-Howell test is nonparametric, it uses the ranks of the observations rather than the raw sample observation values.
The Games-Howell test is defined as:
$$ \bar{x}_i - \bar{x}_j > q_{\sigma, k, df} $$Where $\sigma$ is equal to standard error:
$$ \sigma = \sqrt{{\frac{1}{2} \left(\frac{s^2_i}{n_i} + \frac{s^2_j}{n_j}\right)}} $$Degrees of freedom is calculated using Welch's correction:
$$ \large \frac{\left(\frac{s^2_i}{n_i} + \frac{s^2_j}{n_j}\right)^2}{\frac{\left(\frac{s_i^2}{n_i}\right)^2}{n_i - 1} + \frac{\left(\frac{s_j^2}{n_j}\right)^2}{n_j - 1}} $$Thus, confidence intervals can be formed with:
$$ \bar{x}_i - \bar{x}_j \pm t \sqrt{{\frac{1}{2} \left(\frac{s_i^2}{n_i} + \frac{s_j^2}{n_j}\right)}} $$p-values are calculated using Tukey's studentized range:
$$ \large q_{t * \sqrt{2}, k, df} $$Games-Howell Test in Python¶
%matplotlib inline
import numpy as np
import pandas as pd
import numpy_indexed as npi
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.libqsturng import qsturng, psturng
from hypothetical.descriptive import var
from itertools import combinations
For this example, we will perform the Games-Howell test on the InsectSprays dataset, a standard dataset available in R. The dataset can be downloaded as a CSV here. According to the dataset's description, there are two columns:
- Count of insects
- Type of insecticide spray used.
Before beginning, we load the data using pandas
's read_csv
and inspect it to confirm the columns and data are what is described. After loading the data, we convert it to a numpy
array using the .to_numpy
method. The first three rows of the array are then displayed as well as the unique values in the insecticide spray column.
sprays = pd.read_csv('../../data/InsectSprays.csv')
sprays = sprays.to_numpy()
print(sprays[:3])
print(list(np.unique(sprays[:,2])))
The array contains the two data columns as expected, including an index column that is added when the data is loaded by pandas
. We also see there are six types of insecticide sprays used in the experiment. A boxplot of the dataset is plotted using seaborn
to get a better sense of the distribution of the groups.
plt.figure(figsize=(8, 4))
box = sns.boxplot(x=sprays[:,2], y=sprays[:,1], hue=sprays[:,2], palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.tight_layout()
The boxplot shows the groups' distributions and variances are slightly different from each other. Therefore, the Games-Howell test is a good candidate for comparing the groups as it does not assume equality of variances. We first find the total number of groups, $k$, as well as the group means, observations, and variance. The combinations
function from Python's itertools
library is used to get all possible combinations of groups. The $\alpha$ used in this example is $0.05$.
alpha = 0.05
k = len(np.unique(sprays[:,2]))
group_means = dict(npi.group_by(sprays[:, 2], sprays[:, 1], np.mean))
group_obs = dict(npi.group_by(sprays[:, 2], sprays[:, 1], len))
group_variance = dict(npi.group_by(sprays[:, 2], sprays[:, 1], var))
combs = list(combinations(np.unique(sprays[:, 2]), 2))
combs
With the computed group descriptive statistics in hand, we can proceed to implement the Games-Howell test. The following block loops through each set of combinations listed above and calculates the mean difference, t-value, degrees of freedom, and p-value defined by the Games-Howell test. The psturng
and qsturng
functions from the statsmodels.stats.libqsturng
module of the statsmodels library are used for Tukey's studentized range cumulative distribution function and quantile function, respectively.
group_comps = []
mean_differences = []
degrees_freedom = []
t_values = []
p_values = []
std_err = []
up_conf = []
low_conf = []
for comb in combs:
# Mean differences of each group combination
diff = group_means[comb[1]] - group_means[comb[0]]
# t-value of each group combination
t_val = np.abs(diff) / np.sqrt((group_variance[comb[0]] / group_obs[comb[0]]) +
(group_variance[comb[1]] / group_obs[comb[1]]))
# Numerator of the Welch-Satterthwaite equation
df_num = (group_variance[comb[0]] / group_obs[comb[0]] + group_variance[comb[1]] / group_obs[comb[1]]) ** 2
# Denominator of the Welch-Satterthwaite equation
df_denom = ((group_variance[comb[0]] / group_obs[comb[0]]) ** 2 / (group_obs[comb[0]] - 1) +
(group_variance[comb[1]] / group_obs[comb[1]]) ** 2 / (group_obs[comb[1]] - 1))
# Degrees of freedom
df = df_num / df_denom
# p-value of the group comparison
p_val = psturng(t_val * np.sqrt(2), k, df)
# Standard error of each group combination
se = np.sqrt(0.5 * (group_variance[comb[0]] / group_obs[comb[0]] +
group_variance[comb[1]] / group_obs[comb[1]]))
# Upper and lower confidence intervals
upper_conf = diff + qsturng(1 - alpha, k, df)
lower_conf = diff - qsturng(1 - alpha, k, df)
# Append the computed values to their respective lists.
mean_differences.append(diff)
degrees_freedom.append(df)
t_values.append(t_val)
p_values.append(p_val)
std_err.append(se)
up_conf.append(upper_conf)
low_conf.append(lower_conf)
group_comps.append(str(comb[0]) + ' : ' + str(comb[1]))
The appended lists from the iteration above are then grouped into a pandas
's DataFrame
to give us a quick and nice table display of the multiple comparisons computed by the Games-Howell test.
result_df = pd.DataFrame({'groups': group_comps,
'mean_difference': mean_differences,
'std_error': std_err,
't_value': t_values,
'p_value': p_values,
'upper_limit': up_conf,
'lower limit': low_conf})
result_df
Let's filter the table to only group comparisons with a p-value below $0.05$.
result_df.loc[result_df['p_value'] <= 0.05]
References¶
Ruxton, G.D., and Beauchamp, G. (2008) 'Time for some a priori thinking about post hoc testing', Behavioral Ecology, 19(3), pp. 690-693. doi: 10.1093/beheco/arn020. In-text citations: (Ruxton and Beauchamp, 2008)
Post-hoc (no date) Available at: http://www.unt.edu/rss/class/Jon/ISSS_SC/Module009/isss_m91_onewayanova/node7.html