The Games-Howell test is a nonparametric post hoc analysis approach for performing multiple comparisons for two or more sample populations. The Games-Howell test is somewhat similar to Tukey's post hoc test. Still, unlike Tukey's test, it does not assume homogeneity of variances or equal sample sizes. Thus, the Games-Howell test can be applied in settings when the assumptions of Tukey's test do not hold. The Games-Howell test and Tukey's test will often report similar results with data that is assumed to have equal variance and equal sample sizes.

Games-Howell test employs the Welch-Satterthwaite equation for degrees of freedom, which is also known as the pooled degrees of freedom and is based on Tukey's studentized range distribution, denoted $q$. As the Games-Howell test is nonparametric, it uses the ranks of the observations rather than the raw sample observation values.

The Games-Howell test is defined as:

$$ \bar{x}_i - \bar{x}_j > q_{\sigma, k, df} $$Where $\sigma$ is equal to standard error:

$$ \sigma = \sqrt{{\frac{1}{2} \left(\frac{s^2_i}{n_i} + \frac{s^2_j}{n_j}\right)}} $$Degrees of freedom is calculated using Welch's correction:

$$ \large \frac{\left(\frac{s^2_i}{n_i} + \frac{s^2_j}{n_j}\right)^2}{\frac{\left(\frac{s_i^2}{n_i}\right)^2}{n_i - 1} + \frac{\left(\frac{s_j^2}{n_j}\right)^2}{n_j - 1}} $$Thus, confidence intervals can be formed with:

$$ \bar{x}_i - \bar{x}_j \pm t \sqrt{{\frac{1}{2} \left(\frac{s_i^2}{n_i} + \frac{s_j^2}{n_j}\right)}} $$p-values are calculated using Tukey's studentized range:

$$ \large q_{t * \sqrt{2}, k, df} $$### Games-Howell Test in Python¶

```
%matplotlib inline
import numpy as np
import pandas as pd
import numpy_indexed as npi
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.libqsturng import qsturng, psturng
from hypothetical.descriptive import var
from itertools import combinations
```

For this example, we will perform the Games-Howell test on the InsectSprays dataset, a standard dataset available in R. The dataset can be downloaded as a CSV here. According to the dataset's description, there are two columns:

- Count of insects
- Type of insecticide spray used.

Before beginning, we load the data using `pandas`

's `read_csv`

and inspect it to confirm the columns and data are what is described. After loading the data, we convert it to a `numpy`

array using the `.to_numpy`

method. The first three rows of the array are then displayed as well as the unique values in the insecticide spray column.

```
sprays = pd.read_csv('../../data/InsectSprays.csv')
sprays = sprays.to_numpy()
print(sprays[:3])
print(list(np.unique(sprays[:,2])))
```

The array contains the two data columns as expected, including an index column that is added when the data is loaded by `pandas`

. We also see there are six types of insecticide sprays used in the experiment. A boxplot of the dataset is plotted using `seaborn`

to get a better sense of the distribution of the groups.

```
plt.figure(figsize=(8, 4))
box = sns.boxplot(x=sprays[:,2], y=sprays[:,1], hue=sprays[:,2], palette="Set3")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.tight_layout()
```

The boxplot shows the groups' distributions and variances are slightly different from each other. Therefore, the Games-Howell test is a good candidate for comparing the groups as it does not assume equality of variances. We first find the total number of groups, $k$, as well as the group means, observations, and variance. The `combinations`

function from Python's `itertools`

library is used to get all possible combinations of groups. The $\alpha$ used in this example is $0.05$.

```
alpha = 0.05
k = len(np.unique(sprays[:,2]))
group_means = dict(npi.group_by(sprays[:, 2], sprays[:, 1], np.mean))
group_obs = dict(npi.group_by(sprays[:, 2], sprays[:, 1], len))
group_variance = dict(npi.group_by(sprays[:, 2], sprays[:, 1], var))
combs = list(combinations(np.unique(sprays[:, 2]), 2))
combs
```

With the computed group descriptive statistics in hand, we can proceed to implement the Games-Howell test. The following block loops through each set of combinations listed above and calculates the mean difference, t-value, degrees of freedom, and p-value defined by the Games-Howell test. The `psturng`

and `qsturng`

functions from the `statsmodels.stats.libqsturng`

module of the statsmodels library are used for Tukey's studentized range cumulative distribution function and quantile function, respectively.

```
group_comps = []
mean_differences = []
degrees_freedom = []
t_values = []
p_values = []
std_err = []
up_conf = []
low_conf = []
for comb in combs:
# Mean differences of each group combination
diff = group_means[comb[1]] - group_means[comb[0]]
# t-value of each group combination
t_val = np.abs(diff) / np.sqrt((group_variance[comb[0]] / group_obs[comb[0]]) +
(group_variance[comb[1]] / group_obs[comb[1]]))
# Numerator of the Welch-Satterthwaite equation
df_num = (group_variance[comb[0]] / group_obs[comb[0]] + group_variance[comb[1]] / group_obs[comb[1]]) ** 2
# Denominator of the Welch-Satterthwaite equation
df_denom = ((group_variance[comb[0]] / group_obs[comb[0]]) ** 2 / (group_obs[comb[0]] - 1) +
(group_variance[comb[1]] / group_obs[comb[1]]) ** 2 / (group_obs[comb[1]] - 1))
# Degrees of freedom
df = df_num / df_denom
# p-value of the group comparison
p_val = psturng(t_val * np.sqrt(2), k, df)
# Standard error of each group combination
se = np.sqrt(0.5 * (group_variance[comb[0]] / group_obs[comb[0]] +
group_variance[comb[1]] / group_obs[comb[1]]))
# Upper and lower confidence intervals
upper_conf = diff + qsturng(1 - alpha, k, df)
lower_conf = diff - qsturng(1 - alpha, k, df)
# Append the computed values to their respective lists.
mean_differences.append(diff)
degrees_freedom.append(df)
t_values.append(t_val)
p_values.append(p_val)
std_err.append(se)
up_conf.append(upper_conf)
low_conf.append(lower_conf)
group_comps.append(str(comb[0]) + ' : ' + str(comb[1]))
```

The appended lists from the iteration above are then grouped into a `pandas`

's `DataFrame`

to give us a quick and nice table display of the multiple comparisons computed by the Games-Howell test.

```
result_df = pd.DataFrame({'groups': group_comps,
'mean_difference': mean_differences,
'std_error': std_err,
't_value': t_values,
'p_value': p_values,
'upper_limit': up_conf,
'lower limit': low_conf})
result_df
```

Let's filter the table to only group comparisons with a p-value below $0.05$.

```
result_df.loc[result_df['p_value'] <= 0.05]
```

### References¶

Ruxton, G.D., and Beauchamp, G. (2008) 'Time for some a priori thinking about post hoc testing', Behavioral Ecology, 19(3), pp. 690-693. doi: 10.1093/beheco/arn020. In-text citations: (Ruxton and Beauchamp, 2008)

Post-hoc (no date) Available at: http://www.unt.edu/rss/class/Jon/ISSS_SC/Module009/isss_m91_onewayanova/node7.html