Wald-Wolfowitz Two-Sample Runs Test

The Wald-Wolfowitz runs test is used to test the hypothesis that two independent samples have been drawn from the same population or if the two samples differ in any way. The Wald-Wolfowitz test is useful in that it can be used to test if two samples differ in more than one respect, whether that be the central tendency, variance, skewness, kurtosis, and so on. The test assumes that the data variable has a continuous underlying distribution.

Before the test is performed, the two samples are ordered, keeping their group membership but as a single array. Depending on the total sample size $n_1 + n_2$, the test either uses a critical value table or is approximated using a normal distribution.

The sampling distribution of the observed runs, typically denoted $r$, stems from when the two samples are ordered into a single array, the total number of possible arrangements becomes binomial.

$$ \binom{n_1 + n_2}{n_1} = \binom{n_1 + n_2}{n_2} $$

It can then be shown that the probability of getting an observed value of the runs $r$ or a smaller value when the value of $r$ is even is:

$$ p(r \geq r^{\prime}) = \frac{1}{\binom{n_1 + n_2}{n_1}} \sum^{r^{\prime}}_{r=2} (2) \binom{n_1 - 1}{\frac{r}{2} - 1} \binom{n_2 - 1}{\frac{r}{2} - 1} $$

When $r$ is odd, the probability is defined as:

$$ p(r \geq r^{\prime}) = \frac{1}{\binom{n_1 + n_2}{n_1}} \sum^{r^{\prime}}_{r=2} \Bigg[ \binom{n_1 - 1}{k - 1} \binom{n_2 - 1}{k - 2} + \binom{n_1 - 1}{k - 2} \binom{n_2 - 1}{k - 1} \Bigg] $$

where $r = 2k - 1$.

In the case of small samples, ($n_1, n_2 \geq 20$), a critical value table is used to determine the significance at an alpha of 0.05. For example, if the observed runs value, $r$ is equal to or less than the corresponding value in the critical value table, the null hypothesis $H_0$ may be rejected at a significance level of 0.05. Conversely, if the observed $r$ value is greater than the corresponding value in the table, then the null hypothesis should be rejected.

When the small sample case does not apply, the sampling distribution of $r$ under the null hypothesis is approximately normal. This approximation becomes more accurate as the number of samples for one or both groups increases.

The mean of the runs, $\mu_r$, is defined as:

$$ \mu_r = \frac{2n_1 n_2}{n_1 + n_2} + 1 $$

With standard deviation, $\sigma_r$:

$$ \sigma_r = \sqrt{\frac{2n_1n_2(2n_1n_2 - n_1 - n_2)}{(n_1 + n_2)^2(n_1 + n_2 - 1)}} $$

The z-score, $z = \frac{r - \mu_r}{\sigma_r}$, is then defined as:

$$ z = \frac{r - \large(\frac{2n_1 n_2}{n_1 + n_2} + 1 \large)}{\sqrt{\frac{2n_1n_2(2n_1n_2 - n_1 - n_2)}{(n_1 + n_2)^2(n_1 + n_2 - 1)}}} $$

When the large sample setting applies but the total sample size $N = (n_1 + n_2)$ is still not quite large (large unfortunately still being somewhat subjective; however, generally, this implies that the sample size is not large enough for the assumption of an approximately normally distributed sample to hold), a continuity correction is recommended (and in some cases required). The continuity correction is performed by subtracting $0.5$ from the absolute difference between the observed runs $r$ and the mean $\mu_r$ in the z-score computation.

$$ z = \frac{|r - \mu_r| - .5}{\sigma_r} $$

Testing for Significant Differences in Cats' Weights by Gender with the Wald-Wolfowitz Runs Test

As a practical example, assume we are interested in identifying any significant differences in the weight of cats by their gender. We thus define the null and alternate hypotheses as:

$H_0$: Cat weights do not differ significantly in one or more respect by their gender

$H_A$: Cat weights do differ significantly in at least one respect by their gender

The dataset we will analyze is available in the R MASS package, but can also be downloaded here.

Begin by importing the various packages and functions that will be used in the test.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.special import comb
import itertools

After downloading the dataset, we can read the CSV file using pandas's read_csv function.

In [2]:
cats = pd.read_csv('cats.csv')
cats.head()
Out[2]:
Unnamed: 0 Sex Bwt Hwt
0 1 F 2.0 7.0
1 2 F 2.0 7.4
2 3 F 2.0 9.5
3 4 F 2.1 7.2
4 5 F 2.1 7.3

Using the .loc method of the resulting DataFrame, we can extract the weights of each cat by gender into two new dataframes.

In [3]:
cats_m = cats.loc[cats['Sex'] == 'M'][['Sex', 'Bwt']]
cats_f = cats.loc[cats['Sex'] == 'F'][['Sex', 'Bwt']]

print(len(cats_m))
print(len(cats_f))
97
47

The number of samples for each gender is different. Therefore the Wald-Wolfowitz test is well-suited to test our hypothesis.

Performing the Wald-Wolfowitz Two-Samples Run Test

The first step in performing the Wald-Wolfowitz test is to order the cat weights as ascending values. This step can be accomplished by chaining the .append method with the .sort_values method of the pandas DataFrame.

In [4]:
cats_new = cats_m.append(cats_f).sort_values('Bwt')['Sex']

This step gives us the sorted order of the two samples' group membership which then gives us the ability to count the number of runs. To help visualize this, we print the first 15 group values of the resulting DataFrame.

In [5]:
cats_new.head(15).to_list()
Out[5]:
['M', 'F', 'F', 'F', 'M', 'F', 'M', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F']

We see in this array, there are six runs. To get the number of runs for the whole array without manually counting, we can take advantage of the groupby function in the itertools standard library. This function groups the runs into an iterator that we can then loop over and count, giving us the number of total runs in the combined array.

In [6]:
grouped = itertools.groupby(cats_new)

runs = 0
for t in grouped:
    runs += 1
    
runs
Out[6]:
31

A one-liner approach of the above can be written as the following:

In [7]:
runs = len([sum(1 for _ in r) for _, r in itertools.groupby(cats_new)])
runs
Out[7]:
31

We now have the number of runs in the array! With the number of runs in hand, we calculate the standard deviation, the mean of the runs, the z-score, and the associated p-value.

In [8]:
n1, n2 = len(cats_m), len(cats_f)

mean = (2 * n1 * n2) / (n1 + n2) + 1
sd = np.sqrt((2 * n1 * n2 * (2 * n1 * n2 - n1 - n2)) /
             ((n1 + n2) ** 2 * (n1 + n2 - 1)))

The continuity correction described above is applied when computing the z-score. As discussed, the Wald-Wolfowitz test assumes the underlying distribution of the data values is approximately normal. Thus we use the norm.sf function available in scipy to calculate the cumulative distribution function of the underlying normal distribution.

In [9]:
z = (np.abs(runs - mean) - 0.5) / sd
p_val = norm.sf(z)

print('Z-score: ', z)
print('p-value: ', p_val)
Z-score:  6.247681901704195
p-value:  2.0829449738025176e-10

The p-value is far below the significance level of 0.05. Thus we reject the null hypothesis $H_0$ in favor of the alternate hypothesis $H_A$ and conclude body weights differ significantly in at least one regard depending on gender.

The number of runs also allows us to find the probability of getting an observed runs value of $r$ or smaller. We first get all the even and odd values in the range [2, r + 1]. Using scipy's comb function, we compute the probabilities for even, and odd values defined above and then sum.

In [10]:
r_range = np.arange(2, runs + 1)
evens = r_range[r_range % 2 == 0]
odds = r_range[r_range % 2 != 0]

p_even = 1 / comb(n1 + n2, n1) * \
         np.sum(2 * comb(n1 - 1, evens / 2 - 1) *
                comb(n2 - 1, evens / 2 - 1))

p_odd = 1 / comb(n1 + n2, n1) * \
        np.sum(comb(n1 - 1, odds - 1) *
               comb(n2 - 1, odds - 2) +
               comb(n1 - 1, odds - 2) * comb(n2 - 1, odds - 1))

p = p_even + p_odd

p
Out[10]:
0.10833680229686846

Therefore, the probability of getting an observed value of runs $r$ or a smaller value is approximately 11%.

Related Posts