Median Test

The median test, sometimes referred to as Mood's median test is a nonparametric procedure for investigating whether the medians of the populations from which $k$ sample groups are drawn are equal. The test is a particular case of the chi-square test of dependence. The null and alternative hypotheses when employing the median test may be written as:

$H_0$: All $k$ populations have the same median.

$H_A$: At least two of the $k$ populations have different medians.

Given $k$ samples with $n_1, n_2, \cdots, n_k$ data observations, the median test proceeds by computing the grand median of the combined observations. A $2 \times k$ contingency table is then constructed. The top row contains the number of total observations above the grand median for each of the $k$ sample groups, and the bottom row is the number of observations below the grand median. Ties between the individual observations and the grand median are either put in the top or bottom row or discarded entirely. A chi-square test of independence is then performed on the constructed $2 \times k$ contingency table.

The median test statistic, typically denoted $T$, is defined as:

$$ T = \frac{N^2}{ab} \sum \frac{\left ( O_{1i} - \frac{n_i a}{N} \right )^2}{n_i} $$

Where $a$ is the marginal total of the $2 \times k$ contingency table for observations above the grand median, and $b$ is the marginal total for those observations below the grand median. The test statistic is assumed to have a chi-square distribution where the degrees of freedom is defined as $k - 1$. The $T$ statistic is approximately equal to the $\chi^2$ value of the contingency table.

Median Test Example

The following is an example demonstrating how to perform the median test. Before beginning, import the libraries that will be used throughout the example.

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

Consider four samples each taken from different populations. We first represent each sampled group as a list.

In [2]:
g1 = [83, 91, 94, 89, 89, 96, 91, 92, 90]
g2 = [91, 90, 81, 83, 84, 83, 88, 91, 89, 84]
g3 = [101, 100, 91, 93, 96, 95, 94]
g4 = [78, 82, 81, 77, 79, 81, 80, 81]

The first step in the median test is to compute the grand median. One approach to finding the grand median is to create a list of lists of the sampled data, and then leverage numpy's hstack function which stacks a given sequence of arrays horizontally, or column-wise. We can then get the grand median by employing numpy's median function on the new combined array. The stacked array also allows us to quickly grab the total number of observations $N$ and the degrees of freedom $k - 1$.

In [3]:
group_arr = [g1, g2, g3, g4]

stacked_array = np.hstack(group_arr)
stacked_array
Out[3]:
array([ 83,  91,  94,  89,  89,  96,  91,  92,  90,  91,  90,  81,  83,
        84,  83,  88,  91,  89,  84, 101, 100,  91,  93,  96,  95,  94,
        78,  82,  81,  77,  79,  81,  80,  81])
In [4]:
grand_median = np.median(stacked_array)

n = stacked_array.shape[0]
degrees_of_freedom = len(group_arr) - 1

The $2 \times k$ contingency table is ready for construction! Our approach to building the contingency table is to first initialize two lists, one for observations above the grand median and the other for values below the grand median. We then loop through each sample array and append the count of observations in the array that are above and below the grand median. Regarding ties, in this example, we are adding observations that are tied with the grand median to the bottom row of the contingency table. As mentioned above, tied observations may instead be added to the top row of the contingency table or discarded entirely. These approaches are also given below but commented out.

In [5]:
above = []
below = []

for vec in group_arr:
    vec_arr = np.array(vec)
    
    '''
    Observations tied with the grand median 
    are added to the bottom row of the contingency table
    '''
    
    above.append(len(vec_arr[vec_arr > grand_median]))
    below.append(len(vec_arr[vec_arr <= grand_median]))
    
    '''
    Observations tied with the grand median 
    are added to the top row of the contingency table
    '''
    
    #above.append(len(vec_arr[vec_arr >= self.grand_median]))
    #below.append(len(vec_arr[vec_arr < self.grand_median]))

    '''
    Ignores values that are equal to the grand median when compiling the 
    top and bottom rows of the contingency table
    '''
    
    #vec_arr = vec_arr[vec_arr != self.grand_median]
    
    #above.append(len(vec_arr[vec_arr > self.grand_median]))
    #below.append(len(vec_arr[vec_arr < self.grand_median]))

The $2 \times k$ contingency table can then be created by stacking the two appended lists above and below vertically. Similar to the hstack function used above, numpy also provides a vstack function which stacks two or more arrays row-wise.

In [6]:
cont_table = np.vstack((above, below))
cont_table
Out[6]:
array([[6, 3, 7, 0],
       [3, 7, 0, 8]])

The contingency table is now complete! We can see the count of observations from each group above and below (or tied, in this case) with the grand median. The last step in the median test is to perform a chi-square test of dependence on the constructed contingency table. For the chi-square test, we can employ scipy's chi2_contingency function.

In [7]:
chi2_contingency(cont_table)
Out[7]:
(17.543055555555558,
 0.000546370000565256,
 3,
 array([[4.23529412, 4.70588235, 3.29411765, 3.76470588],
        [4.76470588, 5.29411765, 3.70588235, 4.23529412]]))

The $\chi^2$ value is $17.54$, with a p-value of well below a significance level of $0.05$. Therefore, we should reject the null hypothesis $H_0$ in favor of the alternate hypothesis and conclude that two or more of the populations have different medians.

As noted above, the $\chi^2$ value will approximate the $T$ statistic of the median test. To verify this, we can compute the $T$ statistic by iterating over the top row of the contingency table.

In [8]:
a, b = np.sum(cont_table, axis=1)

t = 0

for i in range(0, len(above)):
    ni = len(group_arr[i])
    t += (above[i] - ni * a / n) ** 2 / ni
    
t *= n ** 2 / (a * b)

t
Out[8]:
17.543055555555558

Related Posts