Bartlett's Test for Equality of Variances with Python

Bartlett's test, developed by Maurice Stevenson Bartlett, is a statistical procedure for testing if $k$ population samples have equal variances. Equality of variances in population samples is assumed in commonly used comparison of means tests, such as Student's t-test and analysis of variance. Therefore, a procedure such as Bartlett's test can be conducted to accept or reject the assumption of equal variances across group samples.

Levene's test is an alternative to Barlett's test and is less sensitive to non-normal samples. Thus, Levene's test is generally preferred in most cases, particularly when the underlying distribution of the samples is known. Equality of variances is also known as homoscedasticity or homogeneity of variances.

Bartlett's test statistic, denoted $\chi^2$ (also sometimes denoted as $T$), is approximately chi-square distributed with $k - 1$ degrees of freedom, where $k$ is the number of sample groups. The chi-square approximation does not hold sufficiently when the sample size of a group is $n_i > 5$. The test statistic is defined as:

$$ \chi^2 = \frac{(n - k) \ln(S^2_p) - \sum^k_{i=1} (n_i - 1) \ln(S^2_i)}{1 + \frac{1}{3(k - 1)} \left(\sum^k_{i=1} (\frac{1}{n_i - 1}) - \frac{1}{n - k} \right)} $$

where,

  • $n$ is the total number of samples across all groups
  • $k$ is the number of groups
  • $S^2_i$ are the sample variances.

$S^2_p$, the pooled estimate of the samples' variance, is defined as:

$$ S^2_p = \frac{1}{n - k} \sum_i (n_i - 1) S^2_i $$

Bartlett's Test in Python

In [1]:
import numpy as np
import pandas as pd
from scipy.stats import chi2
import numpy_indexed as npi

The PlantGrowth dataset is available in R as part of its standard datasets and can also be downloaded here. After downloading the data, we load it into memory with pandas' read_csv function. Once the data is loaded, we transform the resulting DataFrame into a numpy array with the .to_numpy method. The first three rows of the dataset are then printed to get a sense of what the data contains. We also confirm there are indeed three sample groups in the data using numpy's unique function.

In [2]:
plants = pd.read_csv('../../data/PlantGrowth.csv')
plants = plants.to_numpy()

print(plants[:3])
print(list(np.unique(plants[:,2])))
[[1 4.17 'ctrl']
 [2 5.58 'ctrl']
 [3 5.18 'ctrl']]
['ctrl', 'trt1', 'trt2']

Implementing the test procedure in Python is comparatively straightforward. First, we require the total number of samples, $n$, and the number of sample groups, $k$. The number of samples can be found by indexing the first value returned from the plants array .shape method, while the number of unique groups is obtained by taking the length of the array returned by the unique function. We also require the total number of samples within each group and the group's variance, which we compute by grouping the plants array by len and numpy's var functions. The numpy-indexed library and its group_by function are convenient for performing this grouping operation on numpy arrays.

In [3]:
n = plants.shape[0]
k = len(np.unique(plants[:,2]))

group_n = np.array([i for _, i in npi.group_by(plants[:, 2], plants[:, 1], len)])
group_variance = np.array([i for _, i in npi.group_by(plants[:, 2], plants[:, 1], np.var)])

The sample pooled variance, $S^2_p$, is then computed.

In [4]:
pool_var = 1 / (n - k) * np.sum((group_n - 1) * group_variance)

As the Bartlett test statistic equation is rather hefty, we split the computation into two variables, the numerator and denominator of the test statistic, to help keep the code easier to read. The numerator and denominator are then divided, which returns the computed test statistic.

In [5]:
x2_num = (n - k) * np.log(pool_var) - np.sum((group_n - 1) * np.log(group_variance))
x2_den = 1 + 1 / (3 * (k - 1)) * (np.sum(1 / (group_n - 1)) - 1 / (n - k))

x2 = x2_num / x2_den
x2
Out[5]:
2.8785737872360935

The Bartlett test statistic, $\chi^2$, is approximately $2.279$. To find the associated p-value of the test statistic, we using the .cdf method from scipy.stats's chi2 variable with $k - 1$ degrees of freedom.

In [6]:
p = 1 - chi2.cdf(x2, k - 1)
p
Out[6]:
0.23709677363455817

The reported p-value is over $0.05$. Therefore we accept the null hypothesis that the group variances are equal.

Related Posts