The chi-square test is often used to assess the significance (if any) of the differences among $k$ different groups. The null and alternate hypotheses of the test, are generally written as:
$H_0$: There is no significant difference between two or more groups.
$H_A$ There exists at least one significant difference between two or more groups.
The chi-square test statistic, denoted $\chi^2$, is defined as the following:
$$ \chi^2 = \sum^r_{i=1} \sum^k_{j=1} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} $$Where $O_{ij}$ is the $i$-th observed frequency in the $j$-th group and $E_{ij}$ is the corresponding expected frequency. The expected frequency can be calculated using a common statistical analysis. The expected frequency, typically denoted $E_{cr}$, where $c$ is the column index and $r$ is the row index. Stated more formally, the expected frequency is defined as:
$$ E_{cr} = \frac{(\sum^{n_r}_{i=0} r_i)(\sum^{n_c}_{i=0} c_i)}{n} $$Where $n$ is the total sample size and $n_c, n_r$ are the number of cells in row and column, respectively. The expected frequency is calculated for each 'cell' in the given array.
For example, consider the following $2 x 3$ contingency table of observed values:
group | col1 | col2 | col3 | total |
---|---|---|---|---|
cat1 | $10$ | $10$ | $20$ | $40$ |
cat2 | $20$ | $20$ | $10$ | $50$ |
total | $30$ | $30$ | $30$ | $90$ |
The expected frequencies would then be calculated as:
group | col1 | col2 | col3 |
---|---|---|---|
cat1 | $$(40*30)/90$$ | $$(40*30)/90$$ | $$(30*40)/90$$ |
cat2 | $$(50*30)/90$$ | $$(50*30)/90$$ | $$(50*30)/90$$ |
Thus, the expected frequencies of the contigency table are:
group | col1 | col2 | col3 |
---|---|---|---|
cat1 | $13.33333$ | $13.333333$ | $13.333333$ |
cat2 | $16.6666667$ | $16.666667$ | $16.666667$ |
The degrees of freedom is calculated as $(c - 1)(r - 1)$ where $c$ is the number of columns and $r$ is the number of rows in the contingency table. In the case of a $2 \times 2$ contingency table, Yates's continuity correction may be applied to reduce the error in approximation of using the chi-square distribution to calculate the test statistics. The continuity correction changes the computation of $\chi^2$ to the following:
$$ \chi^2 = \sum^r_{i=1} \sum^k_{j=1} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}} $$In addition to the test statistics, several measures of association are also provided. Measures of association are generally scaled from 0 to +1, with a value closer to 1 representing a strong dependence between the two groups. The first is the phi coefficient, defined as:
$$ \phi = \sqrt{\frac{\chi^2}{N}} $$The contingency coefficient, denoted as $C$, is defined as:
$$ C = \sqrt{\frac{\chi^2}{N + \chi^2}} $$Cramer's $V$ is defined as:
$$ V = \sqrt{\frac{\chi^2}{N(k-1)}} $$Lastly, Tschuprow's $T$ coefficient of association is defined as:
$$ T = \sqrt{\frac{\phi^2}{\sqrt{(r - 1)(c - 1)}}} = \sqrt{\frac{\frac{\chi^2}{n}}{\sqrt{(r - 1)(c - 1)}}} $$Example with a $4 \times 4$ Contingency Table¶
We will now consider performing a chi-square test of dependency on a $4 \times 4$ contingency table. Before beginning, import the libraries that will be used throughout the example.
from scipy.stats import chi2, chi2_contingency
import pandas as pd
import numpy as np
from functools import reduce
The $4 \times 4$ table that we will examine is constructed into a numpy array.
observed = np.array([[23, 40, 16, 2], [11, 75, 107, 14], [1, 31, 60, 10]])
observed
If the expected values of the observations are not known, which is typically the case in practical settings, an expected value table is constructed. As we saw above, creating an expected value table from an observation contingency table is to divide the product of the associated cell's row total and column total by the total observations for each cell in the table.
One possible approach for computing the expected values is to use numpy
's handy apply_over_axes
function which allows us to perform a function over a specific axis of an array.
c = np.apply_over_axes(np.sum, observed, 0) # column totals of observations table
r = np.apply_over_axes(np.sum, observed, 1) # row totals of observations table
print(c)
print(r)
With the column and row totals in hand, we can then take advantage of the reduce
function in the functools
Python standard library. The reduce
function gives us the ability to apply a function over all elements of an array. Here, we multiply each row and column total of the observations table and then divide it by the total observations for each cell in the table, all in one line!
exp_freq = reduce(np.multiply, (r, c)) / np.sum(observed)
exp_freq
Another possible approach for computing the expected frequencies is by using standard Python for
loops.
exp_freq2 = []
for i in range(0, len(r)):
for j in range(0, len(c)):
exp = (c[j] * r[i]) / np.sum(observed)
exp_freq2.append(exp)
np.array(exp_freq2)
The contingency table that will be used in the chi-square test can then be constructed by taking the observed values' absolute values subtracted by their respective expected frequency. Numpy
makes this easy for us by performing the broadcasting of math operators on arrays automatically.
cont_table = np.absolute(observed - exp_freq)
cont_table
With the contingency and expected frequencies tables completed, we can compute the $\chi^2$ value for the test of dependence.
chi_val = np.sum(cont_table ** 2 / exp_freq)
chi_val
The degrees of freedom, defined as $(c - 1)(r - 1)$, where $c$ is the number of columns, and $r$ is the number of rows in the table is then calculated. The degrees of freedom and the $\chi^2$ value gives us the ability to compute the p-value. The p-value is calculated using scipy
's chi2.sf
.
degrees_of_freedom = (cont_table.shape[0] - 1) * (cont_table.shape[1] - 1)
pval = chi2.sf(chi_val, degrees_of_freedom)
pval
We can confirm our results by comparing them to the output is given by the chi2_contingency
function in scipy
. According to the function's documentation, the function returns the $\chi^2$ value, the p-value, the degrees of freedom, and the expected frequencies table, all of which match our results!
print(chi2_contingency(observed))
print()
print(chi_val)
print(pval)
print(degrees_of_freedom)
print(exp_freq)
Measures of Association¶
Several measures of association, or dependence between two nominal variables, can be calculated to give the researcher more insight into how the two variables relate to one another, if at all.
Cramer's $V$¶
One of the more commonly used and well-known measures of association is Cramer's $ V $, published by Harold Cramer in 1946. The $V$ statistic of the test is scaled from 0 to +1, meaning that a theoretically higher dependence between the two variables will result in a statistic closer to 1.
n = np.sum(cont_table)
v = np.sqrt(chi_val / (n * (np.minimum(cont_table.shape[0], cont_table.shape[1]) - 1)))
v
Phi Coefficient, $\phi$¶
The $ \phi $ coefficient introduced by Karl Pearson is also a measure of association between two variables. The interpretation of the $\phi$ coefficient is similar to the Pearson correlation coefficient. The $\phi$ coefficient can range from -1 to +1, where 0 indicates no relationship, and negative and positive values indicate their respective association. Two variables are considered associated if most of the data falls on the diagonal of the contingency table. In contrast, the variables are assumed to be negatively related when the data is outside the diagonal.
filled_diag = observed.copy()
np.fill_diagonal(filled_diag, 1)
phi_sign = np.prod(np.diagonal(observed)) - np.prod(filled_diag)
phi_coeff = np.sqrt(chi_val / n)
if phi_sign < 0 and phi_coeff > 0:
phi_coeff = -phi_coeff
phi_coeff
Contingency Coefficient, $C$¶
The contingency coefficient, typically denoted $C$, is another measure of association based on the chi-square measurement value but also adjusts for different sample sizes. Like Cramer's $ V $, the contingency coefficient is scaled from 0 to +1, where a more dependent relationship between the two variables is indicated by a value closer to 1.
c = np.sqrt(chi_val / (n + chi_val))
c
Tschuprow's Coefficient, $T$¶
Closely related to Cramer's $V$ but less well-known and used is Tschuprow's coefficient, which is typically denoted $T$, introduced by Alexander Tschuprow in 1939. As with Cramer's $V$, $T$ is scaled from 0 to +1, where 0 indicates independence and closer values to 1 denote a more dependent relationship between the two variables.
t = np.sqrt(chi_val / (n * np.sqrt((cont_table.shape[0] - 1) * (cont_table.shape[1]) - 1)))
t
Aside: Equality of Tschuprow's $T$ and Cramer's $V$ in $2 \times 2$ Matrices¶
Due to the close relatedness of Cramer's $V$ and Tschuprow's $T$, the two measures of association are equal when the contingency table is a $2 \times 2$ matrix. This is due to the existence of two rows and two columns in the array, which in the calculation for Tschuprow's $T$, the square root portion reduces to 1 and in Cramer's $V$ reduces to $\min{1, 1}$.
We can verify this equality by slicing our original $4 \times 4$ contingency table to a $2 \times 2$ table and then performing the calculations for the chi-square dependency test and the associated Cramer's $V$ and Tschuprow's $T$ measures of dependence.
cont_table2 = cont_table[0:2,0:2]
cc = chi2_contingency(cont_table2)
n2 = np.sum(cont_table2)
v2 = np.sqrt(cc[0] / (n2 * (np.minimum(cont_table2.shape[0], cont_table2.shape[1]) - 1)))
t2 = np.sqrt(cc[0] / (n2 * np.sqrt((cont_table2.shape[0] - 1) * (cont_table2.shape[1]) - 1)))
print("Cramer's V 2x2 matrix: ", v2)
print("Tschuprow's T 2x2 matrix: ", t2)
References¶
Wikipedia contributors. (2018, August 15). Contingency table. In Wikipedia, The Free Encyclopedia. Retrieved 12:08, August 28, 2018, from https://en.wikipedia.org/w/index.php?title=Contingency_table&oldid=854973657
Wikipedia contributors. (2020, April 14). Cramér's V. In Wikipedia, The Free Encyclopedia. Retrieved 13:41, August 12, 2020, from https://en.wikipedia.org/w/index.php?title=Cram%C3%A9r%27s_V&oldid=950837942
Wikipedia contributors. (2020, August 9). Phi coefficient. In Wikipedia, The Free Encyclopedia. Retrieved 13:40, August 12, 2020, from https://en.wikipedia.org/w/index.php?title=Phi_coefficient&oldid=971906217
Wikipedia contributors. (2019, January 14). Tschuprow's T. In Wikipedia, The Free Encyclopedia. Retrieved 13:40, August 12, 2020, from https://en.wikipedia.org/w/index.php?title=Tschuprow%27s_T&oldid=878279875
Wikipedia contributors. (2017, October 20). Yates's correction for continuity. In Wikipedia, The Free Encyclopedia. Retrieved 12:23, September 1, 2018, from https://en.wikipedia.org/w/index.php?title=Yates%27s_correction_for_continuity&oldid=806197753
https://www.empirical-methods.hslu.ch/decisiontree/relationship/chi-square-contingency/