How to Chi-squared test

r
statistics
Author

Paulius Alaburda

Published

May 16, 2020

As I am working through Statistical Rethinking and picking up new tools, I decided to rework my knowledge of null hypothesis testing. This is definitely not the worst tutorial on the Chi-squared test but this is more of a write up to wrap my head around it. I think this will be of interest1 for first year undergraduates or students interested in competing in biology olympiads.

Premise of the Chi-squared test

Karl Pearson created the test in 1900 to answer the question whether two or more categorical variables are associated with one another. It is built on the idea that the sum of squares of independent normal distributions is distributed as a Chi-squared distribution. If the distributions were dependent, then their sum of squares would not follow a Chi-squared distribution.

Now, we usually cannot know the whole distribution and instead we collect samples via surveys and experiments (as in, we usually don’t know the exact gender, political affiliation or age breakdown of the whole population we are studying). However, we do know what a Chi-squared distribution should look like under the null hypothesis of there being no association. So by using the test we get a single chi-squared value and check where that value falls in the Chi-squared distribution. Through that, we can calculate the probability of getting a value equal to or greater than the value we calculated, i.e. the p value.

Step by step of the Chi-squared test

That sounds a little too much for me and probably for a lot of students out there as well. What usually happens is that a workflow is taught instead: gather categorical data, shove that into the Chi-squared test formula and stick your Chi-squared value and your p value into the paper2.

In my case, I was taught how to perform the Chi-squared test by hand. That’s actually great because it lets us break down the Chi-squared test formula3:

\[\tilde{\chi}^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\]

Here \(O_k\) is the observed count of a category, \(E_k\) is the expected count.

Let’s say you are given a contingency table of two variables (in this case, the number of cylinders and the number of gears from the mtcars dataset).

  1. First, create a contingency table. This shows our observed values \(O_k\) for each combination of gear and cylinder:
library(tidyverse)

observed <- table(mtcars$cyl, mtcars$gear)

observed %>% knitr::kable()
  1. Then, find the expected values \(E_k\) as though there was no association between the two. If you were forced to do this with pen and paper, you would have to calculate the proportion of each type of gear and each type of cylinder. Using R, here is the proportion of each type of cylinder:
cyl_summary <- mtcars %>% count(cyl) %>% mutate(Proportion = prop.table(n))

cyl_summary %>% knitr::kable()

And each type of gear:

gear_summary <- mtcars %>% count(gear) %>% mutate(Proportion = prop.table(n))

gear_summary %>% knitr::kable()

If there was no association between gear and cylinder, we would expect that the probability of a 4 cylinder car with 3 gears would be the probability of a 4 cylinder car times the probability of 3 gear car. You can have R do this for you for each combination:

tbl_prop <- cyl_summary$Proportion %o% gear_summary$Proportion
rownames(tbl_prop) <- c(4,6,8)
colnames(tbl_prop) <- c(3,4,5)

tbl_prop %>% knitr::kable()

To get the expected values \(E_k\), we can multiply the probabilities by the number of total observations:

expected <- tbl_prop*sum(observed)

expected %>% knitr::kable()
  1. Then calculate the difference between the observed and the expected values. This gives you a table of differences between \(O_k\) and \(E_k\):
(observed-expected) %>% knitr::kable()

For example, we did not observe any cars with 8 cylinders and 4 gears but based on the how frequent 8 cylinder cars and 4 gear cars are, we expected to see 5.25 cars.

Now what’s interesting is that the sum of all those differences will always sum to 0. That is inconvenient, so what Pearson did in 1900 is square the differences, giving us this:

(observed-expected)^2 %>% knitr::kable()
  1. Divide by the expected value - without this step \((O_k-E_k)^2\) is just proportional to the number of counts, so comparing, for example, a sample of a 1000 counts and 100 counts would be impossible. That is why you need to normalise the values by dividing by the expected value4.
((observed-expected)^2/expected) %>% knitr::kable()
  1. Sum the resulting values:
chisq_rs <- sum((observed-expected)^2/expected)
print(chisq_rs)
  1. Using a reference table (for example, this one), find the critical p value (in this case, for df = 2). You do that by finding the largest Chi-squared value that is smaller than the one we calculated. In this case, the value for df = 2 is larger than the critical value of 9.21 so we can say that the p value is less than 0.01.

To be exact, let’s see where this value falls if we were to build a Chi-squared distribution with df = 2:

chisq_sample <- rchisq(10000, 2)

qplot(chisq_sample) + geom_vline(xintercept = chisq_rs, color = "red") + theme_bw()

chisq_ecdf <- ecdf(chisq_sample)
chisq_percentile <- chisq_ecdf(chisq_rs)

Or, if we built a distribution function out of our samples, our value would fall into the higher percentiles. Since that is definitely larger than 0.95, we can conclude that there is an association between the number of gears and the number of cylinders. Alternatively, this is what the Chi-squared test would have returned:

rs <- chisq.test(observed)
rs

Which we could have reported as such (while also describing the count data):

A chi-square test was conducted whether there is an association between the number of gears and the number of cylinders across different cars. The results were significant.

But why does it work?

I’ll be honest, it takes me a while to grok distributions beyond the Binomial and the Gaussian. A good explanation how the Chi-squared distribution is derived can be found here.

The chi-square (\(\chi^2\)) distribution, as we have mentioned earlier, is a distribution of the sum of squares of two or more normal distributions. In other words, if you would square values from one normal distribution and add it to another, the result should approximate the Chi-square distribution:

\[U\sim N(0,1)\] \[V\sim N(0,1)\] \[U^2+V^2\sim\chi_2^2\]

That subscript denotes that the sum of squares follows a \(\chi^2\) distribution with 2 degrees of freedom. Alternatively, a square of just one normal distribution follows a \(\chi_1^2\) distribution and for any number of degrees:

\[\sum_{j=1}^{k} \chi_j^2\sim\chi_k^2\]

Let’s plot these distributions!

df <- tibble(`Normal` = rnorm(10000, mean = 0, sd = 1),
             `Normal squared` = Normal^2,
             `Actual Chi-squared` = rchisq(10000,2),
             `Chi-squared with df = 2` = `Normal squared` + rnorm(10000, mean = 0, sd = 1)^2) %>%
  gather(key = "distribution", value = "value") %>%
  mutate(distribution = factor(distribution, levels = c("Normal","Normal squared","Chi-squared with df = 2","Actual Chi-squared")))

ggplot(df, aes(x = value, fill = distribution)) +
  geom_histogram() +
  facet_grid(~distribution, scales = "free_x") +
  guides(fill = FALSE) +
  theme_bw()

So… Now what? How does any of this relate back to the Chi-squared test formula? Here’s where the central limit theorem comes in: even though any count data (whether it’s counting blood cells, people, adverse events) is distributed along a Poisson distribution, the sampling distribution (as in, the result of collecting samples a lot of times) is similar to the normal distribution.

To understand this, I’ve simulated two Poisson distributions (as in, count data). If we tried to build a Chi-squared distribution out of this data, what percentage of values would be higher than the 95th percentile of an actual Chi-squared distribution?

df <- tibble(
  pois_1 = rpois(10000, 10),
  pois_2 = rpois(10000, 10),
  chisq_real = rchisq(10000, df = 2)
  ) %>%
  mutate(pois_1 = (pois_1-mean(pois_1))/sd(pois_1),
         pois_2 = (pois_2-mean(pois_2))/sd(pois_2)) %>%
  mutate(chisq_11 = pois_1^2+pois_1^2,
         chisq_12 = pois_1^2+pois_2^2) %>%
  select(contains("chisq_")) %>%
  gather(key = "distribution", value = "value") %>%
  mutate(over_95 = value > qchisq(.95, df=2))

ggplot(df, aes(x = value, fill = over_95)) +
  geom_histogram(alpha = 0.5, binwidth = 1) +
  facet_grid(~distribution) +
  theme_bw()

chisq_12 is built via two independent Poisson distributions, while chisq_11 is just two identical samples added together (making the distributions dependent). From the chart it seems like it works and if you look at the percentage of values higher than the 95th percentile:

df %>%
  group_by(distribution) %>%
  summarise(over_95 = mean(over_95)) %>%
  knitr::kable()

So chisq_12 and chisq_real makes sense - around 5 per cent of values are going to be higher than the 95th percentile. For chisq_11, we see that is not the case and the percentage is higher!

We can see that the Chi-squared distribution requires that the added distributions be independent. The final kicker comes in when we look back at the Chi-squared test formula:

\[\tilde{\chi}^2=\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}\]

If the difference between observed and expected values is small, then \(\chi^2\) is small. If the difference is large, then \(\chi^2\) is large. The observed values follow a normal distribution and the expected values also follow a normal distribution, therefore the squared difference can be modeled as a Chi-squared distribution. And since it can be modeled it can be turned into a probability of observing this or more extreme5 data!

To me this makes enough sense to trust it. In summary, Pearson created the formula that returns a value. This particular value increases or decreases based on the difference of observed and expected count data. The test statistic by itself does not tell us much, but we know that the formula should generate values that are distributed along a chi-squared distribution and we can translate the value into a probability by checking where Chi-squared test values fall in the Chi-squared test distribution.

Even though the test is pretty simple, it packs a lot of assumptions and nuance. For a really good explanation, I would recommend checking out Danielle Navarro’s book Learning with R.

Footnotes

  1. Not necessarily interesting, haha↩︎

  2. This would make for a great “HAHA CHI-SQUARED TEST GOES BRRR” meme.↩︎

  3. I am leaving out some nuance here by using the goodness of fit formula when I could also show the Test for Independence formula which looks like this: \(X^2=\sum\limits_{i=1}^I \sum\limits_{j=1}^J\dfrac{(O_{ij}-E_{ij})^2}{E_{ij}}\) Although it is more correct, it also takes more time to read and understand for people not used to indices.↩︎

  4. A really good technical explanation can be found here. Essentially \(O_k\) is a count therefore it can be modeled as a Poisson distribution (\(O_k \sim Poisson(E_k,E_k)\)) and with high enough counts the distribution becomes similar enough to the normal distribution with mean and variance \(E_k\).↩︎

  5. I am putting this clunky wording here just to be correct because the probability of observing an exact Chi-squared test value is always zero since the distribution is continuous.↩︎