log(x+1) and log(1+x)

statistics

Author

Paulius Alaburda

Published

May 26, 2020

I’ll be honest, my math skills are not great and I haven’t studied maths formally since high school. However, Lior Pachter’s preprint brings a methodological issue in science through a mathematical lens really well. I had to look up a few things and I think it might be worth exploring the paper through the lens of a simulation.

So the first equation in the paper looks like this:

\[E[X|X > 0] = \frac{\lambda}{1-e^{-\lambda}}\]

Essentially, what is the expected value of a Poisson random variable after you filter out any non-zero results? If \(\lambda\) is infinitely large, the expected value will approach \(\lambda\), however, what happens if it approaches zero? Using L’Hôpital’s rule we can estimate that the expected value will be roughly equal to 1. Here’s a quick simulation to confirm this:

library(purrr)
library(tidyverse)

df <- data.frame(Lambdas = c(0.00001,0.0001,0.0005,0.001,0.002,seq(from = 0.0001, to = 10, length.out = 30))) %>%
  mutate(sim_data = map(Lambdas, ~rpois(100,.))) %>%
  unnest(sim_data) %>%
  group_by(Lambdas) %>%
  mutate(`Expected value` = mean(sim_data)) %>%
  ungroup() %>%
  group_by(Lambdas,sim_data > 0) %>%
  mutate(`Expected value when X > 0` = mean(sim_data)) %>%
  ungroup() %>%
  filter(`sim_data > 0` == TRUE) %>%
  select(Lambdas,`Expected value`,`Expected value when X > 0`) %>%
  unique() %>%
  gather(key = "Measurement", value = "Mean", `Expected value`,`Expected value when X > 0`)

df %>%
  ggplot(aes(x = log(Lambdas), y = Mean, color = Measurement)) + geom_line(alpha = 0.5, size = 1) + theme_bw() + labs(title = "When lambda is sufficiently large, filtering zero counts has no effect") + scale_y_continuous(breaks = c(1:10))

The paper points out a great example - in cases of gene expression, \(\lambda\) can be small and filtering out zero counts (which could be wrongly interpreted as absence of the gene) will make lower gene expression indistinguishable from higher gene expression when both of them are low.

There’s a second issue when you try to log transform your counts. Log(0) is undefined, so what usually happens is that all the counts are incremented by one. This essentially means that we are filtering out zero counts and replacing them with ones. As we have seen for large lambdas, that does not really have an effect but it does have a large effect when we are dealing with rare events.

This is important in fields outside of RNA sequencing as well. For example, morphological studies can deal with cell count data as well as counts of certain structures of interest. I vaguely remember having to deal with count data during my undergraduate research years and failing to apply proper methods. If you are an undergrad reading this - do not log transform your count data if you think the event of interest is rare!

Do read the whole thing, it’s short and brings references to explore further. Quite recommended.