Unknown unknowns

The difference between censoring and truncation must be one of the most problematic areas of the BUGS language. A heuristic argument that I like to use is to imagine that the experiment is replicated. If your replicate data all fall inside (or outside) the same range as the real data, then you have truncation, and need to use the T(,) construct in JAGS.  If the replicate data do not necessarily fall in the same ranges then you have censoring, and need to use the dinterval distribution.

Apparently things are not so simple.  Stefano Andreon has sent me an example which appears to be neither censoring nor truncation, and for which this heuristic is not useful. Worse still, there does not seem to be a general way of tackling this problem in JAGS or BUGS.

Suppose we have an instrument that can only detect signals in a certain range. Any signal outside the range is never recorded. Even the fact that a signal has fallen outside the detection range is not recorded.  According to the epistemological classification made famous by the former US Secretary of Defence Donald Rumsfeld, these are unknown unknowns: things we do not know that we do not know.

Stefano’s example is a mixture model, but the mixture part is not necessary to explain the problem, so I will illustrate it with a simple example of estimating the mean of a normal distribution when we can only observe values in the range [-2,2].  We can generate such data in R as follows:

set.seed(1138)
mu <- 1.2
x <- rnorm(1000, mean=mu, sd=1)
x <- x[x >= -2 & x <= 2]

Here’s what the data look like. We started out with 1000 values for x and have kept the 807 values that lie in range.

Truncation histogram

Using my heuristic argument, this looks like a truncation problem. If we run the simulation experiment, we will get a different set of values, but they will all be in the range [-2,2]. If we do not know mu, we can try to estimate it using the following JAGS model:

model {
   for (i in 1:length(x)) {
      x[i] ~ dnorm(mu, 1) T(-2,2)
   }
   mu ~ dunif(-5,5)
}

In R you can use the following code to run the model

library(rjags)
m <- jags.model("trunc1.bug", data=list(x=x))
mu <- coda.samples(m, "mu", n.iter=1000)
densplot(mu, xlim=c(0,2))
abline(v=1.2, col="red")

This seems to work, as the true value, marked with a red vertical line is in the area with posterior support.

Truncation estimate

Now let us see what happens if we add some measurement error. We simulate some data where x is an error-prone measurement of an underlying variable z. The error is relatively small, with a standard deviation of 0.1:

set.seed(1138)
mu <- 1.2
z <- rnorm(1000, mean=mu, sd=1)
x <- rnorm(1000, mean=z, sd=0.1)
x <- x[x >= -2 & x <= 2]

Again, if we do not know mu we can try to estimate it with the following model, treating the x values as truncated:

model {
   for (i in 1:length(x)) {
      x[i] ~ dnorm(z[i], 100) T(-2,2)
      z[i] ~ dnorm(mu, 1)
   }
   mu ~ dunif(-5,5)
}

But unlike the previous example, this does not work:
Incorrect truncated estimate

What is going on here? If you write down the likelihood, you will see that we need to condition the probability statements on the fact that we only see observations in the range [-2,2]. So the likelihood for x[i] needs to be multiplied by

1/p(-2 <= x[i] <= 2 | mu).

What the T(,) construct does is multiply the likelihood by

1/p(-2 <= x[i] <= 2 | parents(x[i]))

These are the same in the first model – which is why the T trick works – but not in the second model – so T gives the wrong answer

In this simple example, we can do re-weighting of the likelihood by hand using the “zeros trick” and recover a consistent estimate of mu. But in more complex examples we cannot calculate the required probability weights because they involve calculating the marginal likelihood with the intermediate parameters integrated out.  If you can do that, you are not really in a situation where you need to use MCMC.  So it seems that there are two outstanding problems

  1. How can we describe this selective observation, which is apparently common in astronomy, which is neither censoring nor truncation?
  2.  Can such problems be analyzed in BUGS without calculating the marginal likelihood?

I am putting up this unusually long post in the hope that it will stimulate someone to think about the problem.  As usual, comments on this blog entry are disabled, but you can email me or put a trackback in your own blog.

About these ads