This tutorial offers an introduction to conformal inference, which is a method for constructing valid (with respect to coverage error) prediction bands for individual forecasts. The appeal of conformal inference is that it relies on few parametric assumptions. For formal treatments of conformal inference, refer to the following:
Shafer and Vovk (2007) “A Tutorial on Conformal Prediction”, arxiv.
Lei et al. (2017) “Distribution-Free Predictive Inference For Regression”, arxiv.
The current tutorial is meant to be a gentle introduction to complement these more complete treatments. It follows Lei et al. very closely—sort of a “Cliffs Notes” with some worked illustrations along the way.
The code that I have written below is not meant to be production level. If you actually want to use conformal methods, then you should use the conformalInference
package (github link). The code here is just to provide algorithmic illustrations of conformal methods.
Suppose a sample of \(n\) iid draws from \(P\), \(U(n) = U_1,...,U_n \sim P\). Consider this sample’s distribution:
set.seed(12345)
n <- 1000
U <- rnorm(n)
hist(U)
Now suppose a \(n+1\)th draw, \(U_{n+1}\), is to be taken from \(P\). What is the probability, with respect to the sample of \(N+1\) draws from \(P\), that \(U_{n+1}\) would exceed some value \(u\)? Given that all draws are iid, we can work with the empirical quantiles of the augmented sample, \((U(n), u)\) to see what mass resides to the right of \(u\). E.g., suppose that \(u = 2\). Then our augmented sample, with the associated finite sample probability, would be as follows:
u <- 2
Ua <- c(U, u)
hist(Ua)
abline(v=u)
print(round(mean(Ua > u), 3))
[1] 0.032
Suppose we set a miscoverage (“error”) rate of \(\alpha\). E.g., suppose that we choose \(\alpha = 0.05\). Then, we could do a line search of \(u\) values to define an interval, \(C\), for which \(\text{Pr}[u \in C] \ge 1-\alpha = 0.95\).
nEval <- 200
u.candidate <- seq(from=min(U), to=max(U), length = nEval)
Cbounds <- c(u.candidate[match(5, apply(as.matrix(u.candidate),
1,
function(x){floor(2.5+100*(mean(c(U, x) < x)))}))],
rev(u.candidate)[match(95, rev(apply(as.matrix(u.candidate),
1,
function(x){ceiling(-2.5+100*(mean(c(U, x) < x)))})))])
print(Cbounds)
[1] -1.980157 2.102782
hist(U)
abline(v=Cbounds)
This interval, \(C\) is a conformal interval.
This section and the next follow Lei et al. (2017).
Now suppose that we observe \(n\) iid draws of \(Z_i = (X_i, Y_i)\) from \(P\), and we can to construct a prediction band for \(Y_{n+1}\) given \(X_{n+1}\), also drawn from \(P\). A naive approach would first estimate a regression function \(\mu(X_i)\) for \(Y\) conditional on \(X\), generate a prediction \(\widehat{\mu}(X_{n+1})\) and then construct a \(1-\alpha\) band around \(\widehat{\mu}(X_{n+1})\) using \(\pm(1-\alpha/2)\) quantile of the distribution of fitted residuals \(|Y_i - \widehat{\mu}(X_i)|\):
X <- rnorm(n)
Y <- X + U
regData <- data.frame(X,Y)
fitlm <- lm(Y~X, data=regData)
eVec <- abs(fitlm$residuals)
hist(eVec)
plot(X,Y)
Xnew <- -1.5
abline(fitlm)
abline(v=Xnew, lty="dashed")
muHat <- predict(fitlm, newdata=data.frame(X=Xnew))
points(Xnew, muHat, pch=19, col="red")
C.X <- c(muHat-quantile(eVec, .975), muHat+quantile(eVec, .975))
points(rep(Xnew, 2), C.X, type="l", col="red", lwd=2)
While intuitive, such bands typically undercover, as the empirical residual distribution typically understates the true error variance. The goal of conformal inference is to correct on such undercoverage to deliver bands that have proper finite sample coverage.
One way to improve upon the naive bands is to use the same data augmentation method as we used in the “Basic Idea” section above. That is, we augment our data with \(X_{n+1}\) and a candidate \(y\), and then fit an augmented regression estimator \(\widehat{\mu}_y\) using \((Z_1,...,Z_n, (X_{n+1}, y))\). Then the fitted residuals are constructed as, \[ R_{y,i} = |Y_i - \widehat{\mu}_y(X_i)|, i=1,...,n \text{ and } R_{y,n+1} = |y-\widehat{\mu}_y(X_{n+1})|. \] Then, we rank \(R_{y,n+1}\) among the \(R_{y,i}\) to compute the proportion of points in the augmented sample with a fitted residual that is smaller than \(R_{y,n+1}\): \[ \pi(y) = \frac{1}{n+1}\sum_{i=1}^{n+1} 1\{ R_{y,i} \le R_{y,n+1}\} = \frac{1}{n+1} + \frac{1}{n+1}\sum_{i=1}^{n} 1\{ R_{y,i} \le R_{y,n+1}\} \] By exchangeabilty and by the fact that \(\widehat{\mu}\) is symmetric over the range of \(X\), we have that \(\pi(Y_{n+1})\) is uniform over \(\{1/(n+1),...,1\}\). As such, \[ \text{Pr}[(n+1)\pi(Y_{n+1}) \le \lceil (1-\alpha)(n+1)\rceil ] \ge 1-\alpha \]
We can thus construct the conformal band by evaluating \(\pi(y)\) over a range of candidate \(y\) values, in which case the interval is given by, \[ C(X_{n+1}) = \left\{ y \in\mathbb{R} : (n+1)\pi(y) \le \lceil(1-\alpha)(n-1) \rceil \right\}. \] Here is an illustration:
nEval <- 200
yCand <- seq(from=min(Y), to=max(Y), length=nEval)
confPredict <- function(y, Xin){
nData <- nrow(regData)
regData.a <- rbind(regData,c(Xin, y))
fitlm.a <- lm(Y~X, data=regData.a)
resOut <- abs(fitlm.a$residuals)
resOut_new <- resOut[length(resOut)]
pi.y <- mean(apply(as.matrix(resOut),
1,
function(x){x<=resOut_new}))
testResult <- pi.y*(nData+1) <= ceiling(.975*(nData+1))
return(testResult)
}
Cxa <- range(yCand[sapply(yCand, confPredict, Xin=Xnew)])
plot(X,Y)
abline(fitlm)
abline(v=Xnew, lty="dashed")
points(rep((Xnew+.05), 2), C.X, type="l", col="red", lwd=2)
points(rep(Xnew, 2), Cxa, type="l", col="blue", lwd=3)
The refined band (blue) is just a touch wider in this example.
Lei et al. (2017, Thm. 2.1) shows that when the residual distribution is continuous, this conformal band both covers marginally at the nominal rate, at least, and is tight on the order of \(1/(n+1)\). Stronger assumptions are needed for the band to cover conditionally (on \(X\)) at the nominal rate.
The augmented regression conformal bands are intensive to compute, as one needs to evaluate for a vector of \(y\) values. This could be costly if the regression model is complex (although in the case above things were quite simple).
Split conformal prediction has you fit on one random split (e.g., a half) of the data and then work with quantiles of the residual distribution from the other split, in a manner that resembles the naive the band above:
splitConfPredict <- function(Xin){
nData <- nrow(regData)
regData$index <- 1:nData
regData$split <- 1
regData$split[sample(regData$index, floor(nrow(regData)/2), replace=F)] <- 2
fitlm.spl <- lm(Y~X, data=subset(regData, split==1))
resOut <- abs(subset(regData, split==2)$Y - predict(fitlm.spl,
newdata=subset(regData, split==2)))
kOut <- ceiling(((nData/2)+1)*(.975))
resUse <- resOut[order(resOut)][kOut]
Y.hat <- predict(fitlm.spl, newdata=data.frame(X=Xin))
C.split <- c(Y.hat-resUse,Y.hat+resUse)
return(C.split)
}
plot(X,Y)
abline(fitlm)
abline(v=Xnew, lty="dashed")
points(rep((Xnew+.05), 2), C.X, type="l", col="red", lwd=2)
points(rep(Xnew, 2), Cxa, type="l", col="blue", lwd=3)
points(rep(Xnew-.05, 2), splitConfPredict(Xnew), type="l", col="green", lwd=3)
Lei et al. show that this split band is tight for marginal coverage at a rate that is similar to that of the augmented data band described in the previous section.
One problem with the bands presented above is that their width does not vary in \(X\). E.g., here is the augmented data conformal band over values of X:
aug.over.X <- function(Xval){range(yCand[sapply(yCand,
confPredict,
Xin=Xval)])}
Xvals <- -4:4
augBands <- matrix(unlist(lapply(Xvals, aug.over.X)), ncol=2, byrow=T)
plot(X,Y, type="n")
polygon(c(Xvals, rev(Xvals)),
c(augBands[,1],rev(augBands[,2])),
border=F,
col="gray")
points(X,Y)
abline(fitlm)
If the residual distribution has variance that varies in \(X\) (i.e., heteroskedasticity), however, this can produce bands that, while having good marginal coverage properties, may have poor conditional coverage properties.
Locally weighted conformal bands allows for the length of the bands to vary in \(X\) in a manner that accounts for heteroskedasticity. The idea is to model the conditional spread, and use estimates of the conditional spread to scale the bands.
We model the conditional spread of the absolute residuals using a local (kernel-weighted) polynomial regression, and then use that to scale the bands. I will modify the example so that heterskedasticity is indeed present:
Uhet <- rnorm(n, sd=exp(.5*X))
Y <- X + Uhet
regData <- data.frame(X,Y)
plot(X, Y)
fitlm.het <- lm(Y~X, data=regData)
regData$absRes <- abs(fitlm.het$residuals)
locFit <- locpol(absRes~X,
data=regData,
bw=2)
plot(regData$X, regData$absRes)
points(locFit$lpFit[,locFit$X],locFit$lpFit[,locFit$Y],type="l",col="blue")
Now we apply this version of locally weighted conformal prediction:
nEval <- 200
yCand <- seq(from=min(Y), to=max(Y), length=nEval)
confPredict.lw <- function(y, Xin, bwArg){
nData <- nrow(regData)
regData.a <- rbind(regData,c(Xin, y))
fitlm.a <- lm(Y~X, data=regData.a)
regData.a$resOut.sc <- abs(fitlm.a$residuals)
xevalVec <- c(rev(seq(from=Xin, to=min(X), by= -0.25)),
seq(from=Xin+.25, to=max(X), by=.25))
locFit <- locpol(resOut.sc~X,
data=regData.a,
bw=bwArg,
xeval=xevalVec)
resScale <- locFit$lpFit[,2][locFit$lpFit[,1]==Xin]
resOut <- regData.a$resOut.sc*resScale
resOut_new <- regData.a$resOut.sc[length(resOut)]
pi.y <- mean(apply(as.matrix(resOut),
1,
function(x){x<=resOut_new}))
testResult <- pi.y*(nData+1) <= ceiling(.975*(nData+1))
return(list(testResult,resScale))
}
aug.over.X.lw <- function(Xval){
Cout.lw <- range(yCand[unlist(sapply(yCand,
confPredict.lw,
Xin=Xval,
bwArg=2)[1,])])
return(Cout.lw)
}
Xvals <- -3:3
augBands.lw <- matrix(unlist(lapply(Xvals, aug.over.X.lw)), ncol=2, byrow=T)
plot(X,Y, type="n")
polygon(c(Xvals, rev(Xvals)),
c(augBands.lw[,1],rev(augBands.lw[,2])),
border=F,
col="gray")
points(X,Y)
abline(fitlm.het)