Skip to content

Replication

W. Christohper Carleton edited this page Sep 21, 2020 · 30 revisions

The code below replicates two of the analyses described in "Evaluating Bayesian Radiocarbon-dated Event-Count [REC] Models for the study of long-term human and environmental processeses." The first analysis involves reponse data (event-counts based on randomly generated dates in years BP) that have no chronological error while the second involves a full multi-level model for the same data with chronological uncertainty. All other analyses described in the paper just referenced can be run using the same general process.

1: No Chronological Error

In this section, we run a simple Bayesian Negative-Binomial model that does not account for chronological uncertainty. The objective is to ensure the model proposed in the paper can be used to correctly estimate a target parameter when there is no chronological uncertainty in the response data.

R Requirements

First, we need to load required R libraries.

Libraries

library(nimble)
library(ggplot2)
library(ggpubr)

Simulate Data

Then, we need to simulate some response data. For this demo, we will use a simple monotonic (exponential) increasing process. We will pass that process to R's sample() function as the 'prob' argument. The idea is to use the function to weight the probability that a given date (year BP) will be randomly drawn from all of the possible dates in a predetermined range. These dates will then constitute the sample used in the rest of the analyses.

Some parameters need to be chosen. For consistency, we can use the same settings here, but other options could be explored, obviously. The first parameter is the rate of increase for the monotonic function. In the paper that rate was labelled β and set at ±0.004---of course, in R the Greek letter wasn't used; it's just B in the code below. The next parameter is the length of the span of time for the response data, which was set at 1000 years. Then, a start date was chosen at random from between 40000 and 2000 BP. This range was used to keep the simulated data well away from the ends of the radiocarbon calibration curve. The start date used in the paper was 6000 BP, which meant the end was 5001 BP. The last parameter is the number of dates to sample, which was set at 1000.

Parameters

B <- 0.004
span = 1000
Ndates = 1000
start = 6000
end = 5001

With the necessary parameters set, we can evaluate the montonic process and pass it to the sample() function in order to generate a random sample of dates (years BP). These dates are simulated event times, which we can then count per unit of time (years, in this case) in order to create count-based time-series for exploring REC models.

Simulate Calendar Dates

prob_fun <- exp(0:(span-1) * B)
simdates <- sample(x=start:end,size=Ndates,replace=T,prob=prob_fun)
event_counts <- as.data.frame(table(simdates))
names(event_counts) <- c("YBP","Count")
event_counts$YBP <- as.numeric(levels(event_counts$YBP))
event_counts$Count <- as.numeric(event_counts$Count)

Next, we need to expand the event_counts dataframe to include all of the years in which the count should be zero. These zeros are simply a function of random sampling, but they need to be included so that the simulated time grid is consistent. They were only excluded above because of the way we collated the data using existing R functions. In addition, we should create a Time variable that represents the passage of time instead of calendar years Before Present. The reasoning here being that years BP increase going backwards in time, which can easily lead to confusion. The following lines of code can be used to make these adjustments:

event_counts <- merge(data.frame(YBP=start:(end+1)),event_counts,by=1,all=T)
event_counts[which(is.na(event_counts[,2])),2] <- 0
event_counts$Time <- rev(1:span)

Before going on, we can plot the event-count series and make sure it's increasing with time as expected:

ggplot(data=event_counts) +
    geom_col(mapping=aes(y=Count,x=YBP),position="identity",alpha=0.8,colour=NA) +
    labs(y="Count",x="Year BP") +
    xlim(c(start,end)) +
    theme_minimal() +
    theme(text = element_text(family="Times", size=12))
## Warning: Removed 2 rows containing missing values (geom_col).

Nimble Model

As desceribed in the paper, the proposed model is based on a Negative-Binomial regression. That core model has the following form:

Short Model

The parameter r in the second equation is the regression function and we are trying to estimate β, which we set above to 0.004. To estimate the model's parameters, we will use the R package Nimble---a package for writing Bayesian models and estimating their parameters with MCMC. I suggest you read through the documentation on that project's website since this is not intended to be a tutorial.

First, we need to set up the Nimble model.

nbCode <- nimbleCode({
   B0 ~ dnorm(mean=0,sd=100)
   B ~ dnorm(mean=0,sd=100)
   for (n in 1:N) {
      r[n] <- exp(B0 + X[n] * B)
      p[n] ~ dunif(1e-10,1-1e-10)
      Y[n] ~ dnegbin(size=r[n],prob=p[n])
      y[n] ~ dnegbin(size=r[n],prob=p[n])
   }
})

Nimble is an excellent tool, but it takes some getting used to because it requires inputs that can be very un-R-like. The code block above is used to establish nodes in a directed acyclic graph structure that will be used in an MCMC simulation. The declarations each create a node and the ~ function assigns stochastic nodes while the <- function assigns deterministic ones. Any variable, like X or N above, that hasn't been declared will have to be passed into the Nimble functions before the MCMC is run.

The loop in the code block is repeatedly assigning distributions---like the uniform (dunif) and Negative-Binomial (dnegbin)---to nodes in the graph. As you can see, the model (graph) will have one B0 node, one B node, n r nodes, n p nodes, and so on all corresponding to the mathematical specification of the model given earlier. At this point, nothing has happened other than defining nodes and establishing the structure of the graph. The Y nodes will contain data (event counts), while all other stochastic nodes will be repeatedly resampled during the MCMC. The only deterministic node will also have its value changed throughout the simulation, of course, as the values for B0 and B are randomly sampled. The y nodes were included because we want to look at the posterior predictive distribution---these allow us to estimate Bayesian credible intervals for the regression function.

Now we can set some variables that Nimble will use:

Y <- rev(event_counts$Count)
N <- length(Y)
X <- 0:(N - 1)

The Y vector needed to be reversed because the calendar dates are all in years BP, which means they would be ordered in reverse-time by R going from the smallest value (most recent date) to the largest value (oldest date). We want them to line up with the passage of time (X) going in the correct direction. We can also reduce the impact of rounding/precision errors caused by using the log-link function (having the regression function in the exponent of e) by shifting the start date from 6000 BP to 0---the rate parameter (β) in question is a rate in time, meaning that the calendar value (e.g., 6000 BP) isn't relevant; a year is a year regardless of when in the calendar it occurs.

The rest of this code is setting parameters to be passed to Nimble in order to build the model given the specification above and setup the MCMC.

This block creates a nimble model object:

nbData <- list(Y=Y,
                X=X)

nbConsts <- list(N=N)

nbInits <- list(B=0,
                B0=0)

nbModel <- nimbleModel(code=nbCode,
                        data=nbData,
                        inits=nbInits,
                        constants=nbConsts)
## defining model...

## building model...

## setting data and initial values...

## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions... This model is not fully initialized. This is not an error. To see which variables are not initialized, use model$initializeInfo(). For more information on model initialization, see help(modelInitialization).
## model building finished.

This line then compiles the Nimble model to C++ code. Doing this massively accelerates the runtime.

C_nbModel <- compileNimble(nbModel, showCompilerOutput = FALSE)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.

## compilation finished.

Next, we configure the MCMC provided by Nimble:

nbModel_conf <- configureMCMC(nbModel)
## ===== Monitors =====
## thin = 1: B0, B, p
## ===== Samplers =====
## RW sampler (1002)
##   - B0
##   - B
##   - p[]  (1000 elements)
## posterior_predictive sampler (1000)
##   - y[]  (1000 elements)

In particular, we want to add a set of MCMC monitors for the y variable, which will allow us to estimate predictive error (posterior credible intervals) for the regression model. Other variables in the model specification are monitored by default, but which variables are monitored and output after the model has run can be changed (see the Nimble documentation).

nbModel_conf$addMonitors2(c("y"))
## thin = 1: B0, B, p
## thin2 = 1: y

We also want to change the samplers in the MCMC. It has been found that using a more advanced sampler is better for models that have parameters that are correlated. This is generally going to be the case in many regression models because the terms in the regression can potentially trade off to get similar results (a bigger intercept can compensate for a smaller covariate regression coefficient). Correlated parameters can lead to really inefficient sampling and for some models this can mean the MCMC never seems to converge on stable posterior estimates even after very large numbers if iterations. So, it's a good idea to group parameters that are likely to be correlated. Nimble would do this automatically with some parameters (like regression coefficients including an intercept term) if they have been declared in a certain way. In this code, however, we declared regression parameters above individually and have to group them manually. The Nimble team have found that an adaptive factor slice sampler can drastically improve sampling for correlated parameters, so we will use it here. (Note that the samplers for relevant parameters have to be removed first because Nimble automatically sets up samplers when the MCMC is configured)

nbModel_conf$removeSamplers(c("B","B0"))
nbModel_conf$addSampler(target=c("B","B0"),type="AF_slice")

Now we build the MCMC and compile it to C++ code as well:

nbModelMCMC <- buildMCMC(nbModel_conf,thin=1,enableWAIC = TRUE)
## Monitored nodes are valid for WAIC
C_nbModelMCMC <- compileNimble(nbModelMCMC,project=nbModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.

## compilation finished.

Then, we set the number of iterations and the random seed (to make the output reproducible):

niter=100
set.seed(1)

It should be noted that the actual number of iterations needed scales in some way with N and getting good, stable, unbiased samples (MCMC chains) of the relevant model posteriors required hundreds of thousands of iterations---this replication document only involved 100 for demonstrative purposes.

Lastly, we run the MCMC and store the MCMC chains (monitored variables) as a matrix in the variable samples:

samples <- runMCMC(C_nbModelMCMC, niter=niter)
## running chain 1...

## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|

After a long wait for the simulation to finish, the variable samples will be a list. The list contains two elemenets, each of which are a matrix with MCMC chains in them. The columns of the matrix contain the monitored model variables and the rows contain values of the simulation for each iteration. Given the specification set out above, the first list element of samples (samples$samples) contains the matrix of MCMC chains for the variables B0, B, and p, while the second element (samples$samples2) contains the matrix of MCMC chains for the variable y.

head(samples$samples[,c(1,2)])
##                  B         B0
## [1,] -3.633598e-05 -0.3715540
## [2,]  4.839602e-04 -0.5328079
## [3,]  5.264443e-04 -0.6319156
## [4,]  7.043548e-04 -0.8247751
## [5,]  7.698017e-04 -0.8338135
## [6,]  8.698414e-04 -0.8434628
head(samples$samples2[,c(1,2)])
##      y[1] y[2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    1
## [4,]    0    0
## [5,]    0    0
## [6,]    0    0

2: Radiocarbon-dated Event-Count Model

Up to this point, the model is just a variant of Bayesian Negative-Binomial (NB) regression with parameters estimated by MCMC simulation. The only novelty is that the p parameter is allowed to vary for every observation in the event-count sequence---which may, in fact, not be that novel in other domains (I haven't looked deeply into it outside of palaeo science). The point is to then add a level to this model that allows us to account for chronological uncertainty in event times. So, in this section, we introduce chronological uncertainty by 1) extending the regression model, 2) turning the calendar dated count series into a set of calibrated radiocarbon dates, 3) building an ensemble of probable of event-count sequences to use in the extended model, 3) and building the corresponding Nimble model.

First, we adjust the NB-regression model by adding a level to it---essentially, just creating probability models for the regression coefficients. The new model is then structured as follows:

Short Model

As you can see, the model now has two levels. At the top level (bottom few equations) there are probability models describing the distributions of hyperparameters. These hyperparameters are the regression parameters we want to estimate while accounting for chronological uncertainty in individual event times. The lower level (top few equations) contains regression models for each of j probable event-count sequences. These sequences will come from the RECE, but we first have to turn the sample of dates created in the first section above into calibrated radiocarbon dates.

R Requirements

Again, we need to make sure we have loaded the required R libraries.

Libraries

library(clam)
library(nimble)
library(ggplot2)
library(ggpubr)

Using functions from the clam package, the simulated calendar dated event series created in the first section can be turned into corresponding radiocarbon dates and then calibrated. We will use lapply() to create a list of calibrated radiocarbon date densities evaluated at an annual resolution. The function calBP.14C takes a calendar date (i.e., and event time with no chronological uncertainty measured in real calendar years) and returns a date in radiocarbon years with an uncertainty value---it's a normally distributed measurement. The calibrate function then takes a radiocarbon date (mean and error) and returns the corresponding calibrated date density sampled annually (the function comes with many optional parameters and interested readers should see the Clam manual).

Get and Calibrate Radiocarbon Dates

simc14 <- t(sapply(simdates,calBP.14C))
c14post <- lapply(1:Ndates,function(x)calibrate(simc14[x,1],simc14[x,2],graph=F)$calib)

The first of these densities (calibrated radiocarbon dates) looks like this:

head(c14post[[1]])
##      [,1]         [,2]
## [1,] 5292 1.055034e-06
## [2,] 5293 1.433886e-06
## [3,] 5294 1.812739e-06
## [4,] 5295 2.191592e-06
## [5,] 5296 8.227856e-06
## [6,] 5297 1.426412e-05

The first column in the data-frame above is the year BP and the second is the density of the relevant calibrated radiocarbon date---a higher density indicates a more likely year for the given event. There are going to be 1000 of these 2-column matrices in the c14post list corresponding to the 1000 dates we randomly generated.

Build a RECE

Next we need to use the list of calibrated radiocarbon dates to produce the RECE. The basic idea is to sample the radiocarbon dates (i.e., select at random a year) in proportion to the level of the relevant calibrated date densities and then construct a count-sequence based on the sampled dates. This sampling process will be repeated some number of times to create an ensemble of probable count sequnces that reflects our uncertainty about the timing of the individual events. In the following example, we'll create a RECE with 1000 probable sequences and store it in a matrix. To make things simpler, the individual dates all need to be sampled onto the same temporal sequence (1-dimenional temporal grid).

To create the grid this, we need to define the interval of time that the grid will cover and the grid's temporal resolution. The dates in the c14post list have an annual resolution and we can determine the maximum temporal span of all the dates with R's range function nested inside a vectorized loop with another range function applied to the final list of ranges:

resolution <- 1
sample_date_range <- range(unlist(lapply(c14post,function(x)range(x[,1]))))
sample_date_range
## [1] 4859 6189

It will be easier to produce the random samples of event times if we create a matrix of calibrated radiocarbon date densities aligned onto the grid. For this I created an approximation function that could conceivably resample (interpolate) any radiocarbon date density onto any grid, but in these examples (and those in the paper) the resolution of the grid (and the RECE) are the same as the resolution of the calibrated dates (i.e., annual).

calSampleApprox <- function(x,t1,t2,r){
    n <- length(x)
    funs <- lapply(x,approxfun)
    y_list <- lapply(1:n,function(j)funs[[j]](seq(t1,t2,r)))
    y_mat <- do.call(cbind,y_list)
    y_mat[which(is.na(y_mat))] <- 0
    return(y_mat)
}

Now we can pass the c14post list to this function in order to create a matrix of calibrated date densities wherein the columns are each a single event and the rows refer to the density of a given calibrated date for a given year:

c14_matrix <- calSampleApprox(c14post,sample_date_range[1],sample_date_range[2],resolution)
head(c14_matrix[,1:10])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    0    0    0     0
## [2,]    0    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    0    0    0    0    0     0
## [4,]    0    0    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     0
## [6,]    0    0    0    0    0    0    0    0    0     0

Then, we will build a data-frame to contain the RECE where each row corresponds to a single year in the grid and each column is going to be a single probable event-count sequence.

Dates <- seq(sample_date_range[1],sample_date_range[2],resolution)
rece_sample <- data.frame(Date=Dates)

Finally, we iterate through a loop to build the RECE. Each iteration, we will sample each event in the c14_matrix exactly once. The sampling is random and weighted by the calibrated date density---so more likely event dates will be sampled more frequently. Then, for each sample of event dates, we count the number of events occurring in each of the grid bins (i.e., every year between the earliest and latest dates covered by all of the calibrated distributions).

for(a in 1:1000){
    count_sample <- apply(c14_matrix,2,function(x)sample(Dates,size=1,prob=x))
    count_df <- as.data.frame(table(count_sample))
    names(count_df) <- c("Date","Count")
    rece_sample <- merge(rece_sample,count_df,by="Date",all=T)
}

Once this loop completes, the variable rece_sample contains the RECE matrix. This matrix needs to be cleaned up a bit before we can use it, though. We need to set any NAs to zero (they reflect instances where a date was outside the sampled range of a given calibrated radiocarbon date density but inside the span of the RECE---these should be effectively 0 because the relevant densities would be extremely low for those dates). We can also relabel the columns with clearer names, and reverse the vertical order of the RECE so that time increases downward. This tidying up is coded like this:

rece_sample <- as.matrix(rece_sample[,-1])
rece_sample[which(is.na(rece_sample))] <- 0
colnames(rece_sample) <- 1:1000
rece_sample <- as.data.frame(cbind(Dates,rece_sample))
rece_sample <- rece_sample[with(rece_sample,order(-Dates)),]
head(rece_sample[,1:10])
##      Dates 1 2 3 4 5 6 7 8 9
## 1331  6189 0 0 0 0 0 0 0 0 0
## 1330  6188 0 0 0 0 0 0 0 0 0
## 1329  6187 0 0 0 0 0 0 0 0 0
## 1328  6186 0 0 0 0 0 0 0 0 0
## 1327  6185 0 0 0 0 0 0 0 0 0
## 1326  6184 0 0 0 0 0 0 0 0 0

Nimble Model We can now create and run the nimble model. As before, a nimble code object needs to be created. This object specifies a directed acyclic graph in which each node is variable that has been defined mathematically above (see the model specification at the beginning of this section). Since the core NB-model has been extended, there is now a top-level with hyperparameters (B, B0, sig**B, sig**B0) and lower level containing individual regression parameters that ultimately define the distribution for the repsonse variable---the individual members of the RECE we just created. So, the nimble model is specified like this:

nbCode <- nimbleCode({
   B ~ dnorm(0,100)
   B0 ~ dnorm(0,100)
   sigB ~ dunif(1e-10,10)
   sigB0 ~ dunif(1e-10,10)
   for (j in 1:J) {
      b[j] ~ dnorm(mean=B,sd=sigB)
      b0[j] ~ dnorm(mean=B0,sd=sigB0)
      for (n in 1:N){
        p[n,j] ~ dunif(1e-10,1)
        r[n,j] <- exp(b0[j] + X[n] * b[j])
        Y[n,j] ~ dnegbin(size=r[n,j],prob=p[n,j])
      }
   }
})

Next, we can set the variables that will be fed into the model:

Y <- as.matrix(rece_sample[,2:51])
N <- dim(Y)[1]
J <- dim(Y)[2]
X <- 0:(N-1)

Then, we can subsample the Y series for the sake of memory/computation

Nsub <- seq(1,N,10)

Now the nimble model object needs to be created (this includes set initial variable values):

nbData <- list(Y=Y[Nsub,],
                X=X[Nsub])

nbConsts <- list(N=length(Nsub),
                    J=J)
nbInits <- list(B=0,
                B0=0,
                b=rep(0,J),
                b0=rep(0,J),
                sigB=0.0001,
                sigB0=0.0001)

nbModel <- nimbleModel(code=nbCode,
                        data=nbData,
                        inits=nbInits,
                        constants=nbConsts)
## defining model...

## building model...

## setting data and initial values...

## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions... This model is not fully initialized. This is not an error. To see which variables are not initialized, use model$initializeInfo(). For more information on model initialization, see help(modelInitialization).
## model building finished.

Then, we need to compile the model into C++ for faster runtimes:

C_nbModel <- compileNimble(nbModel, showCompilerOutput = FALSE)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.

## compilation finished.

Configure the MCMC and samplers as before:

nbModel_conf <- configureMCMC(nbModel)
## ===== Monitors =====
## thin = 1: B, B0, sigB, sigB0, p
## ===== Samplers =====
## RW sampler (6802)
##   - sigB
##   - sigB0
##   - p[]  (6700 elements)
##   - b[]  (50 elements)
##   - b0[]  (50 elements)
## conjugate sampler (2)
##   - B
##   - B0
nbModel_conf$monitors <- c("B","B0","sigB","sigB0")
nbModel_conf$addMonitors2(c("b","b0"))
## thin = 1: B, B0, sigB, sigB0
## thin2 = 1: b, b0
nbModel_conf$removeSamplers(c("B","B0","b","b0","sigB","sigB0"))
nbModel_conf$addSampler(target=c("B","B0","sigB","sigB0"),type="AF_slice")
for(j in 1:J){
   nbModel_conf$addSampler(target=c(paste("b[",j,"]",sep=""),paste("b0[",j,"]",sep="")),type="AF_slice")
}

Next, we build and compile the MCMC:

nbModelMCMC <- buildMCMC(nbModel_conf)
C_nbModelMCMC <- compileNimble(nbModelMCMC,project=nbModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.

## compilation finished.

Then we set the number of iterations for the MCMC and the random seed:

niter <- 100
set.seed(1)

It should be noted that the actual number of iterations needed scales in some way with N and J and even after subsetting as above (Nsub) getting good, stable, unbiased samples of the relevant model posteriors required hundreds of thousands of iterations---this replication document only involved 100 for demonstrative purposes.

And, finally, we run the MCMC and save the chains to the samples variable:

samples <- runMCMC(C_nbModelMCMC, niter=niter)
## running chain 1...

## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|

After another long wait for the simulation to finish, the variable samples will be a list. The list contains two elemenets, each of which are a matrix with MCMC chains in them. The columns of the matrix contain the monitored model variables and the rows contain values of the simulation for each iteration. Given the specification set out above, the first list element of samples (samples$samples) contains the matrix of MCMC chains for the variables B0, B, sigB0, and sigB, while the second element (samples$samples2) contains the matrix of MCMC chains for the variables in the lower-level regressions. We did not include a monitor this time for a y variable that could then be used to estimate posterior credible intervals for the lower-level regressions, but those could be included if needed.

head(samples$samples[,c(1,2)])
##                  B            B0
## [1,]  2.820959e-05 -3.551391e-06
## [2,]  1.862459e-05  4.993371e-06
## [3,]  1.399788e-05  1.467399e-05
## [4,] -7.583753e-07  1.899331e-05
## [5,] -5.882500e-06  2.396612e-05
## [6,] -1.909550e-05  1.687273e-05
head(samples$samples2[,c(1,2)])
##               b[1]          b[2]
## [1,]  3.469194e-05 -2.951469e-05
## [2,] -2.295985e-05  1.964932e-06
## [3,] -2.232041e-05  2.879162e-06
## [4,] -2.050705e-05 -2.132512e-05
## [5,]  1.898067e-06 -1.374691e-05
## [6,] -4.612460e-05 -7.605407e-05

Clone this wiki locally