The dataset contain four different rodent species and their respective count which will be our response variable we would like to model and forecast. Each of the four species time series contains the number of rodent captures made each lunar month (which is approximately 29 days long) from 2004 to 2020. Since modelling 4 time series simultaneously can quickly become cumbersome, we will limit ourselves to modeling only one species of the four available.
summary(portal)
## moon PP DM DO
## Min. :329.0 Min. : 0.00 Min. : 2.00 Min. : 0.000
## 1st Qu.:378.5 1st Qu.: 2.50 1st Qu.: 9.00 1st Qu.: 0.000
## Median :428.0 Median :12.00 Median :13.00 Median : 3.000
## Mean :428.0 Mean :15.14 Mean :14.21 Mean : 4.086
## 3rd Qu.:477.5 3rd Qu.:24.00 3rd Qu.:19.00 3rd Qu.: 7.000
## Max. :527.0 Max. :65.00 Max. :35.00 Max. :21.000
## NA's :36 NA's :36 NA's :36
## OT year month mintemp
## Min. :0.000 Min. :2004 Min. : 1.000 Min. :-24.000
## 1st Qu.:1.000 1st Qu.:2008 1st Qu.: 3.500 1st Qu.: -3.884
## Median :2.000 Median :2012 Median : 7.000 Median : 2.130
## Mean :1.982 Mean :2012 Mean : 6.503 Mean : 3.504
## 3rd Qu.:3.000 3rd Qu.:2016 3rd Qu.: 9.000 3rd Qu.: 12.310
## Max. :9.000 Max. :2020 Max. :12.000 Max. : 18.140
## NA's :36
## ndvi
## Min. :0.2817
## 1st Qu.:1.0741
## Median :1.3501
## Mean :1.4709
## 3rd Qu.:1.8178
## Max. :3.9126
##
We will focus the analysis on time series of captures for one specific rodent species, the desert pocket Mouse Chaetodipus Penicillatus. The desert pocket mouse is a small rodent species adapted to the arid regions of the southwestern United States and northern Mexico. They are known for their burrowing behavior, seed caching, and ability to survive with minimal water intake. A notable characteristic of C. penicillatus is its ability to enter a state of torpor during colder months, leading to significantly reduced activity and lower capture rates in winter.
Converting the dataset to a long format dataset. This step is necessary because the mvgam package uses this type of format where there is a series column containing all available series (just one in our case); a time variable which in our case corresponds to the month column, rescaled such that it starts from 1;
portal %>%
mutate(time = moon - (min(moon)) + 1) %>% # e.g. 329 - 329 + 1
mutate(count = PP) %>%
mutate(series = as.factor('PP')) %>%
select(series, year, time, count, mintemp, ndvi) -> model_data
head(model_data)
## series year time count mintemp ndvi
## 1 PP 2004 1 0 -9.710 1.465889
## 2 PP 2004 2 1 -5.924 1.558507
## 3 PP 2004 3 2 -0.220 1.337817
## 4 PP 2004 4 NA 1.931 1.658913
## 5 PP 2004 5 10 6.568 1.853656
## 6 PP 2004 6 NA 11.590 1.761330
summary(model_data)
## series year time count mintemp
## PP:199 Min. :2004 Min. : 1.0 Min. : 0.00 Min. :-24.000
## 1st Qu.:2008 1st Qu.: 50.5 1st Qu.: 2.50 1st Qu.: -3.884
## Median :2012 Median :100.0 Median :12.00 Median : 2.130
## Mean :2012 Mean :100.0 Mean :15.14 Mean : 3.504
## 3rd Qu.:2016 3rd Qu.:149.5 3rd Qu.:24.00 3rd Qu.: 12.310
## Max. :2020 Max. :199.0 Max. :65.00 Max. : 18.140
## NA's :36
## ndvi
## Min. :0.2817
## 1st Qu.:1.0741
## Median :1.3501
## Mean :1.4709
## 3rd Qu.:1.8178
## Max. :3.9126
##
The response variable represents the number of PP captured, so it is bounded by non-negative values. As we can see, there are 36 NA’s (missing values) in the response variable. In most traditional time series models, this would pose a problem, and we would need to find a way to impute those missing values. However, for the models applied using the mvgam package, this is not an issue.
We can observe some seasonality in the top-left graph, which is confirmed by the ACF plot in the bottom-left. Additionally, missing values are evident in the response variable “count.”
plot_mvgam_series(data = model_data, series = 1, y = 'count')
We can also look at the plot of partial autocorrelation function.
pacf(model_data$ndvi, na.action = na.pass, main = "PACF of Count")
As suggested by the histogram above, data are not symmetric. With a standard Shapiro-Wilk test we can assess whether or not data distribute normally. Given the low p-value, we can reject the null hypothesis of normality.
shapiro.test(model_data$count)
##
## Shapiro-Wilk normality test
##
## data: model_data$count
## W = 0.88543, p-value = 6.812e-10
Here we can visualize the correlation among predictors and covariates.
numeric_vars <- model_data[, c("count", "mintemp", "ndvi")]
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle")
Fitting a loess curve is useful for capturing potential non-linearities. From the two plots below we can see some nonlinear patterns, although not very strong.
ggplot(model_data, aes(x = mintemp, y = count)) +
geom_point(color = "steelblue", alpha = 0.6) +
geom_smooth(method = "loess", color = "darkred", se = TRUE) +
labs(title = ,
x = "Minimum Temperature", y = "Count") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).
ggplot(model_data, aes(x = ndvi, y = count)) +
geom_point(color = "forestgreen", alpha = 0.6) +
geom_smooth(method = "loess", color = "darkred", se = TRUE) +
labs(title = ,
x = "NDVI", y = "Count") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).
Given the non-Gausssian data, missing values, possibly measurement errors and non-linearities, we should move away from linear models.
The first fitted model is a hierarchical GLM. This model is just a starting point to explore whether we can explain the number of captures by differences between years. The number of rodent captures is assumed to follow a Poisson distribution (though we will tackle this assumption in more details later on), which sounds like a natural choice for non-negative count data as the Poisson models the probability of a certain number of events happening in a fixed time period. The model uses a log link function which ensures that predictions are non-negative. Then we have random (hierarchical) intercepts for the year factor which highlights the prior idea that although each year’s effect has unique characteristics (because different environmental conditions might affect rodent captures in each year), we believe that these yearly effects are not completely independent, but rather all sampled from the same underlying population. There is kind of a underlying connection between years, so that the information learned about one year could inform the expected number of captures in another.
ggplot(model_data, aes(x = factor(year), y = count)) +
geom_boxplot(fill = "grey", color = "darkred",
outlier.shape = NA,
width = 0.7) +
stat_boxplot(geom = "errorbar", width = 0.3, color = "darkred") +
labs(title = "Boxplot of Yearly Rodent Counts",
x = "Year",
y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
In order to produce out of sample predictions, we need to divide the dataset into training and testing. The first 160 observations will be used for training the models, the left part to make predictions.
model_data %>%
dplyr::filter(time <= 160) -> data_train
model_data %>%
dplyr::filter(time > 160) -> data_test
Before actually running the sampler we can take a look at the priors the function will use in this situation and we can edit them to represent our own prior beliefs, if any.
priors <- get_mvgam_priors(count ~ s(year_fac, bs = 're')-1,
family = poisson(),
data = model_data)
priors
## param_name param_length param_info
## 1 vector[1] mu_raw; 1 s(year_fac) pop mean
## 2 vector<lower=0>[1] sigma_raw; 1 s(year_fac) pop sd
## prior example_change
## 1 mu_raw ~ std_normal(); mu_raw ~ normal(-0.69, 0.95);
## 2 sigma_raw ~ student_t(3, 0, 2.5); sigma_raw ~ exponential(0.43);
## new_lowerbound new_upperbound
## 1 NA NA
## 2 NA NA
model_re <- mvgam(count ~ s(year_fac, bs = 're')-1,
family = poisson(),
data = data_train,
newdata = data_test)
## Compiling Stan program using rstan
##
## Start sampling
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
\[ Y_t \sim \text{Poisson}(\lambda_t) \] \[ \log(\lambda_t) = \beta_{\text{year}[t]} \] \[ \beta_{\text{year}} \sim \text{Normal}(\mu_{\text{year}}, \sigma_{\text{year}}) \] \[ \mu_{\text{year}} \sim \text{Normal}(0, 1) \]
\[ \sigma_{\text{year}} \sim \text{Exponential}(2) \]
summary(model_re)
## GAM formula:
## count ~ s(year_fac, bs = "re") - 1
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## None
##
## N series:
## 1
##
## N timepoints:
## 199
##
## Status:
## Fitted using Stan
## 4 chains, each with iter = 1000; warmup = 500; thin = 1
## Total post-warmup draws = 2000
##
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## s(year_fac).1 1.8 2.1 2.3 1 2728
## s(year_fac).2 2.5 2.7 2.8 1 2494
## s(year_fac).3 3.0 3.1 3.2 1 2586
## s(year_fac).4 3.1 3.3 3.4 1 2492
## s(year_fac).5 1.9 2.1 2.3 1 2430
## s(year_fac).6 1.5 1.8 2.0 1 2576
## s(year_fac).7 1.8 2.1 2.3 1 2983
## s(year_fac).8 2.8 3.0 3.1 1 2636
## s(year_fac).9 3.1 3.2 3.4 1 2510
## s(year_fac).10 2.6 2.8 2.9 1 2483
## s(year_fac).11 3.0 3.1 3.2 1 2702
## s(year_fac).12 3.1 3.2 3.3 1 2334
## s(year_fac).13 2.1 2.3 2.4 1 2622
## s(year_fac).14 1.3 2.6 3.9 1 1408
## s(year_fac).15 1.3 2.6 3.8 1 1390
## s(year_fac).16 1.2 2.6 3.8 1 1234
## s(year_fac).17 1.3 2.6 3.9 1 1246
##
## GAM group-level estimates:
## 2.5% 50% 97.5% Rhat n_eff
## mean(s(year_fac)) 2.20 2.60 2.90 1.02 327
## sd(s(year_fac)) 0.38 0.59 0.94 1.01 379
##
## Approximate significance of GAM smooths:
## edf Ref.df Chi.sq p-value
## s(year_fac) 11.4 17 1382 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
##
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 14:24:19 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
Using the ‘code’ function, the Stan code is displayed.
code(model_re)
## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // random effect variances
## vector<lower=0>[1] sigma_raw;
## // random effect means
## vector[1] mu_raw;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## b[1:17] = mu_raw[1] + b_raw[1:17] * sigma_raw[1];
## }
## model {
## // prior for random effect population variances
## sigma_raw ~ student_t(3, 0, 2.5);
## // prior for random effect population means
## mu_raw ~ std_normal();
## // prior (non-centred) for s(year_fac)...
## b_raw[1:17] ~ std_normal();
## {
## // likelihood functions
## flat_ys ~ poisson_log_glm(flat_xs,
## 0.0,b);
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## array[n, n_series] int ypred;
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }
We can extract the posterior samples of the mean and standard deviation of population-level effects.
pop_params <- as.data.frame(model_re,
variable = c('mean(year_fac)',
'sd(year_fac)'))
head(pop_params)
## mean(year_fac) sd(year_fac)
## 1 2.291623 0.7410213
## 2 2.193049 0.8001934
## 3 2.353734 0.6042163
## 4 2.579334 0.6969573
## 5 2.589601 0.6617802
## 6 2.571173 0.6553844
As we said before, in this model, each year has its own random effect, meaning that the effect of each year on the number of rodent captures is modeled as a deviation from an overall population level-mean. The posterior distribution of mean(year_fac) consists of samples from the model’s estimate of \(\mu_{\text{year}}\), the population mean of the year effects. The mean of these posterior samples gives us the model’s best estimate of the overall baseline effect of years on rodent captures. Similarly, the mean of sd(year_fac) posterior samples represents the average spread of year-specifc random effects around the population-level mean. The two results shown below are on the link scale. In order to be more interpretable we can turn them into the response scale
# link scale
mean(pop_params[,1])
## [1] 2.587052
mean(pop_params[,2])
## [1] 0.6111257
# response scale
mean(exp(pop_params[,1]))
## [1] 13.4934
mean(exp(pop_params[,2]))
## [1] 1.864422
In order to account for the uncertainty in the estimation we can build and visualize 95% credible intervals of posterior samples.
mean_year_ci <- quantile(pop_params$`mean(year_fac)`, probs = c(0.025, 0.975))
sd_year_ci <- quantile(pop_params$`sd(year_fac)`, probs = c(0.025, 0.975))
# plot as histograms
hist(pop_params$`mean(year_fac)`, main = expression(mu[year]), col = "darkred", breaks = 20)
abline(v = mean_year_ci, col = "blue", lwd = 2, lty = 2)
hist(pop_params$`sd(year_fac)`, main = expression(sigma[year]), col = "darkred", breaks = 20)
abline(v = sd_year_ci, col = "blue", lwd = 2, lty = 2)
Among different ways of assessing goodness of fit, a natural one is to compare observed and expected frequencies from the model. A very useful tool in case of non-negative count data is the rootogram, which plots “histogram-like bars for the observed frequencies and a curve for the fitted frequencies, all on a square-root scale […] to approximately adjust for scale differences across the values” (Kleiber-Zeileis, 2016). If the bars deviate from the curve (either above or below), it indicates areas where what the model expects does not align with the observed data.
pp_check(object = model_re, type = 'rootogram')
## Using all posterior draws for ppc type 'rootogram' by default.
## Warning in pp_check.mvgam(object = model_re, type = "rootogram"): NA responses
## are not shown in 'pp_check'.
To see whether these yearly intercepts can result in good time-varying predictions, we can plot forecasts, which show that the model is not capable of producing accurate predictions.
plot(model_re, type = 'forecast')
## Out of sample DRPS:
## 183.26291725
It is evident that there is a substantial amount of variability that remains unexplained. Let us now examine the residuals: as we can see, there is a strong autocorrelation in the residuals which means the model is not capturing the time-dependent structure in the data. In this case, the model may need to incorporate more complex components.
plot(model_re, type = "residuals")
## Theoretic detour We will now move from the hierarchical GLM
formulation to the use of GAMs. While with GLMs we assume linear
relationships in the parameters, GAMs relax this assumption and deal
with non-linearities in the data. A Generalized Additive Model can be
represented as a sum of smooth functions, named splines, representing
functional relationships between covariates and the response: \[
E(Y|X) = g^{-1} (\beta_0 + \sum_{i=1}^{I} s_i(x_i))
\] Splines are composed of simpler functions named basis
functions. In particular, they are sums of weighted basis functions
evaluated at the values of x: \[
s(x) = \sum_{k = 1}^{K}\beta_kb_k(x)
\] In the frequentist framework, the \(\beta_k\) are estimated by starting off
with the basis functions that are unweighted and values for those
weights are chosen so to maximize the penalized log-likelihood, namely
we force the spline to go as close to the data as we can with the caveat
that we don’t want to overfit. In order to avoid overfitting our data we
measure the wiggliness of the fitted function: the more the fitted
function deviates from a flat line (unweighted basis functions), the
more complicated that model is. One straightforward measure for that is
the following, \[
\int_\mathbb{R}[s'']^2dx = \beta^TS\beta=W
\] namely the integrated square second derivative of the smooth
function, which can be written as a function of the estimated
coefficients \(\beta\) for the basis
functions and S which is the penalty matrix. We want to
penalize W to avoid overfitting using a smoothing
parameter \(\lambda\): \[
\mathcal{L}_{pen} = log(Likelihood) - \lambda W
\] If we set \(\lambda\) to 0 we
are fitting an unpenalized model, but if we allow it to be very large we
are going to pay an increasing cost for the wiggliness and so the model
will tend towards the linear model.
In the Bayesian framework the focus shifts from finding a single best set of coefficients to treating the coefficients as random variables drawn from a multivariate Gaussian distribution with the penalty acting on the prior precision: a large \(\lambda\) results in a tight prior, while a smaller one allows for more wiggliness as the prior imposes less constraints on the coefficients.
In order to improve model’s performances we can replace the year variable with the time variable, hoping that temporal dependencies will be better captured. Here we can see what 15 basis functions look like using B-splines. The larger k is, the more complex the model will be.
basis_funs <- basis(s(time, bs = 'bs', k = 15),
data = model_data)
draw(basis_funs) +
labs(y = 'Basis function value', x = 'Time')
These are the priors that mvgam would use for this setting, but we can decide to place Standard Normal priors on fixed effects coefficients
get_mvgam_priors(count ~ s(time, bs = 'bs', k = 15)+
mintemp + ndvi,
family = poisson(),
data = data_train)
## param_name param_length param_info
## 1 (Intercept) 1 (Intercept)
## 2 mintemp 1 mintemp fixed effect
## 3 ndvi 1 ndvi fixed effect
## 4 vector<lower=0>[n_sp] lambda; 2 s(time) smooth parameters
## prior example_change
## 1 (Intercept) ~ student_t(3, 2.6, 2.5); (Intercept) ~ normal(0, 1);
## 2 mintemp ~ student_t(3, 0, 2); mintemp ~ normal(0, 1);
## 3 ndvi ~ student_t(3, 0, 2); ndvi ~ normal(0, 1);
## 4 lambda ~ normal(5, 30); lambda ~ exponential(1);
## new_lowerbound new_upperbound
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
time_smooth <- mvgam(count ~ s(time, bs = 'bs', k = 15)+
mintemp + ndvi,
family = poisson(),
data = data_train,
newdata = data_test,
priors = prior(normal(0, 1), class = b))
## Compiling Stan program using rstan
##
## Start sampling
## Warning in .local(object, ...): some chains had errors; consider specifying
## chains = 1 to debug
## here are whatever error messages were returned
## [[1]]
## Stan model 'anon_model' does not contain samples.
##
## [[2]]
## Stan model 'anon_model' does not contain samples.
##
## [[3]]
## Stan model 'anon_model' does not contain samples.
\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]
\[ \log(\lambda_t) = \beta_0 + s(\text{time})_t + \beta_{\text{ndvi}} \cdot \text{ndvi}_t + \beta_{\text{mintemp}} \cdot \text{mintemp}_t \]
\[ \beta_0 \sim \text{Student}(3, 2.6, 2.5) \]
\[ s(\text{time})_t = \sum_{k=1}^{K} b_k \cdot \beta_{\text{smooth}} \]
\[ \beta_{\text{smooth}} \sim \text{MVNormal}(0, (S \cdot \lambda)^{-1}) \]
\[ \beta_{\text{ndvi}} \sim \text{Normal}(0, 1) \]
\[ \beta_{\text{mintemp}} \sim \text{Normal}(0, 1) \]
As we can see from the plots below, the residuals still show autocorrelation, meaning also the GAM is not able to capture the temporal structure of the data…
plot(time_smooth, type = 'residuals')
…which is confirmed by the bad forecasts.
plot(time_smooth, type = 'forecast')
## Out of sample DRPS:
## 262.16482
One key aspect that drives the forecasts is the choice of the penalty of the splines. A 2-nd derivative penalty penalizes the curvature of the spline providing linear extrapolations (slope remains unchanged from the last boundary of training data). A 1-st derivative penalty will produce flat extrapolations so the mean remains unchanged from the last boundary of training data.
Other than that, spline extrapolation is also highly sensitive to wiggliness. In order to model and capture autocorrelation in a series, one thing to do can be to let the spline wiggle more. An increased complexity (wiggliness) leads to more accurate inferences on historical patterns, giving more precise intervals and so on, but from a forecasting point of view they still tend to be highly unpredictable given that each spline only have local knowledge about what’s inside its interval.
For forecasting we need global understanding of the phenomenon. So we replace spline of time with an actual time series process that can learn from the behaviour of the past and give better predictions into the future. There is a variety of processes that can be used inside ‘mvgam’ as autoregressive processes, Gaussian procesess, random walks, var processes for multivariate time series.
DGAMs are particularly suited for time series data as they account for temporal dependencies by modelling the series as a dynamic autocorrelated process. In the univariate case, a DGAM for discrete time series is specified as follows \[ E(Y_t|X_t) = g^{-1} (\beta_0 + \sum_{i=1}^{I} s_{i,t}(x_{i,t})+z_t) \] \[ z_t = \phi_0 + z_{t-1} + e_t \] where \(z_t\) is the latent dynamic proess estimate at time t. As highlighted by Clark, Wells (2022), separating the temporal process from the observation process offers several advantages. It reduces the impact of measurement errors or outliers, which can heavily distort the estimation of AR parameters. Additionally, using latent processes makes it easier to manage missing or irregularly sampled data, providing a more robust and flexible modeling approach: since the \(z_t\) are unobserved latent random variables, they will evolve in time even when an observation \(Y_t\) is missing.
\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]
\[ \log(\lambda_t) = \beta_0 + s(\text{ndvi})_t + s(\text{mintemp})_t + z_t \]
\[ z_t \sim \text{Normal}(z_{t-1}, \sigma_{\text{error}}) \]
\[ \sigma_{\text{error}} \sim \text{Exponential}(2) \]
\[ s(\text{ndvi})_t = \sum_{k=1}^{K} b_k(\text{ndvi}_t) \cdot \beta_{\text{smooth}} \]
\[ \beta_{\text{smooth}} \sim \text{MVNormal}(0, (S \cdot \lambda)^{-1}) \]
\[ \beta_{\text{mintemp}} \sim \text{Normal}(0, 1) \]
\[ \beta_0 \sim \text{Normal}(0, 1) \]
mod_notime2 <- mvgam(count ~ s(mintemp, bs = 'bs', k = 15) + s(ndvi, bs = 'bs', k = 8),
family = poisson(),
data = data_train,
newdata = data_test,
trend_model = AR(p = 1),
chains = 4,
samples = 2000,
burnin = 1000,
noncentred = TRUE)
## Compiling Stan program using rstan
##
## Start sampling
code(mod_notime2)
## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_sp; // number of smoothing parameters
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## vector[num_basis] zero; // prior locations for basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## matrix[14,28] S1; // mgcv smooth penalty matrix S1
## matrix[7,14] S2; // mgcv smooth penalty matrix S2
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // latent trend AR1 terms
## vector<lower=-1,upper=1>[n_series] ar1;
## // latent trend variance parameters
## vector<lower=0>[n_series] sigma;
## // raw latent trends
## matrix[n, n_series] trend_raw;
## // smoothing parameters
## vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## // latent trends
## matrix[n, n_series] trend;
## trend = trend_raw .* rep_matrix(sigma', rows(trend_raw));
## for (s in 1 : n_series) {
## trend[2 : n, s] += ar1[s] * trend[1 : (n - 1), s];
## }
## b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
## // prior for (Intercept)...
## b_raw[1] ~ student_t(3, 2.6, 2.5);
## // prior for s(mintemp)...
## b_raw[2:15] ~ multi_normal_prec(zero[2:15],S1[1:14,1:14] * lambda[1] + S1[1:14,15:28] * lambda[2]);
## // prior for s(ndvi)...
## b_raw[16:22] ~ multi_normal_prec(zero[16:22],S2[1:7,1:7] * lambda[3] + S2[1:7,8:14] * lambda[4]);
## // priors for AR parameters
## ar1 ~ std_normal();
## // priors for smoothing parameters
## lambda ~ normal(5, 30);
## // priors for latent trend variance parameters
## sigma ~ student_t(3, 0, 2.5);
## to_vector(trend_raw) ~ std_normal();
## {
## // likelihood functions
## vector[n_nonmissing] flat_trends;
## flat_trends = (to_vector(trend))[obs_ind];
## flat_ys ~ poisson_log_glm(append_col(flat_xs, flat_trends),
## 0.0,append_row(b, 1.0));
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## vector[n_sp] rho;
## vector[n_series] tau;
## array[n, n_series] int ypred;
## rho = log(lambda);
## for (s in 1:n_series) {
## tau[s] = pow(sigma[s], -2.0);
## }
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]] + trend[1:n, s];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }
The number \(k\) in the ‘mvgam’ function should be chosen by the user based on the assumed - possibly nonlinear - relationships in the data. However, in order to check whether the assumed complexity is enough or too much, we can look at the Effective Degrees of Freedom (edf) and the Reference Degrees of Freedom (Red.df). The first one tells us how complex the smooth term is in practice (the lareger the edf, the more complex the smooth is), while the second one represent the maximum possible degrees of freedom allowed for the smooth term, based on the number of basis function the user chose when setting up the model. As we can see here, the smooth term of ‘mintemp’ has an edf of 6.62 which is more than half of the corresponding Ref.df. The same goes for the smooth of ndvi with an edf of 3. Bear in mind that an edf close to 1 implies an almost linear relationship.
summary(mod_notime2)
## GAM formula:
## count ~ s(mintemp, bs = "bs", k = 15) + s(ndvi, bs = "bs", k = 8)
##
## Family:
## poisson
##
## Link function:
## log
##
## Trend model:
## AR(p = 1)
##
## N series:
## 1
##
## N timepoints:
## 199
##
## Status:
## Fitted using Stan
## 4 chains, each with iter = 2000; warmup = 1000; thin = 1
## Total post-warmup draws = 4000
##
##
## GAM coefficient (beta) estimates:
## 2.5% 50% 97.5% Rhat n_eff
## (Intercept) 1.90 2.200 2.40 1.00 2390
## s(mintemp).1 -4.70 0.600 1.70 1.01 412
## s(mintemp).2 -3.90 -0.230 0.83 1.01 439
## s(mintemp).3 -3.20 -0.970 0.17 1.01 616
## s(mintemp).4 -2.50 -1.300 -0.33 1.00 1760
## s(mintemp).5 -2.40 -1.500 -0.19 1.01 739
## s(mintemp).6 -2.90 -1.800 1.20 1.01 439
## s(mintemp).7 -2.60 -1.400 1.60 1.01 429
## s(mintemp).8 -1.80 -0.700 2.50 1.01 431
## s(mintemp).9 -1.10 -0.180 2.40 1.01 429
## s(mintemp).10 -0.86 -0.011 2.00 1.01 452
## s(mintemp).11 -0.45 0.340 2.20 1.01 468
## s(mintemp).12 -0.51 0.400 3.10 1.01 432
## s(mintemp).13 -0.40 0.490 2.70 1.01 459
## s(mintemp).14 -1.60 0.550 2.10 1.00 2219
## s(ndvi).1 -0.75 -0.140 0.50 1.00 4442
## s(ndvi).2 -1.20 0.110 1.20 1.00 2679
## s(ndvi).3 -0.86 0.360 1.50 1.00 2652
## s(ndvi).4 -0.33 0.280 0.89 1.00 2887
## s(ndvi).5 -0.51 0.066 0.66 1.00 3702
## s(ndvi).6 -0.95 -0.170 0.55 1.00 3372
## s(ndvi).7 -1.50 -0.280 0.94 1.00 2598
##
## Approximate significance of GAM smooths:
## edf Ref.df Chi.sq p-value
## s(mintemp) 7.46 14 99.28 0.44
## s(ndvi) 2.60 7 2.62 0.99
##
## Latent trend parameter AR estimates:
## 2.5% 50% 97.5% Rhat n_eff
## ar1[1] 0.65 0.88 0.99 1 2134
## sigma[1] 0.57 0.69 0.84 1 1745
##
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 4000 iterations ended with a divergence (0%)
## 0 of 4000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
##
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 14:28:34 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)
In order to check for discrepancies in the model adequacy to the data, a posterior predictive check can be ran. Notice the slight but systematic deviations of the observed data with respect to the generated data.
pp_check(mod_notime2, ndraws = 30)
## Warning in pp_check.mvgam(mod_notime2, ndraws = 30): NA responses are not shown
## in 'pp_check'.
The code below performs a posterior predictive check by comparing the empirical CDF of the observed data against the empirical CDFs of simultaed data generated from the posterior distribution of the model. If the model fits well, the ECDF of the observed data should align closely with the ECDFs of the simulated data.
pp_check(mod_notime2, type = 'ecdf_overlay')
## Using 10 posterior draws for ppc type 'ecdf_overlay' by default.
## Warning in pp_check.mvgam(mod_notime2, type = "ecdf_overlay"): NA responses are
## not shown in 'pp_check'.
The residuals of this model don’t show evident pattern in the autocorrelation plot.
plot(mod_notime2, type = "residuals")
Also the forecasts on the left out sample look good.
plot(mod_notime2, type = "forecast")
## Out of sample DRPS:
## 125.8303123125
Now two other models will be fit: - Dynamic GLM with just an intercept
in the observation model - Dynamic GLM with interaction between ndvi and
mintemp in the observation model.
\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]
\[ \log(\lambda_t) = \beta_0 + z_t \]
\[ z_t \sim \text{Normal}(z_{t-1}, \sigma_{\text{error}}) \]
\[ \sigma_{\text{error}} \sim \text{Exponential}(2) \]
\[ \beta_0 \sim \text{Normal}(0, 1) \]
mod_notime3 <- mvgam(count ~ 1,
family = poisson(),
data = data_train,
newdata = data_test,
trend_model = AR(p = 1),
chains = 4,
samples = 2000,
burnin = 1000,
noncentred = TRUE)
## Compiling Stan program using rstan
##
## Start sampling
\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]
\[ \log(\lambda_t) = \beta_0 + \beta_{\text{ndvi}} \cdot \text{ndvi}_t + \beta_{\text{mintemp}} \cdot \text{mintemp}_t + z_t \]
\[ z_t \sim \text{Normal}(z_{t-1}, \sigma_{\text{error}}) \]
\[ \sigma_{\text{error}} \sim \text{Exponential}(2) \]
\[ \beta_0 \sim \text{Normal}(0, 1) \]
\[ \beta_{\text{ndvi}} \sim \text{Normal}(0, 1) \]
\[ \beta_{\text{mintemp}} \sim \text{Normal}(0, 1) \]
mod_notime4 <- mvgam(count ~ mintemp + ndvi,
family = poisson(),
data = data_train,
newdata = data_test,
trend_model = AR(p = 1),
chains = 4,
samples = 2000,
burnin = 1000,
noncentred = TRUE)
## Compiling Stan program using rstan
##
## Start sampling
## Warning in .local(object, ...): some chains had errors; consider specifying
## chains = 1 to debug
## here are whatever error messages were returned
## [[1]]
## Stan model 'anon_model' does not contain samples.
##
## [[2]]
## Stan model 'anon_model' does not contain samples.
The step of model comparison is fundamental for assessing which statistical model is more appropriate for the available data. One popular information criterion is the Deviance Information Criterion (DIC) which however has been addressed for not being fully Bayesian because it uses posterior means (point estimates) instead of the entire posterior distributions available after MCMC simulation. Another very used information criterion which instead uses every single one of MCMC samples and observations is the Watanabe-Akaike Information Criterion (WAIC). It uses the log-pointwise predictive density as a raw measure of goodness of fit, which can be computed in practice averaging the likelihood for each observation over all MCMC samples and then summing it up over all n observations \[ \sum_{i = 1}^{n}log\Big( \frac{1}{S}\sum_{s=1}^{S}p(y_i|\theta^s \Big) \] and a penalty on the effective number of parameters which sums for each data point the variance of the log-likelihood across MCMC samples.
loo_compare(mod_notime2, mod_notime3, mod_notime4)
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## elpd_diff se_diff
## mod_notime2 0.0 0.0
## mod_notime4 -0.5 3.8
## mod_notime3 -8.8 3.7
Model with smooths on covariate and model with fixed effects on covariates are comparable in terms of performance on the available data.
We would also want to compare models with respect to their predictive capabilities. One proper scoring rule for this evaluation is the Continuous Rank Probability Score (CRPS), which is similar to a weighted absolute error using the full forecast distribution.
fc2 <- forecast(mod_notime2)
fc3 <- forecast(mod_notime3)
fc4 <- forecast(mod_notime4)
s2 <- score(fc2, score = "crps")
s3 <- score(fc3, score = "crps")
s4 <- score(fc4, score = "crps")
sum(s2$PP$score, na.rm = TRUE) - sum(s3$PP$score, na.rm = TRUE) # model 2 is preferred
## [1] -40.10458
sum(s2$PP$score, na.rm = TRUE) - sum(s4$PP$score, na.rm = TRUE) # model 2 is preferred
## [1] -18.53619
sum(s3$PP$score, na.rm = TRUE) - sum(s4$PP$score, na.rm = TRUE) # model 4 is preferred
## [1] 21.56839
Model two is preferred in terms of predictive capabilities.
Up to now we have assumed a Poisson distribution for the data, which has the strong assumption of the equivalence between mean and variance. However, since overdispersion (i.e. the empirical evidence that the variance is greater than the mean) is very common when analysing count data in ecological time series, we should also consider to use a Negative Binomial distribution for the data. In the context of ecology, a more relevant parametrisation and interpretation of the negative binomial involves expressing the pmf in terms of a location parameter \(\mu\) (i.e. mean) and a parameter k controlling overdispersion. In particular, if p is the probability of success and r the number of successes before which we want to count the number of failures then \[ p = \frac{k}{k+\mu} \] where \(k = r\). Then the probability mass function of Y becomes \[ p_Y(y)=\binom{k+y-1}{y} \Big(\frac{\mu}{\mu+k} \Big)^y \Big(\frac{k}{k+\mu}\Big)^k \] where the variance is a quadratic function of the mean \(Var(Y) = \mu + \frac{\mu^2}{k}\) and the Poisson becomes a special case of the Negative Binomial when \(k\) grows to infinity. This is the parametrisation used by Stan inside ‘mvgam’ function.
nb_notime2 <- mvgam(count ~ s(mintemp, bs = 'bs', k = 15) + s(ndvi, bs = 'bs', k = 8),
family = nb(),
data = data_train,
newdata = data_test,
trend_model = AR(p = 1),
chains = 4,
samples = 2000,
burnin = 1000,
noncentred = TRUE)
## Compiling Stan program using rstan
##
## Start sampling
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
code(nb_notime2)
## // Stan model code generated by package mvgam
## functions {
## vector rep_each(vector x, int K) {
## int N = rows(x);
## vector[N * K] y;
## int pos = 1;
## for (n in 1:N) {
## for (k in 1:K) {
## y[pos] = x[n];
## pos += 1;
## }
## }
## return y;
## }
## }
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_sp; // number of smoothing parameters
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## vector[num_basis] zero; // prior locations for basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## matrix[14,28] S1; // mgcv smooth penalty matrix S1
## matrix[7,14] S2; // mgcv smooth penalty matrix S2
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // negative binomial overdispersion
## vector<lower=0>[n_series] phi_inv;
## // latent trend AR1 terms
## vector<lower=-1,upper=1>[n_series] ar1;
## // latent trend variance parameters
## vector<lower=0>[n_series] sigma;
## // raw latent trends
## matrix[n, n_series] trend_raw;
## // smoothing parameters
## vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## // latent trends
## matrix[n, n_series] trend;
## trend = trend_raw .* rep_matrix(sigma', rows(trend_raw));
## for (s in 1 : n_series) {
## trend[2 : n, s] += ar1[s] * trend[1 : (n - 1), s];
## }
## b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
## // prior for (Intercept)...
## b_raw[1] ~ student_t(3, 2.6, 2.5);
## // prior for s(mintemp)...
## b_raw[2:15] ~ multi_normal_prec(zero[2:15],S1[1:14,1:14] * lambda[1] + S1[1:14,15:28] * lambda[2]);
## // prior for s(ndvi)...
## b_raw[16:22] ~ multi_normal_prec(zero[16:22],S2[1:7,1:7] * lambda[3] + S2[1:7,8:14] * lambda[4]);
## // priors for AR parameters
## ar1 ~ std_normal();
## // priors for smoothing parameters
## lambda ~ normal(5, 30);
## // priors for overdispersion parameters
## phi_inv ~ student_t(3, 0, 0.1);
## // priors for latent trend variance parameters
## sigma ~ student_t(3, 0, 2.5);
## to_vector(trend_raw) ~ std_normal();
## {
## // likelihood functions
## vector[n_nonmissing] flat_trends;
## array[n_nonmissing] real flat_phis;
## flat_trends = (to_vector(trend))[obs_ind];
## flat_phis = to_array_1d(rep_each(phi_inv, n)[obs_ind]);
## flat_ys ~ neg_binomial_2(
## exp(append_col(flat_xs, flat_trends) * append_row(b, 1.0)),
## inv(flat_phis));
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## vector[n_sp] rho;
## vector[n_series] tau;
## array[n, n_series] int ypred;
## matrix[n, n_series] phi_vec;
## vector[n_series] phi;
## phi = inv(phi_inv);
## for (s in 1:n_series) {
## phi_vec[1:n,s] = rep_vector(phi[s], n);
## }
## rho = log(lambda);
## for (s in 1:n_series) {
## tau[s] = pow(sigma[s], -2.0);
## }
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]] + trend[1:n, s];
## ypred[1:n, s] = neg_binomial_2_rng(exp(mus[1:n, s]), phi_vec[1:n, s]);
## }
## }
pp_check(nb_notime2, ndraws = 30)
## Warning in pp_check.mvgam(nb_notime2, ndraws = 30): NA responses are not shown
## in 'pp_check'.
pp_check(nb_notime2, type = 'ecdf_overlay')
## Using 10 posterior draws for ppc type 'ecdf_overlay' by default.
## Warning in pp_check.mvgam(nb_notime2, type = "ecdf_overlay"): NA responses are
## not shown in 'pp_check'.
At one side we have the model with Poisson observations, smooths of covariates and AR1 latent process, at the other side we have the same model but with Negative Binomial observations. In terms of crps, they are paractically equal.
fc5 <- forecast(nb_notime2)
s5 <- score(fc5, type = 'crps')
sum(s2$PP$score, na.rm = TRUE) - sum(s5$PP$score, na.rm = TRUE)
## [1] -3.153801