DESCRIPTION OF THE DATA

The dataset contain four different rodent species and their respective count which will be our response variable we would like to model and forecast. Each of the four species time series contains the number of rodent captures made each lunar month (which is approximately 29 days long) from 2004 to 2020. Since modelling 4 time series simultaneously can quickly become cumbersome, we will limit ourselves to modeling only one species of the four available.

summary(portal)

##       moon             PP              DM              DO        
##  Min.   :329.0   Min.   : 0.00   Min.   : 2.00   Min.   : 0.000  
##  1st Qu.:378.5   1st Qu.: 2.50   1st Qu.: 9.00   1st Qu.: 0.000  
##  Median :428.0   Median :12.00   Median :13.00   Median : 3.000  
##  Mean   :428.0   Mean   :15.14   Mean   :14.21   Mean   : 4.086  
##  3rd Qu.:477.5   3rd Qu.:24.00   3rd Qu.:19.00   3rd Qu.: 7.000  
##  Max.   :527.0   Max.   :65.00   Max.   :35.00   Max.   :21.000  
##                  NA's   :36      NA's   :36      NA's   :36      
##        OT             year          month           mintemp       
##  Min.   :0.000   Min.   :2004   Min.   : 1.000   Min.   :-24.000  
##  1st Qu.:1.000   1st Qu.:2008   1st Qu.: 3.500   1st Qu.: -3.884  
##  Median :2.000   Median :2012   Median : 7.000   Median :  2.130  
##  Mean   :1.982   Mean   :2012   Mean   : 6.503   Mean   :  3.504  
##  3rd Qu.:3.000   3rd Qu.:2016   3rd Qu.: 9.000   3rd Qu.: 12.310  
##  Max.   :9.000   Max.   :2020   Max.   :12.000   Max.   : 18.140  
##  NA's   :36                                                       
##       ndvi       
##  Min.   :0.2817  
##  1st Qu.:1.0741  
##  Median :1.3501  
##  Mean   :1.4709  
##  3rd Qu.:1.8178  
##  Max.   :3.9126  
##

We will focus the analysis on time series of captures for one specific rodent species, the desert pocket Mouse Chaetodipus Penicillatus. The desert pocket mouse is a small rodent species adapted to the arid regions of the southwestern United States and northern Mexico. They are known for their burrowing behavior, seed caching, and ability to survive with minimal water intake. A notable characteristic of C. penicillatus is its ability to enter a state of torpor during colder months, leading to significantly reduced activity and lower capture rates in winter.

PREPROCESSING AND EDA

Converting the dataset to a long format dataset. This step is necessary because the mvgam package uses this type of format where there is a series column containing all available series (just one in our case); a time variable which in our case corresponds to the month column, rescaled such that it starts from 1;

portal %>%
  mutate(time = moon - (min(moon)) + 1) %>%  # e.g. 329 - 329 + 1
  mutate(count = PP) %>%
  mutate(series = as.factor('PP')) %>%
  select(series, year, time, count, mintemp, ndvi) -> model_data

head(model_data)

##   series year time count mintemp     ndvi
## 1     PP 2004    1     0  -9.710 1.465889
## 2     PP 2004    2     1  -5.924 1.558507
## 3     PP 2004    3     2  -0.220 1.337817
## 4     PP 2004    4    NA   1.931 1.658913
## 5     PP 2004    5    10   6.568 1.853656
## 6     PP 2004    6    NA  11.590 1.761330

summary(model_data)

##  series        year           time           count          mintemp       
##  PP:199   Min.   :2004   Min.   :  1.0   Min.   : 0.00   Min.   :-24.000  
##           1st Qu.:2008   1st Qu.: 50.5   1st Qu.: 2.50   1st Qu.: -3.884  
##           Median :2012   Median :100.0   Median :12.00   Median :  2.130  
##           Mean   :2012   Mean   :100.0   Mean   :15.14   Mean   :  3.504  
##           3rd Qu.:2016   3rd Qu.:149.5   3rd Qu.:24.00   3rd Qu.: 12.310  
##           Max.   :2020   Max.   :199.0   Max.   :65.00   Max.   : 18.140  
##                                          NA's   :36                       
##       ndvi       
##  Min.   :0.2817  
##  1st Qu.:1.0741  
##  Median :1.3501  
##  Mean   :1.4709  
##  3rd Qu.:1.8178  
##  Max.   :3.9126  
##

The response variable represents the number of PP captured, so it is bounded by non-negative values. As we can see, there are 36 NA’s (missing values) in the response variable. In most traditional time series models, this would pose a problem, and we would need to find a way to impute those missing values. However, for the models applied using the mvgam package, this is not an issue.

We can observe some seasonality in the top-left graph, which is confirmed by the ACF plot in the bottom-left. Additionally, missing values are evident in the response variable “count.”

plot_mvgam_series(data = model_data, series = 1, y = 'count')

We can also look at the plot of partial autocorrelation function.

pacf(model_data$ndvi, na.action = na.pass, main = "PACF of Count")

As suggested by the histogram above, data are not symmetric. With a standard Shapiro-Wilk test we can assess whether or not data distribute normally. Given the low p-value, we can reject the null hypothesis of normality.

shapiro.test(model_data$count)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_data$count
## W = 0.88543, p-value = 6.812e-10

Here we can visualize the correlation among predictors and covariates.

numeric_vars <- model_data[, c("count", "mintemp", "ndvi")]
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(cor_matrix, method = "circle")

Fitting a loess curve is useful for capturing potential non-linearities. From the two plots below we can see some nonlinear patterns, although not very strong.

ggplot(model_data, aes(x = mintemp, y = count)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  geom_smooth(method = "loess", color = "darkred", se = TRUE) +
  labs(title = ,
       x = "Minimum Temperature", y = "Count") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).

ggplot(model_data, aes(x = ndvi, y = count)) +
  geom_point(color = "forestgreen", alpha = 0.6) +
  geom_smooth(method = "loess", color = "darkred", se = TRUE) +
  labs(title = ,
       x = "NDVI", y = "Count") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 36 rows containing missing values or values outside the scale range
## (`geom_point()`).

Given the non-Gausssian data, missing values, possibly measurement errors and non-linearities, we should move away from linear models.

MODEL FITTING

The first fitted model is a hierarchical GLM. This model is just a starting point to explore whether we can explain the number of captures by differences between years. The number of rodent captures is assumed to follow a Poisson distribution (though we will tackle this assumption in more details later on), which sounds like a natural choice for non-negative count data as the Poisson models the probability of a certain number of events happening in a fixed time period. The model uses a log link function which ensures that predictions are non-negative. Then we have random (hierarchical) intercepts for the year factor which highlights the prior idea that although each year’s effect has unique characteristics (because different environmental conditions might affect rodent captures in each year), we believe that these yearly effects are not completely independent, but rather all sampled from the same underlying population. There is kind of a underlying connection between years, so that the information learned about one year could inform the expected number of captures in another.

ggplot(model_data, aes(x = factor(year), y = count)) +
  geom_boxplot(fill = "grey", color = "darkred", 
               outlier.shape = NA, 
               width = 0.7) +       
  stat_boxplot(geom = "errorbar", width = 0.3, color = "darkred") + 
  labs(title = "Boxplot of Yearly Rodent Counts",
       x = "Year",
       y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Removed 36 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

In order to produce out of sample predictions, we need to divide the dataset into training and testing. The first 160 observations will be used for training the models, the left part to make predictions.

model_data %>% 
  dplyr::filter(time <= 160) -> data_train 
model_data %>% 
  dplyr::filter(time > 160) -> data_test

Before actually running the sampler we can take a look at the priors the function will use in this situation and we can edit them to represent our own prior beliefs, if any.

priors <- get_mvgam_priors(count ~ s(year_fac, bs = 're')-1,
                  family = poisson(),
                  data = model_data)
priors

##                      param_name param_length           param_info
## 1             vector[1] mu_raw;            1 s(year_fac) pop mean
## 2 vector<lower=0>[1] sigma_raw;            1   s(year_fac) pop sd
##                               prior                 example_change
## 1            mu_raw ~ std_normal();  mu_raw ~ normal(-0.69, 0.95);
## 2 sigma_raw ~ student_t(3, 0, 2.5); sigma_raw ~ exponential(0.43);
##   new_lowerbound new_upperbound
## 1             NA             NA
## 2             NA             NA

model_re <- mvgam(count ~ s(year_fac, bs = 're')-1,
                  family = poisson(),
                  data = data_train,
                  newdata = data_test)

## Compiling Stan program using rstan

##

## Start sampling

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

\[ Y_t \sim \text{Poisson}(\lambda_t) \] \[ \log(\lambda_t) = \beta_{\text{year}[t]} \] \[ \beta_{\text{year}} \sim \text{Normal}(\mu_{\text{year}}, \sigma_{\text{year}}) \] \[ \mu_{\text{year}} \sim \text{Normal}(0, 1) \]

\[ \sigma_{\text{year}} \sim \text{Exponential}(2) \]

summary(model_re)

## GAM formula:
## count ~ s(year_fac, bs = "re") - 1
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## None
## 
## N series:
## 1 
## 
## N timepoints:
## 199 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 1000; warmup = 500; thin = 1 
## Total post-warmup draws = 2000
## 
## 
## GAM coefficient (beta) estimates:
##                2.5% 50% 97.5% Rhat n_eff
## s(year_fac).1   1.8 2.1   2.3    1  2728
## s(year_fac).2   2.5 2.7   2.8    1  2494
## s(year_fac).3   3.0 3.1   3.2    1  2586
## s(year_fac).4   3.1 3.3   3.4    1  2492
## s(year_fac).5   1.9 2.1   2.3    1  2430
## s(year_fac).6   1.5 1.8   2.0    1  2576
## s(year_fac).7   1.8 2.1   2.3    1  2983
## s(year_fac).8   2.8 3.0   3.1    1  2636
## s(year_fac).9   3.1 3.2   3.4    1  2510
## s(year_fac).10  2.6 2.8   2.9    1  2483
## s(year_fac).11  3.0 3.1   3.2    1  2702
## s(year_fac).12  3.1 3.2   3.3    1  2334
## s(year_fac).13  2.1 2.3   2.4    1  2622
## s(year_fac).14  1.3 2.6   3.9    1  1408
## s(year_fac).15  1.3 2.6   3.8    1  1390
## s(year_fac).16  1.2 2.6   3.8    1  1234
## s(year_fac).17  1.3 2.6   3.9    1  1246
## 
## GAM group-level estimates:
##                   2.5%  50% 97.5% Rhat n_eff
## mean(s(year_fac)) 2.20 2.60  2.90 1.02   327
## sd(s(year_fac))   0.38 0.59  0.94 1.01   379
## 
## Approximate significance of GAM smooths:
##              edf Ref.df Chi.sq p-value    
## s(year_fac) 11.4     17   1382  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 14:24:19 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)

Using the ‘code’ function, the Stan code is displayed.

code(model_re)

## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes;  // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // random effect variances
## vector<lower=0>[1] sigma_raw;
## // random effect means
## vector[1] mu_raw;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## b[1:17] = mu_raw[1] + b_raw[1:17] * sigma_raw[1];
## }
## model {
## // prior for random effect population variances
## sigma_raw ~ student_t(3, 0, 2.5);
## // prior for random effect population means
## mu_raw ~ std_normal();
## // prior (non-centred) for s(year_fac)...
## b_raw[1:17] ~ std_normal();
## {
## // likelihood functions
## flat_ys ~ poisson_log_glm(flat_xs,
## 0.0,b);
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## array[n, n_series] int ypred;
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }

We can extract the posterior samples of the mean and standard deviation of population-level effects.

pop_params <- as.data.frame(model_re, 
                            variable = c('mean(year_fac)',
                                         'sd(year_fac)'))
head(pop_params)

##   mean(year_fac) sd(year_fac)
## 1       2.291623    0.7410213
## 2       2.193049    0.8001934
## 3       2.353734    0.6042163
## 4       2.579334    0.6969573
## 5       2.589601    0.6617802
## 6       2.571173    0.6553844

As we said before, in this model, each year has its own random effect, meaning that the effect of each year on the number of rodent captures is modeled as a deviation from an overall population level-mean. The posterior distribution of mean(year_fac) consists of samples from the model’s estimate of \(\mu_{\text{year}}\), the population mean of the year effects. The mean of these posterior samples gives us the model’s best estimate of the overall baseline effect of years on rodent captures. Similarly, the mean of sd(year_fac) posterior samples represents the average spread of year-specifc random effects around the population-level mean. The two results shown below are on the link scale. In order to be more interpretable we can turn them into the response scale

# link scale
mean(pop_params[,1])

## [1] 2.587052

mean(pop_params[,2])

## [1] 0.6111257

# response scale
mean(exp(pop_params[,1]))

## [1] 13.4934

mean(exp(pop_params[,2]))

## [1] 1.864422

In order to account for the uncertainty in the estimation we can build and visualize 95% credible intervals of posterior samples.

mean_year_ci <- quantile(pop_params$`mean(year_fac)`, probs = c(0.025, 0.975))
sd_year_ci <- quantile(pop_params$`sd(year_fac)`, probs = c(0.025, 0.975))


# plot as histograms
hist(pop_params$`mean(year_fac)`, main = expression(mu[year]), col = "darkred", breaks = 20)
abline(v = mean_year_ci, col = "blue", lwd = 2, lty = 2)

hist(pop_params$`sd(year_fac)`, main = expression(sigma[year]), col = "darkred", breaks = 20)
abline(v = sd_year_ci, col = "blue", lwd = 2, lty = 2)

ASSESSING GOODNESS OF FIT

Among different ways of assessing goodness of fit, a natural one is to compare observed and expected frequencies from the model. A very useful tool in case of non-negative count data is the rootogram, which plots “histogram-like bars for the observed frequencies and a curve for the fitted frequencies, all on a square-root scale […] to approximately adjust for scale differences across the values” (Kleiber-Zeileis, 2016). If the bars deviate from the curve (either above or below), it indicates areas where what the model expects does not align with the observed data.

pp_check(object = model_re, type = 'rootogram')

## Using all posterior draws for ppc type 'rootogram' by default.

## Warning in pp_check.mvgam(object = model_re, type = "rootogram"): NA responses
## are not shown in 'pp_check'.

To see whether these yearly intercepts can result in good time-varying predictions, we can plot forecasts, which show that the model is not capable of producing accurate predictions.

plot(model_re, type = 'forecast')

## Out of sample DRPS:
## 183.26291725

It is evident that there is a substantial amount of variability that remains unexplained. Let us now examine the residuals: as we can see, there is a strong autocorrelation in the residuals which means the model is not capturing the time-dependent structure in the data. In this case, the model may need to incorporate more complex components.

plot(model_re, type = "residuals")

## Theoretic detour We will now move from the hierarchical GLM formulation to the use of GAMs. While with GLMs we assume linear relationships in the parameters, GAMs relax this assumption and deal with non-linearities in the data. A Generalized Additive Model can be represented as a sum of smooth functions, named splines, representing functional relationships between covariates and the response: \[ E(Y|X) = g^{-1} (\beta_0 + \sum_{i=1}^{I} s_i(x_i)) \] Splines are composed of simpler functions named basis functions. In particular, they are sums of weighted basis functions evaluated at the values of x: \[ s(x) = \sum_{k = 1}^{K}\beta_kb_k(x) \] In the frequentist framework, the \(\beta_k\) are estimated by starting off with the basis functions that are unweighted and values for those weights are chosen so to maximize the penalized log-likelihood, namely we force the spline to go as close to the data as we can with the caveat that we don’t want to overfit. In order to avoid overfitting our data we measure the wiggliness of the fitted function: the more the fitted function deviates from a flat line (unweighted basis functions), the more complicated that model is. One straightforward measure for that is the following, \[ \int_\mathbb{R}[s'']^2dx = \beta^TS\beta=W \] namely the integrated square second derivative of the smooth function, which can be written as a function of the estimated coefficients \(\beta\) for the basis functions and S which is the penalty matrix. We want to penalize W to avoid overfitting using a smoothing parameter \(\lambda\): \[ \mathcal{L}_{pen} = log(Likelihood) - \lambda W \] If we set \(\lambda\) to 0 we are fitting an unpenalized model, but if we allow it to be very large we are going to pay an increasing cost for the wiggliness and so the model will tend towards the linear model.

In the Bayesian framework the focus shifts from finding a single best set of coefficients to treating the coefficients as random variables drawn from a multivariate Gaussian distribution with the penalty acting on the prior precision: a large \(\lambda\) results in a tight prior, while a smaller one allows for more wiggliness as the prior imposes less constraints on the coefficients.

BACK TO MODELLING

In order to improve model’s performances we can replace the year variable with the time variable, hoping that temporal dependencies will be better captured. Here we can see what 15 basis functions look like using B-splines. The larger k is, the more complex the model will be.

basis_funs <- basis(s(time, bs = 'bs', k = 15), 
                            data = model_data)
draw(basis_funs) +
  labs(y = 'Basis function value', x = 'Time')

These are the priors that mvgam would use for this setting, but we can decide to place Standard Normal priors on fixed effects coefficients

get_mvgam_priors(count ~ s(time, bs = 'bs', k = 15)+
                       mintemp + ndvi,
                     family = poisson(),
                     data = data_train)

##                      param_name param_length                param_info
## 1                   (Intercept)            1               (Intercept)
## 2                       mintemp            1      mintemp fixed effect
## 3                          ndvi            1         ndvi fixed effect
## 4 vector<lower=0>[n_sp] lambda;            2 s(time) smooth parameters
##                                   prior              example_change
## 1 (Intercept) ~ student_t(3, 2.6, 2.5); (Intercept) ~ normal(0, 1);
## 2         mintemp ~ student_t(3, 0, 2);     mintemp ~ normal(0, 1);
## 3            ndvi ~ student_t(3, 0, 2);        ndvi ~ normal(0, 1);
## 4               lambda ~ normal(5, 30);    lambda ~ exponential(1);
##   new_lowerbound new_upperbound
## 1             NA             NA
## 2             NA             NA
## 3             NA             NA
## 4             NA             NA

time_smooth <- mvgam(count ~ s(time, bs = 'bs', k = 15)+
                       mintemp + ndvi,
                     family = poisson(),
                     data = data_train,
                     newdata = data_test,
                     priors = prior(normal(0, 1), class = b))

## Compiling Stan program using rstan

##

## Start sampling

## Warning in .local(object, ...): some chains had errors; consider specifying
## chains = 1 to debug

## here are whatever error messages were returned

## [[1]]
## Stan model 'anon_model' does not contain samples.
## 
## [[2]]
## Stan model 'anon_model' does not contain samples.
## 
## [[3]]
## Stan model 'anon_model' does not contain samples.

\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]

\[ \log(\lambda_t) = \beta_0 + s(\text{time})_t + \beta_{\text{ndvi}} \cdot \text{ndvi}_t + \beta_{\text{mintemp}} \cdot \text{mintemp}_t \]

\[ \beta_0 \sim \text{Student}(3, 2.6, 2.5) \]

\[ s(\text{time})_t = \sum_{k=1}^{K} b_k \cdot \beta_{\text{smooth}} \]

\[ \beta_{\text{smooth}} \sim \text{MVNormal}(0, (S \cdot \lambda)^{-1}) \]

\[ \beta_{\text{ndvi}} \sim \text{Normal}(0, 1) \]

\[ \beta_{\text{mintemp}} \sim \text{Normal}(0, 1) \]

As we can see from the plots below, the residuals still show autocorrelation, meaning also the GAM is not able to capture the temporal structure of the data…

plot(time_smooth, type = 'residuals')

…which is confirmed by the bad forecasts.

plot(time_smooth, type = 'forecast')

## Out of sample DRPS:
## 262.16482

One key aspect that drives the forecasts is the choice of the penalty of the splines. A 2-nd derivative penalty penalizes the curvature of the spline providing linear extrapolations (slope remains unchanged from the last boundary of training data). A 1-st derivative penalty will produce flat extrapolations so the mean remains unchanged from the last boundary of training data.

Other than that, spline extrapolation is also highly sensitive to wiggliness. In order to model and capture autocorrelation in a series, one thing to do can be to let the spline wiggle more. An increased complexity (wiggliness) leads to more accurate inferences on historical patterns, giving more precise intervals and so on, but from a forecasting point of view they still tend to be highly unpredictable given that each spline only have local knowledge about what’s inside its interval.

For forecasting we need global understanding of the phenomenon. So we replace spline of time with an actual time series process that can learn from the behaviour of the past and give better predictions into the future. There is a variety of processes that can be used inside ‘mvgam’ as autoregressive processes, Gaussian procesess, random walks, var processes for multivariate time series.

(BAYESIAN) DYNAMIC GENERALIZED ADDITIVE MODELS

DGAMs are particularly suited for time series data as they account for temporal dependencies by modelling the series as a dynamic autocorrelated process. In the univariate case, a DGAM for discrete time series is specified as follows \[ E(Y_t|X_t) = g^{-1} (\beta_0 + \sum_{i=1}^{I} s_{i,t}(x_{i,t})+z_t) \] \[ z_t = \phi_0 + z_{t-1} + e_t \] where \(z_t\) is the latent dynamic proess estimate at time t. As highlighted by Clark, Wells (2022), separating the temporal process from the observation process offers several advantages. It reduces the impact of measurement errors or outliers, which can heavily distort the estimation of AR parameters. Additionally, using latent processes makes it easier to manage missing or irregularly sampled data, providing a more robust and flexible modeling approach: since the \(z_t\) are unobserved latent random variables, they will evolve in time even when an observation \(Y_t\) is missing.

\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]

\[ \log(\lambda_t) = \beta_0 + s(\text{ndvi})_t + s(\text{mintemp})_t + z_t \]

\[ z_t \sim \text{Normal}(z_{t-1}, \sigma_{\text{error}}) \]

\[ \sigma_{\text{error}} \sim \text{Exponential}(2) \]

\[ s(\text{ndvi})_t = \sum_{k=1}^{K} b_k(\text{ndvi}_t) \cdot \beta_{\text{smooth}} \]

\[ \beta_{\text{smooth}} \sim \text{MVNormal}(0, (S \cdot \lambda)^{-1}) \]

\[ \beta_{\text{mintemp}} \sim \text{Normal}(0, 1) \]

\[ \beta_0 \sim \text{Normal}(0, 1) \]

mod_notime2 <- mvgam(count ~ s(mintemp, bs = 'bs', k = 15) + s(ndvi, bs = 'bs', k = 8),
                    family = poisson(),
                    data = data_train,
                    newdata = data_test,
                    trend_model = AR(p = 1),
                    chains = 4, 
                    samples = 2000, 
                    burnin = 1000,
                    noncentred = TRUE)

## Compiling Stan program using rstan

##

## Start sampling

code(mod_notime2)

## // Stan model code generated by package mvgam
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_sp; // number of smoothing parameters
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## vector[num_basis] zero; // prior locations for basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes;  // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## matrix[14,28] S1; // mgcv smooth penalty matrix S1
## matrix[7,14] S2; // mgcv smooth penalty matrix S2
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // latent trend AR1 terms
## vector<lower=-1,upper=1>[n_series] ar1;
## // latent trend variance parameters
## vector<lower=0>[n_series] sigma;
## // raw latent trends
## matrix[n, n_series] trend_raw;
## // smoothing parameters
## vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## // latent trends
## matrix[n, n_series] trend;
## trend = trend_raw .* rep_matrix(sigma', rows(trend_raw));
## for (s in 1 : n_series) {
## trend[2 : n, s] += ar1[s] * trend[1 : (n - 1), s];
## }
## b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
## // prior for (Intercept)...
## b_raw[1] ~ student_t(3, 2.6, 2.5);
## // prior for s(mintemp)...
## b_raw[2:15] ~ multi_normal_prec(zero[2:15],S1[1:14,1:14] * lambda[1] + S1[1:14,15:28] * lambda[2]);
## // prior for s(ndvi)...
## b_raw[16:22] ~ multi_normal_prec(zero[16:22],S2[1:7,1:7] * lambda[3] + S2[1:7,8:14] * lambda[4]);
## // priors for AR parameters
## ar1 ~ std_normal();
## // priors for smoothing parameters
## lambda ~ normal(5, 30);
## // priors for latent trend variance parameters
## sigma ~ student_t(3, 0, 2.5);
## to_vector(trend_raw) ~ std_normal();
## {
## // likelihood functions
## vector[n_nonmissing] flat_trends;
## flat_trends = (to_vector(trend))[obs_ind];
## flat_ys ~ poisson_log_glm(append_col(flat_xs, flat_trends),
## 0.0,append_row(b, 1.0));
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## vector[n_sp] rho;
## vector[n_series] tau;
## array[n, n_series] int ypred;
## rho = log(lambda);
## for (s in 1:n_series) {
## tau[s] = pow(sigma[s], -2.0);
## }
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]] + trend[1:n, s];
## ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
## }
## }

The number \(k\) in the ‘mvgam’ function should be chosen by the user based on the assumed - possibly nonlinear - relationships in the data. However, in order to check whether the assumed complexity is enough or too much, we can look at the Effective Degrees of Freedom (edf) and the Reference Degrees of Freedom (Red.df). The first one tells us how complex the smooth term is in practice (the lareger the edf, the more complex the smooth is), while the second one represent the maximum possible degrees of freedom allowed for the smooth term, based on the number of basis function the user chose when setting up the model. As we can see here, the smooth term of ‘mintemp’ has an edf of 6.62 which is more than half of the corresponding Ref.df. The same goes for the smooth of ndvi with an edf of 3. Bear in mind that an edf close to 1 implies an almost linear relationship.

summary(mod_notime2)

## GAM formula:
## count ~ s(mintemp, bs = "bs", k = 15) + s(ndvi, bs = "bs", k = 8)
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## AR(p = 1)
## 
## N series:
## 1 
## 
## N timepoints:
## 199 
## 
## Status:
## Fitted using Stan 
## 4 chains, each with iter = 2000; warmup = 1000; thin = 1 
## Total post-warmup draws = 4000
## 
## 
## GAM coefficient (beta) estimates:
##                2.5%    50% 97.5% Rhat n_eff
## (Intercept)    1.90  2.200  2.40 1.00  2390
## s(mintemp).1  -4.70  0.600  1.70 1.01   412
## s(mintemp).2  -3.90 -0.230  0.83 1.01   439
## s(mintemp).3  -3.20 -0.970  0.17 1.01   616
## s(mintemp).4  -2.50 -1.300 -0.33 1.00  1760
## s(mintemp).5  -2.40 -1.500 -0.19 1.01   739
## s(mintemp).6  -2.90 -1.800  1.20 1.01   439
## s(mintemp).7  -2.60 -1.400  1.60 1.01   429
## s(mintemp).8  -1.80 -0.700  2.50 1.01   431
## s(mintemp).9  -1.10 -0.180  2.40 1.01   429
## s(mintemp).10 -0.86 -0.011  2.00 1.01   452
## s(mintemp).11 -0.45  0.340  2.20 1.01   468
## s(mintemp).12 -0.51  0.400  3.10 1.01   432
## s(mintemp).13 -0.40  0.490  2.70 1.01   459
## s(mintemp).14 -1.60  0.550  2.10 1.00  2219
## s(ndvi).1     -0.75 -0.140  0.50 1.00  4442
## s(ndvi).2     -1.20  0.110  1.20 1.00  2679
## s(ndvi).3     -0.86  0.360  1.50 1.00  2652
## s(ndvi).4     -0.33  0.280  0.89 1.00  2887
## s(ndvi).5     -0.51  0.066  0.66 1.00  3702
## s(ndvi).6     -0.95 -0.170  0.55 1.00  3372
## s(ndvi).7     -1.50 -0.280  0.94 1.00  2598
## 
## Approximate significance of GAM smooths:
##             edf Ref.df Chi.sq p-value
## s(mintemp) 7.46     14  99.28    0.44
## s(ndvi)    2.60      7   2.62    0.99
## 
## Latent trend parameter AR estimates:
##          2.5%  50% 97.5% Rhat n_eff
## ar1[1]   0.65 0.88  0.99    1  2134
## sigma[1] 0.57 0.69  0.84    1  1745
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 4000 iterations ended with a divergence (0%)
## 0 of 4000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior
## 
## Samples were drawn using NUTS(diag_e) at Tue Sep 24 14:28:34 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split MCMC chains
## (at convergence, Rhat = 1)

In order to check for discrepancies in the model adequacy to the data, a posterior predictive check can be ran. Notice the slight but systematic deviations of the observed data with respect to the generated data.

pp_check(mod_notime2, ndraws = 30)

## Warning in pp_check.mvgam(mod_notime2, ndraws = 30): NA responses are not shown
## in 'pp_check'.

The code below performs a posterior predictive check by comparing the empirical CDF of the observed data against the empirical CDFs of simultaed data generated from the posterior distribution of the model. If the model fits well, the ECDF of the observed data should align closely with the ECDFs of the simulated data.

pp_check(mod_notime2, type = 'ecdf_overlay')

## Using 10 posterior draws for ppc type 'ecdf_overlay' by default.

## Warning in pp_check.mvgam(mod_notime2, type = "ecdf_overlay"): NA responses are
## not shown in 'pp_check'.

The residuals of this model don’t show evident pattern in the autocorrelation plot.

plot(mod_notime2, type = "residuals")

Also the forecasts on the left out sample look good.

plot(mod_notime2, type = "forecast")

## Out of sample DRPS:
## 125.8303123125

Now two other models will be fit: - Dynamic GLM with just an intercept in the observation model - Dynamic GLM with interaction between ndvi and mintemp in the observation model.

\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]

\[ \log(\lambda_t) = \beta_0 + z_t \]

\[ z_t \sim \text{Normal}(z_{t-1}, \sigma_{\text{error}}) \]

\[ \sigma_{\text{error}} \sim \text{Exponential}(2) \]

\[ \beta_0 \sim \text{Normal}(0, 1) \]

mod_notime3 <- mvgam(count ~ 1,
                    family = poisson(),
                    data = data_train,
                    newdata = data_test,
                    trend_model = AR(p = 1),
                    chains = 4, 
                    samples = 2000, 
                    burnin = 1000,
                    noncentred = TRUE)

## Compiling Stan program using rstan

##

## Start sampling

\[ \text{count}_t \sim \text{Poisson}(\lambda_t) \]

\[ \log(\lambda_t) = \beta_0 + \beta_{\text{ndvi}} \cdot \text{ndvi}_t + \beta_{\text{mintemp}} \cdot \text{mintemp}_t + z_t \]

\[ z_t \sim \text{Normal}(z_{t-1}, \sigma_{\text{error}}) \]

\[ \sigma_{\text{error}} \sim \text{Exponential}(2) \]

\[ \beta_0 \sim \text{Normal}(0, 1) \]

\[ \beta_{\text{ndvi}} \sim \text{Normal}(0, 1) \]

\[ \beta_{\text{mintemp}} \sim \text{Normal}(0, 1) \]

mod_notime4 <- mvgam(count ~ mintemp + ndvi,
                    family = poisson(),
                    data = data_train,
                    newdata = data_test,
                    trend_model = AR(p = 1),
                    chains = 4, 
                    samples = 2000, 
                    burnin = 1000,
                    noncentred = TRUE)

## Compiling Stan program using rstan

##

## Start sampling

## Warning in .local(object, ...): some chains had errors; consider specifying
## chains = 1 to debug

## here are whatever error messages were returned

## [[1]]
## Stan model 'anon_model' does not contain samples.
## 
## [[2]]
## Stan model 'anon_model' does not contain samples.

MODEL COMPARISON

The step of model comparison is fundamental for assessing which statistical model is more appropriate for the available data. One popular information criterion is the Deviance Information Criterion (DIC) which however has been addressed for not being fully Bayesian because it uses posterior means (point estimates) instead of the entire posterior distributions available after MCMC simulation. Another very used information criterion which instead uses every single one of MCMC samples and observations is the Watanabe-Akaike Information Criterion (WAIC). It uses the log-pointwise predictive density as a raw measure of goodness of fit, which can be computed in practice averaging the likelihood for each observation over all MCMC samples and then summing it up over all n observations \[ \sum_{i = 1}^{n}log\Big( \frac{1}{S}\sum_{s=1}^{S}p(y_i|\theta^s \Big) \] and a penalty on the effective number of parameters which sums for each data point the variance of the log-likelihood across MCMC samples.

loo_compare(mod_notime2, mod_notime3, mod_notime4)

## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

##             elpd_diff se_diff
## mod_notime2  0.0       0.0   
## mod_notime4 -0.5       3.8   
## mod_notime3 -8.8       3.7

Model with smooths on covariate and model with fixed effects on covariates are comparable in terms of performance on the available data.

We would also want to compare models with respect to their predictive capabilities. One proper scoring rule for this evaluation is the Continuous Rank Probability Score (CRPS), which is similar to a weighted absolute error using the full forecast distribution.

fc2 <- forecast(mod_notime2)
fc3 <- forecast(mod_notime3)
fc4 <- forecast(mod_notime4)

s2 <- score(fc2, score = "crps")
s3 <- score(fc3, score = "crps")
s4 <- score(fc4, score = "crps")
sum(s2$PP$score, na.rm = TRUE) - sum(s3$PP$score, na.rm = TRUE) # model 2 is preferred

## [1] -40.10458

sum(s2$PP$score, na.rm = TRUE) - sum(s4$PP$score, na.rm = TRUE) # model 2 is preferred

## [1] -18.53619

sum(s3$PP$score, na.rm = TRUE) - sum(s4$PP$score, na.rm = TRUE) # model 4 is preferred

## [1] 21.56839

Model two is preferred in terms of predictive capabilities.

POISSON VS NEGATIVE BINOMIAL

Up to now we have assumed a Poisson distribution for the data, which has the strong assumption of the equivalence between mean and variance. However, since overdispersion (i.e. the empirical evidence that the variance is greater than the mean) is very common when analysing count data in ecological time series, we should also consider to use a Negative Binomial distribution for the data. In the context of ecology, a more relevant parametrisation and interpretation of the negative binomial involves expressing the pmf in terms of a location parameter \(\mu\) (i.e. mean) and a parameter k controlling overdispersion. In particular, if p is the probability of success and r the number of successes before which we want to count the number of failures then \[ p = \frac{k}{k+\mu} \] where \(k = r\). Then the probability mass function of Y becomes \[ p_Y(y)=\binom{k+y-1}{y} \Big(\frac{\mu}{\mu+k} \Big)^y \Big(\frac{k}{k+\mu}\Big)^k \] where the variance is a quadratic function of the mean \(Var(Y) = \mu + \frac{\mu^2}{k}\) and the Poisson becomes a special case of the Negative Binomial when \(k\) grows to infinity. This is the parametrisation used by Stan inside ‘mvgam’ function.

nb_notime2 <- mvgam(count ~ s(mintemp, bs = 'bs', k = 15) + s(ndvi, bs = 'bs', k = 8),
                    family = nb(),
                    data = data_train,
                    newdata = data_test,
                    trend_model = AR(p = 1),
                    chains = 4, 
                    samples = 2000, 
                    burnin = 1000,
                    noncentred = TRUE)

## Compiling Stan program using rstan

##

## Start sampling

## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess

## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

code(nb_notime2)

## // Stan model code generated by package mvgam
## functions {
## vector rep_each(vector x, int K) {
## int N = rows(x);
## vector[N * K] y;
## int pos = 1;
## for (n in 1:N) {
## for (k in 1:K) {
## y[pos] = x[n];
## pos += 1;
## }
## }
## return y;
## }
## }
## data {
## int<lower=0> total_obs; // total number of observations
## int<lower=0> n; // number of timepoints per series
## int<lower=0> n_sp; // number of smoothing parameters
## int<lower=0> n_series; // number of series
## int<lower=0> num_basis; // total number of basis coefficients
## vector[num_basis] zero; // prior locations for basis coefficients
## matrix[total_obs, num_basis] X; // mgcv GAM design matrix
## array[n, n_series] int<lower=0> ytimes;  // time-ordered matrix (which col in X belongs to each [time, series] observation?)
## matrix[14,28] S1; // mgcv smooth penalty matrix S1
## matrix[7,14] S2; // mgcv smooth penalty matrix S2
## int<lower=0> n_nonmissing; // number of nonmissing observations
## array[n_nonmissing] int<lower=0> flat_ys; // flattened nonmissing observations
## matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
## array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
## }
## parameters {
## // raw basis coefficients
## vector[num_basis] b_raw;
## // negative binomial overdispersion
## vector<lower=0>[n_series] phi_inv;
## // latent trend AR1 terms
## vector<lower=-1,upper=1>[n_series] ar1;
## // latent trend variance parameters
## vector<lower=0>[n_series] sigma;
## // raw latent trends
## matrix[n, n_series] trend_raw;
## // smoothing parameters
## vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
## // basis coefficients
## vector[num_basis] b;
## // latent trends
## matrix[n, n_series] trend;
## trend = trend_raw .* rep_matrix(sigma', rows(trend_raw));
## for (s in 1 : n_series) {
## trend[2 : n, s] += ar1[s] * trend[1 : (n - 1), s];
## }
## b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
## // prior for (Intercept)...
## b_raw[1] ~ student_t(3, 2.6, 2.5);
## // prior for s(mintemp)...
## b_raw[2:15] ~ multi_normal_prec(zero[2:15],S1[1:14,1:14] * lambda[1] + S1[1:14,15:28] * lambda[2]);
## // prior for s(ndvi)...
## b_raw[16:22] ~ multi_normal_prec(zero[16:22],S2[1:7,1:7] * lambda[3] + S2[1:7,8:14] * lambda[4]);
## // priors for AR parameters
## ar1 ~ std_normal();
## // priors for smoothing parameters
## lambda ~ normal(5, 30);
## // priors for overdispersion parameters
## phi_inv ~ student_t(3, 0, 0.1);
## // priors for latent trend variance parameters
## sigma ~ student_t(3, 0, 2.5);
## to_vector(trend_raw) ~ std_normal();
## {
## // likelihood functions
## vector[n_nonmissing] flat_trends;
## array[n_nonmissing] real flat_phis;
## flat_trends = (to_vector(trend))[obs_ind];
## flat_phis = to_array_1d(rep_each(phi_inv, n)[obs_ind]);
## flat_ys ~ neg_binomial_2(
## exp(append_col(flat_xs, flat_trends) * append_row(b, 1.0)),
## inv(flat_phis));
## }
## }
## generated quantities {
## vector[total_obs] eta;
## matrix[n, n_series] mus;
## vector[n_sp] rho;
## vector[n_series] tau;
## array[n, n_series] int ypred;
## matrix[n, n_series] phi_vec;
## vector[n_series] phi;
## phi = inv(phi_inv);
## for (s in 1:n_series) {
## phi_vec[1:n,s] = rep_vector(phi[s], n);
## }
## rho = log(lambda);
## for (s in 1:n_series) {
## tau[s] = pow(sigma[s], -2.0);
## }
## // posterior predictions
## eta = X * b;
## for(s in 1:n_series){
## mus[1:n, s] = eta[ytimes[1:n, s]] + trend[1:n, s];
## ypred[1:n, s] = neg_binomial_2_rng(exp(mus[1:n, s]), phi_vec[1:n, s]);
## }
## }

pp_check(nb_notime2, ndraws = 30)

## Warning in pp_check.mvgam(nb_notime2, ndraws = 30): NA responses are not shown
## in 'pp_check'.

pp_check(nb_notime2, type = 'ecdf_overlay')

## Using 10 posterior draws for ppc type 'ecdf_overlay' by default.

## Warning in pp_check.mvgam(nb_notime2, type = "ecdf_overlay"): NA responses are
## not shown in 'pp_check'.

At one side we have the model with Poisson observations, smooths of covariates and AR1 latent process, at the other side we have the same model but with Negative Binomial observations. In terms of crps, they are paractically equal.

fc5 <- forecast(nb_notime2)
s5 <- score(fc5, type = 'crps')
sum(s2$PP$score, na.rm = TRUE) - sum(s5$PP$score, na.rm = TRUE)

## [1] -3.153801

Bayesian Modelling Project

Matteo Vantaggio

DESCRIPTION OF THE DATA

PREPROCESSING AND EDA

MODEL FITTING

ASSESSING GOODNESS OF FIT

BACK TO MODELLING

(BAYESIAN) DYNAMIC GENERALIZED ADDITIVE MODELS

MODEL COMPARISON

POISSON VS NEGATIVE BINOMIAL