The radiocarbon data are specified by phi_m, the vector of fraction modern values, and sig_m, the vector of associated standard deviations (both indexed by i). The likelihood involves an integral over calendar dates that is calculated using a Riemann integral at the points in the "sampling grid", tau, which ranges from tau_min to tau_max and has a spacing of dtau (tau is indexed by the variable g).

Constraints on parameters: The parameter vector, theta (th), has the ordering th = (pi_1,...,pi_K,mu_1,..mu_K,s_1,...s_K), where K is the number of mixture components and pi_k / mu_k / s_k are, respectively, the weighting / mean / standard deviation of the k-th mixture. Optimization is actually done using a "reduced" vector from which pi_1 has been removed, th = (pi_2,...), since the mixture weights must add to 1. Constraints are placed on each variable in the parameter vector. All the weightings must lie between 0 and 1, including pi_1 (even though it is not in th_reduced). The means must lie between tau_min and tau_max. Since a numerical integral with a defined spacing is used to calculate the likelihood, the standard deviations of the mixture components cannot be too small, and are constrained to be greater than 10*dtau; they are also constrained to be less than tau_max-tau_min. In summary:

  0 < pi_k < 1

tau_min < mu_k < tau_max 10*dtau < s_k < tau_max-tau_min.

The bounded Hooke-Jeeves algorithm in the dfoptim R package (dfoptim::hjkb) is used to do the optimization. The Hooke-Jeeves algorithm is perhaps the best known varient of derivative free pattern-searches. Using multiple restarts helps identify the global optimum (rather than a local optimum).

To ensure reproducibility, the input_seed can be specified in one of two ways. (1) input_seed is a single integer; if so, set the base_seed equal to equal to input_seed (see below for how base_seed is used). (2) A vector of integers of length 1 + num_restarts is specified; no base seed is needed if a vector is provided. If no input_seed is provided (input_seed = NA), the base seed is randomly generated and stored. (1) If a base_seed is used, the command set.seed(base_seed) is run, then a vector of integers is drawn to populate seed_vect. (2) If a vector is given for input_seed, seed_vect is set equal to it. The first entry in seed_vect is used with set.seed() prior to drawing the starting parameter vectors for the restarts. The subsequent entries are used with set.seed() prior to calling the bounded Hooke-Jeeves optimization algorithm for each restart. These additional seeds are necessary to ensure reproducibility is achieved when multiple cores are used to do the optimizations in a parallel for loop.

It can be desirable to regularize the probability density function so that it does not have periods of very fast decay or growth, which can be done using the optional input r_max. Setting a constraint should most likely be done if the density function is used for demographic reconstruction (i.e., if it is interpreted as the relative population size as a function of calendar date). If so, a value of r_max = 0.04 could be sensible. By default, however, r_max = Inf and no constraint is used.

fit_trunc_gauss_mix(
  K,
  phi_m,
  sig_m,
  tau_min,
  tau_max,
  dtau,
  calib_df,
  r_max = Inf,
  num_restarts = 100,
  maxfeval = (K - 1) * 10000,
  num_cores = NA,
  input_seed = NA
)

Arguments

K

The number of mixture components to use for the fit

phi_m

A vector of fraction moderns

sig_m

A vector of standard deviations for phi_m

tau_min

The minimum calendar date (AD) for the sampling grid

tau_max

The maximum calendar date (AD) for the sampling grid

dtau

The spacing of the sampling grid

calib_df

Calibration curve (see load_calib_curve)

r_max

Maximum value for the instantaneous growth rate of the density function (default: Inf, no constraint)

num_restarts

Number of restarts to use (default: 100)

maxfeval

Maximum number of function evaluations for each restart (default: (K-1)*10000)

num_cores

The number of cores to use for parallelization (default: NA, do not parallelize)

input_seed

Either a single integer or a vector of integers of length 1 + num_restarts that is used to ensure reproducilibity (default: NA, a the base seed will be generated and saved).

Value

A list containing:

  • th The best fit parameter vector

  • neg_log_lik The value of the negative log-likelihood for the best fit parameter vector.

  • tau The sampling grid used for the Riemann sum in the likelihood calculation.

  • tau The probability density given th evaluated at the points in tau.

  • aic The Akaike information criterion value.

  • bic the Bayesian information criterion value.

  • hjkb_fit_list A list containing the fits for each restart.

  • input_seed The value of the input parameter seed.

  • base_seed The value of the base seed, which is NA if input_seed is a vector.

  • seed_vect The vector of seed integers with length 1 + num_restarts.