3. User Customization¶

Implementing a new Model¶

One of the standard use cases of ABCpy is to do inference on a probabilistic model that is not part of ABCpy. We now go through the details of such a scenario using the (already implemented) Gaussian generative model to explain how to implement it from scratch.

There are two scenarios to use a model: First, we want to use our probabilistic model to explain a relationship between parameters* (considered random variables for inference) and observed data. This is for example the case when we want to do inference on mechanistic models that do not have a PDF. In this case, our model implementation has to derive from ProbabilisticModel and a few abstract methods have to be defined, as for example forward_simulate().

In the second scenario, we want to use the model to build a relationship between different parameters (between different random variables). Then our model is restricted to either output continuous or discrete parameters in form of a vector. Consequently, the model must derive from either from Continuous or Discrete and implement the required abstract methods. These two classes in turn derive from from ProbabilisticModel, such that the second scenario essentially extends the first.

Let us go through the implementation of a the Gaussian generative model. The model has to conform to the API specified by the base class ProbabilisticModels, and thus must implement at least the following methods:

ProbabilisticModels.__init__()
ProbabilisticModels._check_input()
ProbabilisticModels._check_output()
ProbabilisticModels.forward_simulate()
ProbabilisticModels.get_output_dimension()

We want our model to work in both described scenarios, so our model also has to conform to the API of Continuous since the model output, which is the resulting data from a forward simulation, is from a continuous domain. For completeness, here the abstract methods defined by Continuous and Discrete:

Initializing a New Model¶

Since a Gaussian model generates continous numbers, the newly implement class derives from Continuous and the header look as follows:

class Gaussian(ProbabilisticModel, Continuous):

A good way to start implementing a new model is to define a convenient way to initialize it with its input parameters. In ABCpy all input parameters are either independent ProbabilisticModels or Hyperparameters. Thus, they should not be stored within but rather referenced in the model we implement. This reference is handled by the InputConnector class and must be used in our model implementation. The required procedure is to call the init function of ProbabilisticModels and pass an InputConnector object to it.

ProbabilisticModel.__init__(input_connector, name='')[source]

This initializer must be called from any derived class to properly connect it to its input models.

It accepts as input an InputConnector object that fully specifies how to connect all parent models to the current model.

Parameters:	input_connector (list) – A list of input parameters. name (string) – A human readable name for the model. Can be the variable name for example.

However, it would be very inconvenient to initialize our Gaussian model with an InputConnector object. We rather like the init function to accept a list of parameters [mu, sigma], where mu is the mean and sigma is the standard deviation which are the sole two parameters of our generative Gaussian model. So the idea is to take a convenient input and transform it it an InputConnection object that in turn can be passed to the initializer of the super class. This leads to the following implementation:

def __init__(self, parameters, name='Gaussian'):
    # We expect input of type parameters = [mu, sigma]
    if not isinstance(parameters, list):
        raise TypeError('Input of Normal model is of type list')

    if len(parameters) != 2:
        raise RuntimeError('Input list must be of length 2, containing [mu, sigma].')

    input_connector = InputConnector.from_list(parameters)
    super().__init__(input_connector, name)

First, we do some basic syntactic checks on the input that throw exceptions if unreasonable input is provided. Line 9 is the interesting part: the InputConnector comes with a convenient set of factory methods that create InputConnector objects:

We use the factory method from_list. The resulting InputConnector creates links between our Gaussian model and the models (or hyperparameters) that are used for mu and sigma at initialization time. For example, if mu and sigma are initialized as hyperparameters like

model = Gaussian([0, 1])

the from_list() method will automatically create two HyperParameter objects HyperParameter(0) and HyperParameter(1) and will link the our current Gaussian model inputs to them. If we initialize mu and sigma with existing models like

uniform1 = Uniform([-1, 1])
uniform2 = Uniform([10,20])
model = Gaussian([uniform1, uniform2])

the from_list() method will link our inputs to the uniform models.

Additionally, every model instance should have a unique name, which should also be passed to the init function of the super class.

Checking the Input¶

The next function we implement is _check_input which should behave as described in the documentation:

ProbabilisticModel._check_input(input_values)[source]

Check whether the input parameters are compatible with the underlying model.

The following behavior is expected:

1. If the input is of wrong type or has the wrong format, this method should raise an exception. For example, if the number of parameters does not match what the model expects.

2. If the values of the input models are not compatible, this method should return False. For example, if an input value is not from the expected domain.

Background information: Many inference schemes modify the input slightly by applying a small perturbation during sampling. This method is called to check whether the perturbation yields a reasonable input to the current model. In case this function returns False, the inference schemes re-perturb the input and try again. If the check is not done properly, the inference computation might crash or not terminate.

Parameters:	input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
Returns:	True if the fixed value of the parameters can be used as input for the current model. False otherwise.
Return type:	boolean

This leads to the following implementation:

def _check_input(self, input_values):
    # Check whether input has correct type or format
    if len(input_values) != 2:
        raise ValueError('Number of parameters of Normal model must be 2.')

    # Check whether input is from correct domain
    mu = input_values[0]
    sigma = input_values[1]
    if sigma < 0:
        return False

    return True

Forward Simulation¶

At the core of our model lies the capability to forward simulate and create pseudo observations. To expose this functionality the following method has to be implemented:

ProbabilisticModel.forward_simulate(input_values, k, rng)[source]

Provides the output (pseudo data) from a forward simulation of the current model.

In case the model is intended to be used as input for another model, a forward simulation must return a list of k numpy arrays with shape (get_output_dimension(),).

In case the model is directly used for inference, and not as input for another model, a forward simulation also must return a list, but the elements can be arbitrarily defined. In this case it is only important that the used statistics and distance functions can read the input.

Parameters:	input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization. k (integer) – The number of forward simulations that should be run rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns:	A list of k elements, where each element is of type numpy arary and represents the result of a single forward simulation.
Return type:	list

A proper implementation look as follows:

def forward_simulate(self, input_values, k, rng=np.random.RandomState()):
    # Extract the input parameters
    mu = input_values[0]
    sigma = input_values[1]

    # Do the actual forward simulation
    vector_of_k_samples = np.array(rng.normal(mu, sigma, k))

    # Format the output to obey API
    result = [np.array([x]) for x in vector_of_k_samples]
    return result

Note that both mu and sigma are stored in the list input values in the same order as we provided them to the InputConnector object in the init function. Futher note that the output is a list of vectors, each of dimension one, though the Gaussian generative model only produces real numbers.

Checking the Output¶

We also need to check the output of the model. This method is commonly used in case our model is used as an input for other models. When using an inference scheme that utilizes perturbation, the output of our model is slightly perturbed. We have to make sure that the perturbed output is still valid for our model. The details of implementing the method _check_output() can be found in the documentation:

ProbabilisticModel._check_output(values)[source]

Checks whether values contains a reasonable output of the current model.

Parameters:	values (numpy array) – Array of shape (get_output_dimension(),) that contains the model output.
Returns:	Return false if values cannot possibly be generated from the model and true otherwise.
Return type:	boolean

Since the output of a Gaussian generative model is a single number from the full real domain, we can restrict ourselves to syntactic checks. However, one could easily imagine models for which the output it restricted to a certain domain. Then, this function should return False as soon as values are out of the desired domain.

def _check_output(self, values):
    if not isinstance(values, Number):
        raise ValueError('Output of the normal distribution is always a number.')

    # At this point values is a number (int, float); full domain for Normal is allowed
    return True

Note that implementing this method is particularly important when using the current model as input for other models, hence in the second scenario described in Implementing a new Model. In case our model should only be used for the first scenario, it is safe to omit the check and return true.

Getting the Output Dimension¶

We have expose the dimension of the produced output of our model using the following method:

ProbabilisticModel.get_output_dimension()[source]

Provides the output dimension of the current model.

This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.

Returns:	The dimension of the output vector of a single forward simulation.
Return type:	int

Since our model generates a single float number in one forward simulation, the implementation looks is straight forward:

def get_output_dimension(self):
    return 1

Note that implementing this method is particularly important when using the current model as input for other models, hence in the second scenario described in Implementing a new Model. In case our model should only be used for the first scenario, it is safe to return 1.

Calculating the Probability Density Function¶

Since our model also derives from Continuous we also have to implement the following function that calculates the probability density function at specific point.

Continuous.pdf(input_values, x)[source]

Calculates the probability density function of the model.

Parameters:	input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization. x (float) – The location at which the probability density function should be evaluated.

As mentioned above, this is only required if one wants to use our model as input for other models. An implementation looks as follows:

def pdf(self, input_values, x):
    mu = input_values[0]
    sigma = input_values[1]
    pdf = np.norm(mu,sigma).pdf(x)
    return pdf

Our model now conforms to ABCpy and we can start inferring parameters in the same way (see Getting Started) as we would do with shipped models.

Wrap a Model Written in C++¶

There are several frameworks that help you integrating your C++/C code into Python. We showcase examples for

Swig

Using Swig¶

Swig is a tool that creates a Python wrapper for our C++/C code using an interface (file) that we have to specify. We can then import the wrapper and in turn use your C++ code with ABCpy as if it was written in Python.

We go through a complete example to illustrate how to use a simple Gaussian model written in C++ with ABCpy. First, have a look at our C++ model:

void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed) {
  boost::mt19937 rng(seed);
  boost::normal_distribution<> nd(mu, sigma);
  boost::variate_generator<boost::mt19937, boost::normal_distribution<> > sampler(rng, nd);
  
  for (int i=0; i<k; ++i) {
    result[i] = sampler();
  }
}

To use this code in Python, we need to specify exactly how to expose the C++ function to Python. Therefore, we write a Swig interface file that look as follows:

%module gaussian_model_simple
%{
  #define SWIG_FILE_WITH_INIT
  
  #include <iostream>
  #include <boost/random.hpp>
  #include <boost/random/normal_distribution.hpp>
  
  extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);
%}

%include "numpy.i"

%init %{
  import_array();
%}

%apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};

extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);

In the first line we define the module name we later have to import in your ABCpy Python code. Then, in curly brackets, we specify which libraries we want to include and which function we want to expose through the wrapper.

Now comes the tricky part. The model class expects a method forward_simulate that forward-simulates our model and which returns an array of syntetic observations. However, C++/C does not know the concept of returning an array, instead in C++/C we would provide a memory position (pointer) where to write the results. Swig has to translate between the two concepts. We use actually an Swig interface definition from numpy called import_array. The line

%apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};

states that we want the two parameters result and k of the gaussian_model C++ function be interpreted as an array of length k that is returned. Have a look at the Python code below and observe how the wrapped Python function takes only two instead of four parameters and returns a numpy array.

The first stop to get everything running is to translate the Swig interface file to wrapper code in C++ and Python.

swig -python -c++ -o gaussian_model_simple_wrap.cpp gaussian_model_simple.i

This creates two wrapper files gaussian_model_simple_wrap.cpp and gaussian_model_simple.py. Now the C++ files can be compiled:

g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple.cpp -o gaussian_model_simple.o
g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple_wrap.cpp -o gaussian_model_simple_wrap.o
g++ -shared gaussian_model_simple.o gaussian_model_simple_wrap.o -o _gaussian_model_simple.so

Note that the include paths might need to be adapted to your system. Finally, we can write a Python model which uses our C++ code:

from numbers import Number
from abcpy.probabilisticmodels import ProbabilisticModel, Continuous, InputConnector
from gaussian_model_simple import gaussian_model

class Gaussian(ProbabilisticModel, Continuous):

    def __init__(self, parameters, name='Gaussian'):
        # We expect input of type parameters = [mu, sigma]
        if not isinstance(parameters, list):
            raise TypeError('Input of Normal model is of type list')

        if len(parameters) != 2:
            raise RuntimeError('Input list must be of length 2, containing [mu, sigma].')

        input_connector = InputConnector.from_list(parameters)
        super().__init__(input_connector, name)

    def _check_input(self, input_values):
        # Check whether input has correct type or format
        if len(input_values) != 2:
            raise ValueError('Number of parameters of Normal model must be 2.')

        # Check whether input is from correct domain
        mu = input_values[0]
        sigma = input_values[1]
        if sigma < 0:
            return False

        return True

    def _check_output(self, values):
        if not isinstance(values, Number):
            raise ValueError('Output of the normal distribution is always a number.')

        # At this point values is a number (int, float); full domain for Normal is allowed
        return True

    def get_output_dimension(self):
        return 1

    def forward_simulate(self, input_values, k, rng=np.random.RandomState()):
        # Extract the input parameters
        mu = input_values[0]
        sigma = input_values[1]
        seed = rng.randint(np.iinfo(np.int32).max)

        # Do the actual forward simulation
        vector_of_k_samples = gaussian_model(k, mu, sigma, seed)

        # Format the output to obey API
        result = [np.array([x]) for x in vector_of_k_samples]
        return result

    def pdf(self, input_values, x):
        mu = input_values[0]
        sigma = input_values[1]
        pdf = np.norm(mu, sigma).pdf(x)
        return pdf

The important lines are where we import the wrapper code as a module (line 3) and call the respective model function (line 48).

The full code is available in examples/extensions/models/gaussion_cpp/. To simplify compilation of SWIG and C++ code we created a Makefile. Note that you might need to adapt some paths in the Makefile.

Wrap a Model Written in R¶

Statisticians often use the R language to build statistical models. R models can be incorporated within the ABCpy language with the rpy2 Python package. We show how to use the rpy2 package to connect with a model written in R.

Continuing from the previous sections we use a simple Gaussian model as an example. The following R code is the contents of the R file gaussian_model.R:

simple_gaussian <- function(mu, sigma, k = 1){
	output <- rnorm(k, mu, sigma)
	return(output)
}

More complex R models are incorporated in the same way. To include this function within ABCpy we include the following code at the beginning of our Python file:

import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()

robjects.r('''
       source('gaussian_model.R')
''')

r_simple_gaussian = robjects.globalenv['simple_gaussian']

This imports the R function simple_gaussian into the Python environment. We need to build our own model to incorporate this R function as in the previous section. The only difference is in the forward_simulate method of the class :code:`Gaussian’.

vector_of_k_samples = list(r_simple_gaussian(mu, sigma, k))

The default output for R functions in Python is a float vector. This must be converted into a Python numpy array for the purposes of ABCpy.

Implementing a new Distance¶

We will now explain how you can implement your own distance measure. A new distance is implemented as a new class that derives from :py:class`Distance <abcpy.distance.Distance>` and for which the following three methods have to be implemented:

Let us first look at the initializer documentation:

Distance.__init__(statistics_calc)[source]

The constructor of a sub-class must accept a non-optional statistics calculator as a parameter. If stored to self.statistics_calc, the private helper method _calculate_summary_stat can be used.

Parameters:	statistics_calc (abcpy.stasistics.Statistics) – Statistics extractor object that conforms to the Statistics class.

Distances in ABCpy should act on summary statistics. Therefore, at initialization of a distance calculator, a statistics calculator should be provided. The following header conforms to this idea:

def __init__(self, statistics):
    self.statistics_calc = statistics

    # Since the observations do always stay the same, we can save the
    #  summary statistics of them and not recalculate it each time
    self.s1 = None
    self.data_set = None

Then, we need to define how the distance is calculated. First we compute the summary statistics from the datasets and then compute the distance between the summary statistics. Notice, while computing the summary statistics we save the first dataset and the corresponding summary statistics. This is since we always pass the observed dataset first to the distance function. The observed dataset does not change during an inference computation and thus it is efficient to compute it once and store it internally.

    # Extract summary statistics from the dataset
    if(self.s1 is None or self.data_set!=d1):
        self.s1 = self.statistics_calc.statistics(d1)
        self.data_set = d1
    s2 = self.statistics_calc.statistics(d2)

    # compute distance between the statistics
    dist = np.zeros(shape=(self.s1.shape[0],s2.shape[0]))
    for ind1 in range(0, self.s1.shape[0]):
        for ind2 in range(0, s2.shape[0]):
            dist[ind1,ind2] = np.sqrt(np.sum(pow(self.s1[ind1,:]-s2[ind2,:],2)))

    return dist.mean()

Finally, we need to define the maximal distance that can be obtained from this distance measure.

def dist_max(self):
    return np.inf

The newly defined distance class can be used in the same way as the already existing once. The complete example for this tutorial can be found in examples/extensions/distances/default_distance.py.

Implementing a new Perturbation Kernel¶

To implement a new kernel, we need to implement a new class that derives from abcpy.perturbationkernel.PerturbationKernel and that implements the following abstract methods:

Kernels in ABCpy can be of two types: they can either be derived from the class ContinuousKernel or from DiscreteKernel. In case a continuous kernel is required, the following method must be implemented:

    def pdf(self, accepted_parameters_manager, kernel_index, index, x):

On the other hand, if the kernel is a discrete kernel, we would need the following method:

    def pmf(self, accepted_parameters_manager, kernel_index, index, x):

As an example, we will implement a kernel which perturbs continuous parameters using a multivariate normal distribution (which is already implemented within ABCpy). First, we need to define a constructor.

PerturbationKernel.__init__(models)[source]

Parameters:	models (list) – The list of abcpy.probabilisticmodel objects that should be perturbed by this kernel.

Thus, ABCpy expects that the arguments passed to the initializer is of type ProbibilisticModel, which can be seen as the random variables that should be perturbed by this kernel. All these models should be saved on the kernel for future reference.

class MultivariateNormalKernel(PerturbationKernel, ContinuousKernel):
    def __init__(self, models):
        self.models = models

Next, we need the following method:

PerturbationKernel.calculate_cov(accepted_parameters_manager, kernel_index)[source]

Calculates the covariance matrix for the kernel.

Parameters:	accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects. kernel_index (integer) – The index of the kernel in the list of kernels of the joint perturbation kernel.
Returns:	The covariance matrix for the kernel.
Return type:	numpy.ndarray

This method calculates the covariance matrix for your kernel. Of course, not all kernels will have covariance matrices. However, since some kernels do, it is necessary to implement this method for all kernels. If your kernel does not have a covariance matrix, simply return an empty list.

The two arguments passed to this method are the accepted parameters manager and the kernel index. An object of type AcceptedParameterManager is always initialized when an inference method object is instantiated. On this object, the accepted parameters, accepted weights, accepted covariance matrices for all kernels and other information is stored. This is such that various objects can access this information without much hassle centrally. To access any of the quantities mentioned above, you will have to call the .value() method of the corresponding quantity.

The second parameter, the kernel index, specifies the index of the kernel in the list of kernels that the inference method will in the end obtain. Since the user is expected to collect all his kernels in one object, this index will automatically be provided. You do not need any knowledge of what the index actually is. However, it is used to access the values relevant to your kernel, for example the current calculated covariance matrix for a kernel.

Let us now look at the implementation of the method:

def calculate_cov(self, accepted_parameters_manager, kernel_index):
    if(accepted_parameters_manager.accepted_weights_bds is not None):
        weights = accepted_parameters_manager.accepted_weights_bds.value()
        cov = np.cov(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index], aweights=weights.reshape(-1), rowvar=False)
    else:
        cov = np.cov(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index], rowvar=False)
    return cov

Some of the implemented inference algorithms weigh different sets of parameters differently. Therefore, if such weights are provided, we would like to weight the covariance matrix accordingly. We, therefore, check whether the accepted parameters manager contains any weights. If it does, we retrieve these weights, and calculate the covariance matrix using numpy, the parameters relevant to this kernel and the weights. If there are no weights, we simply calculate an unweighted covariance matrix.

Next, we need the method:

PerturbationKernel.update(accepted_parameters_manager, row_index, rng)[source]

Perturbs the parameters for this kernel.

Parameters:	accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects. row_index (integer) – The index of the accepted parameters bds that should be perturbed. rng (random number generator) – The random number generator to be used.
Returns:	The perturbed parameters.
Return type:	numpy.ndarray

This method perturbs the parameters that are associated with the random variables the kernel should perturb. The method again requires an accepted parameters manager and a kernel index. These have the same meaning as in the last method. In addition to this, a row index is required, as well as a random number generator. The row index specifies which set of parameters should be perturbed. There are usually multiple sets, which should be perturbed by different workers during parallelization. We, again, need not to worry about the actual value of this index.

The random number generator should be a random number generator compatible with numpy. This is due to the fact that other methods will pass their random number generator to this method, and all random number generators used within ABCpy are provided by numpy. Also, note that even if your kernel does not require a random number generator, you still need to pass this argument.

Here the implementation for our kernel:

def update(self, accepted_parameters_manager, kernel_index, row_index, rng=np.random.RandomState()):
    continuous_model_values = accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index]
    continuous_model_values = np.array(continuous_model_values)
    cov = accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]
    perturbed_continuous_values = rng.multivariate_normal(correctly_ordered_parameters[row_index], cov)

    return perturbed_continuous_values

The first line shows how you obtain the values of the parameters that your kernel should perturb. These values are converted to a numpy array. Then, the covariance matrix is retrieved from the accepted parameters manager using a similar function call. Finally, the parameters are perturbed and returned.

Last but not least, each kernel requires a probability density or probability mass function depending on whether it is a Continous Kernel or a Discrete Kernel:

PerturbationKernel.pdf(accepted_parameters_manager, kernel_index, row_index, x)[source]

Calculates the pdf of the kernel at point x.

Parameters:	accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects. kernel_index (integer) – The index of the kernel in the list of kernels of the joint perturbation kernel. row_index (integer) – The index of the accepted parameters bds for which the pdf should be evaluated. x (list or float) – The point at which the pdf should be evaluated.
Returns:	The pdf evaluated at point x.
Return type:	float

This method is implemented as follows for the multivariate normal:

def pdf(self, accepted_parameters_manager, kernel_index, row_index, x):
    mean = accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index][row_index]

    cov = accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]

    return multivariate_normal(mean, cov).pdf(x)

We simply obtain the parameter values and covariance matrix for this kernel and calculate the probability density function using SciPy.

Note that after defining your own kernel, you will need to collect all your kernels in a JointPerturbationKernel object in order for inference to work. For an example on how to do this, check the Using perturbation kernels section.

The complete example used in this tutorial can be found examples/extensions/perturbationkernels/multivariate_normal_kernel.py.