Welcome to ABCpy’s documentation!¶
Release: | 0.5.1 |
---|---|
Date: | Jul 11, 2018 |
1. Installation¶
ABCpy requires Python3 and is not compatible with Python2. The simplest way to install ABCpy is via PyPI and the is recommended method to use.
Installation from PyPI¶
Simplest way to install
pip3 install abcpy
This clearly works also in a virtual environment.
Installation from Souce¶
If you prefer to work on the source, clone the repository
git clone https://github.com/eth-cscs/abcpy.git
Make sure all requirements are installed
cd abcpy
pip3 install -r requirements.txt
To create a package and install it do
make package
pip3 install build/dist/abcpy-0.5.1-py3-none-any.whl
Note that ABCpy requires Python3.
2. Getting Started¶
Here, we explain how to use ABCpy to quntify parameter uncertainty of a probabbilistic model given some observed dataset. If you are new to uncertainty quantification using Approximate Bayesian Computation (ABC), we recommend you to start with the Parameters as Random Variables section.
Parameters as Random Variables¶
As an example, if we have measurements of the height of a group of grown up human and it is also known that a Gaussian distribution is an appropriate probabilistic model for these kind of observations, then our observed dataset would be measurement of heights and the probabilistic model would be Gaussian.
height_obs = [160.82499176, 167.24266737, 185.71695756, 153.7045709, 163.40568812, 140.70658699, 169.59102084, 172.81041696, 187.38782738, 179.66358934, 176.63417241, 189.16082803, 181.98288443, 170.18565017, 183.78493886, 166.58387299, 161.9521899, 155.69213073, 156.17867343, 144.51580379, 170.29847515, 197.96767899, 153.36646527, 162.22710198, 158.70012047, 178.53470703, 170.77697743, 164.31392633, 165.88595994, 177.38083686, 146.67058471763457, 179.41946565658628, 238.02751620619537, 206.22458790620766, 220.89530574344568, 221.04082532837026, 142.25301427453394, 261.37656571434275, 171.63761180867033, 210.28121820385866, 237.29130237612236, 175.75558340169619, 224.54340549862235, 197.42448680731226, 165.88273684581381, 166.55094082844519, 229.54308602661584, 222.99844054358519, 185.30223966014586, 152.69149367593846, 206.94372818527413, 256.35498655339154, 165.43140916577741, 250.19273595481803, 148.87781549665536, 223.05547559193792, 230.03418198709608, 146.13611923127021, 138.24716809523139, 179.26755740864527, 141.21704876815426, 170.89587081800852, 222.96391329259626, 188.27229523693822, 202.67075179617672, 211.75963110985992, 217.45423324370509]
The Gaussian or Normal model has two parameters: the mean, denoted by \(\mu\), and the standard deviation, denoted by \(\sigma\). We consider these parameters as random variables. The goal of ABC is to quantify the uncertainty of these parameters from the information contained in the observed data.
In ABCpy, a abcpy.probabilisticmodels.ProbabilisticModel
object represents probabilistic relationship
between random variables or between random variables and observed data. Each of the ProbabilisticModel
object has a number of input parameters: they are either random
variables (output of another ProbabilisticModel
object) or
constant values and considered known to the user (Hyperparameters
). If you are interested in implementing your own probabilistic model,
please check the implementing a new model section.
To define a parameter of a model as a random variable, you start by assigning a prior distribution on them. We can utilize prior knowledge about these parameters as prior distribution. In the absence of prior knowledge, we still need to provide prior information and a non-informative flat disribution on the parameter space can be used. The prior distribution on the random variables are assigned by a probabilistic model which can take other random variables or hyper parameters as input.
In our Gaussian example, providing prior information is quite simple. We know from experience that the average height should be somewhere between 150cm and 200cm, while the standard deviation is around 5 to 25. In code, this would look as follows:
from abcpy.continuousmodels import Uniform
mu = Uniform([[150], [200]], )
sigma = Uniform([[5], [25]], )
# define the model
height = Gaussian([mu, sigma], name='height')
We have defined the parameter \(\mu\) and \(\sigma\) of the Gaussian model as random variables and assigned
Uniform prior distributions on them. The parameters of the prior distribution \((150, 200, 5, 25)\) are assumed to
be known to the user, hence they are called hyperparameters. Also, internally, the hyperparameters are converted to
Hyperparameter
objects. Note that you are also required to pass
a name string while defining a random variable. In the final output, you will see these names, together with the
relevant outputs corresponding to them.
For uncertainty quantification, we follow the philosophy of Bayesian inference. We are interested in the distribution of the parameters we get after incorporating the information that is implicit in the observed dataset with the prior information. This target distribution is called posterior distribution of the parameters. For inference, we draw independent and identical sample values from the posterior distribution. These sampled values are called posterior samples and used to either approximate the posterior distribution or to integrate in a Monte Carlo style w.r.t the posterior distribution. The posterior samples are the ones you get as a result of applying an ABC inference scheme.
The heart of the ABC inferential algorithm is a measure of discrepancy between the observed dataset and the synthetic dataset (simulated/generated from the model). Often, computation of discrepancy measure between the observed and synthetic dataset is not feasible (e.g., high dimensionality of dataset, computationally to complex) and the discrepancy measure is defined by computing a distance between relevant summary statistics extracted from the datasets. Here we first define a way to extract summary statistics from the dataset.
1 2 | from abcpy.statistics import Identity
statistics_calculator = Identity(degree = 2, cross = False)
|
Next we define the discrepancy measure between the datasets, by defining a distance function (LogReg distance is chosen here) between the extracted summary statistics. If we want to define the discrepancy measure through a distance function between the datasets directly, we choose Identity as summary statistics which gives the original dataset as the extracted summary statistics. This essentially means that the distance object automatically extracts the statistics from the datasets, and then compute the distance between the two statistics.
from abcpy.distances import LogReg
distance_calculator = LogReg(statistics_calculator)
Algorithms in ABCpy often require a perturbation kernel, a tool to explore the parameter space. Here, we use the default kernel provided, which explores the parameter space of random variables, by using e.g. a multivariate Gaussian distribution or by performing a random walk depending on whether the corresponding random variable is continuous or discrete. For a more involved example, please consult Complex Perturbation Kernels.
from abcpy.perturbationkernel import DefaultKernel
kernel = DefaultKernel([mu, sigma])
Finally, we need to specify a backend that determins the parallelization framework to use. The example code here uses
the dummy backend BackendDummy
which does not parallelize the computation of the
inference schemes, but which this is handy for prototyping and testing. For more advanced parallelization backends
available in ABCpy, please consult Using Parallelization Backends section.
from abcpy.backends import BackendDummy as Backend
backend = Backend()
In this example, we choose PMCABC algorithm (inference scheme) to draw posterior samples of the parameters. Therefore, we instantiate a PMCABC object by passing the random variable corresponding to the observed dataset, the distance function, backend object, perturbation kernel and a seed for the random number generator.
# define sampling scheme
from abcpy.inferences import PMCABC
sampler = PMCABC([height], [distance_calculator], backend, kernel, seed=1)
Finally, we can parametrize the sampler and start sampling from the posterior distribution of the parameters given the observed dataset:
# sample from scheme
T, n_sample, n_samples_per_param = 3, 250, 10
eps_arr = np.array([.75])
epsilon_percentile = 10
journal = sampler.sample([height_obs], T, eps_arr, n_sample, n_samples_per_param, epsilon_percentile)
The above inference scheme gives us samples from the posterior distribution of the parameter of interest height quantifying the uncertainty of the inferred parameter, which are stored in the journal object. See Post Analysis for further information on extracting results.
Note that the model and the observations are given as a list. This is due to the fact that in ABCpy, it is possible to have hierarchical models, building relationships between co-occuring groups of datasets. To learn more, see the Hierarchical Model section.
The full source can be found in examples/extensions/models/gaussian_python/pmcabc_gaussian_model_simple.py. To execute the code you only need to run
python3 pmcabc_gaussian_model_simple.py
Probabilistic Dependency between Random Variables¶
Since release 0.5.0 of ABCpy, a probabilistic dependency structures (e.g., a Bayesian network) between random variables can be modelled. Behind the scene, ABCpy will represent this dependency structure as a directed acyclic graph (DAG) such that the inference can be done on the full graph. Further we can also define new random variables through operations between existing random variables. To make this concept more approachable, we now exemplify an inference problem on a probabilistic dependency structure.
Let us consider the following example: Assume we have a school with different classes. Each class has some number of students. All students of the school take an exam and receive some grade. Lets consider the grades of the students as our observed dataset:
The grades of the students depend on several variables: if there were bias, the size of the class a student is in, as well as the social background of the student. The dependency structure between these variables can be defined using the following Bayesian network:

Here we assume the size of a class the student belongs to and his/her social background are normally distributed with some mean and variance. However, they also both depend on the location of the school. In certain neighbourhoods, the class size might be larger, or the social background might differ. We therefore have additional parameters, specifying the mean of these normal distributions will vary uniformly between some lower and upper bounds. Finally, we can assume that the grade without any bias would be a normally distributed parameter around an average grade.
We can define these random variables and the dependencies between them (in a nutshell, a Bayesian network) in ABCpy in the following way:
So, each student will receive some grade without additional effects which is normally distributed, but then the final grade recieved will be a function of grade without additional effects and the other random variables defined beforehand (e.g., school_location, class_size and background). The model for the final grade of the students now can be written as:
Notice here we created a new random variable final_grade, by subtracting the random variables class_size and
background from the random variable grade. In short, this illustrates that you can perform standard operations “+”,
“-“, “*”, “/” and “**” (the power operator in Python) on any two random variables, to get a new random variable. It is
possible to perform these operations between two random variables additionally to the general data types of Python
(integer, float, and so on) since they are converted to HyperParameters
.
Please keep in mind that parameters defined via operations will not be included in your list of parameters in the
journal file. However, all parameters that are part of the operation, and are not fixed, will be included, so you can
easily perform the required operations on the final result to get these parameters, if necessary. In addition, you can
now also use the []
operator (the access operator in Python). This allows you to select single values or ranges
of a multidimensional random variable as a parameter of a new random variable.
To do inference on the Baysian network given the observed datat, we again have to define some summary statistics of our data (similar to the Parameters as Random Variables section).
And just as before, we need to provide a distance measure, a backend and a kernel to the inference scheme of our choice, which again is PMCABC.
Finally, we can parametrize the sampler and start sampling from the posterior distribution:
The source code can be found in examples/hierarchicalmodels/pmcabc_inference_on_single_set_of_obs.py and can be run as show above.
Hierarchical Model¶
As mentioned above, ABCpy supports inference on hierarchical models. Hierarchical models can share some of the parameters of interest and model co-occuring datasets. To illustrate how this is implemented, we will follow the example model given in Probabilistic Dependency between Random Variables section and extend it to co-occuring datasets.

Let us assume that in addition to the final grade of a student, we have (co-occuring, observed) data for final scholarships given out by the school to the students. Whether a student gets a scholarship depends on his social background and on an independent score.
scholarship_obs = [2.7179657436207805, 2.124647285937229, 3.07193407853297, 2.335024761813643, 2.871893855192, 3.4332002458233837, 3.649996835818173, 3.50292335102711, 2.815638168018455, 2.3581613289315992, 2.2794821846395568, 2.8725835459926503, 3.5588573782815685, 2.26053126526137, 1.8998143530749971, 2.101110815311782, 2.3482974964831573, 2.2707679029919206, 2.4624550491079225, 2.867017757972507, 3.204249152084959, 2.4489542437714213, 1.875415915801106, 2.5604889644872433, 3.891985093269989, 2.7233633223405205, 2.2861070389383533, 2.9758813233490082, 3.1183403287267755, 2.911814060853062, 2.60896794303205, 3.5717098647480316, 3.3355752461779824, 1.99172284546858, 2.339937680892163, 2.9835630207301636, 2.1684912355975774, 3.014847335983034, 2.7844122961916202, 2.752119871525148, 2.1567428931391635, 2.5803629307680644, 2.7326646074552103, 2.559237193255186, 3.13478196958166, 2.388760269933492, 3.2822443541491815, 2.0114405441787437, 3.0380056368041073, 2.4889680313769724, 2.821660164621084, 3.343985964873723, 3.1866861970287808, 4.4535037154856045, 3.0026333138006027, 2.0675706089352612, 2.3835301730913185, 2.584208398359566, 3.288077633446465, 2.6955853384148183, 2.918315169739928, 3.2464814419322985, 2.1601516779909433, 3.231003347780546, 1.0893224045062178, 0.8032302688764734, 2.868438615047827]
We assume the score is normally distributed and we model it as follows:
scholarship_without_additional_effects = Normal([[2], [0.5]], )
We model the impact of the students social background on the scholarship as follows:
final_scholarship = scholarship_without_additional_effects + 3*background
With this we now have two root ProbabilisicModels (random variables), namely final_grade
and
final_scholarship
, whose output can directly compared to the observed datasets grade_obs
and
scholarship_obs
. With this we are able to do an inference computation on all free parameters of the hierarchical
model (of the DAG) given our observations.
To inference uncertainty of our parameters, we follow the same steps as in our previous examples: We choose summary statistics, distance, inference scheme, backend and kernel. We will skip the definitions that have not changed from the previous section. However, we would like to point out the difference in definition of the distance. Since we are now considering two observed datasets, we need to define an distances on them separately. Here, we use the Euclidean distance for each observed data set and corresponding simulated dataset. You can use two different distances on two different observed datasets.
# Define a summary statistics for final grade and final scholarship
from abcpy.statistics import Identity
statistics_calculator_final_grade = Identity(degree = 2, cross = False)
statistics_calculator_final_scholarship = Identity(degree = 3, cross = False)
# Define a distance measure for final grade and final scholarship
from abcpy.distances import Euclidean
distance_calculator_final_grade = Euclidean(statistics_calculator_final_grade)
distance_calculator_final_scholarship = Euclidean(statistics_calculator_final_scholarship)
Using these two distance functions with the final code look as follows:
# Define a backend
from abcpy.backends import BackendDummy as Backend
backend = Backend()
# Define a perturbation kernel
from abcpy.perturbationkernel import DefaultKernel
kernel = DefaultKernel([school_location, class_size, grade_without_additional_effects, \
background, scholarship_without_additional_effects])
# Define sampling parameters
T, n_sample, n_samples_per_param = 3, 250, 10
eps_arr = np.array([.75])
epsilon_percentile = 10
# Define sampler
from abcpy.inferences import PMCABC
sampler = PMCABC([final_grade, final_scholarship], \
[distance_calculator_final_grade, distance_calculator_final_scholarship], backend, kernel)
# Sample
journal = sampler.sample([grades_obs, scholarship_obs], \
T, eps_arr, n_sample, n_samples_per_param, epsilon_percentile)
Observe that the lists given to the sampler and the sampling method now contain two entries. These correspond to the two different observed data sets respectively. Also notice now we provide two different distances corresponding to the two different root models and their observed datasets. Presently ABCpy combines the distances by a linear combination, however customized combination strategies can be implemented by the user.
The full source code can be found in examples/hierarchicalmodels/pmcabc_inference_on_multiple_sets_of_obs.py.
Complex Perturbation Kernels¶
As pointed out earlier, it is possible to define complex perturbation kernels, perturbing different random variables in different ways. Let us take the same example as in the Hierarchical Model and assume that we want to perturb the schools location as well as the scholarship score, using a multivariate normal kernel. However, the remaining parameters we would like to perturb using a multivariate Student’s-T kernel. This can be implemented as follows:
from abcpy.perturbationkernel import MultivariateNormalKernel, MultivariateStudentTKernel
kernel_1 = MultivariateNormalKernel([school_location, scholarship])
kernel_2 = MultivariateStudentTKernel([class_size, background, grade], df=3)
We have now defined how each set of parameters is perturbed on its own. The sampler object, however, needs to be
provided with one single kernel. We, therefore, provide a class which groups the above kernels together. This class,
abcpy.perturbationkernel.JointPerturbationKernel
, knows how to perturb each set of parameters individually.
It just needs to be provided with all the relevant kernels:
from abcpy.perturbationkernel import JointPerturbationKernel
kernel = JointPerturbationKernel([kernel_1, kernel_2])
This is all that needs to be changed. The rest of the implementation works the exact same as in the previous example. If you would like to implement your own perturbation kernel, please check Implementing a new Perturbation Kernel. Please keep in mind that you can only perturb parameters. You cannot use the access operator to perturb one component of a multi-dimensional random variable differently than another component of the same variable.
The source code to this section can be found in examples/extensions/perturbationkernels/pmcabc_perturbation_kernels.py
Inference Schemes¶
In ABCpy, we implement widely used and advanced variants of ABC inferential schemes:
- Rejection ABC
abcpy.inferences.RejectionABC
, - Population Monte Carlo ABC
abcpy.inferences.PMCABC
, - Sequential Monte Carlo ABC
abcpy.inferences.SMCABC
, - Replenishment sequential Monte Carlo ABC (RSMC-ABC)
abcpy.inferences.RSMCABC
, - Adaptive population Monte Carlo ABC (APMC-ABC)
abcpy.inferences.APMCABC
, - ABC with subset simulation (ABCsubsim)
abcpy.inferences.ABCsubsim
, and - Simulated annealing ABC (SABC)
abcpy.inferences.SABC
.
To perform ABC algorithms, we provide different standard distance functions between datasets, e.g., a discrepancy measured by achievable classification accuracy between two datasets
abcpy.distances.Euclidean
,abcpy.distances.LogReg
,abcpy.distances.PenLogReg
.
We also have implemented the population Monte Carlo abcpy.inferences.PMC
algorithm to infer parameters when
the likelihood or approximate likelihood function is available. For approximation of the likelihood function we provide
two methods:
- Synthetic likelihood approximation
abcpy.approx_lhd.SynLiklihood
, and another method using - penalized logistic regression
abcpy.approx_lhd.PenLogReg
.
Next we explain how we can use PMC algorithm using approximation of the likelihood functions. As we are now considering
two observed datasets corresponding to two root models, we need to define an approximation of likelihood function for
each of them separately. Here, we use the abcpy.approx_lhd.SynLiklihood
for each of the root models. It is
also possible to use two different approximate likelihoods for two different root models.
# Define a summary statistics for final grade and final scholarship
from abcpy.statistics import Identity
statistics_calculator_final_grade = Identity(degree = 2, cross = False)
statistics_calculator_final_scholarship = Identity(degree = 3, cross = False)
# Define a distance measure for final grade and final scholarship
from abcpy.approx_lhd import SynLiklihood
approx_lhd_final_grade = SynLiklihood(statistics_calculator_final_grade)
approx_lhd_final_scholarship = SynLiklihood(statistics_calculator_final_scholarship)
We then parametrize the sampler and sample from the posterior distribution.
# Define sampling parameters
T, n_sample, n_samples_per_param = 3, 250, 10
# Define sampler
from abcpy.inferences import PMC
sampler = PMC([final_grade, final_scholarship], \
[approx_lhd_final_grade, approx_lhd_final_scholarship], backend, kernel)
# Sample
journal = sampler.sample([grades_obs, scholarship_obs], T, n_sample, n_samples_per_param)
analyse_journal(journal):
Observe that the lists given to the sampler and the sampling method now contain two entries. These correspond to the two different observed data sets respectively. Also notice now we provide two different distances corresponding to the two different root models and their observed datasets. Presently ABCpy combines the distances by a linear combination. Further possibilities of combination will be made available in later versions of ABCpy.
The source code can be found in examples/approx_lhd/pmc_hierarchical_models.py.
Summary Selection¶
We have noticed in the Parameters as Random Variables Section, the discrepancy measure between two datasets is
defined by a distance function between extracted summary statistics from the datasets. Hence, the ABC algorithms are
subjective to the summary statistics choice. This subjectivity can be avoided by a data-driven summary statistics choice
from the available summary statistics of the dataset. In ABCpy we provide a semi-automatic summary selection procedure in
abcpy.summaryselections.Semiautomatic
Taking our initial example from Parameters as Random Variables where we model the height of humans, we can had summary statistics defined as follows:
# define statistics
from abcpy.statistics import Identity
statistics_calculator = Identity(degree = 3, cross = True)
Then we can learn the optimized summary statistics from the given list of summary statistics using the semi-automatic summary selection procedure as following:
# Learn the optimal summary statistics using Semiautomatic summary selection
from abcpy.summaryselections import Semiautomatic
summary_selection = Semiautomatic([height], statistics_calculator, backend,
n_samples=1000,n_samples_per_param=1, seed=1)
# Redefine the statistics function
statistics_calculator.statistics = lambda x, f2=summary_selection.transformation, \
f1=statistics_calculator.statistics: f2(f1(x))
Then we can perform the inference as before, but the distances will be computed on the newly learned summary statistics using the semi-automatic summary selection procedure.
Model Selection¶
A further extension of the inferential problem is the selection of a model (M), given an observed
dataset, from a set of possible models. The package also includes a parallelized version of random
forest ensemble model selection algorithm [abcpy.modelselections.RandomForest
].
Lets consider an array of two models Normal and StudentT. We want to find out which one of these two models are the most suitable one for the observed dataset y_obs.
## Create a array of models
from abcpy.continuousmodels import Uniform, Normal, StudentT
model_array = [None]*2
#Model 1: Gaussian
mu1 = Uniform([[150], [200]], name='mu1')
sigma1 = Uniform([[5.0], [25.0]], name='sigma1')
model_array[0] = Normal([mu1, sigma1])
#Model 2: Student t
mu2 = Uniform([[150], [200]], name='mu2')
sigma2 = Uniform([[1], [30.0]], name='sigma2')
model_array[1] = StudentT([mu2, sigma2])
We first need to initiate the Model Selection scheme, for which we need to define the summary statistics and backend:
# define statistics
from abcpy.statistics import Identity
statistics_calculator = Identity(degree = 2, cross = False)
# define backend
from abcpy.backends import BackendDummy as Backend
backend = Backend()
# Initiate the Model selection scheme
modelselection = RandomForest(model_array, statistics_calculator, backend, seed = 1)
Now we can choose the most suitable model for the observed dataset y_obs,
# Choose the correct model
model = modelselection.select_model(y_obs, n_samples = 100, n_samples_per_param = 1)
or compute posterior probability of each of the models given the observed dataset.
# Compute the posterior probability of each of the models
model_prob = modelselection.posterior_probability(y_obs)
3. User Customization¶
Implementing a new Model¶
One of the standard use cases of ABCpy is to do inference on a probabilistic model that is not part of ABCpy. We now go through the details of such a scenario using the (already implemented) Gaussian generative model to explain how to implement it from scratch.
There are two scenarios to use a model: First, we want to use our probabilistic model to explain a relationship between
parameters* (considered random variables for inference) and observed data. This is for example the case when we want
to do inference on mechanistic models that do not have a PDF. In this case, our model implementation has to derive from
ProbabilisticModel
and a few abstract methods have to be
defined, as for example forward_simulate()
.
In the second scenario, we want to use the model to build a relationship between different parameters (between
different random variables). Then our model is restricted to either output continuous or discrete parameters in form of
a vector. Consequently, the model must derive from either from Continuous
or Discrete
and implement the
required abstract methods. These two classes in turn derive from from ProbabilisticModel
, such that the second scenario essentially extends the first.
Let us go through the implementation of a the Gaussian generative model. The model has to conform to the API specified
by the base class ProbabilisticModels
, and thus must
implement at least the following methods:
ProbabilisticModels.__init__()
ProbabilisticModels._check_input()
ProbabilisticModels._check_output()
ProbabilisticModels.forward_simulate()
ProbabilisticModels.get_output_dimension()
We want our model to work in both described scenarios, so our model also has to conform to the API of
Continuous
since the model output, which is the resulting data from a
forward simulation, is from a continuous domain. For completeness, here the abstract methods defined by
Continuous
and Discrete
:
Continuous.pdf()
Discrete.pmf()
Initializing a New Model¶
Since a Gaussian model generates continous numbers, the newly implement class derives from
Continuous
and the header look as follows:
class Gaussian(ProbabilisticModel, Continuous):
A good way to start implementing a new model is to define a convenient way to initialize it with its input parameters.
In ABCpy all input parameters are either independent ProbabilisticModels or Hyperparameters. Thus, they should not be
stored within but rather referenced in the model we implement. This reference is handled by the
InputConnector
class and must be used in our model
implementation. The required procedure is to call the init function of ProbabilisticModels and pass an InputConnector
object to it.
However, it would be very inconvenient to initialize our Gaussian model with an InputConnector object. We rather like
the init function to accept a list of parameters [mu, sigma]
, where mu
is the mean and sigma
is
the standard deviation which are the sole two parameters of our generative Gaussian model. So the idea is to take
a convenient input and transform it it an InputConnection object that in turn can be passed to the initializer of
the super class. This leads to the following implementation:
1 2 3 4 5 6 7 8 9 10 | def __init__(self, parameters, name='Gaussian'):
# We expect input of type parameters = [mu, sigma]
if not isinstance(parameters, list):
raise TypeError('Input of Normal model is of type list')
if len(parameters) != 2:
raise RuntimeError('Input list must be of length 2, containing [mu, sigma].')
input_connector = InputConnector.from_list(parameters)
super().__init__(input_connector, name)
|
First, we do some basic syntactic checks on the input that throw exceptions if unreasonable input is provided. Line 9 is the interesting part: the InputConnector comes with a convenient set of factory methods that create InputConnector objects:
abcpy.probabilisticmodels.InputConnector.from_number()
abcpy.probabilisticmodels.InputConnector.from_model()
abcpy.probabilisticmodels.InputConnector.from_list()
We use the factory method from_list
. The resulting
InputConnector creates links between our Gaussian model and the models (or hyperparameters) that are used for mu
and sigma
at initialization time. For example, if mu
and sigma
are initialized as
hyperparameters like
model = Gaussian([0, 1])
the from_list()
method will automatically create two HyperParameter objects HyperParameter(0)
and
HyperParameter(1)
and will link the our current Gaussian model inputs to them. If we initialize mu
and
sigma
with existing models like
uniform1 = Uniform([-1, 1])
uniform2 = Uniform([10,20])
model = Gaussian([uniform1, uniform2])
the from_list()
method will link our inputs to the uniform models.
Additionally, every model instance should have a unique name, which should also be passed to the init function of the super class.
Checking the Input¶
The next function we implement is _check_input
which should behave as described in the documentation:
This leads to the following implementation:
1 2 3 4 5 6 7 8 9 10 11 12 | def _check_input(self, input_values):
# Check whether input has correct type or format
if len(input_values) != 2:
raise ValueError('Number of parameters of Normal model must be 2.')
# Check whether input is from correct domain
mu = input_values[0]
sigma = input_values[1]
if sigma < 0:
return False
return True
|
Forward Simulation¶
At the core of our model lies the capability to forward simulate and create pseudo observations. To expose this functionality the following method has to be implemented:
A proper implementation look as follows:
1 2 3 4 5 6 7 8 9 10 11 | def forward_simulate(self, input_values, k, rng=np.random.RandomState()):
# Extract the input parameters
mu = input_values[0]
sigma = input_values[1]
# Do the actual forward simulation
vector_of_k_samples = np.array(rng.normal(mu, sigma, k))
# Format the output to obey API
result = [np.array([x]) for x in vector_of_k_samples]
return result
|
Note that both mu
and sigma
are stored in the list input values in the same order as we provided them
to the InputConnector object in the init function. Futher note that the output is a list of vectors, each of dimension
one, though the Gaussian generative model only produces real numbers.
Checking the Output¶
We also need to check the output of the model. This method is commonly used in case our model is used as an input for
other models. When using an inference scheme that utilizes perturbation, the output of our model is slightly perturbed.
We have to make sure that the perturbed output is still valid for our model. The details of implementing the method
_check_output()
can be found in the
documentation:
Since the output of a Gaussian generative model is a single number from the full real domain, we can restrict ourselves
to syntactic checks. However, one could easily imagine models for which the output it restricted to a certain domain.
Then, this function should return False
as soon as values are out of the desired domain.
1 2 3 4 5 6 | def _check_output(self, values):
if not isinstance(values, Number):
raise ValueError('Output of the normal distribution is always a number.')
# At this point values is a number (int, float); full domain for Normal is allowed
return True
|
Note that implementing this method is particularly important when using the current model as input for other models, hence in the second scenario described in Implementing a new Model. In case our model should only be used for the first scenario, it is safe to omit the check and return true.
Getting the Output Dimension¶
We have expose the dimension of the produced output of our model using the following method:
Since our model generates a single float number in one forward simulation, the implementation looks is straight forward:
1 2 | def get_output_dimension(self):
return 1
|
Note that implementing this method is particularly important when using the current model as input for other models, hence in the second scenario described in Implementing a new Model. In case our model should only be used for the first scenario, it is safe to return 1.
Calculating the Probability Density Function¶
Since our model also derives from Continuous
we also have to implement
the following function that calculates the probability density function at specific point.
As mentioned above, this is only required if one wants to use our model as input for other models. An implementation looks as follows:
1 2 3 4 5 6 | def pdf(self, input_values, x):
mu = input_values[0]
sigma = input_values[1]
pdf = np.norm(mu,sigma).pdf(x)
return pdf
|
Our model now conforms to ABCpy and we can start inferring parameters in the same way (see Getting Started) as we would do with shipped models.
Wrap a Model Written in C++¶
There are several frameworks that help you integrating your C++/C code into Python. We showcase examples for
Using Swig¶
Swig is a tool that creates a Python wrapper for our C++/C code using an interface (file) that we have to specify. We can then import the wrapper and in turn use your C++ code with ABCpy as if it was written in Python.
We go through a complete example to illustrate how to use a simple Gaussian model written in C++ with ABCpy. First, have a look at our C++ model:
1 2 3 4 5 6 7 8 9 | void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed) {
boost::mt19937 rng(seed);
boost::normal_distribution<> nd(mu, sigma);
boost::variate_generator<boost::mt19937, boost::normal_distribution<> > sampler(rng, nd);
for (int i=0; i<k; ++i) {
result[i] = sampler();
}
}
|
To use this code in Python, we need to specify exactly how to expose the C++ function to Python. Therefore, we write a Swig interface file that look as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | %module gaussian_model_simple
%{
#define SWIG_FILE_WITH_INIT
#include <iostream>
#include <boost/random.hpp>
#include <boost/random/normal_distribution.hpp>
extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);
%}
%include "numpy.i"
%init %{
import_array();
%}
%apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};
extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);
|
In the first line we define the module name we later have to import in your ABCpy Python code. Then, in curly brackets, we specify which libraries we want to include and which function we want to expose through the wrapper.
Now comes the tricky part. The model class expects a method forward_simulate that forward-simulates our model and which returns an array of syntetic observations. However, C++/C does not know the concept of returning an array, instead in C++/C we would provide a memory position (pointer) where to write the results. Swig has to translate between the two concepts. We use actually an Swig interface definition from numpy called import_array. The line
1 | %apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};
|
states that we want the two parameters result and k of the gaussian_model C++ function be interpreted as an array of length k that is returned. Have a look at the Python code below and observe how the wrapped Python function takes only two instead of four parameters and returns a numpy array.
The first stop to get everything running is to translate the Swig interface file to wrapper code in C++ and Python.
swig -python -c++ -o gaussian_model_simple_wrap.cpp gaussian_model_simple.i
This creates two wrapper files gaussian_model_simple_wrap.cpp and gaussian_model_simple.py. Now the C++ files can be compiled:
g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple.cpp -o gaussian_model_simple.o
g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple_wrap.cpp -o gaussian_model_simple_wrap.o
g++ -shared gaussian_model_simple.o gaussian_model_simple_wrap.o -o _gaussian_model_simple.so
Note that the include paths might need to be adapted to your system. Finally, we can write a Python model which uses our C++ code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | from numbers import Number
from abcpy.probabilisticmodels import ProbabilisticModel, Continuous, InputConnector
from gaussian_model_simple import gaussian_model
class Gaussian(ProbabilisticModel, Continuous):
def __init__(self, parameters, name='Gaussian'):
# We expect input of type parameters = [mu, sigma]
if not isinstance(parameters, list):
raise TypeError('Input of Normal model is of type list')
if len(parameters) != 2:
raise RuntimeError('Input list must be of length 2, containing [mu, sigma].')
input_connector = InputConnector.from_list(parameters)
super().__init__(input_connector, name)
def _check_input(self, input_values):
# Check whether input has correct type or format
if len(input_values) != 2:
raise ValueError('Number of parameters of Normal model must be 2.')
# Check whether input is from correct domain
mu = input_values[0]
sigma = input_values[1]
if sigma < 0:
return False
return True
def _check_output(self, values):
if not isinstance(values, Number):
raise ValueError('Output of the normal distribution is always a number.')
# At this point values is a number (int, float); full domain for Normal is allowed
return True
def get_output_dimension(self):
return 1
def forward_simulate(self, input_values, k, rng=np.random.RandomState()):
# Extract the input parameters
mu = input_values[0]
sigma = input_values[1]
seed = rng.randint(np.iinfo(np.int32).max)
# Do the actual forward simulation
vector_of_k_samples = gaussian_model(k, mu, sigma, seed)
# Format the output to obey API
result = [np.array([x]) for x in vector_of_k_samples]
return result
def pdf(self, input_values, x):
mu = input_values[0]
sigma = input_values[1]
pdf = np.norm(mu, sigma).pdf(x)
return pdf
|
The important lines are where we import the wrapper code as a module (line 3) and call the respective model function (line 48).
The full code is available in examples/extensions/models/gaussion_cpp/. To simplify compilation of SWIG and C++ code we created a Makefile. Note that you might need to adapt some paths in the Makefile.
Wrap a Model Written in R¶
Statisticians often use the R language to build statistical models. R models can be incorporated within the ABCpy language with the rpy2 Python package. We show how to use the rpy2 package to connect with a model written in R.
Continuing from the previous sections we use a simple Gaussian model as an example. The following R code is the contents of the R file gaussian_model.R:
1 2 3 4 | simple_gaussian <- function(mu, sigma, k = 1){
output <- rnorm(k, mu, sigma)
return(output)
}
|
More complex R models are incorporated in the same way. To include this function within ABCpy we include the following code at the beginning of our Python file:
1 2 3 4 5 6 7 8 9 | import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()
robjects.r('''
source('gaussian_model.R')
''')
r_simple_gaussian = robjects.globalenv['simple_gaussian']
|
This imports the R function simple_gaussian
into the Python environment. We need to build our own model to
incorporate this R function as in the previous section. The only difference is in the forward_simulate
method of
the class :code:`Gaussian’.
1 | vector_of_k_samples = list(r_simple_gaussian(mu, sigma, k))
|
The default output for R functions in Python is a float vector. This must be converted into a Python numpy array for the purposes of ABCpy.
Implementing a new Distance¶
We will now explain how you can implement your own distance measure. A new distance is implemented as a new class that derives from :py:class`Distance <abcpy.distance.Distance>` and for which the following three methods have to be implemented:
Distance.__init__()
Distance.distance()
Distance.dist_max()
Let us first look at the initializer documentation:
Distances in ABCpy should act on summary statistics. Therefore, at initialization of a distance calculator, a statistics calculator should be provided. The following header conforms to this idea:
def __init__(self, statistics):
self.statistics_calc = statistics
# Since the observations do always stay the same, we can save the
# summary statistics of them and not recalculate it each time
self.s1 = None
self.data_set = None
Then, we need to define how the distance is calculated. First we compute the summary statistics from the datasets and then compute the distance between the summary statistics. Notice, while computing the summary statistics we save the first dataset and the corresponding summary statistics. This is since we always pass the observed dataset first to the distance function. The observed dataset does not change during an inference computation and thus it is efficient to compute it once and store it internally.
# Extract summary statistics from the dataset
if(self.s1 is None or self.data_set!=d1):
self.s1 = self.statistics_calc.statistics(d1)
self.data_set = d1
s2 = self.statistics_calc.statistics(d2)
# compute distance between the statistics
dist = np.zeros(shape=(self.s1.shape[0],s2.shape[0]))
for ind1 in range(0, self.s1.shape[0]):
for ind2 in range(0, s2.shape[0]):
dist[ind1,ind2] = np.sqrt(np.sum(pow(self.s1[ind1,:]-s2[ind2,:],2)))
return dist.mean()
Finally, we need to define the maximal distance that can be obtained from this distance measure.
def dist_max(self):
return np.inf
The newly defined distance class can be used in the same way as the already existing once. The complete example for this tutorial can be found in examples/extensions/distances/default_distance.py.
Implementing a new Perturbation Kernel¶
To implement a new kernel, we need to implement a new class that derives from
abcpy.perturbationkernel.PerturbationKernel
and that implements the following abstract methods:
PerturbationKernel.__init__()
PerturbationKernel.calculate_cov()
PerturbationKernel.update()
Kernels in ABCpy can be of two types: they can either be derived from the class ContinuousKernel
or from DiscreteKernel
. In case a continuous kernel is required, the following method must be
implemented:
def pdf(self, accepted_parameters_manager, kernel_index, index, x):
On the other hand, if the kernel is a discrete kernel, we would need the following method:
def pmf(self, accepted_parameters_manager, kernel_index, index, x):
As an example, we will implement a kernel which perturbs continuous parameters using a multivariate normal distribution (which is already implemented within ABCpy). First, we need to define a constructor.
Thus, ABCpy expects that the arguments passed to the initializer is of type ProbibilisticModel
, which can be seen as the random variables that should be perturbed by
this kernel. All these models should be saved on the kernel for future reference.
class MultivariateNormalKernel(PerturbationKernel, ContinuousKernel):
def __init__(self, models):
self.models = models
Next, we need the following method:
This method calculates the covariance matrix for your kernel. Of course, not all kernels will have covariance matrices. However, since some kernels do, it is necessary to implement this method for all kernels. If your kernel does not have a covariance matrix, simply return an empty list.
The two arguments passed to this method are the accepted parameters manager and the kernel index. An object of type
AcceptedParameterManager
is always initialized
when an inference method object is instantiated. On this object, the accepted parameters, accepted weights, accepted
covariance matrices for all kernels and other information is stored. This is such that various objects can access this
information without much hassle centrally. To access any of the quantities mentioned above, you will have to call the
.value() method of the corresponding quantity.
The second parameter, the kernel index, specifies the index of the kernel in the list of kernels that the inference method will in the end obtain. Since the user is expected to collect all his kernels in one object, this index will automatically be provided. You do not need any knowledge of what the index actually is. However, it is used to access the values relevant to your kernel, for example the current calculated covariance matrix for a kernel.
Let us now look at the implementation of the method:
def calculate_cov(self, accepted_parameters_manager, kernel_index):
if(accepted_parameters_manager.accepted_weights_bds is not None):
weights = accepted_parameters_manager.accepted_weights_bds.value()
cov = np.cov(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index], aweights=weights.reshape(-1), rowvar=False)
else:
cov = np.cov(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index], rowvar=False)
return cov
Some of the implemented inference algorithms weigh different sets of parameters differently. Therefore, if such weights are provided, we would like to weight the covariance matrix accordingly. We, therefore, check whether the accepted parameters manager contains any weights. If it does, we retrieve these weights, and calculate the covariance matrix using numpy, the parameters relevant to this kernel and the weights. If there are no weights, we simply calculate an unweighted covariance matrix.
Next, we need the method:
This method perturbs the parameters that are associated with the random variables the kernel should perturb. The method again requires an accepted parameters manager and a kernel index. These have the same meaning as in the last method. In addition to this, a row index is required, as well as a random number generator. The row index specifies which set of parameters should be perturbed. There are usually multiple sets, which should be perturbed by different workers during parallelization. We, again, need not to worry about the actual value of this index.
The random number generator should be a random number generator compatible with numpy. This is due to the fact that other methods will pass their random number generator to this method, and all random number generators used within ABCpy are provided by numpy. Also, note that even if your kernel does not require a random number generator, you still need to pass this argument.
Here the implementation for our kernel:
def update(self, accepted_parameters_manager, kernel_index, row_index, rng=np.random.RandomState()):
continuous_model_values = accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index]
continuous_model_values = np.array(continuous_model_values)
cov = accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]
perturbed_continuous_values = rng.multivariate_normal(correctly_ordered_parameters[row_index], cov)
return perturbed_continuous_values
The first line shows how you obtain the values of the parameters that your kernel should perturb. These values are converted to a numpy array. Then, the covariance matrix is retrieved from the accepted parameters manager using a similar function call. Finally, the parameters are perturbed and returned.
Last but not least, each kernel requires a probability density or probability mass function depending on whether it is a Continous Kernel or a Discrete Kernel:
This method is implemented as follows for the multivariate normal:
def pdf(self, accepted_parameters_manager, kernel_index, row_index, x):
mean = accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index][row_index]
cov = accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]
return multivariate_normal(mean, cov).pdf(x)
We simply obtain the parameter values and covariance matrix for this kernel and calculate the probability density function using SciPy.
Note that after defining your own kernel, you will need to collect all your kernels in a
JointPerturbationKernel
object in order for inference to
work. For an example on how to do this, check the Using perturbation kernels section.
The complete example used in this tutorial can be found examples/extensions/perturbationkernels/multivariate_normal_kernel.py.
4. Parallelization Backends¶
Using Parallelization Backends¶
Running ABC algorithms is often computationally expensive, thus ABCpy is build with parallelization in mind. In order to run your inference schemes in parallel on multiple nodes (computers) you can choose from the following backends.
Using the Spark Backend¶
To run ABCpy in parallel using Apache Spark, one only needs to use the provided Spark backend. Considering the example from above, the statements for the backend have to be changed to
import pyspark
sc = pyspark.SparkContext()
from abcpy.backends import BackendSpark as Backend
backend = Backend(sc, parallelism=4)
In words, a Spark context has to be created and passed to the Spark backend. Additionally, the level of parallelism can be provided, which defines in a sense in how many blocks the work should be split up. It corresponds to the parallelism of an RDD in Apache Spark terminology. A good value is usually a small multiple of the total number of available cores.
The standard way to run the script on Spark is via the spark-submit command:
PYSPARK_PYTHON=python3 spark-submit pmcabc_gaussian.py
Often Spark installations use Python 2 by default. To make Spark use the required Python 3 interpreter, the PYSPARK_PYTHON environment variable can be set.
The adapted python code can be found in examples/backend/apache_spark/pmcabc_gaussian.py.
Note that in order to run jobs in parallel you need to have Apache Spark installed on the system in question. The dependencies of the spark backend can be install with pip install -r requirements/backend-spark.txt.
Details on the installation can be found on the official homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.
Using the MPI Backend¶
To run ABCpy in parallel using MPI, one only needs to use the provided MPI backend. Using the same example as above, the statements for the backend have to be changed to
from abcpy.backends import BackendMPI as Backend
backend = Backend()
In words, one only needs to initialize an instance of the MPI backend. The number of ranks to spawn are specified at runtime through the way the script is run. A minimum of two ranks is required, since rank 0 (master) is used to orchestrade the calculation and all other ranks (workers) actually perform the calculation.
The standard way to run the script using Open MPI is directly via mpirun like below or on a cluster through a job scheduler like Slurm:
mpirun -np 4 python3 pmcabc_gaussian.py
The adapted Python code can be found in examples/backend/mpi/pmcabc_gaussian.py.
Note that in order to run jobs in parallel you need to have MPI installed on the system(s) in question with the requisite Python bindings for MPI (mpi4py). The dependencies of the MPI backend can be install with pip install -r requirements/backend-mpi.txt.
Details on the installation can be found on the official Open MPI homepage and the mpi4py homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.
Using Cluster Infrastructure¶
When your model is computationally expensive and/or other factors require compute infrastructure that goes beyond a single notebook or workstation you can easily run ABCpy on infrastructure for cluster or high-performance computing.
Running on Amazon Web Services¶
We show with high level steps how to get ABCpy running on Amazon Web Services (AWS). Please note, that this is not a complete guide to AWS, so we would like to refer you to the respective documentation. The first step would be to setup a AWS Elastic Map Reduce (EMR) cluster which comes with the option of a pre-configured Apache Spark. Then, we show how to run a simple inference code on this cluster.
Setting up the EMR Cluster¶
When we setup an EMR cluster we want to install ABCpy on every node of the cluster. Therefore, we provide a bootstrap script that does this job for us. On your local machine create a file named emr_bootstrap.sh with the following content:
#!/bin/sh
sudo yum -y install git
sudo pip-3.4 install ipython findspark abcpy
In AWS go to Services, then S3 under the Storage Section. Create a new bucket called abcpy and upload your bootstrap script emr_bootstap.sh.
To create a cluster, in AWS go to Services and then EMR under the Analytics Section. Click ‘Create Cluster’, then choose ‘Advanced Options’. In Step 1 choose the emr-5.7.0 image and make sure only Spark is selected for your cluster (the other software packages are not required). In Step 2 choose for example one master node and 4 core nodes (16 vCPUs if you have 4 vCPUs instances). In Step 3 under the boostrap action, choose custom, and select the script abcpy/emr_bootstrap.sh. In the last step (Step 4), choose a key to access the master node (we assume that you already setup keys). Start the cluster.
Running ABCpy on AWS¶
Log in via SSH and run the following commands to get an example code from ABCpy running with Python3 support:
sudo bash -c 'echo export PYSPARK_PYTHON=python34 >> /etc/spark/conf/spark-env.sh'
git clone https://github.com/eth-cscs/abcpy.git
Then, to submit a job to the Spark cluster we run the following commands:
cd abcpy/examples/backends/
spark-submit --num-executors 16 pmcabc_gaussian.py
Clearly the setup can be extended and optimized. For this and basic information we refer you to the AWS documentation on EMR.
5. Post Analysis¶
The output of an inference scheme is a Journal
(abcpy.output.Journal
) which holds all the necessary results and
convenient methods to do the post analysis.
For example, one can easily access the sampled parameters and corresponding weights using:
journal.get_parameters()
journal.get_weights()
The output of get_parameters() is a Python dictionary. The keys for this dictionary are the names you specified for the parameters. The corresponding values are the marginal posterior samples of that parameter. Here is a short example of what you would specify, and what would be the output in the end:
a = Normal([[1],[0.1]], name='parameter_1')
b = MultivariateNormal([[1,1],[[0.1,0],[0,0.1]]], name='parameter_2')
If one defined a model with these two parameters as inputs and n_sample=2, the following would be the output of journal.get_parameters():
{'parameter_1' : [[0.95],[0.97]], 'parameter_2': [[0.98,1.03],[1.06,0.92]]}
These are samples at the final step of ABC algorithm. If you want samples from the earlier steps you can get a Python dictionary for that step by using:
journal.get_parameters(step_number)
Since this is a dictionary, you can also access the values for each step as:
journal.get_parameters(step_number)["name"]
For the post analysis basic functions are provided:
# do post analysis
journal.posterior_mean()
journal.posterior_cov()
journal.posterior_histogram()
Also, to ensure reproducibility, every journal stores the parameters of the algorithm that created it:
print(journal.configuration)
And certainly, a journal can easily be saved to and loaded from disk:
journal.save("experiments.jnl")
new_journal = Journal.fromFile('experiments.jnl')
Deploy a new Release¶
This documentation is mainly intended for the main developers. The deployment of new releases is automated using Travis CI. However, there are still a few manual steps required in order to deploy a new release. Assume we want to deploy the new version `M.m.b’:
- Create a release branch release-M.m.b
- Adapt VERSION file in the repos root directiory: echo M.m.b > VERSION
- Adapt README.md file: adapt links to correct version of User Documentation and Reference
- Merge all desired feature branches into the release branch
- Create a pull/ merge request: release branch -> master
After a successful merge:
- Create tag vM.m.b (git tag vM.m.b)
- Retag tag stable to the current version
- Push the tag (git push –tags)
- Create a release in GitHub
The new tag on master will signal Travis to deploy a new package to Pypi while the GitHub release is just for user documentation.
abcpy package¶
This reference given details about the API of modules, classes and functions included in ABCpy.