Welcome to ABCpy’s documentation!¶
Release: | 0.6.0 |
---|---|
Date: | Jan 21, 2021 |
1. Installation¶
ABCpy requires Python3 and is not compatible with Python2. The simplest way to install ABCpy is via PyPI and we recommended to use this method.
Installation from PyPI¶
Simplest way to install
pip3 install abcpy
This also works in a virtual environment.
Installation from Source¶
If you prefer to work on the source, clone the repository
git clone https://github.com/eth-cscs/abcpy.git
Make sure all requirements are installed
cd abcpy
pip3 install -r requirements.txt
To create a package and install it, do
make package
pip3 install wheel
pip3 install build/dist/abcpy-0.6.1-py3-none-any.whl
wheel
is required to install in this way.
Note that ABCpy requires Python3.
Requirements¶
Basic requirements are listed in requirements.txt
in the repository (click here). That also includes packages required for MPI parallelization there, which is very often used. However, we also provide support for parallelization with Apache Spark (see below).
Additional packages are required for additional features:
torch
is needed in order to use neural networks to learn summary statistics. It can be installed by running:pip install -r requirements/neural_networks_requirements.txt
In order to use Apache Spark for parallelization,
findspark
andpyspark
are required; install them by:pip install -r requirements/backend-spark.txt
Troubleshooting mpi4py
installation¶
mpi4py
requires a working MPI implementation to be installed; check the official docs for more info. On Ubuntu, that can be installed with:
sudo apt-get install libopenmpi-dev
Even when that is present, running pip install mpi4py
can sometimes lead to errors. In fact, as specified in the official docs, the mpicc
compiler needs to be in the search path. If that is not the case, a workaround is:
env MPICC=/path/to/mpicc pip install mpi4py
In some cases, even the above may not be enough. A possibility is using conda
(conda install mpi4py
) which usually handles package dependencies better than pip
. Alternatively, you can try by installing directly mpi4py
from the package manager; in Ubuntu, you can do:
sudo apt install python3-mpi4py
which however does not work with virtual environments.
2. Getting Started¶
Here, we explain how to use ABCpy to quantify parameter uncertainty of a probabilistic model given some observed dataset. If you are new to uncertainty quantification using Approximate Bayesian Computation (ABC), we recommend you to start with the Parameters as Random Variables section.
Moreover, we also provide an interactive notebook on Binder guiding through the basics of ABC with ABCpy; without installing that on your machine. Please find it here.
Parameters as Random Variables¶
As an example, if we have measurements of the height of a group of grown up humans and it is also known that a Gaussian distribution is an appropriate probabilistic model for these kind of observations, then our observed dataset would be measurement of heights and the probabilistic model would be Gaussian.
# define observation for true parameters mean=170, std=15
height_obs = [160.82499176, 167.24266737, 185.71695756, 153.7045709, 163.40568812, 140.70658699, 169.59102084,
172.81041696, 187.38782738, 179.66358934, 176.63417241, 189.16082803, 181.98288443, 170.18565017,
183.78493886, 166.58387299, 161.9521899, 155.69213073, 156.17867343, 144.51580379, 170.29847515,
197.96767899, 153.36646527, 162.22710198, 158.70012047, 178.53470703, 170.77697743, 164.31392633,
165.88595994, 177.38083686, 146.67058471763457, 179.41946565658628, 238.02751620619537,
206.22458790620766, 220.89530574344568, 221.04082532837026, 142.25301427453394, 261.37656571434275,
171.63761180867033, 210.28121820385866, 237.29130237612236, 175.75558340169619, 224.54340549862235,
197.42448680731226, 165.88273684581381, 166.55094082844519, 229.54308602661584, 222.99844054358519,
185.30223966014586, 152.69149367593846, 206.94372818527413, 256.35498655339154, 165.43140916577741,
250.19273595481803, 148.87781549665536, 223.05547559193792, 230.03418198709608, 146.13611923127021,
138.24716809523139, 179.26755740864527, 141.21704876815426, 170.89587081800852, 222.96391329259626,
188.27229523693822, 202.67075179617672, 211.75963110985992, 217.45423324370509]
# define the model
from abcpy.continuousmodels import Normal as Gaussian
height = Gaussian([mu, sigma], name='height')
The Gaussian or Normal model has two parameters: the mean, denoted by \(\mu\), and the standard deviation, denoted by \(\sigma\). We consider these parameters as random variables. The goal of ABC is to quantify the uncertainty of these parameters from the information contained in the observed data.
In ABCpy, a abcpy.probabilisticmodels.ProbabilisticModel
object represents probabilistic relationship
between random variables or between random variables and observed data. Each of the ProbabilisticModel
object has a number of input parameters: they are either random
variables (output of another ProbabilisticModel
object) or
constant values and considered known to the user (Hyperparameters
). If you are interested in implementing your own probabilistic model,
please check the implementing a new model section.
To define a parameter of a model as a random variable, you start by assigning a prior distribution on them. We can utilize prior knowledge about these parameters as prior distribution. In the absence of prior knowledge, we still need to provide prior information and a non-informative flat distribution on the parameter space can be used. The prior distribution on the random variables are assigned by a probabilistic model which can take other random variables or hyper parameters as input.
In our Gaussian example, providing prior information is quite simple. We know from experience that the average height should be somewhere between 150cm and 200cm, while the standard deviation is around 5 to 25. In code, this would look as follows:
# define prior
from abcpy.continuousmodels import Uniform
mu = Uniform([[150], [200]], name="mu")
sigma = Uniform([[5], [25]], name="sigma")
# define the model
from abcpy.continuousmodels import Normal as Gaussian
height = Gaussian([mu, sigma], name='height')
We have defined the parameter \(\mu\) and \(\sigma\) of the Gaussian model as random variables and assigned
Uniform prior distributions on them. The parameters of the prior distribution \((150, 200, 5, 25)\) are assumed to
be known to the user, hence they are called hyperparameters. Also, internally, the hyperparameters are converted to
Hyperparameter
objects. Note that you are also required to pass
a name string while defining a random variable. In the final output, you will see these names, together with the
relevant outputs corresponding to them.
For uncertainty quantification, we follow the philosophy of Bayesian inference. We are interested in the distribution of the parameters we get after incorporating the information that is implicit in the observed dataset with the prior information. This target distribution is called posterior distribution of the parameters. For inference, we draw independent and identical sample values from the posterior distribution. These sampled values are called posterior samples and used to either approximate the posterior distribution or to integrate in a Monte Carlo style w.r.t the posterior distribution. The posterior samples are the ones you get as a result of applying an ABC inference scheme.
The heart of the ABC inferential algorithm is a measure of discrepancy between the observed dataset and the synthetic dataset (simulated/generated from the model). Often, computation of discrepancy measure between the observed and synthetic dataset is not feasible (e.g., high dimensionality of dataset, computationally to complex) and the discrepancy measure is defined by computing a distance between relevant summary statistics extracted from the datasets. Here we first define a way to extract summary statistics from the dataset.
# define statistics
from abcpy.statistics import Identity
statistics_calculator = Identity(degree=2, cross=False)
Next we define the discrepancy measure between the datasets, by defining a distance function (LogReg distance is chosen here) between the extracted summary statistics. If we want to define the discrepancy measure through a distance function between the datasets directly, we choose Identity as summary statistics which gives the original dataset as the extracted summary statistics. This essentially means that the distance object automatically extracts the statistics from the datasets, and then compute the distance between the two statistics.
# define distance
from abcpy.distances import LogReg
distance_calculator = LogReg(statistics_calculator, seed=42)
Algorithms in ABCpy often require a perturbation kernel, a tool to explore the parameter space. Here, we use the default kernel provided, which explores the parameter space of random variables, by using e.g. a multivariate Gaussian distribution or by performing a random walk depending on whether the corresponding random variable is continuous or discrete. For a more involved example, please consult Composite Perturbation Kernels.
# define kernel
from abcpy.perturbationkernel import DefaultKernel
kernel = DefaultKernel([mu, sigma])
Finally, we need to specify a backend that determines the parallelization framework to use. The example code here uses
the dummy backend BackendDummy
which does not parallelize the computation of the
inference schemes, but which is handy for prototyping and testing. For more advanced parallelization backends
available in ABCpy, please consult Using Parallelization Backends section.
# define backend
# Note, the dummy backend does not parallelize the code!
from abcpy.backends import BackendDummy as Backend
backend = Backend()
In this example, we choose PMCABC algorithm (inference scheme) to draw posterior samples of the parameters. Therefore, we instantiate a PMCABC object by passing the random variable corresponding to the observed dataset, the distance function, backend object, perturbation kernel and a seed for the random number generator.
# define sampling scheme
from abcpy.inferences import PMCABC
sampler = PMCABC([height], [distance_calculator], backend, kernel, seed=1)
Finally, we can parametrize the sampler and start sampling from the posterior distribution of the parameters given the observed dataset:
# sample from scheme
eps_arr = np.array([.75])
epsilon_percentile = 10
journal = sampler.sample([height_obs], steps, eps_arr, n_sample, n_samples_per_param, epsilon_percentile)
The above inference scheme gives us samples from the posterior distribution of the parameter of interest height quantifying the uncertainty of the inferred parameter, which are stored in the journal object. See Post Analysis for further information on extracting results.
Note that the model and the observations are given as a list. This is due to the fact that in ABCpy, it is possible to have hierarchical models, building relationships between co-occurring groups of datasets. To learn more, see the Hierarchical Model section.
The full source can be found in examples/extensions/models/gaussian_python/pmcabc_gaussian_model_simple.py. To execute the code you only need to run
python3 pmcabc_gaussian_model_simple.py
Probabilistic Dependency between Random Variables¶
Since release 0.5.0 of ABCpy, a probabilistic dependency structures (e.g., a Bayesian network) between random variables can be modelled. Behind the scene, ABCpy will represent this dependency structure as a directed acyclic graph (DAG) such that the inference can be done on the full graph. Further we can also define new random variables through operations between existing random variables. To make this concept more approachable, we now exemplify an inference problem on a probabilistic dependency structure.
Students of a school took an exam and received some grade. The observed grades of the students are:
grades_obs = [3.872486707973337, 4.6735380808674405, 3.9703538990858376, 4.11021272048805, 4.211048655421368,
4.154817956586653, 4.0046893064392695, 4.01891381384729, 4.123804757702919, 4.014941267301294,
3.888174595940634, 4.185275142948246, 4.55148774469135, 3.8954427675259016, 4.229264035335705,
3.839949451328312, 4.039402553532825, 4.128077814241238, 4.361488645531874, 4.086279074446419,
4.370801602256129, 3.7431697332475466, 4.459454162392378, 3.8873973643008255, 4.302566721487124,
4.05556051626865, 4.128817316703757, 3.8673704442215984, 4.2174459453805015, 4.202280254493361,
4.072851400451234, 3.795173229398952, 4.310702877332585, 4.376886328810306, 4.183704734748868,
4.332192463368128, 3.9071312388426587, 4.311681374107893, 3.55187913252144, 3.318878360783221,
4.187850500877817, 4.207923106081567, 4.190462065625179, 4.2341474252986036, 4.110228694304768,
4.1589891480847765, 4.0345604687633045, 4.090635481715123, 3.1384654393449294, 4.20375641386518,
4.150452690356067, 4.015304457401275, 3.9635442007388195, 4.075915739179875, 3.5702080541929284,
4.722333310410388, 3.9087618197155227, 4.3990088006390735, 3.968501165774181, 4.047603645360087,
4.109184340976979, 4.132424805281853, 4.444358334346812, 4.097211737683927, 4.288553086265748,
3.8668863066511303, 3.8837108501541007]
which depend on several variables: if there were bias, the average size of the classes, as well as the number of teachers at the school. Here we assume the average size of a class and the number of the teachers at the school are normally distributed with some mean, depending on the budget of the school and variance $1$. We further assume that the budget of the school is uniformly distributed between 1 and 10 millions US dollars. Finally, we can assume that the grade without any bias would be a normally distributed parameter around an average grade. The dependency structure between these variables can be defined using the following Bayesian network:

We can define these random variables and the dependencies between them in ABCpy in the following way:
from abcpy.continuousmodels import Uniform, Normal
school_budget = Uniform([[1], [10]], name='school_budget')
class_size = Normal([[800 * school_budget], [1]], name='class_size')
no_teacher = Normal([[20 * school_budget], [1]], name='no_teacher')
grade_without_additional_effects = Normal([[4.5], [0.25]], name='grade_without_additional_effects')
So, each student will receive some grade without additional effects which is normally distributed, but then the final grade recieved will be a function of grade without additional effects and the other random variables defined beforehand (e.g., school_budget, class_size and no_teacher). The model for the final grade of the students now can be written as:
final_grade = grade_without_additional_effects - .001 * class_size + .02 * no_teacher
Notice here we created a new random variable final_grade, by subtracting the random variables class_size multiplied
by 0.001 and adding no_teacher multiplied by 0.02 from the random variable grade_without_additional_effects.
In short, this illustrates that you can perform standard operations “+”,
“-“, “*”, “/” and “**” (the power operator in Python) on any two random variables, to get a new random variable. It is
possible to perform these operations between two random variables additionally to the general data types of Python
(integer, float, and so on) since they are converted to HyperParameters
.
Please keep in mind that parameters defined via operations will not be included in your list of parameters in the
journal file. However, all parameters that are part of the operation, and are not fixed, will be included, so you can
easily perform the required operations on the final result to get these parameters, if necessary. In addition, you can
now also use the []
operator (the access operator in Python). This allows you to select single values or ranges
of a multidimensional random variable as a parameter of a new random variable.
Hierarchical Model¶
ABCpy also supports inference when co-occurring datasets are available. To illustrate how this is implemented, we will consider the example from Probabilistic Dependency between Random Variables section and extend it for co-occurring datasets, when we also have data for final scholarships given out by the school to the students in addition to the final grade of a student.

Whether a student gets a scholarship depends on the number of teachers in the school and on an independent score. Assuming the score is normally distributed, we can model the impact of the students social background on the scholarship as follows:
scholarship_obs = [2.7179657436207805, 2.124647285937229, 3.07193407853297, 2.335024761813643, 2.871893855192,
3.4332002458233837, 3.649996835818173, 3.50292335102711, 2.815638168018455, 2.3581613289315992,
2.2794821846395568, 2.8725835459926503, 3.5588573782815685, 2.26053126526137, 1.8998143530749971,
2.101110815311782, 2.3482974964831573, 2.2707679029919206, 2.4624550491079225, 2.867017757972507,
3.204249152084959, 2.4489542437714213, 1.875415915801106, 2.5604889644872433, 3.891985093269989,
2.7233633223405205, 2.2861070389383533, 2.9758813233490082, 3.1183403287267755,
2.911814060853062, 2.60896794303205, 3.5717098647480316, 3.3355752461779824, 1.99172284546858,
2.339937680892163, 2.9835630207301636, 2.1684912355975774, 3.014847335983034, 2.7844122961916202,
2.752119871525148, 2.1567428931391635, 2.5803629307680644, 2.7326646074552103, 2.559237193255186,
3.13478196958166, 2.388760269933492, 3.2822443541491815, 2.0114405441787437, 3.0380056368041073,
2.4889680313769724, 2.821660164621084, 3.343985964873723, 3.1866861970287808, 4.4535037154856045,
3.0026333138006027, 2.0675706089352612, 2.3835301730913185, 2.584208398359566, 3.288077633446465,
2.6955853384148183, 2.918315169739928, 3.2464814419322985, 2.1601516779909433, 3.231003347780546,
1.0893224045062178, 0.8032302688764734, 2.868438615047827]
scholarship_without_additional_effects = Normal([[2], [0.5]], name='schol_without_additional_effects')
final_scholarship = scholarship_without_additional_effects + .03 * no_teacher
With this we now have two root ProbabilisicModels (random variables), namely final_grade
and
final_scholarship
, whose output can directly compared to the observed datasets grade_obs
and
scholarship_obs
. With this we are able to do an inference computation on all free parameters of the hierarchical
model (of the DAG) given our observations.
To infer uncertainty of our parameters, we follow the same steps as in our previous examples: We choose summary statistics, distance, inference scheme, backend and kernel. We will skip the definitions that have not changed from the previous section. However, we would like to point out the difference in definition of the distance. Since we are now considering two observed datasets, we need to define a distance on each one of them separately. Here, we use the Euclidean distance for each observed data set and corresponding simulated dataset. You can use two different distances on two different observed datasets.
# Define a summary statistics for final grade and final scholarship
from abcpy.statistics import Identity
statistics_calculator_final_grade = Identity(degree=2, cross=False)
statistics_calculator_final_scholarship = Identity(degree=3, cross=False)
# Define a distance measure for final grade and final scholarship
from abcpy.distances import Euclidean
distance_calculator_final_grade = Euclidean(statistics_calculator_final_grade)
distance_calculator_final_scholarship = Euclidean(statistics_calculator_final_scholarship)
Using these two distance functions with the final code look as follows:
# Define a backend
from abcpy.backends import BackendDummy as Backend
backend = Backend()
# Define a perturbation kernel
from abcpy.perturbationkernel import DefaultKernel
kernel = DefaultKernel([school_budget, class_size, grade_without_additional_effects,
no_teacher, scholarship_without_additional_effects])
eps_arr = np.array([30]) # starting value of epsilon; the smaller, the slower the algorithm.
# at each iteration, take as epsilon the epsilon_percentile of the distances obtained by simulations at previous
# iteration from the observation
epsilon_percentile = 10
# Define sampler; note here how the two models are passed in a list, as well as the two corresponding distance
# calculators
from abcpy.inferences import PMCABC
sampler = PMCABC([final_grade, final_scholarship],
[distance_calculator_final_grade, distance_calculator_final_scholarship], backend, kernel, seed=1)
# Sample; again, here we pass the two sets of observations in a list
journal = sampler.sample([grades_obs, scholarship_obs],
steps, eps_arr, n_sample, n_samples_per_param, epsilon_percentile)
Observe that the lists given to the sampler and the sampling method now contain two entries. These correspond to the two different observed data sets respectively. Also notice now we provide two different distances corresponding to the two different root models and their observed datasets. Presently ABCpy combines the distances by a linear combination, however customized combination strategies can be implemented by the user.
The full source code can be found in examples/hierarchicalmodels/pmcabc_inference_on_multiple_sets_of_obs.py.
Composite Perturbation Kernels¶
As pointed out earlier, it is possible to define composite perturbation kernels, perturbing different random variables in different ways. Let us take the same example as in the Hierarchical Model and assume that we want to perturb the schools budget, grade score and scholarship score without additional effect, using a multivariate normal kernel. However, the remaining parameters we would like to perturb using a multivariate Student’s-T kernel. This can be implemented as follows:
from abcpy.perturbationkernel import MultivariateNormalKernel, MultivariateStudentTKernel
kernel_1 = MultivariateNormalKernel(
[school_budget, scholarship_without_additional_effects, grade_without_additional_effects])
kernel_2 = MultivariateStudentTKernel([class_size, no_teacher], df=3)
We have now defined how each set of parameters is perturbed on its own. The sampler object, however, needs to be
provided with one single kernel. We, therefore, provide a class which groups the above kernels together. This class,
abcpy.perturbationkernel.JointPerturbationKernel
, knows how to perturb each set of parameters individually.
It just needs to be provided with all the relevant kernels:
# Join the defined kernels
from abcpy.perturbationkernel import JointPerturbationKernel
kernel = JointPerturbationKernel([kernel_1, kernel_2])
This is all that needs to be changed. The rest of the implementation works the exact same as in the previous example. If you would like to implement your own perturbation kernel, please check Implementing a new Perturbation Kernel. Please keep in mind that you can only perturb parameters. You cannot use the access operator to perturb one component of a multi-dimensional random variable differently than another component of the same variable.
The source code to this section can be found in examples/extensions/perturbationkernels/pmcabc_perturbation_kernels.py
Inference Schemes¶
In ABCpy, we implement widely used and advanced variants of ABC inferential schemes:
- Rejection ABC
abcpy.inferences.RejectionABC
, - Population Monte Carlo ABC
abcpy.inferences.PMCABC
, - Sequential Monte Carlo ABC
abcpy.inferences.SMCABC
, - Replenishment sequential Monte Carlo ABC (RSMC-ABC)
abcpy.inferences.RSMCABC
, - Adaptive population Monte Carlo ABC (APMC-ABC)
abcpy.inferences.APMCABC
, - ABC with subset simulation (ABCsubsim)
abcpy.inferences.ABCsubsim
, and - Simulated annealing ABC (SABC)
abcpy.inferences.SABC
.
To perform ABC algorithms, we provide different standard distance functions between datasets, e.g., a discrepancy measured by achievable classification accuracy between two datasets
We also have implemented the population Monte Carlo abcpy.inferences.PMC
algorithm to infer parameters when
the likelihood or approximate likelihood function is available. For approximation of the likelihood function we provide
two methods:
- Synthetic likelihood approximation
abcpy.approx_lhd.SynLikelihood
, and another method using - penalized logistic regression
abcpy.approx_lhd.PenLogReg
.
Next we explain how we can use PMC algorithm using approximation of the
likelihood functions. As we are now considering two observed datasets
corresponding to two root models, we need to define an approximation of
likelihood function for each of them separately. Here, we use the
abcpy.approx_lhd.SynLikelihood
for each of the root models. It is
also possible to use two different approximate likelihoods for two different
root models.
# Define a summary statistics for final grade and final scholarship
from abcpy.statistics import Identity
statistics_calculator_final_grade = Identity(degree=2, cross=False)
statistics_calculator_final_scholarship = Identity(degree=3, cross=False)
# Define an approximate likelihood for final grade and final scholarship
from abcpy.approx_lhd import SynLikelihood
approx_lhd_final_grade = SynLikelihood(statistics_calculator_final_grade)
approx_lhd_final_scholarship = SynLikelihood(statistics_calculator_final_scholarship)
We then parametrize the sampler and sample from the posterior distribution.
# Define sampler to use with the
from abcpy.inferences import PMC
sampler = PMC([final_grade, final_scholarship],
[approx_lhd_final_grade, approx_lhd_final_scholarship], backend, kernel, seed=1)
# Sample
journal = sampler.sample([grades_obs, scholarship_obs], steps, n_sample, n_samples_per_param)
Observe that the lists given to the sampler and the sampling method now contain two entries. These correspond to the two different observed data sets respectively. Also notice we now provide two different distances corresponding to the two different root models and their observed datasets. Presently ABCpy combines the distances by a linear combination. Further possibilities of combination will be made available in later versions of ABCpy.
The source code can be found in examples/approx_lhd/pmc_hierarchical_models.py.
Statistics Learning¶
As we have discussed in the Parameters as Random Variables Section, the discrepancy measure between two datasets is
defined by a distance function between extracted summary statistics from the datasets. Hence, the ABC algorithms are
subjective to the summary statistics choice. This subjectivity can be avoided by a data-driven summary statistics choice
from the available summary statistics of the dataset. In ABCpy we provide several statistics learning procedures,
implemented in the subclasses of abcpy.statisticslearning.StatisticsLearning
, that generate a training
dataset of (parameter, sample) pairs and use it to learn a transformation of the data that will be used in the inference
step. The following techniques are available:
- Semiautomatic
abcpy.statisticslearning.Semiautomatic
, - SemiautomaticNN
abcpy.statisticslearning.SemiautomaticNN
, - ContrastiveDistanceLearning
abcpy.statisticslearning.ContrastiveDistanceLearning
, - TripletDistanceLearning
abcpy.statisticslearning.TripletDistanceLearning
.
The first two build a transformation that approximates the parameter that generated the corresponding observation, the first one by using a linear regression approach and the second one by using a neural network embedding. The other two use instead neural networks to learn an embedding of the data so that the distance between the embeddings is close to the distance between the parameter values that generated the data.
We remark that the techniques using neural networks require Pytorch to be installed. As this is an optional feature, however, Pytorch is not in the list of dependencies of ABCpy. Rather, when one of the neural network based routines is called, ABCpy checks if Pytorch is present and, if not, asks the user to install it. Moreover, unless the user wants to provide a specific form of neural network, the implementation of the neural network based ones do not require any explicit neural network coding, handling all the necessary definitions and training internally.
It is also possible to provide a validation set to the neural network based training routines in order to monitor the
loss on fresh samples. This can be done for instance by set the parameter n_samples_val to a value larger than 0.
Moreover, it is possible to use the validation set for early stopping, ie stopping the training as soon as the loss on
the validation set starts increasing. You can do this by setting early_stopping=True. Please refer to the API documentation
(for instance abcpy.statisticslearning.SemiautomaticNN
) for further details.
We note finally that the statistics learning techniques can be applied after compute a first set of statistics; therefore, the learned transformation will be a transformation applied to the original set of statistics. For instance, consider our initial example from Parameters as Random Variables where we model the height of humans. The original summary statistics were defined as follows:
# define statistics
from abcpy.statistics import Identity
statistics_calculator = Identity(degree=3, cross=True)
Then we can learn the optimized summary statistics from the above list of summary statistics using the semi-automatic summary selection procedure as follows:
# Learn the optimal summary statistics using Semiautomatic summary selection
from abcpy.statisticslearning import Semiautomatic
statistics_learning = Semiautomatic([height], statistics_calculator, backend,
n_samples=1000, n_samples_per_param=1, seed=1)
# Redefine the statistics function
new_statistics_calculator = statistics_learning.get_statistics()
We remark that the minimal amount of coding needed for using the neural network based regression does not change at all:
# Learn the optimal summary statistics using SemiautomaticNN summary selection;
# we use 200 samples as a validation set for early stopping:
from abcpy.statisticslearning import SemiautomaticNN
statistics_learning = SemiautomaticNN([height], statistics_calculator, backend,
n_samples=1000, n_samples_val=200,
n_samples_per_param=1, seed=1, early_stopping=True)
# Redefine the statistics function
new_statistics_calculator = statistics_learning.get_statistics()
And similarly for the other two approaches.
We can then perform the inference as before, but the distances will be computed on the newly learned summary statistics.
Model Selection¶
A further extension of the inferential problem is the selection of a model (M), given an observed
dataset, from a set of possible models. The package also includes a parallelized version of random
forest ensemble model selection algorithm [abcpy.modelselections.RandomForest
].
Lets consider an array of two models Normal and StudentT. We want to find out which one of these two models are the most suitable one for the observed dataset y_obs.
# Create a array of models
from abcpy.continuousmodels import Uniform, Normal, StudentT
model_array = [None] * 2
# Model 1: Gaussian
mu1 = Uniform([[150], [200]], name='mu1')
sigma1 = Uniform([[5.0], [25.0]], name='sigma1')
model_array[0] = Normal([mu1, sigma1])
# Model 2: Student t
mu2 = Uniform([[150], [200]], name='mu2')
sigma2 = Uniform([[1], [30.0]], name='sigma2')
model_array[1] = StudentT([mu2, sigma2])
We first need to initiate the Model Selection scheme, for which we need to define the summary statistics and backend:
# define statistics
from abcpy.statistics import Identity
statistics_calculator = Identity(degree=2, cross=False)
# define backend
from abcpy.backends import BackendDummy as Backend
backend = Backend()
# Initiate the Model selection scheme
modelselection = RandomForest(model_array, statistics_calculator, backend, seed=1)
Now we can choose the most suitable model for the observed dataset y_obs,
# Choose the correct model
model = modelselection.select_model(y_obs, n_samples=100, n_samples_per_param=1, )
or compute posterior probability of each of the models given the observed dataset.
# Compute the posterior probability of the chosen model
model_prob = modelselection.posterior_probability(y_obs)
Logging¶
Sometimes, when running inference schemes it is desired to have a more verbose logging output. This can be achieved by using Python’s standard logger and setting it to info mode at the beginning of the file.
import logging
logging.basicConfig(level=logging_level)
3. User Customization¶
Implementing a new Model¶
One of the standard use cases of ABCpy is to do inference on a probabilistic model that is not part of ABCpy. We now go through the details of such a scenario using the (already implemented) Gaussian generative model to explain how to implement it from scratch.
There are two scenarios to use a model: First, we want to use our probabilistic model to explain a relationship between
parameters (considered random variables for inference). In the second case, we use them to explain a relationship between
parameters and observed data, as example when we want
to do inference on mechanistic models that do not have a PDF. In both these case, our model implementation has to derive from
ProbabilisticModel
class of ABCpy and a few abstract methods have to be
defined, as for example forward_simulate()
.
In the first scenario, we want to use the model to build a relationship between different parameters (between
different random variables). Then our model is restricted to either output continuous or discrete parameters in form of
a vector. Consequently, the model must derive from either from Continuous
or Discrete
and implement the
required abstract methods. These two classes in turn derive from from ProbabilisticModel
, such that the second scenario essentially extends the first.
Let us go through the implementation of a the Gaussian generative model. The model has to conform to the API specified
by the base class ProbabilisticModels
, and thus must
implement at least the following methods:
ProbabilisticModels.__init__()
ProbabilisticModels._check_input()
ProbabilisticModels._check_output()
ProbabilisticModels.forward_simulate()
ProbabilisticModels.get_output_dimension()
If we want our model to work in both the described scenarios, so our model also has to conform to the API of
Continuous
since the model output, which is the resulting data from a
forward simulation, is from a continuous domain. For completeness, here are the abstract methods defined by
Continuous
and Discrete
correspondingly:
If we want our model to work only for the second scenario (typically the case for mechanistic or simulator models) and not to be used to build priors on parameters, we do not need to write the above two abstract methods.
Initializing a New Model¶
Since a Gaussian model generates continous numbers, the newly implemented class derives from
Continuous
and the header look as follows:
from abcpy.probabilisticmodels import ProbabilisticModel, Continuous, InputConnector
class Gaussian(ProbabilisticModel, Continuous):
A good way to start implementing a new model is to define a convenient way to initialize it with its input parameters.
In ABCpy all input parameters are either independent ProbabilisticModels or Hyperparameters. Thus, they should not be
stored within but rather referenced in the model we implement. This reference is handled by the
InputConnector
class and must be used in our model
implementation. The required procedure is to call the init function of ProbabilisticModels and pass an InputConnector
object to it.
-
ProbabilisticModel.
__init__
(input_connector, name='')[source] This initializer must be called from any derived class to properly connect it to its input models.
It accepts as input an InputConnector object that fully specifies how to connect all parent models to the current model.
Parameters: - input_connector (list) – A list of input parameters.
- name (string) – A human readable name for the model. Can be the variable name for example.
However, it would be very inconvenient to initialize our Gaussian model with an InputConnector object. We rather like
the init function to accept a list of parameters [mu, sigma]
, where mu
is the mean and sigma
is
the standard deviation which are the sole two parameters of our generative Gaussian model. So the idea is to take
a convenient input and transform it to an InputConnection object that in turn can be passed to the initializer of
the super class. This leads to the following implementation:
1 2 3 4 5 6 7 8 9 10 | def __init__(self, parameters, name='Gaussian'):
# We expect input of type parameters = [mu, sigma]
if not isinstance(parameters, list):
raise TypeError('Input of Normal model is of type list')
if len(parameters) != 2:
raise RuntimeError('Input list must be of length 2, containing [mu, sigma].')
input_connector = InputConnector.from_list(parameters)
super().__init__(input_connector, name)
|
First, we do some basic syntactic checks on the input that throw exceptions if unreasonable input is provided. Line 9 is the interesting part: the InputConnector comes with a convenient set of factory methods that create InputConnector objects:
abcpy.probabilisticmodels.InputConnector.from_number()
abcpy.probabilisticmodels.InputConnector.from_model()
abcpy.probabilisticmodels.InputConnector.from_list()
We use the factory method from_list
. The resulting
InputConnector creates links between our Gaussian model and the models (or hyperparameters) that are used for mu
and sigma
at initialization time. For example, if mu
and sigma
are initialized as
hyperparameters like
model = Gaussian([0, 1])
the from_list()
method will automatically create two HyperParameter objects HyperParameter(0)
and
HyperParameter(1)
and will link our current Gaussian model inputs to them. If we initialize mu
and
sigma
with existing models like
uniform1 = Uniform([-1, 1])
uniform2 = Uniform([10,20])
model = Gaussian([uniform1, uniform2])
the from_list()
method will link our inputs to the uniform models.
Additionally, every model instance should have a unique name, which should also be passed to the init function of the super class.
Checking the Input¶
The next function we implement is _check_input
which should behave as described in the documentation:
-
ProbabilisticModel.
_check_input
(input_values)[source] Check whether the input parameters are compatible with the underlying model.
The following behavior is expected:
1. If the input is of wrong type or has the wrong format, this method should raise an exception. For example, if the number of parameters does not match what the model expects.
2. If the values of the input models are not compatible, this method should return False. For example, if an input value is not from the expected domain.
Background information: Many inference schemes modify the input slightly by applying a small perturbation during sampling. This method is called to check whether the perturbation yields a reasonable input to the current model. In case this function returns False, the inference schemes re-perturb the input and try again. If the check is not done properly, the inference computation might crash or not terminate.
Parameters: input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization. Returns: True if the fixed value of the parameters can be used as input for the current model. False otherwise. Return type: boolean
This leads to the following implementation:
1 2 3 4 5 6 7 8 9 10 11 12 | def _check_input(self, input_values):
# Check whether input has correct type or format
if len(input_values) != 2:
raise ValueError('Number of parameters of Normal model must be 2.')
# Check whether input is from correct domain
mu = input_values[0]
sigma = input_values[1]
if sigma < 0:
return False
return True
|
Forward Simulation¶
At the core of our model lies the capability to forward simulate and create pseudo observations. To expose this functionality the following method has to be implemented:
-
ProbabilisticModel.
forward_simulate
(input_values, k, rng, mpi_comm=None)[source] Provides the output (pseudo data) from a forward simulation of the current model.
In case the model is intended to be used as input for another model, a forward simulation must return a list of k numpy arrays with shape (get_output_dimension(),).
In case the model is directly used for inference, and not as input for another model, a forward simulation also must return a list, but the elements can be arbitrarily defined. In this case it is only important that the used statistics and distance functions can read the input.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- k (integer) – The number of forward simulations that should be run
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: A list of k elements, where each element is of type numpy arary and represents the result of a single forward simulation.
Return type: list
A proper implementation look as follows:
1 2 3 4 5 6 7 8 9 10 11 | def forward_simulate(self, input_values, k, rng=np.random.RandomState()):
# Extract the input parameters
mu = input_values[0]
sigma = input_values[1]
# Do the actual forward simulation
vector_of_k_samples = np.array(rng.normal(mu, sigma, k))
# Format the output to obey API
result = [np.array([x]) for x in vector_of_k_samples]
return result
|
Note that both mu
and sigma
are stored in the list input values in the same order as we provided them
to the InputConnector object in the init function. Futher note that the output is a list of vectors, each of dimension
one, though the Gaussian generative model only produces real numbers.
Checking the Output¶
We also need to check the output of the model. This method is commonly used in case our model is used as an input for
other models. When using an inference scheme that utilizes perturbation, the output of our model is slightly perturbed.
We have to make sure that the perturbed output is still valid for our model. The details of implementing the method
_check_output()
can be found in the
documentation:
-
ProbabilisticModel.
_check_output
(values)[source] Checks whether values contains a reasonable output of the current model.
Parameters: values (numpy array) – Array of shape (get_output_dimension(),) that contains the model output. Returns: Return false if values cannot possibly be generated from the model and true otherwise. Return type: boolean
Since the output of a Gaussian generative model is a single number from the full real domain, we can restrict ourselves
to syntactic checks. However, one could easily imagine models for which the output it restricted to a certain domain.
Then, this function should return False
as soon as values are out of the desired domain.
1 2 3 4 5 6 | def _check_output(self, values):
if not isinstance(values, Number):
raise ValueError('Output of the normal distribution is always a number.')
# At this point values is a number (int, float); full domain for Normal is allowed
return True
|
Note that implementing this method is particularly important when using the current model as input for other models, hence in the first scenario described in Implementing a new Model. In case our model should only be used for the second scenario, it is safe to omit the check and return true.
Getting the Output Dimension¶
We have expose the dimension of the produced output of our model using the following method:
-
ProbabilisticModel.
get_output_dimension
()[source] Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
Since our model generates a single float number in one forward simulation, the implementation looks is straight forward:
1 2 | def get_output_dimension(self):
return 1
|
Note that implementing this method is particularly important when using the current model as input for other models, hence in the first scenario described in Implementing a new Model. In case our model should only be used for the second scenario, it is safe to return 1.
Calculating the Probability Density Function¶
Since our model also derives from Continuous
we also have to implement
the following function that calculates the probability density function at specific point.
-
Continuous.
pdf
(input_values, x)[source] Calculates the probability density function of the model.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- x (float) – The location at which the probability density function should be evaluated.
As mentioned above, this is only required if one wants to use our model as input for other models. An implementation looks as follows:
1 2 3 4 5 | def pdf(self, input_values, x):
mu = input_values[0]
sigma = input_values[1]
pdf = np.norm(mu, sigma).pdf(x)
return pdf
|
Our model now conforms to ABCpy and we can start inferring parameters in the same way (see Getting Started) as we would do with shipped models.
Wrap a Model Written in C++¶
There are several frameworks that help you integrating your C++/C code into Python. We showcase examples for
Using Swig¶
Swig is a tool that creates a Python wrapper for our C++/C code using an interface (file) that we have to specify. We can then import the wrapper and in turn use your C++ code with ABCpy as if it was written in Python.
We go through a complete example to illustrate how to use a simple Gaussian model written in C++ with ABCpy. First, have a look at our C++ model:
1 2 3 4 5 6 7 8 9 | void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed) {
boost::mt19937 rng(seed);
boost::normal_distribution<> nd(mu, sigma);
boost::variate_generator<boost::mt19937, boost::normal_distribution<> > sampler(rng, nd);
for (int i=0; i<k; ++i) {
result[i] = sampler();
}
}
|
To use this code in Python, we need to specify exactly how to expose the C++ function to Python. Therefore, we write a Swig interface file that look as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | %module gaussian_model_simple
%{
#define SWIG_FILE_WITH_INIT
#include <iostream>
#include <boost/random.hpp>
#include <boost/random/normal_distribution.hpp>
extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);
%}
%include "numpy.i"
%init %{
import_array();
%}
%apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};
extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);
|
In the first line we define the module name we later have to import in your ABCpy Python code. Then, in curly brackets, we specify which libraries we want to include and which function we want to expose through the wrapper.
Now comes the tricky part. The model class expects a method forward_simulate that forward-simulates our model and which returns an array of synthetic observations. However, C++/C does not know the concept of returning an array, instead in C++/C we would provide a memory position (pointer) where to write the results. Swig has to translate between the two concepts. We use actually an Swig interface definition from numpy called import_array. The line
1 | %apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};
|
states that we want the two parameters result and k of the gaussian_model C++ function be interpreted as an array of length k that is returned. Have a look at the Python code below and observe how the wrapped Python function takes only two instead of four parameters and returns a numpy array.
The first stop to get everything running is to translate the Swig interface file to wrapper code in C++ and Python.
swig -python -c++ -o gaussian_model_simple_wrap.cpp gaussian_model_simple.i
This creates two wrapper files gaussian_model_simple_wrap.cpp and gaussian_model_simple.py. Now the C++ files can be compiled:
g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple.cpp -o gaussian_model_simple.o
g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple_wrap.cpp -o gaussian_model_simple_wrap.o
g++ -shared gaussian_model_simple.o gaussian_model_simple_wrap.o -o _gaussian_model_simple.so
Note that the include paths might need to be adapted to your system. Finally, we can write a Python model which uses our C++ code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | import logging
import numpy as np
from examples.extensions.models.gaussian_cpp.gaussian_model_simple import \
gaussian_model # this is the file produced upon compiling
from abcpy.probabilisticmodels import ProbabilisticModel, Continuous, InputConnector
class Gaussian(ProbabilisticModel, Continuous):
def __init__(self, parameters, name='Gaussian'):
# We expect input of type parameters = [mu, sigma]
if not isinstance(parameters, list):
raise TypeError('Input of Normal model is of type list')
if len(parameters) != 2:
raise RuntimeError('Input list must be of length 2, containing [mu, sigma].')
input_connector = InputConnector.from_list(parameters)
super().__init__(input_connector, name)
def _check_input(self, input_values):
# Check whether input has correct type or format
if len(input_values) != 2:
raise ValueError('Number of parameters of Normal model must be 2.')
# Check whether input is from correct domain
mu = input_values[0]
sigma = input_values[1]
if sigma < 0:
return False
return True
def _check_output(self, values):
if not isinstance(values, Number):
raise ValueError('Output of the normal distribution is always a number.')
# At this point values is a number (int, float); full domain for Normal is allowed
return True
def get_output_dimension(self):
return 1
def forward_simulate(self, input_values, k, rng=np.random.RandomState()):
# Extract the input parameters
mu = input_values[0]
sigma = input_values[1]
seed = rng.randint(np.iinfo(np.int32).max)
# Do the actual forward simulation
vector_of_k_samples = gaussian_model(k, mu, sigma, seed) # call the C++ code
# Format the output to obey API
result = [np.array([x]) for x in vector_of_k_samples]
return result
def pdf(self, input_values, x):
mu = input_values[0]
sigma = input_values[1]
pdf = np.norm(mu, sigma).pdf(x)
return pdf
|
The important lines are where we import the wrapper code as a module (line 3) and call the respective model function (line 48).
The full code is available in examples/extensions/models/gaussion_cpp/. To simplify compilation of SWIG and C++ code we created a Makefile. Note that you might need to adapt some paths in the Makefile.
Wrap a Model Written in R¶
Statisticians often use the R language to build statistical models. R models can be incorporated within the ABCpy language with the rpy2 Python package. We show how to use the rpy2 package to connect with a model written in R.
Continuing from the previous sections we use a simple Gaussian model as an example. The following R code is the contents of the R file gaussian_model.R:
1 2 3 4 5 | simple_gaussian <- function(mu, sigma, k = 1, seed = seed) {
set.seed(seed)
output <- rnorm(k, mu, sigma)
return(output)
}
|
More complex R models are incorporated in the same way. To include this function within ABCpy we include the following code at the beginning of our Python file:
1 2 3 4 5 6 7 8 9 10 11 | import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri
try:
robjects.r('''
source('examples/extensions/models/gaussian_R/gaussian_model.R')
''')
except:
robjects.r('''
source('gaussian_model.R')
''')
r_simple_gaussian = robjects.globalenv['simple_gaussian']
|
This imports the R function simple_gaussian
into the Python environment.
We need to build our own model to incorporate this R function as in the previous
section. The only difference is in the forward_simulate
method of the
class Gaussian
.
1 | vector_of_k_samples = list(r_simple_gaussian(mu, sigma, k, seed=seed))
|
The default output for R functions in Python is a float vector. This must be converted into a Python numpy array for the purposes of ABCpy.
Wrap a Model Written in FORTRAN¶
FORTRAN is still a widely used language in some specific application domains. We show here how to wrap a FORTRAN model in ABCpy by exploiting the F2PY tool, which is part of Numpy.
Using this tool is quite simple; first, the FORTRAN code defining the model has to be defined:
module gaussian_model
contains
subroutine gaussian(output, mu, sigma, k, seed)
specifically, that needs to define a subroutine (here gaussian
) in a module (here gaussian_model
):
Then, the FORTRAN code needs to be compiled in a way which can be linked to the Python one; by using F2PY, this is as simple as:
python -m numpy.f2py -c -m gaussian_model_simple gaussian_model_simple.f90
which produces an executable (with .so
extension on Linux, for instance) with the same name as the FORTRAN file.
Finally, an ABCpy model in Python needs to be defined which calls the FORTRAN binary similarly to what done before.
Specifically, we import the FORTRAN model in the following way:
from examples.extensions.models.gaussian_f90.gaussian_model_simple import gaussian_model
Note that the name of the object to import is the same as the module name in the original FORTRAN code. Then, in the
forward_simulate
method of the ABCpy model, you can run the FORTRAN model and obtain its output with the following line:
vector_of_k_samples = np.array(gaussian_model(mu, sigma, k, seed))
A full reproducible example is available in examples/extensions/models/gaussion_f90/; a Makefile with the right compilation commands is also provided.
Implementing a new Distance¶
We will now explain how you can implement your own distance measure. A new distance is implemented as a new class that
derives from Distance
and for which the following three methods have to be
implemented:
Let us first look at the initializer documentation:
-
Distance.
__init__
(statistics_calc)[source] The constructor of a sub-class must accept a non-optional statistics calculator as a parameter; then, it must call the __init__ method of the parent class. This ensures that the object is initialized correctly so that the _calculate_summary_stat private method can be called when computing the distances.
Parameters: statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
Distances in ABCpy should act on summary statistics. Therefore, at initialization of a distance calculator, a statistics calculator should be provided. The following header conforms to this idea:
def __init__(self, statistics_calc):
self.statistics_calc = statistics_calc
# Since the observations do always stay the same, we can save the
# summary statistics of them and not recalculate it each time
self.s1 = None
self.data_set = None
self.dataSame = False
Then, we need to define how the distance is calculated. We need first to compute the summary statistics from the datasets and after compute the distance between the summary statistics. Notice that we use the private method Distance._calculate_summary_stat
to compute the statistics from the dataset; internally, this saves the first dataset and the corresponding summary statistics while computing the summary statistics. In fact, we always pass the observed dataset first to the
distance function during inference and ,as this does not change, it is efficient to
compute it once and store it internally. At each call of the distance
method, the first input is compared to the stored one and, only if they differ, the stored statistics is updated.
def distance(self, d1, d2):
"""Calculates the distance between two datasets, by computing Euclidean distance between each element of d1 and
d2 and taking their average.
Parameters
----------
d1: Python list
Contains n1 data points.
d2: Python list
Contains n2 data points.
Returns
-------
numpy.float
The distance between the two input data sets.
"""
s1, s2 = self._calculate_summary_stat(d1, d2)
# compute distance between the statistics
dist = np.zeros(shape=(s1.shape[0], s2.shape[0]))
for ind1 in range(0, s1.shape[0]):
for ind2 in range(0, s2.shape[0]):
dist[ind1, ind2] = np.sqrt(np.sum(pow(s1[ind1, :] - s2[ind2, :], 2)))
return dist.mean()
Finally, we need to define the maximal distance that can be obtained from this distance measure.
def dist_max(self):
"""
Returns
-------
numpy.float
The maximal possible value of the desired distance function.
"""
return np.inf
The newly defined distance class can be used in the same way as the already existing once. The complete example for this tutorial can be found in examples/extensions/distances/default_distance.py.
Implementing a new Perturbation Kernel¶
To implement a new kernel, we need to implement a new class that derives from
abcpy.perturbationkernel.PerturbationKernel
and that implements the following abstract methods:
Kernels in ABCpy can be of two types: they can either be derived from the class ContinuousKernel
or from DiscreteKernel
. In case a continuous kernel is required, the following method must be
implemented:
def pdf(self, accepted_parameters_manager, kernel_index, index, x):
On the other hand, if the kernel is a discrete kernel, we would need the following method:
def pmf(self, accepted_parameters_manager, kernel_index, index, x):
As an example, we will implement a kernel which perturbs continuous parameters using a multivariate normal distribution (which is already implemented within ABCpy). First, we need to define a constructor.
-
PerturbationKernel.
__init__
(models)[source] Parameters: models (list) – The list of abcpy.probabilisticmodel objects that should be perturbed by this kernel.
Thus, ABCpy expects that the arguments passed to the initializer is of type ProbabilisticModel
, which can be seen as the random variables that should be perturbed by
this kernel. All these models should be saved on the kernel for future reference.
class MultivariateNormalKernel(PerturbationKernel, ContinuousKernel):
def __init__(self, models):
self.models = models
Next, we need the following method:
-
PerturbationKernel.
calculate_cov
(accepted_parameters_manager, kernel_index)[source] Calculates the covariance matrix for the kernel.
Parameters: - accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint perturbation kernel.
Returns: The covariance matrix for the kernel.
Return type: numpy.ndarray
This method calculates the covariance matrix for your kernel. Of course, not all kernels will have covariance matrices. However, since some kernels do, it is necessary to implement this method for all kernels. If your kernel does not have a covariance matrix, simply return an empty list.
The two arguments passed to this method are the accepted parameters manager and the kernel index. An object of type
AcceptedParameterManager
is always initialized
when an inference method object is instantiated. On this object, the accepted parameters, accepted weights, accepted
covariance matrices for all kernels and other information is stored. This is such that various objects can access this
information without much hassle centrally. To access any of the quantities mentioned above, you will have to call the
.value() method of the corresponding quantity.
The second parameter, the kernel index, specifies the index of the kernel in the list of kernels that the inference method will in the end obtain. Since the user is expected to collect all his kernels in one object, this index will automatically be provided. You do not need any knowledge of what the index actually is. However, it is used to access the values relevant to your kernel, for example the current calculated covariance matrix for a kernel.
Let us now look at the implementation of the method:
def calculate_cov(self, accepted_parameters_manager, kernel_index):
"""
Calculates the covariance matrix relevant to this kernel.
Parameters
----------
accepted_parameters_manager: abcpy.AcceptedParametersManager object
AcceptedParametersManager to be used.
kernel_index: integer
The index of the kernel in the list of kernels of the joint kernel.
Returns
-------
list
The covariance matrix corresponding to this kernel.
"""
continuous_model = [[] for i in
range(len(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index]))]
for i in range(len(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index])):
if isinstance(accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index][i][0],
(np.float, np.float32, np.float64, np.int, np.int32, np.int64)):
continuous_model[i] = accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index][i]
else:
continuous_model[i] = np.concatenate(
accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index][i])
continuous_model = np.array(continuous_model).astype(float)
if accepted_parameters_manager.accepted_weights_bds is not None:
weights = accepted_parameters_manager.accepted_weights_bds.value()
cov = np.cov(continuous_model, aweights=weights.reshape(-1).astype(float), rowvar=False)
else:
cov = np.cov(continuous_model, rowvar=False)
return cov
Some of the implemented inference algorithms weigh different sets of parameters differently. Therefore, if such weights are provided, we would like to weight the covariance matrix accordingly. We, therefore, check whether the accepted parameters manager contains any weights. If it does, we retrieve these weights, and calculate the covariance matrix using numpy, the parameters relevant to this kernel and the weights. If there are no weights, we simply calculate an unweighted covariance matrix.
Next, we need the method:
-
PerturbationKernel.
update
(accepted_parameters_manager, row_index, rng)[source] Perturbs the parameters for this kernel.
Parameters: - accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects.
- row_index (integer) – The index of the accepted parameters bds that should be perturbed.
- rng (random number generator) – The random number generator to be used.
Returns: The perturbed parameters.
Return type: numpy.ndarray
This method perturbs the parameters that are associated with the random variables the kernel should perturb. The method again requires an accepted parameters manager and a kernel index. These have the same meaning as in the last method. In addition to this, a row index is required, as well as a random number generator. The row index specifies which set of parameters should be perturbed. There are usually multiple sets, which should be perturbed by different workers during parallelization. We, again, need not to worry about the actual value of this index.
The random number generator should be a random number generator compatible with numpy. This is due to the fact that other methods will pass their random number generator to this method, and all random number generators used within ABCpy are provided by numpy. Also, note that even if your kernel does not require a random number generator, you still need to pass this argument.
Here the implementation for our kernel:
def update(self, accepted_parameters_manager, kernel_index, row_index, rng=np.random.RandomState()):
"""
Updates the parameter values contained in the accepted_paramters_manager using a multivariate normal distribution.
Parameters
----------
accepted_parameters_manager: abcpy.AcceptedParametersManager object
Defines the AcceptedParametersManager to be used.
kernel_index: integer
The index of the kernel in the list of kernels in the joint kernel.
row_index: integer
The index of the row that should be considered from the accepted_parameters_bds matrix.
rng: random number generator
The random number generator to be used.
Returns
-------
np.ndarray
The perturbed parameter values.
"""
# Get all current parameter values relevant for this model and the structure
continuous_model_values = accepted_parameters_manager.kernel_parameters_bds.value()[kernel_index]
if isinstance(continuous_model_values[row_index][0],
(np.float, np.float32, np.float64, np.int, np.int32, np.int64)):
# Perturb
cov = np.array(accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]).astype(float)
continuous_model_values = np.array(continuous_model_values).astype(float)
# Perturbed values anc split according to the structure
perturbed_continuous_values = rng.multivariate_normal(continuous_model_values[row_index], cov)
else:
# print('Hello')
# Learn the structure
struct = [[] for i in range(len(continuous_model_values[row_index]))]
for i in range(len(continuous_model_values[row_index])):
struct[i] = continuous_model_values[row_index][i].shape[0]
struct = np.array(struct).cumsum()
continuous_model_values = np.concatenate(continuous_model_values[row_index])
# Perturb
cov = np.array(accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]).astype(float)
continuous_model_values = np.array(continuous_model_values).astype(float)
# Perturbed values anc split according to the structure
perturbed_continuous_values = np.split(rng.multivariate_normal(continuous_model_values, cov), struct)[:-1]
return perturbed_continuous_values
The first line shows how you obtain the values of the parameters that your kernel should perturb. These values are converted to a numpy array. Then, the covariance matrix is retrieved from the accepted parameters manager using a similar function call. Finally, the parameters are perturbed and returned.
Last but not least, each kernel requires a probability density or probability mass function depending on whether it is a Continuous Kernel or a Discrete Kernel:
-
PerturbationKernel.
pdf
(accepted_parameters_manager, kernel_index, row_index, x)[source] Calculates the pdf of the kernel at point x.
Parameters: - accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint perturbation kernel.
- row_index (integer) – The index of the accepted parameters bds for which the pdf should be evaluated.
- x (list or float) – The point at which the pdf should be evaluated.
Returns: The pdf evaluated at point x.
Return type: float
This method is implemented as follows for the multivariate normal:
def pdf(self, accepted_parameters_manager, kernel_index, mean, x):
"""Calculates the pdf of the kernel.
Commonly used to calculate weights.
Parameters
----------
accepted_parameters_manager: abcpy.AcceptedParametersManager object
The AcceptedParametersManager to be used.
kernel_index: integer
The index of the kernel in the list of kernels in the joint kernel.
index: integer
The row to be considered in the accepted_parameters_bds matrix.
x: The point at which the pdf should be evaluated.
Returns
-------
float
The pdf evaluated at point x.
"""
if isinstance(mean[0], (np.float, np.float32, np.float64, np.int, np.int32, np.int64)):
mean = np.array(mean).astype(float)
cov = np.array(accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]).astype(float)
return multivariate_normal(mean, cov, allow_singular=True).pdf(np.array(x).astype(float))
else:
mean = np.array(np.concatenate(mean)).astype(float)
cov = np.array(accepted_parameters_manager.accepted_cov_mats_bds.value()[kernel_index]).astype(float)
return multivariate_normal(mean, cov, allow_singular=True).pdf(np.concatenate(x))
We simply obtain the parameter values and covariance matrix for this kernel and calculate the probability density function using SciPy.
Note that after defining your own kernel, you will need to collect all your kernels in a
JointPerturbationKernel
object in order for inference to
work. For an example on how to do this, check the Using perturbation kernels section.
The complete example used in this tutorial can be found in the file examples/extensions/perturbationkernels/multivariate_normal_kernel.py.
4. Parallelization Backends¶
Using Parallelization Backends¶
Running ABC algorithms is often computationally expensive, thus ABCpy is built with parallelization in mind. In order to run your inference schemes in parallel on multiple nodes (computers) you can choose from the following backends.
Using the MPI Backend¶
To run ABCpy in parallel using MPI, one only needs to use the provided MPI backend. Using the same example as before, the statements for the backend have to be changed to
from abcpy.backends import BackendMPI as Backend
backend = Backend()
# The above line is equivalent to:
# backend = Backend(process_per_model=1)
# Notice: Models not parallelized by MPI should not be given process_per_model > 1
In words, one only needs to initialize an instance of the MPI backend. The number of ranks to spawn are specified at runtime through the way the script is run. A minimum of two ranks is required, since rank 0 (master) is used to orchestrate the calculation and all other ranks (workers) actually perform the calculation. (The default value of process_per_model is 1. If your simulator model is not parallelized using MPI, do not specify process_per_model > 1. The use of process_per_model for nested parallelization will be explained below.)
The standard way to run the script using MPI is directly via mpirun like below or on a cluster through a job scheduler like Slurm:
mpirun -np 4 python3 pmcabc_gaussian.py
The adapted Python code can be found in examples/backend/mpi/pmcabc_gaussian.py.
Nested-MPI parallelization for MPI-parallelized simulator models¶
Sometimes, the simulator model itself has large compute requirements and needs parallelization. To achieve this parallelization using threads, the MPI backend need to be configured such that each MPI rank can spawn multiple threads on a node. However, there might be situations where node-local parallelization using threads is not sufficient and parallelization across nodes is required.
Parallelization of the forward model across nodes is possible but limited to the MPI backend. Technically, this is implemented using individual MPI communicators for each forward model. The number of ranks per communicator (defined as: process_per_model) can be passed at the initialization of the backend as follows:
from abcpy.backends import BackendMPI as Backend
backend = Backend(process_per_model=2)
Here each model is assigned a MPI communicator with 2 ranks. Clearly, the MPI job has to be configured manually such that the total amount of MPI ranks is ideally a multiple of the ranks per communicator plus one additional rank for the master. For example, if we want to run n instances of a MPI model and allows m processes to each instance, we will have to spawn (n*m)+1 ranks.
For instance, let’s say you want to use n=3. Therefore, we use the following command:
mpirun -n 7 python3 mpi/mpi_model_inferences.py
as (3*2) + 1 = 7. Note that, in this scenario, using only 6 tasks overall leads to failure of the script due to how the tasks are assigned to the model instances.
The forward_simulation method of the MPI-parallelized simulator model has to be able to take an MPI communicator as a parameter.
An example of an MPI-parallelized simulator model, which can be used with ABCpy nested-parallelization, can be found in examples/backend/mpi/mpi_model_inferences.py. The forward_simulation function of the above model is as follows:
def forward_simulate(self, input_values, k, rng=np.random.RandomState, mpi_comm=None):
if mpi_comm is None:
ValueError('MPI-parallelized simulator model needs to have access \
to a MPI communicator object')
# print("Start Forward Simulate on rank {}".format(mpi_comm.Get_rank()))
rank = mpi_comm.Get_rank()
# Extract the input parameters
mu = input_values[rank]
sigma = 1
# Do the actual forward simulation
vector_of_k_samples = np.array(rng.normal(mu, sigma, k))
# Send everything back to rank 0
data = mpi_comm.gather(vector_of_k_samples, root=0)
# Format the output to obey API and broadcast it before return
result = None
if rank == 0:
result = [None] * k
for i in range(k):
element0 = data[0][i]
element1 = data[1][i]
point = np.array([element0, element1])
result[i] = point
result = [np.array([result[i]]).reshape(-1, ) for i in range(k)]
# print("End forward sim on master")
return result
else:
# print("End forward sim on workers")
return None
Note that in order to run jobs in parallel you need to have MPI installed on the system(s) in question with the requisite Python bindings for MPI (mpi4py). The dependencies of the MPI backend can be install with pip install -r requirements/backend-mpi.txt.
Details on the installation can be found on the official Open MPI homepage and the mpi4py homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.
Using the Spark Backend¶
To run ABCpy in parallel using Apache Spark, one only needs to use the provided Spark backend. Considering the example from before, the statements for the backend have to be changed to
import pyspark
sc = pyspark.SparkContext()
from abcpy.backends import BackendSpark as Backend
backend = Backend(sc, parallelism=4)
In words, a Spark context has to be created and passed to the Spark backend. Additionally, the level of parallelism can be provided, which defines in a sense in how many blocks the work should be split up. It corresponds to the parallelism of an RDD in Apache Spark terminology. A good value is usually a small multiple of the total number of available cores.
The standard way to run the script on Spark is via the spark-submit command:
PYSPARK_PYTHON=python3 spark-submit pmcabc_gaussian.py
Often Spark installations use Python 2 by default. To make Spark use the required Python 3 interpreter, the PYSPARK_PYTHON environment variable can be set.
The adapted python code can be found in examples/backend/apache_spark/pmcabc_gaussian.py.
Note that in order to run jobs in parallel you need to have Apache Spark installed on the system in question. The dependencies of the spark backend can be install with pip install -r requirements/backend-spark.txt.
Details on the installation can be found on the official homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.
Using Cluster Infrastructure¶
When your model is computationally expensive and/or other factors require compute infrastructure that goes beyond a single notebook or workstation you can easily run ABCpy on infrastructure for cluster or high-performance computing.
Running on Amazon Web Services¶
We show with high level steps how to get ABCpy running on Amazon Web Services (AWS). Please note, that this is not a complete guide to AWS, so we would like to refer you to the respective documentation. The first step would be to setup a AWS Elastic Map Reduce (EMR) cluster which comes with the option of a pre-configured Apache Spark. Then, we show how to run a simple inference code on this cluster.
Setting up the EMR Cluster¶
When we setup an EMR cluster we want to install ABCpy on every node of the cluster. Therefore, we provide a bootstrap script that does this job for us. On your local machine create a file named emr_bootstrap.sh with the following content:
#!/bin/sh
sudo yum -y install git
sudo pip-3.4 install ipython findspark abcpy
In AWS go to Services, then S3 under the Storage Section. Create a new bucket called abcpy and upload your bootstrap script emr_bootstap.sh.
To create a cluster, in AWS go to Services and then EMR under the Analytics Section. Click ‘Create Cluster’, then choose ‘Advanced Options’. In Step 1 choose the emr-5.7.0 image and make sure only Spark is selected for your cluster (the other software packages are not required). In Step 2 choose for example one master node and 4 core nodes (16 vCPUs if you have 4 vCPUs instances). In Step 3 under the boostrap action, choose custom, and select the script abcpy/emr_bootstrap.sh. In the last step (Step 4), choose a key to access the master node (we assume that you already setup keys). Start the cluster.
Running ABCpy on AWS¶
Log in via SSH and run the following commands to get an example code from ABCpy running with Python3 support:
sudo bash -c 'echo export PYSPARK_PYTHON=python34 >> /etc/spark/conf/spark-env.sh'
git clone https://github.com/eth-cscs/abcpy.git
Then, to submit a job to the Spark cluster we run the following commands:
cd abcpy/examples/backends/
spark-submit --num-executors 16 pmcabc_gaussian.py
Clearly the setup can be extended and optimized. For this and basic information we refer you to the AWS documentation on EMR.
5. Post Analysis¶
The output of an inference scheme is a Journal
(:py:class:abcpy.output.Journal
) which holds all the necessary results and
convenient methods to do the post analysis.
For example, one can easily access the sampled parameters and corresponding weights using:
print(journal.get_parameters())
print(journal.get_weights())
The output of get_parameters()
is a Python dictionary. The keys for this dictionary are the names you specified for the parameters. The corresponding values are the marginal posterior samples of that parameter. Here is a short example of what you would specify, and what would be the output in the end:
a = Normal([[1],[0.1]], name='parameter_1')
b = MultivariateNormal([[1,1],[[0.1,0],[0,0.1]]], name='parameter_2')
If one defined a model with these two parameters as inputs and n_sample=2
, the following would be the output of journal.get_parameters()
:
{'parameter_1' : [[0.95],[0.97]], 'parameter_2': [[0.98,1.03],[1.06,0.92]]}
These are samples at the final step of ABC algorithm. If you want samples from the earlier steps you can get a Python dictionary for that step by using:
journal.get_parameters(step_number)
Since this is a dictionary, you can also access the values for each step as:
journal.get_parameters(step_number)["name"]
For the post analysis basic functions are provided:
# do post analysis
print(journal.posterior_mean())
print(journal.posterior_cov())
Also, to ensure reproducibility, every journal stores the parameters of the algorithm that created it:
print(journal.configuration)
Finally, you can plot the inferred posterior distribution of the parameters in the following way:
journal.plot_posterior_distr(path_to_save="posterior.png")
The above line plots the posterior distribution for all the parameters and stores it in posterior.png
; if you instead want to plot it for some
parameters only, you can use the parameters_to_show
argument; in addition, the ranges_parameters
argument can be
used to provide a dictionary specifying the limits for the axis in the plots:
journal.plot_posterior_distr(parameters_to_show='parameter_1',
ranges_parameters={'parameter_1': [0,2]})
For journals generated with sequential algorithms, we provide a way to check the convergence by plotting the estimated Effective Sample Size (ESS) at each iteration, as well as an estimate of the Wasserstein distance between the empirical distributions defined by the samples and weights at subsequent iterations:
journal.plot_ESS()
journal.Wass_convergence_plot()
And certainly, a journal can easily be saved to and loaded from disk:
journal.save("experiments.jnl")
new_journal = Journal.fromFile('experiments.jnl')
Deploy a new Release¶
This documentation is mainly intended for the main developers. The deployment of new releases is automated using Travis CI. However, there are still a few manual steps required in order to deploy a new release. Assume we want to deploy the new version `M.m.b’:
- Create a release branch release-M.m.b
- Adapt VERSION file in the repos root directory: echo M.m.b > VERSION
- Adapt README.md file: adapt links to correct version of User Documentation and Reference
- Adapt doc/source/installation.rst file: to install correct version of ABCpy
- Merge all desired feature branches into the release branch
- Create a pull/ merge request: release branch -> master
After a successful merge:
- Create tag vM.m.b (git tag vM.m.b)
- Retag tag stable to the current version
- Push the tag (git push –tags)
- Create a release in GitHub
The new tag on master will signal Travis to deploy a new package to Pypi while the GitHub release is just for user documentation.
abcpy package¶
This reference gives details about the API of modules, classes and functions included in ABCpy.
The following diagram shows selected classes with their most important
methods. Abstract classes, which cannot be instantiated, are highlighted in
dark gray and derived classes are highlighted in light gray. Inheritance is
shown by filled arrows. Arrows with no filling highlight associations, e.g.,
Distance
is associated with Statistics
because it calls a method of the instantiated class to translate the input data to summary statistics.

abcpy.acceptedparametersmanager module¶
-
class
abcpy.acceptedparametersmanager.
AcceptedParametersManager
(model)[source]¶ Bases:
object
-
__init__
(model)[source]¶ This class manages the accepted parameters and other bds objects.
Parameters: model (list) – List of all root probabilistic models
-
broadcast
(backend, observations)[source]¶ Broadcasts the observations to observations_bds using the specified backend.
Parameters: - backend (abcpy.backends object) – The backend used by the inference algorithm
- observations (list) – A list containing all observed data
-
update_kernel_values
(backend, kernel_parameters)[source]¶ Broadcasts new parameters for each kernel
Parameters: - backend (abcpy.backends object) – The backend used by the inference algorithm
- kernel_parameters (list) – A list, in which each entry contains the values of the parameters associated with the corresponding kernel in the joint perturbation kernel
-
update_broadcast
(backend, accepted_parameters=None, accepted_weights=None, accepted_cov_mats=None)[source]¶ Updates the broadcasted values using the specified backend
Parameters: - backend (abcpy.backend object) – The backend to be used for broadcasting
- accepted_parameters (list) – The accepted parameters to be broadcasted
- accepted_weights (list) – The accepted weights to be broadcasted
- accepted_cov_mats (np.ndarray) – The accepted covariance matrix to be broadcasted
-
get_mapping
(models, is_root=True, index=0)[source]¶ Returns the order in which the models are discovered during recursive depth-first search. Commonly used when returning the accepted_parameters_bds for certain models.
Parameters: - models (list) – List of the root probabilistic models of the graph.
- is_root (boolean) – Specifies whether the current list of models is the list of overall root models
- index (integer) – The current index in depth-first search.
Returns: The first entry corresponds to the mapping of the root model, as well as all its parents. The second entry corresponds to the next index in depth-first search.
Return type: list
-
get_accepted_parameters_bds_values
(models)[source]¶ Returns the accepted bds values for the specified models.
Parameters: models (list) – Contains the probabilistic models for which the accepted bds values should be returned Returns: The accepted_parameters_bds values of all the probabilistic models specified in models. Return type: list
-
abcpy.approx_lhd module¶
-
class
abcpy.approx_lhd.
Approx_likelihood
(statistics_calc)[source]¶ Bases:
object
This abstract base class defines the approximate likelihood function.
-
__init__
(statistics_calc)[source]¶ The constructor of a sub-class must accept a non-optional statistics calculator, which is stored to self.statistics_calc.
Parameters: statistics_calc (abcpy.stasistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
-
loglikelihood
(y_obs, y_sim)[source]¶ To be overwritten by any sub-class: should compute the approximate loglikelihood value given the observed data set y_obs and the data set y_sim simulated from model set at the parameter value.
Parameters: - y_obs (Python list) – Observed data set.
- y_sim (Python list) – Simulated data set from model at the parameter value.
Returns: Computed approximate loglikelihood.
Return type: float
-
-
class
abcpy.approx_lhd.
SynLikelihood
(statistics_calc)[source]¶ Bases:
abcpy.approx_lhd.Approx_likelihood
This class implements the approximate likelihood function which computes the approximate likelihood using the synthetic likelihood approach described in Wood [1]. For synthetic likelihood approximation, we compute the robust precision matrix using Ledoit and Wolf’s [2] method.
[1] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102–1104, Aug. 2010.
[2] O. Ledoit and M. Wolf, A Well-Conditioned Estimator for Large-Dimensional Covariance Matrices, Journal of Multivariate Analysis, Volume 88, Issue 2, pages 365-411, February 2004.
-
__init__
(statistics_calc)[source]¶ The constructor of a sub-class must accept a non-optional statistics calculator, which is stored to self.statistics_calc.
Parameters: statistics_calc (abcpy.stasistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
-
loglikelihood
(y_obs, y_sim)[source]¶ To be overwritten by any sub-class: should compute the approximate loglikelihood value given the observed data set y_obs and the data set y_sim simulated from model set at the parameter value.
Parameters: - y_obs (Python list) – Observed data set.
- y_sim (Python list) – Simulated data set from model at the parameter value.
Returns: Computed approximate loglikelihood.
Return type: float
-
-
class
abcpy.approx_lhd.
PenLogReg
(statistics_calc, model, n_simulate, n_folds=10, max_iter=100000, seed=None)[source]¶ Bases:
abcpy.approx_lhd.Approx_likelihood
,abcpy.graphtools.GraphTools
This class implements the approximate likelihood function which computes the approximate likelihood up to a constant using penalized logistic regression described in Dutta et. al. [1]. It takes one additional function handler defining the true model and two additional parameters n_folds and n_simulate correspondingly defining number of folds used to estimate prediction error using cross-validation and the number of simulated dataset sampled from each parameter to approximate the likelihood function. For lasso penalized logistic regression we use glmnet of Friedman et. al. [2].
[1] Thomas, O., Dutta, R., Corander, J., Kaski, S., & Gutmann, M. U. (2020). Likelihood-free inference by ratio estimation. Bayesian Analysis.
[2] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
Parameters: - statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
- model (abcpy.models.Model) – Model object that conforms to the Model class.
- n_simulate (int) – Number of data points to simulate for the reference data set; this has to be the same as n_samples_per_param when calling the sampler. The reference data set is generated by drawing parameters from the prior and samples from the model when PenLogReg is instantiated.
- n_folds (int, optional) – Number of folds for cross-validation. The default value is 10.
- max_iter (int, optional) – Maximum passes over the data. The default is 100000.
- seed (int, optional) – Seed for the random number generator. The used glmnet solver is not deterministic, this seed is used for determining the cv folds. The default value is None.
-
__init__
(statistics_calc, model, n_simulate, n_folds=10, max_iter=100000, seed=None)[source]¶ The constructor of a sub-class must accept a non-optional statistics calculator, which is stored to self.statistics_calc.
Parameters: statistics_calc (abcpy.stasistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
-
loglikelihood
(y_obs, y_sim)[source]¶ To be overwritten by any sub-class: should compute the approximate loglikelihood value given the observed data set y_obs and the data set y_sim simulated from model set at the parameter value.
Parameters: - y_obs (Python list) – Observed data set.
- y_sim (Python list) – Simulated data set from model at the parameter value.
Returns: Computed approximate loglikelihood.
Return type: float
abcpy.backends module¶
-
class
abcpy.backends.base.
Backend
[source]¶ Bases:
object
This is the base class for every parallelization backend. It essentially resembles the map/reduce API from Spark.
An idea for the future is to implement a MPI version of the backend with the hope to be more complient with standard HPC infrastructure and a potential speed-up.
-
parallelize
(list)[source]¶ This method distributes the list on the available workers and returns a reference object.
The list should be split into number of workers many parts. Each part should then be sent to a separate worker node.
Parameters: list (Python list) – the list that should get distributed on the worker nodes Returns: A reference object that represents the parallelized list Return type: PDS class (parallel data set)
-
broadcast
(object)[source]¶ Send object to all worker nodes without splitting it up.
Parameters: object (Python object) – An abitrary object that should be available on all workers Returns: A reference to the broadcasted object Return type: BDS class (broadcast data set)
-
map
(func, pds)[source]¶ A distributed implementation of map that works on parallel data sets (PDS).
On every element of pds the function func is called.
Parameters: - func (Python func) – A function that can be applied to every element of the pds
- pds (PDS class) – A parallel data set to which func should be applied
Returns: a new parallel data set that contains the result of the map
Return type: PDS class
-
-
class
abcpy.backends.base.
PDS
[source]¶ Bases:
object
The reference class for parallel data sets (PDS).
-
class
abcpy.backends.base.
BDS
[source]¶ Bases:
object
The reference class for broadcast data set (BDS).
-
class
abcpy.backends.base.
BackendDummy
[source]¶ Bases:
abcpy.backends.base.Backend
This is a dummy parallelization backend, meaning it doesn’t parallelize anything. It is mainly implemented for testing purpose.
-
parallelize
(python_list)[source]¶ This actually does nothing: it just wraps the Python list into dummy pds (PDSDummy).
Parameters: python_list (Python list) – Returns: Return type: PDSDummy (parallel data set)
-
broadcast
(object)[source]¶ This actually does nothing: it just wraps the object into BDSDummy.
Parameters: object (Python object) – Returns: Return type: BDSDummy class
-
map
(func, pds)[source]¶ This is a wrapper for the Python internal map function.
Parameters: - func (Python func) – A function that can be applied to every element of the pds
- pds (PDSDummy class) – A pseudo-parallel data set to which func should be applied
Returns: a new pseudo-parallel data set that contains the result of the map
Return type: PDSDummy class
-
-
class
abcpy.backends.base.
PDSDummy
(python_list)[source]¶ Bases:
abcpy.backends.base.PDS
This is a wrapper for a Python list to fake parallelization.
-
class
abcpy.backends.base.
BDSDummy
(object)[source]¶ Bases:
abcpy.backends.base.BDS
This is a wrapper for a Python object to fake parallelization.
-
class
abcpy.backends.spark.
BackendSpark
(sparkContext, parallelism=4)[source]¶ Bases:
abcpy.backends.base.Backend
A parallelization backend for Apache Spark. It is essetially a wrapper for the required Spark functionality.
-
__init__
(sparkContext, parallelism=4)[source]¶ Initialize the backend with an existing and configured SparkContext.
Parameters: - sparkContext (pyspark.SparkContext) – an existing and fully configured PySpark context
- parallelism (int) – defines on how many workers a distributed dataset can be distributed
-
parallelize
(python_list)[source]¶ This is a wrapper of pyspark.SparkContext.parallelize().
Parameters: list (Python list) – list that is distributed on the workers Returns: A reference object that represents the parallelized list Return type: PDSSpark class (parallel data set)
-
broadcast
(object)[source]¶ This is a wrapper for pyspark.SparkContext.broadcast().
Parameters: object (Python object) – An abitrary object that should be available on all workers Returns: A reference to the broadcasted object Return type: BDSSpark class (broadcast data set)
-
map
(func, pds)[source]¶ This is a wrapper for pyspark.rdd.map()
Parameters: - func (Python func) – A function that can be applied to every element of the pds
- pds (PDSSpark class) – A parallel data set to which func should be applied
Returns: a new parallel data set that contains the result of the map
Return type: PDSSpark class
-
-
class
abcpy.backends.spark.
PDSSpark
(rdd)[source]¶ Bases:
abcpy.backends.base.PDS
This is a wrapper for Apache Spark RDDs.
-
class
abcpy.backends.spark.
BDSSpark
(bcv)[source]¶ Bases:
abcpy.backends.base.BDS
This is a wrapper for Apache Spark Broadcast variables.
abcpy.continuousmodels module¶
-
class
abcpy.continuousmodels.
Uniform
(parameters, name='Uniform')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
,abcpy.probabilisticmodels.Continuous
-
__init__
(parameters, name='Uniform')[source]¶ This class implements a probabilistic model following an uniform distribution.
Parameters: - parameters (list) – Contains two lists. The first list specifies the probabilistic models and hyperparameters from which the lower bound of the uniform distribution derive. The second list specifies the probabilistic models and hyperparameters from which the upper bound derives.
- name (string, optional) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345440914768'>, mpi_comm=None)[source]¶ Samples from a uniform distribution using the current values for each probabilistic model from which the model derives.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples that should be drawn.
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x. Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (list) – The point at which the pdf should be evaluated.
Returns: The evaluated pdf at point x.
Return type: Float
-
-
class
abcpy.continuousmodels.
Normal
(parameters, name='Normal')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
,abcpy.probabilisticmodels.Continuous
-
__init__
(parameters, name='Normal')[source]¶ This class implements a probabilistic model following a normal distribution with mean mu and variance sigma.
Parameters: - parameters (list) – Contains the probabilistic models and hyperparameters from which the model derives. The list has two entries: from the first entry mean of the distribution and from the second entry variance is derived. Note that the second value of the list is strictly greater than 0.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345438770512'>, mpi_comm=None)[source]¶ Samples from a normal distribution using the current values for each probabilistic model from which the model derives.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples that should be drawn.
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x. Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters of the from [mu, sigma]
- x (list) – The point at which the pdf should be evaluated.
Returns: The evaluated pdf at point x.
Return type: Float
-
-
class
abcpy.continuousmodels.
StudentT
(parameters, name='StudentT')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
,abcpy.probabilisticmodels.Continuous
-
__init__
(parameters, name='StudentT')[source]¶ This class implements a probabilistic model following the Student’s T-distribution.
Parameters: - parameters (list) – Contains the probabilistic models and hyperparameters from which the model derives. The list has two entries: from the first entry mean of the distribution and from the second entry degrees of freedom is derived. Note that the second value of the list is strictly greater than 0.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345438809744'>, mpi_comm=None)[source]¶ Samples from a Student’s T-distribution using the current values for each probabilistic model from which the model derives.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples that should be drawn.
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x. Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters
- x (list) – The point at which the pdf should be evaluated.
Returns: The evaluated pdf at point x.
Return type: Float
-
-
class
abcpy.continuousmodels.
MultivariateNormal
(parameters, name='Multivariate Normal')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
,abcpy.probabilisticmodels.Continuous
-
__init__
(parameters, name='Multivariate Normal')[source]¶ This class implements a probabilistic model following a multivariate normal distribution with mean and covariance matrix.
Parameters: - parameters (list of at length 2) – Contains the probabilistic models and hyperparameters from which the model derives. The first entry defines the mean, while the second entry defines the Covariance matrix. Note that if the mean is n dimensional, the covariance matrix is required to be of dimension nxn, symmetric and positive-definite.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345438861264'>, mpi_comm=None)[source]¶ Samples from a multivariate normal distribution using the current values for each probabilistic model from which the model derives.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples that should be drawn.
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x. Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters
- x (list) – The point at which the pdf should be evaluated.
Returns: The evaluated pdf at point x.
Return type: Float
-
-
class
abcpy.continuousmodels.
MultiStudentT
(parameters, name='MultiStudentT')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
,abcpy.probabilisticmodels.Continuous
-
__init__
(parameters, name='MultiStudentT')[source]¶ This class implements a probabilistic model following the multivariate Student-T distribution.
Parameters: - parameters (list) – All but the last two entries contain the probabilistic models and hyperparameters from which the model derives. The second to last entry contains the covariance matrix. If the mean is of dimension n, the covariance matrix is required to be nxn dimensional. The last entry contains the degrees of freedom.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345441323664'>, mpi_comm=None)[source]¶ Samples from a multivariate Student’s T-distribution using the current values for each probabilistic model from which the model derives.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples that should be drawn.
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x. Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters
- x (list) – The point at which the pdf should be evaluated.
Returns: The evaluated pdf at point x.
Return type: Float
-
abcpy.discretemodels module¶
-
class
abcpy.discretemodels.
Bernoulli
(parameters, name='Bernoulli')[source]¶ Bases:
abcpy.probabilisticmodels.Discrete
,abcpy.probabilisticmodels.ProbabilisticModel
-
__init__
(parameters, name='Bernoulli')[source]¶ This class implements a probabilistic model following a bernoulli distribution.
Parameters: - parameters (list) – A list containing one entry, the probability of the distribution.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345437473232'>, mpi_comm=None)[source]¶ Samples from the bernoulli distribution associtated with the probabilistic model.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples to be drawn.
- rng (random number generator) – The random number generator to be used.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pmf
(input_values, x)[source]¶ Evaluates the probability mass function at point x.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (float) – The point at which the pmf should be evaluated.
Returns: The pmf evaluated at point x.
Return type: float
-
-
class
abcpy.discretemodels.
Binomial
(parameters, name='Binomial')[source]¶ Bases:
abcpy.probabilisticmodels.Discrete
,abcpy.probabilisticmodels.ProbabilisticModel
-
__init__
(parameters, name='Binomial')[source]¶ This class implements a probabilistic model following a binomial distribution.
Parameters: - parameters (list) – Contains the probabilistic models and hyperparameters from which the model derives. Note that the first entry of the list, n, an integer and has to be larger than or equal to 0, while the second entry, p, has to be in the interval [0,1].
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345441473360'>, mpi_comm=None)[source]¶ Samples from a binomial distribution using the current values for each probabilistic model from which the model derives.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples that should be drawn.
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pmf
(input_values, x)[source]¶ Calculates the probability mass function at point x.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (list) – The point at which the pmf should be evaluated.
Returns: The evaluated pmf at point x.
Return type: Float
-
-
class
abcpy.discretemodels.
Poisson
(parameters, name='Poisson')[source]¶ Bases:
abcpy.probabilisticmodels.Discrete
,abcpy.probabilisticmodels.ProbabilisticModel
-
__init__
(parameters, name='Poisson')[source]¶ This class implements a probabilistic model following a poisson distribution.
Parameters: - parameters (list) – A list containing one entry, the mean of the distribution.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345436798096'>, mpi_comm=None)[source]¶ Samples k values from the defined possion distribution.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples.
- rng (random number generator) – The random number generator to be used.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pmf
(input_values, x)[source]¶ Calculates the probability mass function of the distribution at point x.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (integer) – The point at which the pmf should be evaluated.
Returns: The evaluated pmf at point x.
Return type: Float
-
-
class
abcpy.discretemodels.
DiscreteUniform
(parameters, name='DiscreteUniform')[source]¶ Bases:
abcpy.probabilisticmodels.Discrete
,abcpy.probabilisticmodels.ProbabilisticModel
-
__init__
(parameters, name='DiscreteUniform')[source]¶ This class implements a probabilistic model following a Discrete Uniform distribution.
Parameters: - parameters (list) – A list containing two entries, the upper and lower bound of the range.
- name (string) – The name that should be given to the probabilistic model in the journal file.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345436849616'>)[source]¶ Samples from the Discrete Uniform distribution associated with the probabilistic model.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- k (integer) – The number of samples to be drawn.
- rng (random number generator) – The random number generator to be used.
Returns: list – A list containing the sampled values as np-array.
Return type: [np.ndarray]
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pmf
(input_values, x)[source]¶ Evaluates the probability mass function at point x.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (float) – The point at which the pmf should be evaluated.
Returns: The pmf evaluated at point x.
Return type: float
-
abcpy.distances module¶
-
class
abcpy.distances.
Distance
(statistics_calc)[source]¶ Bases:
object
This abstract base class defines how the distance between the observed and simulated data should be implemented.
-
__init__
(statistics_calc)[source]¶ The constructor of a sub-class must accept a non-optional statistics calculator as a parameter; then, it must call the __init__ method of the parent class. This ensures that the object is initialized correctly so that the _calculate_summary_stat private method can be called when computing the distances.
Parameters: statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
-
distance
(d1, d2)[source]¶ To be overwritten by any sub-class: should calculate the distance between two sets of data d1 and d2 using their respective statistics.
Usually, calling the _calculate_summary_stat private method to obtain statistics from the datasets is handy; that also keeps track of the first provided dataset (which is the observation in ABCpy inference schemes) and avoids computing the statistics for that multiple times.
Notes
The data sets d1 and d2 are array-like structures that contain n1 and n2 data points each. An implementation of the distance function should work along the following steps:
1. Transform both input sets dX = [ dX1, dX2, …, dXn ] to sX = [sX1, sX2, …, sXn] using the statistics object. See _calculate_summary_stat method.
2. Calculate the mutual desired distance, here denoted by - between the statistics; for instance, dist = [s11 - s21, s12 - s22, …, s1n - s2n] (in some cases however you may want to compute all pairwise distances between statistics elements.
Important: any sub-class must not calculate the distance between data sets d1 and d2 directly. This is the reason why any sub-class must be initialized with a statistics object.
Parameters: - d1 (Python list) – Contains n1 data points.
- d2 (Python list) – Contains n2 data points.
Returns: The distance between the two input data sets.
Return type: numpy.float
-
dist_max
()[source]¶ To be overwritten by sub-class: should return maximum possible value of the desired distance function.
Examples
If the desired distance maps to \(\mathbb{R}\), this method should return numpy.inf.
Returns: The maximal possible value of the desired distance function. Return type: numpy.float
-
-
class
abcpy.distances.
Euclidean
(statistics)[source]¶ Bases:
abcpy.distances.Distance
This class implements the Euclidean distance between two vectors.
The maximum value of the distance is np.inf.
-
__init__
(statistics)[source]¶ Parameters: statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
-
distance
(d1, d2)[source]¶ Calculates the distance between two datasets, by computing Euclidean distance between each element of d1 and d2 and taking their average.
Parameters: - d1 (Python list) – Contains n1 data points.
- d2 (Python list) – Contains n2 data points.
Returns: The distance between the two input data sets.
Return type: numpy.float
-
-
class
abcpy.distances.
PenLogReg
(statistics)[source]¶ Bases:
abcpy.distances.Distance
This class implements a distance measure based on the classification accuracy.
The classification accuracy is calculated between two dataset d1 and d2 using lasso penalized logistics regression and return it as a distance. The lasso penalized logistic regression is done using glmnet package of Friedman et. al. [2]. While computing the distance, the algorithm automatically chooses the most relevant summary statistics as explained in Gutmann et. al. [1]. The maximum value of the distance is 1.0.
[1] Gutmann, M. U., Dutta, R., Kaski, S., & Corander, J. (2018). Likelihood-free inference via classification. Statistics and Computing, 28(2), 411-425.
[2] Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22.
-
__init__
(statistics)[source]¶ Parameters: statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
-
-
class
abcpy.distances.
LogReg
(statistics, seed=None)[source]¶ Bases:
abcpy.distances.Distance
This class implements a distance measure based on the classification accuracy [1]. The classification accuracy is calculated between two dataset d1 and d2 using logistics regression and return it as a distance. The maximum value of the distance is 1.0. The logistic regression may not converge when using one single sample in each dataset (as for instance by putting n_samples_per_param=1 in an inference routine).
[1] Gutmann, M. U., Dutta, R., Kaski, S., & Corander, J. (2018). Likelihood-free inference via classification. Statistics and Computing, 28(2), 411-425.
-
__init__
(statistics, seed=None)[source]¶ Parameters: - statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
- seed (integer, optionl) – Seed used to initialize the Random Numbers Generator used to determine the (random) cross validation split in the Logistic Regression classifier.
-
-
class
abcpy.distances.
Wasserstein
(statistics, num_iter_max=100000)[source]¶ Bases:
abcpy.distances.Distance
This class implements a distance measure based on the 2-Wasserstein distance, as used in [1]. This considers the several simulations/observations in the datasets as iid samples from the model for a fixed parameter value/from the data generating model, and computes the 2-Wasserstein distance between the empirical distributions those simulations/observations define.
[1] Bernton, E., Jacob, P.E., Gerber, M. and Robert, C.P. (2019), Approximate Bayesian computation with the Wasserstein distance. J. R. Stat. Soc. B, 81: 235-269. doi:10.1111/rssb.12312
-
__init__
(statistics, num_iter_max=100000)[source]¶ Parameters: - statistics_calc (abcpy.statistics.Statistics) – Statistics extractor object that conforms to the Statistics class.
- num_iter_max (integer, optional) – The maximum number of iterations in the linear programming algorithm to estimate the Wasserstein distance. Default to 100000.
-
abcpy.graphtools module¶
-
class
abcpy.graphtools.
GraphTools
[source]¶ Bases:
object
This class implements all methods that will be called recursively on the graph structure.
-
sample_from_prior
(model=None, rng=<MagicMock name='mock.RandomState()' id='140345460686608'>)[source]¶ Samples values for all random variables of the model. Commonly used to sample new parameter values on the whole graph.
Parameters: - model (abcpy.ProbabilisticModel object) – The root model for which sample_from_prior should be called.
- rng (Random number generator) – Defines the random number generator to be used
-
pdf_of_prior
(models, parameters, mapping=None, is_root=True)[source]¶ Calculates the joint probability density function of the prior of the specified models at the given parameter values. Commonly used to check whether new parameters are valid given the prior, as well as to calculate acceptance probabilities.
Parameters: - models (list of abcpy.ProbabilisticModel objects) – Defines the models for which the pdf of their prior should be evaluated
- parameters (python list) – The parameters at which the pdf should be evaluated
- mapping (list of tuples) – Defines the mapping of probabilistic models and index in a parameter list.
- is_root (boolean) – A flag specifying whether the provided models are the root models. This is to ensure that the pdf is calculated correctly.
Returns: The resulting pdf,as well as the next index to be considered in the parameters list.
Return type: list
-
get_parameters
(models=None, is_root=True)[source]¶ Returns the current values of all free parameters in the model. Commonly used before perturbing the parameters of the model.
Parameters: - models (list of abcpy.ProbabilisticModel objects) – The models for which, together with their parents, the parameter values should be returned. If no value is provided, the root models are assumed to be the model of the inference method.
- is_root (boolean) – Specifies whether the current models are at the root. This ensures that the values corresponding to simulated observations will not be returned.
Returns: A list containing all currently sampled values of the free parameters.
Return type: list
-
set_parameters
(parameters, models=None, index=0, is_root=True)[source]¶ Sets new values for the currently used values of each random variable. Commonly used after perturbing the parameter values using a kernel.
Parameters: - parameters (list) – Defines the values to which the respective parameter values of the models should be set
- model (list of abcpy.ProbabilisticModel objects) – Defines all models for which, together with their parents, new values should be set. If no value is provided, the root models are assumed to be the model of the inference method.
- index (integer) – The current index to be considered in the parameters list
- is_root (boolean) – Defines whether the current models are at the root. This ensures that only values corresponding to random variables will be set.
Returns: list – Returns whether it was possible to set all parameters and the next index to be considered in the parameters list.
Return type: [boolean, integer]
-
get_correct_ordering
(parameters_and_models, models=None, is_root=True)[source]¶ Orders the parameters returned by a kernel in the order required by the graph. Commonly used when perturbing the parameters.
Parameters: - parameters_and_models (list of tuples) – Contains tuples containing as the first entry the probabilistic model to be considered and as the second entry the parameter values associated with this model
- models (list) – Contains the root probabilistic models that make up the graph. If no value is provided, the root models are assumed to be the model of the inference method.
Returns: The ordering which can be used by recursive functions on the graph.
Return type: list
-
simulate
(n_samples_per_param, rng=<MagicMock name='mock.RandomState()' id='140345441211024'>, npc=None)[source]¶ Simulates data of each model using the currently sampled or perturbed parameters.
Parameters: rng (random number generator) – The random number generator to be used. Returns: Each entry corresponds to the simulated data of one model. Return type: list
-
abcpy.inferences module¶
-
class
abcpy.inferences.
InferenceMethod
[source]¶ Bases:
abcpy.graphtools.GraphTools
This abstract base class represents an inference method.
-
sample
()[source]¶ To be overwritten by any sub-class: Samples from the posterior distribution of the model parameter given the observed data observations.
-
model
¶ an attribute specifying the model to be used
Type: To be overwritten by any sub-class
-
rng
¶ an attribute specifying the random number generator to be used
Type: To be overwritten by any sub-class
-
backend
¶ an attribute specifying the backend to be used.
Type: To be overwritten by any sub-class
-
n_samples
¶ an attribute specifying the number of samples to be generated
Type: To be overwritten by any sub-class
-
n_samples_per_param
¶ an attribute specifying the number of data points in each simulated data set.
Type: To be overwritten by any sub-class
-
-
class
abcpy.inferences.
BaseMethodsWithKernel
[source]¶ Bases:
object
This abstract base class represents inference methods that have a kernel.
-
kernel
¶ an attribute specifying the transition or perturbation kernel.
Type: To be overwritten by any sub-class
-
perturb
(column_index, epochs=10, rng=<MagicMock name='mock.RandomState()' id='140345434914128'>)[source]¶ Perturbs all free parameters, given the current weights. Commonly used during inference.
Parameters: - column_index (integer) – The index of the column in the accepted_parameters_bds that should be used for perturbation
- epochs (integer) – The number of times perturbation should happen before the algorithm is terminated
Returns: Whether it was possible to set new parameter values for all probabilistic models
Return type: boolean
-
-
class
abcpy.inferences.
BaseLikelihood
[source]¶ Bases:
abcpy.inferences.InferenceMethod
,abcpy.inferences.BaseMethodsWithKernel
This abstract base class represents inference methods that use the likelihood.
-
likfun
¶ an attribute specifying the likelihood function to be used.
Type: To be overwritten by any sub-class
-
-
class
abcpy.inferences.
BaseDiscrepancy
[source]¶ Bases:
abcpy.inferences.InferenceMethod
,abcpy.inferences.BaseMethodsWithKernel
This abstract base class represents inference methods using descrepancy.
-
distance
¶ an attribute specifying the distance function.
Type: To be overwritten by any sub-class
-
-
class
abcpy.inferences.
RejectionABC
(root_models, distances, backend, seed=None)[source]¶ Bases:
abcpy.inferences.InferenceMethod
This class implements the rejection algorithm based inference scheme [1] for Approximate Bayesian Computation.
[1] Tavaré, S., Balding, D., Griffith, R., Donnelly, P.: Inferring coalescence times from DNA sequence data. Genetics 145(2), 505–518 (1997).
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance measure to compare simulated and observed data sets.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optionaldistance) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
epsilon
= None¶
-
__init__
(root_models, distances, backend, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
backend
= None¶
-
rng
= None¶
-
sample
(observations, n_samples, n_samples_per_param, epsilon, full_output=0)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- n_samples (integer) – Number of samples to generate
- n_samples_per_param (integer) – Number of data points in each simulated data set.
- epsilon (float) – Value of threshold
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
Returns: a journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
PMCABC
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseDiscrepancy
,abcpy.inferences.InferenceMethod
This class implements a modified version of Population Monte Carlo based inference scheme for Approximate Bayesian computation of Beaumont et. al. [1]. Here the threshold value at t-th generation are adaptively chosen by taking the maximum between the epsilon_percentile-th value of discrepancies of the accepted parameters at t-1-th generation and the threshold value provided for this generation by the user. If we take the value of epsilon_percentile to be zero (default), this method becomes the inference scheme described in [1], where the threshold values considered at each generation are the ones provided by the user.
[1] M. A. Beaumont. Approximate Bayesian computation in evolution and ecology. Annual Review of Ecology, Evolution, and Systematics, 41(1):379–406, Nov. 2010.
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance measure to compare simulated and observed data sets.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= 2¶
-
n_samples_per_param
= None¶
-
__init__
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
kernel
= None¶
-
backend
= None¶
-
rng
= None¶
-
sample
(observations, steps, epsilon_init, n_samples=10000, n_samples_per_param=1, epsilon_percentile=10, covFactor=2, full_output=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – Number of iterations in the sequential algoritm (“generations”)
- epsilon_init (numpy.ndarray) – An array of proposed values of epsilon to be used at each steps. Can be supplied A single value to be used as the threshold in Step 1 or a steps-dimensional array of values to be used as the threshold in evry steps.
- n_samples (integer, optional) – Number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
- epsilon_percentile (float, optional) – A value between [0, 100]. The default value is 10.
- covFactor (float, optional) – scaling parameter of the covariance matrix. The default value is 2 as considered in [1].
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
PMC
(root_models, loglikfuns, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseLikelihood
,abcpy.inferences.InferenceMethod
Population Monte Carlo based inference scheme of Cappé et. al. [1].
This algorithm assumes a likelihood function is available and can be evaluated at any parameter value given the oberved dataset. In absence of the likelihood function or when it can’t be evaluated with a rational computational expenses, we use the approximated likelihood functions in abcpy.approx_lhd module, for which the argument of the consistency of the inference schemes are based on Andrieu and Roberts [2].
[1] Cappé, O., Guillin, A., Marin, J.-M., and Robert, C. P. (2004). Population Monte Carlo. Journal of Computational and Graphical Statistics, 13(4), 907–929.
[2] C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. Annals of Statistics, 37(2):697–725, 04 2009.
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- loglikfun (abcpy.approx_lhd.Approx_likelihood) – Approx_loglikelihood object defining the approximated loglikelihood to be used.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
__init__
(root_models, loglikfuns, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
likfun
= None¶
-
kernel
= None¶
-
backend
= None¶
-
rng
= None¶
-
sample
(observations, steps, n_samples=10000, n_samples_per_param=100, covFactors=None, iniPoints=None, full_output=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – number of iterations in the sequential algoritm (“generations”)
- n_samples (integer, optional) – number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – number of data points in each simulated data set. The default value is 100.
- covFactors (list of float, optional) – scaling parameter of the covariance matrix. The default is a p dimensional array of 1 when p is the dimension of the parameter.
- inipoints (numpy.ndarray, optional) – parameter vaulues from where the sampling starts. By default sampled from the prior.
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
SABC
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseDiscrepancy
,abcpy.inferences.InferenceMethod
This class implements a modified version of Simulated Annealing Approximate Bayesian Computation (SABC) of [1] when the prior is non-informative.
[1] C. Albert, H. R. Kuensch and A. Scheidegger. A Simulated Annealing Approach to Approximate Bayes Computations. Statistics and Computing, (2014).
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance measure used to compare simulated and observed data sets.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
epsilon
= None¶
-
__init__
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
kernel
= None¶
-
backend
= None¶
-
rng
= None¶
-
smooth_distances_bds
= None¶
-
all_distances_bds
= None¶
-
sample
(observations, steps, epsilon, n_samples=10000, n_samples_per_param=1, beta=2, delta=0.2, v=0.3, ar_cutoff=0.1, resample=None, n_update=None, full_output=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – Number of maximum iterations in the sequential algoritm (“generations”)
- epsilon (numpy.float) – A proposed value of threshold to start with.
- n_samples (integer, optional) – Number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
- beta (numpy.float, optional) – Tuning parameter of SABC, default value is 2. Used to scale up the covariance matrices obtained from data.
- delta (numpy.float, optional) – Tuning parameter of SABC, default value is 0.2.
- v (numpy.float, optional) – Tuning parameter of SABC, The default value is 0.3.
- ar_cutoff (numpy.float, optional) – Acceptance ratio cutoff: if the acceptance rate at some iteration of the algorithm is lower than that, the algorithm will stop. The default value is 0.1.
- resample (int, optional) – At any iteration, perform a resampling step if the number of accepted particles is larger than resample. When not provided, it assumes resample to be equal to n_samples.
- n_update (int, optional) – Number of perturbed parameters at each step, The default value is None which takes value inside n_samples
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
ABCsubsim
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseDiscrepancy
,abcpy.inferences.InferenceMethod
This class implements Approximate Bayesian Computation by subset simulation (ABCsubsim) algorithm of [1].
[1] M. Chiachio, J. L. Beck, J. Chiachio, and G. Rus., Approximate Bayesian computation by subset simulation. SIAM J. Sci. Comput., 36(3):A1339–A1358, 2014/10/03 2014.
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance used to compare the simulated and observed data sets.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
chain_length
= None¶
-
__init__
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
kernel
= None¶
-
backend
= None¶
-
rng
= None¶
-
anneal_parameter
= None¶
-
sample
(observations, steps, n_samples=10000, n_samples_per_param=1, chain_length=10, ap_change_cutoff=10, full_output=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – Number of iterations in the sequential algoritm (“generations”)
- n_samples (integer, optional) – Number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
- chain_length (int, optional) – The length of chains, default value is 10. But should be checked such that this is an divisor of n_samples.
- ap_change_cutoff (float, optional) – The cutoff value for the percentage change in the anneal parameter. If the change is less than ap_change_cutoff the iterations are stopped. The default value is 10.
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
RSMCABC
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseDiscrepancy
,abcpy.inferences.InferenceMethod
This class implements Replenishment Sequential Monte Carlo Approximate Bayesian computation of Drovandi and Pettitt [1].
[1] CC. Drovandi CC and AN. Pettitt, Estimation of parameters for macroparasite population evolution using approximate Bayesian computation. Biometrics 67(1):225–233, 2011.
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance measure used to compare simulated and observed data sets.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
alpha
= None¶
-
__init__
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
kernel
= None¶
-
backend
= None¶
-
R
= None¶
-
rng
= None¶
-
accepted_dist_bds
= None¶
-
sample
(observations, steps, n_samples=10000, n_samples_per_param=1, alpha=0.1, epsilon_init=100, epsilon_final=0.1, const=0.01, covFactor=2.0, full_output=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – Number of iterations in the sequential algoritm (“generations”)
- n_samples (integer, optional) – Number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
- alpha (float, optional) – A parameter taking values between [0,1], the default value is 0.1.
- epsilon_init (float, optional) – Initial value of threshold, the default is 100
- epsilon_final (float, optional) – Terminal value of threshold, the default is 0.1
- const (float, optional) – A constant to compute acceptance probability, the default is 0.01.
- covFactor (float, optional) – scaling parameter of the covariance matrix. The default value is 2.
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
APMCABC
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseDiscrepancy
,abcpy.inferences.InferenceMethod
This class implements Adaptive Population Monte Carlo Approximate Bayesian computation of M. Lenormand et al. [1].
[1] M. Lenormand, F. Jabot and G. Deffuant, Adaptive approximate Bayesian computation for complex models. Computational Statistics, 28:2777–2796, 2013.
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance measure used to compare simulated and observed data sets.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
alpha
= None¶
-
accepted_dist
= None¶
-
__init__
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
kernel
= None¶
-
backend
= None¶
-
epsilon
= None¶
-
rng
= None¶
-
sample
(observations, steps, n_samples=10000, n_samples_per_param=1, alpha=0.1, acceptance_cutoff=0.03, covFactor=2.0, full_output=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – Number of iterations in the sequential algoritm (“generations”)
- n_samples (integer, optional) – Number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
- alpha (float, optional) – A parameter taking values between [0,1], the default value is 0.1.
- acceptance_cutoff (float, optional) – Acceptance ratio cutoff, should be chosen between 0.01 and 0.03
- covFactor (float, optional) – scaling parameter of the covariance matrix. The default value is 2.
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
-
class
abcpy.inferences.
SMCABC
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Bases:
abcpy.inferences.BaseDiscrepancy
,abcpy.inferences.InferenceMethod
This class implements Sequential Monte Carlo Approximate Bayesian computation of Del Moral et al. [1].
[1] P. Del Moral, A. Doucet, A. Jasra, An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing, 22(5):1009–1020, 2012.
[2] Lee, Anthony. “n the choice of MCMC kernels for approximate Bayesian computation with SMC samplers. Proceedings of the 2012 Winter Simulation Conference (WSC). IEEE, 2012.
Parameters: - model (list) – A list of the Probabilistic models corresponding to the observed datasets
- distance (abcpy.distances.Distance) – Distance object defining the distance measure used to compare simulated and observed data sets.
- kernel (abcpy.distributions.Distribution) – Distribution object defining the perturbation kernel needed for the sampling.
- backend (abcpy.backends.Backend) – Backend object defining the backend to be used.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
n_samples
= None¶
-
n_samples_per_param
= None¶
-
__init__
(root_models, distances, backend, kernel=None, seed=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
model
= None¶
-
distance
= None¶
-
kernel
= None¶
-
backend
= None¶
-
epsilon
= None¶
-
rng
= None¶
-
accepted_y_sim_bds
= None¶
-
sample
(observations, steps, n_samples=10000, n_samples_per_param=1, epsilon_final=0.1, alpha=0.95, covFactor=2, resample=None, full_output=0, which_mcmc_kernel=0, journal_file=None)[source]¶ Samples from the posterior distribution of the model parameter given the observed data observations.
Parameters: - observations (list) – A list, containing lists describing the observed data sets
- steps (integer) – Number of iterations in the sequential algoritm (“generations”)
- n_samples (integer, optional) – Number of samples to generate. The default value is 10000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
- epsilon_final (float, optional) – The final threshold value of epsilon to be reached; if at some iteration you reach a lower epsilon than epsilon_final, the algorithm will stop and not proceed with further iterations. The default value is 0.1.
- alpha (float, optional) – A parameter taking values between [0,1], determinining the rate of change of the threshold epsilon. The default value is 0.95.
- covFactor (float, optional) – scaling parameter of the covariance matrix. The default value is 2.
- resample (float, optional) – It defines the resample step: introduce a resample step, after the particles have been perturbed and the new weights have been computed, if the effective sample size is smaller than resample. If not provided, resample is set to 0.5 * n_samples.
- full_output (integer, optional) – If full_output==1, intermediate results are included in output journal. The default value is 0, meaning the intermediate results are not saved.
- which_mcmc_kernel (integer, optional) – Specifies which MCMC kernel to be used: ‘0’ kernel suggested in [1], any other value will use r-hit kernel suggested by Anthony Lee [2]. The default value is 0.
- journal_file (str, optional) – Filename of a journal file to read an already saved journal file, from which the first iteration will start. The default value is None.
Returns: A journal containing simulation results, metadata and optionally intermediate results.
Return type:
abcpy.modelselections module¶
-
class
abcpy.modelselections.
ModelSelections
(model_array, statistics_calc, backend, seed=None)[source]¶ Bases:
object
This abstract base class defines a model selection rule of how to choose a model from a set of models given an observation.
-
__init__
(model_array, statistics_calc, backend, seed=None)[source]¶ Constructor that must be overwritten by the sub-class.
The constructor of a sub-class must accept an array of models to choose the model from, and two non-optional parameters statistics calculator and backend stored in self.statistics_calc and self.backend defining how to calculate sumarry statistics from data and what kind of parallelization to use.
Parameters: - model_array (list) – A list of models which are of type abcpy.probabilisticmodels
- statistics (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
select_model
(observations, n_samples=1000, n_samples_per_param=100)[source]¶ To be overwritten by any sub-class: returns a model selected by the modelselection procedure most suitable to the obersved data set observations. Further two optional integer arguments n_samples and n_samples_per_param is supplied denoting the number of samples in the refernce table and the data points in each simulated data set.
Parameters: - observations (python list) – The observed data set.
- n_samples (integer, optional) – Number of samples to generate for reference table.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set.
Returns: A model which are of type abcpy.probabilisticmodels
Return type: abcpy.probabilisticmodels
-
posterior_probability
(observations)[source]¶ To be overwritten by any sub-class: returns the approximate posterior probability of the chosen model given the observed data set observations.
Parameters: observations (python list) – The observed data set. Returns: A vector containing the approximate posterior probability of the model chosen. Return type: np.ndarray
-
-
class
abcpy.modelselections.
RandomForest
(model_array, statistics_calc, backend, N_tree=100, n_try_fraction=0.5, seed=None)[source]¶ Bases:
abcpy.modelselections.ModelSelections
,abcpy.graphtools.GraphTools
This class implements the model selection procedure based on the Random Forest ensemble learner as described in Pudlo et. al. [1].
[1] Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M. and Robert, C. (2016). Reliable ABC model choice via random forests. Bioinformatics, 32 859–866.
-
__init__
(model_array, statistics_calc, backend, N_tree=100, n_try_fraction=0.5, seed=None)[source]¶ Parameters: - N_tree (integer, optional) – Number of trees in the random forest. The default value is 100.
- n_try_fraction (float, optional) – The fraction of number of summary statistics to be considered as the size of the number of covariates randomly sampled at each node by the randomised CART. The default value is 0.5.
-
select_model
(observations, n_samples=1000, n_samples_per_param=1)[source]¶ Parameters: - observations (python list) – The observed data set.
- n_samples (integer, optional) – Number of samples to generate for reference table. The default value is 1000.
- n_samples_per_param (integer, optional) – Number of data points in each simulated data set. The default value is 1.
Returns: A model which are of type abcpy.probabilisticmodels
Return type: abcpy.probabilisticmodels
-
abcpy.NN_utilities module¶
Functions and classes needed for the neural network based summary statistics learning.
-
abcpy.NN_utilities.algorithms.
contrastive_training
(samples, similarity_set, embedding_net, cuda, batch_size=16, n_epochs=200, samples_val=None, similarity_set_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, positive_weight=None, load_all_data_GPU=False, margin=1.0, lr=None, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Implements the algorithm for the contrastive distance learning training of a neural network; need to be provided with a set of samples and the corresponding similarity matrix
-
abcpy.NN_utilities.algorithms.
triplet_training
(samples, similarity_set, embedding_net, cuda, batch_size=16, n_epochs=400, samples_val=None, similarity_set_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, load_all_data_GPU=False, margin=1.0, lr=None, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Implements the algorithm for the triplet distance learning training of a neural network; need to be provided with a set of samples and the corresponding similarity matrix
-
abcpy.NN_utilities.algorithms.
FP_nn_training
(samples, target, embedding_net, cuda, batch_size=1, n_epochs=50, samples_val=None, target_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, load_all_data_GPU=False, lr=0.001, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Implements the algorithm for the training of a neural network based on regressing the values of the parameters on the corresponding simulation outcomes; it is effectively a training with a mean squared error loss. Needs to be provided with a set of samples and the corresponding parameters that generated the samples. Note that in this case the network has to have same output size as the number of parameters, as the learned summary statistic will have the same dimension as the parameter.
-
class
abcpy.NN_utilities.datasets.
Similarities
(samples, similarity_matrix, device)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
A dataset class that considers a set of samples and pairwise similarities defined between them. Note that, for our application of computing distances, we are not interested in train/test split.
-
class
abcpy.NN_utilities.datasets.
SiameseSimilarities
(similarities_dataset, positive_weight=None)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
This class defines a dataset returning pairs of similar and dissimilar samples. It has to be instantiated with a dataset of the class Similarities
-
class
abcpy.NN_utilities.datasets.
TripletSimilarities
(similarities_dataset)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
This class defines a dataset returning triplets of anchor, positive and negative samples. It has to be instantiated with a dataset of the class Similarities.
-
class
abcpy.NN_utilities.datasets.
ParameterSimulationPairs
(simulations, parameters, device)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
A dataset class that consists of pairs of parameters-simulation pairs, in which the data contains the simulations, with shape (n_samples, n_features), and targets contains the ground truth of the parameters, with shape (n_samples, 2). Note that n_features could also have more than one dimension here.
-
class
abcpy.NN_utilities.losses.
ContrastiveLoss
(margin)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Contrastive loss Takes embeddings of two samples and a target label == 1 if samples are from the same class and label == 0 otherwise.
-
class
abcpy.NN_utilities.losses.
TripletLoss
(margin)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Triplet loss Takes embeddings of an anchor sample, a positive sample and a negative sample.
-
class
abcpy.NN_utilities.networks.
SiameseNet
(embedding_net)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
This is used in the contrastive distance learning. It is a network wrapping a standard neural network and feeding two samples through it at once.
-
class
abcpy.NN_utilities.networks.
TripletNet
(embedding_net)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
This is used in the triplet distance learning. It is a network wrapping a standard neural network and feeding three samples through it at once.
-
abcpy.NN_utilities.networks.
createDefaultNN
(input_size, output_size, hidden_sizes=None, nonlinearity=None)[source]¶ Function returning a fully connected neural network class with a given input and output size, and optionally given hidden layer sizes (if these are not given, they are determined from the input and output size with some expression.
In order to instantiate the network, you need to write: createDefaultNN(input_size, output_size)() as the function returns a class, and () is needed to instantiate an object.
-
class
abcpy.NN_utilities.networks.
ScalerAndNet
(net, scaler)[source]¶ Bases:
sphinx.ext.autodoc.importer._MockObject
Defines a nn.Module class that wraps a scaler and a neural network, and applies the scaler before passing the data through the neural network.
-
abcpy.NN_utilities.utilities.
dist2
(x, y)[source]¶ Compute the square of the Euclidean distance between 2 arrays of same length
-
abcpy.NN_utilities.utilities.
compute_similarity_matrix
(target, quantile=0.1, return_pairwise_distances=False)[source]¶ Compute the similarity matrix between some values given a given quantile of the Euclidean distances.
If return_pairwise_distances is True, it also returns a matrix with the pairwise distances with every distance.
abcpy.output module¶
-
class
abcpy.output.
Journal
(type)[source]¶ Bases:
object
The journal holds information created by the run of inference schemes.
It can be configured to even hold intermediate.
-
parameters
¶ a nxpxt matrix
Type: numpy.array
-
weights
¶ a nxt matrix
Type: numpy.array
-
opt_value
¶ nxp matrix containing for each parameter the evaluated objective function for every time step
Type: numpy.array
-
configuration
¶ dictionary containing the schemes configuration parameters
Type: Python dictionary
-
__init__
(type)[source]¶ Initializes a new output journal of given type.
Parameters: type (int (identifying type)) – type=0 only logs final parametersa and weight (production use); type=1 logs all generated information (reproducibily use).
-
classmethod
fromFile
(filename)[source]¶ This method reads a saved journal from disk an returns it as an object.
Notes
To store a journal use Journal.save(filename).
Parameters: filename (string) – The string representing the location of a file Returns: The journal object serialized in <filename> Return type: abcpy.output.Journal Example
>>> jnl = Journal.fromFile('example_output.jnl')
-
add_user_parameters
(names_and_params)[source]¶ Saves the provided parameters and names of the probabilistic models corresponding to them. If type==0, old parameters get overwritten.
Parameters: names_and_params (list) – Each entry is a tuple, where the first entry is the name of the probabilistic model, and the second entry is the parameters associated with this model.
-
add_accepted_parameters
(accepted_parameters)[source]¶ FIX THIS! Saves provided accepted parameters by appending them to the journal. If type==0, old accepted parameters get overwritten.
Parameters: accepted_parameters (list) –
-
add_weights
(weights)[source]¶ Saves provided weights by appending them to the journal. If type==0, old weights get overwritten.
Parameters: weights (numpy.array) – vector containing n weigths
-
add_distances
(distances)[source]¶ Saves provided distances by appending them to the journal. If type==0, old weights get overwritten.
Parameters: distances (numpy.array) – vector containing n distances
-
add_opt_values
(opt_values)[source]¶ Saves provided values of the evaluation of the schemes objective function. If type==0, old values get overwritten
Parameters: opt_value (numpy.array) – vector containing n evaluations of the schemes objective function
-
add_ESS_estimate
(weights)[source]¶ Computes and saves Effective Sample Size (ESS) estimate starting from provided weights; ESS is estimated as sum the inverse of sum of squared normalized weights. The provided weights are normalized before computing ESS. If type==0, old ESS estimate gets overwritten.
Parameters: weights (numpy.array) – vector containing n weigths
-
save
(filename)[source]¶ Stores the journal to disk.
Parameters: filename (string) – the location of the file to store the current object to.
-
get_parameters
(iteration=None)[source]¶ Returns the parameters from a sampling scheme.
For intermediate results, pass the iteration.
Parameters: iteration (int) – specify the iteration for which to return parameters Returns: names_and_parameters – Samples from the specified iteration (last, if not specified) returned as a disctionary with names of the random variables Return type: dictionary
-
get_accepted_parameters
(iteration=None)[source]¶ Returns the accepted parameters from a sampling scheme.
For intermediate results, pass the iteration.
Parameters: iteration (int) – specify the iteration for which to return parameters Returns: accepted_parameters – Samples from the specified iteration (last, if not specified) returned as a disctionary with names of the random variables Return type: dictionary
-
get_weights
(iteration=None)[source]¶ Returns the weights from a sampling scheme.
For intermediate results, pass the iteration.
Parameters: iteration (int) – specify the iteration for which to return weights
-
get_distances
(iteration=None)[source]¶ Returns the distances from a sampling scheme.
For intermediate results, pass the iteration.
Parameters: iteration (int) – specify the iteration for which to return distances
-
get_ESS_estimates
(iteration=None)[source]¶ Returns the estimate of Effective Sample Size (ESS) from a sampling scheme.
For intermediate results, pass the iteration.
Parameters: iteration (int) – specify the iteration for which to return ESS
-
posterior_mean
(iteration=None)[source]¶ Computes posterior mean from the samples drawn from posterior distribution
For intermediate results, pass the iteration.
Parameters: iteration (int) – specify the iteration for which to return posterior mean Returns: posterior mean – Posterior mean from the specified iteration (last, if not specified) returned as a disctionary with names of the random variables Return type: dictionary
-
posterior_cov
(iteration=None)[source]¶ Computes posterior covariance from the samples drawn from posterior distribution
Returns: - np.ndarray – posterior covariance
- dic – order of the variables in the covariance matrix
-
posterior_histogram
(iteration=None, n_bins=10)[source]¶ Computes a weighted histogram of multivariate posterior samples and returns histogram H and a list of p arrays describing the bin edges for each dimension.
Returns: containing two elements (H = np.ndarray, edges = list of p arrays) Return type: python list
-
plot_posterior_distr
(parameters_to_show=None, ranges_parameters=None, iteration=None, show_samples=None, single_marginals_only=False, double_marginals_only=False, write_posterior_mean=True, show_posterior_mean=True, true_parameter_values=None, contour_levels=14, figsize=None, show_density_values=True, bw_method=None, path_to_save=None)[source]¶ Produces a visualization of the posterior distribution of the parameters of the model.
A Gaussian kernel density estimate (KDE) is used to approximate the density starting from the sampled parameters. Specifically, it produces a scatterplot matrix, where the diagonal contains single parameter marginals, while the off diagonal elements contain the contourplot for the paired marginals for each possible pair of parameters.
This visualization is not satisfactory for parameters that take on discrete values, specially in the case where the number of values it can assume are small, as it obtains the posterior by KDE in this case as well. We need to improve on that, considering histograms.
Parameters: - parameters_to_show (list, optional) – a list of the parameters for which you want to plot the posterior distribution. For each parameter, you need to provide the name string as it was defined in the model. For instance, jrnl.plot_posterior_distr(parameters_to_show=[“mu”]) will only plot the posterior distribution for the parameter named “mu” in the list of parameters. If None, then all parameters will be displayed.
- ranges_parameters (Python dictionary, optional) – a dictionary in which you can optionally provide the plotting range for the parameters that you chose to display. You can use this even if parameters_to_show=None. The dictionary key is the name of parameter, and the range needs to be an array-like of the form [lower_limit, upper_limit]. For instance: {“theta” : [0,2]} specifies that you want to plot the posterior distribution for the parameter “theta” in the range [0,2].
- iteration (int, optional) – specify the iteration for which to plot the posterior distribution, in the case of a sequential algorithm. If None, then the last iteration will be used.
- show_samples (boolean, optional) – specifies if you want to show the posterior samples overimposed to the contourplots of the posterior distribution. If None, the default behaviour is the following: if the posterior samples are associated with importance weights, then the samples are not showed (in fact, the KDE for the posterior distribution takes into account the weights, and showing the samples may be misleading). Otherwise, if the posterior samples are not associated with weights, they are displayed by default.
- single_marginals_only (boolean, optional) – if True, the method does not show the paired marginals but only the single parameter marginals; otherwise, it shows the paired marginals as well. Default to False.
- double_marginals_only (boolean, optional) – if True, the method shows the contour plot for the marginal posterior for each possible pair of parameters in the parameters that have to be shown (all parameters of the model if parameters_to_show is None). Default to False.
- write_posterior_mean (boolean, optional) – Whether to write or not the posterior mean on the single marginal plots. Default to True.
- show_posterior_mean (boolean, optional) – Whether to display a line corresponding to the posterior mean value in the plot. Default to True.
- true_parameter_values (array-like, optional) – you can provide here the true values of the parameters, if known, and that will be displayed in the posterior plot. It has to be an array-like of the same length of parameters_to_show (if that is provided), otherwise of length equal to the number of parameters in the model, and with entries corresponding to the true value of that parameter (in case parameters_to_show is not provided, the order of the parameters is the same order the model forward_simulate step takes.
- contour_levels (integer, optional) – The number of levels to be used in the contour plots. Default to 14.
- figsize (float, optional) – Denotes the size (in inches) of the smaller dimension of the plot; the other dimension is automatically determined. If None, then figsize is chosen automatically. Default to None.
- show_density_values (boolean, optional) – If True, the method displays the value of the density at each contour level in the contour plot. Default to True.
- bw_method (str, scalar or callable, optional) – The parameter of the scipy.stats.gaussian_kde defining the method used to calculate the bandwith in the Gaussian kernel density estimator. Please refer to the documentation therein for details. Default to None.
- path_to_save (string, optional) – if provided, save the figure in png format in the specified path.
Returns: a tuple containing the matplotlib “fig, axes” objects defining the plot. Can be useful for further modifications.
Return type: tuple
-
plot_ESS
()[source]¶ Produces a plot showing the evolution of the estimated ESS (from sample weights) across iterations; it also shows as a baseline the maximum possible ESS which can be achieved, corresponding to the case of independent samples, which is equal to the total number of samples.
Returns: a tuple containing the matplotlib “fig, ax” objects defining the plot. Can be useful for further modifications. Return type: tuple
-
Wass_convergence_plot
(num_iter_max=100000000.0, **kwargs)[source]¶ Computes the Wasserstein distance between the empirical distribution at subsequent iterations to see whether the approximation of the posterior is converging. Then, it produces a plot displaying that. The approximation of the posterior is converging if the Wass distance between subsequent iterations decreases with iteration and gets close to 0, as that means there is no evolution of the posterior samples. The Wasserstein distance is estimated using the POT library).
This method only works when the Journal stores results from all the iterations (ie it was generated with full_output=1). Moreover, this only works when all the parameters in the model are univariate.
Parameters: - num_iter_max (integer, optional) – The maximum number of iterations in the linear programming algorithm to estimate the Wasserstein distance. Default to 1e8.
- kwargs – Additional arguments passed to the wass_dist calculation function.
Returns: a tuple containing the matplotlib “fig, ax” objects defining the plot and the list of the computed Wasserstein distances. “fig” and “ax” can be useful for further modifying the plot.
Return type: tuple
-
abcpy.perturbationkernel module¶
-
class
abcpy.perturbationkernel.
PerturbationKernel
(models)[source]¶ Bases:
object
This abstract base class represents all perturbation kernels
-
__init__
(models)[source]¶ Parameters: models (list) – The list of abcpy.probabilisticmodel objects that should be perturbed by this kernel.
-
calculate_cov
(accepted_parameters_manager, kernel_index)[source]¶ Calculates the covariance matrix for the kernel.
Parameters: - accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint perturbation kernel.
Returns: The covariance matrix for the kernel.
Return type: numpy.ndarray
-
update
(accepted_parameters_manager, row_index, rng)[source]¶ Perturbs the parameters for this kernel.
Parameters: - accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects.
- row_index (integer) – The index of the accepted parameters bds that should be perturbed.
- rng (random number generator) – The random number generator to be used.
Returns: The perturbed parameters.
Return type: numpy.ndarray
-
pdf
(accepted_parameters_manager, kernel_index, row_index, x)[source]¶ Calculates the pdf of the kernel at point x.
Parameters: - accepted_parameters_manager (abcpy.acceptedparametersmanager object) – The accepted parameters manager that manages all bds objects.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint perturbation kernel.
- row_index (integer) – The index of the accepted parameters bds for which the pdf should be evaluated.
- x (list or float) – The point at which the pdf should be evaluated.
Returns: The pdf evaluated at point x.
Return type: float
-
-
class
abcpy.perturbationkernel.
ContinuousKernel
[source]¶ Bases:
object
This abstract base class represents all perturbation kernels acting on continuous parameters.
-
class
abcpy.perturbationkernel.
DiscreteKernel
[source]¶ Bases:
object
This abstract base class represents all perturbation kernels acting on discrete parameters.
-
class
abcpy.perturbationkernel.
JointPerturbationKernel
(kernels)[source]¶ Bases:
abcpy.perturbationkernel.PerturbationKernel
-
__init__
(kernels)[source]¶ This class joins different kernels to make up the overall perturbation kernel. Any user-implemented perturbation kernel should derive from this class. Any kernels defined on their own should be joined in the end using this class.
Parameters: kernels (list) – List of abcpy.PerturbationKernels
-
calculate_cov
(accepted_parameters_manager)[source]¶ Calculates the covariance matrix corresponding to each kernel. Commonly used before calculating weights to avoid repeated calculation.
Parameters: accepted_parameters_manager (abcpy.AcceptedParametersManager object) – The AcceptedParametersManager to be uesd. Returns: Each entry corresponds to the covariance matrix of the corresponding kernel. Return type: list
-
update
(accepted_parameters_manager, row_index, rng=<MagicMock name='mock.RandomState()' id='140345434790480'>)[source]¶ Perturbs the parameter values contained in accepted_parameters_manager. Commonly used while perturbing.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – Defines the AcceptedParametersManager to be used.
- row_index (integer) – The index of the row that should be considered from the accepted_parameters_bds matrix.
- rng (random number generator) – The random number generator to be used.
Returns: The list contains tuples. Each tuple contains as the first entry a probabilistic model and as the second entry the perturbed parameter values corresponding to this model.
Return type: list
-
pdf
(mapping, accepted_parameters_manager, mean, x)[source]¶ Calculates the overall pdf of the kernel. Commonly used to calculate weights.
Parameters: - mapping (list) – Each entry is a tuple of which the first entry is a abcpy.ProbabilisticModel object, the second entry is the index in the accepted_parameters_bds list corresponding to an output of this model.
- accepted_parameters_manager (abcpy.AcceptedParametersManager object) – The AcceptedParametersManager to be used.
- index (integer) – The row to be considered in the accepted_parameters_bds matrix.
- x (The point at which the pdf should be evaluated.) –
Returns: The pdf evaluated at point x.
Return type: float
-
-
class
abcpy.perturbationkernel.
MultivariateNormalKernel
(models)[source]¶ Bases:
abcpy.perturbationkernel.PerturbationKernel
,abcpy.perturbationkernel.ContinuousKernel
This class defines a kernel perturbing the parameters using a multivariate normal distribution.
-
__init__
(models)[source]¶ Parameters: models (list) – The list of abcpy.probabilisticmodel objects that should be perturbed by this kernel.
-
calculate_cov
(accepted_parameters_manager, kernel_index)[source]¶ Calculates the covariance matrix relevant to this kernel.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint kernel.
Returns: The covariance matrix corresponding to this kernel.
Return type: list
-
update
(accepted_parameters_manager, kernel_index, row_index, rng=<MagicMock name='mock.RandomState()' id='140345439589136'>)[source]¶ Updates the parameter values contained in the accepted_paramters_manager using a multivariate normal distribution.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – Defines the AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels in the joint kernel.
- row_index (integer) – The index of the row that should be considered from the accepted_parameters_bds matrix.
- rng (random number generator) – The random number generator to be used.
Returns: The perturbed parameter values.
Return type: np.ndarray
-
pdf
(accepted_parameters_manager, kernel_index, mean, x)[source]¶ Calculates the pdf of the kernel. Commonly used to calculate weights.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – The AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels in the joint kernel.
- index (integer) – The row to be considered in the accepted_parameters_bds matrix.
- x (The point at which the pdf should be evaluated.) –
Returns: The pdf evaluated at point x.
Return type: float
-
-
class
abcpy.perturbationkernel.
MultivariateStudentTKernel
(models, df)[source]¶ Bases:
abcpy.perturbationkernel.PerturbationKernel
,abcpy.perturbationkernel.ContinuousKernel
-
__init__
(models, df)[source]¶ This class defines a kernel perturbing the parameters using a multivariate normal distribution.
Parameters: - models (list of abcpy.probabilisticmodel objects) – The models that should be perturbed using this kernel
- df (integer) – The degrees of freedom to be used.
-
calculate_cov
(accepted_parameters_manager, kernel_index)[source]¶ Calculates the covariance matrix relevant to this kernel.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint kernel.
Returns: The covariance matrix corresponding to this kernel.
Return type: list
-
update
(accepted_parameters_manager, kernel_index, row_index, rng=<MagicMock name='mock.RandomState()' id='140345439573136'>)[source]¶ Updates the parameter values contained in the accepted_paramters_manager using a multivariate normal distribution.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – Defines the AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels in the joint kernel.
- row_index (integer) – The index of the row that should be considered from the accepted_parameters_bds matrix.
- rng (random number generator) – The random number generator to be used.
Returns: The perturbed parameter values.
Return type: np.ndarray
-
pdf
(accepted_parameters_manager, kernel_index, mean, x)[source]¶ Calculates the pdf of the kernel. Commonly used to calculate weights.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – The AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels in the joint kernel.
- index (integer) – The row to be considered in the accepted_parameters_bds matrix.
- x (The point at which the pdf should be evaluated.) –
Returns: The pdf evaluated at point x.
Return type: float
-
-
class
abcpy.perturbationkernel.
RandomWalkKernel
(models)[source]¶ Bases:
abcpy.perturbationkernel.PerturbationKernel
,abcpy.perturbationkernel.DiscreteKernel
-
__init__
(models)[source]¶ This class defines a kernel perturbing discrete parameters using a naive random walk.
Parameters: models (list) – List of abcpy.ProbabilisticModel objects
-
update
(accepted_parameters_manager, kernel_index, row_index, rng=<MagicMock name='mock.RandomState()' id='140345436057936'>)[source]¶ Updates the parameter values contained in the accepted_paramters_manager using a random walk.
Parameters: - accepted_parameters_manager (abcpy.AcceptedParametersManager object) – Defines the AcceptedParametersManager to be used.
- row_index (integer) – The index of the row that should be considered from the accepted_parameters_bds matrix.
- rng (random number generator) – The random number generator to be used.
Returns: The perturbed parameter values.
Return type: np.ndarray
-
calculate_cov
(accepted_parameters_manager, kernel_index)[source]¶ Calculates the covariance matrix of this kernel. Since there is no covariance matrix associated with this random walk, it returns an empty list.
-
pmf
(accepted_parameters_manager, kernel_index, mean, x)[source]¶ Calculates the pmf of the kernel. Commonly used to calculate weights.
Parameters: - cov (list) – The covariance matrix used for this kernel. This is a dummy input.
- accepted_parameters_manager (abcpy.AcceptedParametersManager object) – The AcceptedParametersManager to be used.
- kernel_index (integer) – The index of the kernel in the list of kernels of the joint kernel.
- index (integer) – The row to be considered in the accepted_parameters_bds matrix.
- x (The point at which the pdf should be evaluated.) –
Returns: The pmf evaluated at point x.
Return type: float
-
-
class
abcpy.perturbationkernel.
DefaultKernel
(models)[source]¶ Bases:
abcpy.perturbationkernel.JointPerturbationKernel
-
__init__
(models)[source]¶ This class implements a kernel that perturbs all continuous parameters using a multivariate normal, and all discrete parameters using a random walk. To be used as an example for user defined kernels.
Parameters: models (list) – List of abcpy.ProbabilisticModel objects, the models for which the kernel should be defined.
-
abcpy.probabilisticmodels module¶
-
class
abcpy.probabilisticmodels.
InputConnector
(dimension)[source]¶ Bases:
object
-
__init__
(dimension)[source]¶ Creates input parameters of given dimensionality. Each dimension needs to be specified using the set method.
Parameters: dimension (int) – Dimensionality of the input parameters.
-
from_number
()[source]¶ Convenient initializer that converts a number to a hyperparameter input parameter.
Parameters: number – Returns: Return type: InputConnector
-
from_model
()[source]¶ Convenient initializer that converts the full output of a model to input parameters.
Parameters: ProbabilisticModel – Returns: Return type: InputConnector
-
from_list
()[source]¶ Creates an InputParameters object from a list of ProbabilisticModels.
In this case, number of input parameters equals the sum of output dimensions of all models in the parameter list. Further, the output and models are connected to the input parameters in the order they appear in the parameter list.
For convenience, - the parameter list can contain nested lists - the method also accepts numbers instead of models, which are automatically converted to hyper parameters.
Parameters: parameters (list) – A list of ProbabilisticModels Returns: Return type: InputConnector
-
get_model
(index)[source]¶ Returns the model at index.
Returns: Return type: ProbabilisticModel
-
set
(index, model, model_index)[source]¶ Sets for an input parameter index the input model and the model index to use.
For convenience, model can also be a number, which is automatically casted to a hyper parameter.
Parameters: - index (int) – Index of the input parameter to be set.
- model (ProbabilisticModel, Number) – The model to be set for the input parameter.
- model_index (int) – Index of model’s output to be used as input parameter.
-
all_models_fixed_values
()[source]¶ Checks whether all input models have fixed an output value (pseudo data).
In order get a fixed output value (a realization of the random variable described by the model) a model has to run a forward simulation, which is not done automatically upon initialization.
Returns: Return type: boolean
-
-
class
abcpy.probabilisticmodels.
ProbabilisticModel
(input_connector, name='')[source]¶ Bases:
object
This abstract class represents all probabilistic models.
-
__init__
(input_connector, name='')[source]¶ This initializer must be called from any derived class to properly connect it to its input models.
It accepts as input an InputConnector object that fully specifies how to connect all parent models to the current model.
Parameters: - input_connector (list) – A list of input parameters.
- name (string) – A human readable name for the model. Can be the variable name for example.
-
get_input_values
()[source]¶ Returns the input values from the parent models as a list. Commonly used when sampling from the distribution.
Returns: Return type: list
-
get_stored_output_values
()[source]¶ Returns the stored sampled value of the probabilistic model after setting the values explicitly.
At initialization the function should return None.
Returns: Return type: numpy.array or None.
-
get_input_connector
()[source]¶ Returns the input connector object that connecects the current model to its parents.
In case of no dependencies, this function should return None.
Returns: Return type: InputConnector, None
-
get_input_dimension
()[source]¶ Returns the input dimension of the current model.
Returns: Return type: int
-
set_output_values
(values)[source]¶ Sets the output values of the model. This method is commonly used to set new values after perturbing the old ones.
Parameters: values (numpy array or dimension equal to output dimension.) – Returns: Returns True if it was possible to set the values, false otherwise. Return type: boolean
-
__add__
(other)[source]¶ Overload the + operator for probabilistic models.
Parameters: other (probabilistic model or Hyperparameter) – The model to be added to self. Returns: A probabilistic model describing a model coming from summation. Return type: SummationModel
-
__sub__
(other)[source]¶ Overload the - operator for probabilistic models.
Parameters: other (probabilistic model or Hyperparameter) – The model to be subtracted from self. Returns: A probabilistic model describing a model coming from subtraction. Return type: SubtractionModel
-
__mul__
(other)[source]¶ Overload the * operator for probabilistic models.
Parameters: other (probabilistic model or Hyperparameter) – The model to be multiplied with self. Returns: A probabilistic model describing a model coming from multiplication. Return type: MultiplicationModel
-
__truediv__
(other)[source]¶ Overload the / operator for probabilistic models.
Parameters: other (probabilistic model or Hyperparameter) – The model to be divide self. Returns: A probabilistic model describing a model coming from division. Return type: DivisionModel
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x.
Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (list) – The point at which the pdf should be evaluated.
Returns: The pdf evaluated at point x.
Return type: float
-
calculate_and_store_pdf_if_needed
(x)[source]¶ Calculates the probability density function at point x and stores the result internally for later use.
This function is intended to be used within the inference computation.
Parameters: x (list) – The point at which the pdf should be evaluated.
-
flush_stored_pdf
()[source]¶ This function flushes the internally stored value of a previously computed pdf.
-
get_stored_pdf
()[source]¶ Retrieves the value of a previously calculated pdf.
Returns: Return type: number
-
forward_simulate
(input_values, k, rng, mpi_comm=None)[source]¶ Provides the output (pseudo data) from a forward simulation of the current model.
In case the model is intended to be used as input for another model, a forward simulation must return a list of k numpy arrays with shape (get_output_dimension(),).
In case the model is directly used for inference, and not as input for another model, a forward simulation also must return a list, but the elements can be arbitrarily defined. In this case it is only important that the used statistics and distance functions can read the input.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- k (integer) – The number of forward simulations that should be run
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: A list of k elements, where each element is of type numpy arary and represents the result of a single forward simulation.
Return type: list
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
-
class
abcpy.probabilisticmodels.
Continuous
[source]¶ Bases:
object
This abstract class represents all continuous probabilistic models.
-
pdf
(input_values, x)[source]¶ Calculates the probability density function of the model.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- x (float) – The location at which the probability density function should be evaluated.
-
-
class
abcpy.probabilisticmodels.
Discrete
[source]¶ Bases:
object
This abstract class represents all discrete probabilistic models.
-
pmf
(input_values, x)[source]¶ Calculates the probability mass function of the model.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- x (float) – The location at which the probability mass function should be evaluated.
-
-
class
abcpy.probabilisticmodels.
Hyperparameter
(value, name='Hyperparameter')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
This class represents all hyperparameters (i.e. fixed parameters).
-
__init__
(value, name='Hyperparameter')[source]¶ Parameters: value (list) – The values to which the hyperparameter should be set
-
set_output_values
(values, rng=<MagicMock name='mock.RandomState()' id='140345442479184'>)[source]¶ Sets the output values of the model. This method is commonly used to set new values after perturbing the old ones.
Parameters: values (numpy array or dimension equal to output dimension.) – Returns: Returns True if it was possible to set the values, false otherwise. Return type: boolean
-
get_input_dimension
()[source]¶ Returns the input dimension of the current model.
Returns: Return type: int
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
get_input_connector
()[source]¶ Returns the input connector object that connecects the current model to its parents.
In case of no dependencies, this function should return None.
Returns: Return type: InputConnector, None
-
get_input_values
()[source]¶ Returns the input values from the parent models as a list. Commonly used when sampling from the distribution.
Returns: Return type: list
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345454156944'>, mpi_comm=None)[source]¶ Provides the output (pseudo data) from a forward simulation of the current model.
In case the model is intended to be used as input for another model, a forward simulation must return a list of k numpy arrays with shape (get_output_dimension(),).
In case the model is directly used for inference, and not as input for another model, a forward simulation also must return a list, but the elements can be arbitrarily defined. In this case it is only important that the used statistics and distance functions can read the input.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- k (integer) – The number of forward simulations that should be run
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: A list of k elements, where each element is of type numpy arary and represents the result of a single forward simulation.
Return type: list
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x.
Commonly used to determine whether perturbed parameters are still valid according to the pdf.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (list) – The point at which the pdf should be evaluated.
Returns: The pdf evaluated at point x.
Return type: float
-
-
class
abcpy.probabilisticmodels.
ModelResultingFromOperation
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ProbabilisticModel
This class implements probabilistic models returned after performing an operation on two probabilistic models
-
__init__
(parameters, name='')[source]¶ Parameters: parameters (list) – List containing two probabilistic models that should be added together.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345442058512'>, mpi_comm=None)[source]¶ Provides the output (pseudo data) from a forward simulation of the current model.
In case the model is intended to be used as input for another model, a forward simulation must return a list of k numpy arrays with shape (get_output_dimension(),).
In case the model is directly used for inference, and not as input for another model, a forward simulation also must return a list, but the elements can be arbitrarily defined. In this case it is only important that the used statistics and distance functions can read the input.
Parameters: - input_values (list) – A list of numbers that are the concatenation of all parent model outputs in the order specified by the InputConnector object that was passed during initialization.
- k (integer) – The number of forward simulations that should be run
- rng (Random number generator) – Defines the random number generator to be used. The default value uses a random seed to initialize the generator.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: A list of k elements, where each element is of type numpy arary and represents the result of a single forward simulation.
Return type: list
-
get_output_dimension
()[source]¶ Provides the output dimension of the current model.
This function is in particular important if the current model is used as an input for other models. In such a case it is assumed that the output is always a vector of int or float. The length of the vector is the dimension that should be returned here.
Returns: The dimension of the output vector of a single forward simulation. Return type: int
-
pdf
(input_values, x)[source]¶ Calculates the probability density function at point x.
Parameters: - input_values (list) – List of input parameters, in the same order as specified in the InputConnector passed to the init function
- x (float or list) – The point at which the pdf should be evaluated.
Returns: The probability density function evaluated at point x.
Return type: float
-
sample_from_input_models
(k, rng=<MagicMock name='mock.RandomState()' id='140345442097872'>)[source]¶ Return for each input model k samples.
Parameters: k (int) – Specifies the number of samples to generate from each input model. Returns: A dictionary of type ProbabilisticModel:[], where the list contains k samples of the corresponding model. Return type: dict
-
-
class
abcpy.probabilisticmodels.
SummationModel
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ModelResultingFromOperation
This class represents all probabilistic models resulting from an addition of two probabilistic models
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345442149712'>, mpi_comm=None)[source]¶ Adds the sampled values of both parent distributions.
Parameters: - input_values (list) – List of input values
- k (integer) – The number of samples that should be sampled
- rng (random number generator) – The random number generator to be used.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: The first entry is True, it is always possible to sample, given two parent values. The second entry is the sum of the parents values.
Return type: list
-
-
class
abcpy.probabilisticmodels.
SubtractionModel
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ModelResultingFromOperation
This class represents all probabilistic models resulting from an subtraction of two probabilistic models
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345442197328'>, mpi_comm=None)[source]¶ Adds the sampled values of both parent distributions.
Parameters: - input_values (list) – List of input values
- k (integer) – The number of samples that should be sampled
- rng (random number generator) – The random number generator to be used.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: The first entry is True, it is always possible to sample, given two parent values. The second entry is the difference of the parents values.
Return type: list
-
-
class
abcpy.probabilisticmodels.
MultiplicationModel
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ModelResultingFromOperation
This class represents all probabilistic models resulting from a multiplication of two probabilistic models
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345442249104'>, mpi_comm=None)[source]¶ Multiplies the sampled values of both parent distributions element wise.
Parameters: - input_values (list) – List of input values
- k (integer) – The number of samples that should be sampled
- rng (random number generator) – The random number generator to be used.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: The first entry is True, it is always possible to sample, given two parent values. The second entry is the product of the parents values.
Return type: list
-
-
class
abcpy.probabilisticmodels.
DivisionModel
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ModelResultingFromOperation
This class represents all probabilistic models resulting from a division of two probabilistic models
-
forward_simulate
(input_valus, k, rng=<MagicMock name='mock.RandomState()' id='140345441780688'>, mpi_comm=None)[source]¶ Divides the sampled values of both parent distributions.
Parameters: - input_values (list) – List of input values
- k (integer) – The number of samples that should be sampled
- rng (random number generator) – The random number generator to be used.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: The first entry is True, it is always possible to sample, given two parent values. The second entry is the fraction of the parents values.
Return type: list
-
-
class
abcpy.probabilisticmodels.
ExponentialModel
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ModelResultingFromOperation
This class represents all probabilistic models resulting from an exponentiation of two probabilistic models
-
__init__
(parameters, name='')[source]¶ Specific initializer for exponential models that does additional checks.
Parameters: parameters (list) – List of probabilistic models that should be added together.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345441824272'>, mpi_comm=None)[source]¶ Raises the sampled values of the base by the exponent.
Parameters: - input_values (list) – List of input values
- k (integer) – The number of samples that should be sampled
- rng (random number generator) – The random number generator to be used.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: The first entry is True, it is always possible to sample, given two parent values. The second entry is the exponential of the parents values.
Return type: list
-
-
class
abcpy.probabilisticmodels.
RExponentialModel
(parameters, name='')[source]¶ Bases:
abcpy.probabilisticmodels.ModelResultingFromOperation
This class represents all probabilistic models resulting from an exponentiation of a Hyperparameter by another probabilistic model.
-
__init__
(parameters, name='')[source]¶ Specific initializer for exponential models that does additional checks.
Parameters: parameters (list) – List of probabilistic models that should be added together.
-
forward_simulate
(input_values, k, rng=<MagicMock name='mock.RandomState()' id='140345442290448'>, mpi_comm=None)[source]¶ Raises the base by the sampled value of the exponent.
Parameters: - input_values (list) – List of input values
- k (integer) – The number of samples that should be sampled
- rng (random number generator) – The random number generator to be used.
- mpi_comm (MPI communicator object) – Defines the MPI communicator object for MPI parallelization. The default value is None, meaning the forward simulation is not MPI-parallelized.
Returns: The first entry is True, it is always possible to sample, given two parent values. The second entry is the exponential of the parents values.
Return type: list
-
abcpy.statistics module¶
-
class
abcpy.statistics.
Statistics
(degree=1, cross=False, previous_statistics=None)[source]¶ Bases:
object
This abstract base class defines how to calculate statistics from dataset.
The base class also implements a polynomial expansion with cross-product terms that can be used to get desired polynomial expansion of the calculated statistics.
-
__init__
(degree=1, cross=False, previous_statistics=None)[source]¶ Constructor that must be overwritten by the sub-class.
The constructor of a sub-class must accept arguments for the polynomial expansion after extraction of the summary statistics, one has to define the degree of polynomial expansion and cross, indicating whether cross-prodcut terms are included.
Parameters: - degree (integer, optional) – Of polynomial expansion. The default value is 2 meaning second order polynomial expansion.
- cross (boolean, optional) – Defines whether to include the cross-product terms. The default value is True, meaning the cross product term is included.
- previous_statistics (Statistics class, optional) – It allows pipelining of Statistics. Specifically, if the final statistic to be used is determined by the composition of two Statistics, you can pass the first here; then, whenever the final statistic is needed, it is sufficient to call the statistics method of the second one, and that will automatically apply both transformations.
-
statistics
(data: object) → object[source]¶ To be overwritten by any sub-class: should extract statistics from the data set data. It is assumed that data is a list of n same type elements(eg., The data can be a list containing n timeseries, n graphs or n np.ndarray).
Parameters: data (python list) – Contains n data sets with length p. Returns: nxp matrix where for each of the n data points p statistics are calculated. Return type: numpy.ndarray
-
-
class
abcpy.statistics.
Identity
(degree=1, cross=False, previous_statistics=None)[source]¶ Bases:
abcpy.statistics.Statistics
This class implements identity statistics not applying any transformation to the data, before the optional polynomial expansion step. If the data set contains n numpy.ndarray of length p, it returns therefore an nx(p+degree*p+cross*nchoosek(p,2)) matrix, where for each of the n points with p statistics, degree*p polynomial expansion term and cross*nchoosek(p,2) many cross-product terms are calculated.
-
__init__
(degree=1, cross=False, previous_statistics=None)[source]¶ Parameters: - degree (integer, optional) – Of polynomial expansion. The default value is 2 meaning second order polynomial expansion.
- cross (boolean, optional) – Defines whether to include the cross-product terms. The default value is True, meaning the cross product term is included.
- previous_statistics (Statistics class, optional) – It allows pipelining of Statistics. Specifically, if the final statistic to be used is determined by the composition of two Statistics, you can pass the first here; then, whenever the final statistic is needed, it is sufficient to call the statistics method of the second one, and that will automatically apply both transformations.
-
-
class
abcpy.statistics.
LinearTransformation
(coefficients, degree=1, cross=False, previous_statistics=None)[source]¶ Bases:
abcpy.statistics.Statistics
Applies a linear transformation to the data to get (usually) a lower dimensional statistics. Then you can apply an additional polynomial expansion step.
-
__init__
(coefficients, degree=1, cross=False, previous_statistics=None)[source]¶ Parameters: - coefficients (coefficients is a matrix with size d x p, where d is the dimension of the summary statistic that) – is obtained after applying the linear transformation (i.e. before a possible polynomial expansion is applied), while d is the dimension of each data.
- degree (integer, optional) – Of polynomial expansion. The default value is 2 meaning second order polynomial expansion.
- cross (boolean, optional) – Defines whether to include the cross-product terms. The default value is True, meaning the cross product term is included.
- previous_statistics (Statistics class, optional) – It allows pipelining of Statistics. Specifically, if the final statistic to be used is determined by the composition of two Statistics, you can pass the first here; then, whenever the final statistic is needed, it is sufficient to call the statistics method of the second one, and that will automatically apply both transformations.
-
statistics
(data)[source]¶ Parameters: data (python list) – Contains n data sets with length p. Returns: nx(d+degree*d+cross*nchoosek(d,2)) matrix where for each of the n data points with length p you apply the linear transformation to get to dimension d, from where (d+degree*d+cross*nchoosek(d,2)) statistics are calculated. Return type: numpy.ndarray
-
-
class
abcpy.statistics.
NeuralEmbedding
(net, previous_statistics=None)[source]¶ Bases:
abcpy.statistics.Statistics
Computes the statistics by applying a neural network transformation.
It is essentially a wrapper for the application of a neural network transformation to the data. Note that the neural network has had to be trained in some way (for instance check the statistics learning routines) and that Pytorch is required for this part to work.
-
__init__
(net, previous_statistics=None)[source]¶ Parameters: - net (torch.nn object) – the embedding neural network. The input size of the neural network must coincide with the size of each of the datapoints.
- previous_statistics (Statistics class, optional) – It allows pipelining of Statistics. Specifically, if the final statistic to be used is determined by the composition of two Statistics, you can pass the first here; then, whenever the final statistic is needed, it is sufficient to call the statistics method of the second one, and that will automatically apply both transformations.
-
classmethod
fromFile
(path_to_net_state_dict, network_class=None, path_to_scaler=None, input_size=None, output_size=None, hidden_sizes=None, previous_statistics=None)[source]¶ If the neural network state_dict was saved to the disk, this method can be used to instantiate a NeuralEmbedding object with that neural network.
In order for the state_dict to be read correctly, the network class is needed. Therefore, we provide 2 options: 1) the Pytorch neural network class can be passed (if the user defined it, for instance) 2) if the neural network was defined by using the DefaultNN class in abcpy.NN_utilities.networks, you can provide arguments input_size, output_size and hidden_sizes (the latter is optional) that define the sizes of a fully connected network; then a DefaultNN is instantiated with those sizes. This can be used if for instance the neural network was trained using the utilities in abcpy.statisticslearning and you did not provide explicitly the neural network class there, but defined it through the sizes of the different layers.
In both cases, note that the input size of the neural network must coincide with the size of each of the datapoints generated from the model (unless some other statistics are computed beforehand).
Note that if the neural network was of the class ScalerAndNet, ie a scaler was applied before the data is fed through it, you need to pass path_to_scaler as well. Then this method will instantiate the network in the correct way.
Parameters: - path_to_net_state_dict (basestring) – the path where the state-dict is saved
- network_class (torch.nn class, optional) –
- if the neural network class is known explicitly (for instance if the used defined it), then it has to be
- passed here. This must not be provided together with input_size or output_size.
- path_to_scaler (basestring, optional) – The path where the scaler which was applied before the neural network is saved. Note that if the neural network was trained on scaled data and now you do not pass the correct scaler, the behavior will not be correct, leading to wrong inference. Default to None.
- input_size (integer, optional) – if the neural network is an instance of abcpy.NN_utilities.networks.DefaultNN with some input and output size, then you should provide here the input size of the network. It has to be provided together with the corresponding output_size, and it must not be provided with network_class.
- output_size (integer, optional) – if the neural network is an instance of abcpy.NN_utilities.networks.DefaultNN with some input and output size, then you should provide here the output size of the network. It has to be provided together with the corresponding input_size, and it must not be provided with network_class.
- hidden_sizes (array-like, optional) – if the neural network is an instance of abcpy.NN_utilities.networks.DefaultNN with some input and output size, then you can provide here an array-like with the size of the hidden layers (for instance [5,7,5] denotes 3 hidden layers with correspondingly 5,7,5 neurons). In case this parameter is not provided, the hidden sizes are determined from the input and output sizes as determined in abcpy.NN_utilities.networks.DefaultNN. Note that this must not be provided together with network_class.
- previous_statistics (Statistics class, optional) – It allows pipelining of Statistics. Specifically, if the final statistic to be used is determined by the composition of two Statistics, you can pass the first here; then, whenever the final statistic is needed, it is sufficient to call the statistics method of the second one, and that will automatically apply both transformations. In this case, this is the statistics that has to be computed before the neural network transformation is applied.
Returns: the NeuralEmbedding object with the neural network obtained from the specified file.
Return type:
-
save_net
(path_to_net_state_dict, path_to_scaler=None)[source]¶ Method to save the neural network state dict to a file. If the network is of the class ScalerAndNet, ie a scaler is applied before the data is fed through the network, then you are required to pass the path where you want the scaler to be saved.
Parameters: - path_to_net_state_dict (basestring) – Path where the state dict of the neural network is saved.
- path_to_scaler (basestring) – Path where the scaler is saved (with pickle); this is required if the neural network is of the class ScalerAndNet, and is ignored otherwise.
-
abcpy.statisticslearning module¶
-
class
abcpy.statisticslearning.
StatisticsLearning
(model, statistics_calc, backend, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, seed=None)[source]¶ Bases:
object
This abstract base class defines a way to choose the summary statistics.
-
__init__
(model, statistics_calc, backend, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, seed=None)[source]¶ The constructor of a sub-class must accept a non-optional model, statistics calculator and backend which are stored to self.model, self.statistics_calc and self.backend. Further it accepts two optional parameters n_samples and seed defining the number of simulated dataset used for the pilot to decide the summary statistics and the integer to initialize the random number generator.
This __init__ takes care of sample-statistics generation, with the parallelization; however, you can choose to provide simulations and corresponding parameters that have been previously generated, with the parameters parameters and simulations.
Parameters: - model (abcpy.models.Model) – Model object that conforms to the Model class.
- statistics_cal (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- n_samples (int, optional) – The number of (parameter, simulated data) tuple to be generated to learn the summary statistics in pilot step. The default value is 1000. This is ignored if simulations and parameters are provided.
- n_samples_val (int, optional) – The number of (parameter, simulated data) tuple to be generated to be used as a validation set in the pilot step. The default value is 0, which means no validation set is used. This is ignored if simulations_val and parameters_val are provided.
- n_samples_per_param (int, optional) – Number of data points in each simulated data set. This is ignored if simulations and parameters are provided. Default to 1.
- parameters (array, optional) – A numpy array with shape (n_samples, n_parameters) that is used, together with simulations to fit the summary selection learning algorithm. It has to be provided together with simulations, in which case no other simulations are performed to generate the training data. Default value is None.
- simulations (array, optional) – A numpy array with shape (n_samples, output_size) that is used, together with parameters to fit the summary selection learning algorithm. It has to be provided together with parameters, in which case no other simulations are performed to generate the training data. Default value is None.
- parameters_val (array, optional) – A numpy array with shape (n_samples_val, n_parameters) that is used, together with simulations_val as a validation set in the summary selection learning algorithm. It has to be provided together with simulations_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- simulations_val (array, optional) – A numpy array with shape (n_samples_val, output_size) that is used, together with parameters_val as a validation set in the summary selection learning algorithm. It has to be provided together with parameters_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
-
class
abcpy.statisticslearning.
Semiautomatic
(model, statistics_calc, backend, n_samples=1000, n_samples_per_param=1, parameters=None, simulations=None, seed=None)[source]¶ Bases:
abcpy.statisticslearning.StatisticsLearning
,abcpy.graphtools.GraphTools
This class implements the semi automatic summary statistics learning technique described in Fearnhead and Prangle [1].
[1] Fearnhead P., Prangle D. 2012. Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. J. Roy. Stat. Soc. B 74:419–474.
-
__init__
(model, statistics_calc, backend, n_samples=1000, n_samples_per_param=1, parameters=None, simulations=None, seed=None)[source]¶ Parameters: - model (abcpy.models.Model) – Model object that conforms to the Model class.
- statistics_cal (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- n_samples (int, optional) – The number of (parameter, simulated data) tuple to be generated to learn the summary statistics in pilot step. The default value is 1000. This is ignored if simulations and parameters are provided.
- n_samples_per_param (int, optional) – Number of data points in each simulated data set. This is ignored if simulations and parameters are provided.
- parameters (array, optional) – A numpy array with shape (n_samples, n_parameters) that is used, together with simulations to fit the summary selection learning algorithm. It has to be provided together with simulations, in which case no other simulations are performed. Default value is None.
- simulations (array, optional) – A numpy array with shape (n_samples, output_size) that is used, together with parameters to fit the summary selection learning algorithm. It has to be provided together with parameters, in which case no other simulations are performed. Default value is None.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
-
-
class
abcpy.statisticslearning.
StatisticsLearningNN
(model, statistics_calc, backend, training_routine, distance_learning, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, seed=None, cuda=None, scale_samples=True, quantile=0.1, **training_routine_kwargs)[source]¶ Bases:
abcpy.statisticslearning.StatisticsLearning
,abcpy.graphtools.GraphTools
This is the base class for all the statistics learning techniques involving neural networks. In most cases, you should not instantiate this directly. The actual classes instantiate this with the right arguments.
In order to use this technique, Pytorch is required to handle the neural networks.
-
__init__
(model, statistics_calc, backend, training_routine, distance_learning, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, seed=None, cuda=None, scale_samples=True, quantile=0.1, **training_routine_kwargs)[source]¶ Parameters: - model (abcpy.models.Model) – Model object that conforms to the Model class.
- statistics_cal (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- training_routine (function) – training routine to train the network. It has to take as first and second arguments the matrix of simulations and the corresponding targets (or the similarity matrix if distance_learning is True). It also needs to have as keyword parameters embedding_net and cuda.
- distance_learning (boolean) – this has to be True if the statistics learning technique is based on distance learning, in which case the __init__ computes the similarity matrix.
- embedding_net (torch.nn object or list) – it can be a torch.nn object with input size corresponding to size of model output, alternatively, a list with integer numbers denoting the width of the hidden layers, from which a fully connected network with that structure is created, having the input and output size corresponding to size of model output and number of parameters. In case this is None, the depth of the network and the width of the hidden layers is determined from the input and output size as specified in abcpy.NN_utilities.networks.DefaultNN.
- n_samples (int, optional) – The number of (parameter, simulated data) tuple to be generated to learn the summary statistics in pilot step. The default value is 1000. This is ignored if simulations and parameters are provided.
- n_samples_val (int, optional) – The number of (parameter, simulated data) tuple to be generated to be used as a validation set in the pilot step. The default value is 0, which means no validation set is used. This is ignored if simulations_val and parameters_val are provided.
- n_samples_per_param (int, optional) – Number of data points in each simulated data set. This is ignored if simulations and parameters are provided. Default to 1.
- parameters (array, optional) – A numpy array with shape (n_samples, n_parameters) that is used, together with simulations to fit the summary selection learning algorithm. It has to be provided together with simulations, in which case no other simulations are performed to generate the training data. Default value is None.
- simulations (array, optional) – A numpy array with shape (n_samples, output_size) that is used, together with parameters to fit the summary selection learning algorithm. It has to be provided together with parameters, in which case no other simulations are performed to generate the training data. Default value is None.
- parameters_val (array, optional) – A numpy array with shape (n_samples_val, n_parameters) that is used, together with simulations_val as a validation set in the summary selection learning algorithm. It has to be provided together with simulations_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- simulations_val (array, optional) – A numpy array with shape (n_samples_val, output_size) that is used, together with parameters_val as a validation set in the summary selection learning algorithm. It has to be provided together with parameters_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
- cuda (boolean, optional) – If cuda=None, it will select GPU if it is available. Or you can specify True to use GPU or False to use CPU
- scale_samples (boolean, optional) – If True, a scaler of the class sklearn.preprocessing.MinMaxScaler will be fit on the training data before neural network training, and training and validation data simulations data will be rescaled. When calling the get_statistics method, a network of the class ScalerAndNet will be used in instantiating the statistics; this network is a wrapper of a neural network and a scaler and transforms the data with the scaler before applying the neural network. It is highly recommended to use a scaler, as neural networks are sensitive to the range of input data. A case in which you may not want to use a scaler is timeseries data, as the scaler works independently on each feature of the data. Default value is True.
- quantile (float, optional) – quantile used to define the similarity set if distance_learning is True. Default to 0.1.
- training_routine_kwargs – additional kwargs to be passed to the underlying training routine.
-
get_statistics
()[source]¶ Returns a NeuralEmbedding Statistics implementing the learned transformation.
If a scaler was used, the net attribute of the returned object is of the class ScalerAndNet, which is a nn.Module object wrapping the scaler and the learned neural network and applies the scaler before the data is fed through the neural network.
Returns: a statistics object that implements the learned transformation. Return type: abcpy.statistics.NeuralEmbedding object
-
-
class
abcpy.statisticslearning.
SemiautomaticNN
(model, statistics_calc, backend, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, seed=None, cuda=None, scale_samples=True, batch_size=16, n_epochs=200, load_all_data_GPU=False, lr=0.001, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Bases:
abcpy.statisticslearning.StatisticsLearningNN
This class implements the semi automatic summary statistics learning technique as described in Jiang et al. 2017 [1].
In order to use this technique, Pytorch is required to handle the neural networks.
[1] Jiang, B., Wu, T.Y., Zheng, C. and Wong, W.H., 2017. Learning summary statistic for approximate Bayesian computation via deep neural network. Statistica Sinica, pp.1595-1618.
-
__init__
(model, statistics_calc, backend, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, seed=None, cuda=None, scale_samples=True, batch_size=16, n_epochs=200, load_all_data_GPU=False, lr=0.001, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Parameters: - model (abcpy.models.Model) – Model object that conforms to the Model class.
- statistics_cal (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- embedding_net (torch.nn object or list) – it can be a torch.nn object with input size corresponding to size of model output and output size corresponding to the number of parameters or, alternatively, a list with integer numbers denoting the width of the hidden layers, from which a fully connected network with that structure is created, having the input and output size corresponding to size of model output and number of parameters. In case this is None, the depth of the network and the width of the hidden layers is determined from the input and output size as specified in abcpy.NN_utilities.networks.DefaultNN.
- n_samples (int, optional) – The number of (parameter, simulated data) tuple to be generated to learn the summary statistics in pilot step. The default value is 1000. This is ignored if simulations and parameters are provided.
- n_samples_val (int, optional) – The number of (parameter, simulated data) tuple to be generated to be used as a validation set in the pilot step. The default value is 0, which means no validation set is used. This is ignored if simulations_val and parameters_val are provided.
- n_samples_per_param (int, optional) – Number of data points in each simulated data set. This is ignored if simulations and parameters are provided. Default to 1.
- parameters (array, optional) – A numpy array with shape (n_samples, n_parameters) that is used, together with simulations to fit the summary selection learning algorithm. It has to be provided together with simulations, in which case no other simulations are performed to generate the training data. Default value is None.
- simulations (array, optional) – A numpy array with shape (n_samples, output_size) that is used, together with parameters to fit the summary selection learning algorithm. It has to be provided together with parameters, in which case no other simulations are performed to generate the training data. Default value is None.
- parameters_val (array, optional) – A numpy array with shape (n_samples_val, n_parameters) that is used, together with simulations_val as a validation set in the summary selection learning algorithm. It has to be provided together with simulations_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- simulations_val (array, optional) – A numpy array with shape (n_samples_val, output_size) that is used, together with parameters_val as a validation set in the summary selection learning algorithm. It has to be provided together with parameters_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- early_stopping (boolean, optional) – If True, the validation set (which needs to be either provided through the arguments parameters_val and simulations_val or generated by setting n_samples_val to a value larger than 0) is used to early stop the training of the neural network as soon as the loss on the validation set starts to increase. Default value is False.
- epochs_early_stopping_interval (integer, optional) – The frequency at which the validation error is compared in order to decide whether to early stop the training or not. Namely, if epochs_early_stopping_interval=10, early stopping can happen only at epochs multiple of 10. Defaul value is 1.
- start_epoch_early_stopping (integer, optional) – The epoch after which early stopping can happen; in fact, as soon as training starts, there may be a transient period in which the loss increases. Default value is 10.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
- cuda (boolean, optional) – If cuda=None, it will select GPU if it is available. Or you can specify True to use GPU or False to use CPU
- scale_samples (boolean, optional) – If True, a scaler of the class sklearn.preprocessing.MinMaxScaler will be fit on the training data before neural network training, and training and validation data simulations data will be rescaled. When calling the get_statistics method, a network of the class ScalerAndNet will be used in instantiating the statistics; this network is a wrapper of a neural network and a scaler and transforms the data with the scaler before applying the neural network. It is highly recommended to use a scaler, as neural networks are sensitive to the range of input data. A case in which you may not want to use a scaler is timeseries data, as the scaler works independently on each feature of the data. Default value is True.
- batch_size (integer, optional) – the batch size used for training the neural network. Default is 16
- n_epochs (integer, optional) – the number of epochs used for training the neural network. Default is 200
- load_all_data_GPU (boolean, optional) – If True and if we a GPU is used, the whole dataset is loaded on the GPU before training begins; this may speed up training as it avoid transfer between CPU and GPU, but it is not guaranteed to do. Note that if the dataset is not small enough, setting this to True causes things to crash if the dataset is too large. Default to False, you should not rely too much on this.
- lr (float, optional) – The learning rate to be used in the iterative training scheme of the neural network. Default to 1e-3.
- optimizer (torch Optimizer class, optional) – A torch Optimizer class, for instance SGD or Adam. Default to Adam. Additional parameters may be passed through the optimizer_kwargs parameter.
- scheduler (torch _LRScheduler class, optional) – A torch _LRScheduler class, used to modify the learning rate across epochs. By default, no scheduler is used. Additional parameters may be passed through the scheduler_kwargs parameter.
- start_epoch_training (integer, optional) – If a scheduler is provided, for the first start_epoch_training epochs the scheduler is applied to modify the learning rate without training the network. From then on, the training proceeds normally, applying both the scheduler and the optimizer at each epoch. Default to 0.
- verbose (boolean, optional) – if True, prints more information from the training routine. Default to False.
- optimizer_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the optimizer.
- scheduler_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the scheduler.
- loader_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the loader (that handles loading the samples from the dataset during the training phase).
-
-
class
abcpy.statisticslearning.
TripletDistanceLearning
(model, statistics_calc, backend, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, seed=None, cuda=None, scale_samples=True, quantile=0.1, batch_size=16, n_epochs=200, load_all_data_GPU=False, margin=1.0, lr=None, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Bases:
abcpy.statisticslearning.StatisticsLearningNN
This class implements the statistics learning technique by using the triplet loss [1] for distance learning as described in Pacchiardi et al. 2019 [2].
In order to use this technique, Pytorch is required to handle the neural networks.
[1] Schroff, F., Kalenichenko, D. and Philbin, J., 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815-823).
[2] Pacchiardi, L., Kunzli, P., Schoengens, M., Chopard, B. and Dutta, R., 2019. Distance-learning For Approximate Bayesian Computation To Model a Volcanic Eruption. arXiv preprint arXiv:1909.13118.
-
__init__
(model, statistics_calc, backend, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, seed=None, cuda=None, scale_samples=True, quantile=0.1, batch_size=16, n_epochs=200, load_all_data_GPU=False, margin=1.0, lr=None, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Parameters: - model (abcpy.models.Model) – Model object that conforms to the Model class.
- statistics_cal (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- embedding_net (torch.nn object or list) – it can be a torch.nn object with input size corresponding to size of model output (output size can be any); alternatively, a list with integer numbers denoting the width of the hidden layers, from which a fully connected network with that structure is created, having the input and output size corresponding to size of model output and number of parameters. In case this is None, the depth of the network and the width of the hidden layers is determined from the input and output size as specified in abcpy.NN_utilities.networks.DefaultNN.
- n_samples (int, optional) – The number of (parameter, simulated data) tuple to be generated to learn the summary statistics in pilot step. The default value is 1000. This is ignored if simulations and parameters are provided.
- n_samples_val (int, optional) – The number of (parameter, simulated data) tuple to be generated to be used as a validation set in the pilot step. The default value is 0, which means no validation set is used. This is ignored if simulations_val and parameters_val are provided.
- n_samples_per_param (int, optional) – Number of data points in each simulated data set. This is ignored if simulations and parameters are provided. Default to 1.
- parameters (array, optional) – A numpy array with shape (n_samples, n_parameters) that is used, together with simulations to fit the summary selection learning algorithm. It has to be provided together with simulations, in which case no other simulations are performed to generate the training data. Default value is None.
- simulations (array, optional) – A numpy array with shape (n_samples, output_size) that is used, together with parameters to fit the summary selection learning algorithm. It has to be provided together with parameters, in which case no other simulations are performed to generate the training data. Default value is None.
- parameters_val (array, optional) – A numpy array with shape (n_samples_val, n_parameters) that is used, together with simulations_val as a validation set in the summary selection learning algorithm. It has to be provided together with simulations_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- simulations_val (array, optional) – A numpy array with shape (n_samples_val, output_size) that is used, together with parameters_val as a validation set in the summary selection learning algorithm. It has to be provided together with parameters_val, in which case no other simulations are performed to generate the validation set. Default value is None.
- early_stopping (boolean, optional) – If True, the validation set (which needs to be either provided through the arguments parameters_val and simulations_val or generated by setting n_samples_val to a value larger than 0) is used to early stop the training of the neural network as soon as the loss on the validation set starts to increase. Default value is False.
- epochs_early_stopping_interval (integer, optional) – The frequency at which the validation error is compared in order to decide whether to early stop the training or not. Namely, if epochs_early_stopping_interval=10, early stopping can happen only at epochs multiple of 10. Defaul value is 1.
- start_epoch_early_stopping (integer, optional) – The epoch after which early stopping can happen; in fact, as soon as training starts, there may be a transient period in which the loss increases. Default value is 10.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
- cuda (boolean, optional) – If cuda=None, it will select GPU if it is available. Or you can specify True to use GPU or False to use CPU
- scale_samples (boolean, optional) – If True, a scaler of the class sklearn.preprocessing.MinMaxScaler will be fit on the training data before neural network training, and training and validation data simulations data will be rescaled. When calling the get_statistics method, a network of the class ScalerAndNet will be used in instantiating the statistics; this network is a wrapper of a neural network and a scaler and transforms the data with the scaler before applying the neural network. It is highly recommended to use a scaler, as neural networks are sensitive to the range of input data. A case in which you may not want to use a scaler is timeseries data, as the scaler works independently on each feature of the data. Default value is True.
- quantile (float, optional) – quantile used to define the similarity set if distance_learning is True. Default to 0.1.
- batch_size (integer, optional) – the batch size used for training the neural network. Default is 16
- n_epochs (integer, optional) – the number of epochs used for training the neural network. Default is 200
- load_all_data_GPU (boolean, optional) – If True and if we a GPU is used, the whole dataset is loaded on the GPU before training begins; this may speed up training as it avoid transfer between CPU and GPU, but it is not guaranteed to do. Note that if the dataset is not small enough, setting this to True causes things to crash if the dataset is too large. Default to False, you should not rely too much on this.
- margin (float, optional) – margin defining the triplet loss. The larger it is, the further away dissimilar samples are pushed with respect to similar ones. Default to 1.
- lr (float, optional) – The learning rate to be used in the iterative training scheme of the neural network. Default to 1e-3.
- optimizer (torch Optimizer class, optional) – A torch Optimizer class, for instance SGD or Adam. Default to Adam. Additional parameters may be passed through the optimizer_kwargs parameter.
- scheduler (torch _LRScheduler class, optional) – A torch _LRScheduler class, used to modify the learning rate across epochs. By default, no scheduler is used. Additional parameters may be passed through the scheduler_kwargs parameter.
- start_epoch_training (integer, optional) – If a scheduler is provided, for the first start_epoch_training epochs the scheduler is applied to modify the learning rate without training the network. From then on, the training proceeds normally, applying both the scheduler and the optimizer at each epoch. Default to 0.
- verbose (boolean, optional) – if True, prints more information from the training routine. Default to False.
- optimizer_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the optimizer.
- scheduler_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the scheduler.
- loader_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the loader (that handles loading the samples from the dataset during the training phase).
-
-
class
abcpy.statisticslearning.
ContrastiveDistanceLearning
(model, statistics_calc, backend, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, seed=None, cuda=None, scale_samples=True, quantile=0.1, batch_size=16, n_epochs=200, positive_weight=None, load_all_data_GPU=False, margin=1.0, lr=None, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Bases:
abcpy.statisticslearning.StatisticsLearningNN
This class implements the statistics learning technique by using the contrastive loss [1] for distance learning as described in Pacchiardi et al. 2019 [2].
In order to use this technique, Pytorch is required to handle the neural networks.
[1] Hadsell, R., Chopra, S. and LeCun, Y., 2006, June. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (Vol. 2, pp. 1735-1742). IEEE.
[2] Pacchiardi, L., Kunzli, P., Schoengens, M., Chopard, B. and Dutta, R., 2019. Distance-learning For Approximate Bayesian Computation To Model a Volcanic Eruption. arXiv preprint arXiv:1909.13118.
-
__init__
(model, statistics_calc, backend, embedding_net=None, n_samples=1000, n_samples_val=0, n_samples_per_param=1, parameters=None, simulations=None, parameters_val=None, simulations_val=None, early_stopping=False, epochs_early_stopping_interval=1, start_epoch_early_stopping=10, seed=None, cuda=None, scale_samples=True, quantile=0.1, batch_size=16, n_epochs=200, positive_weight=None, load_all_data_GPU=False, margin=1.0, lr=None, optimizer=None, scheduler=None, start_epoch_training=0, optimizer_kwargs={}, scheduler_kwargs={}, loader_kwargs={})[source]¶ Parameters: - model (abcpy.models.Model) – Model object that conforms to the Model class.
- statistics_cal (abcpy.statistics.Statistics) – Statistics object that conforms to the Statistics class.
- backend (abcpy.backends.Backend) – Backend object that conforms to the Backend class.
- embedding_net (torch.nn object or list) – it can be a torch.nn object with input size corresponding to size of model output (output size can be any); alternatively, a list with integer numbers denoting the width of the hidden layers, from which a fully connected network with that structure is created, having the input and output size corresponding to size of model output and number of parameters. In case this is None, the depth of the network and the width of the hidden layers is determined from the input and output size as specified in abcpy.NN_utilities.networks.DefaultNN.
- n_samples (int, optional) – The number of (parameter, simulated data) tuple to be generated to learn the summary statistics in pilot step. The default value is 1000. This is ignored if simulations and parameters are provided.
- n_samples_per_param (int, optional) – Number of data points in each simulated data set. This is ignored if simulations and parameters are provided. Default to 1.
- parameters (array, optional) – A numpy array with shape (n_samples, n_parameters) that is used, together with simulations to fit the summary selection learning algorithm. It has to be provided together with simulations, in which case no other simulations are performed. Default value is None.
- simulations (array, optional) – A numpy array with shape (n_samples, output_size) that is used, together with parameters to fit the summary selection learning algorithm. It has to be provided together with parameters, in which case no other simulations are performed. Default value is None.
- seed (integer, optional) – Optional initial seed for the random number generator. The default value is generated randomly.
- cuda (boolean, optional) – If cuda=None, it will select GPU if it is available. Or you can specify True to use GPU or False to use CPU
- scale_samples (boolean, optional) – If True, a scaler of the class sklearn.preprocessing.MinMaxScaler will be fit on the training data before neural network training, and training and validation data simulations data will be rescaled. When calling the get_statistics method, a network of the class ScalerAndNet will be used in instantiating the statistics; this network is a wrapper of a neural network and a scaler and transforms the data with the scaler before applying the neural network. It is highly recommended to use a scaler, as neural networks are sensitive to the range of input data. A case in which you may not want to use a scaler is timeseries data, as the scaler works independently on each feature of the data. Default value is True.
- quantile (float, optional) – quantile used to define the similarity set if distance_learning is True. Default to 0.1.
- batch_size (integer, optional) – the batch size used for training the neural network. Default is 16
- n_epochs (integer, optional) – the number of epochs used for training the neural network. Default is 200
- positive_weight (float, optional) – The contrastive loss samples pairs of elements at random, and if the majority of samples are labelled as dissimilar, the probability of considering similar pairs is small. Then, you can set this value to a number between 0 and 1 in order to sample positive pairs with that probability during training.
- load_all_data_GPU (boolean, optional) – If True and if we a GPU is used, the whole dataset is loaded on the GPU before training begins; this may speed up training as it avoid transfer between CPU and GPU, but it is not guaranteed to do. Note that if the dataset is not small enough, setting this to True causes things to crash if the dataset is too large. Default to False, you should not rely too much on this.
- margin (float, optional) – margin defining the contrastive loss. The larger it is, the further away dissimilar samples are pushed with respect to similar ones. Default to 1.
- lr (float, optional) – The learning rate to be used in the iterative training scheme of the neural network. Default to 1e-3.
- optimizer (torch Optimizer class, optional) – A torch Optimizer class, for instance SGD or Adam. Default to Adam. Additional parameters may be passed through the optimizer_kwargs parameter.
- scheduler (torch _LRScheduler class, optional) – A torch _LRScheduler class, used to modify the learning rate across epochs. By default, no scheduler is used. Additional parameters may be passed through the scheduler_kwargs parameter.
- start_epoch_training (integer, optional) – If a scheduler is provided, for the first start_epoch_training epochs the scheduler is applied to modify the learning rate without training the network. From then on, the training proceeds normally, applying both the scheduler and the optimizer at each epoch. Default to 0.
- verbose (boolean, optional) – if True, prints more information from the training routine. Default to False.
- optimizer_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the optimizer.
- scheduler_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the scheduler.
- loader_kwargs (Python dictionary, optional) – dictionary containing optional keyword arguments for the loader (that handles loading the samples from the dataset during the training phase).
-