Installation

If you prefer to work on the source, clone the repository

git clone https://github.com/eth-cscs/abcpy.git

Make sure all requirements are installed

cd abcpy
pip3 install -r requirements.txt

To create a package and install it do

make package
pip3 install build/dist/abcpy-0.1-py3-none-any.whl

Note that ABCpy requires Python3.

Getting Started

Here we show how to use ABCpy to infer parameters of model, given observed some data. As a simple example we consider a Gaussian model, where we want to model the height of grown up humans given the following set of measurement (observation, observed data).

y_obs = [160.82499176, 167.24266737, 185.71695756, 153.7045709, 163.40568812, 140.70658699, 169.59102084, 172.81041696, 187.38782738, 179.66358934, 176.63417241, 189.16082803, 181.98288443, 170.18565017, 183.78493886, 166.58387299, 161.9521899, 155.69213073, 156.17867343, 144.51580379, 170.29847515, 197.96767899, 153.36646527, 162.22710198, 158.70012047, 178.53470703, 170.77697743, 164.31392633, 165.88595994, 177.38083686, 146.67058471763457, 179.41946565658628, 238.02751620619537, 206.22458790620766, 220.89530574344568, 221.04082532837026, 142.25301427453394, 261.37656571434275, 171.63761180867033, 210.28121820385866, 237.29130237612236, 175.75558340169619, 224.54340549862235, 197.42448680731226, 165.88273684581381, 166.55094082844519, 229.54308602661584, 222.99844054358519, 185.30223966014586, 152.69149367593846, 206.94372818527413, 256.35498655339154, 165.43140916577741, 250.19273595481803, 148.87781549665536, 223.05547559193792, 230.03418198709608, 146.13611923127021, 138.24716809523139, 179.26755740864527, 141.21704876815426, 170.89587081800852, 222.96391329259626, 188.27229523693822, 202.67075179617672, 211.75963110985992, 217.45423324370509]

Now, we want to model the height of humans by a Gaussian model which has parameters mean, denoted by \(\mu\), and standard deviation, denoted by \(\sigma\). The goal is to use ABC to infer these yet unknown parameters from the information contained in the observed data.

A pre-requisite for ABC is that we provide certain prior knowledge about the parameters we want to infer. In our case it is quite simple, we know from experience that the average height should be somewhere between 150cm and 200cm, and the standard deviation is around 5 to 25.

from abcpy.distributions import Uniform
prior = Uniform([150, 5],[200, 25], seed=1)

from abcpy.models import Gaussian
model = Gaussian(prior, seed=1)

Further, we need a means to quantify how close our observation is to synthetic data (generated by the model). Often the real and synthetic observations cannot compared directly in a reasonable of efficient way. Thus, summary statistics are used to extract relevant properties from the observations, with the idea the these stastistics then compared.

from abcpy.statistics import Identity
statistics_calculator = Identity(degree = 2, cross = False)

As a distance we chose the LogReg distance here. Note that in ABCpy distance functions operate not on the observations, but on summary statistice.

from abcpy.distances import LogReg
distance_calculator = LogReg(statistics_calculator)

We can now setup a inference scheme – let us chose PMCABC as our inference algorithm of choice. As a pre-requisit it requires a perturbation kernel and a backend. We define both in the following:

from abcpy.distributions import MultiStudentT
mean, cov, df = np.array([.0, .0]), np.eye(2), 3.
kernel = MultiStudentT(mean, cov, df, seed=1)

from abcpy.backends import BackendDummy as Backend
backend = Backend()

We instanciate an PMCABC object and pass the kernel and backend objects to the constructor:

from abcpy.inferences import PMCABC
sampler = PMCABC(model, distance_calculator, kernel, backend, seed=1)

Finally, we need to parametrize and start the actualy sampling:

T, n_sample, n_samples_per_param = 3, 250, 10
eps_arr = np.array([.75])
epsilon_percentile = 10
journal = sampler.sample(y_obs,  T, eps_arr, n_sample, n_samples_per_param, epsilon_percentile)

With this the inferrence process is done and the probabilities of the inferred parameters are stored in the journal object. See Post Analysis for further information on extracting results.

The code currently uses the dummy backend BackendDummy which does not parallelize the execution of the inference schemes, but is very handy quick prototyping and testing. To execute the code you only need to run

python3 gaussian.py

The full source can be found in examples/backends/dummy/pmcabc_gaussian.py.

Post Analysis

The output when sampling from an inferrence scheme is a Journal (abcpy.output.Journal) which holds all the necessary results and convenient methods to do the post analysis.

For example, one can easily access the sampled parameters and corresponding weights using:

print(journal.parameters)
print(journal.weights)

For the post analysis basic functions are provided:

print(journal.posterior_mean())
print(journal.posterior_cov())

Also, to ensure reproducibility, every journal stores the parameters of the algorithm that created it:

print(journal.configuration)

And certainly, a journal can easily be saved to and loaded from disk:

journal.save("experiments.jnl")
new_journal = Journal.fromFile('experiments.jnl')

Using Parallelization Backends

Running ABC algorithms is often computationally expensive, thus ABCpy is build with parallelization in mind. In order to run your inference schemes in parallel on multiple nodes (computers) you can choose from the following backends.

Using the Spark Backend

To run ABCpy in parallel using Apache Spark, one only needs to use the provided Spark backend. Considering the example from above, the statements for the backend have to be changed to

import pyspark
sc = pyspark.SparkContext()
from abcpy.backends import BackendSpark as Backend
backend = Backend(sc, parallelism=4)

In words, a Spark context has to be created and passed to the Spark backend. Additionally, the level of parallelism can be provided, which defines in a sense in how many blocks the work should be split up. It corresponds to the parallelism of an RDD in Apache Spark terminology. A good value is usually a small multiple of the total number of available cores.

The standard way to run the script on Spark is via the spark-submit command:

PYSPARK_PYTHON=python3 spark-submit pmcabc_gaussian.py

Often Spark installations use Python 2 by default. To make Spark use the required Python 3 interpreter, the PYSPARK_PYTHON environment variable can be set.

The adapted python code can be found in examples/backend/apache_spark/pmcabc_gaussian.py.

Note that in order to run jobs in parallel you need to have Apache Spark installed on the system in question. The dependencies of the MPI backend can be install with pip install -r requirements/backend-spark.txt.

Details on the installation can be found on the official homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.

Using the MPI Backend

To run ABCpy in parallel using MPI, one only needs to use the provided MPI backend. Using the same example as above, the statements for the backend have to be changed to

from abcpy.backends import BackendMPI as Backend
backend = Backend()

In words, one only needs to initialize an instance of the MPI backend. The number of ranks to spawn are specified at runtime through the way the script is run. A minimum of two ranks is required, since rank 0 (master) is used to orchestrade the calculation and all other ranks (workers) actually perform the calculation.

The standard way to run the script using Open MPI is directly via mpirun like below or on a cluster through a job scheduler like Slurm:

mpirun -np 4 python3 pmcabc_gaussian.py

The adapted Python code can be found in examples/backend/mpi/pmcabc_gaussian.py.

Note that in order to run jobs in parallel you need to have MPI installed on the system(s) in question with the requisite Python bindings for MPI (mpi4py). The dependencies of the MPI backend can be install with pip install -r requirements/backend-mpi.txt.

Details on the installation can be found on the official Open MPI homepage and the mpi4py homepage. Further, keep in mind that the ABCpy library has to be properly installed on the cluster, such that it is available to the Python interpreters on the master and the worker nodes.

Using Cluster Infrastructure

When your model is computationally expensive and/or other factors require compute infrastructure that goes beyond a single notebook or workstation you can easily run ABCpy on infrastructure for cluster or high-performance computing.

Running on Amazon Web Services

We show with high level steps how to get ABCpy running on Amazon Web Services (AWS). Please note, that this is not a complete guide to AWS, so we would like to refer you to the respective documentation. The first step would be to setup a AWS Elastic Map Reduce (EMR) cluster which comes with the option of a pre-configured Apache Spark. Then, we show how to run a simple inference code on this cluster.

Setting up the EMR Cluster

When we setup an EMR cluster we want to install ABCpy on every node of the cluster. Therefore, we provide a bootstrap script that does this job for us. On your local machine create a file named emr_bootstrap.sh with the following content:

#!/bin/sh
sudo yum -y install git
sudo pip-3.4 install ipython findspark abcpy

In AWS go to Services, then S3 under the Storage Section. Create a new bucket called abcpy and upload your bootstrap script emr_bootstap.sh.

To create a cluster, in AWS go to Services and then EMR under the Analytics Section. Click ‘Create Cluster’, then choose ‘Advanced Options’. In Step 1 choose the emr-5.7.0 image and make sure only Spark is selected for your cluster (the other software packages are not required). In Step 2 choose for example one master node and 4 core nodes (16 vCPUs if you have 4 vCPUs instances). In Step 3 under the boostrap action, choose custom, and select the script abcpy/emr_bootstrap.sh. In the last step (Step 4), choose a key to access the master node (we assume that you already setup keys). Start the cluster.

Running ABCpy on AWS

Log in via SSH and run the following commands to get an example code from ABCpy running with Python3 support:

sudo bash -c 'echo export PYSPARK_PYTHON=python34 >> /etc/spark/conf/spark-env.sh'
git clone https://github.com/eth-cscs/abcpy.git

Then, to submit a job to the Spark cluster we run the following commands:

cd abcpy/examples/backends/
spark-submit --num-executors 16 pmcabc_gaussian.py

Clearly the setup can be extended and optimized. For this and basic information we refer you to the AWS documentation on EMR.

Implementing a new Model

Often one wants to use one of the provided inference schemes on a new model, which is not part of ABCpy. We now go through the details of such a scenario using the Gaussian model to exemplify the mechanics.

Every model has to conform to the API specified by the abstract base class abcpy.models.Model. Thus, making a new model compatible with ABCpy, essentially boils down to implementing the following methods:

class Model(metaclass = ABCMeta):
    def __init__(self, prior, seed = None): 
    def set_parameters(self, theta):
    def sample_from_prior():
    def simulate(self, k):
    def get_parameters(self):

In the following we go through a few of the required methods, explain what is expected, and show how it would be implemented for the Gaussian model.

As a general note, one can say that it is always a good idea to consult the reference for implementation details. For the constructor, the reference states the following:

Consequently, we would implement a simple version of a Gaussian model as follows:

class Gaussian(Model):
    def __init__(self, prior, seed=None):
        self.prior = prior
        self.sample_from_prior()
        self.rng = np.random.RandomState(seed)

Here we actually initialize the model parameters by calling abcpy.models.Model.sample_from_prior, which is another functions that must be implemented. Its requirements are quite simple:

    def sample_from_prior(self):
        sample = self.prior.sample(1).reshape(-1)
        self.set_parameters(sample)

Let us have a look at the details on implementing abcpy.models.Model.set_parameters:

For a Gaussian model a simple implementation would look like the following:

    def set_parameters(self, theta):
        theta = np.array(theta)

        if theta.shape[0] > 2: return False
        if theta[1] <= 0: return False

        self.mu = theta[0]
        self.sigma = theta[1]
        return True

Note that abcpy.models.Model.set_parameters is expected to return a boolean dependent on whether the provided parameters are suitable for the model. Thus, we added a few tests to the method.

For the remaining methods that must be implemented, namely abcpy.models.Model.get_parameters and abcpy.models.Model.simulate, we proceed in exactly the same way. This leads to an implementation that might look like the following:

    def get_parameters(self):
        return np.array([self.mu, self.sigma])


    def simulate(self, k):
        return list((self.rng.normal(self.mu, self.sigma, k)).reshape(-1))

Our model now conforms to ABCpy and we can start inferring parameters in the same way (see Getting Started) as we would do with shipped models. The complete example code can be found here

Wrap a Model Written in C++

There are several frameworks that help you integrating your C++/C code into Python. We showcase examples for

Using Swig

Swig is a tool that creates a Python wrapper for our C++/C code using an interface (file) that we have to specify. We can then import the wrapper and in turn use your C++ code with ABCpy as if it was written in Python.

We go through a complete example to illustrate how to use a simple Gaussian model written in C++ with ABCpy. First, have a look at our C++ model:

void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed) {
  boost::mt19937 rng(seed);
  boost::normal_distribution<> nd(mu, sigma);
  boost::variate_generator<boost::mt19937, boost::normal_distribution<> > sampler(rng, nd);
  
  for (int i=0; i<k; ++i) {
    result[i] = sampler();
  }
}

To use this code in Python, we need to specify exactly how to expose the C++ function to Python. Therefore, we write a Swig interface file that look as follows:

%module gaussian_model_simple
%{
  #define SWIG_FILE_WITH_INIT
  
  #include <iostream>
  #include <boost/random.hpp>
  #include <boost/random/normal_distribution.hpp>
  
  extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);
%}

%include "numpy.i"

%init %{
  import_array();
%}

%apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};

extern void gaussian_model(double* result, unsigned int k, double mu, double sigma, int seed);

In the first line we define the module name we later have to import in your ABCpy Python code. Then, in curly brackets, we specify which libraries we want to include and which function we want to expose through the wrapper.

Now comes the tricky part. The model class expects a method simulate that forward-simulates our model and which returns an array of syntetic observations. However, C++/C does not know the concept of returning an array, instead in C++/C we would provide a memory position (pointer) where to write the results. Swig has to translate between the two concepts. We use actually an Swig interface definition from numpy called import_array. The line

%apply (double* ARGOUT_ARRAY1, int DIM1 ) {(double* result, unsigned int k)};

states that we want the two parameters result and k of the gaussian_model C++ function be interpreted as an array of length k that is returned. Have a look at the Python code below and observe how the wrapped Python function takes only two instead of four parameters and returns a numpy array.

The first stop to get everything running is to translate the Swig interface file to wrapper code in C++ and Python.

swig -python -c++ -o gaussian_model_simple_wrap.cpp gaussian_model_simple.i

This creates two wrapper files gaussian_model_simple_wrap.cpp and gaussian_model_simple.py. Now the C++ files can be compiled:

g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple.cpp -o gaussian_model_simple.o
g++ -fPIC -I /usr/include/python3.5m -c gaussian_model_simple_wrap.cpp -o gaussian_model_simple_wrap.o
g++ -shared gaussian_model_simple.o gaussian_model_simple_wrap.o -o _gaussian_model_simple.so

Note that the include paths might need to be adapted to your system. Finally, we can write a Python model which uses our C++ code:

from abcpy.models import Model
from gaussian_model_simple import gaussian_model

class Gaussian(Model):
    def __init__(self, prior, seed=None):
        self.prior = prior
        self.sample_from_prior()
        self.rng = np.random.RandomState(seed)
        

    def set_parameters(self, theta):
        theta = np.array(theta)
        if theta.shape[0] > 2: return False
        if theta[1] <= 0: return False

        self.mu = theta[0]
        self.sigma = theta[1]
        return True

    def get_parameters(self):
        return np.array([self.mu, self.sigma])

    def sample_from_prior(self):
        sample = self.prior.sample(1).reshape(-1)
        self.set_parameters(sample)

    def simulate(self, k):
        seed = self.rng.randint(np.iinfo(np.int32).max)
        result = gaussian_model(k, self.mu, self.sigma, seed)
        return list(result)

The important lines are where we import the wrapper code as a module (line 2) and call the respective model function (line -2).

The full code is available in examples/extensions/models/gaussion_cpp/. To simplify compilation of SWIG and C++ code we created a Makefile. Note that you might need to adapt some paths in the Makefile.

Wrap a Model Written in R

Statisticians often use the R language to build statistical models. R models can be incorporated within the ABCpy language with the rpy2 Python package. We show how to use the rpy2 package to connect with a model written in R.

Continuing from the previous sections we use a simple Gaussian model as an example. The following R code is the contents of the R file gaussian_model.R:

simple_gaussian <- function(mu, sigma, k = 1){
	output <- rnorm(k, mu, sigma)
	return(output)
}

More complex R models are incorporated in the same way. To include this function within ABCpy we include the following code at the beginning of our Python file:


import rpy2.robjects as robjects
import rpy2.robjects.numpy2ri
rpy2.robjects.numpy2ri.activate()

robjects.r('''
       source('gaussian_model.R')
''')

r_simple_gaussian = robjects.globalenv['simple_gaussian']

This imports the R function simple_gaussian into the Python environment. We need to build our own model to incorporate this R function as in the previous section. The only difference is the simulate method of the class `Gaussian’.

    def simulate(self, k):
        output = list(r_simple_gaussian(self.mu, self.sigma, k))
        return output

The default output for R functions in Python is a float vector. This must be converted into a Python list for the purposes of ABCpy.