pyBMA

Bayesian Model Averaging in Python

In a previous post we talking about how model averaging allows us to combine models together to produce estimates that are better than those of any individual model. There I stated that the actual mechanics are needlessly complicated to do by hand and that we should outsource to a package.

We use use pyBMA to do this for survival data in python. The module is based on the R package BMA. Through time more functionality will be added, but at the moment there is only survival data.

Installation

pyBMA can be installed from pypi using pip as normal

pip3.5 install pyBMA

How it works

Given a survival dataset, pyBMA does the following things:

1 - Generates a full set of possible models

2 - Uses lifelines to run Cox Proportional hazards and generate log-likihoods.

3 - Calculates the posterior likihood of each model given the data and some priors

4 - Performs a weighted average over the models based on the posterior model likihood

How to use it

The API of pyBMA is designed to mirror that of lifelines to allow as easy an integration as possible. Compare the two in the snippet below:

##pyBMA version
bma_cf = CoxPHFitter()
bma_cf.fit(rossi_dataset, 'week', event_col='arrest')

## Lifelines version
cf = CoxPHFitter()
cf.fit(rossi_dataset, 'week', event_col='arrest')

One addition is that you can now specify a prior for each variable. This should be inputted as a list of numbers between 0 and 1 in the same order as the covariate variables appear in the main dataframe. The prior for a variable is your belief about the probability that the variable will be included in the correct model. E.g. if you are certain that a variable must occur in a model for it to be correct, then set the prior for that variable to 1, while if you consider it as likely as not to be included then choose 0.5. The default sets all the priors to 0.5

##pyBMA version
bma_cf = CoxPHFitter()
bma_cf.fit(rossi_dataset, 'week', event_col='arrest', priors = np.array([0.3, 0.6, 0.7, 0.1, 0.9, 0.5, 0.03])

This setup will give more weight to those models which contain the 5th variable, and much less weight to the models including the last variable over and above the model’s likihood given just the data.