1. Quickstart
A Bayesian Belief Network (BBN) is defined as a pair (D
, P
), where
D
is a directed acylic graph (DAG), andP
is a joint distribution over a set of variables corresponding to the nodes in the DAG.
Creating a reasoning model involves defining the D
and P
. The BBN is then converted into a secondary structure called join tree
[HD99] for probabilistic and interventional queries. Internally, the reasoning model uses Structural Causal Models (SCMs) for counterfactual queries [PGJ16].
1.1. Creating a model
1.1.1. Create the structure, DAG
Simply define your structure using a dictionary. The nodes in this graph mean the following.
gender
is male or femaledrug
is whether the person/patient took the medicationrecovery
is whether the person recovered
[1]:
d = {
'nodes': ['drug', 'gender', 'recovery'],
'edges': [
['gender', 'drug'],
['gender', 'recovery'],
['drug', 'recovery']
]
}
[2]:
from pybbn.associational import dict_to_graph
import networkx as nx
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5, 5))
g = dict_to_graph(d)
pos = nx.nx_agraph.graphviz_layout(g, prog='dot')
nx.draw(g, pos=pos, with_labels=True, node_color='#e0e0e0')
fig.tight_layout()
1.1.2. Create the parameters, CPTs
The parameters are conditional probability tables (CPTs). A CPT is defined for each node through dictionaries (inspired by Pandas split and records orientations).
[3]:
p = {
'gender': {
'columns': ['gender', '__p__'],
'data': [
['male', 0.51], ['female', 0.49]
]
},
'drug': {
'columns': ['gender', 'drug', '__p__'],
'data': [
['female', 'no', 0.23],
['female', 'yes', 0.77],
['male', 'no', 0.76],
['male', 'yes', 0.24]
]
},
'recovery': {
'columns': ['gender', 'drug', 'recovery', '__p__'],
'data': [
['female', 'no', 'no', 0.31],
['female', 'no', 'yes', 0.69],
['female', 'yes', 'no', 0.27],
['female', 'yes', 'yes', 0.73],
['male', 'no', 'no', 0.13],
['male', 'no', 'yes', 0.87],
['male', 'yes', 'no', 0.07],
['male', 'yes', 'yes', 0.93]
]
}
}
1.1.3. Create the model
We use the create_reasoning_model()
convenience method to create an inference model.
[4]:
from pybbn.factory import create_reasoning_model
model = create_reasoning_model(d, p)
1.2. Associational query
Associational queries are probabilistic queries. Associational queries can be executed with different types of evidence. You can also execute associational queries with a mixture of different types of evidences.
1.2.1. Query without evidence
We can query the model without any evidence as follows. The posteriors come back as Pandas dataframes.
[5]:
q = model.pquery()
[6]:
q['gender']
[6]:
gender | __p__ | |
---|---|---|
0 | female | 0.49 |
1 | male | 0.51 |
[7]:
q['drug']
[7]:
drug | __p__ | |
---|---|---|
0 | no | 0.5003 |
1 | yes | 0.4997 |
[8]:
q['recovery']
[8]:
recovery | __p__ | |
---|---|---|
0 | no | 0.195764 |
1 | yes | 0.804236 |
1.2.2. Query with observation evidence
Arguably, observation evidence is the most common type of evidence. Observation evidences is such that only one value set to 1 and the rest are set to 0’s. We can query the model with observation evidence as follows.
[9]:
evidences = {
'gender': model.create_observation_evidences('gender', 'male')
}
q = model.pquery(evidences=evidences)
[10]:
q['gender']
[10]:
gender | __p__ | |
---|---|---|
0 | female | 0.0 |
1 | male | 1.0 |
[11]:
q['drug']
[11]:
drug | __p__ | |
---|---|---|
0 | no | 0.76 |
1 | yes | 0.24 |
[12]:
q['recovery']
[12]:
recovery | __p__ | |
---|---|---|
0 | no | 0.1156 |
1 | yes | 0.8844 |
1.2.3. Query with observation evidence, shortcut
There is a shortcut version to creating observation evidence to alleviate the verbose approach above.
[13]:
q = model.pquery(evidences=model.e({'gender': 'male'}))
[14]:
q['gender']
[14]:
gender | __p__ | |
---|---|---|
0 | female | 0.0 |
1 | male | 1.0 |
[15]:
q['drug']
[15]:
drug | __p__ | |
---|---|---|
0 | no | 0.76 |
1 | yes | 0.24 |
[16]:
q['recovery']
[16]:
recovery | __p__ | |
---|---|---|
0 | no | 0.1156 |
1 | yes | 0.8844 |
1.2.4. Query with finding evidence
Finding evidence can only be either \(\{0, 1\}\) and generalizes observation evidence. At least one value must be set to 1, however (or there will be a division by zero issue). The difference with observation evidence is that finding evidence can have multiple values set to 1.
[17]:
evidences = {
'gender': model.create_finding_evidences('gender', [1, 0], ['male', 'female'])
}
q = model.pquery(evidences=evidences)
[18]:
q['gender']
[18]:
gender | __p__ | |
---|---|---|
0 | female | 0.0 |
1 | male | 1.0 |
[19]:
q['drug']
[19]:
drug | __p__ | |
---|---|---|
0 | no | 0.76 |
1 | yes | 0.24 |
[20]:
q['recovery']
[20]:
recovery | __p__ | |
---|---|---|
0 | no | 0.1156 |
1 | yes | 0.8844 |
1.2.5. Query with virtual evidence
Virtual evidence is the most general form of evidence (generalizing both observational and finding evidence types). Virtual evidence has all values in the range \([0, 1]\).
[21]:
evidences = {
'gender': model.create_virtual_evidences('gender', [0.01, 0.99], ['male', 'female'])
}
q = model.pquery(evidences=evidences)
[22]:
q['gender']
[22]:
gender | __p__ | |
---|---|---|
0 | female | 0.989596 |
1 | male | 0.010404 |
[23]:
q['drug']
[23]:
drug | __p__ | |
---|---|---|
0 | no | 0.235514 |
1 | yes | 0.764486 |
[24]:
q['recovery']
[24]:
recovery | __p__ | |
---|---|---|
0 | no | 0.277498 |
1 | yes | 0.722502 |
1.2.6. Query with mixed types of evidence
Here, we show how to issue an associational query with mixed types of evidences.
[25]:
evidences = {
'gender': model.create_observation_evidences('gender', 'male'),
'drug': model.create_virtual_evidences('drug', [0.60, 0.40], ['yes', 'no'])
}
q = model.pquery(evidences=evidences)
[26]:
q['gender']
[26]:
gender | __p__ | |
---|---|---|
0 | female | 0.0 |
1 | male | 1.0 |
[27]:
q['drug']
[27]:
drug | __p__ | |
---|---|---|
0 | no | 0.678571 |
1 | yes | 0.321429 |
[28]:
q['recovery']
[28]:
recovery | __p__ | |
---|---|---|
0 | no | 0.110714 |
1 | yes | 0.889286 |
1.3. Interventional query
To estimate the causal effects, we can apply the do
operator [PGJ16]. For brevity, in the running example, denote the following.
\(G\) is gender
\(D\) is drug
\(R\) is recovery
The (backdoor) adjustment formula is defined as follows.
\(P(R=r|\mathrm{do}(D=d)) = P(R=r|D=d, G=g) P(G=g)\)
We can estimate the causal effects separately.
\(P(R=\mathrm{yes}|\mathrm{do}(D=\mathrm{yes})) = P(R=\mathrm{yes}|D=\mathrm{yes}, G=g) P(G=g)\)
\(P(R=\mathrm{yes}|\mathrm{do}(D=\mathrm{no})) = P(R=\mathrm{yes}|D=\mathrm{no}, G=g) P(G=g)\)
The average causal effect (ACE) can then be computed as follows.
\(\mathrm{ACE} = P(R=\mathrm{yes}|\mathrm{do}(D=\mathrm{yes})) - P(R=\mathrm{yes}|\mathrm{do}(D=\mathrm{no}))\)
[29]:
p_yes = model.iquery(Y=['recovery'], y=['yes'], X=['drug'], x=['yes'])
p_yes
[29]:
recovery 0.832
dtype: float64
[30]:
p_no = model.iquery(Y=['recovery'], y=['yes'], X=['drug'], x=['no'])
p_no
[30]:
recovery 0.7818
dtype: float64
[31]:
p_yes['recovery'] - p_no['recovery']
[31]:
0.05020000000000002
1.4. Counterfactual query
In this example, we want to compute the counterfactual: Given that a male patient did not take the drug and did not recover, what would the probability of recovery be had the patient taken the drug?
The evidence is that the patient is male, did not take the drug and did not recover. The evidence is the factual (it actually did happen).
\(G=\mathrm{male}\)
\(D=\mathrm{no}\)
\(R=\mathrm{no}\)
The hypothetical is had the patient taken the drug
. The hypothetical is the counterfactual.
\(D^*=\mathrm{yes}\)
The probability of interest is recovery in the counterfactual.
\(P_{d'}(R | G=g, D=d)\)
[32]:
Y = 'recovery'
e = {
'gender': 'male',
'drug': 'no',
'recovery': 'no'
}
h = {
'drug': 'yes'
}
The probability of recovery for the counterfactual is 0.78.
[33]:
model.cquery(Y, e, h)
[33]:
recovery | __p__ | |
---|---|---|
0 | no | 0.173882 |
1 | yes | 0.826118 |
1.5. Graphical query
Below are some examples of graphical queries.
1.5.1. d-separation and conditional independence
Querying if two nodes are d-separated is possible [Pea18].
[34]:
model.is_d_separated('drug', 'recovery')
[34]:
False
[35]:
model.is_d_separated('drug', 'recovery', {'gender'})
[35]:
False
1.5.2. Confounders and backdoors
We can query for the minimal set of confounders between two nodes [PGJ16].
[36]:
model.get_minimal_confounders('drug', 'recovery')
[36]:
['gender']
1.5.3. Mediators and frontdoors
We can query for the minimal set of mediators between two nodes [PGJ16].
[37]:
model.get_minimal_mediators('drug', 'recovery')
[37]:
[]
1.6. Data sampling
Sampling is done through logic sampling [Hen88]. If evidence is provided, then sampling with rejection is performed.
[38]:
sample_df = model.sample(max_samples=1_000)
sample_df.shape
[38]:
(1000, 3)
[39]:
sample_df.head()
[39]:
gender | drug | recovery | |
---|---|---|---|
0 | female | yes | no |
1 | female | yes | yes |
2 | male | no | yes |
3 | female | yes | yes |
4 | male | yes | yes |
1.7. Serde
Saving and loading the model is easy.
1.7.1. Serialization
To persist the model, use model_to_dict()
to create a Python dictionary and then serialize the dictionary as JSON data.
[40]:
import json
import tempfile
from pybbn.serde import model_to_dict
data1 = model_to_dict(model)
with tempfile.NamedTemporaryFile(mode='w', delete=False) as fp:
json.dump(data1, fp)
file_path = fp.name
print(f'{file_path=}')
file_path='/var/folders/vt/g8zbc68n2nj8dkk85n8b19440000gn/T/tmp3cb_zxtz'
1.7.2. Deserialization
To depersist the model, use the json
module to deserialize the dictionary, and then use dict_to_model()
to recreate the model.
[41]:
from pybbn.serde import dict_to_model
with open(file_path, 'r') as fp:
data2 = json.load(fp)
model2 = dict_to_model(data2)
[42]:
q = model2.pquery()
[43]:
q['gender']
[43]:
gender | __p__ | |
---|---|---|
0 | female | 0.49 |
1 | male | 0.51 |
[44]:
q['drug']
[44]:
drug | __p__ | |
---|---|---|
0 | no | 0.5003 |
1 | yes | 0.4997 |
[45]:
q['recovery']
[45]:
recovery | __p__ | |
---|---|---|
0 | no | 0.195764 |
1 | yes | 0.804236 |