4. Benchmarks
This page records local benchmark runs measured on March 27, 2026 for the
current checkout. The goal was to compare exact inference in pybbn against
other Bayesian-network toolkits on the same generated graphs.
These are machine-specific wall-clock timings. The absolute times should be treated as local reference numbers, while the speedup factors are the more portable part of the result set.
All speedup factors on this page are defined as:
So any factor larger than 1.0x means pybbn was faster.
4.1. Methodology
This page reports local benchmark runs measured on March 27, 2026 for the
current checkout. Graph families came from
pybbn.generator.generate_singly_bbn() and
pybbn.generator.generate_multi_bbn(). Every generated graph used binary
domains with max_values=2, Dirichlet row sampling with max_alpha=10,
and random seed 37. The measured graph sizes were 20, 40, 60,
and 80 nodes with max_iter=40, 80, 120, and 160
respectively. The bnlearn comparisons were driven by
_profile/bench_associational_crosslang.py,
_profile/bench_interventional_crosslang.py, and
_profile/bench_counterfactual_crosslang.py. The pyAgrum comparisons
were driven by _profile/bench_associational_pyagrum.py and
_profile/bench_interventional_pyagrum.py, with shared conversion helpers in
_profile/pyagrum_benchmark_utils.py.
Targets, evidence nodes, and intervention nodes were chosen deterministically
by spreading selections across sorted node ids so that the same graph always
produced the same workload. Associational evidence nodes were clamped to their
first state, s0. Interventional nodes were also clamped to s0 and were
chosen from non-root nodes when possible so the workload exercised real graph
surgery. Counterfactual workloads picked one intervention node with descendants,
used its alternate state as the hypothetical intervention, and chose a
descendant target with factual evidence {X=x', Y=y} so the same graph
always produced the same exact counterfactual query shape.
Build timings are the median of 3 full observational model builds, and all
query timings use 5 repetitions. On the pybbn side, the associational
path measured create_reasoning_model(...) plus model.pquery(...). The
interventional path measured model.intervene({...}) plus
treated.pquery(...) and also split out compile-only and query-only costs.
The counterfactual path measured full model.cquery(...) time and also
split out shared twin-DP construction, twin-model compilation, and the final
exact query on an already compiled twin model. On the bnlearn side,
associational and interventional inference used
bnlearn::custom.fit(...), gRain::compile(as.grain(...)),
gRain::setEvidence(...), bnlearn::mutilated(...), and querygrain(...)
as appropriate. bnlearn has no native counterfactual API, so the
counterfactual benchmark explicitly factors out the shared exact twin-network
construction stage and then compares pybbn against bnlearn + gRain
on that same twin model. The pyAgrum runs on this page used
pyagrum==2.3.2 in an isolated Python 3.12 virtual environment because the
project checkout itself did not have pyagrum installed. The
pyAgrum associational path used gum.LazyPropagation, the
interventional path used pyagrum.causal.causalImpact(...), and the native
pyAgrum counterfactual API was excluded from the apples-to-apples tables
because its results on generated CPT-based BNs did not match the exact
probability-space twin-network semantics already validated for pybbn
against bnlearn + gRain.
Note
Warm repeated timings are intentionally included because they are relevant to
real workloads, but they are not the same thing as one-shot latency. Large
warm-query gains mainly come from pybbn caches: repeated associational
priors reuse cached unconditional marginals, repeated associational evidence
queries reuse cached calibrated cluster potentials, repeated interventional
queries reuse compiled intervened models, and repeated counterfactual
queries reuse cached counterfactual context and twin-model preparation on
the pybbn side.
4.2. bnlearn Results
The bnlearn comparison on this page uses bnlearn together with
gRain for exact inference on the same generated graph families.
4.2.1. Associational Comparisons
The associational sweep compared exact marginal queries with and without evidence on both singly-connected and multi-connected generated graphs.
Singly-connected graphs had 19, 39, 59, and 79 edges with
maximum in-degree 3, 3, 4, and 4 respectively. Multi-connected
graphs had 31, 73, 116, and 158 edges with maximum in-degree
3, 3, 4, and 6.
The main pattern was that pybbn won build time everywhere, from about
4.9x up to 51.7x. On sparse graphs, bnlearn could still win the
first uncached prior query, but on the denser multi-connected graphs
pybbn pulled ahead even on cold one-shot exact queries. Repeated identical
workloads strongly favored pybbn because the cached exact path stayed
array-backed.
4.2.1.1. Singly-Connected Associational Cold Timings
All times are in seconds.
Nodes |
Build pybbn |
Build bnlearn |
Build x |
Prior cold pybbn |
Prior cold bnlearn |
Prior cold x |
Evidence cold pybbn |
Evidence cold bnlearn |
Evidence cold x |
|---|---|---|---|---|---|---|---|---|---|
20 |
0.002276 |
0.024000 |
10.55x |
0.001520 |
0.001000 |
0.66x |
0.000849 |
0.002000 |
2.36x |
40 |
0.004299 |
0.033000 |
7.68x |
0.003160 |
0.002000 |
0.63x |
0.002070 |
0.002000 |
0.97x |
60 |
0.011588 |
0.102000 |
8.80x |
0.008381 |
0.005000 |
0.60x |
0.005027 |
0.006000 |
1.19x |
80 |
0.008725 |
0.059000 |
6.76x |
0.006389 |
0.003000 |
0.47x |
0.003749 |
0.004000 |
1.07x |
4.2.1.2. Singly-Connected Associational Warm Timings
Nodes |
Prior warm pybbn |
Prior warm bnlearn |
Prior warm x |
Evidence warm pybbn |
Evidence warm bnlearn |
Evidence warm x |
|---|---|---|---|---|---|---|
20 |
0.000004 |
0.001000 |
251.83x |
0.000088 |
0.001500 |
16.98x |
40 |
0.000005 |
0.001500 |
330.25x |
0.000107 |
0.002000 |
18.68x |
60 |
0.000014 |
0.003000 |
216.03x |
0.000290 |
0.005000 |
17.24x |
80 |
0.000007 |
0.002000 |
287.48x |
0.000173 |
0.003000 |
17.34x |
4.2.1.3. Multi-Connected Associational Cold Timings
Nodes |
Build pybbn |
Build bnlearn |
Build x |
Prior cold pybbn |
Prior cold bnlearn |
Prior cold x |
Evidence cold pybbn |
Evidence cold bnlearn |
Evidence cold x |
|---|---|---|---|---|---|---|---|---|---|
20 |
0.003115 |
0.028000 |
8.99x |
0.002131 |
0.001000 |
0.47x |
0.001456 |
0.003000 |
2.06x |
40 |
0.008160 |
0.040000 |
4.90x |
0.005409 |
0.004000 |
0.74x |
0.003889 |
0.005000 |
1.29x |
60 |
0.031002 |
0.150000 |
4.84x |
0.038639 |
0.105000 |
2.72x |
0.029857 |
0.123000 |
4.12x |
80 |
0.085349 |
4.416000 |
51.74x |
4.087615 |
8.714000 |
2.13x |
3.403747 |
11.024000 |
3.24x |
4.2.1.4. Multi-Connected Associational Warm Timings
Nodes |
Prior warm pybbn |
Prior warm bnlearn |
Prior warm x |
Evidence warm pybbn |
Evidence warm bnlearn |
Evidence warm x |
|---|---|---|---|---|---|---|
20 |
0.000004 |
0.001500 |
402.09x |
0.000095 |
0.002000 |
21.11x |
40 |
0.000005 |
0.004000 |
855.89x |
0.000129 |
0.004000 |
31.08x |
60 |
0.000014 |
0.097500 |
6727.17x |
0.000325 |
0.121000 |
372.67x |
80 |
0.000007 |
8.754000 |
1271270.53x |
0.027301 |
11.197500 |
410.15x |
4.2.2. Interventional Comparisons
The interventional sweep compared exact do(...) marginals between
pybbn and bnlearn + gRain.
For interventional workloads, the benchmark reports three query views:
do total for end-to-end do(...) cost including intervened-model
compilation plus query execution, do compile for compilation of the
intervened model only, and do query for query execution on an already
compiled intervened model. The main pattern was that pybbn won build time
and cold end-to-end do(...) time across the whole sweep, and also won cold
intervened-model compilation across the whole sweep. On sparse graphs,
bnlearn still had an advantage on the first query against an already
compiled intervened model, but by the denser multi-connected graphs
pybbn also won the compiled-model query itself.
4.2.2.1. Singly-Connected Interventional Cold Timings
All times are in seconds.
Nodes |
Build pybbn |
Build bnlearn |
Build x |
do total cold pybbn |
do total cold bnlearn |
do total cold x |
do compile cold pybbn |
do compile cold bnlearn |
do compile cold x |
do query cold pybbn |
do query cold bnlearn |
do query cold x |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 |
0.002206 |
0.020000 |
9.07x |
0.003673 |
0.016000 |
4.36x |
0.002037 |
0.014000 |
6.87x |
0.001415 |
0.001000 |
0.71x |
40 |
0.004271 |
0.031000 |
7.26x |
0.007470 |
0.023000 |
3.08x |
0.004460 |
0.020000 |
4.48x |
0.003231 |
0.001000 |
0.31x |
60 |
0.006375 |
0.041000 |
6.43x |
0.010266 |
0.028000 |
2.73x |
0.005685 |
0.026000 |
4.57x |
0.004438 |
0.001000 |
0.23x |
80 |
0.008494 |
0.054000 |
6.36x |
0.025261 |
0.036000 |
1.43x |
0.007907 |
0.030000 |
3.79x |
0.006293 |
0.002000 |
0.32x |
4.2.2.2. Singly-Connected Interventional Warm Timings
Nodes |
do total warm pybbn |
do total warm bnlearn |
do total warm x |
do compile warm pybbn |
do compile warm bnlearn |
do compile warm x |
do query warm pybbn |
do query warm bnlearn |
do query warm x |
|---|---|---|---|---|---|---|---|---|---|
20 |
0.000010 |
0.015500 |
1495.92x |
0.000005 |
0.014500 |
2669.61x |
0.000003 |
0.001000 |
295.90x |
40 |
0.000012 |
0.021000 |
1699.23x |
0.000005 |
0.020000 |
3911.61x |
0.000004 |
0.001000 |
228.31x |
60 |
0.000016 |
0.028000 |
1698.77x |
0.000006 |
0.026500 |
4137.08x |
0.000007 |
0.002000 |
301.98x |
80 |
0.000017 |
0.034500 |
1987.90x |
0.000007 |
0.032000 |
4431.82x |
0.000007 |
0.002000 |
301.18x |
4.2.2.3. Multi-Connected Interventional Cold Timings
Nodes |
Build pybbn |
Build bnlearn |
Build x |
do total cold pybbn |
do total cold bnlearn |
do total cold x |
do compile cold pybbn |
do compile cold bnlearn |
do compile cold x |
do query cold pybbn |
do query cold bnlearn |
do query cold x |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 |
0.002811 |
0.026000 |
9.25x |
0.004643 |
0.018000 |
3.88x |
0.002697 |
0.014000 |
5.19x |
0.002047 |
0.001000 |
0.49x |
40 |
0.008155 |
0.040000 |
4.90x |
0.013016 |
0.033000 |
2.54x |
0.008327 |
0.026000 |
3.12x |
0.005985 |
0.004000 |
0.67x |
60 |
0.017649 |
0.071000 |
4.02x |
0.050510 |
0.109000 |
2.16x |
0.018501 |
0.041000 |
2.22x |
0.020000 |
0.058000 |
2.90x |
80 |
0.094908 |
4.398000 |
46.34x |
2.592297 |
5.843000 |
2.25x |
0.071562 |
0.710000 |
9.92x |
2.327604 |
5.102000 |
2.19x |
4.2.2.4. Multi-Connected Interventional Warm Timings
Nodes |
do total warm pybbn |
do total warm bnlearn |
do total warm x |
do compile warm pybbn |
do compile warm bnlearn |
do compile warm x |
do query warm pybbn |
do query warm bnlearn |
do query warm x |
|---|---|---|---|---|---|---|---|---|---|
20 |
0.000010 |
0.019000 |
1897.91x |
0.000006 |
0.017000 |
2804.35x |
0.000003 |
0.001000 |
301.12x |
40 |
0.000012 |
0.028500 |
2316.70x |
0.000006 |
0.025500 |
4581.78x |
0.000004 |
0.003500 |
842.77x |
60 |
0.000016 |
0.107500 |
6532.36x |
0.000006 |
0.039000 |
6221.09x |
0.000007 |
0.059000 |
8957.73x |
80 |
0.000017 |
5.846000 |
335399.43x |
0.000008 |
0.713500 |
93018.84x |
0.000006 |
5.080500 |
794136.26x |
4.2.3. Counterfactual Comparisons
The counterfactual sweep compared exact counterfactual marginals on generated graphs.
For counterfactual workloads, the benchmark reports four timing views:
cf total for full end-to-end exact counterfactual cost, shared twin dp
for construction of the exact twin network that both engines need for a fair
comparison, twin compile for compilation of the exact twin network only,
and twin query for the final exact query on an already compiled twin model
with factual evidence. Because bnlearn does not expose a native
counterfactual API, the fair end-to-end number here is
bnlearn total + shared twin dp rather than the raw bnlearn twin
compile/query number alone.
The main pattern was that pybbn won build time everywhere, from about
2.7x to 8.9x, and won cold end-to-end exact counterfactual queries
everywhere, from about 1.85x to 10.98x. Sparse singly-connected
counterfactuals stayed cheap for both engines, so the absolute cold gaps were
small there, while multi-connected exact counterfactuals widened the gap
sharply, especially on the 40-node graph. Repeated identical
counterfactual queries strongly favored pybbn, from about 250x up to
about 13,757x faster.
All returned marginals agreed closely across engines. The observed maximum
absolute difference over the completed sweep was between roughly 2.1e-12
and 4.5e-12.
Note
Counterfactual difficulty is driven by the size of the selected twin
submodel, not just the original graph node count. That is why some
80-node points are faster than some 60-node points.
4.2.3.1. Singly-Connected Counterfactual Cold Timings
All times are in seconds.
Nodes |
Build pybbn |
Build bnlearn |
Build x |
Shared twin dp |
cf total cold pybbn |
cf total cold bnlearn+shared |
cf total cold x |
|---|---|---|---|---|---|---|---|
20 |
0.002348 |
0.021000 |
8.94x |
0.000794 |
0.006013 |
0.020794 |
3.46x |
40 |
0.004005 |
0.032000 |
7.99x |
0.000643 |
0.004948 |
0.018643 |
3.77x |
60 |
0.007316 |
0.048000 |
6.56x |
0.001529 |
0.011698 |
0.034529 |
2.95x |
80 |
0.009485 |
0.063000 |
6.64x |
0.000936 |
0.007356 |
0.026936 |
3.66x |
4.2.3.2. Singly-Connected Counterfactual Split Cold Timings
Nodes |
Twin compile pybbn |
Twin compile bnlearn |
Twin compile x |
Twin query pybbn |
Twin query bnlearn |
Twin query x |
|---|---|---|---|---|---|---|
20 |
0.002876 |
0.017000 |
5.91x |
0.000993 |
0.001000 |
1.01x |
40 |
0.002188 |
0.016000 |
7.31x |
0.000890 |
0.001000 |
1.12x |
60 |
0.005082 |
0.029000 |
5.71x |
0.002260 |
0.002000 |
0.88x |
80 |
0.003174 |
0.021000 |
6.62x |
0.001303 |
0.001000 |
0.77x |
4.2.3.3. Singly-Connected Counterfactual Warm Timings
Nodes |
cf total warm pybbn |
cf total warm bnlearn+shared |
cf total warm x |
|---|---|---|---|
20 |
0.000069 |
0.019782 |
287.59x |
40 |
0.000070 |
0.017625 |
250.07x |
60 |
0.000117 |
0.032386 |
276.11x |
80 |
0.000071 |
0.023977 |
336.89x |
4.2.3.4. Multi-Connected Counterfactual Cold Timings
Nodes |
Build pybbn |
Build bnlearn |
Build x |
Shared twin dp |
cf total cold pybbn |
cf total cold bnlearn+shared |
cf total cold x |
|---|---|---|---|---|---|---|---|
20 |
0.004470 |
0.025000 |
5.59x |
0.006776 |
0.067426 |
0.158776 |
2.35x |
40 |
0.008077 |
0.039000 |
4.83x |
0.005788 |
0.130693 |
1.434788 |
10.98x |
60 |
0.012910 |
0.061000 |
4.73x |
0.008233 |
0.512215 |
0.948233 |
1.85x |
80 |
0.027299 |
0.074000 |
2.71x |
0.009404 |
0.323123 |
0.747404 |
2.31x |
4.2.3.5. Multi-Connected Counterfactual Split Cold Timings
Nodes |
Twin compile pybbn |
Twin compile bnlearn |
Twin compile x |
Twin query pybbn |
Twin query bnlearn |
Twin query x |
|---|---|---|---|---|---|---|
20 |
0.040335 |
0.123000 |
3.05x |
0.005370 |
0.006000 |
1.12x |
40 |
0.039044 |
0.236000 |
6.04x |
0.071891 |
0.802000 |
11.16x |
60 |
0.058453 |
0.149000 |
2.55x |
0.385814 |
0.391000 |
1.01x |
80 |
0.066303 |
0.175000 |
2.64x |
0.225300 |
0.313000 |
1.39x |
4.2.3.6. Multi-Connected Counterfactual Warm Timings
Nodes |
cf total warm pybbn |
cf total warm bnlearn+shared |
cf total warm x |
|---|---|---|---|
20 |
0.000075 |
0.144508 |
1938.50x |
40 |
0.000077 |
1.057110 |
13757.29x |
60 |
0.000089 |
0.571844 |
6445.30x |
80 |
0.000080 |
0.516321 |
6424.30x |
4.3. pyAgrum Results
The additional pyAgrum comparison covers exact associational and
interventional queries on the same generated graph families. Completed
associational and interventional points matched pybbn numerically.
For the completed pyAgrum points, the maximum absolute differences were:
associational: about
1e-16on both prior and evidence queriesinterventional: about
1e-16to1e-13on completed points
Two important boundary conditions apply:
the multi-connected interventional runs at
40,60, and80nodes were skipped because this environment repeatedly hard-killed those jobs, including an isolated retry of the40-node multi-connected case; kernel logs for those retries showed the Linux OOM killer terminating the benchmark Python process after it grew to about57.997 GiBRSS formulti-40,58.022 GiBformulti-60,58.030 GiBformulti-80, and58.048 GiBon the isolatedmulti-40rerunthe
pyAgrumcounterfactual API was excluded from direct speed comparison because it did not return the same distributions aspybbnon generated CPT-based BNs
The main pyAgrum pattern was:
pybbnwon associational build time on every completed pointpyAgrumstill won most first uncached associational prior queries, butpybbnwon the20-node multi-connected prior point and was nearly tied on the40-node multi-connected prior pointpybbnusually won associational evidence queries and won every completed warm associational evidence point, butpyAgrumstill won the cold80-node evidence rowspybbnwon every completed interventional point in this rerun
4.3.1. Associational Comparisons
The pyAgrum associational comparison used the same generated graphs and
measured exact marginal queries with and without evidence.
4.3.1.1. Associational vs pyAgrum Cold Timings
All times are in seconds.
Graph |
Nodes |
Build pybbn |
Build pyAgrum |
Build x |
Prior cold pybbn |
Prior cold pyAgrum |
Prior cold x |
Evidence cold pybbn |
Evidence cold pyAgrum |
Evidence cold x |
|---|---|---|---|---|---|---|---|---|---|---|
Singly |
20 |
0.002096 |
0.002614 |
1.25x |
0.001205 |
0.000881 |
0.73x |
0.001050 |
0.002659 |
2.53x |
Singly |
40 |
0.004042 |
0.004733 |
1.17x |
0.001842 |
0.000688 |
0.37x |
0.001974 |
0.004605 |
2.33x |
Singly |
60 |
0.006198 |
0.007714 |
1.24x |
0.002816 |
0.000775 |
0.28x |
0.002107 |
0.007184 |
3.37x |
Singly |
80 |
0.008090 |
0.010317 |
1.28x |
0.003633 |
0.000706 |
0.19x |
0.010210 |
0.009764 |
0.96x |
Multi |
20 |
0.002757 |
0.003692 |
1.34x |
0.001471 |
0.001611 |
1.09x |
0.001208 |
0.004324 |
3.58x |
Multi |
40 |
0.007764 |
0.012481 |
1.61x |
0.003764 |
0.003654 |
0.97x |
0.002947 |
0.016492 |
5.60x |
Multi |
60 |
0.016036 |
0.032312 |
2.01x |
0.014893 |
0.006947 |
0.47x |
0.011325 |
0.043087 |
3.80x |
Multi |
80 |
0.082692 |
0.993471 |
12.01x |
3.760185 |
0.121709 |
0.03x |
3.312482 |
1.209022 |
0.36x |
4.3.1.2. Associational vs pyAgrum Warm Timings
Graph |
Nodes |
Prior warm pybbn |
Prior warm pyAgrum |
Prior warm x |
Evidence warm pybbn |
Evidence warm pyAgrum |
Evidence warm x |
|---|---|---|---|---|---|---|---|
Singly |
20 |
0.000011 |
0.000174 |
16.50x |
0.000199 |
0.002635 |
13.25x |
Singly |
40 |
0.000011 |
0.000159 |
15.01x |
0.000192 |
0.004605 |
24.00x |
Singly |
60 |
0.000012 |
0.000168 |
14.34x |
0.000201 |
0.007184 |
35.82x |
Singly |
80 |
0.000011 |
0.000171 |
15.34x |
0.000199 |
0.009802 |
49.37x |
Multi |
20 |
0.000011 |
0.000163 |
15.30x |
0.000197 |
0.004296 |
21.79x |
Multi |
40 |
0.000011 |
0.000160 |
13.97x |
0.000211 |
0.016616 |
78.61x |
Multi |
60 |
0.000011 |
0.000170 |
15.24x |
0.000238 |
0.043247 |
181.85x |
Multi |
80 |
0.000012 |
0.000249 |
21.19x |
0.028507 |
1.243487 |
43.62x |
4.3.2. Interventional Comparisons
The pyAgrum interventional comparison measured exact do(...) marginals
on the same generated causal graphs.
4.3.2.1. Interventional vs pyAgrum Cold Timings
All times are in seconds.
Graph |
Nodes |
Build pybbn |
Build pyAgrum |
Build x |
do total cold pybbn |
do total cold pyAgrum |
do total cold x |
|---|---|---|---|---|---|---|---|
Singly |
20 |
0.002126 |
0.002627 |
1.24x |
0.003318 |
0.010088 |
3.04x |
Singly |
40 |
0.003905 |
0.004858 |
1.24x |
0.005465 |
0.009704 |
1.78x |
Singly |
60 |
0.006378 |
0.007942 |
1.25x |
0.008404 |
0.037552 |
4.47x |
Singly |
80 |
0.008242 |
0.009912 |
1.20x |
0.010970 |
0.012289 |
1.12x |
Multi |
20 |
0.002770 |
0.003680 |
1.33x |
0.004126 |
0.058473 |
14.17x |
Multi |
40 |
not run |
not run |
not run |
not run |
not run |
not run |
Multi |
60 |
not run |
not run |
not run |
not run |
not run |
not run |
Multi |
80 |
not run |
not run |
not run |
not run |
not run |
not run |
4.3.2.2. Interventional vs pyAgrum Warm Timings
Graph |
Nodes |
do total warm pybbn |
do total warm pyAgrum |
do total warm x |
|---|---|---|---|---|
Singly |
20 |
0.000019 |
0.010196 |
537.08x |
Singly |
40 |
0.000019 |
0.009167 |
485.29x |
Singly |
60 |
0.000020 |
0.036283 |
1805.04x |
Singly |
80 |
0.000034 |
0.011851 |
350.40x |
Multi |
20 |
0.000018 |
0.054234 |
3095.47x |
Multi |
40 |
not run |
not run |
not run |
Multi |
60 |
not run |
not run |
not run |
Multi |
80 |
not run |
not run |
not run |
Note
The missing 40-, 60-, and 80-node multi-connected
interventional points are not hidden negative results. Those jobs were
repeatedly hard-killed in this environment before completion, including an
isolated retry of the 40-node multi-connected case, so this page
excludes them instead of guessing or extrapolating.
Note
Those missing rows are also a practical scaling result. Kernel logs on this
machine showed the benchmark Python process being terminated by the Linux
OOM killer during the pyAgrum multi-connected interventional runs at
40, 60, and 80 nodes. The recorded anonymous resident set sizes
at kill time were 60813988 kB (57.997 GiB) for multi-40,
60840596 kB (58.022 GiB) for multi-60, 60848964 kB
(58.030 GiB) for multi-80, and 60868028 kB (58.048 GiB)
on the isolated multi-40 rerun. For context, this machine reports
65784396 kB of RAM (62.737 GiB) and 2097148 kB of swap
(2.000 GiB). These are still small exact-inference graph sizes compared
with the larger-node pybbn-only profiling path supported elsewhere in
the repo, including the optional bn-10k slot in
_profile/bench_matrix.py when that fixture is available.
Note
pyAgrum exposes a native causal.counterfactual API, but on these
generated CPT-based BNs it did not return the same distributions as the
exact probability-space twin-network counterfactuals already validated for
pybbn against bnlearn + gRain. For that reason, the
pyAgrum counterfactual timings are intentionally omitted from the
apples-to-apples benchmark tables on this page.
4.4. Interpretation
The strongest cross-toolkit pattern is that pybbn consistently won model
build time against both bnlearn and, on completed points, pyAgrum. On
the bnlearn side, pybbn also won cold end-to-end interventional and
counterfactual workloads across the sweep. On the pyAgrum side,
pybbn won every completed cold end-to-end interventional query in this
rerun, ranging from about 1.1x faster up to about 14.2x faster, and
it dominated warm repeated associational and interventional workloads.
The main exact associational caveat is still one-shot prior latency. Against
bnlearn, sparse graphs could still favor the first uncached prior query,
though pybbn pulled ahead once the multi-connected graphs became denser.
Against pyAgrum, pybbn won the sparse 20-node multi-connected
prior row outright and was nearly tied on the 40-node multi-connected
prior row, but pyAgrum remained strongest on most singly-connected and
denser multi-connected cold prior rows. pybbn won most cold associational
evidence rows against both toolkits, although pyAgrum still won the cold
80-node evidence rows.
The missing pyAgrum multi-connected interventional 40, 60, and
80 rows are themselves a practical scaling result. Those runs were
terminated by the Linux OOM killer, with the benchmark Python process reaching
about 57.997 GiB, 58.022 GiB, 58.030 GiB, and 58.048 GiB RSS
across the failed attempts on a machine with 62.737 GiB RAM and
2.000 GiB swap. Counterfactual inference remains a direct comparison only
between pybbn and bnlearn + gRain, because the native
pyAgrum counterfactual API did not align numerically with the exact
probability-space counterfactual semantics used here. The large warm-query
gaps elsewhere on the page are real for repeated workloads, but they mostly
reflect the effectiveness of pybbn cache reuse rather than one-shot
latency alone.