4. Benchmarks

This page records local benchmark runs measured on March 27, 2026 for the current checkout. The goal was to compare exact inference in pybbn against other Bayesian-network toolkits on the same generated graphs.

These are machine-specific wall-clock timings. The absolute times should be treated as local reference numbers, while the speedup factors are the more portable part of the result set.

All speedup factors on this page are defined as:

\[\text{speedup} = \frac{\text{comparison engine time}}{\text{pybbn time}}\]

So any factor larger than 1.0x means pybbn was faster.

4.1. Methodology

This page reports local benchmark runs measured on March 27, 2026 for the current checkout. Graph families came from pybbn.generator.generate_singly_bbn() and pybbn.generator.generate_multi_bbn(). Every generated graph used binary domains with max_values=2, Dirichlet row sampling with max_alpha=10, and random seed 37. The measured graph sizes were 20, 40, 60, and 80 nodes with max_iter=40, 80, 120, and 160 respectively. The bnlearn comparisons were driven by _profile/bench_associational_crosslang.py, _profile/bench_interventional_crosslang.py, and _profile/bench_counterfactual_crosslang.py. The pyAgrum comparisons were driven by _profile/bench_associational_pyagrum.py and _profile/bench_interventional_pyagrum.py, with shared conversion helpers in _profile/pyagrum_benchmark_utils.py.

Targets, evidence nodes, and intervention nodes were chosen deterministically by spreading selections across sorted node ids so that the same graph always produced the same workload. Associational evidence nodes were clamped to their first state, s0. Interventional nodes were also clamped to s0 and were chosen from non-root nodes when possible so the workload exercised real graph surgery. Counterfactual workloads picked one intervention node with descendants, used its alternate state as the hypothetical intervention, and chose a descendant target with factual evidence {X=x', Y=y} so the same graph always produced the same exact counterfactual query shape.

Build timings are the median of 3 full observational model builds, and all query timings use 5 repetitions. On the pybbn side, the associational path measured create_reasoning_model(...) plus model.pquery(...). The interventional path measured model.intervene({...}) plus treated.pquery(...) and also split out compile-only and query-only costs. The counterfactual path measured full model.cquery(...) time and also split out shared twin-DP construction, twin-model compilation, and the final exact query on an already compiled twin model. On the bnlearn side, associational and interventional inference used bnlearn::custom.fit(...), gRain::compile(as.grain(...)), gRain::setEvidence(...), bnlearn::mutilated(...), and querygrain(...) as appropriate. bnlearn has no native counterfactual API, so the counterfactual benchmark explicitly factors out the shared exact twin-network construction stage and then compares pybbn against bnlearn + gRain on that same twin model. The pyAgrum runs on this page used pyagrum==2.3.2 in an isolated Python 3.12 virtual environment because the project checkout itself did not have pyagrum installed. The pyAgrum associational path used gum.LazyPropagation, the interventional path used pyagrum.causal.causalImpact(...), and the native pyAgrum counterfactual API was excluded from the apples-to-apples tables because its results on generated CPT-based BNs did not match the exact probability-space twin-network semantics already validated for pybbn against bnlearn + gRain.

Note

Warm repeated timings are intentionally included because they are relevant to real workloads, but they are not the same thing as one-shot latency. Large warm-query gains mainly come from pybbn caches: repeated associational priors reuse cached unconditional marginals, repeated associational evidence queries reuse cached calibrated cluster potentials, repeated interventional queries reuse compiled intervened models, and repeated counterfactual queries reuse cached counterfactual context and twin-model preparation on the pybbn side.

4.2. bnlearn Results

The bnlearn comparison on this page uses bnlearn together with gRain for exact inference on the same generated graph families.

4.2.1. Associational Comparisons

The associational sweep compared exact marginal queries with and without evidence on both singly-connected and multi-connected generated graphs.

Singly-connected graphs had 19, 39, 59, and 79 edges with maximum in-degree 3, 3, 4, and 4 respectively. Multi-connected graphs had 31, 73, 116, and 158 edges with maximum in-degree 3, 3, 4, and 6.

The main pattern was that pybbn won build time everywhere, from about 4.9x up to 51.7x. On sparse graphs, bnlearn could still win the first uncached prior query, but on the denser multi-connected graphs pybbn pulled ahead even on cold one-shot exact queries. Repeated identical workloads strongly favored pybbn because the cached exact path stayed array-backed.

4.2.1.1. Singly-Connected Associational Cold Timings

All times are in seconds.

Nodes

Build pybbn

Build bnlearn

Build x

Prior cold pybbn

Prior cold bnlearn

Prior cold x

Evidence cold pybbn

Evidence cold bnlearn

Evidence cold x

20

0.002276

0.024000

10.55x

0.001520

0.001000

0.66x

0.000849

0.002000

2.36x

40

0.004299

0.033000

7.68x

0.003160

0.002000

0.63x

0.002070

0.002000

0.97x

60

0.011588

0.102000

8.80x

0.008381

0.005000

0.60x

0.005027

0.006000

1.19x

80

0.008725

0.059000

6.76x

0.006389

0.003000

0.47x

0.003749

0.004000

1.07x

4.2.1.2. Singly-Connected Associational Warm Timings

Nodes

Prior warm pybbn

Prior warm bnlearn

Prior warm x

Evidence warm pybbn

Evidence warm bnlearn

Evidence warm x

20

0.000004

0.001000

251.83x

0.000088

0.001500

16.98x

40

0.000005

0.001500

330.25x

0.000107

0.002000

18.68x

60

0.000014

0.003000

216.03x

0.000290

0.005000

17.24x

80

0.000007

0.002000

287.48x

0.000173

0.003000

17.34x

4.2.1.3. Multi-Connected Associational Cold Timings

Nodes

Build pybbn

Build bnlearn

Build x

Prior cold pybbn

Prior cold bnlearn

Prior cold x

Evidence cold pybbn

Evidence cold bnlearn

Evidence cold x

20

0.003115

0.028000

8.99x

0.002131

0.001000

0.47x

0.001456

0.003000

2.06x

40

0.008160

0.040000

4.90x

0.005409

0.004000

0.74x

0.003889

0.005000

1.29x

60

0.031002

0.150000

4.84x

0.038639

0.105000

2.72x

0.029857

0.123000

4.12x

80

0.085349

4.416000

51.74x

4.087615

8.714000

2.13x

3.403747

11.024000

3.24x

4.2.1.4. Multi-Connected Associational Warm Timings

Nodes

Prior warm pybbn

Prior warm bnlearn

Prior warm x

Evidence warm pybbn

Evidence warm bnlearn

Evidence warm x

20

0.000004

0.001500

402.09x

0.000095

0.002000

21.11x

40

0.000005

0.004000

855.89x

0.000129

0.004000

31.08x

60

0.000014

0.097500

6727.17x

0.000325

0.121000

372.67x

80

0.000007

8.754000

1271270.53x

0.027301

11.197500

410.15x

4.2.2. Interventional Comparisons

The interventional sweep compared exact do(...) marginals between pybbn and bnlearn + gRain.

For interventional workloads, the benchmark reports three query views: do total for end-to-end do(...) cost including intervened-model compilation plus query execution, do compile for compilation of the intervened model only, and do query for query execution on an already compiled intervened model. The main pattern was that pybbn won build time and cold end-to-end do(...) time across the whole sweep, and also won cold intervened-model compilation across the whole sweep. On sparse graphs, bnlearn still had an advantage on the first query against an already compiled intervened model, but by the denser multi-connected graphs pybbn also won the compiled-model query itself.

4.2.2.1. Singly-Connected Interventional Cold Timings

All times are in seconds.

Nodes

Build pybbn

Build bnlearn

Build x

do total cold pybbn

do total cold bnlearn

do total cold x

do compile cold pybbn

do compile cold bnlearn

do compile cold x

do query cold pybbn

do query cold bnlearn

do query cold x

20

0.002206

0.020000

9.07x

0.003673

0.016000

4.36x

0.002037

0.014000

6.87x

0.001415

0.001000

0.71x

40

0.004271

0.031000

7.26x

0.007470

0.023000

3.08x

0.004460

0.020000

4.48x

0.003231

0.001000

0.31x

60

0.006375

0.041000

6.43x

0.010266

0.028000

2.73x

0.005685

0.026000

4.57x

0.004438

0.001000

0.23x

80

0.008494

0.054000

6.36x

0.025261

0.036000

1.43x

0.007907

0.030000

3.79x

0.006293

0.002000

0.32x

4.2.2.2. Singly-Connected Interventional Warm Timings

Nodes

do total warm pybbn

do total warm bnlearn

do total warm x

do compile warm pybbn

do compile warm bnlearn

do compile warm x

do query warm pybbn

do query warm bnlearn

do query warm x

20

0.000010

0.015500

1495.92x

0.000005

0.014500

2669.61x

0.000003

0.001000

295.90x

40

0.000012

0.021000

1699.23x

0.000005

0.020000

3911.61x

0.000004

0.001000

228.31x

60

0.000016

0.028000

1698.77x

0.000006

0.026500

4137.08x

0.000007

0.002000

301.98x

80

0.000017

0.034500

1987.90x

0.000007

0.032000

4431.82x

0.000007

0.002000

301.18x

4.2.2.3. Multi-Connected Interventional Cold Timings

Nodes

Build pybbn

Build bnlearn

Build x

do total cold pybbn

do total cold bnlearn

do total cold x

do compile cold pybbn

do compile cold bnlearn

do compile cold x

do query cold pybbn

do query cold bnlearn

do query cold x

20

0.002811

0.026000

9.25x

0.004643

0.018000

3.88x

0.002697

0.014000

5.19x

0.002047

0.001000

0.49x

40

0.008155

0.040000

4.90x

0.013016

0.033000

2.54x

0.008327

0.026000

3.12x

0.005985

0.004000

0.67x

60

0.017649

0.071000

4.02x

0.050510

0.109000

2.16x

0.018501

0.041000

2.22x

0.020000

0.058000

2.90x

80

0.094908

4.398000

46.34x

2.592297

5.843000

2.25x

0.071562

0.710000

9.92x

2.327604

5.102000

2.19x

4.2.2.4. Multi-Connected Interventional Warm Timings

Nodes

do total warm pybbn

do total warm bnlearn

do total warm x

do compile warm pybbn

do compile warm bnlearn

do compile warm x

do query warm pybbn

do query warm bnlearn

do query warm x

20

0.000010

0.019000

1897.91x

0.000006

0.017000

2804.35x

0.000003

0.001000

301.12x

40

0.000012

0.028500

2316.70x

0.000006

0.025500

4581.78x

0.000004

0.003500

842.77x

60

0.000016

0.107500

6532.36x

0.000006

0.039000

6221.09x

0.000007

0.059000

8957.73x

80

0.000017

5.846000

335399.43x

0.000008

0.713500

93018.84x

0.000006

5.080500

794136.26x

4.2.3. Counterfactual Comparisons

The counterfactual sweep compared exact counterfactual marginals on generated graphs.

For counterfactual workloads, the benchmark reports four timing views: cf total for full end-to-end exact counterfactual cost, shared twin dp for construction of the exact twin network that both engines need for a fair comparison, twin compile for compilation of the exact twin network only, and twin query for the final exact query on an already compiled twin model with factual evidence. Because bnlearn does not expose a native counterfactual API, the fair end-to-end number here is bnlearn total + shared twin dp rather than the raw bnlearn twin compile/query number alone.

The main pattern was that pybbn won build time everywhere, from about 2.7x to 8.9x, and won cold end-to-end exact counterfactual queries everywhere, from about 1.85x to 10.98x. Sparse singly-connected counterfactuals stayed cheap for both engines, so the absolute cold gaps were small there, while multi-connected exact counterfactuals widened the gap sharply, especially on the 40-node graph. Repeated identical counterfactual queries strongly favored pybbn, from about 250x up to about 13,757x faster.

All returned marginals agreed closely across engines. The observed maximum absolute difference over the completed sweep was between roughly 2.1e-12 and 4.5e-12.

Note

Counterfactual difficulty is driven by the size of the selected twin submodel, not just the original graph node count. That is why some 80-node points are faster than some 60-node points.

4.2.3.1. Singly-Connected Counterfactual Cold Timings

All times are in seconds.

Nodes

Build pybbn

Build bnlearn

Build x

Shared twin dp

cf total cold pybbn

cf total cold bnlearn+shared

cf total cold x

20

0.002348

0.021000

8.94x

0.000794

0.006013

0.020794

3.46x

40

0.004005

0.032000

7.99x

0.000643

0.004948

0.018643

3.77x

60

0.007316

0.048000

6.56x

0.001529

0.011698

0.034529

2.95x

80

0.009485

0.063000

6.64x

0.000936

0.007356

0.026936

3.66x

4.2.3.2. Singly-Connected Counterfactual Split Cold Timings

Nodes

Twin compile pybbn

Twin compile bnlearn

Twin compile x

Twin query pybbn

Twin query bnlearn

Twin query x

20

0.002876

0.017000

5.91x

0.000993

0.001000

1.01x

40

0.002188

0.016000

7.31x

0.000890

0.001000

1.12x

60

0.005082

0.029000

5.71x

0.002260

0.002000

0.88x

80

0.003174

0.021000

6.62x

0.001303

0.001000

0.77x

4.2.3.3. Singly-Connected Counterfactual Warm Timings

Nodes

cf total warm pybbn

cf total warm bnlearn+shared

cf total warm x

20

0.000069

0.019782

287.59x

40

0.000070

0.017625

250.07x

60

0.000117

0.032386

276.11x

80

0.000071

0.023977

336.89x

4.2.3.4. Multi-Connected Counterfactual Cold Timings

Nodes

Build pybbn

Build bnlearn

Build x

Shared twin dp

cf total cold pybbn

cf total cold bnlearn+shared

cf total cold x

20

0.004470

0.025000

5.59x

0.006776

0.067426

0.158776

2.35x

40

0.008077

0.039000

4.83x

0.005788

0.130693

1.434788

10.98x

60

0.012910

0.061000

4.73x

0.008233

0.512215

0.948233

1.85x

80

0.027299

0.074000

2.71x

0.009404

0.323123

0.747404

2.31x

4.2.3.5. Multi-Connected Counterfactual Split Cold Timings

Nodes

Twin compile pybbn

Twin compile bnlearn

Twin compile x

Twin query pybbn

Twin query bnlearn

Twin query x

20

0.040335

0.123000

3.05x

0.005370

0.006000

1.12x

40

0.039044

0.236000

6.04x

0.071891

0.802000

11.16x

60

0.058453

0.149000

2.55x

0.385814

0.391000

1.01x

80

0.066303

0.175000

2.64x

0.225300

0.313000

1.39x

4.2.3.6. Multi-Connected Counterfactual Warm Timings

Nodes

cf total warm pybbn

cf total warm bnlearn+shared

cf total warm x

20

0.000075

0.144508

1938.50x

40

0.000077

1.057110

13757.29x

60

0.000089

0.571844

6445.30x

80

0.000080

0.516321

6424.30x

4.3. pyAgrum Results

The additional pyAgrum comparison covers exact associational and interventional queries on the same generated graph families. Completed associational and interventional points matched pybbn numerically.

For the completed pyAgrum points, the maximum absolute differences were:

  • associational: about 1e-16 on both prior and evidence queries

  • interventional: about 1e-16 to 1e-13 on completed points

Two important boundary conditions apply:

  • the multi-connected interventional runs at 40, 60, and 80 nodes were skipped because this environment repeatedly hard-killed those jobs, including an isolated retry of the 40-node multi-connected case; kernel logs for those retries showed the Linux OOM killer terminating the benchmark Python process after it grew to about 57.997 GiB RSS for multi-40, 58.022 GiB for multi-60, 58.030 GiB for multi-80, and 58.048 GiB on the isolated multi-40 rerun

  • the pyAgrum counterfactual API was excluded from direct speed comparison because it did not return the same distributions as pybbn on generated CPT-based BNs

The main pyAgrum pattern was:

  • pybbn won associational build time on every completed point

  • pyAgrum still won most first uncached associational prior queries, but pybbn won the 20-node multi-connected prior point and was nearly tied on the 40-node multi-connected prior point

  • pybbn usually won associational evidence queries and won every completed warm associational evidence point, but pyAgrum still won the cold 80-node evidence rows

  • pybbn won every completed interventional point in this rerun

4.3.1. Associational Comparisons

The pyAgrum associational comparison used the same generated graphs and measured exact marginal queries with and without evidence.

4.3.1.1. Associational vs pyAgrum Cold Timings

All times are in seconds.

Graph

Nodes

Build pybbn

Build pyAgrum

Build x

Prior cold pybbn

Prior cold pyAgrum

Prior cold x

Evidence cold pybbn

Evidence cold pyAgrum

Evidence cold x

Singly

20

0.002096

0.002614

1.25x

0.001205

0.000881

0.73x

0.001050

0.002659

2.53x

Singly

40

0.004042

0.004733

1.17x

0.001842

0.000688

0.37x

0.001974

0.004605

2.33x

Singly

60

0.006198

0.007714

1.24x

0.002816

0.000775

0.28x

0.002107

0.007184

3.37x

Singly

80

0.008090

0.010317

1.28x

0.003633

0.000706

0.19x

0.010210

0.009764

0.96x

Multi

20

0.002757

0.003692

1.34x

0.001471

0.001611

1.09x

0.001208

0.004324

3.58x

Multi

40

0.007764

0.012481

1.61x

0.003764

0.003654

0.97x

0.002947

0.016492

5.60x

Multi

60

0.016036

0.032312

2.01x

0.014893

0.006947

0.47x

0.011325

0.043087

3.80x

Multi

80

0.082692

0.993471

12.01x

3.760185

0.121709

0.03x

3.312482

1.209022

0.36x

4.3.1.2. Associational vs pyAgrum Warm Timings

Graph

Nodes

Prior warm pybbn

Prior warm pyAgrum

Prior warm x

Evidence warm pybbn

Evidence warm pyAgrum

Evidence warm x

Singly

20

0.000011

0.000174

16.50x

0.000199

0.002635

13.25x

Singly

40

0.000011

0.000159

15.01x

0.000192

0.004605

24.00x

Singly

60

0.000012

0.000168

14.34x

0.000201

0.007184

35.82x

Singly

80

0.000011

0.000171

15.34x

0.000199

0.009802

49.37x

Multi

20

0.000011

0.000163

15.30x

0.000197

0.004296

21.79x

Multi

40

0.000011

0.000160

13.97x

0.000211

0.016616

78.61x

Multi

60

0.000011

0.000170

15.24x

0.000238

0.043247

181.85x

Multi

80

0.000012

0.000249

21.19x

0.028507

1.243487

43.62x

4.3.2. Interventional Comparisons

The pyAgrum interventional comparison measured exact do(...) marginals on the same generated causal graphs.

4.3.2.1. Interventional vs pyAgrum Cold Timings

All times are in seconds.

Graph

Nodes

Build pybbn

Build pyAgrum

Build x

do total cold pybbn

do total cold pyAgrum

do total cold x

Singly

20

0.002126

0.002627

1.24x

0.003318

0.010088

3.04x

Singly

40

0.003905

0.004858

1.24x

0.005465

0.009704

1.78x

Singly

60

0.006378

0.007942

1.25x

0.008404

0.037552

4.47x

Singly

80

0.008242

0.009912

1.20x

0.010970

0.012289

1.12x

Multi

20

0.002770

0.003680

1.33x

0.004126

0.058473

14.17x

Multi

40

not run

not run

not run

not run

not run

not run

Multi

60

not run

not run

not run

not run

not run

not run

Multi

80

not run

not run

not run

not run

not run

not run

4.3.2.2. Interventional vs pyAgrum Warm Timings

Graph

Nodes

do total warm pybbn

do total warm pyAgrum

do total warm x

Singly

20

0.000019

0.010196

537.08x

Singly

40

0.000019

0.009167

485.29x

Singly

60

0.000020

0.036283

1805.04x

Singly

80

0.000034

0.011851

350.40x

Multi

20

0.000018

0.054234

3095.47x

Multi

40

not run

not run

not run

Multi

60

not run

not run

not run

Multi

80

not run

not run

not run

Note

The missing 40-, 60-, and 80-node multi-connected interventional points are not hidden negative results. Those jobs were repeatedly hard-killed in this environment before completion, including an isolated retry of the 40-node multi-connected case, so this page excludes them instead of guessing or extrapolating.

Note

Those missing rows are also a practical scaling result. Kernel logs on this machine showed the benchmark Python process being terminated by the Linux OOM killer during the pyAgrum multi-connected interventional runs at 40, 60, and 80 nodes. The recorded anonymous resident set sizes at kill time were 60813988 kB (57.997 GiB) for multi-40, 60840596 kB (58.022 GiB) for multi-60, 60848964 kB (58.030 GiB) for multi-80, and 60868028 kB (58.048 GiB) on the isolated multi-40 rerun. For context, this machine reports 65784396 kB of RAM (62.737 GiB) and 2097148 kB of swap (2.000 GiB). These are still small exact-inference graph sizes compared with the larger-node pybbn-only profiling path supported elsewhere in the repo, including the optional bn-10k slot in _profile/bench_matrix.py when that fixture is available.

Note

pyAgrum exposes a native causal.counterfactual API, but on these generated CPT-based BNs it did not return the same distributions as the exact probability-space twin-network counterfactuals already validated for pybbn against bnlearn + gRain. For that reason, the pyAgrum counterfactual timings are intentionally omitted from the apples-to-apples benchmark tables on this page.

4.4. Interpretation

The strongest cross-toolkit pattern is that pybbn consistently won model build time against both bnlearn and, on completed points, pyAgrum. On the bnlearn side, pybbn also won cold end-to-end interventional and counterfactual workloads across the sweep. On the pyAgrum side, pybbn won every completed cold end-to-end interventional query in this rerun, ranging from about 1.1x faster up to about 14.2x faster, and it dominated warm repeated associational and interventional workloads.

The main exact associational caveat is still one-shot prior latency. Against bnlearn, sparse graphs could still favor the first uncached prior query, though pybbn pulled ahead once the multi-connected graphs became denser. Against pyAgrum, pybbn won the sparse 20-node multi-connected prior row outright and was nearly tied on the 40-node multi-connected prior row, but pyAgrum remained strongest on most singly-connected and denser multi-connected cold prior rows. pybbn won most cold associational evidence rows against both toolkits, although pyAgrum still won the cold 80-node evidence rows.

The missing pyAgrum multi-connected interventional 40, 60, and 80 rows are themselves a practical scaling result. Those runs were terminated by the Linux OOM killer, with the benchmark Python process reaching about 57.997 GiB, 58.022 GiB, 58.030 GiB, and 58.048 GiB RSS across the failed attempts on a machine with 62.737 GiB RAM and 2.000 GiB swap. Counterfactual inference remains a direct comparison only between pybbn and bnlearn + gRain, because the native pyAgrum counterfactual API did not align numerically with the exact probability-space counterfactual semantics used here. The large warm-query gaps elsewhere on the page are real for repeated workloads, but they mostly reflect the effectiveness of pybbn cache reuse rather than one-shot latency alone.