5. Benchmarks

These are local wall-clock reference measurements for exact discrete Bayesian network queries. Absolute timings depend on the machine and runtime, so the relative comparisons are the useful part. Warm timings reuse an already prepared model. Cold timings include fresh model preparation before each query.

The native-port table uses the shared deterministic query corpus at 30 and 1000 nodes. Every listed port matched the Python reference output within tolerance across the full query set.

Discrete Query Runtime By Port

Language

30 warm ms

30 cold ms

1000 warm ms

1000 cold ms

vs Python cold

C++

0.020

0.305

0.302

3.138

4.6x

Rust

0.056

0.305

0.311

3.184

4.5x

Ruby

0.138

0.323

0.314

3.196

4.5x

Lua

0.125

0.346

0.342

3.242

4.4x

Go

0.056

0.312

0.314

3.260

4.4x

Swift

0.055

0.343

0.338

3.288

4.4x

R

0.196

0.543

0.471

3.477

4.1x

Octave

0.322

0.538

0.516

3.513

4.1x

Java

0.039

1.379

0.136

3.705

3.9x

TypeScript

0.052

1.064

0.285

5.153

2.8x

C#

0.015

1.870

0.100

10.253

1.4x

Python

0.034

3.392

0.042

14.351

1.0x

Julia

0.055

19.244

0.348

22.766

0.6x

vs Python cold uses the 1000-node cold Python mean as the baseline. Values above 1.0x are faster than Python on that condition. The fastest large cold group is tightly clustered: C++, Rust, Ruby, Lua, Go, and Swift all fall between 3.138 and 3.288 ms per query.

5.1. Query-Family Highlights

The operation-level view is more informative than a full language matrix because the fastest port changes by query shape. These rows use the 1000-node cold run.

1000-Node Cold Query-Family Highlights

Query family

Fastest port

Fastest ms

Python ms

Python / fastest

Marginal

C++

0.213

0.236

1.1x

Joint

C++

0.263

0.497

1.9x

Conditional

C++

0.404

1.099

2.7x

Evidence likelihood

C++

0.366

17.892

48.9x

Interventional

C++

2.194

18.403

8.4x

Counterfactual

C++

6.987

24.507

3.5x

Counterfactual probability

Java

7.787

27.308

3.5x

Counterfactual joint

Java

5.647

22.547

4.0x

Counterfactual conditional

Java

5.785

22.320

3.9x

Counterfactual evidence

Java

5.128

15.225

3.0x

The repeated-measures analysis on log(mean_ms) shows strong language, graph-size, and warm/cold effects, with strong interaction terms. That means condition-specific tables are the right summary: there is no single global fastest port across all query shapes and temperatures.

5.2. R Exact-Inference Parity

The R parity checks use bnlearn and gRain as independent exact validators for associational, interventional, and counterfactual queries. Counterfactual checks use an explicit twin-network validator.

R Parity Summary

Suite

Networks

Worst absolute difference

Result

Associational

6

6.106e-16

Exact

Interventional

5

1.110e-16

Exact

Counterfactual

3

3.331e-16

Exact

5.3. External Toolkit Comparisons

The external toolkit comparisons use generated binary Bayesian networks at 20, 40, 60, and 80 nodes, with both singly-connected and multi-connected graph families. Speedup means comparison-tool time divided by pybbn time, so values above 1.0x favor pybbn.

R bnlearn + gRain

Associational, interventional, and counterfactual. Build time favored pybbn on the generated sweeps. Cold do(...) rows and cold exact counterfactual rows also favored pybbn; cold counterfactual speedups were about 1.85x-10.98x and warm repeated counterfactual speedups were about 250x-13,757x. Sparse first-hit prior queries can favor bnlearn. Counterfactual comparisons factor out the shared twin-network construction because bnlearn has no native counterfactual API.

pyAgrum

Associational and interventional. pybbn won associational build time on every completed point, usually won evidence queries, won every completed warm evidence point, and won every completed interventional point. pyAgrum won most first uncached associational prior queries. Multi-connected interventional runs at 40, 60, and 80 nodes reached about 58 GiB RSS and were killed by the OS. Native pyAgrum counterfactual output did not match the exact twin-network semantics used here.

pgmpy VariableElimination

Associational and graph-surgery interventional. Completed VE points matched numerically. pybbn won every cold prior row, every multi-connected cold evidence row, and every graph-surgery interventional row. VE won three singly-connected cold evidence rows. On the 80-node multi-connected graph, pgmpy built the observational VE model faster, but exact query times were still slower.

pgmpy BeliefPropagation

Public junction-tree path. The completed 20- and 40-node observational rows were slower than pybbn by one to four orders of magnitude. The 60-node row timed out before completion, 80-node rows were skipped after that timeout, and the 1000-node fixture did not construct within the benchmark timeout.

5.4. Interpretation

The native ports all agree on the shared query corpus, and the large cold discrete-query comparison shows several non-Python ports clustered closely around the fastest runtime. Python remains highly competitive for warm cached associational calls, but cold large-graph calls favor the native ports.

The external-toolkit results have a consistent shape. pybbn is strongest on model build, repeated exact queries, interventional queries, and exact counterfactual workflows. The main associational caveat is first-hit latency: some sparse prior or evidence rows favor engines that do less preparation for a single query. Once workloads become repeated, interventional, counterfactual, or denser multi-connected cases, the cached exact path is the stronger fit.