5. Benchmarks
These are local wall-clock reference measurements for exact discrete Bayesian network queries. Absolute timings depend on the machine and runtime, so the relative comparisons are the useful part. Warm timings reuse an already prepared model. Cold timings include fresh model preparation before each query.
The native-port table uses the shared deterministic query corpus at 30 and
1000 nodes. Every listed port matched the Python reference output within
tolerance across the full query set.
Language |
30 warm ms |
30 cold ms |
1000 warm ms |
1000 cold ms |
vs Python cold |
|---|---|---|---|---|---|
C++ |
0.020 |
0.305 |
0.302 |
3.138 |
4.6x |
Rust |
0.056 |
0.305 |
0.311 |
3.184 |
4.5x |
Ruby |
0.138 |
0.323 |
0.314 |
3.196 |
4.5x |
Lua |
0.125 |
0.346 |
0.342 |
3.242 |
4.4x |
Go |
0.056 |
0.312 |
0.314 |
3.260 |
4.4x |
Swift |
0.055 |
0.343 |
0.338 |
3.288 |
4.4x |
R |
0.196 |
0.543 |
0.471 |
3.477 |
4.1x |
Octave |
0.322 |
0.538 |
0.516 |
3.513 |
4.1x |
Java |
0.039 |
1.379 |
0.136 |
3.705 |
3.9x |
TypeScript |
0.052 |
1.064 |
0.285 |
5.153 |
2.8x |
C# |
0.015 |
1.870 |
0.100 |
10.253 |
1.4x |
Python |
0.034 |
3.392 |
0.042 |
14.351 |
1.0x |
Julia |
0.055 |
19.244 |
0.348 |
22.766 |
0.6x |
vs Python cold uses the 1000-node cold Python mean as the baseline.
Values above 1.0x are faster than Python on that condition. The fastest
large cold group is tightly clustered: C++, Rust, Ruby, Lua, Go, and Swift all
fall between 3.138 and 3.288 ms per query.
5.1. Query-Family Highlights
The operation-level view is more informative than a full language matrix
because the fastest port changes by query shape. These rows use the
1000-node cold run.
Query family |
Fastest port |
Fastest ms |
Python ms |
Python / fastest |
|---|---|---|---|---|
Marginal |
C++ |
0.213 |
0.236 |
1.1x |
Joint |
C++ |
0.263 |
0.497 |
1.9x |
Conditional |
C++ |
0.404 |
1.099 |
2.7x |
Evidence likelihood |
C++ |
0.366 |
17.892 |
48.9x |
Interventional |
C++ |
2.194 |
18.403 |
8.4x |
Counterfactual |
C++ |
6.987 |
24.507 |
3.5x |
Counterfactual probability |
Java |
7.787 |
27.308 |
3.5x |
Counterfactual joint |
Java |
5.647 |
22.547 |
4.0x |
Counterfactual conditional |
Java |
5.785 |
22.320 |
3.9x |
Counterfactual evidence |
Java |
5.128 |
15.225 |
3.0x |
The repeated-measures analysis on log(mean_ms) shows strong language,
graph-size, and warm/cold effects, with strong interaction terms. That means
condition-specific tables are the right summary: there is no single global
fastest port across all query shapes and temperatures.
5.2. R Exact-Inference Parity
The R parity checks use bnlearn and gRain as independent exact
validators for associational, interventional, and counterfactual queries.
Counterfactual checks use an explicit twin-network validator.
Suite |
Networks |
Worst absolute difference |
Result |
|---|---|---|---|
Associational |
6 |
6.106e-16 |
Exact |
Interventional |
5 |
1.110e-16 |
Exact |
Counterfactual |
3 |
3.331e-16 |
Exact |
5.3. External Toolkit Comparisons
The external toolkit comparisons use generated binary Bayesian networks at
20, 40, 60, and 80 nodes, with both singly-connected and
multi-connected graph families. Speedup means comparison-tool time divided by
pybbn time, so values above 1.0x favor pybbn.
- R
bnlearn+gRain Associational, interventional, and counterfactual. Build time favored
pybbnon the generated sweeps. Colddo(...)rows and cold exact counterfactual rows also favoredpybbn; cold counterfactual speedups were about 1.85x-10.98x and warm repeated counterfactual speedups were about 250x-13,757x. Sparse first-hit prior queries can favorbnlearn. Counterfactual comparisons factor out the shared twin-network construction becausebnlearnhas no native counterfactual API.pyAgrumAssociational and interventional.
pybbnwon associational build time on every completed point, usually won evidence queries, won every completed warm evidence point, and won every completed interventional point.pyAgrumwon most first uncached associational prior queries. Multi-connected interventional runs at 40, 60, and 80 nodes reached about 58 GiB RSS and were killed by the OS. NativepyAgrumcounterfactual output did not match the exact twin-network semantics used here.pgmpyVariableEliminationAssociational and graph-surgery interventional. Completed VE points matched numerically.
pybbnwon every cold prior row, every multi-connected cold evidence row, and every graph-surgery interventional row. VE won three singly-connected cold evidence rows. On the 80-node multi-connected graph,pgmpybuilt the observational VE model faster, but exact query times were still slower.pgmpyBeliefPropagationPublic junction-tree path. The completed 20- and 40-node observational rows were slower than
pybbnby one to four orders of magnitude. The 60-node row timed out before completion, 80-node rows were skipped after that timeout, and the 1000-node fixture did not construct within the benchmark timeout.
5.4. Interpretation
The native ports all agree on the shared query corpus, and the large cold discrete-query comparison shows several non-Python ports clustered closely around the fastest runtime. Python remains highly competitive for warm cached associational calls, but cold large-graph calls favor the native ports.
The external-toolkit results have a consistent shape. pybbn is strongest
on model build, repeated exact queries, interventional queries, and exact
counterfactual workflows. The main associational caveat is first-hit latency:
some sparse prior or evidence rows favor engines that do less preparation for a
single query. Once workloads become repeated, interventional, counterfactual,
or denser multi-connected cases, the cached exact path is the stronger fit.