5. Benchmarks

These are local wall-clock reference measurements for exact discrete Bayesian network queries. Absolute timings depend on the machine and runtime, so the relative comparisons are the useful part. Warm timings reuse an already prepared model. Cold timings include fresh model preparation before each query.

The native-port table uses the shared deterministic query corpus at 30 and 1000 nodes. Every listed port matched the Python reference output within tolerance across the full query set.

Discrete Query Runtime By Port
Language	30 warm ms	30 cold ms	1000 warm ms	1000 cold ms	vs Python cold
C++	0.020	0.305	0.302	3.138	4.6x
Rust	0.056	0.305	0.311	3.184	4.5x
Ruby	0.138	0.323	0.314	3.196	4.5x
Lua	0.125	0.346	0.342	3.242	4.4x
Go	0.056	0.312	0.314	3.260	4.4x
Swift	0.055	0.343	0.338	3.288	4.4x
R	0.196	0.543	0.471	3.477	4.1x
Octave	0.322	0.538	0.516	3.513	4.1x
Java	0.039	1.379	0.136	3.705	3.9x
TypeScript	0.052	1.064	0.285	5.153	2.8x
C#	0.015	1.870	0.100	10.253	1.4x
Python	0.034	3.392	0.042	14.351	1.0x
Julia	0.055	19.244	0.348	22.766	0.6x

vs Python cold uses the 1000-node cold Python mean as the baseline. Values above 1.0x are faster than Python on that condition. The fastest large cold group is tightly clustered: C++, Rust, Ruby, Lua, Go, and Swift all fall between 3.138 and 3.288 ms per query.

5.1. Query-Family Highlights

The operation-level view is more informative than a full language matrix because the fastest port changes by query shape. These rows use the 1000-node cold run.

1000-Node Cold Query-Family Highlights
Query family	Fastest port	Fastest ms	Python ms	Python / fastest
Marginal	C++	0.213	0.236	1.1x
Joint	C++	0.263	0.497	1.9x
Conditional	C++	0.404	1.099	2.7x
Evidence likelihood	C++	0.366	17.892	48.9x
Interventional	C++	2.194	18.403	8.4x
Counterfactual	C++	6.987	24.507	3.5x
Counterfactual probability	Java	7.787	27.308	3.5x
Counterfactual joint	Java	5.647	22.547	4.0x
Counterfactual conditional	Java	5.785	22.320	3.9x
Counterfactual evidence	Java	5.128	15.225	3.0x

The repeated-measures analysis on log(mean_ms) shows strong language, graph-size, and warm/cold effects, with strong interaction terms. That means condition-specific tables are the right summary: there is no single global fastest port across all query shapes and temperatures.

5.2. R Exact-Inference Parity

The R parity checks use bnlearn and gRain as independent exact validators for associational, interventional, and counterfactual queries. Counterfactual checks use an explicit twin-network validator.

R Parity Summary
Suite	Networks	Worst absolute difference	Result
Associational	6	6.106e-16	Exact
Interventional	5	1.110e-16	Exact
Counterfactual	3	3.331e-16	Exact

5.3. External Toolkit Comparisons

The external toolkit comparisons use generated binary Bayesian networks at 20, 40, 60, and 80 nodes, with both singly-connected and multi-connected graph families. Speedup means comparison-tool time divided by pybbn time, so values above 1.0x favor pybbn.

R bnlearn + gRain: Associational, interventional, and counterfactual. Build time favored pybbn on the generated sweeps. Cold do(...) rows and cold exact counterfactual rows also favored pybbn; cold counterfactual speedups were about 1.85x-10.98x and warm repeated counterfactual speedups were about 250x-13,757x. Sparse first-hit prior queries can favor bnlearn. Counterfactual comparisons factor out the shared twin-network construction because bnlearn has no native counterfactual API.
pyAgrum: Associational and interventional. pybbn won associational build time on every completed point, usually won evidence queries, won every completed warm evidence point, and won every completed interventional point. pyAgrum won most first uncached associational prior queries. Multi-connected interventional runs at 40, 60, and 80 nodes reached about 58 GiB RSS and were killed by the OS. Native pyAgrum counterfactual output did not match the exact twin-network semantics used here.
pgmpy VariableElimination: Associational and graph-surgery interventional. Completed VE points matched numerically. pybbn won every cold prior row, every multi-connected cold evidence row, and every graph-surgery interventional row. VE won three singly-connected cold evidence rows. On the 80-node multi-connected graph, pgmpy built the observational VE model faster, but exact query times were still slower.
pgmpy BeliefPropagation: Public junction-tree path. The completed 20- and 40-node observational rows were slower than pybbn by one to four orders of magnitude. The 60-node row timed out before completion, 80-node rows were skipped after that timeout, and the 1000-node fixture did not construct within the benchmark timeout.

5.4. Interpretation

The native ports all agree on the shared query corpus, and the large cold discrete-query comparison shows several non-Python ports clustered closely around the fastest runtime. Python remains highly competitive for warm cached associational calls, but cold large-graph calls favor the native ports.

The external-toolkit results have a consistent shape. pybbn is strongest on model build, repeated exact queries, interventional queries, and exact counterfactual workflows. The main associational caveat is first-hit latency: some sparse prior or evidence rows favor engines that do less preparation for a single query. Once workloads become repeated, interventional, counterfactual, or denser multi-connected cases, the cached exact path is the stronger fit.