Benchmarking¶

The benchmark suite measures every cryptographic operation in the library and provides the data for the research paper contributions. This guide explains how to run the benchmarks, what each harness covers, and how to interpret the output.

Authoritative results are stored in results/BENCHMARKS.md. That file is the canonical record; this guide explains how to reproduce and extend them.

Test environments ¶

Two environments are used. Results differ primarily due to the liboqs build and the OS scheduler. ENV-2 (Docker/Linux) is the primary reference for paper claims.

Label	ENV-1 — Windows 11 Native	ENV-2 — Docker / WSL2 Linux (Primary)
OS	Windows 11 Home 10.0.26200	Linux 6.6.87.2-microsoft-standard-WSL2 (Hyper-V)
Python	3.12.7	3.12.13 (python:3.12-slim)
liboqs	0.15.0 MSYS2 DLL (generic build)	0.15.0 compiled from source (`-DOQS_DIST_BUILD=ON`)
Scheduler noise	15.6 ms NT timer resolution inflates CoV	WSL2 vCPU adds ~2–4% CoV above bare-metal Linux

Note

The from-source Docker build enables AVX2/AVX-512 CPUID detection at runtime (-DOQS_DIST_BUILD=ON), giving ML-KEM-768 keygen a 5.3× speedup over the Windows MSYS2 DLL. CoV analysis in the paper uses ENV-2 values exclusively.

Benchmark harnesses ¶

Two harness scripts live in tests/bench/:

Script	What it measures
`bench_kem.py`	KEM operations: X25519 classical, mock-PQC hybrid, real ML-KEM-768 hybrid, hybrid decomposition (combiner overhead isolation), concurrent throughput curve
`bench_signatures.py`	Signature operations: Ed25519 baseline, ML-DSA-65 standalone, HybridSign (Ed25519+ML-DSA-65), X.509 hybrid certificate build and cosig verify

All harnesses share the same methodology:

1000 measurement iterations per operation
100 warmup iterations (discarded)
1% outlier trim from each tail (removes OS-scheduler spikes)
time.perf_counter for nanosecond-resolution timing
GC disabled during measurement

Running the benchmarks ¶

Docker (recommended — ENV-2)¶

The Docker image compiles liboqs 0.15.0 from source with AVX2/AVX-512 support. This is the reproducible, ENV-2 path used for all primary paper numbers.

# Build once — compiles liboqs 0.15.0 from source (~3 min on a modern laptop)
docker build -t quantum-safe-bench .

# KEM suite: classical + hybrid decomposition + concurrent load curve
docker run --rm -v ${PWD}/results:/app/results quantum-safe-bench \
  python -X utf8 tests/bench/bench_kem.py --with-pqc \
  --save results/bench_kem_$(date +%Y-%m-%d).json

# Signature suite: Ed25519 + ML-DSA-65 + HybridSign + X.509 certs
docker run --rm -v ${PWD}/results:/app/results quantum-safe-bench \
  python -X utf8 tests/bench/bench_signatures.py --with-pqc \
  --save results/bench_sigs_$(date +%Y-%m-%d).json

# Or run both in one go with docker-compose
docker compose up

KEM benchmarks (native)¶

# Classical + mock PQC (no liboqs needed)
python -X utf8 tests/bench/bench_kem.py

# Full suite — real ML-KEM-768 + decomposition + extended concurrent load
python -X utf8 tests/bench/bench_kem.py --with-pqc

# Save JSON snapshot for statistical post-processing
python -X utf8 tests/bench/bench_kem.py --with-pqc \
  --save results/bench_kem_$(date +%Y-%m-%d).json

Signature benchmarks (native)¶

# Ed25519 baselines + HybridSign + X.509 certs
python -X utf8 tests/bench/bench_signatures.py

# Add standalone ML-DSA-65 (requires liboqs)
python -X utf8 tests/bench/bench_signatures.py --with-pqc

# Save JSON snapshot
python -X utf8 tests/bench/bench_signatures.py --with-pqc \
  --save results/bench_sigs_$(date +%Y-%m-%d).json

Note

-X utf8 forces UTF-8 output on Windows, which is required for the µ character in the timing output. On Linux/macOS this flag is harmless.

On Windows, oqs.dll must be discoverable. Set OQS_DLL_DIR to the directory containing oqs.dll (e.g. C:\Users\<you>\_oqs\bin), or place the DLL at ~\_oqs\bin\oqs.dll — the _oqs_path.py helper registers it automatically via os.add_dll_directory.

Reading the output ¶

Each operation prints one line:

Ed25519 sign (32B)                            median=   41.4 µs  p95=   46.4 µs  CoV=6.9% *

Column	Meaning
`median`	50th-percentile latency — the headline number for the paper
`p95`	95th-percentile — worst case for 95% of calls
`CoV`	Coefficient of variation (stdev / mean × 100) — side-channel proxy
`*`	CoV > 5% — high variance (see below)
`~`	CoV 3–5% — moderate variance

CoV as a side-channel proxy ¶

The coefficient of variation (CoV) is the primary metric for the timing side-channel analysis (paper Contribution 4).

The baseline reference is AES-256-GCM, which is universally accepted as constant-time. Any operation with CoV ≤ AES-GCM’s CoV is considered timing-stable on that platform.

ENV-2 (Docker/WSL2) noise floor: ~2.1% — AES-256-GCM 1 KB encrypt. Operations within ~2% CoV are timing-stable. The paper uses this per-environment floor rather than a fixed global threshold, since the WSL2 hypervisor adds residual jitter above bare-metal Linux (~0.5–1.5%).

ENV-2 CoV reference values (2026-03-28)¶
Operation	CoV	Assessment
AES-256-GCM 1 KB (baseline)	2.1%	Noise floor — constant-time reference
Ed25519 verify	2.2%	✓ Timing-stable
PublicKey.fingerprint()	2.0%	✓ Timing-stable
HKDF-SHA256	2.8%	✓ Within noise floor
ML-KEM-768 encapsulate	9.4%	WSL2 vCPU scheduler noise; no secret-dep. branching in FIPS 203
ML-DSA-65 sign	52.4%	✓ Expected — FIPS 204 hedged signing randomness

Why ML-DSA sign has high CoV (~52%)

ML-DSA-65 (FIPS 204) uses hedged signing: a fresh 32-byte random string is generated per signing call and mixed into the lattice rejection-sampling loop. Different random draws cause the loop to run a different number of iterations, producing genuine timing variation at the µs scale. This is not a timing side-channel — it is the intended behaviour of the algorithm.

Why CoV is higher on Windows

The Windows NT scheduler has a default timer resolution of 15.6 ms. For sub-millisecond operations, a single scheduler interruption can spike a sample by 10–20×. This is why operations that show CoV ~2% in Docker/Linux show CoV ~5–10% on Windows. The paper reports ENV-2 (Linux) values for the CoV analysis and notes ENV-1 (Windows) values as calibration reference only.

Hybrid decomposition ¶

The --with-pqc KEM run includes a decomposition table that isolates each component’s contribution:

Tier	Label	What it runs
①	X25519 only	Pure classical: keygen + DH exchange (no PQC)
②	ML-KEM-768 only	Pure PQC: keygen + encapsulate + decapsulate (liboqs, no classical)
③	HybridKEM full	Both combined: keygen + encapsulate + decapsulate

Combiner overhead ≈ ③ − ① − ② (per operation). This cost is dominated by HKDF-SHA256 and key serialisation, not by the algorithms themselves.

ENV-2 combiner overhead (Docker/WSL2, 2026-03-28):

keygen combiner: ~94.0 µs (Python wiring + HKDF + key serialisation)
encapsulate combiner: ~57.0 µs
decapsulate combiner: ~51.0 µs

ENV-1 combiner overhead (Windows native, 2026-03-28) — for reference:

keygen combiner: ~110.6 µs
encapsulate combiner: ~74.1 µs
decapsulate combiner: ~90.6 µs

Combiner cost is dominated by HKDF-SHA256 and Python key serialisation (PEM/CBOR encoding), not by the cryptographic algorithms. The ENV-2 vs ENV-1 difference reflects the Linux kernel’s faster context-switch overhead for short Python calls.

Concurrent throughput curve ¶

The --with-pqc KEM run measures throughput at four concurrency tiers using concurrent.futures.ThreadPoolExecutor. Each task = one complete hybrid KEM handshake (keygen + encapsulate + decapsulate) with real ML-KEM-768.

ENV-2 results (Docker/WSL2, 2026-03-28):

Concurrent users	Wall-clock median	Throughput	Note
100	50.2 ms	~1,992 ops/s	Baseline
500	232.4 ms	~2,151 ops/s	Peak throughput
1,000	478.7 ms	~2,089 ops/s	Stable
5,000	2,487.8 ms	~2,009 ops/s	−6.6% vs peak; −0.8% vs 100-user baseline

Throughput is near-constant at ~2,000–2,150 ops/s from 100 to 5,000 users. This validates GIL-release during liboqs C calls — true thread parallelism despite Python’s GIL. The −6.6% drop from peak to 5,000 users is within normal Python thread-pool scheduling variance.

Signature benchmark key numbers ¶

ENV-2 headline latencies (Docker/WSL2, 2026-03-28):

Operation	Median	p95	CoV	Note
Ed25519 sign (32 B)	33.5 µs	42.0 µs	10.1%	WSL2 vCPU noise
Ed25519 verify (32 B)	106.9 µs	116.6 µs	4.0%	✓ Near noise floor
ML-DSA-65 sign (32 B)	100.5 µs	242.6 µs	52.4%	Expected — FIPS 204 hedged signing
ML-DSA-65 verify (32 B)	45.4 µs	53.7 µs	6.2%
HybridSign sign (32 B)	138.8 µs	253.6 µs	31.3%	Dominated by ML-DSA hedged signing
HybridSign verify (32 B)	133.2 µs	172.7 µs	13.4%	+25% vs Ed25519 verify alone
X.509 HybridCert build	313.8 µs	479.2 µs	23.1%	Ed25519 + ML-DSA-65 cosign
X.509 HybridCert verify_cosig	255.4 µs	300.3 µs	14.8%

Paper headline figures (ENV-2):

Full hybrid KEM handshake (keygen + encap + decap): ~301 µs
Full hybrid signature cycle (keygen + sign + verify): ~468 µs
Full hybrid cert issuance: ~314 µs
Throughput at 5,000 users: ~2,009 ops/s (−6.6% vs peak)

All values are sub-millisecond, confirming production viability at TLS handshake rates. See results/BENCHMARKS.md for complete tables, ENV-1 comparison, and the cross-environment speedup analysis.

Statistical post-processing ¶

tests/bench/bench_stats.py provides pure-Python (no scipy) statistical utilities for converting raw samples to paper-quality numbers.

import sys
sys.path.insert(0, 'tests/bench')
from bench_stats import (
    bootstrap_ci, welch_t_test, cohens_d,
    latex_table, cov_stability_report, describe_samples,
)

# 95% bootstrap confidence interval (Efron 1979 percentile method)
lo, median, hi = bootstrap_ci(samples_us, confidence=0.95, n_resamples=2000)

# Welch's t-test (no equal-variance assumption)
result = welch_t_test(classical_samples, hybrid_samples)
print(f"p={result.p_value:.4f}  significant={result.significant}")
print(f"overhead={result.overhead_pct:.1f}%")

# Cohen's d effect size
d = cohens_d(classical_samples, hybrid_samples)

# LaTeX booktabs table (paste directly into paper)
table = latex_table(
    rows=[["X25519", "33", "35", "3.8%"],
          ["ML-KEM-768", "96", "130", "12.8%"]],
    columns=["Algorithm", "Median (µs)", "p95 (µs)", "CoV"],
    caption="KEM operation latency",
    label="tab:kem-latency",
)
print(table)

Available functions:

Function	Purpose
`bootstrap_ci(samples, confidence, n_resamples, seed)`	Percentile bootstrap CI for the median
`welch_t_test(a, b, alpha)`	Welch’s t-test → p-value, df, significance, overhead%
`cohens_d(a, b)`	Pooled-SD effect size
`throughput_curve(points)`	ops/s and scaling efficiency per concurrency tier
`cov_stability_report(results, threshold)`	Flag operations above CoV threshold
`describe_samples(samples)`	Full summary: median, mean, p95, p99, stdev, CoV
`latex_table(rows, columns, caption, label)`	booktabs-formatted LaTeX table string

Results storage ¶

Benchmark runs produce two kinds of output:

File	Description
`results/BENCHMARKS.md`	Human-readable research record — tracked in git, canonical reference
`results/bench_*.json`	Machine-readable JSON snapshots — gitignored (large, machine-specific)

results/BENCHMARKS.md is the authoritative document. It records methodology, dual-environment descriptions (ENV-1 / ENV-2), full result tables, CoV analysis, cross-environment comparison, and paper headline numbers. Update it after every authoritative benchmark run.

The JSON structure:

{
  "generated_at": "2026-03-28T...",
  "harness": {"iterations": 1000, "warmup": 100, "outlier_trim_pct": 1},
  "results": {
    "classical_baselines": [
      {"name": "X25519 keygen", "median_us": 36.9, "p95_us": 40.5,
       "cov_pct": 4.0, "mean_us": 37.1, "stdev_us": 1.5, ...}
    ]
  }
}