Benchmarking¶
The benchmark suite measures every cryptographic operation in the library and provides the data for the research paper contributions. This guide explains how to run the benchmarks, what each harness covers, and how to interpret the output.
Authoritative results are stored in results/BENCHMARKS.md. That file is the
canonical record; this guide explains how to reproduce and extend them.
Test environments¶
Two environments are used. Results differ primarily due to the liboqs build and the OS scheduler. ENV-2 (Docker/Linux) is the primary reference for paper claims.
Label |
ENV-1 — Windows 11 Native |
ENV-2 — Docker / WSL2 Linux (Primary) |
|---|---|---|
OS |
Windows 11 Home 10.0.26200 |
Linux 6.6.87.2-microsoft-standard-WSL2 (Hyper-V) |
Python |
3.12.7 |
3.12.13 (python:3.12-slim) |
liboqs |
0.15.0 MSYS2 DLL (generic build) |
0.15.0 compiled from source ( |
Scheduler noise |
15.6 ms NT timer resolution inflates CoV |
WSL2 vCPU adds ~2–4% CoV above bare-metal Linux |
Note
The from-source Docker build enables AVX2/AVX-512 CPUID detection at runtime
(-DOQS_DIST_BUILD=ON), giving ML-KEM-768 keygen a 5.3× speedup over the
Windows MSYS2 DLL. CoV analysis in the paper uses ENV-2 values exclusively.
Benchmark harnesses¶
Two harness scripts live in tests/bench/:
Script |
What it measures |
|---|---|
|
KEM operations: X25519 classical, mock-PQC hybrid, real ML-KEM-768 hybrid, hybrid decomposition (combiner overhead isolation), concurrent throughput curve |
|
Signature operations: Ed25519 baseline, ML-DSA-65 standalone, HybridSign (Ed25519+ML-DSA-65), X.509 hybrid certificate build and cosig verify |
All harnesses share the same methodology:
1000 measurement iterations per operation
100 warmup iterations (discarded)
1% outlier trim from each tail (removes OS-scheduler spikes)
time.perf_counterfor nanosecond-resolution timingGC disabled during measurement
Running the benchmarks¶
Docker (recommended — ENV-2)¶
The Docker image compiles liboqs 0.15.0 from source with AVX2/AVX-512 support. This is the reproducible, ENV-2 path used for all primary paper numbers.
# Build once — compiles liboqs 0.15.0 from source (~3 min on a modern laptop)
docker build -t quantum-safe-bench .
# KEM suite: classical + hybrid decomposition + concurrent load curve
docker run --rm -v ${PWD}/results:/app/results quantum-safe-bench \
python -X utf8 tests/bench/bench_kem.py --with-pqc \
--save results/bench_kem_$(date +%Y-%m-%d).json
# Signature suite: Ed25519 + ML-DSA-65 + HybridSign + X.509 certs
docker run --rm -v ${PWD}/results:/app/results quantum-safe-bench \
python -X utf8 tests/bench/bench_signatures.py --with-pqc \
--save results/bench_sigs_$(date +%Y-%m-%d).json
# Or run both in one go with docker-compose
docker compose up
KEM benchmarks (native)¶
# Classical + mock PQC (no liboqs needed)
python -X utf8 tests/bench/bench_kem.py
# Full suite — real ML-KEM-768 + decomposition + extended concurrent load
python -X utf8 tests/bench/bench_kem.py --with-pqc
# Save JSON snapshot for statistical post-processing
python -X utf8 tests/bench/bench_kem.py --with-pqc \
--save results/bench_kem_$(date +%Y-%m-%d).json
Signature benchmarks (native)¶
# Ed25519 baselines + HybridSign + X.509 certs
python -X utf8 tests/bench/bench_signatures.py
# Add standalone ML-DSA-65 (requires liboqs)
python -X utf8 tests/bench/bench_signatures.py --with-pqc
# Save JSON snapshot
python -X utf8 tests/bench/bench_signatures.py --with-pqc \
--save results/bench_sigs_$(date +%Y-%m-%d).json
Note
-X utf8 forces UTF-8 output on Windows, which is required for the µ
character in the timing output. On Linux/macOS this flag is harmless.
On Windows, oqs.dll must be discoverable. Set OQS_DLL_DIR to the
directory containing oqs.dll (e.g. C:\Users\<you>\_oqs\bin), or place
the DLL at ~\_oqs\bin\oqs.dll — the _oqs_path.py helper registers it
automatically via os.add_dll_directory.
Reading the output¶
Each operation prints one line:
Ed25519 sign (32B) median= 41.4 µs p95= 46.4 µs CoV=6.9% *
Column |
Meaning |
|---|---|
|
50th-percentile latency — the headline number for the paper |
|
95th-percentile — worst case for 95% of calls |
|
Coefficient of variation (stdev / mean × 100) — side-channel proxy |
|
CoV > 5% — high variance (see below) |
|
CoV 3–5% — moderate variance |
CoV as a side-channel proxy¶
The coefficient of variation (CoV) is the primary metric for the timing side-channel analysis (paper Contribution 4).
The baseline reference is AES-256-GCM, which is universally accepted as constant-time. Any operation with CoV ≤ AES-GCM’s CoV is considered timing-stable on that platform.
ENV-2 (Docker/WSL2) noise floor: ~2.1% — AES-256-GCM 1 KB encrypt. Operations within ~2% CoV are timing-stable. The paper uses this per-environment floor rather than a fixed global threshold, since the WSL2 hypervisor adds residual jitter above bare-metal Linux (~0.5–1.5%).
Operation |
CoV |
Assessment |
|---|---|---|
AES-256-GCM 1 KB (baseline) |
2.1% |
Noise floor — constant-time reference |
Ed25519 verify |
2.2% |
✓ Timing-stable |
PublicKey.fingerprint() |
2.0% |
✓ Timing-stable |
HKDF-SHA256 |
2.8% |
✓ Within noise floor |
ML-KEM-768 encapsulate |
9.4% |
WSL2 vCPU scheduler noise; no secret-dep. branching in FIPS 203 |
ML-DSA-65 sign |
52.4% |
✓ Expected — FIPS 204 hedged signing randomness |
Why ML-DSA sign has high CoV (~52%)
ML-DSA-65 (FIPS 204) uses hedged signing: a fresh 32-byte random string is generated per signing call and mixed into the lattice rejection-sampling loop. Different random draws cause the loop to run a different number of iterations, producing genuine timing variation at the µs scale. This is not a timing side-channel — it is the intended behaviour of the algorithm.
Why CoV is higher on Windows
The Windows NT scheduler has a default timer resolution of 15.6 ms. For sub-millisecond operations, a single scheduler interruption can spike a sample by 10–20×. This is why operations that show CoV ~2% in Docker/Linux show CoV ~5–10% on Windows. The paper reports ENV-2 (Linux) values for the CoV analysis and notes ENV-1 (Windows) values as calibration reference only.
Hybrid decomposition¶
The --with-pqc KEM run includes a decomposition table that isolates
each component’s contribution:
Tier |
Label |
What it runs |
|---|---|---|
① |
X25519 only |
Pure classical: keygen + DH exchange (no PQC) |
② |
ML-KEM-768 only |
Pure PQC: keygen + encapsulate + decapsulate (liboqs, no classical) |
③ |
HybridKEM full |
Both combined: keygen + encapsulate + decapsulate |
Combiner overhead ≈ ③ − ① − ② (per operation). This cost is dominated by HKDF-SHA256 and key serialisation, not by the algorithms themselves.
ENV-2 combiner overhead (Docker/WSL2, 2026-03-28):
keygen combiner: ~94.0 µs (Python wiring + HKDF + key serialisation)
encapsulate combiner: ~57.0 µs
decapsulate combiner: ~51.0 µs
ENV-1 combiner overhead (Windows native, 2026-03-28) — for reference:
keygen combiner: ~110.6 µs
encapsulate combiner: ~74.1 µs
decapsulate combiner: ~90.6 µs
Combiner cost is dominated by HKDF-SHA256 and Python key serialisation (PEM/CBOR encoding), not by the cryptographic algorithms. The ENV-2 vs ENV-1 difference reflects the Linux kernel’s faster context-switch overhead for short Python calls.
Concurrent throughput curve¶
The --with-pqc KEM run measures throughput at four concurrency tiers using
concurrent.futures.ThreadPoolExecutor. Each task = one complete hybrid KEM
handshake (keygen + encapsulate + decapsulate) with real ML-KEM-768.
ENV-2 results (Docker/WSL2, 2026-03-28):
Concurrent users |
Wall-clock median |
Throughput |
Note |
|---|---|---|---|
100 |
50.2 ms |
~1,992 ops/s |
Baseline |
500 |
232.4 ms |
~2,151 ops/s |
Peak throughput |
1,000 |
478.7 ms |
~2,089 ops/s |
Stable |
5,000 |
2,487.8 ms |
~2,009 ops/s |
−6.6% vs peak; −0.8% vs 100-user baseline |
Throughput is near-constant at ~2,000–2,150 ops/s from 100 to 5,000 users. This validates GIL-release during liboqs C calls — true thread parallelism despite Python’s GIL. The −6.6% drop from peak to 5,000 users is within normal Python thread-pool scheduling variance.
Signature benchmark key numbers¶
ENV-2 headline latencies (Docker/WSL2, 2026-03-28):
Operation |
Median |
p95 |
CoV |
Note |
|---|---|---|---|---|
Ed25519 sign (32 B) |
33.5 µs |
42.0 µs |
10.1% |
WSL2 vCPU noise |
Ed25519 verify (32 B) |
106.9 µs |
116.6 µs |
4.0% |
✓ Near noise floor |
ML-DSA-65 sign (32 B) |
100.5 µs |
242.6 µs |
52.4% |
Expected — FIPS 204 hedged signing |
ML-DSA-65 verify (32 B) |
45.4 µs |
53.7 µs |
6.2% |
|
HybridSign sign (32 B) |
138.8 µs |
253.6 µs |
31.3% |
Dominated by ML-DSA hedged signing |
HybridSign verify (32 B) |
133.2 µs |
172.7 µs |
13.4% |
+25% vs Ed25519 verify alone |
X.509 HybridCert build |
313.8 µs |
479.2 µs |
23.1% |
Ed25519 + ML-DSA-65 cosign |
X.509 HybridCert verify_cosig |
255.4 µs |
300.3 µs |
14.8% |
Paper headline figures (ENV-2):
Full hybrid KEM handshake (keygen + encap + decap): ~301 µs
Full hybrid signature cycle (keygen + sign + verify): ~468 µs
Full hybrid cert issuance: ~314 µs
Throughput at 5,000 users: ~2,009 ops/s (−6.6% vs peak)
All values are sub-millisecond, confirming production viability at TLS handshake
rates. See results/BENCHMARKS.md for complete tables, ENV-1 comparison, and
the cross-environment speedup analysis.
Statistical post-processing¶
tests/bench/bench_stats.py provides pure-Python (no scipy) statistical
utilities for converting raw samples to paper-quality numbers.
import sys
sys.path.insert(0, 'tests/bench')
from bench_stats import (
bootstrap_ci, welch_t_test, cohens_d,
latex_table, cov_stability_report, describe_samples,
)
# 95% bootstrap confidence interval (Efron 1979 percentile method)
lo, median, hi = bootstrap_ci(samples_us, confidence=0.95, n_resamples=2000)
# Welch's t-test (no equal-variance assumption)
result = welch_t_test(classical_samples, hybrid_samples)
print(f"p={result.p_value:.4f} significant={result.significant}")
print(f"overhead={result.overhead_pct:.1f}%")
# Cohen's d effect size
d = cohens_d(classical_samples, hybrid_samples)
# LaTeX booktabs table (paste directly into paper)
table = latex_table(
rows=[["X25519", "33", "35", "3.8%"],
["ML-KEM-768", "96", "130", "12.8%"]],
columns=["Algorithm", "Median (µs)", "p95 (µs)", "CoV"],
caption="KEM operation latency",
label="tab:kem-latency",
)
print(table)
Available functions:
Function |
Purpose |
|---|---|
|
Percentile bootstrap CI for the median |
|
Welch’s t-test → p-value, df, significance, overhead% |
|
Pooled-SD effect size |
|
ops/s and scaling efficiency per concurrency tier |
|
Flag operations above CoV threshold |
|
Full summary: median, mean, p95, p99, stdev, CoV |
|
booktabs-formatted LaTeX table string |
Results storage¶
Benchmark runs produce two kinds of output:
File |
Description |
|---|---|
|
Human-readable research record — tracked in git, canonical reference |
|
Machine-readable JSON snapshots — gitignored (large, machine-specific) |
results/BENCHMARKS.md is the authoritative document. It records methodology,
dual-environment descriptions (ENV-1 / ENV-2), full result tables, CoV analysis,
cross-environment comparison, and paper headline numbers. Update it after every
authoritative benchmark run.
The JSON structure:
{
"generated_at": "2026-03-28T...",
"harness": {"iterations": 1000, "warmup": 100, "outlier_trim_pct": 1},
"results": {
"classical_baselines": [
{"name": "X25519 keygen", "median_us": 36.9, "p95_us": 40.5,
"cov_pct": 4.0, "mean_us": 37.1, "stdev_us": 1.5, ...}
]
}
}