Benchmarking

The benchmark suite measures every cryptographic operation in the library and provides the data for the research paper contributions. This guide explains how to run the benchmarks, what each harness covers, and how to interpret the output.

Authoritative results are stored in results/BENCHMARKS.md. That file is the canonical record; this guide explains how to reproduce and extend them.


Test environments

Two environments are used. Results differ primarily due to the liboqs build and the OS scheduler. ENV-2 (Docker/Linux) is the primary reference for paper claims.

Label

ENV-1 — Windows 11 Native

ENV-2 — Docker / WSL2 Linux (Primary)

OS

Windows 11 Home 10.0.26200

Linux 6.6.87.2-microsoft-standard-WSL2 (Hyper-V)

Python

3.12.7

3.12.13 (python:3.12-slim)

liboqs

0.15.0 MSYS2 DLL (generic build)

0.15.0 compiled from source (-DOQS_DIST_BUILD=ON)

Scheduler noise

15.6 ms NT timer resolution inflates CoV

WSL2 vCPU adds ~2–4% CoV above bare-metal Linux

Note

The from-source Docker build enables AVX2/AVX-512 CPUID detection at runtime (-DOQS_DIST_BUILD=ON), giving ML-KEM-768 keygen a 5.3× speedup over the Windows MSYS2 DLL. CoV analysis in the paper uses ENV-2 values exclusively.


Benchmark harnesses

Two harness scripts live in tests/bench/:

Script

What it measures

bench_kem.py

KEM operations: X25519 classical, mock-PQC hybrid, real ML-KEM-768 hybrid, hybrid decomposition (combiner overhead isolation), concurrent throughput curve

bench_signatures.py

Signature operations: Ed25519 baseline, ML-DSA-65 standalone, HybridSign (Ed25519+ML-DSA-65), X.509 hybrid certificate build and cosig verify

All harnesses share the same methodology:

  • 1000 measurement iterations per operation

  • 100 warmup iterations (discarded)

  • 1% outlier trim from each tail (removes OS-scheduler spikes)

  • time.perf_counter for nanosecond-resolution timing

  • GC disabled during measurement


Running the benchmarks

KEM benchmarks (native)

# Classical + mock PQC (no liboqs needed)
python -X utf8 tests/bench/bench_kem.py

# Full suite — real ML-KEM-768 + decomposition + extended concurrent load
python -X utf8 tests/bench/bench_kem.py --with-pqc

# Save JSON snapshot for statistical post-processing
python -X utf8 tests/bench/bench_kem.py --with-pqc \
  --save results/bench_kem_$(date +%Y-%m-%d).json

Signature benchmarks (native)

# Ed25519 baselines + HybridSign + X.509 certs
python -X utf8 tests/bench/bench_signatures.py

# Add standalone ML-DSA-65 (requires liboqs)
python -X utf8 tests/bench/bench_signatures.py --with-pqc

# Save JSON snapshot
python -X utf8 tests/bench/bench_signatures.py --with-pqc \
  --save results/bench_sigs_$(date +%Y-%m-%d).json

Note

-X utf8 forces UTF-8 output on Windows, which is required for the µ character in the timing output. On Linux/macOS this flag is harmless.

On Windows, oqs.dll must be discoverable. Set OQS_DLL_DIR to the directory containing oqs.dll (e.g. C:\Users\<you>\_oqs\bin), or place the DLL at ~\_oqs\bin\oqs.dll — the _oqs_path.py helper registers it automatically via os.add_dll_directory.


Reading the output

Each operation prints one line:

Ed25519 sign (32B)                            median=   41.4 µs  p95=   46.4 µs  CoV=6.9% *

Column

Meaning

median

50th-percentile latency — the headline number for the paper

p95

95th-percentile — worst case for 95% of calls

CoV

Coefficient of variation (stdev / mean × 100) — side-channel proxy

*

CoV > 5% — high variance (see below)

~

CoV 3–5% — moderate variance


CoV as a side-channel proxy

The coefficient of variation (CoV) is the primary metric for the timing side-channel analysis (paper Contribution 4).

The baseline reference is AES-256-GCM, which is universally accepted as constant-time. Any operation with CoV ≤ AES-GCM’s CoV is considered timing-stable on that platform.

ENV-2 (Docker/WSL2) noise floor: ~2.1% — AES-256-GCM 1 KB encrypt. Operations within ~2% CoV are timing-stable. The paper uses this per-environment floor rather than a fixed global threshold, since the WSL2 hypervisor adds residual jitter above bare-metal Linux (~0.5–1.5%).

ENV-2 CoV reference values (2026-03-28)

Operation

CoV

Assessment

AES-256-GCM 1 KB (baseline)

2.1%

Noise floor — constant-time reference

Ed25519 verify

2.2%

✓ Timing-stable

PublicKey.fingerprint()

2.0%

✓ Timing-stable

HKDF-SHA256

2.8%

✓ Within noise floor

ML-KEM-768 encapsulate

9.4%

WSL2 vCPU scheduler noise; no secret-dep. branching in FIPS 203

ML-DSA-65 sign

52.4%

✓ Expected — FIPS 204 hedged signing randomness

Why ML-DSA sign has high CoV (~52%)

ML-DSA-65 (FIPS 204) uses hedged signing: a fresh 32-byte random string is generated per signing call and mixed into the lattice rejection-sampling loop. Different random draws cause the loop to run a different number of iterations, producing genuine timing variation at the µs scale. This is not a timing side-channel — it is the intended behaviour of the algorithm.

Why CoV is higher on Windows

The Windows NT scheduler has a default timer resolution of 15.6 ms. For sub-millisecond operations, a single scheduler interruption can spike a sample by 10–20×. This is why operations that show CoV ~2% in Docker/Linux show CoV ~5–10% on Windows. The paper reports ENV-2 (Linux) values for the CoV analysis and notes ENV-1 (Windows) values as calibration reference only.


Hybrid decomposition

The --with-pqc KEM run includes a decomposition table that isolates each component’s contribution:

Tier

Label

What it runs

X25519 only

Pure classical: keygen + DH exchange (no PQC)

ML-KEM-768 only

Pure PQC: keygen + encapsulate + decapsulate (liboqs, no classical)

HybridKEM full

Both combined: keygen + encapsulate + decapsulate

Combiner overhead ≈ ③ − ① − ② (per operation). This cost is dominated by HKDF-SHA256 and key serialisation, not by the algorithms themselves.

ENV-2 combiner overhead (Docker/WSL2, 2026-03-28):

  • keygen combiner: ~94.0 µs (Python wiring + HKDF + key serialisation)

  • encapsulate combiner: ~57.0 µs

  • decapsulate combiner: ~51.0 µs

ENV-1 combiner overhead (Windows native, 2026-03-28) — for reference:

  • keygen combiner: ~110.6 µs

  • encapsulate combiner: ~74.1 µs

  • decapsulate combiner: ~90.6 µs

Combiner cost is dominated by HKDF-SHA256 and Python key serialisation (PEM/CBOR encoding), not by the cryptographic algorithms. The ENV-2 vs ENV-1 difference reflects the Linux kernel’s faster context-switch overhead for short Python calls.


Concurrent throughput curve

The --with-pqc KEM run measures throughput at four concurrency tiers using concurrent.futures.ThreadPoolExecutor. Each task = one complete hybrid KEM handshake (keygen + encapsulate + decapsulate) with real ML-KEM-768.

ENV-2 results (Docker/WSL2, 2026-03-28):

Concurrent users

Wall-clock median

Throughput

Note

100

50.2 ms

~1,992 ops/s

Baseline

500

232.4 ms

~2,151 ops/s

Peak throughput

1,000

478.7 ms

~2,089 ops/s

Stable

5,000

2,487.8 ms

~2,009 ops/s

−6.6% vs peak; −0.8% vs 100-user baseline

Throughput is near-constant at ~2,000–2,150 ops/s from 100 to 5,000 users. This validates GIL-release during liboqs C calls — true thread parallelism despite Python’s GIL. The −6.6% drop from peak to 5,000 users is within normal Python thread-pool scheduling variance.


Signature benchmark key numbers

ENV-2 headline latencies (Docker/WSL2, 2026-03-28):

Operation

Median

p95

CoV

Note

Ed25519 sign (32 B)

33.5 µs

42.0 µs

10.1%

WSL2 vCPU noise

Ed25519 verify (32 B)

106.9 µs

116.6 µs

4.0%

✓ Near noise floor

ML-DSA-65 sign (32 B)

100.5 µs

242.6 µs

52.4%

Expected — FIPS 204 hedged signing

ML-DSA-65 verify (32 B)

45.4 µs

53.7 µs

6.2%

HybridSign sign (32 B)

138.8 µs

253.6 µs

31.3%

Dominated by ML-DSA hedged signing

HybridSign verify (32 B)

133.2 µs

172.7 µs

13.4%

+25% vs Ed25519 verify alone

X.509 HybridCert build

313.8 µs

479.2 µs

23.1%

Ed25519 + ML-DSA-65 cosign

X.509 HybridCert verify_cosig

255.4 µs

300.3 µs

14.8%

Paper headline figures (ENV-2):

  • Full hybrid KEM handshake (keygen + encap + decap): ~301 µs

  • Full hybrid signature cycle (keygen + sign + verify): ~468 µs

  • Full hybrid cert issuance: ~314 µs

  • Throughput at 5,000 users: ~2,009 ops/s (−6.6% vs peak)

All values are sub-millisecond, confirming production viability at TLS handshake rates. See results/BENCHMARKS.md for complete tables, ENV-1 comparison, and the cross-environment speedup analysis.


Statistical post-processing

tests/bench/bench_stats.py provides pure-Python (no scipy) statistical utilities for converting raw samples to paper-quality numbers.

import sys
sys.path.insert(0, 'tests/bench')
from bench_stats import (
    bootstrap_ci, welch_t_test, cohens_d,
    latex_table, cov_stability_report, describe_samples,
)

# 95% bootstrap confidence interval (Efron 1979 percentile method)
lo, median, hi = bootstrap_ci(samples_us, confidence=0.95, n_resamples=2000)

# Welch's t-test (no equal-variance assumption)
result = welch_t_test(classical_samples, hybrid_samples)
print(f"p={result.p_value:.4f}  significant={result.significant}")
print(f"overhead={result.overhead_pct:.1f}%")

# Cohen's d effect size
d = cohens_d(classical_samples, hybrid_samples)

# LaTeX booktabs table (paste directly into paper)
table = latex_table(
    rows=[["X25519", "33", "35", "3.8%"],
          ["ML-KEM-768", "96", "130", "12.8%"]],
    columns=["Algorithm", "Median (µs)", "p95 (µs)", "CoV"],
    caption="KEM operation latency",
    label="tab:kem-latency",
)
print(table)

Available functions:

Function

Purpose

bootstrap_ci(samples, confidence, n_resamples, seed)

Percentile bootstrap CI for the median

welch_t_test(a, b, alpha)

Welch’s t-test → p-value, df, significance, overhead%

cohens_d(a, b)

Pooled-SD effect size

throughput_curve(points)

ops/s and scaling efficiency per concurrency tier

cov_stability_report(results, threshold)

Flag operations above CoV threshold

describe_samples(samples)

Full summary: median, mean, p95, p99, stdev, CoV

latex_table(rows, columns, caption, label)

booktabs-formatted LaTeX table string


Results storage

Benchmark runs produce two kinds of output:

File

Description

results/BENCHMARKS.md

Human-readable research record — tracked in git, canonical reference

results/bench_*.json

Machine-readable JSON snapshots — gitignored (large, machine-specific)

results/BENCHMARKS.md is the authoritative document. It records methodology, dual-environment descriptions (ENV-1 / ENV-2), full result tables, CoV analysis, cross-environment comparison, and paper headline numbers. Update it after every authoritative benchmark run.

The JSON structure:

{
  "generated_at": "2026-03-28T...",
  "harness": {"iterations": 1000, "warmup": 100, "outlier_trim_pct": 1},
  "results": {
    "classical_baselines": [
      {"name": "X25519 keygen", "median_us": 36.9, "p95_us": 40.5,
       "cov_pct": 4.0, "mean_us": 37.1, "stdev_us": 1.5, ...}
    ]
  }
}