Benchmarks¶
License: CC BY 4.0. See LICENSE-docs.md.
The SKI Framework's latency story has to be honest about what the framework controls and what it does not. Per-verdict latency in a deployment decomposes as:
end-to-end = framework overhead + LLM inference + ledger round-trip
The framework controls the first term: jurisdiction + effective-date
scoping (scope_to), prompt rendering, citation validation, the
Symbolic Verifier's per-assertion cross-check, risk-tier policy, and
ed25519 transcript signing. LLM inference time is a property of the
model and hardware the operator chooses; the ledger round-trip is a
property of their Postgres deployment. "Sub-100ms" is therefore a
framework-overhead budget, not an end-to-end promise — end-to-end
latency must be validated per deployment, and the suite ships a mode
for exactly that.
The suite¶
benchmarks/ measures the production code path — the same
kg_loader.scope_to → V3Evaluator.aevaluate_with_transcript flow
server.py runs, with real transcript signing — never a simulation.
The workload is the SKI Evals golden dataset, cycled deterministically,
with dataset hashes recorded in the run provenance.
```bash
Framework overhead (in-process, deterministic FakeLLM, no infra):¶
python -m benchmarks.run --mode pipeline --n 2000 --warmup 200
End-to-end against a live deployment (any backend it runs):¶
python -m benchmarks.run --mode http \ --endpoint https://localhost:8000 --api-key "$SKI_API_KEY" \ --n 200 --warmup 20
CI gate — fail the build if the overhead budget is exceeded:¶
python -m benchmarks.run --mode pipeline --max-framework-p99-ms 100 ```
Reports render as markdown (humans) and JSON (machines), both carrying full provenance: dataset hashes, git commit, Python version, platform, CPU count, and the verdict mix actually produced.
Reference numbers — framework overhead¶
Pipeline mode, FakeLLM backend (model inference ≈ 0), 2,000 samples
after 200 warmup, transcript signing enabled. Workload:
evals/datasets/energy (50 golden cases, 10-obligation KG).
| Stage | p50 | p90 | p95 | p99 | mean | verdicts/s |
|---|---|---|---|---|---|---|
KG scoping (scope_to) |
0.01 ms | 0.01 ms | 0.01 ms | 0.03 ms | 0.01 ms | — |
| Evaluate + verify + sign | 0.09 ms | 0.13 ms | 0.14 ms | 0.29 ms | 0.11 ms | — |
| Framework total (per verdict) | 0.10 ms | 0.14 ms | 0.16 ms | 0.36 ms | 0.12 ms | ~8,500 |
Environment: Python 3.10, Linux aarch64, 2 vCPUs (deliberately modest —
commodity-edge-class, not a benchmarking rig). Reproduce with the first
command above; CI uploads a fresh report artifact on every build and
gates on framework_total p99 ≤ 100 ms.
The headline: framework overhead is sub-millisecond at p99 — about 250× inside the 100 ms budget on 2 vCPUs. In a real deployment, per-verdict latency is dominated by LLM inference (hundreds of ms to seconds for a 7B-class model on CPU; tens of ms on a GPU with vLLM) and the audit-ledger append (single-digit ms on a local Postgres). The framework does not meaningfully add to either.
Scope and caveats¶
- Single worker by design. The runtime enforces
SKI_MODEL_WORKERS=1(seedocs/CONCURRENCY.md), so throughput scales by sharding deployments, not by adding workers. The verdicts/s figure above is the single-worker ceiling imposed by framework overhead alone; a deployment's real ceiling is its model's inference throughput. httpmode measures everything. FastAPI, auth, TLS, scoping, inference, verification, signing, and the ledger append — the number an operator should validate and record per deployment.- No KG-size scaling claims yet. The workload KG has 10
obligations.
scope_tois a linear scan; numbers for real-sized KGs (hundreds to thousands of obligations) land with the sector KG work. - Shared CI runners are noisy; the CI gate exists to catch order-of- magnitude regressions, not single-digit-percent drift. Trend numbers belong in the per-build artifacts, reference numbers in this page.