Compare & plot¶
Two (or more) saved runs in, one comparison out — as a table (benchmem compare) or an
interactive view (benchmem plot). Both take --metric time or any memory metric, and
group by the dims your tests carry.
Setup¶
A scratch dir, and a baseline run to diff against. plotly renders inline from the
CDN.
import os
import sys
import tempfile
from pathlib import Path
import plotly.io as pio
os.environ["FORCE_COLOR"] = "1"
os.environ["PATH"] = f"{Path(sys.executable).parent}{os.pathsep}{os.environ['PATH']}"
pio.renderers.default = "notebook_connected"
_tmp = Path(tempfile.mkdtemp(prefix="pytest-benchmem-"))
suite = _tmp / "test_sortbench.py"
suite.write_text("""
import pytest
@pytest.mark.parametrize("n", [10_000, 50_000, 200_000, 500_000])
def test_sort(benchmark_memory, n):
benchmark_memory(sorted, list(range(n, 0, -1)))
""")
baseline, candidate = _tmp / "baseline.json", _tmp / "candidate.json"
!pytest {suite} --benchmark-only --benchmark-json={baseline} --benchmark-columns=min,median -q -p no:cacheprovider
.
.
.
.
[100%]
Wrote benchmark data in: <_io.BufferedWriter name='/tmp/pytest-benchmem-mg6u7bxo/baseline.json'>
benchmark: 4 tests
Name (time in us) Min Median │ peak (MiB)
──────────────────────────────────────────────────────────────────────────────
test_sort[10000] 49.8410 (1.0) 56.7110 (1.0) │ 0.08
test_sort[50000] 260.0050 (5.22) 272.6160 (4.81) │ 0.38
test_sort[200000] 1,047.8020 (21.02) 1,062.5530 (18.74) │ 1.53
test_sort[500000] 2,876.9400 (57.72) 3,129.2860 (55.18) │ 3.81
memory (right of │): a separate, untimed pass, not the timed rounds • also
available via --benchmark-memory-columns: allocated, allocs
4 passed in 5.35s
On a real change you'd run the suite on main, then on your branch. Here we just run
it twice — same code, so the deltas below are measurement noise; on a real change
they'd move.
!pytest {suite} --benchmark-only --benchmark-json={candidate} --benchmark-columns=min,median -q -p no:cacheprovider
.
.
.
. [100%]
Wrote benchmark data in: <_io.BufferedWriter name='/tmp/pytest-benchmem-mg6u7bxo/candidate.json'>
benchmark: 4 tests
Name (time in us) Min Median │ peak (MiB)
──────────────────────────────────────────────────────────────────────────────
test_sort[10000] 49.9810 (1.0) 51.2310 (1.0) │ 0.08
test_sort[50000] 257.8960 (5.16) 264.6160 (5.17) │ 0.38
test_sort[200000] 1,031.7010 (20.64) 1,058.7975 (20.67) │ 1.53
test_sort[500000] 2,768.9480 (55.40) 2,928.7870 (57.17) │ 3.81
memory (right of │): a separate, untimed pass, not the timed rounds • also
available via --benchmark-memory-columns: allocated, allocs
4 passed in 4.20s
benchmem compare — the delta table¶
A per-id delta table with percent change, for whichever --metric you ask for. Ids
in only one run show —.
!benchmem compare {baseline} {candidate} --metric peak
peak (MiB)
id baseline candidate change
───────────────────────────────────────────────────
test_sort[10000] 0.08 0.08 +0.0%
test_sort[200000] 1.53 1.53 +0.0%
test_sort[500000] 3.81 3.81 +0.0%
test_sort[50000] 0.38 0.38 +0.0%
!benchmem compare {baseline} {candidate} --metric time
time (s)
id baseline candidate change
────────────────────────────────────────────────────
test_sort[10000] 4.984e-05 4.998e-05 +0.3%
test_sort[200000] 0.001048 0.001032 -1.5%
test_sort[500000] 0.002877 0.002769 -3.8%
test_sort[50000] 0.00026 0.0002579 -0.8%
For timing comparisons you can also use pytest-benchmark's own tooling directly —
pytest-benchmark compare,--benchmark-histogram. pytest-benchmem doesn't reimplement those; it adds the memory-aware, dims-aware views.
Order the rows with --sort (name | value — largest in the last run first — |
change), and write the raw numbers for another tool with --csv out.csv:
benchmem compare {baseline} {candidate} --metric peak --sort value --csv peak.csv
Gate on a regression with --fail-on — it exits non-zero past a threshold.
Here baseline and candidate are the same code, so nothing trips it (exit 0); on a
real regression the offending ids print and the command exits 1:
!benchmem compare {baseline} {candidate} --metric peak --fail-on peak:10% --fail-on allocations:5%; echo "exit: $?"
peak (MiB)
id baseline candidate change
───────────────────────────────────────────────────
test_sort[10000] 0.08 0.08 +0.0%
test_sort[200000] 1.53 1.53 +0.0%
test_sort[500000] 3.81 3.81 +0.0%
test_sort[50000] 0.38 0.38 +0.0%
no regressions over thresholds
exit: 0
Thresholds are percent (peak:10%) or absolute (peak:5MiB), on peak, allocated,
or allocations. The next section wires this into CI; the
reference has the full grammar.
Gate CI on regressions¶
Two ways to fail a PR when memory regresses.
A) Two saved JSON files — save a baseline (e.g. on main), then compare the PR
run against it with benchmem compare --fail-on. The baseline file is just the
--benchmark-json from an earlier run, restored from cache or a base-branch build:
# on the PR branch:
pytest --benchmark-only --benchmark-json=pr.json
benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%
B) Inline, via pytest-benchmark storage — no separate files. Save a baseline into
storage once with pytest-benchmark's own --benchmark-save (or --benchmark-autosave
every run), then gate the next run against it. --benchmark-memory-compare-fail
implies --benchmark-memory-compare, so the PR run compares against the latest saved
run automatically:
# on main — record the baseline into .benchmarks/ storage:
pytest --benchmark-only --benchmark-memory --benchmark-save=main
# on the PR branch — fail if peak grows >10% vs that baseline:
pytest --benchmark-only --benchmark-memory --benchmark-memory-compare-fail=peak:10%
Without a prior saved run, the inline gate is a no-op — it prints "no prior run with memory to compare against" and passes. Save a baseline first.
A minimal GitHub Actions job using approach A, caching the baseline across runs:
- uses: actions/cache@v4
with:
path: main.json
key: benchmem-baseline-${{ github.base_ref }}
- run: pytest --benchmark-only --benchmark-json=pr.json
- run: benchmem compare main.json pr.json --fail-on peak:10% --fail-on allocations:5%
For which metric to gate on, see Picking one for a gate
— allocations is usually the steadiest tripwire.
benchmem plot — the interactive views¶
benchmem plot writes an interactive plotly view to standalone HTML. It picks the
view by run count — but each view answers a different question, so override with
--view when you want a specific one:
| Runs | Default view | Answers |
|---|---|---|
| 1 | scaling |
how does cost grow with input size? |
| 2 | scatter |
which ids moved, and were they already big? |
| 2 | compare (--view compare) |
ranked — what moved most, in native units? |
| 3+ | sweep |
fold-change across versions, one cell per (id, run) |
!benchmem plot --metric peak {baseline} {candidate} -o {_tmp / "scatter.html"}
scatter (peak): 4 ids → /tmp/pytest-benchmem-mg6u7bxo/scatter.html
Every view is a plot_* function over the same load_long_df seam — call it directly
to render the same figure inline, no HTML round-trip. Each takes a metric, returns
(figure, n_ids), and shares three options: facet (small-multiple by a dim), labels
(name the series, defaulting to file stems), and clip (clamp the colour scale so one
outlier doesn't wash the rest out).
Scaling — a single run, cost vs. size. plot_scaling auto-infers the x-axis from
the numeric n dim (override with x=), and auto-picks log/linear (force with
log=). The baseline alone draws sorted's peak-memory curve:
from pytest_benchmem import plotting
plotting.plot_scaling([baseline], metric="peak")[0]
Scatter — two runs. x = baseline cost (log), y = candidate/baseline ratio, colour = absolute Δ. The top-right is the "big and got bigger" corner — where a regression actually costs you. Here on memory:
plotting.plot_scatter([baseline, candidate], metric="peak")[0]
Compare — two runs, ranked. A bar per id sorted by absolute delta, diverging
colour around zero — the "did anything regress, biggest first" view. Pass
sort="relative" to rank by percent instead. On timing this time:
plotting.plot_compare([baseline, candidate], metric="time")[0]
Sweep — three or more runs. A heatmap of log₂ fold-change vs the first run, one column per run, one row per id — the natural picture for a version sweep. A third run to make one:
third = _tmp / "third.json"
!pytest {suite} --benchmark-only --benchmark-json={third} --benchmark-columns=min,median -q -p no:cacheprovider
plotting.plot_sweep([baseline, candidate, third], metric="peak")[0]
.
.
.
. [100%]
Wrote benchmark data in: <_io.BufferedWriter name='/tmp/pytest-benchmem-mg6u7bxo/third.json'>
benchmark: 4 tests
Name (time in us) Min Median │ peak (MiB)
──────────────────────────────────────────────────────────────────────────────
test_sort[10000] 49.6110 (1.0) 50.2500 (1.0) │ 0.08
test_sort[50000] 259.7450 (5.24) 264.5750 (5.27) │ 0.38
test_sort[200000] 1,053.6020 (21.24) 1,082.8520 (21.55) │ 1.53
test_sort[500000] 2,898.7810 (58.43) 3,288.0200 (65.43) │ 3.81
memory (right of │): a separate, untimed pass, not the timed rounds • also
available via --benchmark-memory-columns: allocated, allocs
4 passed in 4.47s
Naming the series¶
By default each run is labelled by its file stem (baseline, candidate, …). Pass
labels= to name them yourself — the API behind plot's -l/--label — which is what
you want when the filenames are version numbers or commit shas:
plotting.plot_sweep([baseline, candidate, third], metric="peak",
labels=["v0.6", "v0.7", "v0.8"])[0]