A benchmark for log reduction tools (RTK, grep, tail, hybrid routers, LLM-summary) — do they preserve enough evidence for LLM root-cause diagnosis?
LogDx-CI compares 12 context providers (raw, tail, grep,
three RTK modes, two real LLM
summarizers, three hybrid routers, plus
Headroom) by handing the
same CI failure log to three debugger families (Claude Haiku 4.5,
Claude Sonnet 4.6, OpenAI gpt-5-mini) and scoring the resulting
root-cause diagnoses against AI-drafted + author-verified ground
truths. It optimizes for method ranking stability across model
families, not "which LLM scored highest."
Have a function that takes a CI log and returns a reduced version? You can rank it against all 12 reference methods with zero setup — no clone, no API key, no money:
pip install logdx-ciimport logdx_ci
def my_reducer(raw_log: str) -> str:
# your logic here — e.g. keep only the last 200 lines
return "\n".join(raw_log.split("\n")[-200:])
result = logdx_ci.evaluate(reducer=my_reducer) # all 35 cases, ~5 sec
print(result.summary()) # score + table vs 12 baselinesThe corpus + scoring code is auto-fetched (~20 MB) on first call and
cached. The default diagnoser="static-signal-recall" does not call
any LLM — it scores whether your reducer preserved the ground-truth
required_signals. Deterministic, free, runs in under a second.
For leaderboard-comparable diagnosis-quality scoring, pass
diagnoser="real-debugger-v2" (Claude Sonnet 4.6 via the claude CLI;
~$0.03 / case) or "real-debugger-v1" (Haiku 4.5, ~$0.005 / case) or
"real-debugger-v3" (gpt-5-mini, ~$0.006 / case, needs
OPENAI_API_KEY). Diagnoses are cached per (diagnoser, case_id, reduced_context_hash), so reruns are free.
Full SDK tutorial: logdx_ci/README.md. External
evaluation example (Headroom): docs/external-evaluations/headroom_logcompressor_default.md.
Across 35 real CI failure cases and 3 model families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini), the top-3 ∩ of the per-family rankings is
{hybrid-grep-120k-rtk-tail, hybrid-grep-120k-tail}. Bottom-4 set is also stable across all three families.
Macro diagnosis_score_v1_1 aggregated case-count-weighted across
the 35-case corpus:
| Rank | Method | Haiku 4.5 | Sonnet 4.6 | gpt-5-mini | Overall |
|---|---|---|---|---|---|
| 1 | hybrid-grep-120k-rtk-tail |
0.624 | 0.679 | 0.706 | 0.670 |
| 2 | hybrid-grep-120k-tail |
0.610 | 0.730 | 0.658 | 0.666 |
| 3 | llm-summary-v1-gpt-5-mini(new in v1.2; agent-loop #1 at 0.749) |
0.654 | 0.686 | 0.652 | 0.664 |
| 4 | grep |
0.578 | 0.684 | 0.655 | 0.639 |
| 5 | llm-summary-v1-haiku(promoted to headline in v1.1) |
0.583 | 0.704 | 0.608 | 0.632 |
| 6 | tail-200 |
0.595 | 0.624 | 0.623 | 0.614 |
| 7 | hybrid-grep-4k-rtk-err-cat(earlier 4k-threshold hybrid; replaced) |
0.552 | 0.597 | 0.571 | 0.573 |
| 8 | headroom-LogCompressor(Headroom v0.24.0 defaults; evaluated via the SDK) |
0.548 | 0.601 | 0.534 | 0.561 |
| 9 | rtk-err-cat |
0.455 | 0.488 | 0.467 | 0.470 |
| 10 | raw |
0.324 | 0.368 | 0.367 | 0.353 |
| 11 | rtk-read |
0.329 | 0.369 | 0.349 | 0.349 |
| 12 | rtk-log |
0.238 | 0.262 | 0.249 | 0.249 |
The legacy llm-summary-v1-mock stub (used as the LLM-summary
representative through v1.1) is retained as an appendix entry on
the leaderboard, not in the headline. The top-2 hybrids replaced an
earlier 4k-threshold hybrid that was overfit during methodology
development. See the technical
report for the v1.2 paper, and
reports/legacy/e10_v1_3_to_v2_transition_study.md
for the original prototype-vs-formal corpus analysis.
Full leaderboard at https://logdx-bench.github.io/leaderboard.html.
| 🏠 Homepage | https://logdx-bench.github.io/ |
| 📊 Leaderboard | https://logdx-bench.github.io/leaderboard.html |
| 📄 arXiv preprint | https://arxiv.org/abs/2605.28876 |
| 📄 Full report | reports/technical_report.md |
| 🐍 Python SDK | logdx_ci/ — pip install logdx-ci, evaluate your reducer in 5 min |
| 📦 Cases corpus mirror | https://huggingface.co/datasets/eyuansu71/logdx-ci |
| 📋 Release notes | latest: RELEASE_NOTES_v1_2.md · history: RELEASE_NOTES.md (v1.0), v1.1.1, v1.1.2 |
| 📑 Cite | CITATION.cff · BibTeX |
Each case lives under cases/<split>/<case_id>/{raw.log, case.json, ground_truth.json, tags.json, privacy_audit.json}. Schema in the
dataset card.
The SDK auto-downloads them on first use — the only reason to clone
the repo is if you're reproducing the leaderboard or contributing.
The numbers above were generated through the canonical pipeline in
tools/, which writes auditable manifest artifacts to results/
(committed to git for long-term reproducibility). To regenerate them
from a clean checkout:
git clone https://github.com/eyuansu62/LogDx.git && cd LogDx
python tools/run_baseline.py --method tail-200 --split dev
python tools/run_diagnosis.py --diagnoser real-debugger-v2 \
--split dev --method tail-200
python tools/evaluate_diagnosis.py --split dev --diagnoser real-debugger-v2The SDK (logdx_ci.evaluate(...)) re-uses the same scorer + diagnoser
shims, so SDK scores are bit-for-bit comparable to leaderboard numbers
— but the SDK doesn't write the committed-artifact audit trail. Use
the SDK to try a new reducer; use tools/ to enshrine a method on
the leaderboard.
Current release: v1.2 (preprint). We'll add cases + model
families before calling it stable.
- 35 cases (target: 50+ with broader ecosystem coverage)
- Ground truth is AI-drafted + single-author verified (not independent human annotation)
- Three model families tested (Haiku / Sonnet / gpt-5-mini); GPT-4o / Gemini / Llama are the most-leveraged follow-up
- 20 documented historical exclusions in
configs/historical_provider_error_exclusions.jsonappear as zero-score abstentions in the eval denominator
Full caveats in the technical report §5.
@article{qin2026logdx,
title = {{LogDx-CI}: Benchmarking Log Reduction Tools
for LLM Root-Cause Diagnosis},
author = {Qin, Bowen},
year = {2026},
eprint = {2605.28876},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2605.28876},
note = {v1.2 release; cases corpus at
\url{https://huggingface.co/datasets/eyuansu71/logdx-ci}},
}- Code (
tools/,examples/,schemas/,configs/,prompts/, tests, scripts) — Apache-2.0 (LICENSE) - Data + reports + protocol locks (
cases/,results/,reports/,protocols/,docs/) — CC-BY-4.0 (LICENSE-DATA)
LogDx-CI benchmarks third-party log-reduction tools alongside its own baselines. Specifically:
- RTK (Rust Token Killer) by
rtk-ai — the
rtk-read,rtk-log, andrtk-err-catbaselines are three different invocations of thertkCLI binary. The hybrid routershybrid-grep-120k-rtk-tailandhybrid-grep-4k-rtk-err-catuse rtk'serr-catmode as an intermediate / fallback context provider. Seedocs/methods/rtk.mdfor setup + invocation details.
CI failure logs are sourced from publicly visible GitHub Actions runs. Diagnoses are produced by Claude (Anthropic) and gpt-5-mini (OpenAI).
New context-provider methods, debugger families, and case
contributions are welcome — see CONTRIBUTING.md
for the dev environment, repo layout, validator scripts, and the
"add a new method" checklist.
Bowen Qin · National University of Singapore · contact via GitHub Issues