LogDx-CI

A benchmark for log reduction tools (RTK, grep, tail, hybrid routers, LLM-summary) — do they preserve enough evidence for LLM root-cause diagnosis?

LogDx-CI compares 12 context providers (raw, tail, grep, three RTK modes, two real LLM summarizers, three hybrid routers, plus Headroom) by handing the same CI failure log to three debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) and scoring the resulting root-cause diagnoses against AI-drafted + author-verified ground truths. It optimizes for method ranking stability across model families, not "which LLM scored highest."

Evaluate your own reducer in 5 minutes

Have a function that takes a CI log and returns a reduced version? You can rank it against all 12 reference methods with zero setup — no clone, no API key, no money:

pip install logdx-ci

import logdx_ci

def my_reducer(raw_log: str) -> str:
    # your logic here — e.g. keep only the last 200 lines
    return "\n".join(raw_log.split("\n")[-200:])

result = logdx_ci.evaluate(reducer=my_reducer)   # all 35 cases, ~5 sec
print(result.summary())                          # score + table vs 12 baselines

The corpus + scoring code is auto-fetched (~20 MB) on first call and cached. The default diagnoser="static-signal-recall" does not call any LLM — it scores whether your reducer preserved the ground-truth required_signals. Deterministic, free, runs in under a second.

For leaderboard-comparable diagnosis-quality scoring, pass diagnoser="real-debugger-v2" (Claude Sonnet 4.6 via the claude CLI; ~$0.03 / case) or "real-debugger-v1" (Haiku 4.5, ~$0.005 / case) or "real-debugger-v3" (gpt-5-mini, ~$0.006 / case, needs OPENAI_API_KEY). Diagnoses are cached per (diagnoser, case_id, reduced_context_hash), so reruns are free.

Full SDK tutorial: logdx_ci/README.md. External evaluation example (Headroom): docs/external-evaluations/headroom_logcompressor_default.md.

Headline finding

Across 35 real CI failure cases and 3 model families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini), the top-3 ∩ of the per-family rankings is {hybrid-grep-120k-rtk-tail, hybrid-grep-120k-tail}. Bottom-4 set is also stable across all three families.

Macro diagnosis_score_v1_1 aggregated case-count-weighted across the 35-case corpus:

Rank	Method	Haiku 4.5	Sonnet 4.6	gpt-5-mini	Overall
1	`hybrid-grep-120k-rtk-tail`	0.624	0.679	0.706	0.670
2	`hybrid-grep-120k-tail`	0.610	0.730	0.658	0.666
3	`llm-summary-v1-gpt-5-mini` _{(new in v1.2; agent-loop #1 at 0.749)}	0.654	0.686	0.652	0.664
4	`grep`	0.578	0.684	0.655	0.639
5	`llm-summary-v1-haiku` _{(promoted to headline in v1.1)}	0.583	0.704	0.608	0.632
6	`tail-200`	0.595	0.624	0.623	0.614
7	`hybrid-grep-4k-rtk-err-cat` _{(earlier 4k-threshold hybrid; replaced)}	0.552	0.597	0.571	0.573
8	`headroom-LogCompressor` _{(Headroom v0.24.0 defaults; evaluated via the SDK)}	0.548	0.601	0.534	0.561
9	`rtk-err-cat`	0.455	0.488	0.467	0.470
10	`raw`	0.324	0.368	0.367	0.353
11	`rtk-read`	0.329	0.369	0.349	0.349
12	`rtk-log`	0.238	0.262	0.249	0.249

The legacy llm-summary-v1-mock stub (used as the LLM-summary representative through v1.1) is retained as an appendix entry on the leaderboard, not in the headline. The top-2 hybrids replaced an earlier 4k-threshold hybrid that was overfit during methodology development. See the technical report for the v1.2 paper, and reports/legacy/e10_v1_3_to_v2_transition_study.md for the original prototype-vs-formal corpus analysis.

Full leaderboard at https://logdx-bench.github.io/leaderboard.html.

Quick links


🏠 Homepage	https://logdx-bench.github.io/
📊 Leaderboard	https://logdx-bench.github.io/leaderboard.html
📄 arXiv preprint	https://arxiv.org/abs/2605.28876
📄 Full report	`reports/technical_report.md`
🐍 Python SDK	`logdx_ci/` — `pip install logdx-ci`, evaluate your reducer in 5 min
📦 Cases corpus mirror	https://huggingface.co/datasets/eyuansu71/logdx-ci
📋 Release notes	latest: `RELEASE_NOTES_v1_2.md` · history: `RELEASE_NOTES.md` (v1.0), v1.1.1, v1.1.2
📑 Cite	`CITATION.cff` · BibTeX

Browse the cases directly

Each case lives under cases/<split>/<case_id>/{raw.log, case.json, ground_truth.json, tags.json, privacy_audit.json}. Schema in the dataset card. The SDK auto-downloads them on first use — the only reason to clone the repo is if you're reproducing the leaderboard or contributing.

For benchmark maintainers (reproduce the leaderboard)

The numbers above were generated through the canonical pipeline in tools/, which writes auditable manifest artifacts to results/ (committed to git for long-term reproducibility). To regenerate them from a clean checkout:

git clone https://github.com/eyuansu62/LogDx.git && cd LogDx
python tools/run_baseline.py     --method tail-200 --split dev
python tools/run_diagnosis.py    --diagnoser real-debugger-v2 \
                                 --split dev --method tail-200
python tools/evaluate_diagnosis.py --split dev --diagnoser real-debugger-v2

The SDK (logdx_ci.evaluate(...)) re-uses the same scorer + diagnoser shims, so SDK scores are bit-for-bit comparable to leaderboard numbers — but the SDK doesn't write the committed-artifact audit trail. Use the SDK to try a new reducer; use tools/ to enshrine a method on the leaderboard.

Caveats

Current release: v1.2 (preprint). We'll add cases + model families before calling it stable.

35 cases (target: 50+ with broader ecosystem coverage)
Ground truth is AI-drafted + single-author verified (not independent human annotation)
Three model families tested (Haiku / Sonnet / gpt-5-mini); GPT-4o / Gemini / Llama are the most-leveraged follow-up
20 documented historical exclusions in configs/historical_provider_error_exclusions.json appear as zero-score abstentions in the eval denominator

Full caveats in the technical report §5.

Cite

@article{qin2026logdx,
  title         = {{LogDx-CI}: Benchmarking Log Reduction Tools
                  for LLM Root-Cause Diagnosis},
  author        = {Qin, Bowen},
  year          = {2026},
  eprint        = {2605.28876},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE},
  url           = {https://arxiv.org/abs/2605.28876},
  note          = {v1.2 release; cases corpus at
                  \url{https://huggingface.co/datasets/eyuansu71/logdx-ci}},
}

License

Code (tools/, examples/, schemas/, configs/, prompts/, tests, scripts) — Apache-2.0 (LICENSE)
Data + reports + protocol locks (cases/, results/, reports/, protocols/, docs/) — CC-BY-4.0 (LICENSE-DATA)

Acknowledgements

LogDx-CI benchmarks third-party log-reduction tools alongside its own baselines. Specifically:

RTK (Rust Token Killer) by rtk-ai — the rtk-read, rtk-log, and rtk-err-cat baselines are three different invocations of the rtk CLI binary. The hybrid routers hybrid-grep-120k-rtk-tail and hybrid-grep-4k-rtk-err-cat use rtk's err-cat mode as an intermediate / fallback context provider. See docs/methods/rtk.md for setup + invocation details.

CI failure logs are sourced from publicly visible GitHub Actions runs. Diagnoses are produced by Claude (Anthropic) and gpt-5-mini (OpenAI).

Contributing

New context-provider methods, debugger families, and case contributions are welcome — see CONTRIBUTING.md for the dev environment, repo layout, validator scripts, and the "add a new method" checklist.

Contact

Bowen Qin · National University of Singapore · contact via GitHub Issues

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LogDx-CI

Evaluate your own reducer in 5 minutes

Headline finding

Quick links

Browse the cases directly

For benchmark maintainers (reproduce the leaderboard)

Caveats

Cite

License

Acknowledgements

Contributing

Contact

About

Licenses found

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github/workflows		.github/workflows
cases		cases
configs		configs
docs		docs
examples		examples
huggingface		huggingface
logdx_ci		logdx_ci
prompts		prompts
protocols		protocols
reports		reports
results		results
schemas		schemas
tools		tools
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-DATA		LICENSE-DATA
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
RELEASE_NOTES_v1_1_1.md		RELEASE_NOTES_v1_1_1.md
RELEASE_NOTES_v1_1_2.md		RELEASE_NOTES_v1_1_2.md
RELEASE_NOTES_v1_2.md		RELEASE_NOTES_v1_2.md
ROADMAP.md		ROADMAP.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

LogDx-CI

Evaluate your own reducer in 5 minutes

Headline finding

Quick links

Browse the cases directly

For benchmark maintainers (reproduce the leaderboard)

Caveats

Cite

License

Acknowledgements

Contributing

Contact

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages