LogDx-CI
A benchmark for CI log reduction tools (RTK, grep, tail, hybrid routers, LLM-summary) — do they preserve enough evidence for LLM root-cause diagnosis?
Current release: v1.2 (adds cross-family LLM-summary; new agent-loop #1) ·
Leaderboard ·
Citation ·
Technical report
v1.2 highlight:
llm-summary-v1-gpt-5-mini(real OpenAI gpt-5-mini map-reduce summarizer) becomes the new agent-loop #1 at 0.749 with only 0.37 tool calls/case — the lowest of any method. Cross-pair (gpt-5-mini-summarizer → Haiku-debugger) beats the self-pair (Haiku → Haiku) by +0.071, falsifying the self-call-bias hypothesis raised by v1.1’s reviewer. See v1.2 release notes and the v1.2 promotion section on the leaderboard.v1.1 highlight (still load-bearing): in multi-turn agent usage, the choice of context method matters far less for quality — the score range collapses 7× (0.42 → 0.06) as the agent rescues weak contexts via tool calls. See the agent-loop leaderboard.
Recent finding (2026-05-21, from the 420-row agent-loop trajectory dataset): input tokens dominate output by 40× (97.6% vs 2.4%) in agent-loop diagnosis, and agents barely use
read_file(2.2% of tool calls in CI diagnosis — the rest is grep / tail / view_log_lines). Worst-case agent runs cost 4-5× the median, all on v2/stress huge logs. Implications for reduction-tool design indocs/analysis/agent-trajectory-token-anatomy.md.
What it measures
LogDx-CI compares 11 context providers — raw, tail, grep,
three RTK modes (rtk-read,
rtk-log, rtk-err-cat), two real LLM summarizers
(llm-summary-v1-haiku Anthropic + llm-summary-v1-gpt-5-mini
OpenAI), and three hybrid routers — by handing the same CI failure
log to three debugger families (Claude Haiku 4.5, Claude Sonnet
4.6, OpenAI gpt-5-mini) and scoring the resulting root-cause
diagnoses against AI-drafted + author-verified ground truth.
It optimizes for method ranking stability — the question is not “which LLM is smartest” but “which log reducer gives an LLM the best chance of finding the true root cause within a fixed token budget, ACROSS model families.”
Headline finding
Across 35 real CI failure cases and 3 model families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini), the per-family top-3 sets agree on
{hybrid-grep-120k-rtk-tail, hybrid-grep-120k-tail}. The bottom-4 set is also stable across all three families.
Case-count-weighted macro diagnosis_score_v1_1 aggregated across
the 35-case corpus:
| Rank | Method | Haiku 4.5 | Sonnet 4.6 | gpt-5-mini | Overall | conf_err (↓) |
tokens per case (↓) |
|---|---|---|---|---|---|---|---|
| 1 | hybrid-grep-120k-rtk-tail |
0.624 | 0.679 | 0.706 | 0.670 | 0.000 | 19,844 |
| 2 | hybrid-grep-120k-tail |
0.610 | 0.730 | 0.658 | 0.666 | 0.010 | 19,753 |
| 3 | llm-summary-v1-gpt-5-mini (new in v1.2; agent-loop #1 at 0.749) |
0.654 | 0.686 | 0.652 | 0.664 | 0.010 | 537,638 |
| 4 | grep |
0.578 | 0.684 | 0.655 | 0.639 | 0.000 | 88,355 |
| 5 | llm-summary-v1-haiku (promoted to headline in v1.1) |
0.583 | 0.704 | 0.608 | 0.632 | 0.029 | 1,681,520 |
| 6 | tail-200 |
0.595 | 0.624 | 0.623 | 0.614 | 0.019 | 6,108 |
| 7 | hybrid-grep-4k-rtk-err-cat (replaced; see report) |
0.552 | 0.597 | 0.571 | 0.573 | 0.029 | 19,892 |
| 8 | rtk-err-cat |
0.455 | 0.488 | 0.467 | 0.470 | 0.029 | 19,850 |
| 9 | raw |
0.324 | 0.368 | 0.367 | 0.353 | 0.000 | 275,248 |
| 10 | rtk-read |
0.329 | 0.369 | 0.349 | 0.349 | 0.010 | 274,289 |
| 11 | rtk-log |
0.238 | 0.262 | 0.249 | 0.249 | 0.133 | 810 |
Footnote on llm-summary-v1-haiku: three of the 35 cases used
chunk_lines=100 instead of the default 500 because they contained
500-line windows exceeding Haiku’s effective input window after
Claude-Code session overhead. Same map-reduce algorithm, same model,
same temperature — only map-stage granularity differs; recorded in
per-case metadata.chunk_lines. The legacy llm-summary-v1-mock
stub (used as the LLM-summary representative through v1.1) has been
moved to the appendix
on the leaderboard page.
Three layers of finding:
- Quality: top-2 are 120k-threshold hybrid routers. Stable
across all 3 model families. The real Haiku summarizer
(
llm-summary-v1-haiku, row 4) lands fourth — a +0.30 jump over the legacy mock stub that previously represented the LLM-summary class. - Safety (
conf_err): top-3 methods produce ~zero confidently- wrong diagnoses.rtk-logand the legacyllm-summary-v1-mockmislead a confident LLM on ~13% of cases — the failure mode discussed in rtk-ai/rtk#1599. - Cost (
tokens): the top-2 hybrids dominategrep— same-ballpark score at 4.5× fewer tokens.llm-summary-v1-haikuis the most expensive method on the board (1.68M tokens/case end-to-end — the real summarizer’s Claude-Code-nested cached-prefix overhead is ~4× higher than the mock had estimated), so it remains a quality-over-cost choice. Full Pareto frontier on the leaderboard page.
The top-2 hybrids replaced an earlier 4k-threshold hybrid that was overfit during methodology development (see the technical report §3 for the prototype-vs-formal corpus analysis).
Full per-split + per-debugger breakdown → leaderboard.
Corpus
35 real GitHub Actions failure cases across dev (8), holdout
(15), and stress (12) splits. Coverage:
- 8 failure categories:
test_assertion,compile_error,type_error,lint_failure,dependency_install,docker_build,timeout_or_oom,multi_failure - 7+ ecosystems: pytest, cargo,
go test, Maven, pnpm + jest, docker buildx, helm/k8s, terraform, gradle, biome, mypy, tsc, etc.
Quick start
git clone https://github.com/eyuansu62/LogDx.git
cd LogDx
# (optional) mirror the cases corpus from HuggingFace
hf download --repo-type dataset \
eyuansu71/logdx-ci --local-dir cases-from-hf
# 165-test suite
python3 tools/tests/test_diagnosis_cache_key.py
python3 tools/tests/test_hybrid_router.py
# 3 release gates
python3 tools/validate_committed_diagnosis_provider_errors.py
python3 tools/validate_eval_manifest_consistency.py
python3 tools/validate_diagnosis_vs_context_consistency.py
# Validate the canonical protocol lock
python3 tools/validate_protocol_lock.py \
--protocol protocols/logdx-ci-v2-partial-2026-06-22.lock.json
Reproducibility infrastructure
Every release carries:
- Protocol lock (
protocols/*.lock.json) — SHA-pins 10 schemas + 3 prompts + 4 evaluators + 10 baselines + 35 case files at the release commit - 3 release gates — fail CI when any committed artifact drifts:
validate_committed_diagnosis_provider_errors.py— no non- allowlisted provider_error rows inreal-debugger-*manifestsvalidate_eval_manifest_consistency.py— eval files match manifests, with strict zero-score verification for excluded rowsvalidate_diagnosis_vs_context_consistency.py— diagnosis case sets ⊆ source context manifest, with an explicit historical- exclusion list for transparent gaps
- 165-test suite covering unit, integration, and end-to-end paths
- Cache identity validation —
metadata.diagnoser_config_sha256metadata.shim_sha256on every fresh row; the runner rejects stale cache hits on config/shim edits
- Secret redaction — URL / bearer / API-key / long-opaque-token / hostname redactors + hash-only summaries for model-controlled exception text
Caveats
This is a v1.0 preprint release.
- 35 cases. Per-case variance can shift macro means by ±0.05 with future corpus expansion. The direction of the top-3 ∩ finding is robust; absolute magnitudes are preliminary.
- Ground truth is AI-drafted (Claude Opus 4.7) + single-author verified by Bowen Qin (NUS). Not independent human annotation.
- Three model families tested. Adding GPT-4o / Gemini / Llama is the most-leveraged follow-up.
- No independent human review of v1.0 diagnoses (an earlier 16-case prototype subset had E2/E2b model-as-judge + E9 AI-assisted human review; the full 35-case set has not been re-scored).
- 20 historical exclusions documented in
configs/historical_provider_error_exclusions.json; the eval injects zero-score abstentions for those tuples so the denominator stays correct.
See the full §5 caveats for the complete list.
Roadmap
- v1.1 — Corpus expansion (target 50+); fill remaining
stressgaps (huge log + non-pytest); spot-checked human review of v1.0 diagnoses - v2 — Train/holdout split decoupling, GPT-4o + Gemini family
additions,
matrix_or_monorepo_failureas a first-class canonical category, optional Gradio leaderboard space on HF
Citation
@misc{qin2026logdx,
title = {{LogDx-CI}: Benchmarking CI Log Reduction Tools
for LLM Root-Cause Diagnosis},
author = {Qin, Bowen},
year = {2026},
howpublished = {\url{https://github.com/eyuansu62/LogDx}},
note = {v1.0 release; cases corpus at
\url{https://huggingface.co/datasets/eyuansu71/logdx-ci}},
}
Plain text → cite.
Acknowledgements
LogDx-CI benchmarks third-party log-reduction tools alongside its own baselines. Specifically:
- RTK (Rust Token Killer) by
rtk-ai — the
rtk-read,rtk-log, andrtk-err-catbaselines are three different invocations of thertkCLI binary. The hybrid routershybrid-grep-120k-rtk-tailandhybrid-grep-4k-rtk-err-catuse rtk’serr-catmode as an intermediate / fallback context provider. Seedocs/methods/rtk.mdfor setup + invocation details.
CI failure logs are sourced from publicly visible GitHub Actions runs. Diagnoses are produced by Claude (Anthropic) and gpt-5-mini (OpenAI).
Contact
- Author: Bowen Qin (National University of Singapore)
- Issues: https://github.com/eyuansu62/LogDx/issues
Code under Apache-2.0; data, reports, and protocol locks under CC-BY-4.0. © 2026 Bowen Qin.