Skip to content

OTA-1927: Eval cluster update prompts#2908

Open
fao89 wants to merge 2 commits into
openshift:mainfrom
fao89:OTA-1927
Open

OTA-1927: Eval cluster update prompts#2908
fao89 wants to merge 2 commits into
openshift:mainfrom
fao89:OTA-1927

Conversation

@fao89

@fao89 fao89 commented Apr 29, 2026

Copy link
Copy Markdown
Member

Add comprehensive MCP test scenarios to evaluation dataset for validating OpenShift cluster update workflow AI responses. These scenarios establish quality benchmarks for LLM outputs across different update phases.

Test Scenarios Added (conv_798-802):

  • Precheck: Pre-upgrade validation and readiness assessment Comprehensive analysis of cluster health, available updates, and upgrade blockers before initiating updates

  • Precheck-Specific: Targeted upgrade path validation Validates specific version availability and upgrade feasibility for planned update targets

  • No-Updates: Cluster health assessment at latest version Health monitoring and operational status when no updates are available in current channel

  • Progress: Real-time upgrade progress monitoring Tracks upgrade progress with component status, timeline analysis, and ETA calculations during active updates

  • Troubleshoot: Upgrade failure diagnosis and remediation Root cause analysis and conservative troubleshooting guidance for failed or stuck upgrade scenarios

Each scenario includes:

  • Complete analysis prompts with constraints and requirements
  • Full ClusterVersion YAML data as attachments
  • Full ClusterOperator YAML data as attachments
  • Expected responses with Summary and TL;DR sections
  • Real cluster data from production-like scenarios

These scenarios mirror the CONSOLE-5118 OLS integration workflow phases and provide the evaluation baseline for cluster update AI assistance.

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Ref: openshift/console#16131

Summary by CodeRabbit

  • Documentation
    • Expanded evaluation setup and usage guidance, including datasets, “What’s Included” details for cluster-updates (tags and conversation ranges), and direct links to the evaluation tool.
  • Chores
    • Added a dedicated cluster-updates evaluation configuration with judge/metric settings and tuned output, CSV, and logging/telemetry behavior.
  • Tests / CI
    • Added make test-cluster-updates and a new end-to-end cluster-updates evaluation test, plus a CI script to run the suite and validate generated artifacts and error-free summaries.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 29, 2026
@openshift-ci-robot

openshift-ci-robot commented Apr 29, 2026

Copy link
Copy Markdown

@fao89: This pull request references OTA-1927 which is a valid jira issue.

Details

In response to this:

Add comprehensive MCP test scenarios to evaluation dataset for validating OpenShift cluster update workflow AI responses. These scenarios establish quality benchmarks for LLM outputs across different update phases.

Test Scenarios Added (conv_798-802):

  • Precheck: Pre-upgrade validation and readiness assessment Comprehensive analysis of cluster health, available updates, and upgrade blockers before initiating updates

  • Precheck-Specific: Targeted upgrade path validation Validates specific version availability and upgrade feasibility for planned update targets

  • No-Updates: Cluster health assessment at latest version Health monitoring and operational status when no updates are available in current channel

  • Progress: Real-time upgrade progress monitoring Tracks upgrade progress with component status, timeline analysis, and ETA calculations during active updates

  • Troubleshoot: Upgrade failure diagnosis and remediation Root cause analysis and conservative troubleshooting guidance for failed or stuck upgrade scenarios

Each scenario includes:

  • Complete analysis prompts with constraints and requirements
  • Full ClusterVersion YAML data as attachments
  • Full ClusterOperator YAML data as attachments
  • Expected responses with Summary and TL;DR sections
  • Real cluster data from production-like scenarios

These scenarios mirror the CONSOLE-5118 OLS integration workflow phases and provide the evaluation baseline for cluster update AI assistance.

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Ref: openshift/console#16131

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested review from blublinsky and raptorsun April 29, 2026 15:44
@openshift-ci

openshift-ci Bot commented Apr 29, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds end-to-end infrastructure for cluster-updates evaluation tests. A new YAML configuration file (eval/system_cluster_updates.yaml) defines the LightSpeed evaluation framework parameters: OpenAI judge LLM settings, API query configuration, turn- and conversation-level metrics, output/visualization options, and logging control. A pytest test harness (tests/e2e/evaluation/test_cluster_updates.py) bootstraps dependencies, discovers the OLS endpoint, runs the evaluation subprocess, and validates artifact output. A CI shell script (tests/scripts/test-cluster-updates.sh) orchestrates the full pipeline: installing dependencies and operator-sdk, deploying OLS, running the evaluation suite, and managing cleanup. The README documents usage commands, dataset details, test categories, and both system configuration presets. A Makefile target wires the test into the build system.

Changes

Cluster-Updates Evaluation Setup

Layer / File(s) Summary
Evaluation system configuration
eval/system_cluster_updates.yaml
Defines LightSpeed evaluation configuration with OpenAI judge LLM (gpt-4-turbo, temperature, token/timeout limits), query-style API targeting local HTTPS server with optional tool and system-prompt overrides, turn-level default correctness metric plus optional GEval criteria for Kubernetes condition interpretation and output format compliance (Summary/TL;DR sections), conversation-level optional DeepEval metrics (completeness, relevancy, knowledge retention disabled by default), CSV output columns and result directory settings, visualization figure sizing and enabled graph types, and environment/logging configuration to suppress telemetry and control per-package log levels.
README: setup and usage documentation
eval/README.md
Adds evaluation framework prerequisites (Python 3.11+) and setup link to Lightspeed evaluation tool; documents run commands for full, short, and cluster-updates evaluation variants with tag-based filtering example; expands "What's Included" section with explicit dataset file listings (short, full, cluster-updates), maps test-category tags to conversation ranges for cluster-updates, and provides detailed descriptions of both system.yaml and system_cluster_updates.yaml with their respective metrics and cluster-specific settings.
Pytest evaluation test harness
tests/e2e/evaluation/test_cluster_updates.py
Implements pytest module that ensures lightspeed-eval binary installation, discovers OLS base URL from pytest config or environment, extracts optional bearer token from pytest client fixture, loads system configuration and overrides API base URL, writes temporary config file, runs lightspeed-eval subprocess against fixed eval data YAML and output directory, and validates success by asserting subprocess exit status, CSV/JSON artifact presence, and zero error count in summary JSON.
CI shell script and orchestration
tests/scripts/test-cluster-updates.sh
Adds CI orchestration script with strict error handling that installs project/test dependencies, sources shared helper functions, detects host OS/arch and installs operator-sdk v1.36.1, reads OpenAI API key from environment, defines run_suites() function to deploy OLS and execute cluster_updates evaluation suite with OpenAI provider configuration, performs cleanup, manages artifact directories (creates temp directory with LOCAL_MODE=1 when outside Prow), and executes the full flow with cleanup-on-exit trap.
Makefile target and build integration
Makefile
Updates PHONY target list and adds test-cluster-updates make target that invokes pytest with lseval and evaluation extras against tests/e2e/evaluation, writes JUnit XML to ARTIFACT_DIR keyed by SUITE_ID, and sets eval output mode to cluster_updates with results directory in ARTIFACT_DIR.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning, 1 inconclusive)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error The code logs subprocess output (stdout/stderr) that may contain sensitive API keys. In tests/e2e/evaluation/test_cluster_updates.py lines 100-104, the stdout and stderr from the lightspeed-eval su... Filter sensitive data from subprocess output before printing. Either avoid printing stderr/stdout from subprocesses that receive API_KEY env vars, or redact sensitive patterns like "API_KEY=..." from output before logging.
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title refers to cluster update evaluation but the changeset primarily adds test infrastructure, configuration files, and documentation for cluster-updates evaluation, not just the prompts themselves. Consider a more specific title like 'Add cluster-updates evaluation test infrastructure and data' or 'OTA-1927: Add cluster-updates evaluation tests and configuration' to accurately reflect the comprehensive nature of the changes.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR contains no Ginkgo tests. The check for "Stable and Deterministic Test Names" applies to Ginkgo test patterns (It(), Describe(), etc.), but this PR only adds Python pytest tests and bash sc...
Test Structure And Quality ✅ Passed Custom check for Ginkgo test code quality is not applicable to this PR. The repository is Python-based using pytest for testing, not a Go project using Ginkgo. No Go or Ginkgo tests exist.
Microshift Test Compatibility ✅ Passed PR adds pytest and shell script tests, not Ginkgo e2e tests. The custom check applies only to Ginkgo tests (It(), Describe(), etc.), which are absent here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any Ginkgo e2e tests. All new tests are Python pytest tests (test_cluster_updates.py) and shell scripts (test-cluster-updates.sh), not Go/Ginkgo-based tests. The custom check i...
Topology-Aware Scheduling Compatibility ✅ Passed This PR adds test infrastructure and evaluation configuration only—no deployment manifests, operators, controllers, or scheduling constraints were introduced or modified.
Ote Binary Stdout Contract ✅ Passed OTE Binary Stdout Contract check is not applicable: PR adds Python pytest tests and shell CI scripts, not OTE (Go) binaries. Print statements are inside pytest test functions (allowed).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added in this PR. Changes include Python pytest tests, shell scripts, and config files without IPv4 assumptions or IPv6-incompatible networking patterns.
No-Weak-Crypto ✅ Passed No weak cryptographic algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons detected in PR-modified files.
Container-Privileges ✅ Passed No privileged container configurations found in any PR files. The modified files contain test infrastructure, documentation, and evaluation data—not Kubernetes manifests with container security con...
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@eval/README.md`:
- Around line 54-63: The cluster-updates example commands in the eval/README.md
reference system_cluster_updates.yaml which uses https://localhost:8080, but the
local setup starts OLS at http://localhost:8080, causing a TLS mismatch. Add a
clarifying note in the README near these example commands explaining that for
local runs, users need to either modify the api_base setting in
system_cluster_updates.yaml to use http instead of https, or provide
instructions pointing to a separate local cluster-updates configuration preset
that uses HTTP. This will prevent users from encountering immediate
connection/TLS failures when attempting to run these commands locally.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8d9c1f51-f25b-40f6-b7f8-eba854c9da4a

📥 Commits

Reviewing files that changed from the base of the PR and between a8aa7a8 and 2064cd9.

📒 Files selected for processing (3)
  • eval/README.md
  • eval/eval_data_cluster_updates.yaml
  • eval/system_cluster_updates.yaml

Comment thread eval/README.md Outdated
@fao89 fao89 force-pushed the OTA-1927 branch 2 times, most recently from e1c10db to 47bc7ce Compare June 16, 2026 17:55
@openshift-ci openshift-ci Bot requested review from cambelem and sriroopar June 17, 2026 17:37
@fao89

fao89 commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

/cc @sriroopar @rioloc

@openshift-ci openshift-ci Bot requested a review from rioloc June 17, 2026 17:38
@sriroopar

Copy link
Copy Markdown
Contributor
  1. Turn metrics need to be defined for every turn as necessary.
  2. provider name needs to be standardized to openai.
  3. https should be replaced with http.
  4. all conversations have single tag - but readme suggests otherwise.

- Clear recommendation should be provided
- conversation_group_id: conv_800
tag: cluster-updates-scenarios
turns:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your PR Fabricio,:)

a major bug is that turn metrics is not set up for everyturn which will in turn not capture the metrics we may want to analyze. rest looks okay, dropped a couple minor mismatches in a comment.

@fao89

fao89 commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

/cc @wking

@openshift-ci openshift-ci Bot requested a review from wking June 18, 2026 11:43
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/scripts/test-cluster-updates.sh`:
- Line 34: Separate the variable assignment from the export statement on the
line that sets OPENAI_API_KEY to avoid masking errors from the cat command.
First assign the output of cat "$OPENAI_PROVIDER_KEY_PATH" to a temporary
variable or directly capture it, then check that the command succeeded before
exporting OPENAI_API_KEY. This ensures that if the cat command fails due to a
missing file or permission issues, the error is immediately visible rather than
causing a cryptic authentication error later.
- Around line 24-25: The export statements on lines 24 and 25 combine variable
assignment with command substitution, which masks failures if the underlying
commands fail. Separate the command substitution from the export statement for
both ARCH and OS variables. First, assign the result of the command substitution
to the variable without exporting (e.g., ARCH=$(case $(uname -m) in ... esac)),
then add error checking to verify the command succeeded (e.g., using [ -z
"$ARCH" ] or checking the exit code with ||), and only then export the variable.
If the command fails, exit with an error message to prevent incorrect values
from being used when constructing OPERATOR_SDK_DL_URL on line 26.
- Around line 60-63: The export statement combined with the mktemp -d command
substitution masks failures. If mktemp -d fails, the export still succeeds with
an empty or invalid value. Separate the command substitution from the export by
first assigning the mktemp -d result to ARTIFACT_DIR variable, add error
checking to ensure mktemp succeeded before proceeding, and then export the
variable separately. This ensures that if mktemp -d fails (due to permission
issues or lack of disk space), the script properly detects and handles the error
instead of continuing with an invalid ARTIFACT_DIR path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c5dea0a-d4ff-446e-98f7-eb2bcb4814a1

📥 Commits

Reviewing files that changed from the base of the PR and between 5cc6965 and ce2dccf.

📒 Files selected for processing (6)
  • Makefile
  • eval/README.md
  • eval/eval_data_cluster_updates.yaml
  • eval/system_cluster_updates.yaml
  • tests/e2e/evaluation/test_cluster_updates.py
  • tests/scripts/test-cluster-updates.sh
✅ Files skipped from review due to trivial changes (1)
  • eval/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • eval/system_cluster_updates.yaml

Comment thread tests/scripts/test-cluster-updates.sh Outdated
Comment thread tests/scripts/test-cluster-updates.sh Outdated
Comment thread tests/scripts/test-cluster-updates.sh
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 23, 2026
@fao89 fao89 force-pushed the OTA-1927 branch 6 times, most recently from db695e0 to d0cc961 Compare June 26, 2026 08:00
@fao89 fao89 force-pushed the OTA-1927 branch 2 times, most recently from 4bb0e70 to cc7c8aa Compare June 26, 2026 11:31
Fix 'set_session cannot be used inside a transaction' error that occurred
when storing multi-turn conversation history in PostgreSQL cache.

Problem:
- insert_or_append() and delete() methods set autocommit=False to start
  a transaction, then set it back to True in the finally block
- If an exception occurs, the connection may still be in a transaction
  when autocommit=True is called
- psycopg2 internally calls set_session() when changing autocommit, which
  fails if a transaction is active

Solution:
- Check connection transaction status before setting autocommit=True
- Rollback any active transaction before changing autocommit setting
- Ensures clean transition from transactional to autocommit mode

Impact:
- Multi-turn conversations now work correctly with PostgreSQL cache
- No functional change for single-turn conversations
- Evaluation tests can now test context retention and progressive refinement

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@fao89

fao89 commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

/retest-required

Add comprehensive MCP test scenarios to evaluation dataset for validating
OpenShift cluster update workflow AI responses. These scenarios establish
quality benchmarks for LLM outputs across different update phases.

Test Scenarios Added (conv_798-802):
- Precheck: Pre-upgrade validation and readiness assessment
  Comprehensive analysis of cluster health, available updates, and
  upgrade blockers before initiating updates

- Precheck-Specific: Targeted upgrade path validation
  Validates specific version availability and upgrade feasibility
  for planned update targets

- No-Updates: Cluster health assessment at latest version
  Health monitoring and operational status when no updates are
  available in current channel

- Progress: Real-time upgrade progress monitoring
  Tracks upgrade progress with component status, timeline analysis,
  and ETA calculations during active updates

- Troubleshoot: Upgrade failure diagnosis and remediation
  Root cause analysis and conservative troubleshooting guidance
  for failed or stuck upgrade scenarios

Each scenario includes:
- Complete analysis prompts with constraints and requirements
- Full ClusterVersion YAML data as attachments
- Full ClusterOperator YAML data as attachments
- Expected responses with Summary and TL;DR sections
- Real cluster data from production-like scenarios

These scenarios mirror the CONSOLE-5118 OLS integration workflow phases
and provide the evaluation baseline for cluster update AI assistance.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Fabricio Aguiar <fabricio.aguiar@gmail.com>

rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

@fao89: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants