Skip to content

feat: in-process LLM credential hot-reload#2955

Open
lalan7 wants to merge 1 commit into
openshift:mainfrom
lalan7:feat/credential-hot-reload
Open

feat: in-process LLM credential hot-reload#2955
lalan7 wants to merge 1 commit into
openshift:mainfrom
lalan7:feat/credential-hot-reload

Conversation

@lalan7

@lalan7 lalan7 commented Jun 16, 2026

Copy link
Copy Markdown

Summary

When credentialsSecretRef is updated by a CronJob or external rotation, the lightspeed-operator triggers a rolling restart of lightspeed-app-server. For customers using short-lived LLM tokens (1-hour validity), this causes hourly pod restarts and temporary capacity loss — including reloading RAG vector indexes and embedding models.

This PR makes ProviderConfig re-read the credential file on every LLM request via a new get_credentials() method, so rotated secrets are picked up without a pod restart. This follows the Prometheus credentials_file pattern: re-read on every request, no caching, no edge cases with K8s symlink swaps.

Changes

  • ols/utils/checks.py — New read_secret_from_path() helper for repeated file reads
  • ols/app/models/config.py — Store _credentials_path in ProviderConfig.__init__(), add get_credentials() method that re-reads the file when a path is configured
  • All LLM providers (openai, azure_openai, watsonx, rhoai_vllm, rhelai_vllm, google_vertex) — Use get_credentials() instead of .credentials
  • Unit tests — Tests for the helper, ProviderConfig.get_credentials() (cached, re-read, fallback, directory), and OpenAI provider credential rotation
  • docs/credential-hot-reload.md — Design rationale and test plan

Why this approach

Concern Answer
K8s symlink safety kubelet uses atomic symlink swaps (..data); os.stat() mtime can be unreliable after swaps. Re-reading avoids the edge case.
Precedent Prometheus credentials_file, Envoy, controller-runtime CertWatcher, kube-proxy all re-read files.
Dependencies None — no watchdog, fsnotify, or background threads.
Performance One open()+read() of a <100 byte file per LLM request is negligible vs LLM call latency (seconds).
Thread safety Each request reads independently; no shared mutable state.

Scope

Hot-reload applies to top-level credentials_path, which is the path used by operator-managed deployments (credentialsSecretRef). Provider-specific config blocks (openai_config.credentials_path, azure_openai_config.credentials_path, etc.) continue to cache credentials at startup — this matches the existing operator behavior, which does not generate these blocks.

Azure AD credentials (tenant_id, client_id, client_secret) are also read once at startup. These can be addressed in a follow-up if needed.

Test plan

  • read_secret_from_path() — file, directory, nonexistent, rotation
  • ProviderConfig.get_credentials() — cached fallback, disk re-read, read failure fallback, directory path
  • OpenAI provider — verifies rotated credential is picked up across two load_llm() calls
  • Full suite: 1067 tests pass, ruff check clean, mypy clean

Ref

Summary by CodeRabbit

  • New Features

    • Credentials are now re-read from disk each time LLM request parameters are built, enabling rotated secrets to take effect without service restarts.
    • Supported LLM providers automatically use the latest credentials from the configured path.
  • Bug Fixes

    • Prevents downtime and capacity impacts caused by credential-only restarts.
    • If a credential read fails, the system falls back to the most recently available valid credential.
  • Documentation

    • Added in-depth guidance to validate credential hot-reload behavior and rollout checks.
  • Tests

    • Added unit tests covering credential rotation and file/path read behaviors.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds in-process LLM credential hot-reload. Credential values are now re-read from disk through ProviderConfig.get_credentials(), provider call sites use the refreshed values, and tests plus documentation cover the new behavior.

Changes

Credential Hot-Reload

Layer / File(s) Summary
Secret file read helper
ols/utils/checks.py, tests/unit/utils/test_checks.py
read_secret_from_path(path, default_filename) reads a credential from a file or directory path, strips trailing whitespace, returns None on OSError, and is covered by tests for file, directory, missing, and rotation cases.
ProviderConfig credential lookup
ols/app/models/config.py, tests/unit/app/models/test_config.py
ProviderConfig stores the configured credential path and adds get_credentials(), which re-reads from disk when a path is configured and otherwise returns the cached credential value. Unit tests cover the cached, reread, fallback, and directory cases.
Provider credential sourcing
ols/src/llms/providers/openai.py, azure_openai.py, bedrock.py, rhoai_vllm.py, rhelai_vllm.py, watsonx.py, tests/unit/llms/providers/test_openai.py, tests/unit/llms/providers/test_bedrock.py
Provider implementations now call provider_config.get_credentials() instead of reading provider_config.credentials directly, and the provider tests verify rotated file contents are picked up for OpenAI and Bedrock.
Google Vertex credential handling
ols/src/llms/providers/google_vertex.py
Google Vertex providers now source credentials through get_credentials(), pass them into load_vertex_credentials(...), and raise InvalidConfigurationError when no credential is available.
Documentation
docs/credential-hot-reload.md
The new document describes the hot-reload behavior, the per-request re-read design, validation steps, and checklist items.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • openshift/lightspeed-service#2959: Updates the Bedrock provider’s credential source to use ProviderConfig.get_credentials(), which is part of the same provider-side hot-reload path.
  • The operator-side PR referenced in the new documentation: it is directly coupled to the documented rollout order and credentialsSecretRef update flow.

Suggested reviewers

  • onmete
  • tisnik
  • bparees
  • xrajesh
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: enabling in-process LLM credential hot-reload.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from raptorsun and tisnik June 16, 2026 19:15
@openshift-ci

openshift-ci Bot commented Jun 16, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tisnik for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unit/app/models/test_config.py (1)

4357-4443: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add type hints to the new test signatures.

The new tests in this block omit parameter/return annotations, which conflicts with strict typing
requirements.

Suggested fix
+from pathlib import Path
@@
-def test_provider_config_get_credentials_returns_cached_when_no_path():
+def test_provider_config_get_credentials_returns_cached_when_no_path() -> None:
@@
-def test_provider_config_get_credentials_rereads_from_disk(tmp_path):
+def test_provider_config_get_credentials_rereads_from_disk(tmp_path: Path) -> None:
@@
-def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path):
+def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path: Path) -> None:
@@
-def test_provider_config_get_credentials_with_directory(tmp_path):
+def test_provider_config_get_credentials_with_directory(tmp_path: Path) -> None:

As per coding guidelines, "All function signatures must include type hints".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/app/models/test_config.py` around lines 4357 - 4443, Add type
hints to all four test function signatures in this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/credential-hot-reload.md`:
- Around line 52-57: The fenced code block showing the Request flow diagram is
missing a language identifier, which triggers markdownlint rule MD040. Add the
language identifier "text" to the opening fence of the code block (the three
backticks before the Request line) so it reads ```text instead of just ```. This
will satisfy the markdown linting requirement while properly labeling the
content type.

In `@ols/app/models/config.py`:
- Around line 612-618: The get_credentials() method in ols/app/models/config.py
returns the fresh credential value without persisting it to the cached
self.credentials instance variable. When fresh credentials are successfully read
from the path (when fresh is not None), update self.credentials with this fresh
value before returning it. This ensures that subsequent calls to
get_credentials() will return the last known-good rotated key rather than
falling back to a stale startup value if a later read fails.

In `@ols/src/llms/providers/google_vertex.py`:
- Around line 35-38: Replace the `ValueError` exception with
`InvalidConfigurationError` in the credential validation logic of the Google
Vertex provider. Import `InvalidConfigurationError` from `ols.utils.checks` at
the top of the file if not already present. Then update the exception raised
when `creds_value is None` (at the location around lines 35-38) and apply the
same change to the second occurrence mentioned at lines 74-79 where similar
credential validation occurs, ensuring both locations use the domain-specific
exception type for consistent configuration error handling across the provider.

In `@tests/unit/llms/providers/test_openai.py`:
- Around line 275-302: The function `test_openai_picks_up_rotated_credentials`
is missing type hints on its parameters and return type, violating the
repository's typed-signature requirement. Add type annotations to both the
`tmp_path` parameter (use the appropriate Path type from pathlib) and the
`fake_certifi_store` parameter (determine the correct fixture type), and add a
return type hint of None for the test function.

---

Outside diff comments:
In `@tests/unit/app/models/test_config.py`:
- Around line 4357-4443: Add type hints to all four test function signatures in
this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64086aa0-5d90-4440-9524-4d7a4b23a9e5

📥 Commits

Reviewing files that changed from the base of the PR and between a8aa7a8 and bc5ee67.

📒 Files selected for processing (12)
  • docs/credential-hot-reload.md
  • ols/app/models/config.py
  • ols/src/llms/providers/azure_openai.py
  • ols/src/llms/providers/google_vertex.py
  • ols/src/llms/providers/openai.py
  • ols/src/llms/providers/rhelai_vllm.py
  • ols/src/llms/providers/rhoai_vllm.py
  • ols/src/llms/providers/watsonx.py
  • ols/utils/checks.py
  • tests/unit/app/models/test_config.py
  • tests/unit/llms/providers/test_openai.py
  • tests/unit/utils/test_checks.py

Comment thread docs/credential-hot-reload.md Outdated
Comment thread ols/app/models/config.py
Comment thread ols/src/llms/providers/google_vertex.py
Comment thread tests/unit/llms/providers/test_openai.py Outdated
@lalan7 lalan7 force-pushed the feat/credential-hot-reload branch from bc5ee67 to 9950ad6 Compare June 26, 2026 15:13
@joshuawilson

Copy link
Copy Markdown
Contributor

Adversarial Review

Nice PR — the motivation is solid and the Prometheus credentials_file pattern is the right call. A few gaps surfaced during review:

1. Bedrock provider not updated (ols/src/llms/providers/bedrock.py:37)

Bedrock still uses self.credentials = self.provider_config.credentials (direct attribute access). All other 6 providers were updated. Credential rotation will silently fail for Bedrock deployments.

2. Provider-specific api_key silently overrides hot-reloaded credentials (ols/src/llms/providers/openai.py:33)

In default_params, line 28 calls get_credentials(), but lines 30–34 check openai_config.api_key and override with the static value read once at init (via read_api_key() at config.py:577–587). Same pattern in Azure OpenAI, WatsonX, RHOAI vLLM, and RHELAI vLLM. Deployments using provider-specific config blocks will never pick up rotated credentials. The PR description acknowledges the scope boundary, but there's no runtime warning when this override happens — a deployment could silently have hot-reload disabled.

3. Silent fallback to stale credentials masks persistent read failures (ols/app/models/config.py:612)

When _credentials_path is set but the file can't be read, get_credentials() logs a warning and falls back to self.credentials. Correct for transient K8s symlink swaps (milliseconds), but no mechanism to distinguish transient from permanent failure. If the file is permanently deleted, the service uses stale/expired credentials until LLM API calls return 401 — with only a warning log as signal.

Consider: logging at a higher level or incrementing an error counter after N consecutive failures, so ops teams can detect persistent credential-read failures.

4. Dual access pattern is a maintenance trap (ols/app/models/config.py:600)

.credentials (stale, cached at init) remains a public field alongside .get_credentials() (fresh). New providers or contributors will naturally use .credentials. Consider making .credentials a @property that delegates to get_credentials(), or at minimum adding a docstring/comment on the field warning about staleness.

5. read_secret_from_path() duplicates read_secret() (ols/utils/checks.py:71)

Both implement directory-resolution (os.path.isdir + os.path.join) and file-reading (open + read + rstrip). They diverge only in error handling (print() vs logger.warning()). A shared inner helper would reduce duplication.

6. PR title missing OLS-XXXX prefix

Per repo conventions, PR titles should start with the Jira reference (OLS-XXXX description). Current title uses conventional-commit format instead.


Items 1–2 are the most actionable — they're correctness gaps where hot-reload silently doesn't work. Items 3–5 are design/cleanup suggestions. Item 6 is a convention nit.

@blublinsky blublinsky left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a clean, well-scoped change. The Prometheus-style per-request re-read is the right pattern — no new dependencies, negligible overhead, and avoids the fsnotify/mtime edge cases with K8s symlink swaps.

Two items:

1. Stale fallback after successful rotation (ols/app/models/config.py:612-618)

When get_credentials() reads a rotated credential successfully, it returns the fresh value but doesn't update self.credentials. If a later read fails (transient I/O error), the fallback returns the original startup key — which may have been revoked hours ago — instead of the last-known-good rotated key.

Scenario with 1-hour token rotation on a long-running pod:

  1. Startup: self.credentials = "key-v1"
  2. Hour 1: reads "key-v2" from disk, returns it — but self.credentials still "key-v1"
  3. Hour 2: transient read failure → falls back to self.credentials → returns revoked "key-v1"

Suggested fix — one line:

            if fresh is not None:
                self.credentials = fresh
                return fresh

The test test_provider_config_get_credentials_falls_back_on_read_failure should also cover this sequence: startup → successful rotation → file disappears → assert fallback returns the rotated key, not the startup key.

2. Bedrock provider not updated

The Bedrock provider (ols/src/llms/providers/bedrock.py) was merged to main via #2959 after this branch was created. It uses self.provider_config.credentials directly in default_params (line 37), same pattern this PR replaces in all other providers. After rebase it will be the only provider without hot-reload — please update it to self.provider_config.get_credentials().

@lalan7 lalan7 force-pushed the feat/credential-hot-reload branch from 9950ad6 to b0f5ce9 Compare June 26, 2026 15:49
When credentialsSecretRef is updated by a CronJob or external rotation,
the lightspeed-operator currently triggers a rolling restart.  For
short-lived tokens (1-hour rotation) this causes hourly capacity loss.

This change makes ProviderConfig re-read the credential file on every
LLM request via a new get_credentials() method, following the same
Prometheus-style pattern (credentials_file re-read on every scrape).
This is safe with Kubernetes atomic symlink swaps, requires no new
dependencies, and adds negligible overhead vs the LLM call latency.

Changes:
- ols/utils/checks.py: add read_secret_from_path() helper
- ols/app/models/config.py: store _credentials_path, add get_credentials()
- All LLM providers: use get_credentials() instead of .credentials
- Unit tests for helper, ProviderConfig, and OpenAI provider rotation
- docs/credential-hot-reload.md: design rationale and test plan

Ref: https://issues.redhat.com/browse/RFE-9380
Co-authored-by: Cursor <cursoragent@cursor.com>
@lalan7 lalan7 force-pushed the feat/credential-hot-reload branch from b0f5ce9 to 1fcc71f Compare June 26, 2026 17:28

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unit/llms/providers/test_bedrock.py (1)

497-522: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Rotation test exercises default_params, not the loaded client.

This validates that two freshly constructed instances read fresh file contents, which mainly re-proves ProviderConfig.get_credentials() behavior (already covered in test_config.py). It does not assert that the credential actually wired into the LangChain client (load()bedrock_api_key/openai_api_key) reflects rotation. Consider asserting on load() output (e.g. the bedrock_api_key passed to a mocked ChatBedrockConverse) so the test guards the end-to-end path that callers depend on.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/llms/providers/test_bedrock.py` around lines 497 - 522, The
rotation test is only checking Bedrock.default_params, so it duplicates
ProviderConfig credential loading instead of verifying the provider/client
wiring. Update test_bedrock_picks_up_rotated_credentials to exercise
Bedrock.load() and assert the rotated secret is propagated into the LangChain
client path, specifically the bedrock_api_key/openai_api_key value passed when
constructing the mocked ChatBedrockConverse. Keep the existing Bedrock and
ProviderConfig setup, but make the assertion target the loaded client output
rather than default_params.
ols/utils/checks.py (1)

71-100: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoff

Consider extracting the shared file-read logic from read_secret.

read_secret_from_path duplicates the directory-resolution and open()/read().rstrip() block already present in read_secret (Lines 48-59). The only real differences are the input shape (concrete path vs dict lookup) and the error policy (warn-and-return-None vs raise/print). A small private helper that both functions delegate to would keep the read behavior consistent if it ever changes (encoding, stripping, directory handling).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/utils/checks.py` around lines 71 - 100, `read_secret_from_path`
duplicates the same path resolution and file-reading logic already used by
`read_secret`; extract that shared open/read/rstrip and directory-handling
behavior into a small private helper and have both `read_secret` and
`read_secret_from_path` call it. Keep the distinct error handling in
`read_secret_from_path` (logging a warning and returning `None`) while
preserving `read_secret`’s existing lookup/exception behavior, using the
existing function names as the main entry points.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ols/utils/checks.py`:
- Around line 71-100: `read_secret_from_path` duplicates the same path
resolution and file-reading logic already used by `read_secret`; extract that
shared open/read/rstrip and directory-handling behavior into a small private
helper and have both `read_secret` and `read_secret_from_path` call it. Keep the
distinct error handling in `read_secret_from_path` (logging a warning and
returning `None`) while preserving `read_secret`’s existing lookup/exception
behavior, using the existing function names as the main entry points.

In `@tests/unit/llms/providers/test_bedrock.py`:
- Around line 497-522: The rotation test is only checking
Bedrock.default_params, so it duplicates ProviderConfig credential loading
instead of verifying the provider/client wiring. Update
test_bedrock_picks_up_rotated_credentials to exercise Bedrock.load() and assert
the rotated secret is propagated into the LangChain client path, specifically
the bedrock_api_key/openai_api_key value passed when constructing the mocked
ChatBedrockConverse. Keep the existing Bedrock and ProviderConfig setup, but
make the assertion target the loaded client output rather than default_params.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cf0a8cf8-1441-4a82-8fa0-65911ee50f09

📥 Commits

Reviewing files that changed from the base of the PR and between 9950ad6 and 1fcc71f.

📒 Files selected for processing (14)
  • docs/credential-hot-reload.md
  • ols/app/models/config.py
  • ols/src/llms/providers/azure_openai.py
  • ols/src/llms/providers/bedrock.py
  • ols/src/llms/providers/google_vertex.py
  • ols/src/llms/providers/openai.py
  • ols/src/llms/providers/rhelai_vllm.py
  • ols/src/llms/providers/rhoai_vllm.py
  • ols/src/llms/providers/watsonx.py
  • ols/utils/checks.py
  • tests/unit/app/models/test_config.py
  • tests/unit/llms/providers/test_bedrock.py
  • tests/unit/llms/providers/test_openai.py
  • tests/unit/utils/test_checks.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • ols/src/llms/providers/rhoai_vllm.py
  • ols/src/llms/providers/azure_openai.py
  • ols/src/llms/providers/rhelai_vllm.py
  • ols/src/llms/providers/google_vertex.py
  • ols/src/llms/providers/openai.py
  • tests/unit/utils/test_checks.py
  • ols/src/llms/providers/watsonx.py
  • ols/app/models/config.py

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown

@lalan7: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants