feat: in-process LLM credential hot-reload#2955
Conversation
📝 WalkthroughWalkthroughAdds in-process LLM credential hot-reload. Credential values are now re-read from disk through ChangesCredential Hot-Reload
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/app/models/test_config.py (1)
4357-4443:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd type hints to the new test signatures.
The new tests in this block omit parameter/return annotations, which conflicts with strict typing
requirements.Suggested fix
+from pathlib import Path @@ -def test_provider_config_get_credentials_returns_cached_when_no_path(): +def test_provider_config_get_credentials_returns_cached_when_no_path() -> None: @@ -def test_provider_config_get_credentials_rereads_from_disk(tmp_path): +def test_provider_config_get_credentials_rereads_from_disk(tmp_path: Path) -> None: @@ -def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path): +def test_provider_config_get_credentials_falls_back_on_read_failure(tmp_path: Path) -> None: @@ -def test_provider_config_get_credentials_with_directory(tmp_path): +def test_provider_config_get_credentials_with_directory(tmp_path: Path) -> None:As per coding guidelines, "All function signatures must include type hints".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/app/models/test_config.py` around lines 4357 - 4443, Add type hints to all four test function signatures in this block. For test_provider_config_get_credentials_returns_cached_when_no_path, add return type hint -> None. For test_provider_config_get_credentials_rereads_from_disk, test_provider_config_get_credentials_falls_back_on_read_failure, and test_provider_config_get_credentials_with_directory, add the parameter type hint for tmp_path as pathlib.Path and return type hint as -> None to each function signature.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/credential-hot-reload.md`:
- Around line 52-57: The fenced code block showing the Request flow diagram is
missing a language identifier, which triggers markdownlint rule MD040. Add the
language identifier "text" to the opening fence of the code block (the three
backticks before the Request line) so it reads ```text instead of just ```. This
will satisfy the markdown linting requirement while properly labeling the
content type.
In `@ols/app/models/config.py`:
- Around line 612-618: The get_credentials() method in ols/app/models/config.py
returns the fresh credential value without persisting it to the cached
self.credentials instance variable. When fresh credentials are successfully read
from the path (when fresh is not None), update self.credentials with this fresh
value before returning it. This ensures that subsequent calls to
get_credentials() will return the last known-good rotated key rather than
falling back to a stale startup value if a later read fails.
In `@ols/src/llms/providers/google_vertex.py`:
- Around line 35-38: Replace the `ValueError` exception with
`InvalidConfigurationError` in the credential validation logic of the Google
Vertex provider. Import `InvalidConfigurationError` from `ols.utils.checks` at
the top of the file if not already present. Then update the exception raised
when `creds_value is None` (at the location around lines 35-38) and apply the
same change to the second occurrence mentioned at lines 74-79 where similar
credential validation occurs, ensuring both locations use the domain-specific
exception type for consistent configuration error handling across the provider.
In `@tests/unit/llms/providers/test_openai.py`:
- Around line 275-302: The function `test_openai_picks_up_rotated_credentials`
is missing type hints on its parameters and return type, violating the
repository's typed-signature requirement. Add type annotations to both the
`tmp_path` parameter (use the appropriate Path type from pathlib) and the
`fake_certifi_store` parameter (determine the correct fixture type), and add a
return type hint of None for the test function.
---
Outside diff comments:
In `@tests/unit/app/models/test_config.py`:
- Around line 4357-4443: Add type hints to all four test function signatures in
this block. For
test_provider_config_get_credentials_returns_cached_when_no_path, add return
type hint -> None. For test_provider_config_get_credentials_rereads_from_disk,
test_provider_config_get_credentials_falls_back_on_read_failure, and
test_provider_config_get_credentials_with_directory, add the parameter type hint
for tmp_path as pathlib.Path and return type hint as -> None to each function
signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 64086aa0-5d90-4440-9524-4d7a4b23a9e5
📒 Files selected for processing (12)
docs/credential-hot-reload.mdols/app/models/config.pyols/src/llms/providers/azure_openai.pyols/src/llms/providers/google_vertex.pyols/src/llms/providers/openai.pyols/src/llms/providers/rhelai_vllm.pyols/src/llms/providers/rhoai_vllm.pyols/src/llms/providers/watsonx.pyols/utils/checks.pytests/unit/app/models/test_config.pytests/unit/llms/providers/test_openai.pytests/unit/utils/test_checks.py
bc5ee67 to
9950ad6
Compare
Adversarial ReviewNice PR — the motivation is solid and the Prometheus 1. Bedrock provider not updated (
|
blublinsky
left a comment
There was a problem hiding this comment.
Overall this is a clean, well-scoped change. The Prometheus-style per-request re-read is the right pattern — no new dependencies, negligible overhead, and avoids the fsnotify/mtime edge cases with K8s symlink swaps.
Two items:
1. Stale fallback after successful rotation (ols/app/models/config.py:612-618)
When get_credentials() reads a rotated credential successfully, it returns the fresh value but doesn't update self.credentials. If a later read fails (transient I/O error), the fallback returns the original startup key — which may have been revoked hours ago — instead of the last-known-good rotated key.
Scenario with 1-hour token rotation on a long-running pod:
- Startup:
self.credentials = "key-v1" - Hour 1: reads
"key-v2"from disk, returns it — butself.credentialsstill"key-v1" - Hour 2: transient read failure → falls back to
self.credentials→ returns revoked"key-v1"
Suggested fix — one line:
if fresh is not None:
self.credentials = fresh
return freshThe test test_provider_config_get_credentials_falls_back_on_read_failure should also cover this sequence: startup → successful rotation → file disappears → assert fallback returns the rotated key, not the startup key.
2. Bedrock provider not updated
The Bedrock provider (ols/src/llms/providers/bedrock.py) was merged to main via #2959 after this branch was created. It uses self.provider_config.credentials directly in default_params (line 37), same pattern this PR replaces in all other providers. After rebase it will be the only provider without hot-reload — please update it to self.provider_config.get_credentials().
9950ad6 to
b0f5ce9
Compare
When credentialsSecretRef is updated by a CronJob or external rotation, the lightspeed-operator currently triggers a rolling restart. For short-lived tokens (1-hour rotation) this causes hourly capacity loss. This change makes ProviderConfig re-read the credential file on every LLM request via a new get_credentials() method, following the same Prometheus-style pattern (credentials_file re-read on every scrape). This is safe with Kubernetes atomic symlink swaps, requires no new dependencies, and adds negligible overhead vs the LLM call latency. Changes: - ols/utils/checks.py: add read_secret_from_path() helper - ols/app/models/config.py: store _credentials_path, add get_credentials() - All LLM providers: use get_credentials() instead of .credentials - Unit tests for helper, ProviderConfig, and OpenAI provider rotation - docs/credential-hot-reload.md: design rationale and test plan Ref: https://issues.redhat.com/browse/RFE-9380 Co-authored-by: Cursor <cursoragent@cursor.com>
b0f5ce9 to
1fcc71f
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
tests/unit/llms/providers/test_bedrock.py (1)
497-522: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick winRotation test exercises
default_params, not the loaded client.This validates that two freshly constructed instances read fresh file contents, which mainly re-proves
ProviderConfig.get_credentials()behavior (already covered intest_config.py). It does not assert that the credential actually wired into the LangChain client (load()→bedrock_api_key/openai_api_key) reflects rotation. Consider asserting onload()output (e.g. thebedrock_api_keypassed to a mockedChatBedrockConverse) so the test guards the end-to-end path that callers depend on.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/llms/providers/test_bedrock.py` around lines 497 - 522, The rotation test is only checking Bedrock.default_params, so it duplicates ProviderConfig credential loading instead of verifying the provider/client wiring. Update test_bedrock_picks_up_rotated_credentials to exercise Bedrock.load() and assert the rotated secret is propagated into the LangChain client path, specifically the bedrock_api_key/openai_api_key value passed when constructing the mocked ChatBedrockConverse. Keep the existing Bedrock and ProviderConfig setup, but make the assertion target the loaded client output rather than default_params.ols/utils/checks.py (1)
71-100: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚖️ Poor tradeoffConsider extracting the shared file-read logic from
read_secret.
read_secret_from_pathduplicates the directory-resolution andopen()/read().rstrip()block already present inread_secret(Lines 48-59). The only real differences are the input shape (concrete path vs dict lookup) and the error policy (warn-and-return-Nonevs raise/print). A small private helper that both functions delegate to would keep the read behavior consistent if it ever changes (encoding, stripping, directory handling).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/utils/checks.py` around lines 71 - 100, `read_secret_from_path` duplicates the same path resolution and file-reading logic already used by `read_secret`; extract that shared open/read/rstrip and directory-handling behavior into a small private helper and have both `read_secret` and `read_secret_from_path` call it. Keep the distinct error handling in `read_secret_from_path` (logging a warning and returning `None`) while preserving `read_secret`’s existing lookup/exception behavior, using the existing function names as the main entry points.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ols/utils/checks.py`:
- Around line 71-100: `read_secret_from_path` duplicates the same path
resolution and file-reading logic already used by `read_secret`; extract that
shared open/read/rstrip and directory-handling behavior into a small private
helper and have both `read_secret` and `read_secret_from_path` call it. Keep the
distinct error handling in `read_secret_from_path` (logging a warning and
returning `None`) while preserving `read_secret`’s existing lookup/exception
behavior, using the existing function names as the main entry points.
In `@tests/unit/llms/providers/test_bedrock.py`:
- Around line 497-522: The rotation test is only checking
Bedrock.default_params, so it duplicates ProviderConfig credential loading
instead of verifying the provider/client wiring. Update
test_bedrock_picks_up_rotated_credentials to exercise Bedrock.load() and assert
the rotated secret is propagated into the LangChain client path, specifically
the bedrock_api_key/openai_api_key value passed when constructing the mocked
ChatBedrockConverse. Keep the existing Bedrock and ProviderConfig setup, but
make the assertion target the loaded client output rather than default_params.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cf0a8cf8-1441-4a82-8fa0-65911ee50f09
📒 Files selected for processing (14)
docs/credential-hot-reload.mdols/app/models/config.pyols/src/llms/providers/azure_openai.pyols/src/llms/providers/bedrock.pyols/src/llms/providers/google_vertex.pyols/src/llms/providers/openai.pyols/src/llms/providers/rhelai_vllm.pyols/src/llms/providers/rhoai_vllm.pyols/src/llms/providers/watsonx.pyols/utils/checks.pytests/unit/app/models/test_config.pytests/unit/llms/providers/test_bedrock.pytests/unit/llms/providers/test_openai.pytests/unit/utils/test_checks.py
🚧 Files skipped from review as they are similar to previous changes (8)
- ols/src/llms/providers/rhoai_vllm.py
- ols/src/llms/providers/azure_openai.py
- ols/src/llms/providers/rhelai_vllm.py
- ols/src/llms/providers/google_vertex.py
- ols/src/llms/providers/openai.py
- tests/unit/utils/test_checks.py
- ols/src/llms/providers/watsonx.py
- ols/app/models/config.py
|
@lalan7: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
When
credentialsSecretRefis updated by a CronJob or external rotation, the lightspeed-operator triggers a rolling restart oflightspeed-app-server. For customers using short-lived LLM tokens (1-hour validity), this causes hourly pod restarts and temporary capacity loss — including reloading RAG vector indexes and embedding models.This PR makes
ProviderConfigre-read the credential file on every LLM request via a newget_credentials()method, so rotated secrets are picked up without a pod restart. This follows the Prometheuscredentials_filepattern: re-read on every request, no caching, no edge cases with K8s symlink swaps.Changes
ols/utils/checks.py— Newread_secret_from_path()helper for repeated file readsols/app/models/config.py— Store_credentials_pathinProviderConfig.__init__(), addget_credentials()method that re-reads the file when a path is configuredopenai,azure_openai,watsonx,rhoai_vllm,rhelai_vllm,google_vertex) — Useget_credentials()instead of.credentialsProviderConfig.get_credentials()(cached, re-read, fallback, directory), and OpenAI provider credential rotationdocs/credential-hot-reload.md— Design rationale and test planWhy this approach
kubeletuses atomic symlink swaps (..data);os.stat()mtime can be unreliable after swaps. Re-reading avoids the edge case.credentials_file, Envoy, controller-runtimeCertWatcher, kube-proxy all re-read files.open()+read()of a <100 byte file per LLM request is negligible vs LLM call latency (seconds).Scope
Hot-reload applies to top-level
credentials_path, which is the path used by operator-managed deployments (credentialsSecretRef). Provider-specific config blocks (openai_config.credentials_path,azure_openai_config.credentials_path, etc.) continue to cache credentials at startup — this matches the existing operator behavior, which does not generate these blocks.Azure AD credentials (
tenant_id,client_id,client_secret) are also read once at startup. These can be addressed in a follow-up if needed.Test plan
read_secret_from_path()— file, directory, nonexistent, rotationProviderConfig.get_credentials()— cached fallback, disk re-read, read failure fallback, directory pathload_llm()callsruff checkclean,mypycleanRef
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests