Add prompt caching and latest-model support to model latency benchmarking tool#714
Open
evgenisokolov wants to merge 2 commits into
Open
Add prompt caching and latest-model support to model latency benchmarking tool#714evgenisokolov wants to merge 2 commits into
evgenisokolov wants to merge 2 commits into
Conversation
Enhances model-latency-benchmarking/ (refs aws-samples#713): - Add opt-in, per-scenario prompt caching (prompt_caching, cache_ttl, cached_context) with a global PROMPT_CACHING default. Inserts a Converse cachePoint after the cached context; extended 1h TTL is applied only for Anthropic models, others use the default duration. - Capture cache metrics (cache_read_input_tokens, cache_write_input_tokens, Cache_Hit_Rate) and split TTFT into cached vs uncached in the analysis, preserving all existing columns and aggregates. - Add a boto3 minimum-version gate and confirm Converse streaming usage. - Fix inference config so configured TEMPERATURE/TOP_P/TOP_K are honored instead of hardcoded; send a single sampling parameter (INFERENCE_SAMPLING) since several models reject temperature and topP together. - Refresh the sample dataset to current models (Claude Opus/Sonnet/Haiku 4.5, Claude 3.7 Sonnet, Amazon Nova Pro) and add a dedicated caching demo dataset with cached/uncached pairs for Claude Haiku 4.5 and Amazon Nova Pro. - Update readme with the new fields, metrics, prerequisites, and caching notes. Existing datasets without the new fields run unchanged. Change is limited to the model-latency-benchmarking/ folder.
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
- Remove the large cached-context scenario from the sample dataset; the caching demo dataset already covers caching end to end. - Condense the readme caching docs into a single concise section that does not repeat the dataset field table, aligning with the original style.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enhances the existing
model-latency-benchmarking/tool. Refs #713.This adds opt-in prompt caching, support for the latest Bedrock models, and SDK/API modernization, while keeping existing datasets working unchanged. The change is limited to the
model-latency-benchmarking/folder.Changes
prompt_caching,cache_ttl(5m/1h), andcached_context, plus a globalPROMPT_CACHINGdefault. A ConversecachePointis inserted after the cached context. Extended1hTTL is applied only for Anthropic models; other models use the default duration.cache_read_input_tokens,cache_write_input_tokens, and a derivedCache_Hit_Rate; the analysis reports TTFT split by cached vs uncached. All existing per-invocation columns and aggregated metrics are preserved (cache columns are added, not renamed).boto3version gate and confirm Converse streaming via thebedrock-runtimeclient.TEMPERATURE/TOP_P/TOP_Kinstead of hardcoded values. Because several current models rejecttemperatureandtopPtogether, the tool now sends a single sampling parameter selected byINFERENCE_SAMPLING(defaulttemperature).caching-demo-prompts-for-benchmarking.jsonl) with cached/uncached pairs for Claude Haiku 4.5 and Amazon Nova Pro.readme.mdwith the new fields, metrics, prerequisites, supported models/regions, the on-demand-only constraint, and the demo dataset.Testing
Backward compatibility
Datasets that contain only the original fields run unchanged; a scenario without the caching fields behaves exactly as before.
Note on scope
The repository's contribution guide mentions a website markdown mirror under
docs/. This tool is not currently published throughmkdocs.yml, so a website mirror is intentionally out of scope for this change. Happy to add one if maintainers prefer.