Skip to content

Add prompt caching and latest-model support to model latency benchmarking tool#714

Open
evgenisokolov wants to merge 2 commits into
aws-samples:mainfrom
evgenisokolov:latency-benchmarking-prompt-caching
Open

Add prompt caching and latest-model support to model latency benchmarking tool#714
evgenisokolov wants to merge 2 commits into
aws-samples:mainfrom
evgenisokolov:latency-benchmarking-prompt-caching

Conversation

@evgenisokolov

Copy link
Copy Markdown

Summary

Enhances the existing model-latency-benchmarking/ tool. Refs #713.

This adds opt-in prompt caching, support for the latest Bedrock models, and SDK/API modernization, while keeping existing datasets working unchanged. The change is limited to the model-latency-benchmarking/ folder.

Changes

  • Prompt caching (opt-in, per scenario): new optional fields prompt_caching, cache_ttl (5m/1h), and cached_context, plus a global PROMPT_CACHING default. A Converse cachePoint is inserted after the cached context. Extended 1h TTL is applied only for Anthropic models; other models use the default duration.
  • Cache metrics: capture cache_read_input_tokens, cache_write_input_tokens, and a derived Cache_Hit_Rate; the analysis reports TTFT split by cached vs uncached. All existing per-invocation columns and aggregated metrics are preserved (cache columns are added, not renamed).
  • SDK/API: add a minimum boto3 version gate and confirm Converse streaming via the bedrock-runtime client.
  • Inference config fix: honor the configured TEMPERATURE/TOP_P/TOP_K instead of hardcoded values. Because several current models reject temperature and topP together, the tool now sends a single sampling parameter selected by INFERENCE_SAMPLING (default temperature).
  • Datasets: refresh the sample dataset to current models (Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5, Claude 3.7 Sonnet, Amazon Nova Pro) and add a dedicated caching demo dataset (caching-demo-prompts-for-benchmarking.jsonl) with cached/uncached pairs for Claude Haiku 4.5 and Amazon Nova Pro.
  • Docs: update readme.md with the new fields, metrics, prerequisites, supported models/regions, the on-demand-only constraint, and the demo dataset.

Testing

  • Ran the notebook end-to-end in Amazon SageMaker Studio against the caching demo dataset. After the fixes above, cached and uncached scenarios for both Claude Haiku 4.5 and Amazon Nova Pro complete without errors and produce per-invocation CSV and aggregated analysis including the cache columns.
  • Verified the notebook is valid JSON and all Python cells compile.

Backward compatibility

Datasets that contain only the original fields run unchanged; a scenario without the caching fields behaves exactly as before.

Note on scope

The repository's contribution guide mentions a website markdown mirror under docs/. This tool is not currently published through mkdocs.yml, so a website mirror is intentionally out of scope for this change. Happy to add one if maintainers prefer.

Enhances model-latency-benchmarking/ (refs aws-samples#713):

- Add opt-in, per-scenario prompt caching (prompt_caching, cache_ttl,
  cached_context) with a global PROMPT_CACHING default. Inserts a Converse
  cachePoint after the cached context; extended 1h TTL is applied only for
  Anthropic models, others use the default duration.
- Capture cache metrics (cache_read_input_tokens, cache_write_input_tokens,
  Cache_Hit_Rate) and split TTFT into cached vs uncached in the analysis,
  preserving all existing columns and aggregates.
- Add a boto3 minimum-version gate and confirm Converse streaming usage.
- Fix inference config so configured TEMPERATURE/TOP_P/TOP_K are honored
  instead of hardcoded; send a single sampling parameter (INFERENCE_SAMPLING)
  since several models reject temperature and topP together.
- Refresh the sample dataset to current models (Claude Opus/Sonnet/Haiku 4.5,
  Claude 3.7 Sonnet, Amazon Nova Pro) and add a dedicated caching demo dataset
  with cached/uncached pairs for Claude Haiku 4.5 and Amazon Nova Pro.
- Update readme with the new fields, metrics, prerequisites, and caching notes.

Existing datasets without the new fields run unchanged. Change is limited to
the model-latency-benchmarking/ folder.
@review-notebook-app

Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

- Remove the large cached-context scenario from the sample dataset; the
  caching demo dataset already covers caching end to end.
- Condense the readme caching docs into a single concise section that does
  not repeat the dataset field table, aligning with the original style.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant