Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644) by charles-typ · Pull Request #644 · facebookresearch/DCPerf

charles-typ · 2026-05-29T16:49:39Z

Summary:

Add cpu_work_us parameter: configurable microseconds of CPU busy-work per
request using hash computation loops. Allows tuning server CPU utilization
to match production levels.
Enable bucket_lock_power=20 in config (production default), adding
CacheTable-style fiber mutex contention per request.
Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON.

Differential Revision: D99338680

meta-codesync · 2026-05-29T16:50:17Z

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338680.

…cebookresearch#644) Summary: Pull Request resolved: facebookresearch#644 - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680

Summary: ## Problem When `additional_fanout=500` is used to simulate production's high connection count (num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO cascade during warmup: 1. **Connection storm**: All 32K lazy connections are established simultaneously on first requests, overwhelming the server's TCP accept queue (default backlog=1024). 2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight per client fire simultaneously. With 2 clients, this is 128K concurrent requests hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error) and server TKO marking. Once TKO is set, all subsequent requests fail immediately. **Previous 2-client benchmark results (without this fix):** - Client 0: 97.7% error rate - Client 1: 48.8% error rate ## Solution Three changes to prevent TKO: ### 1. Server: Increase TCP listen backlog (65536) Prevents connection refusals during connection storms from multiple clients. ### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`) Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period (default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming the server. ### 3. Client: Adaptive load control during warmup (TCP congestion control) Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD (Additive Increase, Multiplicative Decrease): - Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads) - **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128) - **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500) - **Backoff**: halves inflight if error rate > 5% - All workers share a dynamic `currentMaxInflight` atomic variable New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2) ## Results **2-client benchmark with fix (adaptive load control):** | Metric | Client 0 | Client 1 | |--------|----------|----------| | Warmup QPS | 428,540 | 428,989 | | Warmup Errors | **0** | **0** | | Benchmark QPS | **482,367** | **482,775** | | GET Errors | **0** | **0** | | SET Errors | **0** | **0** | | Hit Ratio | 100% | 100% | | P50 Latency | 130ms | 130ms | | P99 Latency | 263ms | 263ms | Combined: **~965K QPS with 0 errors** across both clients. Differential Revision: D98351095

Summary: - Add createSameThreadClient() support to eliminate cross-thread message queue hops - Workers run directly on McRouter proxy EventBases instead of separate thread pool - Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark) - Add experiment config files for various benchmark configurations Differential Revision: D98968871

Summary: Add configurable per-request CPU overhead simulation to ucachebench server to help close the CPU utilization gap between ucachebench (~35% idle) and production ucache (~9% idle). The simulation includes hash computation, clock_gettime calls, and memory allocations that mimic production ACL checks, CacheTable key construction, and serialization overhead. Changes: - Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy) - Wire flag through run.py and jobs_internal.yml - Add folly::hash and BenchmarkUtil deps - Add exp_y config (fibers enabled) and exp_z config (fibers + overhead) Experiment results: - Exp V (baseline): 35% idle, 6.91M QPS - Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp - Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective Differential Revision: D99338676

Summary: Implement production-like per-request overhead features to close the CPU utilization gap between ucachebench (~46% idle) and production ucache (~9% idle). Features added: - Compound key construction (McStoredKey-style: "uc:pool:key:v1") - MurmurHash2 key hashing (matching production getHashForKey) - ACL prefix checks with F14FastMap lookup - Overload protection with inflight request counting - Stats tracking (12+ atomic increments per request) - Ticket staleness checks - Egress hash computation - Response timestamps via clock_gettime Also adds --production-features flag to run.py, jobs_internal.yml, and server main.cpp to enable these features via automark config. Differential Revision: D99338673

Summary: Adds three new production-like CPU overhead simulations to close the CPU utilization gap between ucachebench and production ucache: - CRC32C hardware-accelerated value checksums (integrity verification) - Thrift compact protocol serialization simulation (varint encoding, field headers) - IOBuf chain construction and coalescing (header + value chaining) Also adds benchmark config files for various experiment configurations. Differential Revision: D99338677

Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674

Summary: Three additional production-like CPU overhead simulations: - Egress rate limiting: thread-local F14 map simulating NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window + token bucket per-response checks - KCB double-lookup: second CacheLib find for ~25% of requests, matching production Key Client Binding version mismatch path - Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads matching production shouldLoadShed() per-request checks Differential Revision: D99338675

…cebookresearch#644) Summary: - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 29, 2026

meta-codesync Bot added fb-exported meta-exported labels May 29, 2026

charles-typ force-pushed the export-D99338680-to-v2-beta branch from 2d4cc67 to df1e471 Compare May 29, 2026 18:39

charles-typ force-pushed the export-D99338680-to-v2-beta branch from df1e471 to 2425c4d Compare May 29, 2026 18:46

meta-codesync Bot changed the title ~~Add configurable cpu_work_us and bucket_lock_power for CPU tuning~~ Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644) May 29, 2026

charles-typ added 8 commits May 29, 2026 18:04

charles-typ force-pushed the export-D99338680-to-v2-beta branch from 2425c4d to f02f28d Compare May 30, 2026 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644)#644

Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644)#644
charles-typ wants to merge 8 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338680-to-v2-beta

charles-typ commented May 29, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

charles-typ commented May 29, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

charles-typ commented May 29, 2026 •

edited by meta-codesync Bot

Loading