Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644)#644
Open
charles-typ wants to merge 8 commits into
Open
Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644)#644charles-typ wants to merge 8 commits into
charles-typ wants to merge 8 commits into
Conversation
|
@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338680. |
2d4cc67 to
df1e471
Compare
charles-typ
added a commit
to charles-typ/DCPerf
that referenced
this pull request
May 29, 2026
…cebookresearch#644) Summary: Pull Request resolved: facebookresearch#644 - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680
df1e471 to
2425c4d
Compare
Summary: ## Problem When `additional_fanout=500` is used to simulate production's high connection count (num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO cascade during warmup: 1. **Connection storm**: All 32K lazy connections are established simultaneously on first requests, overwhelming the server's TCP accept queue (default backlog=1024). 2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight per client fire simultaneously. With 2 clients, this is 128K concurrent requests hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error) and server TKO marking. Once TKO is set, all subsequent requests fail immediately. **Previous 2-client benchmark results (without this fix):** - Client 0: 97.7% error rate - Client 1: 48.8% error rate ## Solution Three changes to prevent TKO: ### 1. Server: Increase TCP listen backlog (65536) Prevents connection refusals during connection storms from multiple clients. ### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`) Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period (default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming the server. ### 3. Client: Adaptive load control during warmup (TCP congestion control) Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD (Additive Increase, Multiplicative Decrease): - Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads) - **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128) - **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500) - **Backoff**: halves inflight if error rate > 5% - All workers share a dynamic `currentMaxInflight` atomic variable New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2) ## Results **2-client benchmark with fix (adaptive load control):** | Metric | Client 0 | Client 1 | |--------|----------|----------| | Warmup QPS | 428,540 | 428,989 | | Warmup Errors | **0** | **0** | | Benchmark QPS | **482,367** | **482,775** | | GET Errors | **0** | **0** | | SET Errors | **0** | **0** | | Hit Ratio | 100% | 100% | | P50 Latency | 130ms | 130ms | | P99 Latency | 263ms | 263ms | Combined: **~965K QPS with 0 errors** across both clients. Differential Revision: D98351095
Summary: - Add createSameThreadClient() support to eliminate cross-thread message queue hops - Workers run directly on McRouter proxy EventBases instead of separate thread pool - Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark) - Add experiment config files for various benchmark configurations Differential Revision: D98968871
Summary: Add configurable per-request CPU overhead simulation to ucachebench server to help close the CPU utilization gap between ucachebench (~35% idle) and production ucache (~9% idle). The simulation includes hash computation, clock_gettime calls, and memory allocations that mimic production ACL checks, CacheTable key construction, and serialization overhead. Changes: - Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy) - Wire flag through run.py and jobs_internal.yml - Add folly::hash and BenchmarkUtil deps - Add exp_y config (fibers enabled) and exp_z config (fibers + overhead) Experiment results: - Exp V (baseline): 35% idle, 6.91M QPS - Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp - Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective Differential Revision: D99338676
Summary: Implement production-like per-request overhead features to close the CPU utilization gap between ucachebench (~46% idle) and production ucache (~9% idle). Features added: - Compound key construction (McStoredKey-style: "uc:pool:key:v1") - MurmurHash2 key hashing (matching production getHashForKey) - ACL prefix checks with F14FastMap lookup - Overload protection with inflight request counting - Stats tracking (12+ atomic increments per request) - Ticket staleness checks - Egress hash computation - Response timestamps via clock_gettime Also adds --production-features flag to run.py, jobs_internal.yml, and server main.cpp to enable these features via automark config. Differential Revision: D99338673
Summary: Adds three new production-like CPU overhead simulations to close the CPU utilization gap between ucachebench and production ucache: - CRC32C hardware-accelerated value checksums (integrity verification) - Thrift compact protocol serialization simulation (varint encoding, field headers) - IOBuf chain construction and coalescing (header + value chaining) Also adds benchmark config files for various experiment configurations. Differential Revision: D99338677
Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674
Summary: Three additional production-like CPU overhead simulations: - Egress rate limiting: thread-local F14 map simulating NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window + token bucket per-response checks - KCB double-lookup: second CacheLib find for ~25% of requests, matching production Key Client Binding version mismatch path - Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads matching production shouldLoadShed() per-request checks Differential Revision: D99338675
…cebookresearch#644) Summary: - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680
2425c4d to
f02f28d
Compare
charles-typ
added a commit
to charles-typ/DCPerf
that referenced
this pull request
May 30, 2026
…cebookresearch#644) Summary: - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
request using hash computation loops. Allows tuning server CPU utilization
to match production levels.
CacheTable-style fiber mutex contention per request.
Differential Revision: D99338680