Add egress rate limiting, KCB double-lookup, and CPU load measurement#643
Open
charles-typ wants to merge 12 commits into
Open
Add egress rate limiting, KCB double-lookup, and CPU load measurement#643charles-typ wants to merge 12 commits into
charles-typ wants to merge 12 commits into
Conversation
|
@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338675. |
68458b8 to
f194522
Compare
) Summary: The cachelib_num_shards parameter was parsed from gflags and stored in UcacheBenchConfig but never actually applied to the CacheAllocator::Config. This meant the config value was silently ignored and CacheLib used its default of 8192 shards. Now call setNumShards() when cachelib_num_shards > 0, allowing the benchmark to match production shard counts for more accurate CPU utilization profiling. Differential Revision: D96087814
Summary: Add support for configuring ThriftServer's socketMaxReadsPerEvent via CLI flag. This controls how many reads a single connection can perform per event loop iteration, which affects multi-client scalability. Changes: - Add rpc_socket_max_reads_per_event gflag to UcacheBenchRpcServer.cpp - Apply flag value to thriftServer_->setSocketMaxReadsPerEvent() - Add parameter to benchmark configs (debug/large/medium/small) with default value of 1 matching production ucache - Add --rpc-socket-max-reads-per-event CLI arg in jobs_internal.yml - Add parameter to ALLOWED_PARAMS in ucache_bench_benchmark.py Reviewed By: excelle08 Differential Revision: D96763733
Summary: Add support for fiber-based request processing and verbose error logging in ucache_bench server and client. Fiber configuration changes: - Add enable_fibers flag to enable fiber-based request processing - Add fiber_stack_size for configuring IO thread fiber stack size (default 64KB) - Add fiber_max_pool_size for max preallocated free fibers (default 1000) - Add fiber_pool_resize_period_ms for fiber pool resize period (default 1000ms) Verbose logging changes: - Add verbose parameter to server and client configs (default 0) - Print detailed error messages for SET/GET failures when verbose is enabled - Include carbon::Result error codes in log output for debugging Files modified: - Config JSON files: Added verbose parameter to server configs - ucache_bench_benchmark.py: Added fiber params to ALLOWED_PARAMS - jobs_internal.yml: Added CLI args for fiber config and verbose flag - run.py: Added fiber and verbose CLI argument parsing - UcacheBenchClient.cpp: Added verbose error logging for warmup and benchmark ops Reviewed By: excelle08 Differential Revision: D96763783
Summary: Add NIC IRQ affinity configuration to ucache_bench, ported from TaoBench. This feature distributes network interrupt processing across CPUs to prevent IRQ handling from bottlenecking on a few cores. New parameters: - nic_channel_ratio: Ratio of NIC channels to logical cores (0.0 = disabled) - interface_name: Network interface for IRQ affinity tuning (default: eth0) - hard_binding: Hard bind NIC channels to specific CPU cores (default: 0) Changes: - Add affinitize_nic() function to configure NIC channels via ethtool and redistribute IRQ affinity using affinitize_nic.py script - Add new CLI arguments to server: --nic-channel-ratio, --interface-name, --hard-binding - Update install script to copy affinitize_nic scripts for OSS builds - Add NIC affinity params to benchmark configs and jobs_internal.yml - Add ucache_bench_debug_nic_affinity_configs.json for testing Differential Revision: D96763816
Summary: The affinitize_nic() function was computing n_channels = int(n_cores * ratio) which could exceed the NIC's maximum supported combined channels. On T2 Turin machines with 316 logical cores and ratio=0.5, this computed 158 channels, but the NIC (Mellanox) only supports 128 max. The ethtool command silently degraded to 79 channels, breaking network connectivity. Fix: Query ethtool -l to get the pre-set maximum combined channels and clamp n_channels to that value before calling ethtool -L. Differential Revision: D98269551
Summary: ## Problem When `additional_fanout=500` is used to simulate production's high connection count (num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO cascade during warmup: 1. **Connection storm**: All 32K lazy connections are established simultaneously on first requests, overwhelming the server's TCP accept queue (default backlog=1024). 2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight per client fire simultaneously. With 2 clients, this is 128K concurrent requests hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error) and server TKO marking. Once TKO is set, all subsequent requests fail immediately. **Previous 2-client benchmark results (without this fix):** - Client 0: 97.7% error rate - Client 1: 48.8% error rate ## Solution Three changes to prevent TKO: ### 1. Server: Increase TCP listen backlog (65536) Prevents connection refusals during connection storms from multiple clients. ### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`) Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period (default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming the server. ### 3. Client: Adaptive load control during warmup (TCP congestion control) Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD (Additive Increase, Multiplicative Decrease): - Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads) - **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128) - **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500) - **Backoff**: halves inflight if error rate > 5% - All workers share a dynamic `currentMaxInflight` atomic variable New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2) ## Results **2-client benchmark with fix (adaptive load control):** | Metric | Client 0 | Client 1 | |--------|----------|----------| | Warmup QPS | 428,540 | 428,989 | | Warmup Errors | **0** | **0** | | Benchmark QPS | **482,367** | **482,775** | | GET Errors | **0** | **0** | | SET Errors | **0** | **0** | | Hit Ratio | 100% | 100% | | P50 Latency | 130ms | 130ms | | P99 Latency | 263ms | 263ms | Combined: **~965K QPS with 0 errors** across both clients. Differential Revision: D98351095
) Summary: - Add createSameThreadClient() support to eliminate cross-thread message queue hops - Workers run directly on McRouter proxy EventBases instead of separate thread pool - Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark) - Add experiment config files for various benchmark configurations Differential Revision: D98968871
…acebookresearch#638) Summary: Add configurable per-request CPU overhead simulation to ucachebench server to help close the CPU utilization gap between ucachebench (~35% idle) and production ucache (~9% idle). The simulation includes hash computation, clock_gettime calls, and memory allocations that mimic production ACL checks, CacheTable key construction, and serialization overhead. Changes: - Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy) - Wire flag through run.py and jobs_internal.yml - Add folly::hash and BenchmarkUtil deps - Add exp_y config (fibers enabled) and exp_z config (fibers + overhead) Experiment results: - Exp V (baseline): 35% idle, 6.91M QPS - Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp - Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective Differential Revision: D99338676
Summary: Implement production-like per-request overhead features to close the CPU utilization gap between ucachebench (~46% idle) and production ucache (~9% idle). Features added: - Compound key construction (McStoredKey-style: "uc:pool:key:v1") - MurmurHash2 key hashing (matching production getHashForKey) - ACL prefix checks with F14FastMap lookup - Overload protection with inflight request counting - Stats tracking (12+ atomic increments per request) - Ticket staleness checks - Egress hash computation - Response timestamps via clock_gettime Also adds --production-features flag to run.py, jobs_internal.yml, and server main.cpp to enable these features via automark config. Differential Revision: D99338673
…bookresearch#641) Summary: Adds three new production-like CPU overhead simulations to close the CPU utilization gap between ucachebench and production ucache: - CRC32C hardware-accelerated value checksums (integrity verification) - Thrift compact protocol serialization simulation (varint encoding, field headers) - IOBuf chain construction and coalescing (header + value chaining) Also adds benchmark config files for various experiment configurations. Differential Revision: D99338677
…ch#642) Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674
Summary: Three additional production-like CPU overhead simulations: - Egress rate limiting: thread-local F14 map simulating NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window + token bucket per-response checks - KCB double-lookup: second CacheLib find for ~25% of requests, matching production Key Client Binding version mismatch path - Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads matching production shouldLoadShed() per-request checks Differential Revision: D99338675
f194522 to
e9d86c4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Three additional production-like CPU overhead simulations:
NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window
matching production Key Client Binding version mismatch path
matching production shouldLoadShed() per-request checks
Differential Revision: D99338675