Skip to content

Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644)#644

Open
charles-typ wants to merge 8 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338680-to-v2-beta
Open

Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644)#644
charles-typ wants to merge 8 commits into
facebookresearch:v2-betafrom
charles-typ:export-D99338680-to-v2-beta

Conversation

@charles-typ
Copy link
Copy Markdown
Contributor

@charles-typ charles-typ commented May 29, 2026

Summary:

  • Add cpu_work_us parameter: configurable microseconds of CPU busy-work per
    request using hash computation loops. Allows tuning server CPU utilization
    to match production levels.
  • Enable bucket_lock_power=20 in config (production default), adding
    CacheTable-style fiber mutex contention per request.
  • Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON.

Differential Revision: D99338680

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 29, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 29, 2026

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99338680.

@charles-typ charles-typ force-pushed the export-D99338680-to-v2-beta branch from 2d4cc67 to df1e471 Compare May 29, 2026 18:39
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 29, 2026
…cebookresearch#644)

Summary:
Pull Request resolved: facebookresearch#644

- Add cpu_work_us parameter: configurable microseconds of CPU busy-work per
  request using hash computation loops. Allows tuning server CPU utilization
  to match production levels.
- Enable bucket_lock_power=20 in config (production default), adding
  CacheTable-style fiber mutex contention per request.
- Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON.

Differential Revision: D99338680
@charles-typ charles-typ force-pushed the export-D99338680-to-v2-beta branch from df1e471 to 2425c4d Compare May 29, 2026 18:46
@meta-codesync meta-codesync Bot changed the title Add configurable cpu_work_us and bucket_lock_power for CPU tuning Add configurable cpu_work_us and bucket_lock_power for CPU tuning (#644) May 29, 2026
Summary:
## Problem

When `additional_fanout=500` is used to simulate production's high connection count
(num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO
cascade during warmup:

1. **Connection storm**: All 32K lazy connections are established simultaneously on first
   requests, overwhelming the server's TCP accept queue (default backlog=1024).

2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight
   per client fire simultaneously. With 2 clients, this is 128K concurrent requests
   hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error)
   and server TKO marking. Once TKO is set, all subsequent requests fail immediately.

**Previous 2-client benchmark results (without this fix):**
- Client 0: 97.7% error rate
- Client 1: 48.8% error rate

## Solution

Three changes to prevent TKO:

### 1. Server: Increase TCP listen backlog (65536)
Prevents connection refusals during connection storms from multiple clients.

### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`)
Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period
(default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming
the server.

### 3. Client: Adaptive load control during warmup (TCP congestion control)
Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD
(Additive Increase, Multiplicative Decrease):
- Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads)
- **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128)
- **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500)
- **Backoff**: halves inflight if error rate > 5%
- All workers share a dynamic `currentMaxInflight` atomic variable

New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2)

## Results

**2-client benchmark with fix (adaptive load control):**

| Metric | Client 0 | Client 1 |
|--------|----------|----------|
| Warmup QPS | 428,540 | 428,989 |
| Warmup Errors | **0** | **0** |
| Benchmark QPS | **482,367** | **482,775** |
| GET Errors | **0** | **0** |
| SET Errors | **0** | **0** |
| Hit Ratio | 100% | 100% |
| P50 Latency | 130ms | 130ms |
| P99 Latency | 263ms | 263ms |

Combined: **~965K QPS with 0 errors** across both clients.

Differential Revision: D98351095
Summary:
- Add createSameThreadClient() support to eliminate cross-thread message queue hops
- Workers run directly on McRouter proxy EventBases instead of separate thread pool
- Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark)
- Add experiment config files for various benchmark configurations

Differential Revision: D98968871
Summary:
Add configurable per-request CPU overhead simulation to ucachebench server to
help close the CPU utilization gap between ucachebench (~35% idle) and
production ucache (~9% idle). The simulation includes hash computation,
clock_gettime calls, and memory allocations that mimic production ACL checks,
CacheTable key construction, and serialization overhead.

Changes:
- Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy)
- Wire flag through run.py and jobs_internal.yml
- Add folly::hash and BenchmarkUtil deps
- Add exp_y config (fibers enabled) and exp_z config (fibers + overhead)

Experiment results:
- Exp V (baseline): 35% idle, 6.91M QPS
- Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp
- Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective

Differential Revision: D99338676
Summary:
Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
Summary:
Adds three new production-like CPU overhead simulations to close the CPU
utilization gap between ucachebench and production ucache:
- CRC32C hardware-accelerated value checksums (integrity verification)
- Thrift compact protocol serialization simulation (varint encoding, field headers)
- IOBuf chain construction and coalescing (header + value chaining)

Also adds benchmark config files for various experiment configurations.

Differential Revision: D99338677
Summary:
Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674
Summary:
Three additional production-like CPU overhead simulations:
- Egress rate limiting: thread-local F14 map simulating
  NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window
  + token bucket per-response checks
- KCB double-lookup: second CacheLib find for ~25% of requests,
  matching production Key Client Binding version mismatch path
- Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads
  matching production shouldLoadShed() per-request checks

Differential Revision: D99338675
…cebookresearch#644)

Summary:

- Add cpu_work_us parameter: configurable microseconds of CPU busy-work per
  request using hash computation loops. Allows tuning server CPU utilization
  to match production levels.
- Enable bucket_lock_power=20 in config (production default), adding
  CacheTable-style fiber mutex contention per request.
- Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON.

Differential Revision: D99338680
@charles-typ charles-typ force-pushed the export-D99338680-to-v2-beta branch from 2425c4d to f02f28d Compare May 30, 2026 01:05
charles-typ added a commit to charles-typ/DCPerf that referenced this pull request May 30, 2026
…cebookresearch#644)

Summary:

- Add cpu_work_us parameter: configurable microseconds of CPU busy-work per
  request using hash computation loops. Allows tuning server CPU utilization
  to match production levels.
- Enable bucket_lock_power=20 in config (production default), adding
  CacheTable-style fiber mutex contention per request.
- Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON.

Differential Revision: D99338680
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant