Make FakeValuesService expression caches static to share across Faker instances#1819
Make FakeValuesService expression caches static to share across Faker instances#1819mferretti wants to merge 6 commits into
Conversation
PR Summary
|
There was a problem hiding this comment.
Pull request overview
Adds an opt-in way to share a single FakeValuesService across threads to avoid repeated YAML loading, plus a Faker factory that pairs the shared service with a per-thread Random. New concurrency tests exercise the singleton identity, behavior under load, and parity with a normally constructed Faker.
Changes:
- New
FakeValuesService.getShared(Locale)lazily caches per-locale instances in aConcurrentHashMap. - New
Faker.withSharedService(FakeValuesService, Locale, Random)factory wires a shared service with a per-threadRandomService. - New
SharedFakeValuesServiceTestcovering singleton identity, concurrent usage, and output parity.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/main/java/net/datafaker/service/FakeValuesService.java | Adds SHARED_INSTANCES map and getShared(Locale) factory. |
| src/main/java/net/datafaker/Faker.java | Adds withSharedService static factory for thread-local Fakers reusing a shared service. |
| src/test/java/net/datafaker/SharedFakeValuesServiceTest.java | New tests for concurrent identity, no-errors under load, and output parity. |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1819 +/- ##
============================================
- Coverage 92.52% 92.26% -0.26%
- Complexity 3561 3567 +6
============================================
Files 345 345
Lines 7037 7086 +49
Branches 686 703 +17
============================================
+ Hits 6511 6538 +27
- Misses 364 368 +4
- Partials 162 180 +18 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@mferretti Please don't take my words as critics, I am just researching the problem. I feel that this solution heavily depends on the internal structure of DF. Ideally, the end-user should use only The bottleneck is redundant YAML loading, right? If we want to avoid redundant YAML loading, then shouldn't we just extract the YML loading to some smaller class, and cache its results? |
|
@asolntsev , none taken :) YAML loading is not the bottleneck — it's already cached globally in You're right that exposing Would that make sense? If so, I can revise the PR along those lines :-) |
|
@mferretti Yes, it seems absolutely reasonable to make UPD I realized that it was previously static, and I made in non-static in commit 751ab16 :) That's an irony! :) The reason was memory leak. Static map Maybe we should investigate #1263 once again and find a better solution for it. Or just use some Cache-like map for |
|
@asolntsev thanks for pointing this out. I'll have a look at the bug that made you opt for non static and try to put everything together. it'll have to wait a couple of days as i am heading out and will be back tomorrow late evening/night. |
|
No rush, enjoy your weekend!! |
|
Hello @asolntsev and @kingthorin
What I propose, but i'd need some validation from you guys, is :
On point 2, Obviously, I can recycle this pr or close this one and open a new one. Looking forward for your input ! |
|
Hi @mferretti !
|
|
Hi. The original PR added a public This revised PR implements exactly that — but with the additional structural work required to do it correctly (naively making the map static brings back the memory leak from issue #1263 / PR #1271).
Also, I added an L2 cache: L1 — static L2 — per-instance Resolution order: L2 hit → L1 hit (materialize + store L2) → full discovery (store L1 + materialize L2). The benchmarks tests i did locally:
In all honesty, I would suggest that you run your tests too as the performance gain, on my side, is clear with complex regex and a property based test scenario, but non existent with simple regex; plus I have added the L2 cache complexity. |
Replaces the per-instance REGEXP2SUPPLIER_MAP with a two-level static+instance design: - L1 RECIPE_MAP (static): keyed by CacheKey(String exp, SingletonLocale locale) — no per-Faker references. Stores context-free "recipe" resolvers shared across all Fakers with the same locale. Fixes the memory leak that blocked making this cache static (issue datafaker-net#1263): the old RegExpContext key held a strong reference to the BaseFaker root, keeping every Faker alive indefinitely. - L2 instanceMap (per-instance): stores resolvers pre-bound to this Faker's concrete provider instances for fast repeated calls within the same Faker. ValueResolver gains materialize(ProviderRegistration) and cacheable() to support the two-level contract. New resolver types (ProviderMethodResolver, ChainedCoercedResolver, InstanceMethodResolver, etc.) are context-free at L1 and pre-bound at L2. expression2generex (RgxGen compiled-regex cache) is also made static Adds SharedFakeValuesServiceTest covering concurrent multi-Faker usage and determinism under caching.
…owth note - SafeFetchResolver.resolve(): guard against null root - resolveFromMethodOn(): guard null root before NamedProviderCoercedResolver branch - resolveExpression inner: guard null root before dotIndex provider lookup - RECIPE_MAP: expand Javadoc noting bounded-in-practice growth - accessor(): fix dead-code `methods == null` check → `methods.isEmpty()` - accessor(): fix two "Didn't accessor" log typos → "Didn't find accessor" - SharedFakeValuesServiceTest: call shutdownNow() on awaitTermination timeout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ProviderMethodResolver.resolve(): return null when root is null - ProviderMethodResolver.materialize(): return this when root is null - resolveExpression outer loop: skip L2 caching when root is null, call recipe directly without materializing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RootCoercedResolver, NamedProviderCoercedResolver, and ChainedCoercedResolver all called root.getProvider() / fakerAccessor.invoke(root) without guarding against null root in both resolve() and materialize(). Pattern matches the existing guards on SafeFetchResolver and ProviderMethodResolver. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Seems reasonable to me, but someone that knows the code better should review or merge. |
Resolves the conflict in FakeValuesService introduced by upstream datafaker-net#1849 ("Do not create RegExpContext each time"). Upstream's change is a smaller refinement of the single-map cache that this PR's two-level (L1 static RECIPE_MAP / L2 per-instance instanceMap) design already supersedes, so the PR's design is kept. While reconciling, closed a correctness gap: the L2 instanceMap was keyed by expression only. A FakeValuesService shared across Fakers (public BaseFaker(FakeValuesService, FakerContext) constructor) could therefore hand a resolver pre-bound to one root/context to a different root, breaking per-root determinism — exactly the case upstream/base guarded via root/context identity. L2 entries now carry (root, context) via the new L2Entry record and are reused only when both identities match the current call; on mismatch the entry is recomputed and overwritten. The hot path adds only two reference comparisons (benchmarked as no measurable steady-state cost). Adds SharedFakeValuesServiceTest#sharedServiceAcrossRootsKeepsPerRootDeterminism, which fails without the guard (root B observed root A's random stream). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Czwt9bXVETbU5vJg2UwFwm
|
@asolntsev , @kingthorin good morning.
Times = median ms, 10k fresh Fakers same locale (last row = one Faker × 10k). Identical outputs both sides. test for single locale: import net.datafaker.Faker;
import java.util.Arrays;
import java.util.Locale;
import java.util.Random;
// Usage: java -cp <datafaker+deps> Bench <vin|bic|simple> <new|single> [n]
public class Bench {
static long work(Faker f, String w) {
return switch (w) {
case "vin" -> f.vehicle().vin().length();
case "bic" -> f.finance().bic().length();
default -> f.regexify("[a-z]{5}").length();
};
}
static long modeNew(String w, int n) { // n fresh Fakers, same locale
long s = 0;
for (int i = 0; i < n; i++) s += work(new Faker(Locale.ENGLISH, new Random(i)), w);
return s;
}
static long modeSingle(String w, int n) { // one Faker, steady state
long s = 0; Faker f = new Faker(Locale.ENGLISH, new Random(1));
for (int i = 0; i < n; i++) s += work(f, w);
return s;
}
public static void main(String[] args) {
String w = args.length > 0 ? args[0] : "vin";
String mode = args.length > 1 ? args[1] : "new";
int n = args.length > 2 ? Integer.parseInt(args[2]) : 10000;
java.util.function.IntToLongFunction run =
"single".equals(mode) ? k -> modeSingle(w, k) : k -> modeNew(w, k);
long guard = 0;
for (int i = 0; i < 5; i++) guard += run.applyAsLong(n); // warmup
double[] ms = new double[9];
for (int t = 0; t < 9; t++) {
long start = System.nanoTime();
guard += run.applyAsLong(n);
ms[t] = (System.nanoTime() - start) / 1_000_000.0;
}
Arrays.sort(ms);
System.out.printf("%-7s %-7s n=%d median=%6.1f ms min=%6.1f ms [guard=%d]%n",
w, mode, n, ms[4], ms[0], guard);
}
}multiple locales: Fresh Fakers —
Single long-lived Faker — locale switched per call via
Median ms, one Faker × 10k, locale rotated over 8 locales per call. Identical outputs both sides. test: import net.datafaker.Faker;
import java.util.Arrays;
import java.util.Locale;
import java.util.Random;
// Usage: java -cp <datafaker+deps> Bench <vin|bic|simple> <new|single|multi|singlemulti> [n]
// new : n fresh Fakers, single locale (warm-up amortization)
// single : one Faker, n iterations, single locale (steady state)
// multi : n fresh Fakers, locale rotated (multi-locale warm-up)
// singlemulti : one Faker, locale switched per call (multi-locale steady state)
public class Bench {
private static final Locale[] LOCALES = {
Locale.ENGLISH, Locale.FRENCH, Locale.GERMAN, Locale.ITALIAN,
Locale.forLanguageTag("es"), Locale.forLanguageTag("pt"),
Locale.forLanguageTag("nl"), Locale.forLanguageTag("pl")
};
static long work(Faker f, String w) {
return switch (w) {
case "vin" -> f.vehicle().vin().length();
case "bic" -> f.finance().bic().length();
default -> f.regexify("[a-z]{5}").length();
};
}
static long modeNew(String w, int n) {
long s = 0;
for (int i = 0; i < n; i++) s += work(new Faker(Locale.ENGLISH, new Random(i)), w);
return s;
}
static long modeSingle(String w, int n) {
long s = 0; Faker f = new Faker(Locale.ENGLISH, new Random(1));
for (int i = 0; i < n; i++) s += work(f, w);
return s;
}
static long modeMulti(String w, int n) {
long s = 0;
for (int i = 0; i < n; i++) s += work(new Faker(LOCALES[i % LOCALES.length], new Random(i)), w);
return s;
}
static long modeSingleMulti(String w, int n) {
long s = 0; Faker f = new Faker(Locale.ENGLISH, new Random(1));
for (int i = 0; i < n; i++) {
final Locale loc = LOCALES[i % LOCALES.length];
s += f.doWith(() -> work(f, w), loc);
}
return s;
}
public static void main(String[] args) {
String w = args.length > 0 ? args[0] : "vin";
String mode = args.length > 1 ? args[1] : "new";
int n = args.length > 2 ? Integer.parseInt(args[2]) : 10000;
java.util.function.IntToLongFunction run = switch (mode) {
case "single" -> k -> modeSingle(w, k);
case "multi" -> k -> modeMulti(w, k);
case "singlemulti" -> k -> modeSingleMulti(w, k);
default -> k -> modeNew(w, k);
};
long guard = 0;
for (int i = 0; i < 5; i++) guard += run.applyAsLong(n); // warmup
double[] ms = new double[9];
for (int t = 0; t < 9; t++) {
long start = System.nanoTime();
guard += run.applyAsLong(n);
ms[t] = (System.nanoTime() - start) / 1_000_000.0;
}
Arrays.sort(ms);
System.out.printf("%-7s %-12s n=%d median=%6.1f ms min=%6.1f ms [guard=%d]%n",
w, mode, n, ms[4], ms[0], guard);
}
}In the multiple locales scenario the gain is smaller but the winning key is the fact the the regular expression is compiled only once so, in a scenario with multiple Faker instances and locale, it's practically a constant; what actually changes in the multiple scenario case is the fact that YAML interface map AND L1 method resolution cache are locale aware so there we lose something. |
|
I appreciate you staying on this, but I still think this needs more history/insight than I have for this code base. |
|
Same here. @snuyanzin , could you help out? |
Problem
Every
new Faker()starts with empty expression-resolution and regex-compilation caches. In workloads that create many short-lived Fakers — property-based testing frameworks (jqwik, QuickCheck), seeded-per-record data generation — the warm-up cost is paid repeatedly, even when Fakers share the same locale.Why "just make the map static" needed structural work
The existing
REGEXP2SUPPLIER_MAPusedRegExpContext(String exp, BaseFaker root, FakerContext context)as its cache key. Making it static would re-introduce the memory leak fixed in #1263 / PR #1271: therootfield holds a strong reference to theFakerinstance, so everynew Faker()adds a unique entry the GC can never collect.Fixing this requires separating what method to call (shareable, context-free) from which provider instance to call it on (per-Faker, bound at first use).
What changed
Two-level expression cache:
RECIPE_MAP(static) — keyed byCacheKey(String exp, SingletonLocale locale). No per-Faker references; safe to share globally. Stores context-free "recipe" resolvers.instanceMap(per-instance) — stores "materialized" resolvers pre-bound to this Faker's concrete provider instances. Subsequent calls on the same Faker skipgetProvider()entirely.ValueResolvergainsmaterialize(ProviderRegistration root)andcacheable()to support the two-level contract. New resolver types (ProviderMethodResolver,ChainedCoercedResolver,InstanceMethodResolver, etc.) are context-free at L1 and pre-bound at L2.expression2generexmade static — theRgxGencompiled-regex cache was per-instance. This is the primary performance win: every new Faker recompiled the same patterns; now compilation happens once globally.No public API changes
FakeValuesServiceremains internal.Fakerconstructors are unchanged. No new public methods.Tests
SharedFakeValuesServiceTestcovers: