Skip to content

Make FakeValuesService expression caches static to share across Faker instances#1819

Open
mferretti wants to merge 6 commits into
datafaker-net:mainfrom
mferretti:feature/1814
Open

Make FakeValuesService expression caches static to share across Faker instances#1819
mferretti wants to merge 6 commits into
datafaker-net:mainfrom
mferretti:feature/1814

Conversation

@mferretti

@mferretti mferretti commented May 18, 2026

Copy link
Copy Markdown

Problem

Every new Faker() starts with empty expression-resolution and regex-compilation caches. In workloads that create many short-lived Fakers — property-based testing frameworks (jqwik, QuickCheck), seeded-per-record data generation — the warm-up cost is paid repeatedly, even when Fakers share the same locale.

Why "just make the map static" needed structural work

The existing REGEXP2SUPPLIER_MAP used RegExpContext(String exp, BaseFaker root, FakerContext context) as its cache key. Making it static would re-introduce the memory leak fixed in #1263 / PR #1271: the root field holds a strong reference to the Faker instance, so every new Faker() adds a unique entry the GC can never collect.

Fixing this requires separating what method to call (shareable, context-free) from which provider instance to call it on (per-Faker, bound at first use).

What changed

Two-level expression cache:

  • L1 RECIPE_MAP (static) — keyed by CacheKey(String exp, SingletonLocale locale). No per-Faker references; safe to share globally. Stores context-free "recipe" resolvers.
  • L2 instanceMap (per-instance) — stores "materialized" resolvers pre-bound to this Faker's concrete provider instances. Subsequent calls on the same Faker skip getProvider() entirely.

ValueResolver gains materialize(ProviderRegistration root) and cacheable() to support the two-level contract. New resolver types (ProviderMethodResolver, ChainedCoercedResolver, InstanceMethodResolver, etc.) are context-free at L1 and pre-bound at L2.

expression2generex made static — the RgxGen compiled-regex cache was per-instance. This is the primary performance win: every new Faker recompiled the same patterns; now compilation happens once globally.

No public API changes

FakeValuesService remains internal. Faker constructors are unchanged. No new public methods.

Tests

SharedFakeValuesServiceTest covers:

  • 16 threads × 10k iterations of concurrent Faker creation — no errors, no races
  • 4 locales × 4 threads concurrent — no errors
  • Same seed → same output regardless of static cache state

@what-the-diff

what-the-diff Bot commented May 18, 2026

Copy link
Copy Markdown

PR Summary

  • Introduction of a new feature in Faker.java: A new method has been added which allows the configuration of shared instances. This optimizes the system when operated in multi-user scenarios, enhancing performance and efficiency.

  • Updates to FakeValuesService.java: A shared information storage, akin to a vault, has been included to hold instances of value-creating services, in a safe and secure manner across multiple users.

  • Added a retrieval feature in FakeValuesService.java: A new method is introduced to get the shared instance of service, if it exists, or create a new one. This ensures smooth flow and operation, reducing delays due to repetitive creation of services.

  • Inclusion of a new testing method: Ensuring the stability and correctness of the shared instance when being accessed by multiple users. This includes:

    • Verifying the same service instance is given to different users.
    • Asserting smooth operation when multiple users are using the service.
    • Checking that the results of shared service instances are consistent with individual instances.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in way to share a single FakeValuesService across threads to avoid repeated YAML loading, plus a Faker factory that pairs the shared service with a per-thread Random. New concurrency tests exercise the singleton identity, behavior under load, and parity with a normally constructed Faker.

Changes:

  • New FakeValuesService.getShared(Locale) lazily caches per-locale instances in a ConcurrentHashMap.
  • New Faker.withSharedService(FakeValuesService, Locale, Random) factory wires a shared service with a per-thread RandomService.
  • New SharedFakeValuesServiceTest covering singleton identity, concurrent usage, and output parity.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/main/java/net/datafaker/service/FakeValuesService.java Adds SHARED_INSTANCES map and getShared(Locale) factory.
src/main/java/net/datafaker/Faker.java Adds withSharedService static factory for thread-local Fakers reusing a shared service.
src/test/java/net/datafaker/SharedFakeValuesServiceTest.java New tests for concurrent identity, no-errors under load, and output parity.

Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
@codecov-commenter

codecov-commenter commented May 20, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 73.03371% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.26%. Comparing base (7492624) to head (5cfffe7).

Files with missing lines Patch % Lines
.../java/net/datafaker/service/FakeValuesService.java 73.03% 8 Missing and 16 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1819      +/-   ##
============================================
- Coverage     92.52%   92.26%   -0.26%     
- Complexity     3561     3567       +6     
============================================
  Files           345      345              
  Lines          7037     7086      +49     
  Branches        686      703      +17     
============================================
+ Hits           6511     6538      +27     
- Misses          364      368       +4     
- Partials        162      180      +18     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@asolntsev

Copy link
Copy Markdown
Collaborator

@mferretti Please don't take my words as critics, I am just researching the problem.

I feel that this solution heavily depends on the internal structure of DF. Ideally, the end-user should use only new Faker(), and shouldn't even know any its internals like FakeValuesService etc.

The bottleneck is redundant YAML loading, right? If we want to avoid redundant YAML loading, then shouldn't we just extract the YML loading to some smaller class, and cache its results?

@mferretti

Copy link
Copy Markdown
Author

@asolntsev , none taken :)
It was food for thought and made me go back at the code.

YAML loading is not the bottleneck — it's already cached globally in FakeValues.of(). The real issue is regex compilation: FakeValuesService has a REGEXP2SUPPLIER_MAP that is a per-instance field, so every new Faker() starts with an empty regex compilation cache.

You're right that exposing FakeValuesService is leaky. The cleaner fix is simply making REGEXP2SUPPLIER_MAP static (like EXPRESSION_2_SPLITTED already is) — users just call new Faker() and get shared regex caching for free, no API changes needed.

Would that make sense? If so, I can revise the PR along those lines :-)

@asolntsev

Copy link
Copy Markdown
Collaborator

@mferretti Yes, it seems absolutely reasonable to make REGEXP2SUPPLIER_MAP static. Seems it should solve the initial problem?

UPD I realized that it was previously static, and I made in non-static in commit 751ab16 :) That's an irony! :)

The reason was memory leak. Static map REGEXP2SUPPLIER_MAP was growing endlessly, collecting indefinite number of unique regexps.

Maybe we should investigate #1263 once again and find a better solution for it. Or just use some Cache-like map for REGEXP2SUPPLIER_MAP (which removes unused records after some time).

@mferretti

Copy link
Copy Markdown
Author

@asolntsev thanks for pointing this out. I'll have a look at the bug that made you opt for non static and try to put everything together. it'll have to wait a couple of days as i am heading out and will be back tomorrow late evening/night.

@kingthorin

Copy link
Copy Markdown
Collaborator

No rush, enjoy your weekend!!

@mferretti

Copy link
Copy Markdown
Author

Hello @asolntsev and @kingthorin
here's my understanding/findings (I am not going to question the why of the changes or the architecture/development, just exposing what I saw and understood)

  • Every new Faker() starts with an empty REGEXP2SUPPLIER_MAP — the cache that records which method resolves each YAML expression (e.g. #{Name.firstName} → Name.firstName()). In high-volume scenarios where many Faker instances are created, method-resolution work is repeated from scratch for every instance, causing slowdown.
  • REGEXP2SUPPLIER_MAP was previously static but was made per-instance in PR Fix memory leak #1271 to fix a memory leak (HashMap grew without bound as every Faker produced unique cache keys and entries never evicted).

What I propose, but i'd need some validation from you guys, is :

  1. expression2generex becomes static (does not help yaml resolution but helps regexify)
  2. add a static map and implement a "context free" ValueResolver

On point 2, resolve is private so there shouldn't be any breaking changes
On point 1: the main caveat here is the assumption that the provider is registred with the same name across all the Faker's root; probably an edge case but custom providers could be registering via non standard names thus i'd add a test case.

Obviously, I can recycle this pr or close this one and open a new one.

Looking forward for your input !

@mferretti mferretti marked this pull request as draft May 28, 2026 13:09
@asolntsev

Copy link
Copy Markdown
Collaborator

Hi @mferretti !
Yes, it seems like a totally valid plan:

  • expression2generex becomes static (does not help yaml resolution but helps regexify)
  • add a static map and implement a "context free" ValueResolver

@mferretti

Copy link
Copy Markdown
Author

Hi.
sorry for the delay... life happens.
I had to revert what i done in the first 2 commits as we are changing the view radically on what needs to be done.

The original PR added a public FakeValuesService.getShared(Locale) API and Faker.withSharedService() to let callers share a single FakeValuesService across multiple Faker instances, reducing the cost of repeated cache warm-up. @asolntsev correctly pointed out that exposing FakeValuesService publicly is too leaky. His suggestion: make the internal map static instead.

This revised PR implements exactly that — but with the additional structural work required to do it correctly (naively making the map static brings back the memory leak from issue #1263 / PR #1271).
The fix requires two steps:

  1. Change the key to CacheKey(String exp, SingletonLocale locale) — no per-Faker references.
  2. Separate which method to call (shareable) from which provider instance to call it on (per-Faker). The ValueResolver implementations previously closed over per-Faker provider instances, so they could not be shared across Fakers.

Also, I added an L2 cache:

L1 — static RECIPE_MAP: Map<CacheKey, ValueResolver>
Stores context-free "recipes": which provider class and which method to invoke. Shared across all Fakers with the same locale. No per-Faker references → no memory leak.

L2 — per-instance instanceMap: Map<String, ValueResolver>
Stores "materialized" resolvers: the recipe pre-bound to this Faker's concrete provider object. Fast repeated calls on the same Faker skip the getProvider() reflective call entirely.

Resolution order: L2 hit → L1 hit (materialize + store L2) → full discovery (store L1 + materialize L2).

The benchmarks tests i did locally:

Scenario main This PR Delta
10k new Fakers × vehicle().vin() (heavy regex) 75 ms 7 ms 10× faster
10k new Fakers × finance().bic() (medium regex) 29 ms 8 ms 3.6× faster
Single Faker × 10k vehicle().vin() 6 ms 5 ms unchanged
1000 new Fakers × 3 calls (name/address/internet) 10,275 µs 7,940 µs 23% faster
10k single-shot Fakers × 1 call 25,222 µs 19,112 µs 24% faster
Single Faker × 1000 calls 5,024 µs 4,278 µs unchanged

In all honesty, I would suggest that you run your tests too as the performance gain, on my side, is clear with complex regex and a property based test scenario, but non existent with simple regex; plus I have added the L2 cache complexity.

@mferretti mferretti marked this pull request as ready for review June 4, 2026 14:39
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

Comment thread src/main/java/net/datafaker/service/FakeValuesService.java
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/test/java/net/datafaker/SharedFakeValuesServiceTest.java
Comment thread src/test/java/net/datafaker/SharedFakeValuesServiceTest.java Outdated
Comment thread src/test/java/net/datafaker/SharedFakeValuesServiceTest.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java
@mferretti mferretti changed the title Add shared FakeValuesService for multi-threaded Faker instances Make FakeValuesService expression caches static to share across Faker instances Jun 4, 2026
mferretti and others added 5 commits June 7, 2026 08:37
Replaces the per-instance REGEXP2SUPPLIER_MAP with a two-level static+instance design:

- L1 RECIPE_MAP (static): keyed by CacheKey(String exp, SingletonLocale locale) — no per-Faker references. Stores context-free "recipe" resolvers shared across all Fakers with the same locale. Fixes the memory leak that blocked making this cache static (issue datafaker-net#1263): the old RegExpContext key held a strong reference to the BaseFaker root, keeping every Faker alive indefinitely.
- L2 instanceMap (per-instance): stores resolvers pre-bound to this Faker's concrete provider instances for fast repeated calls within the same Faker.

ValueResolver gains materialize(ProviderRegistration) and cacheable() to support the two-level contract. New resolver types (ProviderMethodResolver, ChainedCoercedResolver, InstanceMethodResolver, etc.) are context-free at L1 and pre-bound at L2.

expression2generex (RgxGen compiled-regex cache) is also made static

Adds SharedFakeValuesServiceTest covering concurrent multi-Faker usage and determinism under caching.
…owth note

- SafeFetchResolver.resolve(): guard against null root
- resolveFromMethodOn(): guard null root before NamedProviderCoercedResolver branch
- resolveExpression inner: guard null root before dotIndex provider lookup
- RECIPE_MAP: expand Javadoc noting bounded-in-practice growth
- accessor(): fix dead-code `methods == null` check → `methods.isEmpty()`
- accessor(): fix two "Didn't accessor" log typos → "Didn't find accessor"
- SharedFakeValuesServiceTest: call shutdownNow() on awaitTermination timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ProviderMethodResolver.resolve(): return null when root is null
- ProviderMethodResolver.materialize(): return this when root is null
- resolveExpression outer loop: skip L2 caching when root is null,
  call recipe directly without materializing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RootCoercedResolver, NamedProviderCoercedResolver, and ChainedCoercedResolver
all called root.getProvider() / fakerAccessor.invoke(root) without guarding
against null root in both resolve() and materialize(). Pattern matches the
existing guards on SafeFetchResolver and ProviderMethodResolver.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kingthorin

Copy link
Copy Markdown
Collaborator

Seems reasonable to me, but someone that knows the code better should review or merge.

@asolntsev asolntsev added this to the 2.6.0 milestone Jun 7, 2026
@asolntsev asolntsev added the enhancement New feature or request label Jun 7, 2026
@asolntsev asolntsev added java Pull requests that update Java code refactoring labels Jun 7, 2026
@asolntsev asolntsev removed this from the 2.6.0 milestone Jun 18, 2026
Resolves the conflict in FakeValuesService introduced by upstream datafaker-net#1849
("Do not create RegExpContext each time"). Upstream's change is a smaller
refinement of the single-map cache that this PR's two-level (L1 static
RECIPE_MAP / L2 per-instance instanceMap) design already supersedes, so the
PR's design is kept.

While reconciling, closed a correctness gap: the L2 instanceMap was keyed by
expression only. A FakeValuesService shared across Fakers (public
BaseFaker(FakeValuesService, FakerContext) constructor) could therefore hand a
resolver pre-bound to one root/context to a different root, breaking per-root
determinism — exactly the case upstream/base guarded via root/context identity.

L2 entries now carry (root, context) via the new L2Entry record and are reused
only when both identities match the current call; on mismatch the entry is
recomputed and overwritten. The hot path adds only two reference comparisons
(benchmarked as no measurable steady-state cost).

Adds SharedFakeValuesServiceTest#sharedServiceAcrossRootsKeepsPerRootDeterminism,
which fails without the guard (root B observed root A's random stream).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Czwt9bXVETbU5vJg2UwFwm
@mferretti

Copy link
Copy Markdown
Author

@asolntsev , @kingthorin good morning.
It's awfully hot in here so i woke up early to take advantange of the "cool 25C" early morning and took a look at the conflict.
The base issue derived from #1849 which reworked the same single map cache this PR replaces with the L1/L2 design. I kept the L1/L2 design and worked to keep compatibility with upstream and discovered that this PR actually had a gap as the L2 cache was keyed by expression only so a FakerValueService shared across Fakers could hand a resolver bound to one root to another; L2 entries now carry (root, context) and are reused only on identity match.
I added a test SharedFakeValuesServiceTest#sharedServiceAcrossRootsKeepsPerRootDeterminism that fails if the the guard introduced in #1849 is missing, namely checking if root B observes root A random stream.
I reran my performance tests taking into account a case where you have multiple locales and one with a single locale:

Workload regex this PR main speedup
vehicle().vin() complex 6.6 65.1 ~10×
finance().bic() medium 8.0 29.2 ~3.6×
regexify("[a-z]{5}") simple 2.6 9.1 ~3.5×
single Faker × 10k vin() steady 7.2 8.3 ~parity

Times = median ms, 10k fresh Fakers same locale (last row = one Faker × 10k). Identical outputs both sides.

test for single locale:

import net.datafaker.Faker;
import java.util.Arrays;
import java.util.Locale;
import java.util.Random;

// Usage: java -cp <datafaker+deps> Bench <vin|bic|simple> <new|single> [n]
public class Bench {
    static long work(Faker f, String w) {
        return switch (w) {
            case "vin" -> f.vehicle().vin().length();
            case "bic" -> f.finance().bic().length();
            default    -> f.regexify("[a-z]{5}").length();
        };
    }
    static long modeNew(String w, int n) {           // n fresh Fakers, same locale
        long s = 0;
        for (int i = 0; i < n; i++) s += work(new Faker(Locale.ENGLISH, new Random(i)), w);
        return s;
    }
    static long modeSingle(String w, int n) {        // one Faker, steady state
        long s = 0; Faker f = new Faker(Locale.ENGLISH, new Random(1));
        for (int i = 0; i < n; i++) s += work(f, w);
        return s;
    }
    public static void main(String[] args) {
        String w = args.length > 0 ? args[0] : "vin";
        String mode = args.length > 1 ? args[1] : "new";
        int n = args.length > 2 ? Integer.parseInt(args[2]) : 10000;
        java.util.function.IntToLongFunction run =
            "single".equals(mode) ? k -> modeSingle(w, k) : k -> modeNew(w, k);
        long guard = 0;
        for (int i = 0; i < 5; i++) guard += run.applyAsLong(n);   // warmup
        double[] ms = new double[9];
        for (int t = 0; t < 9; t++) {
            long start = System.nanoTime();
            guard += run.applyAsLong(n);
            ms[t] = (System.nanoTime() - start) / 1_000_000.0;
        }
        Arrays.sort(ms);
        System.out.printf("%-7s %-7s n=%d  median=%6.1f ms  min=%6.1f ms  [guard=%d]%n",
            w, mode, n, ms[4], ms[0], guard);
    }
}

multiple locales:

Fresh Fakers — new Faker(locale, …) per call (issue #1814: many short-lived Fakers)

Workload regex this PR main speedup
vehicle().vin() complex 9.4 68.4 ~7.3×
finance().bic() medium 12.8 32.2 ~2.5×
regexify("[a-z]{5}") simple 5.6 12.5 ~2.2×

Single long-lived Faker — locale switched per call via doWith (steady state)

Workload regex this PR main delta
vehicle().vin() complex 10.6 11.0 ~parity
finance().bic() medium 7.6 12.2 ~1.6x *
regexify("[a-z]{5}") simple 4.2 4.4 ~parity

Median ms, one Faker × 10k, locale rotated over 8 locales per call. Identical outputs both sides.
* bic() rebuilds its regex per country code → this PR's static compile-cache wins even on a single Faker.

test:

import net.datafaker.Faker;
import java.util.Arrays;
import java.util.Locale;
import java.util.Random;

// Usage: java -cp <datafaker+deps> Bench <vin|bic|simple> <new|single|multi|singlemulti> [n]
//   new          : n fresh Fakers, single locale          (warm-up amortization)
//   single       : one Faker, n iterations, single locale (steady state)
//   multi        : n fresh Fakers, locale rotated          (multi-locale warm-up)
//   singlemulti  : one Faker, locale switched per call     (multi-locale steady state)
public class Bench {
    private static final Locale[] LOCALES = {
        Locale.ENGLISH, Locale.FRENCH, Locale.GERMAN, Locale.ITALIAN,
        Locale.forLanguageTag("es"), Locale.forLanguageTag("pt"),
        Locale.forLanguageTag("nl"), Locale.forLanguageTag("pl")
    };

    static long work(Faker f, String w) {
        return switch (w) {
            case "vin" -> f.vehicle().vin().length();
            case "bic" -> f.finance().bic().length();
            default    -> f.regexify("[a-z]{5}").length();
        };
    }

    static long modeNew(String w, int n) {
        long s = 0;
        for (int i = 0; i < n; i++) s += work(new Faker(Locale.ENGLISH, new Random(i)), w);
        return s;
    }
    static long modeSingle(String w, int n) {
        long s = 0; Faker f = new Faker(Locale.ENGLISH, new Random(1));
        for (int i = 0; i < n; i++) s += work(f, w);
        return s;
    }
    static long modeMulti(String w, int n) {
        long s = 0;
        for (int i = 0; i < n; i++) s += work(new Faker(LOCALES[i % LOCALES.length], new Random(i)), w);
        return s;
    }
    static long modeSingleMulti(String w, int n) {
        long s = 0; Faker f = new Faker(Locale.ENGLISH, new Random(1));
        for (int i = 0; i < n; i++) {
            final Locale loc = LOCALES[i % LOCALES.length];
            s += f.doWith(() -> work(f, w), loc);
        }
        return s;
    }

    public static void main(String[] args) {
        String w = args.length > 0 ? args[0] : "vin";
        String mode = args.length > 1 ? args[1] : "new";
        int n = args.length > 2 ? Integer.parseInt(args[2]) : 10000;
        java.util.function.IntToLongFunction run = switch (mode) {
            case "single"      -> k -> modeSingle(w, k);
            case "multi"       -> k -> modeMulti(w, k);
            case "singlemulti" -> k -> modeSingleMulti(w, k);
            default            -> k -> modeNew(w, k);
        };
        long guard = 0;
        for (int i = 0; i < 5; i++) guard += run.applyAsLong(n);   // warmup
        double[] ms = new double[9];
        for (int t = 0; t < 9; t++) {
            long start = System.nanoTime();
            guard += run.applyAsLong(n);
            ms[t] = (System.nanoTime() - start) / 1_000_000.0;
        }
        Arrays.sort(ms);
        System.out.printf("%-7s %-12s n=%d  median=%6.1f ms  min=%6.1f ms  [guard=%d]%n",
            w, mode, n, ms[4], ms[0], guard);
    }
}

In the multiple locales scenario the gain is smaller but the winning key is the fact the the regular expression is compiled only once so, in a scenario with multiple Faker instances and locale, it's practically a constant; what actually changes in the multiple scenario case is the fact that YAML interface map AND L1 method resolution cache are locale aware so there we lose something.

@kingthorin

Copy link
Copy Markdown
Collaborator

I appreciate you staying on this, but I still think this needs more history/insight than I have for this code base.

@bodiam

bodiam commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Same here. @snuyanzin , could you help out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java Pull requests that update Java code refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants