Explainer: Built-in Embedding API

This proposal is an early design sketch by the Chrome Built-in AI team to describe the problem below and solicit feedback on the proposed solution. It has not been approved to ship in Chrome.

Proponents

Chrome Built-in AI Team

Participate

https://github.com/explainers-by-googlers/embedding-api/issues

Introduction

The Embedding API is a proposed Web Platform API that allows developers to generate high-dimensional vector representations (embeddings) of text content directly on the user's device. By leveraging an on-device model built into the browser, this API enables powerful semantic understanding features without the latency, cost, and privacy trade-offs associated with cloud-based embedding services, or the heavy network bandwidth and multiplied storage costs imposed on users by DIY client-side alternatives.

Goals

Privacy-First Design: Raw text can remain on-device. Embeddings are generated locally, enabling developers to process sensitive user data entirely on-device.
Low Latency: Enable real-time semantic features (e.g., proactive moderation while a user types) by eliminating network round-trips.
API Simplicity: Provide a high-level JavaScript interface that is easy for generalist web developers to use, without requiring expertise in Machine Learning.
Interoperability: Expose a standardized JavaScript interface that aligns with the design patterns established by other Web Machine Learning APIs (such as the Summarizer and Translation APIs).
- Note on Model Versioning and Selection: Because raw embedding vectors are inherently tied to an embedding space, chosen by the specific model that generated them, true interoperability requires a mechanism for vector space alignment. This might be provided by vector space identification, embedding invalidation, or stronger methods around model selection and versioning. Vectors generated by models targeting different embedding spaces cannot be directly compared. While standardizing the JS interface is the first step, we are actively exploring how to ensure the raw vectors exposed by this API are practically interoperable across different browser implementations and compatible with embeddings generated by servers.

Non-goals

Integrated Storage: This API does not provide a built-in vector database. Developers will use existing Web Platform storage (e.g., IndexedDB, OPFS) for storing and searching generated embeddings. However, we aim to guide users toward sensible patterns and best practices for managing this data locally, and we anticipate that the broader JavaScript ecosystem will build ergonomic libraries and tooling around this API to simplify vector search and storage.
Model Training: This API is for inference only. It does not support training or fine-tuning models on the device.
Complex ML Knobs: Providing advanced model parameters initially is out of scope. The focus is on the core functionality.

User research

Developers have shown strong interest in building contextual features like semantic search and recommendations. However, they are sometimes blocked by the cost and privacy risks of server-side AI, and the severe storage bloat of DIY client-side solutions where each site must download its own large model.

We hope initial signals of developer demand will merit formal user research regarding demand for the same use cases supported by a semantic embedding API.

Use cases

Semantic Search

A note-taking web app could offer "semantic search" to find notes based on meaning, not just keywords, entirely on-device and private to the user.

Retrieval-Augmented Generation (RAG)

A documentation site could build a simple, offline-capable Q&A bot (RAG) that finds the most relevant passages to answer a user's question.

Real-time Content Intelligence

A small, community-run forum could offer real-time, on-device moderation hints, flagging potentially toxic comments as the user is typing, before the content is ever sent to a server.

Potential Solution

The API follows the standard availability -> create -> execute pattern of other Built-In AI APIs.

Note: The name "SemanticEmbedder" is currently an open question due to potential overloading. See "Considered alternatives" for other names being discussed.

Basic Usage and Cosine Similarity

In this example, we generate embeddings for two different strings and compare them using a simple cosine similarity function. This demonstrates how a developer can determine the semantic closeness of two inputs.

// 1. Check if the API is available
if (!SemanticEmbedder || (await SemanticEmbedder.availability()) === "unavailable") {
  console.error("Embedding model is not available on this device.");
  return;
}

// 2. Create the embedder
const semanticEmbedder = await SemanticEmbedder.create();

// 3. Embed two strings
const result1 = await semanticEmbedder.embed("The quick brown fox jumps over the lazy dog.");
const result2 = await semanticEmbedder.embed("A fast, dark-colored fox leaps over a resting hound.");

// 4. Extract the vectors
const vector1 = result1.embeddings[0].values;
const vector2 = result2.embeddings[0].values;

// 5. Compare their semantic similarity
const similarity = cosineSimilarity(vector1, vector2);
console.log(`Similarity score: ${similarity}`); // High similarity expected

// 6. Proactively release resources
semanticEmbedder.destroy();

// --- Utility Function ---
function cosineSimilarity(vecA, vecB) {
  let dotProduct = 0, normA = 0, normB = 0;
  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Batching for Document Retrieval

For RAG (Retrieval-Augmented Generation) or document search, developers will often need to embed multiple passages at once. The API supports batching via a polymorphic embed() method.

const semanticEmbedder = await SemanticEmbedder.create();

// A document that the developer has already chunked into passages
const passages = [
  "Built-in AI APIs use on-device models.",
  "Embeddings are high-dimensional vectors representing semantic meaning.",
  "The Prompt API facilitates direct usage of a language model."
];

// Embed the entire batch in one call, batchResult.length === passages.length
const batchResult = await semanticEmbedder.embed(passages);

// Destroy the embedder immediately to free up memory
semanticEmbedder.destroy();

// Store the vectors in a local vector database (e.g., IndexedDB)
for (let i = 0; i < batchResult.embeddings.length; i++) {
  const vector = batchResult.embeddings[i].values;
  const tokenCount = batchResult.embeddings[i].statistics.tokenCount;
  await myLocalVectorDB.insert({
    text: passages[i],
    embedding: vector,
    tokens: tokenCount
  });
}

How this solution would solve the use cases

By returning embedding vectors (as Float32Array values within a structured result), developers can pass them into existing libraries to compare embeddings (e.g., via cosine similarity), store them locally in vector databases, or use them as input for client-side classifiers.

// 1. Generate embeddings for two strings
const semanticEmbedder = await SemanticEmbedder.create();
const resultA = await semanticEmbedder.embed("How to bake a cake");
const resultB = await semanticEmbedder.embed("Cake recipe");
const vectorA = resultA.embeddings[0].values;
const vectorB = resultB.embeddings[0].values;

// 2. Hand-waved utility function for cosine similarity
function computeCosineSimilarity(vecA, vecB) {
  let dotProduct = 0, normA = 0, normB = 0;
  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] * vecA[i];
    normB += vecB[i] * vecB[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

const similarity = computeCosineSimilarity(vectorA, vectorB);
console.log(`Similarity score: ${similarity}`);

Embedding Space and Compatibility

For developers to utilize the generated Float32Array effectively (e.g., performing client-server hybrid retrieval, or clustering within a vector database), they must know the mathematical vector space the embeddings belong to.

Initial API prototypes should use spaces available from open-weight models, like EmbeddingGemma (open-weights google/embeddinggemma-300m) or Qwen3Embedding (open-weights Qwen/Qwen3-Embedding-0.6B), to maximize prospective compatibility. Because the underlying model might evolve, developers need a way to ensure the embeddings they compute today are comparable to embeddings they compute tomorrow. To guarantee this consistency, the EmbedderResult object will include an EmbedderMetadata dictionary. This metadata will expose information such as the embeddingSpace or model identifier, allowing developers to safely version their local vector databases and know exactly which cloud-based models they are mathematically compatible with.

Detailed design discussion

Input Mapping and Truncation

Unlike generative APIs that might return simple strings, the embed() method returns a structured EmbedderResult object rather than a raw array. Currently, this object encapsulates the sequence of embeddings (as Float32Array values) which strictly corresponds 1:1 with the inputs provided in a batch. The API does not automatically chunk large text inputs. Developers must pre-chunk large documents themselves—such as by using the model's open-sourced tokenizer to stay within the 2048-token limit -- and pass them as an array to process them in full.

Returning a dictionary rather than a raw array provides the flexibility to extend the API in the future—such as including token consumption statistics, truncation warnings, max input tokens, or multi-modal metadata—without breaking backwards compatibility for early adopters.

// Example of the structured EmbedderResult object returned by embed()
{
  embeddings: [
    {
      values: Float32Array(300) [0.0023, -0.0093, ...], 
      // Future extensibility for usage requirements:
      // statistics: { tokenCount: 8, truncated: false } 
    }
  ],
  // Future extensibility for model compatibility:
  metadata: {
    embeddingSpace: "embeddinggemma-300m" 
    maxInputTokens: 2048
  }
}

Model Versioning and Consistency

Embedding vectors are mathematically tied to the specific model version that generated them. If the underlying model is updated, old embeddings may become invalid. The API may need to expose information about the embedding space or model version, so developers can track when they need to re-index their data.

Built-in Comparison Method

Currently, developers must manually calculate the distance between vectors (e.g., via a custom cosine similarity function) or rely on third-party libraries. We are actively discussing whether to provide a built-in static comparison utility (e.g., SemanticEmbedder.compare(resultA, resultB)). A native method would provide a well-lit path for ML novices to gauge semantic similarity. Furthermore, if the comparison method accepts strongly typed EmbedderResult objects rather than raw arrays, the browser could automatically validate the embedding space identifiers, throwing an error if a developer attempts to compare mathematically incompatible vectors.

Built-in Chunking Method

Currently, the API does not automatically chunk large text inputs. For the DevTrial, we suggest developers utilize ecosystem libraries like LangChain's RecursiveCharacterTextSplitter. This chunking logic works by recursively splitting text along natural boundaries (e.g., paragraphs, sentences, words) and merging the segments back together until they reach a specified size limit. To stay within the token limit, developers have two options:

For maximum efficiency: Use an offline tokenizer to estimate a domain-specific character-to-token ratio, allowing them to rapidly chunk the text by character length in the browser.
For strict accuracy: Load the open-sourced tokenizer and pass it into the RecursiveCharacterTextSplitter's length function to count exact tokens during the merging phase.

Similar to the built-in comparison method, we are actively discussing whether to provide a built-in utility function for automatic chunking (e.g., SemanticEmbedder.chunk(text)). A native method would provide a well-lit path for handling large documents, offering the developer ergonomics of a "Browser-provided Vector DB" without tying the API to an internal storage mechanism.

Task Type Optimization Hints

Developers can optionally provide a taskType hint to optimize the embedding quality for specific use cases (e.g., retrieval or classification). This is strictly an optional hint that browsers can safely ignore if their underlying model does not support it. During the Developer Trial, we will gather metrics to evaluate how effectively these task hints improve embedding quality in practice.

Considered alternatives

WebAssembly/WebGPU (DIY)

Developers can ship their own models via WASM. However, this leads to significant storage bloat (each site downloading its own large model) and high memory usage. A browser-provided API amortizes model storage costs, balances compute resources across all sites, and offers a more approachable option for developers.

Cloud APIs

Provides the highest quality models and additional computing resources, at the cost of user privacy and financial expense for the developer.

Browser-provided Vector DB

We considered designing this as a browser-provided vector database where developers could insert and search text, but never access the raw embedding vectors. While this approach completely avoids embedding space compatibility issues—since the browser could silently re-index the hidden vectors if the underlying model updates—and would provide the added benefit of automatically chunking large documents, it was rejected because it does not enable hybrid local/cloud RAG use cases. Developers have shown strong interest in using on-device inference to generate vectors that are then sent to their own cloud databases, which requires exposing the raw Float32Array.

API Naming

We initially called this API Embedder, but that gets confusing on the web where "embedder" usually refers to the browser itself, or frames containing <iframe> or <embed> elements. Here are the alternatives we considered:

TextEmbedder: This is what MediaPipe uses. It’s simple, but not preferred because the API may extend support for multimodal embeddings (like images) in the future.
FeatureExtractor: A standard ML term, but it feels too generic and technical for a high-level web API.
SemanticEmbedder (Preferred): This is our current preference. It clearly states that the API is for vectorizing meaning, and it naturally covers multimodal inputs later.

We will use SemanticEmbedder for the developer trial to avoid the ambiguity of "embedder" while keeping the name future-proof for multimodal use cases. However, we still welcome feedback on this naming choice.

Ensuring an Interoperable API Design

We are actively seeking input from other browser vendors, the Web Machine Learning Community Group (WebML CG), and the broader web development community around the interoperable design aspects of this API. We do not intend to solve all of these challenges immediately, but rather aim to use the Developer Trial and discussions with other vendors to discover solutions collaboratively.

Key open questions we want to explore include:

Model and Space Choices: Should developers be able to specify their own embedding models, or should implementers be limited to specific open-weight models for compatibility with server-side embedding databases? We must also consider how to resolve the need for model updates and end-of-life for model support, and how this information is exposed if developers cannot choose the model. Finally, we need to evaluate if a standard effort is needed to define embedding spaces, similar to codecs, geographic coordinate systems, or language codes.
Content Mediation: What lessons can we apply from open web platform interop efforts in codecs regarding content mediation?
Data Representation: Is the Float32Array representation common enough across all major open-weight models, and how should we handle potential type differences in the future?
Task Types: Are task types interoperable, and can they be implemented over the top of any given model, or is it acceptable for them to be optional optimizations that can be safely ignored?
Context Windows and Chunking: How can we provide interoperable model information regarding max token lengths for an embedding, including potentially exposing API options when models support variable lengths? Furthermore, should the API offer interoperable helper methods for chunking, measuring, and tokenization?
Execution Modes: Should the API provide signaling for CPU versus GPU execution modes if those are more appropriate for specific batch or query clients?

Security Considerations

Permissions Policy: Access to the API is gated by an embedding permissions policy, restricted to top-level frames and same-origin iframes by default. Third-party contexts must be explicitly granted access.
Sandbox Isolation: Data processing and model execution occur in a sandboxed environment to mitigate the risks of malicious inputs.

Privacy Considerations

Statelessness: Other than the model download state, the API is stateless and does not maintain any memory or user data across sessions.

Stakeholder Feedback / Opposition

Initial interest from the web developer community
- webmachinelearning/prompt-api#16
- webmachinelearning/proposals#8

References & acknowledgements

Many thanks for valuable feedback and advice from the Chrome Built-in AI team.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github		.github
.gitignore		.gitignore
.pr-preview.json		.pr-preview.json
LICENSE.md		LICENSE.md
README.md		README.md
index.bs		index.bs

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Explainer: Built-in Embedding API

Proponents

Participate

Introduction

Goals

Non-goals

User research

Use cases

Semantic Search

Retrieval-Augmented Generation (RAG)

Real-time Content Intelligence

Potential Solution

Basic Usage and Cosine Similarity

Batching for Document Retrieval

How this solution would solve the use cases

Embedding Space and Compatibility

Detailed design discussion

Input Mapping and Truncation

Model Versioning and Consistency

Built-in Comparison Method

Built-in Chunking Method

Task Type Optimization Hints

Considered alternatives

WebAssembly/WebGPU (DIY)

Cloud APIs

Browser-provided Vector DB

API Naming

Ensuring an Interoperable API Design

Security Considerations

Privacy Considerations

Stakeholder Feedback / Opposition

References & acknowledgements

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages