GitHub - pablo-codes/eny-consulting-embedder

🧠 JSON → Embeddings → MongoDB Pipeline

This project is designed to take crawled content from Firecrawl.dev (specifically its extract service), process it, generate embeddings, and store them into MongoDB for vector search.

Workflow:

Crawl websites with Firecrawl.dev using the extract service.
Download the extracted data and save each result as a sequentially numbered JSON file (1.json, 2.json, 3.json, ...).
Run this pipeline to process, embed, and store the data into MongoDB.

✨ Features

📂 Automatic JSON ingestion – Reads all .json files from the /json folder.
🧹 Data cleaning & processing – Normalizes titles, content, and sources into a consistent format.
🧠 Vector embeddings – Generates embeddings with Together AI’s m2-bert-80M-32k-retrieval (default).
💾 MongoDB persistence – Saves documents & embeddings into MongoDB with duplicate handling.
📈 Vector search index creation – Builds a vector index in MongoDB for similarity search.
🔒 Environment-driven config – All keys and hosts are configurable via .env.

🛠️ Tech Stack

Node.js (ESM)
MongoDB
Together AI
Ollama (local LLM support, optional)
dotenv (environment variables)

📂 Project Structure


.
├── config.js          # Configuration for MongoDB, Ollama, Together AI
├── crawler.js         # Reads and processes JSON files
├── database.js        # MongoDB connection, indexes, and persistence logic
├── embedder.js        # Embedding generation using Together AI
├── index.js           # Main pipeline entry point
├── /json              # JSON files to ingest (downloaded from Firecrawl.dev extract API)
└── .env               # Environment configuration

⚙️ Setup

1. Clone repository

git clone https://github.com/your-username/embedding-pipeline.git
cd embedding-pipeline

2. Install dependencies

npm install

3. Environment variables

Create a .env file in the project root:

# MongoDB configuration
MONGODB_URI=mongodb://localhost:27017
MONGODB_DATABASE=vector_db
MONGODB_COLLECTION=embeddings

# Together AI
TOGETHER_API_KEY=your_together_api_key
EMBEDDING_MODEL=togethercomputer/m2-bert-80M-32k-retrieval

# Ollama (optional, for local LLMs)
OLLAMA_HOST=http://localhost:11434

🚀 Running the Pipeline

1. Add JSON files

After crawling with Firecrawl.dev, download the extracted files and place them in the /json folder. Name them sequentially: 1.json, 2.json, 3.json, etc.

2. Start the pipeline

node index.js

The pipeline will:

Read and process JSON files
Generate embeddings with Together AI
Save embeddings into MongoDB
Build vector indexes for similarity search

🔍 Querying Embeddings

Once stored in MongoDB, you can run vector similarity searches using the vector_index. For example:

db.embeddings.aggregate([
  {
    $vectorSearch: {
      queryVector: [
        /* your query embedding */
      ],
      path: "embedding",
      numCandidates: 100,
      limit: 5,
    },
  },
]);

🧪 Sample Document in MongoDB

{
  "id": "1",
  "url": "https://example.com/page",
  "title": "Sample Page Title",
  "content": "Full processed text content...",
  "embedding": [0.0123, -0.0456, ...],
  "embeddingModel": "togethercomputer/m2-bert-80M-32k-retrieval",
  "embeddingTimestamp": "2025-09-05T05:00:00Z",
  "meta": {
    "timestamp": "2025-09-05T05:00:00Z",
    "source": "json_file",
    "filePath": "./json/1.json",
    "originalId": "1"
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 JSON → Embeddings → MongoDB Pipeline

✨ Features

🛠️ Tech Stack

📂 Project Structure

⚙️ Setup

1. Clone repository

2. Install dependencies

3. Environment variables

🚀 Running the Pipeline

1. Add JSON files

2. Start the pipeline

🔍 Querying Embeddings

🧪 Sample Document in MongoDB

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
json		json
.gitignore		.gitignore
README.md		README.md
config.js		config.js
crawler.js		crawler.js
database.js		database.js
embedder.js		embedder.js
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

🧠 JSON → Embeddings → MongoDB Pipeline

✨ Features

🛠️ Tech Stack

📂 Project Structure

⚙️ Setup

1. Clone repository

2. Install dependencies

3. Environment variables

🚀 Running the Pipeline

1. Add JSON files

2. Start the pipeline

🔍 Querying Embeddings

🧪 Sample Document in MongoDB

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages