Skip to content

pablo-codes/eny-consulting-embedder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 JSON → Embeddings → MongoDB Pipeline

This project is designed to take crawled content from Firecrawl.dev (specifically its extract service), process it, generate embeddings, and store them into MongoDB for vector search.

Workflow:

  1. Crawl websites with Firecrawl.dev using the extract service.
  2. Download the extracted data and save each result as a sequentially numbered JSON file (1.json, 2.json, 3.json, ...).
  3. Run this pipeline to process, embed, and store the data into MongoDB.

✨ Features

  • 📂 Automatic JSON ingestion – Reads all .json files from the /json folder.
  • 🧹 Data cleaning & processing – Normalizes titles, content, and sources into a consistent format.
  • 🧠 Vector embeddings – Generates embeddings with Together AI’s m2-bert-80M-32k-retrieval (default).
  • 💾 MongoDB persistence – Saves documents & embeddings into MongoDB with duplicate handling.
  • 📈 Vector search index creation – Builds a vector index in MongoDB for similarity search.
  • 🔒 Environment-driven config – All keys and hosts are configurable via .env.

🛠️ Tech Stack


📂 Project Structure


.
├── config.js          # Configuration for MongoDB, Ollama, Together AI
├── crawler.js         # Reads and processes JSON files
├── database.js        # MongoDB connection, indexes, and persistence logic
├── embedder.js        # Embedding generation using Together AI
├── index.js           # Main pipeline entry point
├── /json              # JSON files to ingest (downloaded from Firecrawl.dev extract API)
└── .env               # Environment configuration


⚙️ Setup

1. Clone repository

git clone https://github.com/your-username/embedding-pipeline.git
cd embedding-pipeline

2. Install dependencies

npm install

3. Environment variables

Create a .env file in the project root:

# MongoDB configuration
MONGODB_URI=mongodb://localhost:27017
MONGODB_DATABASE=vector_db
MONGODB_COLLECTION=embeddings

# Together AI
TOGETHER_API_KEY=your_together_api_key
EMBEDDING_MODEL=togethercomputer/m2-bert-80M-32k-retrieval

# Ollama (optional, for local LLMs)
OLLAMA_HOST=http://localhost:11434

🚀 Running the Pipeline

1. Add JSON files

After crawling with Firecrawl.dev, download the extracted files and place them in the /json folder. Name them sequentially: 1.json, 2.json, 3.json, etc.

2. Start the pipeline

node index.js

The pipeline will:

  • Read and process JSON files
  • Generate embeddings with Together AI
  • Save embeddings into MongoDB
  • Build vector indexes for similarity search

🔍 Querying Embeddings

Once stored in MongoDB, you can run vector similarity searches using the vector_index. For example:

db.embeddings.aggregate([
  {
    $vectorSearch: {
      queryVector: [
        /* your query embedding */
      ],
      path: "embedding",
      numCandidates: 100,
      limit: 5,
    },
  },
]);

🧪 Sample Document in MongoDB

{
  "id": "1",
  "url": "https://example.com/page",
  "title": "Sample Page Title",
  "content": "Full processed text content...",
  "embedding": [0.0123, -0.0456, ...],
  "embeddingModel": "togethercomputer/m2-bert-80M-32k-retrieval",
  "embeddingTimestamp": "2025-09-05T05:00:00Z",
  "meta": {
    "timestamp": "2025-09-05T05:00:00Z",
    "source": "json_file",
    "filePath": "./json/1.json",
    "originalId": "1"
  }
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors