This project is designed to take crawled content from Firecrawl.dev (specifically its extract service), process it, generate embeddings, and store them into MongoDB for vector search.
Workflow:
- Crawl websites with Firecrawl.dev using the
extractservice. - Download the extracted data and save each result as a sequentially numbered JSON file (
1.json,2.json,3.json, ...). - Run this pipeline to process, embed, and store the data into MongoDB.
- 📂 Automatic JSON ingestion – Reads all
.jsonfiles from the/jsonfolder. - 🧹 Data cleaning & processing – Normalizes titles, content, and sources into a consistent format.
- 🧠 Vector embeddings – Generates embeddings with Together AI’s
m2-bert-80M-32k-retrieval(default). - 💾 MongoDB persistence – Saves documents & embeddings into MongoDB with duplicate handling.
- 📈 Vector search index creation – Builds a vector index in MongoDB for similarity search.
- 🔒 Environment-driven config – All keys and hosts are configurable via
.env.
- Node.js (ESM)
- MongoDB
- Together AI
- Ollama (local LLM support, optional)
- dotenv (environment variables)
.
├── config.js # Configuration for MongoDB, Ollama, Together AI
├── crawler.js # Reads and processes JSON files
├── database.js # MongoDB connection, indexes, and persistence logic
├── embedder.js # Embedding generation using Together AI
├── index.js # Main pipeline entry point
├── /json # JSON files to ingest (downloaded from Firecrawl.dev extract API)
└── .env # Environment configuration
git clone https://github.com/your-username/embedding-pipeline.git
cd embedding-pipelinenpm installCreate a .env file in the project root:
# MongoDB configuration
MONGODB_URI=mongodb://localhost:27017
MONGODB_DATABASE=vector_db
MONGODB_COLLECTION=embeddings
# Together AI
TOGETHER_API_KEY=your_together_api_key
EMBEDDING_MODEL=togethercomputer/m2-bert-80M-32k-retrieval
# Ollama (optional, for local LLMs)
OLLAMA_HOST=http://localhost:11434After crawling with Firecrawl.dev, download the extracted files and place them in the /json folder.
Name them sequentially: 1.json, 2.json, 3.json, etc.
node index.jsThe pipeline will:
- Read and process JSON files
- Generate embeddings with Together AI
- Save embeddings into MongoDB
- Build vector indexes for similarity search
Once stored in MongoDB, you can run vector similarity searches using the vector_index.
For example:
db.embeddings.aggregate([
{
$vectorSearch: {
queryVector: [
/* your query embedding */
],
path: "embedding",
numCandidates: 100,
limit: 5,
},
},
]);{
"id": "1",
"url": "https://example.com/page",
"title": "Sample Page Title",
"content": "Full processed text content...",
"embedding": [0.0123, -0.0456, ...],
"embeddingModel": "togethercomputer/m2-bert-80M-32k-retrieval",
"embeddingTimestamp": "2025-09-05T05:00:00Z",
"meta": {
"timestamp": "2025-09-05T05:00:00Z",
"source": "json_file",
"filePath": "./json/1.json",
"originalId": "1"
}
}