This project implements a distributed text mining pipeline for lexicon-based sentiment analysis using a MapReduce-style workflow.
The pipeline processes a labeled dataset of financial news statements, performs text preprocessing, applies a pre-trained sentiment lexicon, generates document-level sentiment predictions, aggregates results with mapper/reducer logic, benchmarks parallel execution, and validates the final outputs with summary metrics.
Data → Preprocessing → Lexicon Scoring → Mapper → Parallel Execution → Reducer → Metrics
The project uses the file:
data/all-data.csv
The dataset contains sentiment-labeled financial text statements and is used throughout the pipeline for preprocessing, sentiment prediction, and validation.
Owner: BearAx
Status: Completed
Responsibilities
- load the dataset;
- handle file encoding and format issues;
- clean the text;
- tokenize documents;
- generate vocabulary and preprocessing statistics.
Implemented outputs
output_data/cleaned_dataset.csvoutput_data/tokens.jsonoutput_data/vocabulary.jsonoutput_data/top_words.csvoutput_data/summary.json
Owner: Telman3000
Status: Completed
Responsibilities
- define or provide a pre-trained sentiment lexicon;
- assign token-level sentiment scores;
- compute document-level sentiment scores;
- classify each document as
positive,negative, orneutral.
Implemented outputs
output_lexicon/scored_documents.jsonoutput_lexicon/sentiment_summary.json
Owner: LeoPython2006
Status: Completed
Responsibilities
- implement the mapper;
- implement the reducer;
- produce document-level map results;
- aggregate sentiment counts through reducer logic.
Implemented outputs
output_mapreduce/mapper_results.jsonoutput_mapreduce/reducer_summary.json
Owner: uSs3ewa
Status: Completed
Responsibilities
- split the dataset into chunks;
- execute document scoring in parallel;
- compare sequential and parallel runtime;
- save chunk-level and benchmark results.
Implemented outputs
output_parallel/parallel_scored_documents.jsonoutput_parallel/parallel_sentiment_summary.jsonoutput_parallel/chunk_level_results.jsonoutput_parallel/runtime_results.csv
Owner: Mysteri0K1ng
Status: Completed
Responsibilities
- verify correctness of predictions;
- compute accuracy if labels exist;
- generate summary statistics;
- save validated document-level results and evaluation metrics.
Implemented outputs
output_validation/validated_predictions.jsonoutput_validation/metrics_summary.json
| Stage | Role | Main Work | Main Outputs | Owner | Status |
|---|---|---|---|---|---|
| 1 | Data Engineer | Dataset loading, cleaning, tokenization | output_data/* |
BearAx | Done |
| 2 | Lexicon Specialist | Lexicon scoring and sentiment assignment | output_lexicon/* |
Telman3000 | Done |
| 3 | MapReduce Developer | Mapper and reducer implementation | output_mapreduce/* |
LeoPython2006 | Done |
| 4 | Parallelization Engineer | Chunking, multiprocessing, runtime benchmarking | output_parallel/* |
uSs3ewa | Done |
| 5 | Validation & Metrics | Accuracy, class metrics, confusion matrix | output_validation/* |
Mysteri0K1ng | Done |
Distributed-Text-Mining-and-Sentiment-Analysis/
│
├── data/
│ ├── all-data.csv
│ └── sentiment_lexicon.json
│
├── scripts/
│ ├── data_preprocessing.py
│ ├── lexicon_scoring.py
│ ├── map_reduce_developer.py
│ ├── parallel_runner.py
│ └── validation_metrics.py
│
├── output_data/
├── output_lexicon/
├── output_mapreduce/
├── output_parallel/
├── output_validation/
│
├── README.md
├── LICENSE
└── .gitignore
The preprocessing stage:
- loads
all-data.csv; - handles encoding and header format;
- converts text to lowercase;
- removes punctuation;
- normalizes whitespace;
- tokenizes text into word lists;
- builds a vocabulary and preprocessing summary.
Observed preprocessing results
- documents after cleaning: 4838
- total tokens: 103049
- unique tokens: 10103
- average tokens per document: 21.3
The lexicon stage loads output_data/tokens.json and data/sentiment_lexicon.json, then:
- scores each token using the sentiment lexicon;
- sums token scores into a document score;
- assigns:
positiveif score > 0negativeif score < 0neutralif score = 0
Observed lexicon-stage results
- lexicon terms: 78
- positive documents: 952
- negative documents: 272
- neutral documents: 3614
- evaluated documents: 4838
- accuracy: 0.6807
The MapReduce stage implements:
- a mapper, which transforms one tokenized document into a structured prediction record;
- a reducer, which aggregates counts across all mapped records.
Mapper output example
{
"doc_id": 24,
"tokens": ["company", "reported", "profit"],
"score": 2,
"predicted_sentiment": "positive",
"true_label": "positive"
}Reducer output fields
documents_countpositive_documentsnegative_documentsneutral_documentsaccuracy
Observed reducer-stage results
- documents_count: 4838
- positive_documents: 952
- negative_documents: 272
- neutral_documents: 3614
- accuracy: 0.6807
The parallel stage reuses the sentiment scoring logic and:
- splits tokenized documents into chunks;
- processes them with multiprocessing;
- computes chunk-level summaries;
- compares sequential and parallel runtime;
- optionally verifies result consistency.
Observed parallel-stage results
- workers: 8
- chunk size: 500
- sequential time: 0.021747 s
- parallel time: 0.245139 s
- speedup: 0.0887x
On this dataset and configuration, parallel execution is slower than sequential execution because multiprocessing overhead dominates the workload. This does not mean the stage is incorrect; it shows a realistic benchmark outcome on a relatively small task.
The validation stage:
- reads prediction outputs;
- checks whether ground-truth labels are available;
- computes accuracy and summary counts;
- computes per-class precision, recall, and F1;
- builds a confusion matrix;
- saves validated document-level results.
Observed validation results
- evaluated documents: 4838
- correct predictions: 3293
- accuracy: 0.6807
- average tokens per document: 21.3
- average score: 0.1875
Class metrics
- positive: precision 0.6061, recall 0.4236, F1 0.4987
- negative: precision 0.6360, recall 0.2864, F1 0.3949
- neutral: precision 0.7037, recall 0.8854, F1 0.7842
The current project outputs are internally consistent:
output_lexicon/sentiment_summary.jsonoutput_mapreduce/reducer_summary.jsonoutput_parallel/parallel_sentiment_summary.jsonoutput_validation/metrics_summary.json
All of them agree on the same document count and sentiment distribution:
- documents: 4838
- positive: 952
- negative: 272
- neutral: 3614
- accuracy: 0.6807
This is a strong sign that the pipeline stages are aligned correctly.
Run all commands from the repository root.
python scripts/data_preprocessing.pypython scripts/lexicon_scoring.pypython scripts/map_reduce_developer.pypython scripts/parallel_runner.py --workers 4 --chunk-size 500 --verifypython scripts/validation_metrics.pyA pre-trained lexicon was chosen because:
- it is interpretable;
- it does not require model training;
- it fits a modular MapReduce pipeline;
- it is easy to validate and explain.
MapReduce matches this task naturally:
- Map: process each document independently;
- Reduce: aggregate all document-level outputs.
Chunking was chosen because:
- it is simple to implement;
- it clearly demonstrates distributed thinking;
- it allows measurable runtime benchmarking.
The preprocessing stage intentionally stays aligned with the assignment:
- lowercase conversion;
- punctuation removal;
- tokenization.
- clear modular architecture;
- separate outputs for each stage;
- reproducible pipeline;
- consistent summary counts across stages;
- built-in validation and benchmarking.
- the lexicon is relatively small (78 terms);
- the approach may miss context, sarcasm, and domain nuance;
- parallel processing is slower than sequential processing on this dataset due to overhead;
- preprocessing is intentionally basic and does not include lemmatization or stopword removal.
The current project is functionally complete for the required assignment scope.
It correctly implements:
- preprocessing;
- lexicon-based sentiment scoring;
- mapper/reducer aggregation;
- parallel execution benchmarking;
- validation and metrics.
The README is now aligned with the actual repository structure, actual script names, actual output folders, and the current measured results.