GeoParser API

Overview

The GeoParser API is a powerful service designed to extract and disambiguate geographic entities (like cities, countries, and other locations) from text. It leverages state-of-the-art NLP models to provide accurate location recognition across multiple languages. This service is containerized using Docker for easy deployment and scalability.

Features

Named Entity Recognition (NER) for Locations: Identifies geographic names in text.
Multi-Language Support: Configurable to support various languages (e.g., English, German, French, Chinese, Spanish).
Flexible Model Selection: Supports different SpaCy model sizes (sm, md, lg, trf) to balance performance and resource usage.
Transformer-Based Models: Utilizes transformer models for enhanced accuracy.
Gazetteer Integration: Uses GeoNames for disambiguation and rich location data.
Dockerized: Easy to deploy and manage using Docker and Docker Compose.
Batch Processing: Efficiently parse multiple texts in a single API call.
Caching: In-memory caching for frequently requested texts to improve response times.
Health Check Endpoint: Provides a health status for monitoring.
GPU Support: Can leverage NVIDIA GPUs for accelerated processing.
Comprehensive Configuration: Highly configurable via environment variables.

Tech Stack

Backend: Python, Flask
NLP Libraries:
- geoparser (core library)
- SpaCy
- Transformers (Hugging Face)
- PyTorch
Containerization: Docker, Docker Compose
WSGI Server: Gunicorn

Supported Models

The GeoParser API supports a wide range of SpaCy language models for geographic entity recognition. The service currently supports 24 languages with different model configurations:

Language Support Overview

Language	Code	Model Pattern	Available Sizes	TRF Support	Notes
Catalan	`ca`	`ca_core_news_{size}`	sm, md, lg, trf	✅
Chinese	`zh`	`zh_core_web_{size}`	sm, md, lg, trf	✅	Web-based model
Croatian	`hr`	`hr_core_news_{size}`	sm, md, lg	❌
Danish	`da`	`da_core_news_{size}`	sm, md, lg	❌
Dutch	`nl`	`nl_core_news_{size}`	sm, md, lg	❌
English	`en`	`en_core_web_{size}`	sm, md, lg, trf	✅	Web-based model
Finnish	`fi`	`fi_core_news_{size}`	sm, md, lg	❌
French	`fr`	`fr_core_news_{size}`	sm, md, lg	⚠️	TRF: `fr_dep_news_trf` (dependency parsing only)
German	`de`	`de_core_news_{size}`	sm, md, lg	⚠️	TRF: `de_dep_news_trf` (dependency parsing only)
Greek	`el`	`el_core_news_{size}`	sm, md, lg	❌
Italian	`it`	`it_core_news_{size}`	sm, md, lg	❌
Japanese	`ja`	`ja_core_news_{size}`	sm, md, lg, trf	✅
Korean	`ko`	`ko_core_news_{size}`	sm, md, lg	❌
Lithuanian	`lt`	`lt_core_news_{size}`	sm, md, lg	❌
Macedonian	`mk`	`mk_core_news_{size}`	sm, md, lg	❌
Norwegian	`nb`	`nb_core_news_{size}`	sm, md, lg	❌
Polish	`pl`	`pl_core_news_{size}`	sm, md, lg	❌
Portuguese	`pt`	`pt_core_news_{size}`	sm, md, lg	❌
Romanian	`ro`	`ro_core_news_{size}`	sm, md, lg	❌
Russian	`ru`	`ru_core_news_{size}`	sm, md, lg	❌
Slovenian	`sl`	`sl_core_news_{size}`	sm, md, lg, trf	✅
Spanish	`es`	`es_core_news_{size}`	sm, md, lg	⚠️	TRF: `es_dep_news_trf` (dependency parsing only)
Swedish	`sv`	`sv_core_news_{size}`	sm, md, lg	❌
Ukrainian	`uk`	`uk_core_news_{size}`	sm, md, lg, trf	✅

Model Size Recommendations

sm (Small): Fastest, minimal memory usage, good for basic NER
md (Medium): Balanced performance and accuracy - Recommended for most use cases
lg (Large): Higher accuracy, more memory intensive
trf (Transformer): Highest accuracy but not recommended for geo-parsing due to limited availability and compatibility issues

⚠️ Important Note: While some languages have trf (transformer) models available, we do not recommend using the trf size for geographic entity recognition. Languages like German, French, and Spanish only have dependency parsing transformer models (xx_dep_news_trf) which cannot perform named entity recognition required for geo-parsing.

Model Naming Convention

The service follows SpaCy's standard naming convention:

English & Chinese: Use xx_core_web_{size} (web-trained models)
All other languages: Use xx_core_news_{size} (news-trained models)

Where xx is the ISO 639-1 language code and {size} is one of: sm, md, lg, trf.

Prerequisites

Git
Docker
Docker Compose (usually included with Docker Desktop)
A shell environment (like Bash, Zsh, PowerShell).
(Optional) For GPU support:
- NVIDIA GPU drivers
- NVIDIA Container Toolkit

Setup and Installation

Clone the Repository:

git clone <repository-url>
cd GeoParser-API

Configure Environment: Create a .env file in the root of the project. You can copy the structure from the example below or from the existing .env file if you pulled it from a source that included it (though it's often gitignored). A minimal .env.example would look like this:

# GeoParser API Configuration

# --------------------------------------------------------------------------
# Model Configuration
# --------------------------------------------------------------------------
# Transformer model for embeddings (from Hugging Face Model Hub)
TRANSFORMER_MODEL=dguzh/geo-all-MiniLM-L6-v2
# Gazetteer to use (geonames is standard for geoparser)
GAZETTEER=geonames
# Available SpaCy model sizes (e.g., sm, md, lg, trf for transformer models)
# The setup_models.sh script will try to download models for these sizes for each supported language.
# The first size in this list will be used as the default if not specified in API calls.
AVAILABLE_MODEL_SIZES=md,sm

# --------------------------------------------------------------------------
# Supported Languages
# --------------------------------------------------------------------------
# Comma-separated list of ISO 639-1 language codes (e.g., en, de, fr, zh, es)
# The setup_models.sh script will download models for these languages.
SUPPORTED_LANGUAGES=en,de

# --------------------------------------------------------------------------
# Model and Data Paths (within the Docker container)
# These should generally not be changed unless you modify docker-compose.yml volume mounts.
# --------------------------------------------------------------------------
SPACY_MODEL_PATH=/app/models/spacy
TRANSFORMERS_MODEL_PATH=/app/models/transformers # Currently not used for pre-downloaded custom transformers, but reserved.
GEONAMES_DATA_PATH=/app/data/geonames

# --------------------------------------------------------------------------
# API Configuration
# --------------------------------------------------------------------------
MAX_TEXT_LENGTH=10000  # Maximum characters for input text
TIMEOUT=30             # Request timeout in seconds
ENABLE_CACHE=true      # Enable/disable in-memory cache
MAX_BATCH_SIZE=100     # Maximum items in a batch request

# --------------------------------------------------------------------------
# Logging Configuration
# --------------------------------------------------------------------------
LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL

# --------------------------------------------------------------------------
# Server Configuration (for Flask/Gunicorn)
# --------------------------------------------------------------------------
HOST=0.0.0.0
PORT=5000
DEBUG=false # Set to true for Flask debug mode (not recommended for Gunicorn production)

# Gunicorn worker settings (see docker-compose.yml command for how these are used)
WORKERS=2
WORKER_TIMEOUT=600
WORKER_CLASS=sync # or 'gthread', 'eventlet', 'gevent' for async workers
MAX_REQUESTS=1000
MAX_REQUESTS_JITTER=100

# --------------------------------------------------------------------------
# GPU Configuration (Informational, actual GPU allocation is via Docker)
# --------------------------------------------------------------------------
CUDA_VISIBLE_DEVICES=0 # Specific GPU to use, if multiple are available
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility

# --------------------------------------------------------------------------
# Docker Resource Limits (Informational, actual limits are in docker-compose.yml)
# --------------------------------------------------------------------------
MEMORY_LIMIT=12G
MEMORY_RESERVATION=6G
CPU_LIMIT=2.0
CPU_RESERVATION=1.0

After creating/editing your .env file, ensure it reflects the languages and model sizes you intend to use. Key variables to customize initially:

SUPPORTED_LANGUAGES: Comma-separated list of languages to support (e.g., en,de,fr,zh).
AVAILABLE_MODEL_SIZES: Comma-separated list of SpaCy model sizes to make available (e.g., sm,md,lg,trf).
PORT: Port on which the API will be accessible.
TRANSFORMER_MODEL: The Hugging Face model to use for embeddings.
Paths for models and data (ensure these match your docker-compose.yml volumes if you customize them).

Download Models and Data: This step is crucial. It downloads the required SpaCy language models and GeoNames data that the GeoParser service needs. The script uses the settings from your .env file to determine which models to fetch.
```
bash setup_models.sh
```
This script will:
- Build a temporary Docker image.
- Run a container to download SpaCy models for the languages and sizes specified in .env.
- Download GeoNames data used by the geoparser library.
- Place these assets into the local ./models and ./data directories, which will be mounted into the main service container.
- Clean up the temporary Docker image.
- Fix potential permission issues on the created directories.
Ensure this script completes successfully. If you change SUPPORTED_LANGUAGES or AVAILABLE_MODEL_SIZES in .env later, you may need to re-run this script to download any new required models.

Running the Application

You have two options to run the GeoParser API:

Option 1: Build from Source (Development)

Once the setup is complete, you can start the GeoParser API using Docker Compose:

docker-compose up -d

The -d flag runs the containers in detached mode.
The service will be available at http://localhost:<PORT> (e.g., http://localhost:5000 if PORT=5000).

Option 2: Use Pre-built Image from Docker Hub (Production)

For easier deployment, you can use the pre-built Docker image:

# Pull the latest image
docker pull realjensen/geoparser-api:latest

# Run with docker-compose using pre-built image
docker-compose -f docker-compose.prod.yml up -d

Or run directly with Docker:

# Create required directories
mkdir -p models data logs

# Run the container
docker run -d \
  --name geoparser-api \
  -p 5000:5000 \
  -v $(pwd)/models:/app/models \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/logs:/app/logs \
  --env-file .env \
  realjensen/geoparser-api:latest

Note: You still need to download models and data using bash setup_models.sh before running the container.

Common Commands

To view the logs:

docker-compose logs -f geoparser

To stop the application:

docker-compose down

Docker Hub Repository

The GeoParser API is available as a pre-built Docker image on Docker Hub:

🐳 Docker Hub: realjensen/geoparser-api

Available tags:

latest: Most recent stable version
v1.0: Specific version tags

Building and Pushing to Docker Hub

If you want to build and push your own version to Docker Hub:

Login to Docker Hub:
```
docker login
```

Build and push using the provided script:

./build_and_push.sh [version] [username]

Examples:

# Push as latest with default username
./build_and_push.sh

# Push specific version
./build_and_push.sh v1.1 your-username

Manual build and push:

# Build production image
docker build -f Dockerfile.prod -t your-username/geoparser-api:latest .

# Push to Docker Hub
docker push your-username/geoparser-api:latest

API Endpoints

The API provides several endpoints for interacting with the GeoParser service. All request and response bodies are in JSON format.

1. Parse Text

Endpoint: POST /api/parse
Description: Parses a single text string to extract geographic entities.

Request Body:

{
    "text": "I want to travel from Berlin to Paris next week.",
    "languages": ["en"], // Optional: list of language codes (e.g., "en", "de"). Uses default if not provided or model not available.
    "model_size": "md"   // Optional: "sm", "md", "lg", "trf". Uses default from .env if not provided.
}

Example Request (curl):

curl -X POST -H "Content-Type: application/json" \
-d '{
    "text": "I want to travel from Berlin to Paris next week.",
    "languages": ["en"],
    "model_size": "md"
}' \
http://localhost:5000/api/parse

Success Response (200 OK):

{
    "success": true,
    "language_detected": "en",
    "model_used": "en_core_web_md",
    "text_length": 50,
    "locations_found": 2,
    "locations": [
        {
            "name": "Berlin",
            "geonameid": "2950159",
            "feature_type": "PPLC",
            "latitude": 52.52437,
            "longitude": 13.41053,
            "elevation": null,
            "population": 3426354,
            "admin2_name": null,
            "admin1_name": "Berlin",
            "country_name": "Germany"
        },
        {
            "name": "Paris",
            "geonameid": "2988507",
            "feature_type": "PPLC",
            "latitude": 48.85341,
            "longitude": 2.3488,
            "elevation": null,
            "population": 2138551,
            "admin2_name": null,
            "admin1_name": "Île-de-France",
            "country_name": "France"
        }
    ],
    "processing_time": 0.8523,
    "parse_time": 0.7998,
    "from_cache": false
}

Error Responses:
- 400 Bad Request: Invalid input (e.g., missing text, text too long).
```
{
    "success": false,
    "error": "Text cannot be empty",
    "locations": []
}
```
- 503 Service Unavailable: If the GeoParserService is not initialized.

2. Parse Batch of Texts

Endpoint: POST /api/parse/batch
Description: Parses a list of text strings.

Request Body:

{
    "texts": [
        {
            "id": "doc1", // Optional: user-defined identifier for the text
            "text": "London is the capital of the United Kingdom.",
            "languages": ["en"] // Optional: per-item language
        },
        {
            "id": "doc2",
            "text": "Ich fahre nach München.",
            "languages": ["de"]
        }
    ],
    "model_size": "md" // Optional: applies to all texts unless overridden per-item (though per-item model_size is not explicitly shown in service.py, it's good practice for future)
}

Example Request (curl):

curl -X POST -H "Content-Type: application/json" \
-d '{
    "texts": [
        {"id": "doc1", "text": "London is the capital of the United Kingdom.", "languages": ["en"]},
        {"id": "doc2", "text": "Ich fahre nach München.", "languages": ["de"]}
    ],
    "model_size": "md"
}' \
http://localhost:5000/api/parse/batch

Success Response (200 OK):

{
    "success": true,
    "total_processed": 2,
    "successful_parses": 2,
    "failed_parses": 0,
    "results": [
        {
            "id": "doc1", // Included if provided in request
            "success": true,
            "language_detected": "en",
            // ... other fields similar to /api/parse response
            "locations": [ /* ... */ ]
        },
        {
            "id": "doc2",
            "success": true,
            "language_detected": "de",
            // ... other fields
            "locations": [ /* ... */ ]
        }
    ]
}

Error Responses:
- 400 Bad Request: Invalid input (e.g., texts not a list, batch size exceeded).

3. Get Service Information

Endpoint: GET /api/info
Description: Provides information about the loaded models and service configuration.
Example Request (curl):
```
curl http://localhost:5000/api/info
```

Success Response (200 OK):

{
    "success": true,
    "info": {
        "loaded_models": ["en", "de"], // Actual loaded language codes
        "default_model_size": "md",
        "transformer_model": "dguzh/geo-all-MiniLM-L6-v2",
        "gazetteer": "geonames",
        "supported_languages": ["en", "de", "fr", "zh", "es"], // From .env
        "cache_enabled": true,
        "cache_size": 10,
        "max_text_length": 10000,
        "max_batch_size": 100
    }
}

Error Responses:
- 503 Service Unavailable: If the GeoParserService is not initialized.

4. Health Check

Endpoint: GET /api/health
Description: Checks the health of the service. Used by Docker for container health monitoring.
Example Request (curl):
```
curl http://localhost:5000/api/health
```

Success Response (200 OK):

{
    "status": "healthy",
    "models_loaded": 2,
    "test_parse_success": true,
    "config_valid": true
}

Error Response (503 Service Unavailable):

{
    "status": "unhealthy",
    "error": "GeoParser service is not available"
}

5. Clear Cache

Endpoint: POST /api/cache/clear
Description: Clears the in-memory cache of the GeoParserService.

Example Request (curl):

curl -X POST http://localhost:5000/api/cache/clear

Success Response (200 OK):

{
    "success": true,
    "message": "Cache cleared successfully. Removed 10 entries."
}

Response if caching is disabled (200 OK but indicates no action):

{
    "success": false, // Or true with a different message
    "message": "Caching is not enabled. No cache to clear."
}

6. Get Supported Languages and Models

Endpoint: GET /api/languages
Description: Returns the list of languages and model sizes supported by the current configuration.

Example Request (curl):

curl http://localhost:5000/api/languages

Success Response (200 OK):

{
    "success": true,
    "supported_languages": ["en", "de", "fr", "zh", "es"], // From .env
    "default_model_size": "md", // From .env
    "available_model_sizes": ["sm", "md", "lg", "trf"] // From .env
}

Root Endpoint

Endpoint: GET /
Description: Provides basic service information and a list of available endpoints.
Example Request (curl):
```
curl http://localhost:5000/
```

Success Response (200 OK):

{
    "service": "GeoParser API",
    "version": "1.0.0",
    "status": "running",
    "endpoints": {
        "parse": "/api/parse",
        "batch_parse": "/api/parse/batch",
        "info": "/api/info",
        "health": "/api/health",
        "clear_cache": "/api/cache/clear",
        "languages": "/api/languages"
    },
    "documentation": "https://github.com/Jensen-JZ/GeoParser-API"
}

Configuration Options

The application is configured primarily through the .env file. Some key options include:

TRANSFORMER_MODEL: Specifies the Hugging Face transformer model for embeddings.
GAZETTEER: The gazetteer to use (default: geonames).
AVAILABLE_MODEL_SIZES: Comma-separated list of SpaCy model sizes (e.g., sm,md,lg,trf).
SUPPORTED_LANGUAGES: Comma-separated list of ISO language codes (e.g., en,de,fr,zh,es).
SPACY_MODEL_PATH, TRANSFORMERS_MODEL_PATH, GEONAMES_DATA_PATH: Paths within the container where models and data are stored. These are typically managed by docker-compose.yml volumes and the setup_models.sh script.
MAX_TEXT_LENGTH: Maximum characters allowed for input text.
TIMEOUT: Request timeout.
ENABLE_CACHE: Set to true to enable in-memory caching.
MAX_BATCH_SIZE: Maximum number of texts allowed in a batch request.
LOG_LEVEL: Logging level (e.g., INFO, DEBUG).
HOST, PORT: Server host and port.
WORKERS, WORKER_TIMEOUT, etc.: Gunicorn worker configuration.
MEMORY_LIMIT, CPU_LIMIT: Docker resource limits.

Refer to the .env file and app/config.py for a complete list of configurations.

GPU Support

The service is configured to support NVIDIA GPUs for faster model inference. To enable GPU support:

Ensure you have NVIDIA drivers installed on the host machine.
Install the NVIDIA Container Toolkit on the host machine.

The docker-compose.yml file includes the necessary runtime: nvidia configuration.

services:
  geoparser:
    # ... other configurations
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1 # Or 'all'
              capabilities: [gpu]

(Note: The deploy.resources.reservations.devices structure is common, but runtime: nvidia is the primary enabler for Docker Compose v2+). The .env file also contains GPU-related environment variables like CUDA_VISIBLE_DEVICES.

If the NVIDIA runtime is correctly configured, PyTorch (a dependency of Transformers) should automatically detect and use available GPUs.

Troubleshooting

Model Download Issues (setup_models.sh):
- Ensure you have a stable internet connection.
- Check for typos in SUPPORTED_LANGUAGES or AVAILABLE_MODEL_SIZES in your .env file. SpaCy model names are specific (e.g., en_core_web_sm, de_core_news_md). The script attempts to derive these.
- If a specific model fails, try downloading it manually with python -m spacy download <model_name> inside a Python environment with spacy installed to see more detailed errors.
Port Conflicts: If another service is using the specified PORT (default 5000), change it in .env and restart the containers.
Docker Permission Issues:
- The setup_models.sh script attempts to fix permissions for ./models and ./data directories.
- If you encounter permission errors when Docker tries to write to mounted volumes, ensure the user running Docker has write access to these directories on the host or run sudo chown -R $(whoami):$(whoami) models/ data/ logs/ (be cautious with sudo).
Service Fails to Start (Check Logs):
- docker-compose logs -f geoparser
- Look for errors related to model loading (e.g., "No models were successfully loaded") or Python package issues.
- Ensure all models listed by SUPPORTED_LANGUAGES and DEFAULT_MODEL_SIZE (first of AVAILABLE_MODEL_SIZES) were successfully downloaded by setup_models.sh.
CUDA_ERROR_NO_DEVICE or similar GPU errors:
- Verify NVIDIA drivers and NVIDIA Container Toolkit are correctly installed and configured on the host.
- Ensure the runtime: nvidia is set in docker-compose.yml.
- Check CUDA_VISIBLE_DEVICES in .env.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs, feature requests, or improvements.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
app		app
.env		.env
.gitignore		.gitignore
DOCKER_DEPLOYMENT.md		DOCKER_DEPLOYMENT.md
Dockerfile		Dockerfile
Dockerfile.prod		Dockerfile.prod
LICENSE		LICENSE
README.md		README.md
build_and_push.sh		build_and_push.sh
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt
setup_models.sh		setup_models.sh

Folders and files

Latest commit

History

Repository files navigation

GeoParser API

Overview

Features

Tech Stack

Supported Models

Language Support Overview

Model Size Recommendations

Model Naming Convention

Prerequisites

Setup and Installation

Running the Application

Option 1: Build from Source (Development)

Option 2: Use Pre-built Image from Docker Hub (Production)

Common Commands

Docker Hub Repository

Building and Pushing to Docker Hub

API Endpoints

1. Parse Text

2. Parse Batch of Texts

3. Get Service Information

4. Health Check

5. Clear Cache

6. Get Supported Languages and Models

Root Endpoint

Configuration Options

GPU Support

Troubleshooting

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages