Skip to content

Jensen-JZ/GeoParser-API

Repository files navigation

GeoParser API

Overview

The GeoParser API is a powerful service designed to extract and disambiguate geographic entities (like cities, countries, and other locations) from text. It leverages state-of-the-art NLP models to provide accurate location recognition across multiple languages. This service is containerized using Docker for easy deployment and scalability.

Features

  • Named Entity Recognition (NER) for Locations: Identifies geographic names in text.
  • Multi-Language Support: Configurable to support various languages (e.g., English, German, French, Chinese, Spanish).
  • Flexible Model Selection: Supports different SpaCy model sizes (sm, md, lg, trf) to balance performance and resource usage.
  • Transformer-Based Models: Utilizes transformer models for enhanced accuracy.
  • Gazetteer Integration: Uses GeoNames for disambiguation and rich location data.
  • Dockerized: Easy to deploy and manage using Docker and Docker Compose.
  • Batch Processing: Efficiently parse multiple texts in a single API call.
  • Caching: In-memory caching for frequently requested texts to improve response times.
  • Health Check Endpoint: Provides a health status for monitoring.
  • GPU Support: Can leverage NVIDIA GPUs for accelerated processing.
  • Comprehensive Configuration: Highly configurable via environment variables.

Tech Stack

  • Backend: Python, Flask
  • NLP Libraries:
    • geoparser (core library)
    • SpaCy
    • Transformers (Hugging Face)
    • PyTorch
  • Containerization: Docker, Docker Compose
  • WSGI Server: Gunicorn

Supported Models

The GeoParser API supports a wide range of SpaCy language models for geographic entity recognition. The service currently supports 24 languages with different model configurations:

Language Support Overview

Language Code Model Pattern Available Sizes TRF Support Notes
Catalan ca ca_core_news_{size} sm, md, lg, trf
Chinese zh zh_core_web_{size} sm, md, lg, trf Web-based model
Croatian hr hr_core_news_{size} sm, md, lg
Danish da da_core_news_{size} sm, md, lg
Dutch nl nl_core_news_{size} sm, md, lg
English en en_core_web_{size} sm, md, lg, trf Web-based model
Finnish fi fi_core_news_{size} sm, md, lg
French fr fr_core_news_{size} sm, md, lg ⚠️ TRF: fr_dep_news_trf (dependency parsing only)
German de de_core_news_{size} sm, md, lg ⚠️ TRF: de_dep_news_trf (dependency parsing only)
Greek el el_core_news_{size} sm, md, lg
Italian it it_core_news_{size} sm, md, lg
Japanese ja ja_core_news_{size} sm, md, lg, trf
Korean ko ko_core_news_{size} sm, md, lg
Lithuanian lt lt_core_news_{size} sm, md, lg
Macedonian mk mk_core_news_{size} sm, md, lg
Norwegian nb nb_core_news_{size} sm, md, lg
Polish pl pl_core_news_{size} sm, md, lg
Portuguese pt pt_core_news_{size} sm, md, lg
Romanian ro ro_core_news_{size} sm, md, lg
Russian ru ru_core_news_{size} sm, md, lg
Slovenian sl sl_core_news_{size} sm, md, lg, trf
Spanish es es_core_news_{size} sm, md, lg ⚠️ TRF: es_dep_news_trf (dependency parsing only)
Swedish sv sv_core_news_{size} sm, md, lg
Ukrainian uk uk_core_news_{size} sm, md, lg, trf

Model Size Recommendations

  • sm (Small): Fastest, minimal memory usage, good for basic NER
  • md (Medium): Balanced performance and accuracy - Recommended for most use cases
  • lg (Large): Higher accuracy, more memory intensive
  • trf (Transformer): Highest accuracy but not recommended for geo-parsing due to limited availability and compatibility issues

⚠️ Important Note: While some languages have trf (transformer) models available, we do not recommend using the trf size for geographic entity recognition. Languages like German, French, and Spanish only have dependency parsing transformer models (xx_dep_news_trf) which cannot perform named entity recognition required for geo-parsing.

Model Naming Convention

The service follows SpaCy's standard naming convention:

  • English & Chinese: Use xx_core_web_{size} (web-trained models)
  • All other languages: Use xx_core_news_{size} (news-trained models)

Where xx is the ISO 639-1 language code and {size} is one of: sm, md, lg, trf.

Prerequisites

Setup and Installation

  1. Clone the Repository:

    git clone <repository-url>
    cd GeoParser-API
  2. Configure Environment: Create a .env file in the root of the project. You can copy the structure from the example below or from the existing .env file if you pulled it from a source that included it (though it's often gitignored). A minimal .env.example would look like this:

    # GeoParser API Configuration
    
    # --------------------------------------------------------------------------
    # Model Configuration
    # --------------------------------------------------------------------------
    # Transformer model for embeddings (from Hugging Face Model Hub)
    TRANSFORMER_MODEL=dguzh/geo-all-MiniLM-L6-v2
    # Gazetteer to use (geonames is standard for geoparser)
    GAZETTEER=geonames
    # Available SpaCy model sizes (e.g., sm, md, lg, trf for transformer models)
    # The setup_models.sh script will try to download models for these sizes for each supported language.
    # The first size in this list will be used as the default if not specified in API calls.
    AVAILABLE_MODEL_SIZES=md,sm
    
    # --------------------------------------------------------------------------
    # Supported Languages
    # --------------------------------------------------------------------------
    # Comma-separated list of ISO 639-1 language codes (e.g., en, de, fr, zh, es)
    # The setup_models.sh script will download models for these languages.
    SUPPORTED_LANGUAGES=en,de
    
    # --------------------------------------------------------------------------
    # Model and Data Paths (within the Docker container)
    # These should generally not be changed unless you modify docker-compose.yml volume mounts.
    # --------------------------------------------------------------------------
    SPACY_MODEL_PATH=/app/models/spacy
    TRANSFORMERS_MODEL_PATH=/app/models/transformers # Currently not used for pre-downloaded custom transformers, but reserved.
    GEONAMES_DATA_PATH=/app/data/geonames
    
    # --------------------------------------------------------------------------
    # API Configuration
    # --------------------------------------------------------------------------
    MAX_TEXT_LENGTH=10000  # Maximum characters for input text
    TIMEOUT=30             # Request timeout in seconds
    ENABLE_CACHE=true      # Enable/disable in-memory cache
    MAX_BATCH_SIZE=100     # Maximum items in a batch request
    
    # --------------------------------------------------------------------------
    # Logging Configuration
    # --------------------------------------------------------------------------
    LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
    
    # --------------------------------------------------------------------------
    # Server Configuration (for Flask/Gunicorn)
    # --------------------------------------------------------------------------
    HOST=0.0.0.0
    PORT=5000
    DEBUG=false # Set to true for Flask debug mode (not recommended for Gunicorn production)
    
    # Gunicorn worker settings (see docker-compose.yml command for how these are used)
    WORKERS=2
    WORKER_TIMEOUT=600
    WORKER_CLASS=sync # or 'gthread', 'eventlet', 'gevent' for async workers
    MAX_REQUESTS=1000
    MAX_REQUESTS_JITTER=100
    
    # --------------------------------------------------------------------------
    # GPU Configuration (Informational, actual GPU allocation is via Docker)
    # --------------------------------------------------------------------------
    CUDA_VISIBLE_DEVICES=0 # Specific GPU to use, if multiple are available
    NVIDIA_VISIBLE_DEVICES=all
    NVIDIA_DRIVER_CAPABILITIES=compute,utility
    
    # --------------------------------------------------------------------------
    # Docker Resource Limits (Informational, actual limits are in docker-compose.yml)
    # --------------------------------------------------------------------------
    MEMORY_LIMIT=12G
    MEMORY_RESERVATION=6G
    CPU_LIMIT=2.0
    CPU_RESERVATION=1.0

    After creating/editing your .env file, ensure it reflects the languages and model sizes you intend to use. Key variables to customize initially:

    • SUPPORTED_LANGUAGES: Comma-separated list of languages to support (e.g., en,de,fr,zh).
    • AVAILABLE_MODEL_SIZES: Comma-separated list of SpaCy model sizes to make available (e.g., sm,md,lg,trf).
    • PORT: Port on which the API will be accessible.
    • TRANSFORMER_MODEL: The Hugging Face model to use for embeddings.
    • Paths for models and data (ensure these match your docker-compose.yml volumes if you customize them).
  3. Download Models and Data: This step is crucial. It downloads the required SpaCy language models and GeoNames data that the GeoParser service needs. The script uses the settings from your .env file to determine which models to fetch.

    bash setup_models.sh

    This script will:

    • Build a temporary Docker image.
    • Run a container to download SpaCy models for the languages and sizes specified in .env.
    • Download GeoNames data used by the geoparser library.
    • Place these assets into the local ./models and ./data directories, which will be mounted into the main service container.
    • Clean up the temporary Docker image.
    • Fix potential permission issues on the created directories.

    Ensure this script completes successfully. If you change SUPPORTED_LANGUAGES or AVAILABLE_MODEL_SIZES in .env later, you may need to re-run this script to download any new required models.

Running the Application

You have two options to run the GeoParser API:

Option 1: Build from Source (Development)

Once the setup is complete, you can start the GeoParser API using Docker Compose:

docker-compose up -d
  • The -d flag runs the containers in detached mode.
  • The service will be available at http://localhost:<PORT> (e.g., http://localhost:5000 if PORT=5000).

Option 2: Use Pre-built Image from Docker Hub (Production)

For easier deployment, you can use the pre-built Docker image:

# Pull the latest image
docker pull realjensen/geoparser-api:latest

# Run with docker-compose using pre-built image
docker-compose -f docker-compose.prod.yml up -d

Or run directly with Docker:

# Create required directories
mkdir -p models data logs

# Run the container
docker run -d \
  --name geoparser-api \
  -p 5000:5000 \
  -v $(pwd)/models:/app/models \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/logs:/app/logs \
  --env-file .env \
  realjensen/geoparser-api:latest

Note: You still need to download models and data using bash setup_models.sh before running the container.

Common Commands

To view the logs:

docker-compose logs -f geoparser

To stop the application:

docker-compose down

Docker Hub Repository

The GeoParser API is available as a pre-built Docker image on Docker Hub:

🐳 Docker Hub: realjensen/geoparser-api

Available tags:

  • latest: Most recent stable version
  • v1.0: Specific version tags

Building and Pushing to Docker Hub

If you want to build and push your own version to Docker Hub:

  1. Login to Docker Hub:

    docker login
  2. Build and push using the provided script:

    ./build_and_push.sh [version] [username]

    Examples:

    # Push as latest with default username
    ./build_and_push.sh
    
    # Push specific version
    ./build_and_push.sh v1.1 your-username
  3. Manual build and push:

    # Build production image
    docker build -f Dockerfile.prod -t your-username/geoparser-api:latest .
    
    # Push to Docker Hub
    docker push your-username/geoparser-api:latest

API Endpoints

The API provides several endpoints for interacting with the GeoParser service. All request and response bodies are in JSON format.


1. Parse Text

  • Endpoint: POST /api/parse
  • Description: Parses a single text string to extract geographic entities.
  • Request Body:
    {
        "text": "I want to travel from Berlin to Paris next week.",
        "languages": ["en"], // Optional: list of language codes (e.g., "en", "de"). Uses default if not provided or model not available.
        "model_size": "md"   // Optional: "sm", "md", "lg", "trf". Uses default from .env if not provided.
    }
  • Example Request (curl):
    curl -X POST -H "Content-Type: application/json" \
    -d '{
        "text": "I want to travel from Berlin to Paris next week.",
        "languages": ["en"],
        "model_size": "md"
    }' \
    http://localhost:5000/api/parse
  • Success Response (200 OK):
    {
        "success": true,
        "language_detected": "en",
        "model_used": "en_core_web_md",
        "text_length": 50,
        "locations_found": 2,
        "locations": [
            {
                "name": "Berlin",
                "geonameid": "2950159",
                "feature_type": "PPLC",
                "latitude": 52.52437,
                "longitude": 13.41053,
                "elevation": null,
                "population": 3426354,
                "admin2_name": null,
                "admin1_name": "Berlin",
                "country_name": "Germany"
            },
            {
                "name": "Paris",
                "geonameid": "2988507",
                "feature_type": "PPLC",
                "latitude": 48.85341,
                "longitude": 2.3488,
                "elevation": null,
                "population": 2138551,
                "admin2_name": null,
                "admin1_name": "Île-de-France",
                "country_name": "France"
            }
        ],
        "processing_time": 0.8523,
        "parse_time": 0.7998,
        "from_cache": false
    }
  • Error Responses:
    • 400 Bad Request: Invalid input (e.g., missing text, text too long).
      {
          "success": false,
          "error": "Text cannot be empty",
          "locations": []
      }
    • 503 Service Unavailable: If the GeoParserService is not initialized.

2. Parse Batch of Texts

  • Endpoint: POST /api/parse/batch
  • Description: Parses a list of text strings.
  • Request Body:
    {
        "texts": [
            {
                "id": "doc1", // Optional: user-defined identifier for the text
                "text": "London is the capital of the United Kingdom.",
                "languages": ["en"] // Optional: per-item language
            },
            {
                "id": "doc2",
                "text": "Ich fahre nach München.",
                "languages": ["de"]
            }
        ],
        "model_size": "md" // Optional: applies to all texts unless overridden per-item (though per-item model_size is not explicitly shown in service.py, it's good practice for future)
    }
  • Example Request (curl):
    curl -X POST -H "Content-Type: application/json" \
    -d '{
        "texts": [
            {"id": "doc1", "text": "London is the capital of the United Kingdom.", "languages": ["en"]},
            {"id": "doc2", "text": "Ich fahre nach München.", "languages": ["de"]}
        ],
        "model_size": "md"
    }' \
    http://localhost:5000/api/parse/batch
  • Success Response (200 OK):
    {
        "success": true,
        "total_processed": 2,
        "successful_parses": 2,
        "failed_parses": 0,
        "results": [
            {
                "id": "doc1", // Included if provided in request
                "success": true,
                "language_detected": "en",
                // ... other fields similar to /api/parse response
                "locations": [ /* ... */ ]
            },
            {
                "id": "doc2",
                "success": true,
                "language_detected": "de",
                // ... other fields
                "locations": [ /* ... */ ]
            }
        ]
    }
  • Error Responses:
    • 400 Bad Request: Invalid input (e.g., texts not a list, batch size exceeded).

3. Get Service Information

  • Endpoint: GET /api/info
  • Description: Provides information about the loaded models and service configuration.
  • Example Request (curl):
    curl http://localhost:5000/api/info
  • Success Response (200 OK):
    {
        "success": true,
        "info": {
            "loaded_models": ["en", "de"], // Actual loaded language codes
            "default_model_size": "md",
            "transformer_model": "dguzh/geo-all-MiniLM-L6-v2",
            "gazetteer": "geonames",
            "supported_languages": ["en", "de", "fr", "zh", "es"], // From .env
            "cache_enabled": true,
            "cache_size": 10,
            "max_text_length": 10000,
            "max_batch_size": 100
        }
    }
  • Error Responses:
    • 503 Service Unavailable: If the GeoParserService is not initialized.

4. Health Check

  • Endpoint: GET /api/health
  • Description: Checks the health of the service. Used by Docker for container health monitoring.
  • Example Request (curl):
    curl http://localhost:5000/api/health
  • Success Response (200 OK):
    {
        "status": "healthy",
        "models_loaded": 2,
        "test_parse_success": true,
        "config_valid": true
    }
  • Error Response (503 Service Unavailable):
    {
        "status": "unhealthy",
        "error": "GeoParser service is not available"
    }

5. Clear Cache

  • Endpoint: POST /api/cache/clear
  • Description: Clears the in-memory cache of the GeoParserService.
  • Example Request (curl):
    curl -X POST http://localhost:5000/api/cache/clear
  • Success Response (200 OK):
    {
        "success": true,
        "message": "Cache cleared successfully. Removed 10 entries."
    }
  • Response if caching is disabled (200 OK but indicates no action):
    {
        "success": false, // Or true with a different message
        "message": "Caching is not enabled. No cache to clear."
    }

6. Get Supported Languages and Models

  • Endpoint: GET /api/languages
  • Description: Returns the list of languages and model sizes supported by the current configuration.
  • Example Request (curl):
    curl http://localhost:5000/api/languages
  • Success Response (200 OK):
    {
        "success": true,
        "supported_languages": ["en", "de", "fr", "zh", "es"], // From .env
        "default_model_size": "md", // From .env
        "available_model_sizes": ["sm", "md", "lg", "trf"] // From .env
    }

Root Endpoint

  • Endpoint: GET /
  • Description: Provides basic service information and a list of available endpoints.
  • Example Request (curl):
    curl http://localhost:5000/
  • Success Response (200 OK):
    {
        "service": "GeoParser API",
        "version": "1.0.0",
        "status": "running",
        "endpoints": {
            "parse": "/api/parse",
            "batch_parse": "/api/parse/batch",
            "info": "/api/info",
            "health": "/api/health",
            "clear_cache": "/api/cache/clear",
            "languages": "/api/languages"
        },
        "documentation": "https://github.com/Jensen-JZ/GeoParser-API"
    }

Configuration Options

The application is configured primarily through the .env file. Some key options include:

  • TRANSFORMER_MODEL: Specifies the Hugging Face transformer model for embeddings.
  • GAZETTEER: The gazetteer to use (default: geonames).
  • AVAILABLE_MODEL_SIZES: Comma-separated list of SpaCy model sizes (e.g., sm,md,lg,trf).
  • SUPPORTED_LANGUAGES: Comma-separated list of ISO language codes (e.g., en,de,fr,zh,es).
  • SPACY_MODEL_PATH, TRANSFORMERS_MODEL_PATH, GEONAMES_DATA_PATH: Paths within the container where models and data are stored. These are typically managed by docker-compose.yml volumes and the setup_models.sh script.
  • MAX_TEXT_LENGTH: Maximum characters allowed for input text.
  • TIMEOUT: Request timeout.
  • ENABLE_CACHE: Set to true to enable in-memory caching.
  • MAX_BATCH_SIZE: Maximum number of texts allowed in a batch request.
  • LOG_LEVEL: Logging level (e.g., INFO, DEBUG).
  • HOST, PORT: Server host and port.
  • WORKERS, WORKER_TIMEOUT, etc.: Gunicorn worker configuration.
  • MEMORY_LIMIT, CPU_LIMIT: Docker resource limits.

Refer to the .env file and app/config.py for a complete list of configurations.

GPU Support

The service is configured to support NVIDIA GPUs for faster model inference. To enable GPU support:

  1. Ensure you have NVIDIA drivers installed on the host machine.
  2. Install the NVIDIA Container Toolkit on the host machine.
  3. The docker-compose.yml file includes the necessary runtime: nvidia configuration.
    services:
      geoparser:
        # ... other configurations
        runtime: nvidia
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: 1 # Or 'all'
                  capabilities: [gpu]
    (Note: The deploy.resources.reservations.devices structure is common, but runtime: nvidia is the primary enabler for Docker Compose v2+). The .env file also contains GPU-related environment variables like CUDA_VISIBLE_DEVICES.

If the NVIDIA runtime is correctly configured, PyTorch (a dependency of Transformers) should automatically detect and use available GPUs.

Troubleshooting

  • Model Download Issues (setup_models.sh):
    • Ensure you have a stable internet connection.
    • Check for typos in SUPPORTED_LANGUAGES or AVAILABLE_MODEL_SIZES in your .env file. SpaCy model names are specific (e.g., en_core_web_sm, de_core_news_md). The script attempts to derive these.
    • If a specific model fails, try downloading it manually with python -m spacy download <model_name> inside a Python environment with spacy installed to see more detailed errors.
  • Port Conflicts: If another service is using the specified PORT (default 5000), change it in .env and restart the containers.
  • Docker Permission Issues:
    • The setup_models.sh script attempts to fix permissions for ./models and ./data directories.
    • If you encounter permission errors when Docker tries to write to mounted volumes, ensure the user running Docker has write access to these directories on the host or run sudo chown -R $(whoami):$(whoami) models/ data/ logs/ (be cautious with sudo).
  • Service Fails to Start (Check Logs):
    • docker-compose logs -f geoparser
    • Look for errors related to model loading (e.g., "No models were successfully loaded") or Python package issues.
    • Ensure all models listed by SUPPORTED_LANGUAGES and DEFAULT_MODEL_SIZE (first of AVAILABLE_MODEL_SIZES) were successfully downloaded by setup_models.sh.
  • CUDA_ERROR_NO_DEVICE or similar GPU errors:
    • Verify NVIDIA drivers and NVIDIA Container Toolkit are correctly installed and configured on the host.
    • Ensure the runtime: nvidia is set in docker-compose.yml.
    • Check CUDA_VISIBLE_DEVICES in .env.

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs, feature requests, or improvements.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

GeoParser-API is a modern, containerized reimplementation of the classic Irchel Geoparser, providing a plug-and-play geolocation parsing API built with spaCy and Transformers.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors