The GeoParser API is a powerful service designed to extract and disambiguate geographic entities (like cities, countries, and other locations) from text. It leverages state-of-the-art NLP models to provide accurate location recognition across multiple languages. This service is containerized using Docker for easy deployment and scalability.
- Named Entity Recognition (NER) for Locations: Identifies geographic names in text.
- Multi-Language Support: Configurable to support various languages (e.g., English, German, French, Chinese, Spanish).
- Flexible Model Selection: Supports different SpaCy model sizes (sm, md, lg, trf) to balance performance and resource usage.
- Transformer-Based Models: Utilizes transformer models for enhanced accuracy.
- Gazetteer Integration: Uses GeoNames for disambiguation and rich location data.
- Dockerized: Easy to deploy and manage using Docker and Docker Compose.
- Batch Processing: Efficiently parse multiple texts in a single API call.
- Caching: In-memory caching for frequently requested texts to improve response times.
- Health Check Endpoint: Provides a health status for monitoring.
- GPU Support: Can leverage NVIDIA GPUs for accelerated processing.
- Comprehensive Configuration: Highly configurable via environment variables.
- Backend:
Python,Flask - NLP Libraries:
geoparser(core library)SpaCyTransformers(Hugging Face)PyTorch
- Containerization:
Docker,Docker Compose - WSGI Server:
Gunicorn
The GeoParser API supports a wide range of SpaCy language models for geographic entity recognition. The service currently supports 24 languages with different model configurations:
| Language | Code | Model Pattern | Available Sizes | TRF Support | Notes |
|---|---|---|---|---|---|
| Catalan | ca |
ca_core_news_{size} |
sm, md, lg, trf | ✅ | |
| Chinese | zh |
zh_core_web_{size} |
sm, md, lg, trf | ✅ | Web-based model |
| Croatian | hr |
hr_core_news_{size} |
sm, md, lg | ❌ | |
| Danish | da |
da_core_news_{size} |
sm, md, lg | ❌ | |
| Dutch | nl |
nl_core_news_{size} |
sm, md, lg | ❌ | |
| English | en |
en_core_web_{size} |
sm, md, lg, trf | ✅ | Web-based model |
| Finnish | fi |
fi_core_news_{size} |
sm, md, lg | ❌ | |
| French | fr |
fr_core_news_{size} |
sm, md, lg | TRF: fr_dep_news_trf (dependency parsing only) |
|
| German | de |
de_core_news_{size} |
sm, md, lg | TRF: de_dep_news_trf (dependency parsing only) |
|
| Greek | el |
el_core_news_{size} |
sm, md, lg | ❌ | |
| Italian | it |
it_core_news_{size} |
sm, md, lg | ❌ | |
| Japanese | ja |
ja_core_news_{size} |
sm, md, lg, trf | ✅ | |
| Korean | ko |
ko_core_news_{size} |
sm, md, lg | ❌ | |
| Lithuanian | lt |
lt_core_news_{size} |
sm, md, lg | ❌ | |
| Macedonian | mk |
mk_core_news_{size} |
sm, md, lg | ❌ | |
| Norwegian | nb |
nb_core_news_{size} |
sm, md, lg | ❌ | |
| Polish | pl |
pl_core_news_{size} |
sm, md, lg | ❌ | |
| Portuguese | pt |
pt_core_news_{size} |
sm, md, lg | ❌ | |
| Romanian | ro |
ro_core_news_{size} |
sm, md, lg | ❌ | |
| Russian | ru |
ru_core_news_{size} |
sm, md, lg | ❌ | |
| Slovenian | sl |
sl_core_news_{size} |
sm, md, lg, trf | ✅ | |
| Spanish | es |
es_core_news_{size} |
sm, md, lg | TRF: es_dep_news_trf (dependency parsing only) |
|
| Swedish | sv |
sv_core_news_{size} |
sm, md, lg | ❌ | |
| Ukrainian | uk |
uk_core_news_{size} |
sm, md, lg, trf | ✅ |
- sm (Small): Fastest, minimal memory usage, good for basic NER
- md (Medium): Balanced performance and accuracy - Recommended for most use cases
- lg (Large): Higher accuracy, more memory intensive
- trf (Transformer): Highest accuracy but not recommended for geo-parsing due to limited availability and compatibility issues
⚠️ Important Note: While some languages havetrf(transformer) models available, we do not recommend using thetrfsize for geographic entity recognition. Languages like German, French, and Spanish only have dependency parsing transformer models (xx_dep_news_trf) which cannot perform named entity recognition required for geo-parsing.
The service follows SpaCy's standard naming convention:
- English & Chinese: Use
xx_core_web_{size}(web-trained models) - All other languages: Use
xx_core_news_{size}(news-trained models)
Where xx is the ISO 639-1 language code and {size} is one of: sm, md, lg, trf.
- Git
- Docker
- Docker Compose (usually included with Docker Desktop)
- A shell environment (like Bash, Zsh, PowerShell).
- (Optional) For GPU support:
- NVIDIA GPU drivers
- NVIDIA Container Toolkit
-
Clone the Repository:
git clone <repository-url> cd GeoParser-API
-
Configure Environment: Create a
.envfile in the root of the project. You can copy the structure from the example below or from the existing.envfile if you pulled it from a source that included it (though it's often gitignored). A minimal.env.examplewould look like this:# GeoParser API Configuration # -------------------------------------------------------------------------- # Model Configuration # -------------------------------------------------------------------------- # Transformer model for embeddings (from Hugging Face Model Hub) TRANSFORMER_MODEL=dguzh/geo-all-MiniLM-L6-v2 # Gazetteer to use (geonames is standard for geoparser) GAZETTEER=geonames # Available SpaCy model sizes (e.g., sm, md, lg, trf for transformer models) # The setup_models.sh script will try to download models for these sizes for each supported language. # The first size in this list will be used as the default if not specified in API calls. AVAILABLE_MODEL_SIZES=md,sm # -------------------------------------------------------------------------- # Supported Languages # -------------------------------------------------------------------------- # Comma-separated list of ISO 639-1 language codes (e.g., en, de, fr, zh, es) # The setup_models.sh script will download models for these languages. SUPPORTED_LANGUAGES=en,de # -------------------------------------------------------------------------- # Model and Data Paths (within the Docker container) # These should generally not be changed unless you modify docker-compose.yml volume mounts. # -------------------------------------------------------------------------- SPACY_MODEL_PATH=/app/models/spacy TRANSFORMERS_MODEL_PATH=/app/models/transformers # Currently not used for pre-downloaded custom transformers, but reserved. GEONAMES_DATA_PATH=/app/data/geonames # -------------------------------------------------------------------------- # API Configuration # -------------------------------------------------------------------------- MAX_TEXT_LENGTH=10000 # Maximum characters for input text TIMEOUT=30 # Request timeout in seconds ENABLE_CACHE=true # Enable/disable in-memory cache MAX_BATCH_SIZE=100 # Maximum items in a batch request # -------------------------------------------------------------------------- # Logging Configuration # -------------------------------------------------------------------------- LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL # -------------------------------------------------------------------------- # Server Configuration (for Flask/Gunicorn) # -------------------------------------------------------------------------- HOST=0.0.0.0 PORT=5000 DEBUG=false # Set to true for Flask debug mode (not recommended for Gunicorn production) # Gunicorn worker settings (see docker-compose.yml command for how these are used) WORKERS=2 WORKER_TIMEOUT=600 WORKER_CLASS=sync # or 'gthread', 'eventlet', 'gevent' for async workers MAX_REQUESTS=1000 MAX_REQUESTS_JITTER=100 # -------------------------------------------------------------------------- # GPU Configuration (Informational, actual GPU allocation is via Docker) # -------------------------------------------------------------------------- CUDA_VISIBLE_DEVICES=0 # Specific GPU to use, if multiple are available NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility # -------------------------------------------------------------------------- # Docker Resource Limits (Informational, actual limits are in docker-compose.yml) # -------------------------------------------------------------------------- MEMORY_LIMIT=12G MEMORY_RESERVATION=6G CPU_LIMIT=2.0 CPU_RESERVATION=1.0
After creating/editing your
.envfile, ensure it reflects the languages and model sizes you intend to use. Key variables to customize initially:SUPPORTED_LANGUAGES: Comma-separated list of languages to support (e.g.,en,de,fr,zh).AVAILABLE_MODEL_SIZES: Comma-separated list of SpaCy model sizes to make available (e.g.,sm,md,lg,trf).PORT: Port on which the API will be accessible.TRANSFORMER_MODEL: The Hugging Face model to use for embeddings.- Paths for models and data (ensure these match your
docker-compose.ymlvolumes if you customize them).
-
Download Models and Data: This step is crucial. It downloads the required SpaCy language models and GeoNames data that the GeoParser service needs. The script uses the settings from your
.envfile to determine which models to fetch.bash setup_models.sh
This script will:
- Build a temporary Docker image.
- Run a container to download SpaCy models for the languages and sizes specified in
.env. - Download GeoNames data used by the
geoparserlibrary. - Place these assets into the local
./modelsand./datadirectories, which will be mounted into the main service container. - Clean up the temporary Docker image.
- Fix potential permission issues on the created directories.
Ensure this script completes successfully. If you change
SUPPORTED_LANGUAGESorAVAILABLE_MODEL_SIZESin.envlater, you may need to re-run this script to download any new required models.
You have two options to run the GeoParser API:
Once the setup is complete, you can start the GeoParser API using Docker Compose:
docker-compose up -d- The
-dflag runs the containers in detached mode. - The service will be available at
http://localhost:<PORT>(e.g.,http://localhost:5000ifPORT=5000).
For easier deployment, you can use the pre-built Docker image:
# Pull the latest image
docker pull realjensen/geoparser-api:latest
# Run with docker-compose using pre-built image
docker-compose -f docker-compose.prod.yml up -dOr run directly with Docker:
# Create required directories
mkdir -p models data logs
# Run the container
docker run -d \
--name geoparser-api \
-p 5000:5000 \
-v $(pwd)/models:/app/models \
-v $(pwd)/data:/app/data \
-v $(pwd)/logs:/app/logs \
--env-file .env \
realjensen/geoparser-api:latestNote: You still need to download models and data using bash setup_models.sh before running the container.
To view the logs:
docker-compose logs -f geoparserTo stop the application:
docker-compose downThe GeoParser API is available as a pre-built Docker image on Docker Hub:
🐳 Docker Hub: realjensen/geoparser-api
Available tags:
latest: Most recent stable versionv1.0: Specific version tags
If you want to build and push your own version to Docker Hub:
-
Login to Docker Hub:
docker login
-
Build and push using the provided script:
./build_and_push.sh [version] [username]
Examples:
# Push as latest with default username ./build_and_push.sh # Push specific version ./build_and_push.sh v1.1 your-username
-
Manual build and push:
# Build production image docker build -f Dockerfile.prod -t your-username/geoparser-api:latest . # Push to Docker Hub docker push your-username/geoparser-api:latest
The API provides several endpoints for interacting with the GeoParser service. All request and response bodies are in JSON format.
- Endpoint:
POST /api/parse - Description: Parses a single text string to extract geographic entities.
- Request Body:
{ "text": "I want to travel from Berlin to Paris next week.", "languages": ["en"], // Optional: list of language codes (e.g., "en", "de"). Uses default if not provided or model not available. "model_size": "md" // Optional: "sm", "md", "lg", "trf". Uses default from .env if not provided. } - Example Request (
curl):curl -X POST -H "Content-Type: application/json" \ -d '{ "text": "I want to travel from Berlin to Paris next week.", "languages": ["en"], "model_size": "md" }' \ http://localhost:5000/api/parse
- Success Response (200 OK):
{ "success": true, "language_detected": "en", "model_used": "en_core_web_md", "text_length": 50, "locations_found": 2, "locations": [ { "name": "Berlin", "geonameid": "2950159", "feature_type": "PPLC", "latitude": 52.52437, "longitude": 13.41053, "elevation": null, "population": 3426354, "admin2_name": null, "admin1_name": "Berlin", "country_name": "Germany" }, { "name": "Paris", "geonameid": "2988507", "feature_type": "PPLC", "latitude": 48.85341, "longitude": 2.3488, "elevation": null, "population": 2138551, "admin2_name": null, "admin1_name": "Île-de-France", "country_name": "France" } ], "processing_time": 0.8523, "parse_time": 0.7998, "from_cache": false } - Error Responses:
400 Bad Request: Invalid input (e.g., missingtext, text too long).{ "success": false, "error": "Text cannot be empty", "locations": [] }503 Service Unavailable: If the GeoParserService is not initialized.
- Endpoint:
POST /api/parse/batch - Description: Parses a list of text strings.
- Request Body:
{ "texts": [ { "id": "doc1", // Optional: user-defined identifier for the text "text": "London is the capital of the United Kingdom.", "languages": ["en"] // Optional: per-item language }, { "id": "doc2", "text": "Ich fahre nach München.", "languages": ["de"] } ], "model_size": "md" // Optional: applies to all texts unless overridden per-item (though per-item model_size is not explicitly shown in service.py, it's good practice for future) } - Example Request (
curl):curl -X POST -H "Content-Type: application/json" \ -d '{ "texts": [ {"id": "doc1", "text": "London is the capital of the United Kingdom.", "languages": ["en"]}, {"id": "doc2", "text": "Ich fahre nach München.", "languages": ["de"]} ], "model_size": "md" }' \ http://localhost:5000/api/parse/batch
- Success Response (200 OK):
{ "success": true, "total_processed": 2, "successful_parses": 2, "failed_parses": 0, "results": [ { "id": "doc1", // Included if provided in request "success": true, "language_detected": "en", // ... other fields similar to /api/parse response "locations": [ /* ... */ ] }, { "id": "doc2", "success": true, "language_detected": "de", // ... other fields "locations": [ /* ... */ ] } ] } - Error Responses:
400 Bad Request: Invalid input (e.g.,textsnot a list, batch size exceeded).
- Endpoint:
GET /api/info - Description: Provides information about the loaded models and service configuration.
- Example Request (
curl):curl http://localhost:5000/api/info
- Success Response (200 OK):
{ "success": true, "info": { "loaded_models": ["en", "de"], // Actual loaded language codes "default_model_size": "md", "transformer_model": "dguzh/geo-all-MiniLM-L6-v2", "gazetteer": "geonames", "supported_languages": ["en", "de", "fr", "zh", "es"], // From .env "cache_enabled": true, "cache_size": 10, "max_text_length": 10000, "max_batch_size": 100 } } - Error Responses:
503 Service Unavailable: If the GeoParserService is not initialized.
- Endpoint:
GET /api/health - Description: Checks the health of the service. Used by Docker for container health monitoring.
- Example Request (
curl):curl http://localhost:5000/api/health
- Success Response (200 OK):
{ "status": "healthy", "models_loaded": 2, "test_parse_success": true, "config_valid": true } - Error Response (503 Service Unavailable):
{ "status": "unhealthy", "error": "GeoParser service is not available" }
- Endpoint:
POST /api/cache/clear - Description: Clears the in-memory cache of the GeoParserService.
- Example Request (
curl):curl -X POST http://localhost:5000/api/cache/clear
- Success Response (200 OK):
{ "success": true, "message": "Cache cleared successfully. Removed 10 entries." } - Response if caching is disabled (200 OK but indicates no action):
{ "success": false, // Or true with a different message "message": "Caching is not enabled. No cache to clear." }
- Endpoint:
GET /api/languages - Description: Returns the list of languages and model sizes supported by the current configuration.
- Example Request (
curl):curl http://localhost:5000/api/languages
- Success Response (200 OK):
{ "success": true, "supported_languages": ["en", "de", "fr", "zh", "es"], // From .env "default_model_size": "md", // From .env "available_model_sizes": ["sm", "md", "lg", "trf"] // From .env }
- Endpoint:
GET / - Description: Provides basic service information and a list of available endpoints.
- Example Request (
curl):curl http://localhost:5000/
- Success Response (200 OK):
{ "service": "GeoParser API", "version": "1.0.0", "status": "running", "endpoints": { "parse": "/api/parse", "batch_parse": "/api/parse/batch", "info": "/api/info", "health": "/api/health", "clear_cache": "/api/cache/clear", "languages": "/api/languages" }, "documentation": "https://github.com/Jensen-JZ/GeoParser-API" }
The application is configured primarily through the .env file. Some key options include:
TRANSFORMER_MODEL: Specifies the Hugging Face transformer model for embeddings.GAZETTEER: The gazetteer to use (default:geonames).AVAILABLE_MODEL_SIZES: Comma-separated list of SpaCy model sizes (e.g.,sm,md,lg,trf).SUPPORTED_LANGUAGES: Comma-separated list of ISO language codes (e.g.,en,de,fr,zh,es).SPACY_MODEL_PATH,TRANSFORMERS_MODEL_PATH,GEONAMES_DATA_PATH: Paths within the container where models and data are stored. These are typically managed bydocker-compose.ymlvolumes and thesetup_models.shscript.MAX_TEXT_LENGTH: Maximum characters allowed for input text.TIMEOUT: Request timeout.ENABLE_CACHE: Set totrueto enable in-memory caching.MAX_BATCH_SIZE: Maximum number of texts allowed in a batch request.LOG_LEVEL: Logging level (e.g.,INFO,DEBUG).HOST,PORT: Server host and port.WORKERS,WORKER_TIMEOUT, etc.: Gunicorn worker configuration.MEMORY_LIMIT,CPU_LIMIT: Docker resource limits.
Refer to the .env file and app/config.py for a complete list of configurations.
The service is configured to support NVIDIA GPUs for faster model inference. To enable GPU support:
- Ensure you have NVIDIA drivers installed on the host machine.
- Install the NVIDIA Container Toolkit on the host machine.
- The
docker-compose.ymlfile includes the necessaryruntime: nvidiaconfiguration.(Note: Theservices: geoparser: # ... other configurations runtime: nvidia deploy: resources: reservations: devices: - driver: nvidia count: 1 # Or 'all' capabilities: [gpu]
deploy.resources.reservations.devicesstructure is common, butruntime: nvidiais the primary enabler for Docker Compose v2+). The.envfile also contains GPU-related environment variables likeCUDA_VISIBLE_DEVICES.
If the NVIDIA runtime is correctly configured, PyTorch (a dependency of Transformers) should automatically detect and use available GPUs.
- Model Download Issues (
setup_models.sh):- Ensure you have a stable internet connection.
- Check for typos in
SUPPORTED_LANGUAGESorAVAILABLE_MODEL_SIZESin your.envfile. SpaCy model names are specific (e.g.,en_core_web_sm,de_core_news_md). The script attempts to derive these. - If a specific model fails, try downloading it manually with
python -m spacy download <model_name>inside a Python environment withspacyinstalled to see more detailed errors.
- Port Conflicts: If another service is using the specified
PORT(default 5000), change it in.envand restart the containers. - Docker Permission Issues:
- The
setup_models.shscript attempts to fix permissions for./modelsand./datadirectories. - If you encounter permission errors when Docker tries to write to mounted volumes, ensure the user running Docker has write access to these directories on the host or run
sudo chown -R $(whoami):$(whoami) models/ data/ logs/(be cautious withsudo).
- The
- Service Fails to Start (Check Logs):
docker-compose logs -f geoparser- Look for errors related to model loading (e.g., "No models were successfully loaded") or Python package issues.
- Ensure all models listed by
SUPPORTED_LANGUAGESandDEFAULT_MODEL_SIZE(first ofAVAILABLE_MODEL_SIZES) were successfully downloaded bysetup_models.sh.
CUDA_ERROR_NO_DEVICEor similar GPU errors:- Verify NVIDIA drivers and NVIDIA Container Toolkit are correctly installed and configured on the host.
- Ensure the
runtime: nvidiais set indocker-compose.yml. - Check
CUDA_VISIBLE_DEVICESin.env.
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs, feature requests, or improvements.
This project is licensed under the MIT License. See the LICENSE file for details.