Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
e8ab070
harden GCP Cloud Run: Cloud SQL lockdown and Secret Manager credentials
jivanb7 May 31, 2026
511d858
secure FastAPI and Feast database credentials
jivanb7 May 31, 2026
8646ad9
bigquery: day-partition growth tables and allow clean destroy
jivanb7 May 31, 2026
8552b79
teardown: narrow the auto-teardown service account privileges
jivanb7 May 31, 2026
1395ff4
cli: deploy/destroy lifecycle robustness and minikube/GKE commands
jivanb7 May 31, 2026
17e1f5e
kubernetes: MLflow persistence, namespace isolation, GKE LB targeting
jivanb7 May 31, 2026
1092e39
docker build and FastAPI app robustness
jivanb7 May 31, 2026
9b6e678
docs: rewrite for the GCP Cloud Run path
jivanb7 May 31, 2026
a2749d2
example: guard env vars with actionable errors
jivanb7 May 31, 2026
d640b21
add unit tests and packaging config
jivanb7 May 31, 2026
83ec86d
fix #54: preflight gcloud auth and ADC in deploy
jivanb7 May 31, 2026
44cf09a
fix #53: validate config.stack shape to avoid set/non-dict crash
jivanb7 May 31, 2026
5ce4830
docs: document new CLI flags and teardown behaviors
jivanb7 May 31, 2026
1eefb2f
docs: correct destroy note in README and document the Kubernetes paths
jivanb7 May 31, 2026
be6a597
docs: surface minikube and GKE across the site overview pages
jivanb7 May 31, 2026
cdcaf9f
add platform_compat module for cross platform tool execution
May 31, 2026
1fd28a6
route external tool calls through run_tool (blocker 1)
May 31, 2026
d5f8ed3
force UTF-8 console output at CLI entry (blocker 2)
May 31, 2026
a875839
run Cloud SQL readiness provisioner under bash (blocker 3)
May 31, 2026
61aa842
use robust_rmtree for destroy workspace cleanup (blocker 4)
May 31, 2026
5c1f872
warn when gke-gcloud-auth-plugin is missing (blocker 5)
May 31, 2026
6bb3784
docs: native Windows setup and platform notes
May 31, 2026
1407ad0
prefer Git bash over WSL bash for terraform local-exec (blocker 3)
May 31, 2026
26a3d18
decode subprocess output as UTF-8 on Windows (blocker 2 read side)
Jun 1, 2026
8099dec
docs: minikube tunnel, memory, and gcloud component notes for Windows
Jun 1, 2026
7496629
remove duplicate subprocess and shutil imports in helpers
jivanb7 Jun 1, 2026
2cc4c30
skip docker permission check in doctor when docker is absent
jivanb7 Jun 1, 2026
51d690c
merge duplicate markdown_extensions key in mkdocs config
jivanb7 Jun 1, 2026
877d045
add unit tests for platform_compat
jivanb7 Jun 1, 2026
1bdd7bb
clean up orphaned PVC disk on gke-destroy --delete-cluster
jivanb7 Jun 1, 2026
b13e8fe
warn when gke-init image push fails instead of silently producing a b…
Jun 1, 2026
531e6cf
Merge pull request #58 from deployml-core/feat/windows_compat
jivanb7 Jun 1, 2026
85ede3c
Merge dev into feat/gcp: adopt the generate-command removal (PR #60) …
jivanb7 Jun 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 35 additions & 9 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,21 +1,47 @@
# Count Python and HCL only
*.py linguist-detectable=true
# Default: normalize line endings on commit, native on checkout
* text=auto

# Ignore notebooks
*.ipynb linguist-documentation
# Shell scripts MUST stay LF so they run in Linux containers.
# Without this, a Windows clone produces CRLF which causes
# "exec format error" inside Docker images.
*.sh text eol=lf
*.bash text eol=lf

# Dockerfile content and configs that may run in Linux
Dockerfile text eol=lf
*.dockerfile text eol=lf
.dockerignore text eol=lf

# Templates rendered into Terraform / Kubernetes / YAML
# also need stable LF since they may be consumed by Linux tools.
*.tf text eol=lf
*.tf.j2 text eol=lf
*.tfvars text eol=lf
*.yaml text eol=lf
*.yml text eol=lf
*.j2 text eol=lf
*.tpl text eol=lf
*.json text eol=lf

# Python and HCL source: native auto handling
*.py text=auto

# Ignore configs and data
# Linguist hints for repo language stats
*.py linguist-detectable=true
*.ipynb linguist-documentation
*.yaml linguist-documentation
*.yml linguist-documentation
*.json linguist-documentation
*.csv linguist-documentation
*.parquet linguist-documentation

# Ignore templates
*.jinja linguist-documentation
*.j2 linguist-documentation
*.tpl linguist-documentation
*.tf linguist-documentation

# Ignore misc
*.md linguist-documentation

# Binary files
*.png binary
*.jpg binary
*.gif binary
*.parquet binary
29 changes: 22 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ A CLI tool that deploys a production MLOps stack on GCP with a single command. B
- **Grafana** — monitoring dashboard connected to your metrics database
- **BigQuery** — `mlops` dataset with tables for features, predictions, ground truth, and drift metrics

All running on GCP Cloud Run — no servers to manage, scales to zero when idle.
All running on GCP Cloud Run. No servers to manage. Cloud Run services scale to zero when idle. Cloud SQL and BigQuery storage incur baseline cost. See Costs below.

## Quick Start

Expand All @@ -22,7 +22,7 @@ pip install deployml-core
**2. Initialize your GCP project** (enables APIs, creates Artifact Registry)

```bash
deployml init --provider gcp --project-id YOUR_PROJECT_ID
deployml init --provider gcp --project-id YOUR_GCP_PROJECT_ID
```

**3. Configure**
Expand Down Expand Up @@ -78,15 +78,30 @@ See [example/README.md](example/README.md) for details.
deployml destroy
```

Deletes all Cloud Run services, Cloud SQL, GCS bucket, and BigQuery dataset. Does not delete Artifact Registry images or the GCP project.
Deletes all Cloud Run services, Cloud SQL, the GCS bucket, and the BigQuery dataset, and also removes the Artifact Registry repo and the Cloud Build staging bucket that `build-images` created, so a destroyed project leaves no billing residue. Does not delete the GCP project itself.

## Full Tutorial

See [docs/tutorials/gcp-cloud-run.md](docs/tutorials/gcp-cloud-run.md) for a step-by-step walkthrough.

## Other deployment targets

Cloud Run is the primary, fully supported path. The CLI also supports Kubernetes for users who want a cluster:

- **Local minikube**, for testing without GCP: `mlflow-init` and `mlflow-deploy`, or `minikube-init` and `minikube-deploy`.
- **GKE** on GCP: `gke-cluster-create`, `gke-init`, then `gke-deploy` or `gke-apply`, torn down with `gke-destroy`.

MLflow keeps its data on a PersistentVolumeClaim in both, so experiments survive pod restarts. See [CLI Commands](docs/api/cli-commands.md) and the [GKE flow notes](docs/tutorials/gcp-cloud-run.md#gke-flow-notes).

## Requirements

- Python 3.10+
- `gcloud` CLI (authenticated)
- Docker (running)
- Terraform
- Python 3.11 or newer
- `gcloud` CLI, authenticated with `gcloud auth login`, `gcloud auth application-default login`, and `gcloud auth configure-docker us-west1-docker.pkg.dev`
- Docker, running
- Terraform 1.0 or newer

Run `deployml doctor --project-id YOUR_GCP_PROJECT_ID` to verify auth, ADC, tool versions, enabled APIs, and IAM roles on your project.

## Costs

Cloud Run scales to zero when idle. Cloud SQL Postgres and BigQuery storage do not. Expect roughly $30 to $80 per month while the stack is up. MLflow runs with `min_instances = 1` by default for snappy UI, which adds about $5 per month. Set `min_instances` to 0 if you want zero idle cost in exchange for cold starts. Always run `deployml destroy` when done.
1 change: 1 addition & 0 deletions config.example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ provider:
name: gcp
project_id: YOUR_GCP_PROJECT_ID
region: us-west1
image_tag: v0.0.42 # pinned. Override to your build tag if you rebuild images.
deployment:
type: cloud_run
stack:
Expand Down
120 changes: 93 additions & 27 deletions docs/api/cli-commands.md
Original file line number Diff line number Diff line change
@@ -1,88 +1,154 @@
# CLI Commands Reference

Commands you use most live at the top. Advanced and experimental commands are listed at the bottom.

## `deployml doctor`

Check that all required local tools are installed and authenticated.
Check that required tools are installed and authenticated. Optionally checks enabled APIs and IAM roles on a project.

```bash
deployml doctor
deployml doctor --project-id YOUR_GCP_PROJECT_ID
```

Run this before anything else. Checks for `gcloud`, `docker`, `terraform`, and `bq`.
**Options:**
- `--project-id`, `-j`: GCP project to probe for enabled APIs and IAM role coverage.

The doctor checks: tool versions for docker, terraform, gcloud, bq; gcloud auth and Application Default Credentials; Infracost (optional); and for a given project, required APIs and IAM roles.

---

## `deployml init`

Enable required GCP APIs for a project and create the Artifact Registry repository. Run once per project.
Enable required GCP APIs for a project and create a local `docker/` folder plus a runnable `config.yaml` starter. Run once per project.

```bash
deployml init --provider gcp --project-id YOUR_PROJECT_ID
deployml init --provider gcp --project-id YOUR_GCP_PROJECT_ID
```

**Options:**
- `--provider`, `-p`: Cloud provider — currently `gcp`
- `--project-id`, `-j`: GCP project ID
- `--provider`, `-p`: Cloud provider. Currently `gcp` is fully supported. `aws` and `azure` write skeleton configs.
- `--project-id`, `-j`: GCP project ID.
- `--path`: Directory where the project is initialized. Defaults to current directory.
- `--overwrite`: Overwrite an existing `docker/` folder or `config.yaml`.

The generated `config.yaml` includes a `provider.image_tag` field set to the deployml version. Override this to pin to a specific build tag.

---

## `deployml build-images`

Build Docker images and push them to GCP Artifact Registry. Reads project ID and region from `config.yaml` by default.
Build Docker images and push them to GCP Artifact Registry. Reads project ID, region, and image_tag from `config.yaml` by default.

```bash
deployml build-images
deployml build-images --create-repo
```

**Options:**
- `--config-path`, `-c`: Path to config YAML file (default: `config.yaml`)
- `--docker-root`, `-d`: Path to folder containing Dockerfiles (default: built-in package images)
- `--gcp-project`, `-p`: GCP project ID (default: inferred from config)
- `--region`: GCP region (default: inferred from config)
- `--repository`: Artifact Registry repository name (default: `mlops-images`)
- `--tag`: Image tag (default: `latest`)
- `--create-repo`: Create the Artifact Registry repository if it does not exist
- `--config-path`, `-c`: Path to config YAML file. Default `config.yaml`.
- `--docker-root`, `-d`: Folder containing subfolders with Dockerfiles. Default is the built-in deployml docker directory.
- `--gcp-project`, `-p`: GCP project ID. Default inferred from config.
- `--region`: GCP region. Default inferred from config.
- `--repository`: Artifact Registry repository name. Default `mlops-images`.
- `--tag`, `-t`: Image tag. Default reads `config.provider.image_tag`, falls back to `v{deployml_version}`.
- `--create-repo`: Create the Artifact Registry repository on first run.
- `--dry-run`: Print commands without executing.
- `--platform`: Local build platform. Defaults to the host architecture so images run on a local minikube node, including arm64 Macs. Pass `linux/amd64` only when building locally for a manual amd64 push. Ignored in GCP Cloud Build mode.

Builds run on Cloud Build, so a local Docker daemon is not required for GCP mode. In local mode a daemon probe runs first and the build targets the host architecture.

---

## `deployml deploy`

Deploy infrastructure from a YAML config file.
Deploy infrastructure from a YAML config file. Prompts for confirmation by default.

```bash
deployml deploy --verbose
deployml deploy --verbose --yes
```

**Options:**
- `--config-path`, `-c`: Path to config YAML file (default: `config.yaml`)
- `--verbose`, `-v`: Stream Terraform logs instead of showing a progress bar
- `--yes`, `-y`: Skip confirmation prompts
- `--config-path`, `-c`: Path to config YAML. Default `config.yaml`.
- `--verbose`, `-v`: Stream Terraform logs to stdout. Without this you get a progress bar.
- `--yes`, `-y`: Skip the `[y/N]` deploy confirmation. Required for non-interactive scripts.
- `--generate-only`, `-g`: For the GKE flow, render the Kubernetes manifests without applying them, so you can review and then apply with `gke-apply`.

Before any cloud work, deploy validates the config shape and, for GCP, preflights both gcloud auth and Application Default Credentials. A missing ADC fails fast with `gcloud auth application-default login` guidance instead of an opaque "default credentials not found" at apply. A malformed `stack` entry also fails here with a clear message rather than crashing mid-deploy.

First-time deploy takes about 20 minutes because Cloud SQL Postgres provisioning is slow.

---

## `deployml get-urls`

Print service URLs from the last deployment and write them to a `.env` file.
Print service URLs from the last deployment and write them to a `.env` file. Database credentials are masked.

```bash
deployml get-urls
deployml get-urls --show-secrets
```

**Options:**
- `--config-path`, `-c`: Path to config YAML file (default: `config.yaml`)
- `--env-path`: Where to write the `.env` file (default: `.env`)
- `--config-path`, `-c`: Path to config YAML. Default `config.yaml`.
- `--env-path`: Where to write the `.env`. Default `.env`.
- `--show-secrets`: Additionally fetch and print the Grafana admin password and the Cloud SQL Auth Proxy connection command.

---

## `deployml destroy`

Tear down all infrastructure for a given config.
Tear down all infrastructure for a given config. Also removes the Artifact Registry repo and the Cloud Build staging bucket created by build-images, so a destroyed project leaves no billing residue.

```bash
deployml destroy
deployml destroy --yes
```

**Options:**
- `--config-path`, `-c`: Path to config YAML file (default: `config.yaml`)
- `--clean-workspace`: Delete the local workspace folder after destroy
- `--yes`, `-y`: Skip confirmation prompts
- `--config-path`, `-c`: Path to config YAML. Default `config.yaml`.
- `--clean-workspace`: Remove the local `.deployml/` workspace folder after destroy.
- `--yes`, `-y`: Skip both the destroy confirmation and the Terraform state cleanup prompt.
- `--workspace`: Override the workspace name from config.

On partial failure, Terraform state is preserved and the command prints recovery instructions including `gcloud asset search-all-resources` for finding residual resources.

---

## Config file reference

Top-level fields used by deploy and destroy:

```yaml
name: string # workspace name. defaults to "default" if omitted
provider:
name: gcp | aws | azure
project_id: string # required for gcp
region: string # required for gcp
image_tag: string # optional. default v{deployml_version}
deployment:
type: cloud_run | cloud_vm | gke # required
stack:
- <stage_name>:
name: mlflow | fastapi | grafana | feast | cron
params: {} # tool-specific
```

Supported stage names: `experiment_tracking`, `artifact_tracking`, `model_registry`, `model_serving`, `model_monitoring`, `feature_store`, `workflow_orchestration`.

---

## Advanced and experimental commands

These exist in the CLI but are not part of the documented happy path. Use at your own risk and inspect the source.

- `deployml generate`: Interactive YAML generator. Less useful since `init` now writes a runnable config. Pass `--force`, `-f` to overwrite an existing `config.yaml` without the confirm prompt.
- `deployml status`: Stub. Reports deployment status of current workspace.
- `deployml terraform`: Run raw terraform actions (plan, apply, destroy) on a rendered workspace.
- `deployml teardown`: Manage scheduled auto-teardown jobs.
- `deployml vm`: Placeholder for VM deployment.
- `deployml mlflow-init`, `deployml mlflow-deploy`: MLflow-only minikube flow. `mlflow-init` provisions a PersistentVolumeClaim by default with `--persistent-storage` so the sqlite backend and artifacts survive pod restarts, with `--pvc-size` to size it and `--ephemeral-storage` to opt out. `mlflow-deploy` takes `--namespace`, `-n` to isolate the stack.
- `deployml minikube-init`, `deployml minikube-deploy`: Local Kubernetes flow for testing without GCP. `minikube-deploy` takes `--namespace`, `-n`. Build local images with `build-images` in local mode, which targets the host architecture so they run on the minikube node.
- `deployml gke-cluster-create`: Create a GKE cluster. Autopilot by default with `--region`; pass `--standard` for a small zonal cluster.
- `deployml gke-init`, `deployml gke-deploy`, `deployml gke-apply`, `deployml gke-destroy`: GKE deployment path. MLflow on GKE provisions a PersistentVolumeClaim by default so experiment data survives pod restarts. Deploy commands take `--namespace`, `-n`, and MLflow and FastAPI must share a namespace for in-cluster service DNS to resolve. `gke-destroy` removes the deployed manifests including the PVC and the referenced `gcr.io` image so nothing keeps billing, with `--keep-images` to retain images for a quick redeploy and `--delete-cluster` to remove the cluster. See the [GKE flow notes](../tutorials/gcp-cloud-run.md#gke-flow-notes) in the tutorial.

The fully supported and tested path is GCP Cloud Run via `init`, `build-images`, `deploy`, `get-urls`, `destroy`.
15 changes: 14 additions & 1 deletion docs/api/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,17 @@ deployml provides a command-line interface (CLI) for deploying and managing MLOp
| `deployml get-urls` | Print service URLs and write `.env` file |
| `deployml destroy` | Tear down all infrastructure |

See [CLI Commands](cli-commands.md) for full usage details.
### Kubernetes commands

For the optional Kubernetes paths, local minikube and GKE:

| Command | Description |
|---|---|
| `deployml minikube-init` / `minikube-deploy` | Generate and deploy FastAPI manifests to a local minikube cluster |
| `deployml mlflow-init` / `mlflow-deploy` | Generate and deploy MLflow to minikube, with a PersistentVolumeClaim for data |
| `deployml gke-cluster-create` | Create a GKE cluster, Autopilot by default |
| `deployml gke-init` | Generate Kubernetes manifests for GKE |
| `deployml gke-deploy` / `gke-apply` | Apply manifests to a GKE cluster |
| `deployml gke-destroy` | Remove manifests, the PVC, and the gcr.io image, optionally the cluster |

See [CLI Commands](cli-commands.md) for full usage details and flags.
4 changes: 2 additions & 2 deletions docs/features/costs.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,14 @@ Here are estimated typical costs for several **GCP** services, but please do not
- Google Cloud Storage costs approximately $0.020 per GB per month.
- BigQuery storage costs $0.020 per GB per month with query costs based on data scanned.
- Cloud VMs cost approximately $25 per month for medium instances.
- GKE clusters have no management fee, but you pay for VM instances and load balancers. Note that the GKE can get very expensive very quickly.
- GKE clusters have no management fee, but you pay for VM instances and load balancers. MLflow on GKE also provisions a small PersistentDisk for its data, a few cents per GB-month. Note that GKE can get expensive quickly.



## Cost Optimization

Here are some tips to keep the costs low while you are learning:

- Use SQLite instead of Cloud SQL whenever possible, particularly for development purposes and when your data is small.
- Use SQLite instead of Cloud SQL whenever possible, particularly for development purposes and when your data is small. The minikube and GKE MLflow paths already do this, sqlite on a PersistentVolumeClaim, so they avoid the always-on Cloud SQL cost.
- Enable auto-teardown to prevent forgotten deployments.
- Use Cloud Run for variable workloads to take advantage of scale-to-zero pricing.
10 changes: 10 additions & 0 deletions docs/features/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,16 @@ deployml is a Python library that deploys a complete MLOps infrastructure in GCP

You define your stack in a YAML config file, run `deployml deploy`, and Terraform provisions everything in GCP. When you're done, `deployml destroy` tears it all down cleanly.

## Deployment targets

Cloud Run is the primary, fully supported target and is what the rest of this page describes. The same MLflow and FastAPI stack can also run on Kubernetes, selected by `deployment.type` in your config:

- `cloud_run` — serverless on GCP Cloud Run, the default.
- `gke` — a Google Kubernetes Engine cluster, where MLflow gets a PersistentVolumeClaim so experiment data survives pod restarts.
- Local **minikube**, for testing without GCP, via the `minikube-*` and `mlflow-*` commands.

See [CLI Commands](../api/cli-commands.md) and the [GKE flow notes](../tutorials/gcp-cloud-run.md#gke-flow-notes) for the Kubernetes paths.

## What gets deployed

### Experiment Tracking, Artifact Storage, and Model Registry — MLflow
Expand Down
Loading