SF 311 — Latent Cluster Atlas

Turning 18 months of San Francisco 311 case photos into a clustered, browsable map of what the city is actually complaining about — and a causal estimate of where those complaints sit longest.

The pipeline pulls public 311 cases from the Socrata API, downloads each attached photo, captions it with a vision model, embeds the captions, and runs Leiden community detection over the kNN graph of embeddings. The resulting clusters get auto-named by a second model pass and projected into 2D / 3D with UMAP for an interactive browser atlas. Then we run an embedding-matched causal study on top of the same vectors to estimate which neighborhoods take longer to close cases that look the same to a model.

TL;DR results

355 cases · 982 photos · 8 discovered visual clusters, automatically named.
Mission cases take +2.21 hours longer to close than visually-matched controls in the rest of SF (95% CI [0.53, 3.95], p = 0.002).
South of Market closes them 1.79 hours faster than matched controls (p < 0.001).
Mean cosine similarity between treated cases and their k=3 nearest matched controls: 0.81 — controls really are visually similar.

Pipeline

Six stages, each checkpointed to parquet so reruns are cheap:

stage	tech	output
`pull`	Socrata API + async httpx	`cases.parquet` (18 mo of 311 cases)
`download`	async httpx, on-disk cache	`photos_index.parquet` + `photos/*.jpg`
`caption`	`gpt-4o-mini` (low-detail vision)	`captions.parquet` (scene/issues/severity)
`embed`	`text-embedding-3-small` (1536d)	`embeddings.parquet`
`cluster`	Leiden + sub-Leiden + UMAP	`clusters.parquet`, `cluster_names.json`
`causal`	kNN matching on embeddings	`viz/causal.json` (ATT per neighborhood)

Run end-to-end:

uv sync
cp .env.example .env   # add OPENAI_API_KEY
uv run python pipeline.py --stages all

Or run any subset:

uv run python pipeline.py --stages caption,embed,cluster

Each stage skips work that's already cached on disk, so a rerun on the same data is seconds, not hours.

What the clusters look like

Cluster names are generated by a second LLM pass over the top-K captions in each Leiden community — no hand labels. The largest clusters (trash/disarray, parking, overgrowth) are exactly what a city ops team would expect; the smaller ones (mobility & maintenance) surface things that don't show up cleanly in 311's own service-name taxonomy.

Causal estimate

Question. Holding the visual content of the complaint constant, does the neighborhood where the case sits change how long it takes to close?

Setup. For each treated case in neighborhood N, find the k=3 nearest controls by cosine distance on caption embeddings, drawn from cases in the rest of SF. The within-pair difference in response_hours averaged across the treated set is the ATT under unconfoundedness given the visual content captured by the embedding. 95% CIs come from a paired bootstrap; p-values from a permutation test that re-shuffles the treatment label.

Mission is the headline: 35 treated cases, +2.2 hours slower than visually-matched controls, robust across k ∈ {1, 3, 5}. South of Market is the inverse — visually-similar complaints close noticeably faster there. The other neighborhoods have tiny samples (n ≤ 7) and CIs that cross zero, so they're descriptive only.

Caveat that always applies to matched designs: this rules out confounding from anything captured in the photo (severity, scene type, foreground objects) but cannot rule out confounding from anything the photo can't see — time of day the report came in, the specific responder dispatched, weather, etc.

Interactive viz

viz/index.html is a single-file local atlas: 3D UMAP scatter on the left (auto-rotating, click any point to see the photo), Mapbox map on the right. viz/causality.html is the matched-pairs explorer for the causal study.

cd viz && python3 -m http.server 8000
# then open http://localhost:8000/

No build step, no framework — just static HTML + the prebuilt data.json / causal.json.

Stack

Python 3.11, uv, httpx, polars, openai, leidenalg + python-igraph, umap-learn, scikit-learn, duckdb, pyarrow, scipy.

LLM choices were made for cost:

Captioning: gpt-4o-mini at low detail ≈ $0.0001 / image — fine for trash / encampment / overgrowth, misses small detail.
Embedding: text-embedding-3-small (1536d) ≈ $0.02 / 1M tokens.
Cluster naming: gpt-4o-mini again, one call per cluster.

Total OpenAI spend for the full 355-case run was under $0.50.

Repo layout

pipeline.py                 # CLI entry; argparse --stages
stages/
  pull.py                   # Socrata API → cases.parquet
  download.py               # async photo download
  caption.py                # vision captions → structured tags
  embed.py                  # text-embedding-3-small
  cluster.py                # Leiden + sub-Leiden + UMAP + naming
  causal_match.py           # kNN embedding matching → ATT
common/
  io.py                     # parquet helpers, paths, schema contracts
  http.py                   # shared httpx.AsyncClient + retry
  env.py                    # .env loading
  progress.py               # throughput logging
viz/
  index.html                # 3D UMAP + map atlas
  causality.html            # matched-pairs explorer
  build_data.py             # parquet → viz/data.json
  build_readme_images.py    # regenerate the figures in this README
tests/                      # one offline smoke test per stage
docs/
  superpowers/specs/        # design doc
  superpowers/plans/        # implementation plan
  img/                      # README figures

Reproducing the figures

uv run python viz/build_readme_images.py

Regenerates docs/img/{umap,clusters,causal,pipeline}.png from viz/data.json and viz/causal.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SF 311 — Latent Cluster Atlas

TL;DR results

Pipeline

What the clusters look like

Causal estimate

Interactive viz

Stack

Repo layout

Reproducing the figures

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
common		common
docs		docs
stages		stages
tests		tests
viz		viz
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SF 311 — Latent Cluster Atlas

TL;DR results

Pipeline

What the clusters look like

Causal estimate

Interactive viz

Stack

Repo layout

Reproducing the figures

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages