Skip to content

sohumt123/sf311-clusters

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SF 311 — Latent Cluster Atlas

Turning 18 months of San Francisco 311 case photos into a clustered, browsable map of what the city is actually complaining about — and a causal estimate of where those complaints sit longest.

UMAP of SF 311 case-photo captions, colored by Leiden cluster

The pipeline pulls public 311 cases from the Socrata API, downloads each attached photo, captions it with a vision model, embeds the captions, and runs Leiden community detection over the kNN graph of embeddings. The resulting clusters get auto-named by a second model pass and projected into 2D / 3D with UMAP for an interactive browser atlas. Then we run an embedding-matched causal study on top of the same vectors to estimate which neighborhoods take longer to close cases that look the same to a model.


TL;DR results

  • 355 cases · 982 photos · 8 discovered visual clusters, automatically named.
  • Mission cases take +2.21 hours longer to close than visually-matched controls in the rest of SF (95% CI [0.53, 3.95], p = 0.002).
  • South of Market closes them 1.79 hours faster than matched controls (p < 0.001).
  • Mean cosine similarity between treated cases and their k=3 nearest matched controls: 0.81 — controls really are visually similar.

Pipeline

Pipeline diagram: pull → download → caption → embed → cluster → causal

Six stages, each checkpointed to parquet so reruns are cheap:

stage tech output
pull Socrata API + async httpx cases.parquet (18 mo of 311 cases)
download async httpx, on-disk cache photos_index.parquet + photos/*.jpg
caption gpt-4o-mini (low-detail vision) captions.parquet (scene/issues/severity)
embed text-embedding-3-small (1536d) embeddings.parquet
cluster Leiden + sub-Leiden + UMAP clusters.parquet, cluster_names.json
causal kNN matching on embeddings viz/causal.json (ATT per neighborhood)

Run end-to-end:

uv sync
cp .env.example .env   # add OPENAI_API_KEY
uv run python pipeline.py --stages all

Or run any subset:

uv run python pipeline.py --stages caption,embed,cluster

Each stage skips work that's already cached on disk, so a rerun on the same data is seconds, not hours.


What the clusters look like

Discovered cluster sizes

Cluster names are generated by a second LLM pass over the top-K captions in each Leiden community — no hand labels. The largest clusters (trash/disarray, parking, overgrowth) are exactly what a city ops team would expect; the smaller ones (mobility & maintenance) surface things that don't show up cleanly in 311's own service-name taxonomy.


Causal estimate

Question. Holding the visual content of the complaint constant, does the neighborhood where the case sits change how long it takes to close?

Setup. For each treated case in neighborhood N, find the k=3 nearest controls by cosine distance on caption embeddings, drawn from cases in the rest of SF. The within-pair difference in response_hours averaged across the treated set is the ATT under unconfoundedness given the visual content captured by the embedding. 95% CIs come from a paired bootstrap; p-values from a permutation test that re-shuffles the treatment label.

ATT by neighborhood — Mission slowest, South of Market fastest

Mission is the headline: 35 treated cases, +2.2 hours slower than visually-matched controls, robust across k ∈ {1, 3, 5}. South of Market is the inverse — visually-similar complaints close noticeably faster there. The other neighborhoods have tiny samples (n ≤ 7) and CIs that cross zero, so they're descriptive only.

Caveat that always applies to matched designs: this rules out confounding from anything captured in the photo (severity, scene type, foreground objects) but cannot rule out confounding from anything the photo can't see — time of day the report came in, the specific responder dispatched, weather, etc.


Interactive viz

viz/index.html is a single-file local atlas: 3D UMAP scatter on the left (auto-rotating, click any point to see the photo), Mapbox map on the right. viz/causality.html is the matched-pairs explorer for the causal study.

cd viz && python3 -m http.server 8000
# then open http://localhost:8000/

No build step, no framework — just static HTML + the prebuilt data.json / causal.json.


Stack

Python 3.11, uv, httpx, polars, openai, leidenalg + python-igraph, umap-learn, scikit-learn, duckdb, pyarrow, scipy.

LLM choices were made for cost:

  • Captioning: gpt-4o-mini at low detail ≈ $0.0001 / image — fine for trash / encampment / overgrowth, misses small detail.
  • Embedding: text-embedding-3-small (1536d) ≈ $0.02 / 1M tokens.
  • Cluster naming: gpt-4o-mini again, one call per cluster.

Total OpenAI spend for the full 355-case run was under $0.50.


Repo layout

pipeline.py                 # CLI entry; argparse --stages
stages/
  pull.py                   # Socrata API → cases.parquet
  download.py               # async photo download
  caption.py                # vision captions → structured tags
  embed.py                  # text-embedding-3-small
  cluster.py                # Leiden + sub-Leiden + UMAP + naming
  causal_match.py           # kNN embedding matching → ATT
common/
  io.py                     # parquet helpers, paths, schema contracts
  http.py                   # shared httpx.AsyncClient + retry
  env.py                    # .env loading
  progress.py               # throughput logging
viz/
  index.html                # 3D UMAP + map atlas
  causality.html            # matched-pairs explorer
  build_data.py             # parquet → viz/data.json
  build_readme_images.py    # regenerate the figures in this README
tests/                      # one offline smoke test per stage
docs/
  superpowers/specs/        # design doc
  superpowers/plans/        # implementation plan
  img/                      # README figures

Reproducing the figures

uv run python viz/build_readme_images.py

Regenerates docs/img/{umap,clusters,causal,pipeline}.png from viz/data.json and viz/causal.json.

About

SF 311 latent cluster atlas — photos → captions → embeddings → Leiden + UMAP, plus embedding-matched causal estimates of neighborhood-level response delays

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors