Kmonte/tb examples v2 by kmontemayor2-sc · Pull Request #632 · Snapchat/GiGL

kmontemayor2-sc · 2026-05-11T21:48:00Z

Scope of work done

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

…_metrics Introduces ``gigl.utils.tensorboard_writer.TensorBoardWriter``, the trainer/inferencer-side abstraction that writes scalar metrics to a Vertex AI ``ExperimentRun`` synchronously via ``aiplatform.log_time_series_metrics``. Key design points: - ``TensorBoardWriter.create(resource_name=..., experiment_name=..., experiment_run_name=..., enabled=is_chief_process)`` is the only constructor entry point. No env-var contract, no proto fields on ``GiglResourceConfig`` — configuration is plumbed through the trainer/inferencer's argparse (typically populated from ``GbmlConfig.trainerConfig.trainerArgs`` / ``inferencerConfig.inferencerArgs``). - ``enabled=False`` (non-chief ranks) returns a no-op writer. Chief ranks must supply all three string args; missing any of them raises ``RuntimeError`` so misconfiguration surfaces fast rather than producing a silent no-op. - Logs the cross-job experiment URL on ``start_run`` success so engineers can find the comparison TB page from trainer stdout. - Each ``log()`` is a single synchronous ``WriteTensorboardRunData`` RPC. Failures propagate to the caller (no background uploader thread). This PR introduces the writer with full test coverage but no callers; example trainer/inferencer entrypoints get wired up in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds chief-rank TensorBoard logging to all 8 example link_prediction entrypoints (4 trainers + 4 inferencers, single-pool and graph-store). Each entrypoint: - Adds ``--tensorboard_resource_name`` / ``--tensorboard_experiment_name`` argparse flags (populated from ``trainerConfig.trainerArgs`` / ``inferencerConfig.inferencerArgs`` in the task config). - Plumbs both args plus ``--job_name`` through the per-process dataclass (``TrainingProcessArgs`` / ``InferenceProcessArgs``). - Constructs the writer once at the top of the per-process function with ``enabled=is_chief_process`` (or the graph-store equivalent ``args.cluster_info.compute_node_rank == 0 and local_rank == 0``). Misconfiguration on the chief rank fails fast inside ``create()``. - Trainers log ``Loss/train`` / ``Loss/val`` / ``Loss/test`` inside the existing ``log_every_n_batch`` gates; inferencers log ``Inference/throughput_batches_per_sec``. - Closes the writer at the end of the per-process function (paired ``aiplatform.end_run``). Updates the four OSS example task configs (CORA and DBLP, single-pool and graph-store) to set the GiGL OSS Tensorboard resource + ``gigl-oss-examples`` experiment, so the examples emit comparable runs out of the box. The single-line ruff reformat of ``tensorboard_writer_test.py`` is a trivial cleanup that rides along. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kmontemayor and others added 2 commits May 11, 2026 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kmonte/tb examples v2#632

Kmonte/tb examples v2#632
kmontemayor2-sc wants to merge 2 commits into
mainfrom
kmonte/tb-examples-v2

kmontemayor2-sc commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kmontemayor2-sc commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant