Kmonte/tb examples v2#632
Draft
kmontemayor2-sc wants to merge 2 commits into
Draft
Conversation
…_metrics Introduces ``gigl.utils.tensorboard_writer.TensorBoardWriter``, the trainer/inferencer-side abstraction that writes scalar metrics to a Vertex AI ``ExperimentRun`` synchronously via ``aiplatform.log_time_series_metrics``. Key design points: - ``TensorBoardWriter.create(resource_name=..., experiment_name=..., experiment_run_name=..., enabled=is_chief_process)`` is the only constructor entry point. No env-var contract, no proto fields on ``GiglResourceConfig`` — configuration is plumbed through the trainer/inferencer's argparse (typically populated from ``GbmlConfig.trainerConfig.trainerArgs`` / ``inferencerConfig.inferencerArgs``). - ``enabled=False`` (non-chief ranks) returns a no-op writer. Chief ranks must supply all three string args; missing any of them raises ``RuntimeError`` so misconfiguration surfaces fast rather than producing a silent no-op. - Logs the cross-job experiment URL on ``start_run`` success so engineers can find the comparison TB page from trainer stdout. - Each ``log()`` is a single synchronous ``WriteTensorboardRunData`` RPC. Failures propagate to the caller (no background uploader thread). This PR introduces the writer with full test coverage but no callers; example trainer/inferencer entrypoints get wired up in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds chief-rank TensorBoard logging to all 8 example link_prediction entrypoints (4 trainers + 4 inferencers, single-pool and graph-store). Each entrypoint: - Adds ``--tensorboard_resource_name`` / ``--tensorboard_experiment_name`` argparse flags (populated from ``trainerConfig.trainerArgs`` / ``inferencerConfig.inferencerArgs`` in the task config). - Plumbs both args plus ``--job_name`` through the per-process dataclass (``TrainingProcessArgs`` / ``InferenceProcessArgs``). - Constructs the writer once at the top of the per-process function with ``enabled=is_chief_process`` (or the graph-store equivalent ``args.cluster_info.compute_node_rank == 0 and local_rank == 0``). Misconfiguration on the chief rank fails fast inside ``create()``. - Trainers log ``Loss/train`` / ``Loss/val`` / ``Loss/test`` inside the existing ``log_every_n_batch`` gates; inferencers log ``Inference/throughput_batches_per_sec``. - Closes the writer at the end of the per-process function (paired ``aiplatform.end_run``). Updates the four OSS example task configs (CORA and DBLP, single-pool and graph-store) to set the GiGL OSS Tensorboard resource + ``gigl-oss-examples`` experiment, so the examples emit comparable runs out of the box. The single-line ruff reformat of ``tensorboard_writer_test.py`` is a trivial cleanup that rides along. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scope of work done
Where is the documentation for this feature?: N/A
Did you add automated tests or write a test plan?
Updated Changelog.md? NO
Ready for code review?: NO