Skip to content

Add masked language modeling models#69

Merged
shujaatTracebloc merged 5 commits into
masterfrom
feat/add-masked-language-modeling-models
May 18, 2026
Merged

Add masked language modeling models#69
shujaatTracebloc merged 5 commits into
masterfrom
feat/add-masked-language-modeling-models

Conversation

@shujaatTracebloc

@shujaatTracebloc shujaatTracebloc commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add masked_language_modeling/pytorch/ directory with three model definitions for biomedical graph MLM pretraining
  • simple_mlm.py — ~30M param, 4-layer transformer encoder for smoke testing and pipeline validation
  • netmedgpt_style_scratch.py — ~110M param, 12-layer BERT-scale encoder for training from scratch on domain corpora (e.g. PrimeKG random walks)
  • netmedgpt_style_warmstart.py — BERT-base warm-started from HuggingFace pretrained weights, embedding layer resized for custom tokenizer vocab
  • Update test_model_contract.py to include masked_language_modeling in known categories

Test plan

  • pytest tests/test_model_contract.py passes with new category and all three model files
  • Verify simple_mlm forward pass produces correct output shape (batch, seq_len, vocab_size)
  • Verify netmedgpt_style_scratch forward pass produces correct output shape
  • Verify netmedgpt_style_warmstart loads pretrained weights and resizes embeddings

🤖 Generated with Claude Code


Note

Medium Risk
Adds new PyTorch MLM model entrypoints, including one that imports HuggingFace transformers, which may break model-contract imports if that dependency isn’t present in relevant CI/runtime environments.

Overview
Introduces a new masked_language_modeling task in the model zoo with three PyTorch model definitions: a small transformer encoder (SimpleMaskedLM), a BERT-base-scale from-scratch encoder with tied LM head (NetMedGPTScratch), and a warm-started HuggingFace BERT loader that resizes token embeddings (NetMedGPTWarmStart).

Updates the model contract test to recognize masked_language_modeling as a valid category so these new model files are included in the import/metadata checks.

Reviewed by Cursor Bugbot for commit 084cb95. Bugbot is set up for automated code reviews on this repo. Configure here.

Three MLM model definitions for biomedical graph pretraining:
- simple_mlm.py: ~30M param transformer encoder for smoke testing
- netmedgpt_style_scratch.py: ~110M param BERT-scale, train from scratch
- netmedgpt_style_warmstart.py: BERT-base warm-started from HuggingFace

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc self-assigned this May 18, 2026
@LukasWodka

Copy link
Copy Markdown
Contributor

👋 Heads-up — Code review queue is at 18 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

Comment thread model_zoo/masked_language_modeling/pytorch/simple_mlm.py Outdated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread model_zoo/masked_language_modeling/pytorch/netmedgpt_style_warmstart.py Outdated
NetMedGPTWarmStart is a factory function, not an nn.Module subclass,
so it must declare main_method per the metadata contract.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6576d4b. Configure here.

- Tie lm_decoder weights to word_embeddings (BERT-style) to match
  the warmstart model architecture and get correct ~110M param count
- Initialize nn.MultiheadAttention in_proj_weight and out_proj params
  which are stored as raw Parameters, not nn.Linear submodules

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shujaatTracebloc shujaatTracebloc merged commit c0aa960 into master May 18, 2026
6 checks passed
@shujaatTracebloc shujaatTracebloc deleted the feat/add-masked-language-modeling-models branch May 18, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants