Add masked language modeling models by shujaatTracebloc · Pull Request #69 · tracebloc/model-zoo

shujaatTracebloc · 2026-05-18T08:40:13Z

Summary

Add masked_language_modeling/pytorch/ directory with three model definitions for biomedical graph MLM pretraining
simple_mlm.py — ~30M param, 4-layer transformer encoder for smoke testing and pipeline validation
netmedgpt_style_scratch.py — ~110M param, 12-layer BERT-scale encoder for training from scratch on domain corpora (e.g. PrimeKG random walks)
netmedgpt_style_warmstart.py — BERT-base warm-started from HuggingFace pretrained weights, embedding layer resized for custom tokenizer vocab
Update test_model_contract.py to include masked_language_modeling in known categories

Test plan

pytest tests/test_model_contract.py passes with new category and all three model files
Verify simple_mlm forward pass produces correct output shape (batch, seq_len, vocab_size)
Verify netmedgpt_style_scratch forward pass produces correct output shape
Verify netmedgpt_style_warmstart loads pretrained weights and resizes embeddings

🤖 Generated with Claude Code

Note

Medium Risk
Adds new PyTorch MLM model entrypoints, including one that imports HuggingFace transformers, which may break model-contract imports if that dependency isn’t present in relevant CI/runtime environments.

Overview
Introduces a new masked_language_modeling task in the model zoo with three PyTorch model definitions: a small transformer encoder (SimpleMaskedLM), a BERT-base-scale from-scratch encoder with tied LM head (NetMedGPTScratch), and a warm-started HuggingFace BERT loader that resizes token embeddings (NetMedGPTWarmStart).

Updates the model contract test to recognize masked_language_modeling as a valid category so these new model files are included in the import/metadata checks.

^{Reviewed by Cursor Bugbot for commit 084cb95. Bugbot is set up for automated code reviews on this repo. Configure here.}

Three MLM model definitions for biomedical graph pretraining: - simple_mlm.py: ~30M param transformer encoder for smoke testing - netmedgpt_style_scratch.py: ~110M param BERT-scale, train from scratch - netmedgpt_style_warmstart.py: BERT-base warm-started from HuggingFace Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

LukasWodka · 2026-05-18T08:40:44Z

👋 Heads-up — Code review queue is at 18 / 8

Above the WIP limit. The team convention is to review existing PRs before opening new work.

Open PRs currently in Code review (oldest first):

.github#41 — fix(profile): point install snippets to canonical i.sh / i.ps1 URLs · author: @saadqbal · reviewer: @LukasWodka
client#100 — feat(client): scaffolding for stateless requests-proxy auth (HC-1) · author: @saadqbal · no reviewer assigned
client-runtime#35 — feat(Develop #21): jobs-manager submit-ingestion-run HTTP endpoint · author: @saadqbal · no reviewer assigned
client-runtime#36 — Bump requests from 2.32.3 to 2.33.0 in /Node-deploy · author: @dependabot · no reviewer assigned
client-runtime#37 — Bump black from 25.1.0 to 26.3.1 in /Node-deploy · author: @dependabot · no reviewer assigned
docs#3 — Block bots from crawling Mintlify static assets · author: @LukasWodka · reviewer: @saadqbal
frontend-app#469 — chore(deps): bump postcss from 8.5.6 to 8.5.10 · author: @dependabot · no reviewer assigned
frontend-app#470 — chore(deps): bump dompurify from 3.3.1 to 3.4.0 · author: @dependabot · no reviewer assigned
frontend-app#471 — chore(deps): bump @babel/plugin-transform-modules-systemjs from 7.27.1 to 7.29.4 · author: @dependabot · no reviewer assigned
model-zoo#67 — Add 10 trending model architectures across all task families · author: @divyasinghds · no reviewer assigned

Pull from review before opening new work. (This is a nudge from the kanban WIP check, not a block.)

…nguage-modeling-models

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NetMedGPTWarmStart is a factory function, not an nn.Module subclass, so it must declare main_method per the metadata contract. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6576d4b. Configure here.}

- Tie lm_decoder weights to word_embeddings (BERT-style) to match the warmstart model architecture and get correct ~110M param count - Initialize nn.MultiheadAttention in_proj_weight and out_proj params which are stored as raw Parameters, not nn.Linear submodules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shujaatTracebloc self-assigned this May 18, 2026

Merge remote-tracking branch 'origin/develop' into feat/add-masked-la…

f2155fb

…nguage-modeling-models

shujaatTracebloc requested a review from divyasinghds May 18, 2026 08:41

divyasinghds reviewed May 18, 2026

View reviewed changes

Comment thread model_zoo/masked_language_modeling/pytorch/simple_mlm.py Outdated

Remove unused math import from simple_mlm

d7c642e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread model_zoo/masked_language_modeling/pytorch/netmedgpt_style_warmstart.py Outdated

Fix warmstart model to use main_method for factory function

6576d4b

NetMedGPTWarmStart is a factory function, not an nn.Module subclass, so it must declare main_method per the metadata contract. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread model_zoo/masked_language_modeling/pytorch/netmedgpt_style_scratch.py

Comment thread model_zoo/masked_language_modeling/pytorch/netmedgpt_style_scratch.py

divyasinghds approved these changes May 18, 2026

View reviewed changes

shujaatTracebloc merged commit c0aa960 into master May 18, 2026
6 checks passed

shujaatTracebloc deleted the feat/add-masked-language-modeling-models branch May 18, 2026 10:30

LukasWodka mentioned this pull request Jun 3, 2026

fix(keypoint): vitpose_plus dataset_index + pixel-space coords, sapiens HF-loadable backbone #84

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add masked language modeling models#69

Add masked language modeling models#69
shujaatTracebloc merged 5 commits into
masterfrom
feat/add-masked-language-modeling-models

shujaatTracebloc commented May 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

LukasWodka commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

shujaatTracebloc commented May 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

LukasWodka commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shujaatTracebloc commented May 18, 2026 •

edited by cursor Bot

Loading