RadSEM is a semantic evaluation metric for radiology reports that breaks down reports into atomic sentences, aligns them between generated and reference reports, and computes detailed scores based on anatomical and abnormality relationships.
RadSEM evaluates radiology reports through three main steps:
- Step 1 (Report processing): Converts reports into atomic sentences following strict rules
- Step 2 (Sentence matching): Aligns sentences between generated and reference reports with detailed relationship labels
- Step 3 (Scoring): Computes weighted F1 scores for abnormal and normal findings
RadSEM/
├── l1_l5/ # L1–L5 evaluation data and filtered samples
├── step/
│ ├── step1.py # Report rewriting into atomic sentences
│ ├── step2.py # Sentence matching and tagging
│ └── step3.py # Score calculation
├── run_radsem.py # Main pipeline orchestrator
├── groundtruth.jsonl # Reference reports
└── model_output.jsonl # Generated reports to evaluate
The scripts use an API for LLM-based processing. Update the API endpoint and key in step/step1.py:
url = "http://your/API/base/url"
headers = {
"Authorization": "YOUR_API_KEY",
...
}Run the complete pipeline:
python run_radsem.pyThis will:
- Process
model_output.jsonlthrough step1 →model_rewritten_res.jsonl - Process
groundtruth.jsonlthrough step1 →gt_rewritten_res.jsonl - Align and tag both →
tag.jsonl - Compute scores →
score.jsonl
Each line should be a JSON object with:
{
"name": "sample_0001",
"Examined_Area": "CHEST",
"Examined_Type": "CT",
"English_Report": "Both lungs are clear..."
}