Version: graphifyy 0.9.5 (Windows 11, docs-heavy corpus: ~6800 nodes, 76% documents, ~800 files incl. many .docx/.xlsx)
Problem
A .docx/.xlsx that is modified after its first conversion never re-enters an --update.
convert_office_file (detect.py:604) names the sidecar {stem}_{sha256(NFC path)[:8]}.md — a hash of the path, not the content — and early-returns if the sidecar already exists (detect.py:636) without ever looking at the source file.
detect_incremental (detect.py:1389+) compares mtime/hash of the sidecar, never of the Office source.
Repro
- Convert a .docx (first extract) → sidecar created.
- Edit the .docx.
- Run
detect_incremental → file reported "unchanged"; the stale sidecar is what gets (not) re-extracted. Forever.
On our corpus this forces a manual pre-update pass that purges stale sidecars by comparing source.st_mtime > sidecar.st_mtime — workable, but any user with a living Office corpus silently gets a frozen graph otherwise.
Proposal
Store the md5 of the SOURCE file in the sidecar (header comment) or in the manifest entry, and re-convert when it differs. Minimal alternative: regenerate the sidecar when source.st_mtime > sidecar.st_mtime.
Related (closed, different angles): #1226 (NFC/NFD duplicate sidecars), #861 (.graphifyignore bypass for sidecars).
Version: graphifyy 0.9.5 (Windows 11, docs-heavy corpus: ~6800 nodes, 76% documents, ~800 files incl. many .docx/.xlsx)
Problem
A
.docx/.xlsxthat is modified after its first conversion never re-enters an--update.convert_office_file(detect.py:604) names the sidecar{stem}_{sha256(NFC path)[:8]}.md— a hash of the path, not the content — and early-returns if the sidecar already exists (detect.py:636) without ever looking at the source file.detect_incremental(detect.py:1389+) compares mtime/hash of the sidecar, never of the Office source.Repro
detect_incremental→ file reported "unchanged"; the stale sidecar is what gets (not) re-extracted. Forever.On our corpus this forces a manual pre-update pass that purges stale sidecars by comparing
source.st_mtime > sidecar.st_mtime— workable, but any user with a living Office corpus silently gets a frozen graph otherwise.Proposal
Store the md5 of the SOURCE file in the sidecar (header comment) or in the manifest entry, and re-convert when it differs. Minimal alternative: regenerate the sidecar when
source.st_mtime > sidecar.st_mtime.Related (closed, different angles): #1226 (NFC/NFD duplicate sidecars), #861 (.graphifyignore bypass for sidecars).