Skip to content

Office files (.docx/.xlsx): modified sources never re-enter --update (sidecar keyed on path hash, source content never checked) #1649

Description

@Ns2384-star

Version: graphifyy 0.9.5 (Windows 11, docs-heavy corpus: ~6800 nodes, 76% documents, ~800 files incl. many .docx/.xlsx)

Problem

A .docx/.xlsx that is modified after its first conversion never re-enters an --update.

  • convert_office_file (detect.py:604) names the sidecar {stem}_{sha256(NFC path)[:8]}.md — a hash of the path, not the content — and early-returns if the sidecar already exists (detect.py:636) without ever looking at the source file.
  • detect_incremental (detect.py:1389+) compares mtime/hash of the sidecar, never of the Office source.

Repro

  1. Convert a .docx (first extract) → sidecar created.
  2. Edit the .docx.
  3. Run detect_incremental → file reported "unchanged"; the stale sidecar is what gets (not) re-extracted. Forever.

On our corpus this forces a manual pre-update pass that purges stale sidecars by comparing source.st_mtime > sidecar.st_mtime — workable, but any user with a living Office corpus silently gets a frozen graph otherwise.

Proposal

Store the md5 of the SOURCE file in the sidecar (header comment) or in the manifest entry, and re-convert when it differs. Minimal alternative: regenerate the sidecar when source.st_mtime > sidecar.st_mtime.

Related (closed, different angles): #1226 (NFC/NFD duplicate sidecars), #861 (.graphifyignore bypass for sidecars).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions