Skip to content

pfftools: add Maildir export format with MIME synthesis and contact deduplication#158

Open
KJ7LNW wants to merge 4 commits into
libyal:mainfrom
KJ7LNW:maildir
Open

pfftools: add Maildir export format with MIME synthesis and contact deduplication#158
KJ7LNW wants to merge 4 commits into
libyal:mainfrom
KJ7LNW:maildir

Conversation

@KJ7LNW
Copy link
Copy Markdown

@KJ7LNW KJ7LNW commented May 14, 2026

Description

libpff's pffexport tool previously supported only flat directory exports with no standard mail layout. This adds a -f maildir export mode producing RFC 2822 messages in a standard Maildir tree, consumable directly by mail user agents.

Type of Change

  • Feature (non-breaking change that adds functionality)
  • Bug fix (non-breaking change that fixes an issue)
  • Breaking change (fix or feature that causes existing functionality to change)
  • Refactor (no functional changes)
  • Documentation

Implementation Details

The Maildir exporter writes each email as a structurally valid RFC 2822 message into cur/, new/, and tmp/ subdirectories per PST folder. A folder rule table skips synthetic containers (Common Views, Finder, NON_IPM_SUBTREE), passes through transparent wrappers (Root - Mailbox, IPM_SUBTREE) without creating a directory level, and renames others (Root - Public -> Public Folders).

  • Cross-run deduplication: a djb2 hash table tracks seen Message-ID values, persisted to .seen_message_ids in the export root and reloaded on subsequent runs, enabling deduplication across multiple PST/OST archives without holding all IDs in memory.
  • MIME synthesis: both plain-text and HTML bodies are retrieved independently; attachments are enumerated and classified as inline or regular by Content-ID. The correct multipart/mixed, multipart/related, or multipart/alternative structure is selected from what is actually present. Original transport headers have MIME envelope lines stripped before writing so the synthesized headers are authoritative.
  • scripts/contact-to-vcf.py: reads Contact.txt files exported by libpff from stdin, merges duplicates keyed on primary email or display name, and emits a vCard 3.0 .vcf file to stdout.
  • Manual page updated to document -f maildir, deduplication behavior, and the folder rule table.

Eric Wheeler added 4 commits May 3, 2026 14:32
Introduces a new -f maildir export mode that writes each email as an
RFC 2822 message into a Maildir tree (cur/, new/, tmp/ per folder),
producing output directly consumable by standard mail user agents.

To prevent duplicate messages when exporting multiple overlapping PST
or OST archives into the same Maildir tree, a djb2 hash table tracks
seen Message-ID values. The table is persisted to .seen_message_ids in
the export root and reloaded on subsequent runs, enabling cross-file
deduplication without holding all IDs in memory between invocations.

Maildir mode also applies a rule table to PST/OST internal folder names,
skipping synthetic containers (Common Views, Finder, NON_IPM_SUBTREE,
etc.), passing through transparent wrappers (Root - Mailbox,
IPM_SUBTREE) without creating a directory level, and renaming others
(Root - Public -> Public Folders). Non-email item types are silently
skipped so only RFC 2822-representable items appear in the output.

- add EXPORT_FORMAT_MAILDIR enum value and "maildir" input recognition
- add seen_message_ids_table_t hash table with load/save to .seen_message_ids
- add export_handle_initialize_maildir to build dedup path and load prior state
- add export_handle_export_email_maildir writing Maildir filenames to cur/
- add maildir_folder_rules table with skip/passthrough/rename actions
- create cur/, new/, tmp/ subdirectories per exported folder
- allow appending to existing export path in Maildir mode
- wire initialization and directory-exists bypass in pffexport main

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
The Maildir exporter previously wrote a single body part with no MIME
envelope, choosing plain-text or HTML by fallback rather than capturing
both.  Attachments were not included in the output at all, and the
original transport headers were written verbatim, leaving conflicting
Content-Type and MIME-Version fields.

This rework makes the exporter produce structurally valid RFC 2822
messages.  Both plain-text and HTML bodies are retrieved independently.
Attachments are enumerated and their Content-ID, MIME type, and filename
are read to classify each as inline or regular.  The correct multipart
structure is then synthesised from what is actually present:
multipart/mixed wraps the body section and regular attachments,
multipart/related wraps an HTML body with its inline attachments, and
multipart/alternative wraps both body types when no attachments are
present.  Original transport headers have their MIME envelope lines
stripped before writing so the synthesised headers are authoritative.

- add PR_ATTACH_MIME_TAG and PR_ATTACH_CONTENT_ID defines for MAPI
  properties absent from the shared entry-type enum
- add mime_base64_write_attachment() to stream attachment data as
  base64 with CRLF-terminated 76-character lines
- add maildir_strip_mime_headers() to remove Content-Type,
  Content-Transfer-Encoding, and MIME-Version from original headers
- retrieve plain-text and HTML bodies independently, removing the
  plain-text-or-HTML fallback
- enumerate attachments to collect content-id, MIME type, and filename,
  classifying each as inline or regular before writing begins
- replace flat body write with a MIME structure decision tree that
  selects multipart/mixed, multipart/related, multipart/alternative, or
  a direct Content-Type based on available content
- move file close, dedup table insert, and success log after all parts
  are written; extend on_error cleanup to cover attachment metadata

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
Documents the -f maildir option added in the Maildir export
commits: RFC 2822 output layout, cross-run Message-ID
deduplication via .seen_message_ids, and the folder
skip/passthrough/rename rule table.

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
libpff exports one Contact.txt per item, leading to duplicate records
when processing large mailbox archives. This script reads a list of
Contact.txt paths from stdin, merges duplicates keyed on primary email
or display name, and emits a single vCard 3.0 .vcf file to stdout.

- parse Contact.txt fields into a structured dict per file
- filter Exchange internal addresses (/o=) from email fields
- key contacts on first real email, falling back to display name
- accumulate phones and emails when merging duplicate keys
- map phone label substrings to vCard TEL type tokens
- emit vCard 3.0 blocks with proper value escaping
- report input/error/skip/merge/unique tallies to stderr

Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant