pfftools: add Maildir export format with MIME synthesis and contact deduplication#158
Open
KJ7LNW wants to merge 4 commits into
Open
pfftools: add Maildir export format with MIME synthesis and contact deduplication#158KJ7LNW wants to merge 4 commits into
KJ7LNW wants to merge 4 commits into
Conversation
added 4 commits
May 3, 2026 14:32
Introduces a new -f maildir export mode that writes each email as an RFC 2822 message into a Maildir tree (cur/, new/, tmp/ per folder), producing output directly consumable by standard mail user agents. To prevent duplicate messages when exporting multiple overlapping PST or OST archives into the same Maildir tree, a djb2 hash table tracks seen Message-ID values. The table is persisted to .seen_message_ids in the export root and reloaded on subsequent runs, enabling cross-file deduplication without holding all IDs in memory between invocations. Maildir mode also applies a rule table to PST/OST internal folder names, skipping synthetic containers (Common Views, Finder, NON_IPM_SUBTREE, etc.), passing through transparent wrappers (Root - Mailbox, IPM_SUBTREE) without creating a directory level, and renaming others (Root - Public -> Public Folders). Non-email item types are silently skipped so only RFC 2822-representable items appear in the output. - add EXPORT_FORMAT_MAILDIR enum value and "maildir" input recognition - add seen_message_ids_table_t hash table with load/save to .seen_message_ids - add export_handle_initialize_maildir to build dedup path and load prior state - add export_handle_export_email_maildir writing Maildir filenames to cur/ - add maildir_folder_rules table with skip/passthrough/rename actions - create cur/, new/, tmp/ subdirectories per exported folder - allow appending to existing export path in Maildir mode - wire initialization and directory-exists bypass in pffexport main Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
The Maildir exporter previously wrote a single body part with no MIME envelope, choosing plain-text or HTML by fallback rather than capturing both. Attachments were not included in the output at all, and the original transport headers were written verbatim, leaving conflicting Content-Type and MIME-Version fields. This rework makes the exporter produce structurally valid RFC 2822 messages. Both plain-text and HTML bodies are retrieved independently. Attachments are enumerated and their Content-ID, MIME type, and filename are read to classify each as inline or regular. The correct multipart structure is then synthesised from what is actually present: multipart/mixed wraps the body section and regular attachments, multipart/related wraps an HTML body with its inline attachments, and multipart/alternative wraps both body types when no attachments are present. Original transport headers have their MIME envelope lines stripped before writing so the synthesised headers are authoritative. - add PR_ATTACH_MIME_TAG and PR_ATTACH_CONTENT_ID defines for MAPI properties absent from the shared entry-type enum - add mime_base64_write_attachment() to stream attachment data as base64 with CRLF-terminated 76-character lines - add maildir_strip_mime_headers() to remove Content-Type, Content-Transfer-Encoding, and MIME-Version from original headers - retrieve plain-text and HTML bodies independently, removing the plain-text-or-HTML fallback - enumerate attachments to collect content-id, MIME type, and filename, classifying each as inline or regular before writing begins - replace flat body write with a MIME structure decision tree that selects multipart/mixed, multipart/related, multipart/alternative, or a direct Content-Type based on available content - move file close, dedup table insert, and success log after all parts are written; extend on_error cleanup to cover attachment metadata Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
Documents the -f maildir option added in the Maildir export commits: RFC 2822 output layout, cross-run Message-ID deduplication via .seen_message_ids, and the folder skip/passthrough/rename rule table. Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
libpff exports one Contact.txt per item, leading to duplicate records when processing large mailbox archives. This script reads a list of Contact.txt paths from stdin, merges duplicates keyed on primary email or display name, and emits a single vCard 3.0 .vcf file to stdout. - parse Contact.txt fields into a structured dict per file - filter Exchange internal addresses (/o=) from email fields - key contacts on first real email, falling back to display name - accumulate phones and emails when merging duplicate keys - map phone label substrings to vCard TEL type tokens - emit vCard 3.0 blocks with proper value escaping - report input/error/skip/merge/unique tallies to stderr Signed-off-by: Eric Wheeler <git-default@z.ewheeler.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
libpff's
pffexporttool previously supported only flat directory exports with no standard mail layout. This adds a-f maildirexport mode producing RFC 2822 messages in a standard Maildir tree, consumable directly by mail user agents.Type of Change
Implementation Details
The Maildir exporter writes each email as a structurally valid RFC 2822 message into
cur/,new/, andtmp/subdirectories per PST folder. A folder rule table skips synthetic containers (Common Views, Finder, NON_IPM_SUBTREE), passes through transparent wrappers (Root - Mailbox, IPM_SUBTREE) without creating a directory level, and renames others (Root - Public -> Public Folders).Message-IDvalues, persisted to.seen_message_idsin the export root and reloaded on subsequent runs, enabling deduplication across multiple PST/OST archives without holding all IDs in memory.multipart/mixed,multipart/related, ormultipart/alternativestructure is selected from what is actually present. Original transport headers have MIME envelope lines stripped before writing so the synthesized headers are authoritative.scripts/contact-to-vcf.py: readsContact.txtfiles exported by libpff from stdin, merges duplicates keyed on primary email or display name, and emits a vCard 3.0.vcffile to stdout.-f maildir, deduplication behavior, and the folder rule table.