undoc

A high-performance Rust library for extracting content from Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown, plain text, and JSON.

Features

Multi-format support: DOCX (Word), XLSX (Excel), PPTX (PowerPoint)
Multiple output formats: Markdown, Plain Text, JSON (with full metadata)
Structure preservation: Headings, lists, tables, inline formatting
Smart heading detection: Style-based heading recognition (English/Korean)
Table cell alignment: Proper left/center/right alignment in Markdown
PPTX table extraction: Full table parsing from PowerPoint slides
CJK text support: Smart spacing for Korean, Chinese, Japanese content
Asset extraction: Images, charts, and embedded media with resolved paths (XLSX drawings included)
Rich content: Footnotes/endnotes, headers/footers, text boxes, cell comments, hyperlinks
Section markers:  /  boundary markers for PPTX/XLSX
Text cleanup: Multiple presets for LLM training data preparation
Self-update: Built-in update mechanism via GitHub releases
C-ABI FFI: Native library for C#, Python, and other languages
Parallel processing: Uses Rayon for multi-section documents
Streaming pipeline (0.3.0+): parse_file_streaming yields sections as they parse; peak memory bounded regardless of document size
WebAssembly (0.4.0+): @iyulab/undoc npm package — parse Office documents in the browser, no server required

WASM / Browser

Parse Office documents in the browser — no server, no upload:

import init, { parse } from '@iyulab/undoc';
await init();
const doc = parse(new Uint8Array(await file.arrayBuffer()));
console.log(doc.format());      // "docx" | "xlsx" | "pptx"
console.log(doc.toMarkdown());

Install: npm install @iyulab/undoc

Live Playground →

Installation

Pre-built Binaries (Recommended)

Download the latest release from GitHub Releases.

Windows (x64)

# Download and extract (replace VERSION with the actual version, e.g. v0.3.1)
$VERSION = (Invoke-RestMethod "https://api.github.com/repos/iyulab/undoc/releases/latest").tag_name
Invoke-WebRequest -Uri "https://github.com/iyulab/undoc/releases/latest/download/undoc-windows-x86_64-${VERSION}.zip" -OutFile "undoc.zip"
Expand-Archive -Path "undoc.zip" -DestinationPath "."

# Move to a directory in PATH (optional)
Move-Item -Path "undoc.exe" -Destination "$env:LOCALAPPDATA\Microsoft\WindowsApps\"

# Verify installation
undoc version

Linux (x64)

# Download and extract (replace VERSION with the actual version, e.g. v0.3.1)
VERSION=$(curl -s "https://api.github.com/repos/iyulab/undoc/releases/latest" | grep '"tag_name"' | cut -d'"' -f4)
curl -LO "https://github.com/iyulab/undoc/releases/latest/download/undoc-linux-x86_64-${VERSION}.tar.gz"
tar -xzf "undoc-linux-x86_64-${VERSION}.tar.gz"

# Install to /usr/local/bin (requires sudo)
sudo mv undoc /usr/local/bin/

# Or install to user directory
mkdir -p ~/.local/bin
mv undoc ~/.local/bin/
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Verify installation
undoc version

macOS

# Intel Mac (replace VERSION with the actual version, e.g. v0.3.1)
VERSION=$(curl -s "https://api.github.com/repos/iyulab/undoc/releases/latest" | grep '"tag_name"' | cut -d'"' -f4)
curl -LO "https://github.com/iyulab/undoc/releases/latest/download/undoc-macos-x86_64-${VERSION}.tar.gz"
tar -xzf "undoc-macos-x86_64-${VERSION}.tar.gz"

# Apple Silicon (M1/M2/M3/M4)
curl -LO "https://github.com/iyulab/undoc/releases/latest/download/undoc-macos-aarch64-${VERSION}.tar.gz"
tar -xzf "undoc-macos-aarch64-${VERSION}.tar.gz"

# Install
sudo mv undoc /usr/local/bin/

# Verify
undoc version

Available Binaries

Platform	Architecture	File
Windows	x64	`undoc-windows-x86_64-{version}.zip`
Linux	x64	`undoc-linux-x86_64-{version}.tar.gz`
macOS	Intel	`undoc-macos-x86_64-{version}.tar.gz`
macOS	Apple Silicon	`undoc-macos-aarch64-{version}.tar.gz`

Updating

undoc includes a built-in self-update mechanism:

# Check for updates
undoc update --check

# Update to latest version
undoc update

# Force reinstall (even if on latest)
undoc update --force

Install via Cargo

If you have Rust installed:

# Install CLI
cargo install undoc-cli

# Add library to your project
cargo add undoc

CLI Usage

Quick Start

# Convert to Markdown + extract media (default)
undoc document.docx

# Specify output directory
undoc document.docx ./output

# With text cleanup for LLM training
undoc document.docx --cleanup aggressive

Output Structure

document_output/
├── extract.md      # Markdown output
└── media/          # Extracted images and media
    └── image1.jpeg

Use undoc convert <file> --all to produce all three formats at once:

document_output/
├── extract.md      # Markdown output
├── extract.txt     # Plain text output
├── content.json    # Full structured JSON
└── media/

Commands

undoc <file> [output]              # Convert to Markdown + extract media (default)
undoc convert <file> [OPTIONS]     # Convert with full format/streaming control
undoc markdown <file> [OPTIONS]    # Convert to Markdown only (alias: md)
undoc text <file> [OPTIONS]        # Convert to plain text only
undoc json <file> [OPTIONS]        # Convert to JSON only
undoc info <file>                  # Show document information
undoc extract <file> [OPTIONS]     # Extract resources only
undoc update [OPTIONS]             # Self-update to latest version
undoc version                      # Show version information

Convert (multi-format streaming pipeline)

# Markdown only (default)
undoc convert document.docx

# All formats + media
undoc convert document.docx --all -o ./output

# Specific formats
undoc convert document.docx --formats md,txt,json

# Skip media extraction
undoc convert document.docx --no-images

# Lossless mode (page breaks + headers/footers)
undoc convert document.docx --lossless

# Insert section boundary markers for PPTX/XLSX
undoc convert presentation.pptx --section-markers

Convert Options

Option	Description	Default
`-o, --output`	Output directory	`<stem>_output/`
`--formats`	Comma-separated formats: `md,txt,json`	`md`
`--all`	Output all formats (MD + TXT + JSON)	false
`--no-images`	Skip media extraction	false
`--cleanup`	Text cleanup: `minimal`, `standard`, `aggressive`	none
`--section-markers`	Insert `<!-- slide/sheet N: Name -->` markers (PPTX/XLSX)	false
`--emit-page-breaks`	Emit `---` for hard page breaks	false
`--include-headers-footers`	Include section headers/footers as blockquotes	false
`--lossless`	Enable both `--emit-page-breaks` and `--include-headers-footers`	false
`-q, --quiet`	Suppress progress output	false

Convert to Markdown

# Basic conversion (output to stdout)
undoc markdown document.docx

# Save to file
undoc markdown document.docx -o output.md

# With YAML frontmatter
undoc markdown document.docx --frontmatter -o output.md

# With text cleanup for LLM training
undoc markdown document.docx --cleanup standard -o cleaned.md

# Table rendering options
undoc markdown spreadsheet.xlsx --table-mode html -o output.md

# Limit heading depth
undoc markdown document.docx --max-heading 3 -o output.md

# Insert section boundary markers for PPTX/XLSX
undoc markdown presentation.pptx --section-markers -o slides.md

# Lossless mode (preserve page breaks and headers/footers)
undoc markdown document.docx --lossless -o output.md

Markdown Options

Option	Description	Default
`-o, --output`	Output file path	stdout
`-f, --frontmatter`	Include YAML frontmatter	false
`--table-mode`	Table rendering: `markdown`, `html`, `ascii`	markdown
`--cleanup`	Text cleanup: `minimal`, `standard`, `aggressive`	none
`--max-heading`	Maximum heading level (1-6)	4
`--section-markers`	Insert `<!-- slide/sheet N: Name -->` markers (PPTX/XLSX)	false
`--emit-page-breaks`	Emit `---` for hard page breaks	false
`--include-headers-footers`	Include section headers/footers as blockquotes	false
`--lossless`	Enable both `--emit-page-breaks` and `--include-headers-footers`	false

Convert to Plain Text

# Basic extraction
undoc text document.docx

# With cleanup
undoc text document.docx --cleanup standard -o output.txt

Convert to JSON

# Pretty-printed JSON
undoc json document.docx -o output.json

# Compact JSON
undoc json document.docx --compact -o output.json

Show Document Information

undoc info document.docx

Output:

Document Information
────────────────────────────────────────
File: document.docx
Format: Docx
Sections: 5
Resources: 3
Title: My Document
Author: John Doe
Pages/Slides/Sheets: 10
Created: 2025-01-15T10:30:00Z
Modified: 2025-01-20T14:45:00Z

Content Statistics
────────────────────────────────────────
Words: 2500
Characters: 15000

Extract Resources

# Extract to current directory
undoc extract presentation.pptx

# Extract to specific directory
undoc extract presentation.pptx -o ./media

Self-Update

# Check for updates
undoc update --check

# Update to latest version
undoc update

# Force reinstall
undoc update --force

Examples

# Convert Word document to Markdown with frontmatter
undoc md report.docx --frontmatter -o report.md

# Convert Excel to Markdown tables
undoc md data.xlsx -o tables.md

# Convert PowerPoint to Markdown with slide markers
undoc md presentation.pptx --section-markers -o slides.md

# Extract all images from a document
undoc extract report.docx -o ./images

# Get document metadata
undoc info document.docx

# Convert with aggressive cleanup for AI training
undoc md document.docx --cleanup aggressive -o cleaned.md

# Batch conversion (shell)
for f in *.docx; do undoc md "$f" -o "${f%.docx}.md"; done

# Batch conversion (PowerShell)
Get-ChildItem *.docx | ForEach-Object { undoc md $_.FullName -o "$($_.BaseName).md" }

Rust Library Usage

Quick Start

use undoc::{parse_file, render};

fn main() -> undoc::Result<()> {
    // Parse document
    let doc = parse_file("document.docx")?;

    // Convert to Markdown
    let options = render::RenderOptions::default();
    let markdown = render::to_markdown(&doc, &options)?;
    println!("{}", markdown);

    // Get plain text
    let text = render::to_text(&doc, &options)?;

    // Get JSON
    let json = render::to_json(&doc, render::JsonFormat::Pretty)?;

    Ok(())
}

Convenience Functions

// One-shot conversions without building a Document first
let text     = undoc::extract_text("document.docx")?;
let markdown = undoc::to_markdown("document.docx")?;
let json     = undoc::to_json("document.docx", undoc::render::JsonFormat::Pretty)?;

// Parse from bytes
let doc = undoc::parse_bytes(&file_bytes)?;

Render Options

use undoc::render::{RenderOptions, CleanupPreset, TableFallback};
use undoc::SectionMarkerStyle;

let options = RenderOptions::new()
    .with_frontmatter(true)
    .with_table_fallback(TableFallback::Html)
    .with_cleanup_preset(CleanupPreset::Aggressive)
    .with_max_heading(3)
    .with_section_markers(SectionMarkerStyle::Comment); // <!-- slide N: Name -->

let markdown = render::to_markdown(&doc, &options)?;

// Lossless mode: preserve page breaks and headers/footers
let lossless = RenderOptions::lossless();
let markdown = render::to_markdown(&doc, &lossless)?;

Working with Document Structure

use undoc::parse_file;

let doc = parse_file("document.docx")?;

// Access metadata
println!("Title: {:?}", doc.metadata.title);
println!("Author: {:?}", doc.metadata.author);
println!("Created: {:?}", doc.metadata.created);

// Iterate sections
for section in &doc.sections {
    println!("Section: {:?}", section.name);
    for element in &section.elements {
        // Process paragraphs, tables, etc.
    }
}

// Extract resources
for (id, resource) in &doc.resources {
    let filename = resource.suggested_filename(id);
    std::fs::write(&filename, &resource.data)?;
}

Streaming Pipeline

Supported for PPTX (per slide) and XLSX (per sheet). DOCX is not yet supported.

use std::ops::ControlFlow;
use undoc::{parse_file_streaming, ParseEvent, SectionStreamOptions};

parse_file_streaming("slides.pptx", SectionStreamOptions::default(), |event| {
    match event {
        ParseEvent::DocumentStart { metadata, section_count, .. } => {
            println!("{} sections", section_count);
        }
        ParseEvent::SectionParsed(section) => {
            // section memory is freed after this callback returns
            println!("Slide: {:?}", section.name);
        }
        ParseEvent::ResourceExtracted { name, data } => {
            std::fs::write(&name, &data).ok();
        }
        _ => {}
    }
    ControlFlow::Continue(())
})?;

Format Detection

use undoc::{detect_format_from_path, detect_format_from_bytes, FormatType};

// From file path
let format = detect_format_from_path("document.docx")?;
assert_eq!(format, FormatType::Docx);

// From bytes
let data = std::fs::read("document.docx")?;
let format = detect_format_from_bytes(&data)?;

C# / .NET Integration

undoc provides C-ABI compatible bindings for integration with C# and .NET applications.

Getting the Native Library

Download from GitHub Releases:

Platform	Library File
Windows x64	`undoc.dll`
Linux x64	`libundoc.so`
macOS	`libundoc.dylib`

Or build from source:

cargo build --release --features ffi

C# Wrapper Usage

using Iyulab.Undoc;

// Parse and convert to Markdown
using var doc = UndocDocument.FromFile("document.docx");
string markdown = doc.ToMarkdown(MarkdownFlags.Frontmatter);
Console.WriteLine(markdown);

// Get document metadata
Console.WriteLine($"Title: {doc.Title}");
Console.WriteLine($"Author: {doc.Author}");
Console.WriteLine($"Sections: {doc.SectionCount}");
Console.WriteLine($"Resources: {doc.ResourceCount}");

// Convert to other formats
string text = doc.ToText();
string json = doc.ToJson(JsonFormat.Pretty);

// Work with resources
string resourceIds = doc.GetResourceIds(); // JSON array: ["rId1", "rId2"]
string info = doc.GetResourceInfo("rId1"); // JSON metadata
byte[] imageData = doc.GetResourceData("rId1"); // Binary data

See bindings/csharp/Undoc.cs for the complete wrapper implementation.

Output Formats

Markdown

Structured Markdown with preserved formatting:

Headings: Document headings → #, ##, ###
Lists: Ordered and unordered with nesting
Tables: Markdown tables (with HTML/ASCII fallback for complex layouts)
Inline styles: Bold (**), italic (*), underline (<u>), strikethrough, superscript/subscript
Hyperlinks: Preserved as Markdown links (DOCX, XLSX, PPTX)
Images: Linked image references from document drawings
Footnotes: Markdown reference-style ([^N] / [^N]: text)
Headers/Footers: Rendered as blockquotes (opt-in via --include-headers-footers)

Plain Text

Pure text content without formatting markers.

JSON

Complete document structure with metadata:

{
  "metadata": {
    "title": "Document Title",
    "author": "Author Name",
    "created": "2025-01-15T10:30:00Z",
    "modified": "2025-01-20T14:45:00Z"
  },
  "sections": [...],
  "resources": [...]
}

Supported Formats

Format	Extension	Status
Word	.docx	Supported
Excel	.xlsx	Supported
PowerPoint	.pptx	Supported

Feature Flags

Feature	Description	Default
`ffi`	C-ABI foreign function interface	No

# Cargo.toml - enable FFI
[dependencies]
undoc = { version = "0.3", features = ["ffi"] }

Performance

Parallel section/sheet/slide processing with Rayon
Efficient XML parsing with quick-xml
Memory-efficient handling of large documents
Streaming pipeline for bounded peak memory on large files

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Related Projects

unhwp - Korean HWP document extraction
unpdf - PDF document extraction

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
benches		benches
bindings		bindings
cli		cli
docs		docs
include		include
src		src
tests		tests
undoc-wasm		undoc-wasm
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

undoc

Features

Table of Contents

WASM / Browser

Installation

Pre-built Binaries (Recommended)

Windows (x64)

Linux (x64)

macOS

Available Binaries

Updating

Install via Cargo

CLI Usage

Quick Start

Output Structure

Commands

Convert (multi-format streaming pipeline)

Convert Options

Convert to Markdown

Markdown Options

Convert to Plain Text

Convert to JSON

Show Document Information

Extract Resources

Self-Update

Examples

Rust Library Usage

Quick Start

Convenience Functions

Render Options

Working with Document Structure

Streaming Pipeline

Format Detection

C# / .NET Integration

Getting the Native Library

C# Wrapper Usage

Output Formats

Markdown

Plain Text

JSON

Supported Formats

Feature Flags

Performance

License

Contributing

Related Projects

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages