Tools / MarkItDown
Visit official site north_east

MarkItDown

Python library from Microsoft that converts various file formats (Office documents, PDFs, images, audio) to Markdown. Perfect for processing documents with LLMs or building documentation pipelines.

Features

Supported Formats

Documents

  • Word: .docx files to markdown
  • PowerPoint: .pptx presentations with slide content
  • Excel: .xlsx spreadsheets to markdown tables
  • PDF: Extract text and structure from PDFs
  • HTML: Convert web pages to clean markdown

Media

  • Images: Extract text from images using OCR
  • Audio: Transcribe audio files using Whisper
  • Video: Extract audio and transcribe

Code & Data

  • Jupyter Notebooks: .ipynb to markdown
  • CSV: Convert to markdown tables
  • JSON: Format as readable markdown
  • XML: Convert to structured markdown

Conversion Features

  • Clean Output: Well-formatted, readable markdown
  • Structure Preservation: Maintains headings, lists, tables
  • Image Handling: Embedded images extracted or referenced
  • Table Conversion: Excel/HTML tables to markdown tables
  • Metadata Extraction: Preserve document properties

Installation

pip install markitdown

With Optional Dependencies

# For all features
pip install markitdown[all]

# For specific features
pip install markitdown[ocr]      # Image OCR
pip install markitdown[audio]    # Audio transcription

Usage

Basic Conversion

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)

Convert Multiple Files

from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()

for file in Path("docs").glob("*.docx"):
    result = md.convert(file)
    output = file.with_suffix(".md")
    output.write_text(result.text_content)

With OCR for Images

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("screenshot.png")  # Uses OCR
markdown_text = result.text_content

Audio Transcription

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("meeting.mp3")  # Transcribes audio
transcript = result.text_content

Use Cases

LLM Processing

  • Convert documents before feeding to LLMs
  • Process entire document libraries
  • Extract text from scanned documents
  • Prepare training data

Documentation

  • Convert Word docs to markdown for version control
  • Migrate documentation to markdown-based systems
  • Generate docs from presentations
  • Archive documents in plain text format

Data Extraction

  • Extract tables from Excel to markdown
  • Process PDF reports
  • Convert forms and surveys
  • Extract content from presentations

Content Management

  • Bulk convert legacy documents
  • Prepare content for static site generators
  • Convert email attachments
  • Process uploaded documents

Output Format

Word Document

# Document Title

## Section 1

This is a paragraph with **bold** and *italic* text.

- Bullet point 1
- Bullet point 2

## Section 2

| Header 1 | Header 2 |
|----------|----------|
| Cell 1   | Cell 2   |

Excel Spreadsheet

# Sheet: Sales Data

| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget  | 100| 150| 200| 250|
| Gadget  | 75 | 80 | 90 | 100|

PowerPoint

# Slide 1: Title Slide

Presentation Title
Subtitle

---

# Slide 2: Content

- Point 1
- Point 2
- Point 3

Advanced Features

Custom Handlers

from markitdown import MarkItDown, DocumentConverter

class CustomConverter(DocumentConverter):
    def convert(self, source):
        # Custom conversion logic
        return ConversionResult(...)

md = MarkItDown()
md.register_converter(".custom", CustomConverter())

Batch Processing

from markitdown import MarkItDown
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path

def convert_file(file_path):
    md = MarkItDown()
    result = md.convert(file_path)
    return result.text_content

files = list(Path("documents").glob("**/*.docx"))

with ProcessPoolExecutor() as executor:
    results = executor.map(convert_file, files)

Best Practices

  • Test conversions with sample files first
  • Handle large files with streaming when possible
  • Cache OCR results for repeated processing
  • Validate markdown output for important conversions
  • Use appropriate optional dependencies for features needed
  • Consider file size limits for memory usage

Limitations

  • Complex layouts may not convert perfectly
  • Some formatting may be lost (colors, fonts)
  • Large PDFs can be slow to process
  • OCR accuracy depends on image quality
  • Audio transcription requires good audio quality

Best For

  • AI/LLM Projects: Preprocessing documents for AI
  • Documentation Teams: Converting legacy docs to markdown
  • Data Scientists: Extracting data from reports
  • Content Managers: Bulk document conversion
  • Archivists: Long-term document preservation
  • Developers: Building document processing pipelines

Integration Examples

With LangChain

from markitdown import MarkItDown
from langchain.text_splitter import MarkdownTextSplitter

md = MarkItDown()
result = md.convert("document.pdf")

splitter = MarkdownTextSplitter()
chunks = splitter.split_text(result.text_content)

With FastAPI

from fastapi import FastAPI, UploadFile
from markitdown import MarkItDown

app = FastAPI()
md = MarkItDown()

@app.post("/convert")
async def convert_document(file: UploadFile):
    content = await file.read()
    result = md.convert(content, file_extension=file.filename.split(".")[-1])
    return {"markdown": result.text_content}

MarkItDown is an essential tool for anyone working with documents and AI, making it trivial to convert virtually any file format into clean, processable markdown text.

Ready to get started? Visit the official site to learn more.

Visit official site north_east
An unhandled error has occurred. Reload