MarkItDown
Python library from Microsoft that converts various file formats (Office documents, PDFs, images, audio) to Markdown. Perfect for processing documents with LLMs or building documentation pipelines.
Features
Supported Formats
Documents
- Word: .docx files to markdown
- PowerPoint: .pptx presentations with slide content
- Excel: .xlsx spreadsheets to markdown tables
- PDF: Extract text and structure from PDFs
- HTML: Convert web pages to clean markdown
Media
- Images: Extract text from images using OCR
- Audio: Transcribe audio files using Whisper
- Video: Extract audio and transcribe
Code & Data
- Jupyter Notebooks: .ipynb to markdown
- CSV: Convert to markdown tables
- JSON: Format as readable markdown
- XML: Convert to structured markdown
Conversion Features
- Clean Output: Well-formatted, readable markdown
- Structure Preservation: Maintains headings, lists, tables
- Image Handling: Embedded images extracted or referenced
- Table Conversion: Excel/HTML tables to markdown tables
- Metadata Extraction: Preserve document properties
Installation
pip install markitdown
With Optional Dependencies
# For all features
pip install markitdown[all]
# For specific features
pip install markitdown[ocr] # Image OCR
pip install markitdown[audio] # Audio transcription
Usage
Basic Conversion
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.docx")
print(result.text_content)
Convert Multiple Files
from markitdown import MarkItDown
from pathlib import Path
md = MarkItDown()
for file in Path("docs").glob("*.docx"):
result = md.convert(file)
output = file.with_suffix(".md")
output.write_text(result.text_content)
With OCR for Images
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("screenshot.png") # Uses OCR
markdown_text = result.text_content
Audio Transcription
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("meeting.mp3") # Transcribes audio
transcript = result.text_content
Use Cases
LLM Processing
- Convert documents before feeding to LLMs
- Process entire document libraries
- Extract text from scanned documents
- Prepare training data
Documentation
- Convert Word docs to markdown for version control
- Migrate documentation to markdown-based systems
- Generate docs from presentations
- Archive documents in plain text format
Data Extraction
- Extract tables from Excel to markdown
- Process PDF reports
- Convert forms and surveys
- Extract content from presentations
Content Management
- Bulk convert legacy documents
- Prepare content for static site generators
- Convert email attachments
- Process uploaded documents
Output Format
Word Document
# Document Title
## Section 1
This is a paragraph with **bold** and *italic* text.
- Bullet point 1
- Bullet point 2
## Section 2
| Header 1 | Header 2 |
|----------|----------|
| Cell 1 | Cell 2 |
Excel Spreadsheet
# Sheet: Sales Data
| Product | Q1 | Q2 | Q3 | Q4 |
|---------|----|----|----|----|
| Widget | 100| 150| 200| 250|
| Gadget | 75 | 80 | 90 | 100|
PowerPoint
# Slide 1: Title Slide
Presentation Title
Subtitle
---
# Slide 2: Content
- Point 1
- Point 2
- Point 3
Advanced Features
Custom Handlers
from markitdown import MarkItDown, DocumentConverter
class CustomConverter(DocumentConverter):
def convert(self, source):
# Custom conversion logic
return ConversionResult(...)
md = MarkItDown()
md.register_converter(".custom", CustomConverter())
Batch Processing
from markitdown import MarkItDown
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
def convert_file(file_path):
md = MarkItDown()
result = md.convert(file_path)
return result.text_content
files = list(Path("documents").glob("**/*.docx"))
with ProcessPoolExecutor() as executor:
results = executor.map(convert_file, files)
Best Practices
- Test conversions with sample files first
- Handle large files with streaming when possible
- Cache OCR results for repeated processing
- Validate markdown output for important conversions
- Use appropriate optional dependencies for features needed
- Consider file size limits for memory usage
Limitations
- Complex layouts may not convert perfectly
- Some formatting may be lost (colors, fonts)
- Large PDFs can be slow to process
- OCR accuracy depends on image quality
- Audio transcription requires good audio quality
Best For
- AI/LLM Projects: Preprocessing documents for AI
- Documentation Teams: Converting legacy docs to markdown
- Data Scientists: Extracting data from reports
- Content Managers: Bulk document conversion
- Archivists: Long-term document preservation
- Developers: Building document processing pipelines
Integration Examples
With LangChain
from markitdown import MarkItDown
from langchain.text_splitter import MarkdownTextSplitter
md = MarkItDown()
result = md.convert("document.pdf")
splitter = MarkdownTextSplitter()
chunks = splitter.split_text(result.text_content)
With FastAPI
from fastapi import FastAPI, UploadFile
from markitdown import MarkItDown
app = FastAPI()
md = MarkItDown()
@app.post("/convert")
async def convert_document(file: UploadFile):
content = await file.read()
result = md.convert(content, file_extension=file.filename.split(".")[-1])
return {"markdown": result.text_content}
MarkItDown is an essential tool for anyone working with documents and AI, making it trivial to convert virtually any file format into clean, processable markdown text.
Ready to get started? Visit the official site to learn more.
Visit official site north_east