Document Handler

Extract text, metadata, and content from any document format.

Supported Formats

Format	Extensions	Text Extract	Metadata	Convert
PDF	.pdf	✅ pdftotext	✅ pdfinfo	✅ pdftoppm
Word	.docx	✅ unzip + xml	✅	✅
Excel	.xlsx	✅ unzip + xml	✅	✅
PowerPoint	.pptx	✅ unzip + xml	✅	✅
EPUB	.epub	✅ unzip + html	✅	✅
RTF	.rtf	✅ textutil	✅	✅
OpenDocument	.odt, .ods, .odp	✅ unzip + xml	✅	✅

Quick Commands

PDF

# Extract text
pdftotext -layout input.pdf output.txt

# Get metadata
pdfinfo input.pdf

# Convert to images (for OCR or viewing)
pdftoppm -png input.pdf output_prefix

# Extract specific pages
pdftotext -f 5 -l 10 -layout input.pdf output.txt

DOCX/XLSX/PPTX (Office Open XML)

# Extract text from DOCX
unzip -p input.docx word/document.xml | sed 's/<[^>]*>//g' | tr -s ' \n'

# Extract text from XLSX (all sheets)
unzip -p input.xlsx xl/sharedStrings.xml | sed 's/<[^>]*>//g' | tr -s '\n'

# Extract text from PPTX
unzip -p input.pptx ppt/slides/*.xml | sed 's/<[^>]*>//g' | tr -s ' \n'

# Get metadata
unzip -p input.docx docProps/core.xml

RTF (macOS)

# Convert RTF to plain text
textutil -convert txt input.rtf -output output.txt

# Convert RTF to HTML
textutil -convert html input.rtf -output output.html

EPUB

# Extract and read EPUB content
unzip -l input.epub                    # List contents
unzip -p input.epub "*.html" | lynx -stdin -dump  # Text via lynx
unzip -p input.epub "*.xhtml" | sed 's/<[^>]*>//g'  # Raw text

OpenDocument (ODT/ODS/ODP)

# Extract text from ODT
unzip -p input.odt content.xml | sed 's/<[^>]*>//g' | tr -s ' \n'

# Extract from ODS
unzip -p input.ods content.xml | sed 's/<[^>]*>//g'

# Get metadata
unzip -p input.odt meta.xml

Scripts

extract_document.sh

Extracts text and metadata from any supported document format.

~/Dropbox/jarvis/skills/document-handler/scripts/extract_document.sh <file>

Output:

Text content to stdout
Metadata as JSON comments

pdf_to_images.sh

Converts PDF pages to images for OCR or visual processing.

~/Dropbox/jarvis/skills/document-handler/scripts/pdf_to_images.sh <pdf> <output_dir> [dpi]

Workflow

Identify format — Check file extension
Extract text — Use appropriate tool
Get metadata — Author, date, pages, etc.
Process content — Summarize, search, transform

Notes

PDFs with scanned images need OCR (pdftoppm + tesseract)
Encrypted PDFs require password
Complex formatting may be lost in text extraction
For tables in PDFs, consider tabula or camelot

Document Handler

Extract text, metadata, and content from any document format.

Supported Formats

Format	Extensions	Text Extract	Metadata	Convert
PDF	.pdf	✅ pdftotext	✅ pdfinfo	✅ pdftoppm
Word	.docx	✅ unzip + xml	✅	✅
Excel	.xlsx	✅ unzip + xml	✅	✅
PowerPoint	.pptx	✅ unzip + xml	✅	✅
EPUB	.epub	✅ unzip + html	✅	✅
RTF	.rtf	✅ textutil	✅	✅
OpenDocument	.odt, .ods, .odp	✅ unzip + xml	✅	✅

Quick Commands

PDF

# Extract text
pdftotext -layout input.pdf output.txt

# Get metadata
pdfinfo input.pdf

# Convert to images (for OCR or viewing)
pdftoppm -png input.pdf output_prefix

# Extract specific pages
pdftotext -f 5 -l 10 -layout input.pdf output.txt

DOCX/XLSX/PPTX (Office Open XML)

# Extract text from DOCX
unzip -p input.docx word/document.xml | sed 's/<[^>]*>//g' | tr -s ' \n'

# Extract text from XLSX (all sheets)
unzip -p input.xlsx xl/sharedStrings.xml | sed 's/<[^>]*>//g' | tr -s '\n'

# Extract text from PPTX
unzip -p input.pptx ppt/slides/*.xml | sed 's/<[^>]*>//g' | tr -s ' \n'

# Get metadata
unzip -p input.docx docProps/core.xml

RTF (macOS)

# Convert RTF to plain text
textutil -convert txt input.rtf -output output.txt

# Convert RTF to HTML
textutil -convert html input.rtf -output output.html

EPUB

# Extract and read EPUB content
unzip -l input.epub                    # List contents
unzip -p input.epub "*.html" | lynx -stdin -dump  # Text via lynx
unzip -p input.epub "*.xhtml" | sed 's/<[^>]*>//g'  # Raw text

OpenDocument (ODT/ODS/ODP)

# Extract text from ODT
unzip -p input.odt content.xml | sed 's/<[^>]*>//g' | tr -s ' \n'

# Extract from ODS
unzip -p input.ods content.xml | sed 's/<[^>]*>//g'

# Get metadata
unzip -p input.odt meta.xml

Scripts

extract_document.sh

Extracts text and metadata from any supported document format.

~/Dropbox/jarvis/skills/document-handler/scripts/extract_document.sh <file>

Output:

Text content to stdout
Metadata as JSON comments

pdf_to_images.sh

Converts PDF pages to images for OCR or visual processing.

~/Dropbox/jarvis/skills/document-handler/scripts/pdf_to_images.sh <pdf> <output_dir> [dpi]

Workflow

Identify format — Check file extension
Extract text — Use appropriate tool
Get metadata — Author, date, pages, etc.
Process content — Summarize, search, transform

Notes

PDFs with scanned images need OCR (pdftoppm + tesseract)
Encrypted PDFs require password
Complex formatting may be lost in text extraction
For tables in PDFs, consider tabula or camelot

Document Handler

Document Handler

Supported Formats

Quick Commands

PDF

DOCX/XLSX/PPTX (Office Open XML)

RTF (macOS)

EPUB

OpenDocument (ODT/ODS/ODP)

Scripts

extract_document.sh

pdf_to_images.sh

Workflow

Notes

Download

Skill Info

Document Handler

Document Handler

Supported Formats

Quick Commands

PDF

DOCX/XLSX/PPTX (Office Open XML)

RTF (macOS)

EPUB

OpenDocument (ODT/ODS/ODP)

Scripts

extract_document.sh

pdf_to_images.sh

Workflow

Notes

Download

Skill Info