Latest Supply Chain Attack:Mini Shai-Hulud Hits @antv npm Packages, 639 Versions Compromised.Learn More
Socket
Book a DemoSign in
Socket

officeparser

Package Overview
Dependencies
Maintainers
1
Versions
59
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

officeparser

A robust, strictly-typed Node.js and Browser library for parsing office files (.docx, .pptx, .xlsx, .odt, .odp, .ods, .pdf, .rtf, .csv, .md, .html) and generating high-fidelity outputs in Markdown, HTML, CSV, RTF, and RAG-focused chunks.

latest
Source
npmnpm
Version
7.0.3
Version published
Weekly downloads
419K
15.42%
Maintainers
1
Weekly downloads
 
Created
Source

officeParser — Universal Office Document Parser & Generator

A robust, strictly-typed Node.js and Browser library for parsing office files into a rich Abstract Syntax Tree (AST) and generating high-fidelity output in multiple formats.

Parses: docx · pptx · xlsx · odt · odp · ods · pdf · rtf · csv · md · html

Generates: Markdown · HTML · CSV · RTF · PDF · Plain Text · RAG Chunks

npm version Total Downloads Weekly Downloads License: MIT

🌟 Live Interactive AST Visualizer & Documentation 🌟

Upload any office file in your browser — inspect the AST, tweak config, and preview generated output in real-time.

  • AST Visualizer: Inspect the hierarchical node tree, metadata, and raw content
  • Config Configurator: Tweak options (ignoreNotes, ocr, newlineDelimiter) and see results instantly
  • Debugging: Identify exactly how nodes are interpreted
  • Format Specs: Read detailed specs for the AST structure and all config options

📝 Changelog

Table of Contents

Install via npm

npm i officeparser

Command Line Usage

# Full AST as JSON (default)
npx officeparser /path/to/file.docx

# Plain text output
npx officeparser /path/to/file.docx --format=text

# Convert DOCX to Markdown and save
npx officeparser report.docx --format=md --output=report.md

# Convert PPTX to HTML
npx officeparser presentation.pptx --format=html --output=preview.html

# Convert XLSX to CSV
npx officeparser data.xlsx --format=csv

# Generate RAG chunks
npx officeparser document.pdf --format=chunks

CLI Options

FlagValuesDefaultDescription
--formatjson|text|md|html|csv|rtf|pdf|chunksjsonOutput format
--outputpathWrite output to a file
--toTexttrue|falsefalseDeprecated. Use --format=text
--ignoreNotestrue|falsefalseIgnore speaker notes (PPTX/ODP)
--putNotesAtLasttrue|falsefalseCollect notes at end of output
--newlineDelimiterstring\nDelimiter between lines
--extractAttachmentstrue|falsefalseExtract images/charts as Base64
--ocrtrue|falsefalseEnable OCR for images
--includeRawContenttrue|falsefalseInclude raw XML/RTF in nodes
--includeBreakNodestrue|falsefalseInclude break nodes (DOCX only)
--outputErrorToConsoletrue|falsefalseDeprecated. Use onWarning callback
--verbosetrue|falsefalseShow full error stack traces

Quick Decision Guide

GoalAPI to use
Extract text / AST from a fileOfficeParser.parseOffice(file)
Convert directly to another formatOfficeConverter.convert(file, 'md')
Parse first, then generateparseOffice()OfficeGenerator.generate(ast, 'html')
Convert on the AST itself (shorthand)ast.to('md')
RAG pipeline chunkingOfficeConverter.convert(file, 'chunks', {...})

Library Usage: Parsing

Async/Await

const officeParser = require('officeparser');

const ast = await officeParser.parseOffice('/path/to/file.docx');

console.log(ast.type);       // 'docx'
console.log(ast.metadata);   // { author, title, created, ... }
console.log(ast.content);    // Array of hierarchical nodes
console.log(ast.attachments);// Images/charts (if extractAttachments: true)
console.log(ast.warnings);   // Non-fatal issues from parsing phase

TypeScript (named import):

import { OfficeParser } from 'officeparser';

const ast = await OfficeParser.parseOffice('report.docx', {
    extractAttachments: true,
    ocr: true,
});

Callback (Backward Compat)

officeParser.parseOffice('/path/to/file.docx', function(ast, err) {
    if (err) { console.error(err); return; }
    console.log(ast.toText());
});

File Buffers & ArrayBuffers

Pass a Buffer, ArrayBuffer, or Uint8Array instead of a file path:

const fs = require('fs');
const buffer = fs.readFileSync('/path/to/file.pdf');
const ast = await officeParser.parseOffice(buffer);

[!IMPORTANT] Text-based formats from buffers need a fileType hint. Formats like md, html, and csv have no magic bytes, so the parser cannot auto-detect them from a buffer. You must provide fileType in that case:

const ast = await officeParser.parseOffice(markdownBuffer, { fileType: 'md' });

ast.to() — Generate from AST

The preferred way to convert a parsed AST to another format. Returns a ConversionResult.

// ConversionResult shape:
// { value: string | Uint8Array | OfficeChunk[], messages: OfficeIssue[] }

const { value: markdown, messages } = await ast.to('md');
const { value: html }               = await ast.to('html', { includeFormatting: false });
const { value: chunks }             = await ast.to('chunks', { strategy: 'fixed-size', chunkSize: 800 });
const { value: pdfBytes }           = await ast.to('pdf'); // Uint8Array

ast.toText() — Quick Text Extraction

[!NOTE] toText() is synchronous and deprecated in favour of the async ast.to('text'). It remains available for backward compatibility.

const text = ast.toText(); // synchronous, returns plain string

OfficeGenerator

Use OfficeGenerator.generate(ast, format, config?) when you need to produce output from an already-parsed AST:

import { OfficeParser, OfficeGenerator } from 'officeparser';

const ast = await OfficeParser.parseOffice('report.docx');

// Convert to Markdown
const { value: md } = await OfficeGenerator.generate(ast, 'md');

// Convert to HTML with style mapping
const { value: html } = await OfficeGenerator.generate(ast, 'html', {
    includeFormatting: true,
    styleMap: [
        {
            selector: { nodeType: 'paragraph', attributes: { style: 'Heading 1' } },
            output: { tag: 'h1', classes: ['main-title'] }
        }
    ]
});

// Convert to CSV (spreadsheets)
const { value: csv } = await OfficeGenerator.generate(ast, 'csv');

Supported destinations: 'text' · 'md' · 'html' · 'csv' · 'rtf' · 'pdf' · 'chunks'

[!NOTE] PDF generation requires the optional puppeteer peer dependency:

npm install puppeteer

OfficeConverter — One-Step API

OfficeConverter.convert() combines parsing and generation in a single call. It automatically syncs parser options from generator config (e.g., enables extractAttachments when images are requested).

import { OfficeConverter } from 'officeparser';

// Minimal usage
const { value: markdown } = await OfficeConverter.convert('report.docx', 'md');

// With config
const { value: html, messages } = await OfficeConverter.convert('data.xlsx', 'html', {
    parseConfig: {
        ignoreNotes: true,
        newlineDelimiter: '\n\n',
    },
    generatorConfig: {
        includeFormatting: true,
        styleMap: [
            {
                selector: { attributes: { style: { value: 'Header', operator: '~=' } } },
                output: { tag: 'h2', classes: ['data-header'] }
            }
        ]
    },
    onWarning: (issue) => console.warn(`[${issue.code}] ${issue.message}`)
});

[!IMPORTANT] The OfficeConverterConfig shape uses nested parseConfig and generatorConfig sub-objects. Do not put parser or generator options at the top level — only onWarning lives there.

Native RAG Chunking

officeParser provides native document chunking for Retrieval-Augmented Generation (RAG) pipelines with three strategies:

Strategy 1: Document Structure (Default)

Splits at natural AST boundaries (paragraphs, headings, pages, slides, sheets). Preserves logical flow.

const { value: chunks } = await OfficeConverter.convert('report.docx', 'chunks', {
    generatorConfig: {
        chunksConfig: {
            strategy: 'document-structure',
            splitBy: 'heading',    // 'paragraph' | 'heading' | 'page' | 'slide' | 'sheet'
            maxChunkSize: 1500,
            tableSplitStrategy: 'row', // repeats header row in every chunk — ideal for RAG
        }
    }
});

Strategy 2: Fixed-Size (Recursive)

Splits by character count with overlap. Equivalent to LangChain's RecursiveCharacterTextSplitter.

const { value: chunks } = await OfficeConverter.convert('report.docx', 'chunks', {
    generatorConfig: {
        chunksConfig: {
            strategy: 'fixed-size',
            chunkSize: 1000,
            chunkOverlap: 200,
        }
    }
});
console.log(`Generated ${chunks.length} chunks`);

Strategy 3: Semantic

Uses cosine similarity between sentence embeddings to find topic boundaries. Requires you to provide an embeddingFunction.

import OpenAI from 'openai';
const openai = new OpenAI();

const { value: chunks } = await OfficeConverter.convert('report.docx', 'chunks', {
    generatorConfig: {
        chunksConfig: {
            strategy: 'semantic',
            embeddingFunction: async (text) => {
                const res = await openai.embeddings.create({
                    input: text, model: 'text-embedding-3-small'
                });
                return res.data[0].embedding;
            },
            similarityThreshold: 0.8,
            maxChunkSize: 2000,
        }
    }
});

The OfficeChunk Object

Every chunk contains text and rich metadata for citations and filtered retrieval:

interface OfficeChunk {
    text: string;
    /** Rich metadata for filtered retrieval */
    metadata: {
        sourceType: string;       // e.g., 'docx', 'pdf'
        pageNumber?: number;      // (PDF only)
        slideNumber?: number;     // (PPTX only)
        sheetName?: string;       // (XLSX only)
        closestHeading?: string;  // Nearest heading above this chunk
        isTableChunk?: boolean;   // True if part of a split table
    };
    startIndex?: number;          // Character offset (if addStartIndex: true)
    endIndex?: number;            // End character offset (if addStartIndex: true)
}

The AST Structure

OfficeParserAST is a format-agnostic document representation:

OfficeParserAST
├── type: 'docx' | 'pdf' | 'xlsx' | 'csv' | 'md' | ...  (11 formats)
├── metadata: { author, title, created, modified, customProperties, styleMap, ... }
├── content: [ OfficeContentNode ]
│   ├── type: 'paragraph' | 'heading' | 'table' | 'list' | 'image' | 'chart' | ...
│   ├── text: string  (concatenated text of node + all descendants)
│   ├── children: [ OfficeContentNode ]  (recursive)
│   ├── formatting: { bold, italic, underline, color, size, font, alignment, ... }
│   └── metadata: { level, listId, row, col, rowSpan, colSpan, style, ... }
├── attachments: [ OfficeAttachment ]  (populated when extractAttachments: true)
│   ├── type: 'image' | 'chart'
│   ├── name: string
│   ├── mimeType: string
│   ├── data: string  (Base64)
│   ├── ocrText?: string  (if ocr: true)
│   └── chartData?: { title, dataSets, labels }
├── warnings: OfficeIssue[]  (non-fatal issues from the parsing phase)
├── to(format, config?)  (format: 'html'|'md'|'text'|'csv'|'rtf'|'pdf'|'chunks', returns { value, messages })
└── toText()             (Deprecated: use .to('text') instead)

OfficeIssue — Warning / Error Object

All warnings and errors (from both parsing and generation) use this shape:

interface OfficeIssue {
    type: 'warning' | 'info' | 'error';
    code: OfficeWarningType | OfficeErrorType;  // typed enum, e.g. 'OCR_FAILED'
    message: string;
    node?: OfficeContentNode;  // the node that triggered the issue, if any
    details?: any;             // original error or extra context
}

Deep Dive: Document Components

1. Lists

List Node
├── type: 'list'
├── metadata: {
│       listId: '1',          // items with the same listId belong to one logical list
│       listType: 'ordered' | 'unordered',
│       indentation: 0,       // nesting level (0-based)
│       itemIndex: 0,         // sequential position within the list level
│       paragraphIndentation: { left, hanging, right, firstLine }
│   }
└── children: [ Text content ]

[!TIP] Even if a list is interrupted by a regular paragraph, itemIndex keeps incrementing for the same listId, so numbering stays correct.

2. Tables

Tables follow a strict table → row → cell hierarchy:

Table Node (type: 'table')
└── children: Row Nodes (type: 'row')
    └── children: Cell Nodes (type: 'cell')
        ├── metadata: { row, col, rowSpan?, colSpan? }
        └── children: [ Paragraph | List | Table | ... ]
  • row / col: zero-based grid position
  • rowSpan / colSpan: merged cells (primarily ODF formats)
  • Cells can contain nested tables

3. Images & OCR

Image Node (type: 'image')
├── metadata: { attachmentName: 'img1.png', altText: '...' }
└── → Attachment: { data: 'base64...', ocrText: '...' }
  • Set extractAttachments: true to populate attachment.data
  • Set ocr: true (requires extractAttachments: true) to populate ocrText

4. Charts

Chart Node (type: 'chart')
├── metadata: { attachmentName: 'chart1.xml' }
└── → Attachment: { chartData: { title, dataSets, labels } }

5. Text Formatting

formatting: {
    bold?: boolean
    italic?: boolean
    underline?: boolean
    strikethrough?: boolean
    color?: string          // '#RRGGBB'
    backgroundColor?: string
    size?: string           // e.g. '12pt'
    font?: string
    subscript?: boolean
    superscript?: boolean
    alignment?: 'left' | 'center' | 'right' | 'justify'
}

6. Break Nodes (DOCX only)

When includeBreakNodes: true, break elements appear as nodes:

Break Node (type: 'break')
└── metadata: {
        breakType: 'textWrapping' | 'page' | 'column' | 'lastRenderedPage' | 'carriageReturn',
        clear?: 'all' | 'left' | 'none' | 'right'
    }

[!NOTE] Break nodes have no text property, but ast.toText() and ast.to('text') automatically convert them to the configured newline delimiter.

7. Document Metadata

ast.metadata = {
    author?: string
    title?: string
    created?: Date
    modified?: Date
    description?: string
    customProperties?: Record<string, any>  // user-defined metadata from the document
    styleMap?: Record<string, TextFormatting>  // named styles → formatting definitions
    formatting?: TextFormatting              // document-wide defaults
}

Accessing custom properties:

const ast = await officeParser.parseOffice('contract.docx');
console.log(ast.metadata.customProperties);
// { "ProjectID": "ABC-123", "InternalReview": true }

Performance Highlights

Key internal optimizations shipped in recent versions:

  • OpenOffice (ODP): Up to 23× faster parsing via optimized XML pre-parsing and style caching
  • Excel Memory: Resolved O(n) memory overhead on large sparse spreadsheets using iterative stream-based parsing
  • RTF Parser: Rewrote string accumulation loop to eliminate O(n²) bottleneck in large files
  • Table Fidelity (DOCX): Native support for vertical cell merging (vMerge) and horizontal spanning (gridSpan)

Advanced AST Usage

Extract all headings

const headings = ast.content.filter(n => n.type === 'heading' && n.metadata?.level === 1);
console.log(headings.map(h => h.text));

Extract images with OCR text

const ast = await officeParser.parseOffice('report.docx', { extractAttachments: true, ocr: true });
ast.attachments.filter(a => a.mimeType?.startsWith('image/')).forEach(img => {
    console.log(`${img.name}: ${img.ocrText ?? 'no OCR'}`);
});

Extract tables to CSV manually

ast.content.filter(n => n.type === 'table').forEach((table, i) => {
    const csv = table.children
        .filter(r => r.type === 'row')
        .map(r => r.children.filter(c => c.type === 'cell')
            .map(c => `"${c.text.replace(/"/g, '""')}"`)
            .join(','))
        .join('\n');
    console.log(`Table ${i + 1}:\n${csv}`);
});

Find all bold text runs

function findBold(nodes) {
    return nodes.flatMap(n => [
        ...(n.type === 'text' && n.formatting?.bold ? [n.text] : []),
        ...(n.children ? findBold(n.children) : [])
    ]);
}
console.log(findBold(ast.content));

Extract footnotes / endnotes

function extractNotes(nodes) {
    return nodes.flatMap(n => [
        ...(n.type === 'note' ? [{ id: n.metadata.noteId, text: n.text, type: n.metadata.noteType }] : []),
        ...(n.children ? extractNotes(n.children) : [])
    ]);
}
console.log(extractNotes(ast.content));

Search for a term (TypeScript)

import { OfficeParser } from 'officeparser';

async function contains(filePath: string, term: string): Promise<boolean> {
    const ast = await OfficeParser.parseOffice(filePath);
    return (await ast.to('text')).value.includes(term);
}

Configuration Reference

OfficeParserConfig

Pass as the second argument to parseOffice(file, config).

OptionTypeDefaultDescription
newlineDelimiterstring'\n'Delimiter inserted between lines in text output
ignoreNotesbooleanfalseIgnore speaker notes (PPTX/ODP)
putNotesAtLastbooleanfalseCollect all notes at the end instead of inline
extractAttachmentsbooleanfalsePopulate ast.attachments with Base64 images/charts
ocrbooleanfalseRun Tesseract OCR on images (requires extractAttachments: true)
ocrConfigOcrConfig{}OCR worker pool settings — see OCR section
includeRawContentbooleanfalseAttach raw XML/RTF source to each node
serializeRawContentbooleantrueRe-serialize XML to clean strings (only if includeRawContent: true)
preserveXmlWhitespacebooleanfalsePreserve original XML whitespace during serialization
includeBreakNodesbooleanfalseInclude w:br / w:cr as typed break nodes (DOCX only)
ignoreInternalLinksbooleanfalseStrip bookmarks and internal cross-references from AST
fileTypeSupportedFileType | nullnullRequired for text-based binary data ('md', 'html', 'csv') as these lack magic bytes.
csvDelimiterstring','Input delimiter when parsing CSV files
pdfWorkerSrcstringCDN (jsDelivr)Path/URL to pdf.worker.min.mjs (required in browser)
onWarning(issue: OfficeIssue) => voidCallback for non-fatal parsing issues
outputErrorToConsolebooleanfalseDeprecated. Use onWarning instead

GeneratorConfig (Common)

Options shared by all generator formats. Pass to OfficeGenerator.generate(ast, format, config) or ast.to(format, config).

OptionTypeDefaultDescription
includeFormattingbooleantrueInclude bold/italic/colors/sizes in output
generateIdsbooleantrueAdd slug-based id attributes to headings
renderMetadatabooleanfalseRender title/author as visible header block
includeImagesbooleantrueInclude image nodes in output
includeChartsbooleantrueInclude interactive charts (HTML only)
ignoreInternalLinksbooleanfalseStrip bookmarks and internal anchors from output
ignoreDefaultStyleMapbooleanfalseDisable built-in style mappings (e.g., "Heading 1" → h1)
styleMapstring[] | StructuredStyleMapping[][]Custom semantic style mappings
onNode(node) => string | false | voidPer-node callback for filtering, overriding, or mutating
onWarning(issue: OfficeIssue) => voidCallback for non-fatal generation issues

onNode Callback — Advanced Node Manipulation

Called for every node in the AST during generation. Can be async.

Return valueEffect
falseSkip this node and all its children
stringUse this string as the output for this node, skip default logic
voidProceed with default rendering (mutations to node are applied)
const { value: md } = await ast.to('md', {
    onNode: async (node) => {
        // Skip all images
        if (node.type === 'image') return false;

        // Redact secrets (mutate then proceed)
        if (node.text?.includes('SECRET_KEY')) {
            node.text = node.text.replace(/SECRET_KEY: \w+/, 'SECRET_KEY: [REDACTED]');
        }

        // Custom rendering for a specific style
        if (node.metadata?.style === 'Callout') {
            return `> [!INFO]\n> ${node.text}`;
        }
    }
});

styleMap — Semantic Style Mapping

Maps document style names to semantic output elements. Two formats supported:

styleMap: [
    {
        selector: { nodeType: 'paragraph', attributes: { style: 'Heading 1' } },
        output: { tag: 'h1', classes: ['main-title'], attributes: { id: 'top' } }
    },
    {
        // '~=' operator matches if the word 'Quote' appears anywhere in the style name
        selector: { attributes: { style: { value: 'Quote', operator: '~=' } } },
        output: { tag: 'blockquote', fresh: true }
    }
]

fresh: true prevents the generator from merging adjacent nodes of the same tag into one block.

Legacy String DSL

Compatible with mammoth.js style maps:

styleMap: [
    "p[style-name='Heading 1'] => h1",
    "p[style~='Title'] => h2",
    "p[style-name='Quote'][lang='en'] => blockquote"
]

HtmlGeneratorConfig

Pass as htmlConfig inside GeneratorConfig.

OptionTypeDefaultDescription
standalonebooleantrueWrap output in a full <html> document with CSS
chartJsSrcstringjsDelivr CDNURL for the Chart.js library

MdGeneratorConfig

Pass as mdConfig inside GeneratorConfig.

OptionTypeDefaultDescription
fallbackToHtmlbooleantrueUse HTML tags for features Markdown cannot represent (underlines, merged table cells, etc.)

PdfGeneratorConfig

Pass as pdfConfig inside GeneratorConfig. Requires the optional puppeteer peer dependency.

OptionTypeDefaultDescription
formatstring'A4'Paper format ('A4', 'Letter', 'Legal', etc.)
landscapebooleanfalseLandscape page orientation
printBackgroundbooleantruePrint background graphics
marginobject{0,0,0,0}Page margins (top, right, bottom, left)
displayHeaderFooterbooleanfalseShow print header/footer
headerTemplatestring''HTML template for the print header
footerTemplatestring''HTML template for the print footer
scalenumber1Rendering scale factor
launchOptionsobjectheadless defaultsPuppeteer launch options (e.g., executablePath)

CsvGeneratorConfig

Pass as csvConfig inside GeneratorConfig.

OptionTypeDefaultDescription
sheetsstring''Sheet range to export: '1', '1-3', '1,3' (1-based). Empty = all sheets
mergeSheetsbooleantrueMerge all sheets into one CSV. If false, returns a ZIP archive
columnDelimiterstring','Output column delimiter

TextGeneratorConfig

Pass as textConfig inside GeneratorConfig.

OptionTypeDefaultDescription
newlineDelimiterstring'\n'String inserted between structural blocks
preserveLayoutbooleanfalseRender tables with aligned columns using whitespace

OfficeConverterConfig

Configuration for OfficeConverter.convert(file, format, config).

OptionTypeDescription
parseConfigOfficeParserConfigSettings for the parsing phase
generatorConfigGeneratorConfigSettings for the generation phase
onWarning(issue: OfficeIssue) => voidGlobal warning callback (overrides phase-specific ones)

ChunkingConfig

ChunkingConfig is a discriminated union — the available options depend on the strategy field.

Common Options (all strategies)

OptionTypeDefaultDescription
strategystring'document-structure'Chunking strategy
stripWhitespacebooleantrueTrim leading/trailing whitespace from each chunk
includeMetadatabooleantrueInclude page/slide/heading metadata in each chunk
addStartIndexbooleanfalseAdd startIndex character offset to chunk metadata
lengthFunction(text) => numbertext.lengthCustom size measurer (e.g., token counter)
sentenceBoundaryRegexstring | RegExp/[.!?。!?]/Custom regex for sentence boundary detection
abbreviationsstring[]common listAbbreviations to skip when splitting on .

strategy: 'fixed-size'

OptionTypeDefaultDescription
chunkSizenumber1000Maximum characters per chunk
chunkOverlapnumber200Character overlap between consecutive chunks
separatorsstring[]['\n\n','\n',' ','']Ordered list of separators to try

strategy: 'document-structure'

OptionTypeDefaultDescription
splitBystring'paragraph''paragraph' · 'heading' · 'page' · 'slide' · 'sheet'
maxChunkSizenumber1000Max characters per chunk (oversized units are split recursively)
tableSplitStrategystring'row''row' (repeats header in each chunk) or 'flatten'

strategy: 'semantic'

OptionTypeDefaultDescription
embeddingFunction(text) => Promise<number[]>requiredAsync embedding function
similarityThresholdnumber0.8Cosine similarity threshold; lower = fewer boundaries
maxChunkSizenumber2000Max characters even if similarity stays high
bufferSizenumber1Surrounding sentences used when computing similarity
embeddingBatchSizenumber50Sentences per embedding API batch

OCR Scheduler & Resource Management

When ocr: true is set, officeParser maintains an intelligent Smart Worker Pool backed by Tesseract.js:

  • Dynamic Affinity: Workers persist with their last-used language, avoiding re-initialization overhead.
  • LRU Re-allocation: When a new language is requested and the pool is full, the Least Recently Used idle worker is re-initialized.
  • Auto-Termination: Workers shut down after 10 seconds of inactivity (configurable via ocrConfig.autoTerminateTimeout).

OCR Config (ocrConfig)

OptionTypeDefaultDescription
languagestring'eng'Tesseract language code(s), e.g. 'eng+fra'
workerPathstring''Custom path to Tesseract worker script
corePathstring''Custom path to Tesseract core script
langPathstring''Custom path for language data files
autoTerminateTimeoutnumber10000Inactivity timeout in ms before auto-teardown (0 = disabled)

See all language codes at tesseract-ocr.github.io.

OfficeParser.terminateOcr()

In short-lived scripts (CLI tools, one-off automation), call terminateOcr() after processing to bypass the idle timer and exit immediately:

const officeParser = require('officeparser');

const ast = await officeParser.parseOffice('file.pdf', { ocr: true });
// ... process results ...
await officeParser.terminateOcr(); // immediate exit

[!TIP] The built-in CLI (npx officeparser ...) handles this automatically. Only call it manually in your own scripts.

Browser Usage

Two bundles are available in the dist/ directory:

BundleUsage
officeparser.browser.mjsESM — use with import statements or modern bundlers (Vite, Webpack, Next.js)
officeparser.browser.iife.jsIIFE — use with a <script> tag; exposes the global officeParser object

ESM (Vite / Webpack / Next.js)

import { OfficeParser } from 'officeparser';

const handleFile = async (event) => {
    const file = event.target.files[0];
    const buffer = await file.arrayBuffer();
    const ast = await OfficeParser.parseOffice(new Uint8Array(buffer));
    console.log(ast.toText());
};

Script Tag

<script src="dist/officeparser.browser.iife.js"></script>
<script>
    async function handleFile(event) {
        const file = event.target.files[0];
        const buffer = await file.arrayBuffer();
        const ast = await officeParser.parseOffice(new Uint8Array(buffer));
        console.log(ast.toText());
    }
</script>

[!NOTE] File paths don't work in the browser. Always pass a Buffer, ArrayBuffer, or Uint8Array. Passing a path string will throw a descriptive FEATURE_NOT_SUPPORTED_IN_BROWSER error.

PDF Worker Configuration

When parsing PDFs in the browser, a Web Worker is required. If pdfWorkerSrc is omitted, a jsDelivr CDN link is used automatically:

// Uses default CDN worker:
const ast = await officeParser.parseOffice(pdfArrayBuffer);

// Or specify your own:
const ast = await officeParser.parseOffice(pdfArrayBuffer, {
    pdfWorkerSrc: 'https://cdn.jsdelivr.net/npm/pdfjs-dist@5.6.205/build/pdf.worker.min.mjs'
});

[!NOTE] The pdfjs-dist worker version must match the version bundled with officeparser (currently 5.6.205).

Troubleshooting & Common Issues

SymptomFix
Node.js process stays alive after finishingCall await officeParser.terminateOcr() at end of script when OCR was used
"Worker not found" in browser for PDFVerify pdfWorkerSrc points to pdf.worker.min.mjs matching version 5.6.205
Low OCR accuracyVerify ocrConfig.language matches the document language; quality depends on image resolution
Out of memory on large Excel filesCall ast.toText() early and discard the AST object to allow garbage collection
md/html/csv buffer not detectedAdd fileType: 'md' (or 'html', 'csv') to config — these formats have no magic bytes
IMPROPER_BUFFERS errorUsually means no file extension and no fileType hint was provided for a buffer input
PDF generation failsInstall the optional peer dependency: npm install puppeteer

For a full debugging guide, visit the Live Documentation.

Known Limitations

  • ODT/ODS Charts: May show inaccurate data when the chart references external cell ranges or uses complex layout-based data.
  • PDF Images (Browser): Extracted as BMP files for cross-platform compatibility. Conversion is automatic.
  • RTF Notes: putNotesAtLast has no effect for RTF files; footnotes and endnotes are always appended at the end.

npm: https://npmjs.com/package/officeparser

github: https://github.com/harshankur/officeParser

Support the Project

If officeParser has helped you save time, consider supporting its continued development. Your sponsorship helps maintain the project, add new features, and keep it robust for everyone.

Buy Me A Coffee

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for details.

License

This project is licensed under the MIT License — see the LICENSE file for details.

Keywords

office

FAQs

Package last updated on 15 May 2026

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts