Destiny Ed

Posted on May 14

DocTextExtractor: A Flutter Package to Extract Text from Word, PDF, Google Docs, and Markdown

#programming #flutter #dart #google

How I built DocTextExtractor to power NotteChat's AI-powered document chat, and how you can integrate it into your own Flutter apps.

As a Flutter developer with a passion for simplifying complex problems, I created DocTextExtractor—a lightweight, open-source Dart package that extracts text from .doc, .docx, .pdf, Google Docs URLs, and .md files.

This tool was born from the challenges I faced while building NotteChat, an app that allows users to chat with document content using AI. In this article, I’ll share how I built DocTextExtractor, why it matters, and how you can integrate it into your own Flutter projects.

Why I Built DocTextExtractor

NotteChat empowers students, professionals, and educators to interact with documents conversationally. Users simply paste a URL or upload a file to a document and can then summarize, explore, or ask questions about the content using AI.

However, supporting multiple document formats (.doc, .docx, PDF, Google Docs, Markdown) posed a serious challenge. Most existing Flutter solutions only worked for specific formats, and there was no unified solution.

So I built DocTextExtractor, with a goal to:

Support .doc, .docx, .pdf, .md, and Google Docs URLs
Handle both local files and URLs
Enable offline parsing for .doc and .md
Power AI by providing clean, structured text
Extract real filenames for better UX

Building DocTextExtractor

1. Identifying the Need

The core feature of NotteChat—chatting with document content—meant I needed a consistent way to extract text, regardless of format or source.

Key Requirements:

Unified API for all formats
Clean filename extraction
Minimal dependencies
Cross-platform support (iOS, Android, Web)

2. Choosing the Tech Stack

I relied on the trusted Flutter/Dart ecosystem with the following tools and packages:

http: Fetch documents via URLs
syncfusion_flutter_pdf: Parse and extract PDFs
archive + xml: Extract from .docx and .doc
markdown: Convert .md to plain text
VS Code + GitHub for development and version control

GitHub repo: github.com/Destiny-Ed/doc_text_extractor

3. Designing the Core Logic

At the heart of the package is the TextExtractor class with a single extractText() method.

Key Features:

Unified Return Type: A Record(text, filename) for easy use
Smart Format Detection: Checks HTTP Content-Type or file extension
Offline Support: No internet required for .doc and .md
Error Handling: Friendly exceptions (e.g., "Unsupported document type")

4. Format-Specific Logic

Each format was tackled with custom logic:

.doc: No existing Dart parser, so I created one using raw XML parsing
.docx: Unzipped and parsed word/document.xml
.md: Used markdown package for plain-text conversion
PDF: Parsed using syncfusion_flutter_pdf
Google Docs: Converted /edit URLs to /export?format=pdf and parsed as PDF

5. Filename Extraction

To enhance UX, I added a _extractFilename() method that pulls names from:

Content-Disposition headers (e.g., filename="report.docx")
URL segments (e.g., https://example.com/readme.md)
Google Docs metadata (fallback if unavailable)

6. Testing & Refinement

I tested with:

.doc: Legacy Word file
.docx: Modern reports
.md: GitHub README files
PDF: Academic papers
Google Docs: Shared documents

Edge cases included:

Missing headers
Large files (>10MB)
Unsupported formats

7. Publishing as a Package

To make DocTextExtractor reusable, I decided to publish it on pub dev, and that is also one of the reasons I'm writing this article

Published on Pub.dev
MIT Licensed
Included example app
Wrote a detailed README with usage examples

Why It Matters to NotteChat

DocTextExtractor is the backbone of NotteChat's AI-powered chat with documents.

It enables:

AI Chat: Clean text fed into AI (e.g., "Summarize this PDF")
Offline Use: Great for areas with limited internet
Smart UX: Real filenames and helpful error messages
Versatile Support: For modern and legacy users

How to Use DocTextExtractor in Flutter

Step 1: Add the Dependency

Add doc_text_extractor to your app’s pubspec.yaml.

yaml dependencies: doc_text_extractor: ^1.0.0

Run:
flutter pub get

Step 2: Import and Initialize

Import the package and create a TextExtractor instance:

import 'package:doc_text_extractor/doc_text_extractor.dart';

final extractor = TextExtractor();

Step 3: Extract Text from a URL

Use extractText to process URLs for .doc, .docx, .md, PDF, or Google Docs:

Future<void> processDocumentUrl(String url) async {
  try {
    final result = await extractor.extractText(url);
    final text = result.text;
    final filename = result.filename;
    print('Filename: $filename');
    print('Text: ${text.substring(0, 100)}...');
    // Pass text to AI service (e.g., for NotteChat’s AI chat)
  } catch (e) {
    print('Error: $e');
    // Show user-friendly error (e.g., "Please convert .doc to .docx")
  }
}

Example usage

processDocumentUrl('https://raw.githubusercontent.com/user/repo/main/README.md');
processDocumentUrl('https://docs.google.com/document/d/EXAMPLE_ID/edit');

Step 4: Extract Text from a Local File

For local files (e.g., user-uploaded .md or .doc), set isUrl: false:

import 'package:path_provider/path_provider.dart';
import 'dart:io';

Future<void> processLocalFile(String filePath) async {
  try {
    final result = await extractor.extractText(filePath, isUrl: false);
    final text = result.text;
    final filename = result.filename;
    print('Filename: $filename');
    print('Text: ${text.substring(0, 100)}...');
    // Use text in app logic
  } catch (e) {
    print('Error: $e');
  }
}

Example usage

final dir = await getTemporaryDirectory();
processLocalFile('${dir.path}/sample.md');

Step 5: Integrate with your preferred AI API

You can now use the extracted text in your app with AI tools like OpenAI, Gemini, or Sonar APIs.

class ChatScreen extends StatelessWidget {
  Future<void> _handleDocument(String url) async {
    final result = await extractor.extractText(url);
    final text = result.text;
    final filename = result.filename;

    // Update session title
    final sessionTitle = 'Session ${DateTime.now().toIso8601String().split('T')[0]} - $filename';

    // Summarize with AI (e.g., Sonar API)
    final sonarService = SonarService();
    final summary = await sonarService.queryDocument(text, 'Summarize this document');

    // Display in UI
    print('Session: $sessionTitle');
    print('Summary: $summary');
  }
}

Step 6: Enhance UX with Error Handling

Add loading dialogs for large files and user-friendly errors:

if (e.toString().contains('Unsupported document type')) {
  ScaffoldMessenger.of(context).showSnackBar(
    SnackBar(content: Text('Unsupported format. Try converting to .docx or PDF.')),
  );
}

Final Thoughts

DocTextExtractor started as a necessity for NotteChat but evolved into a powerful, standalone Flutter package. It’s now available for anyone building document-based apps, AI tools, or productivity platforms.

Try it out: https://pub.dev/packages/doc_text_extractor
View the source: GitHub Repo

If you found this helpful or end up using the package, feel free to drop a ⭐ on GitHub or share your feedback. I’d love to hear how you’re using it!

Happy coding!