DEV Community

Cover image for DocTextExtractor: A Flutter Package to Extract Text from Word, PDF, Google Docs, and Markdown
Destiny Ed
Destiny Ed

Posted on

DocTextExtractor: A Flutter Package to Extract Text from Word, PDF, Google Docs, and Markdown

How I built DocTextExtractor to power NotteChat's AI-powered document chat, and how you can integrate it into your own Flutter apps.

As a Flutter developer with a passion for simplifying complex problems, I created DocTextExtractor—a lightweight, open-source Dart package that extracts text from .doc, .docx, .pdf, Google Docs URLs, and .md files.

This tool was born from the challenges I faced while building NotteChat, an app that allows users to chat with document content using AI. In this article, I’ll share how I built DocTextExtractor, why it matters, and how you can integrate it into your own Flutter projects.

Why I Built DocTextExtractor

NotteChat empowers students, professionals, and educators to interact with documents conversationally. Users simply paste a URL or upload a file to a document and can then summarize, explore, or ask questions about the content using AI.

However, supporting multiple document formats (.doc, .docx, PDF, Google Docs, Markdown) posed a serious challenge. Most existing Flutter solutions only worked for specific formats, and there was no unified solution.

So I built DocTextExtractor, with a goal to:

  • Support .doc, .docx, .pdf, .md, and Google Docs URLs
  • Handle both local files and URLs
  • Enable offline parsing for .doc and .md
  • Power AI by providing clean, structured text
  • Extract real filenames for better UX

Building DocTextExtractor

1. Identifying the Need

The core feature of NotteChat—chatting with document content—meant I needed a consistent way to extract text, regardless of format or source.

Key Requirements:

  • Unified API for all formats
  • Clean filename extraction
  • Minimal dependencies
  • Cross-platform support (iOS, Android, Web)

2. Choosing the Tech Stack

I relied on the trusted Flutter/Dart ecosystem with the following tools and packages:

  • http: Fetch documents via URLs
  • syncfusion_flutter_pdf: Parse and extract PDFs
  • archive + xml: Extract from .docx and .doc
  • markdown: Convert .md to plain text
  • VS Code + GitHub for development and version control

GitHub repo: github.com/Destiny-Ed/doc_text_extractor

3. Designing the Core Logic

At the heart of the package is the TextExtractor class with a single extractText() method.

Key Features:

  • Unified Return Type: A Record(text, filename) for easy use
  • Smart Format Detection: Checks HTTP Content-Type or file extension
  • Offline Support: No internet required for .doc and .md
  • Error Handling: Friendly exceptions (e.g., "Unsupported document type")

4. Format-Specific Logic

Each format was tackled with custom logic:

  • .doc: No existing Dart parser, so I created one using raw XML parsing
  • .docx: Unzipped and parsed word/document.xml
  • .md: Used markdown package for plain-text conversion
  • PDF: Parsed using syncfusion_flutter_pdf
  • Google Docs: Converted /edit URLs to /export?format=pdf and parsed as PDF

5. Filename Extraction

To enhance UX, I added a _extractFilename() method that pulls names from:

  • Content-Disposition headers (e.g., filename="report.docx")
  • URL segments (e.g., https://example.com/readme.md)
  • Google Docs metadata (fallback if unavailable)

6. Testing & Refinement

I tested with:

  • .doc: Legacy Word file
  • .docx: Modern reports
  • .md: GitHub README files
  • PDF: Academic papers
  • Google Docs: Shared documents

Edge cases included:

  • Missing headers
  • Large files (>10MB)
  • Unsupported formats

7. Publishing as a Package

To make DocTextExtractor reusable, I decided to publish it on pub dev, and that is also one of the reasons I'm writing this article

  • Published on Pub.dev
  • MIT Licensed
  • Included example app
  • Wrote a detailed README with usage examples

Why It Matters to NotteChat

DocTextExtractor is the backbone of NotteChat's AI-powered chat with documents.

It enables:

  • AI Chat: Clean text fed into AI (e.g., "Summarize this PDF")
  • Offline Use: Great for areas with limited internet
  • Smart UX: Real filenames and helpful error messages
  • Versatile Support: For modern and legacy users

How to Use DocTextExtractor in Flutter

Step 1: Add the Dependency

Add doc_text_extractor to your app’s pubspec.yaml.

yaml
dependencies:
doc_text_extractor: ^1.0.0

Run:
flutter pub get

Step 2: Import and Initialize

Import the package and create a TextExtractor instance:

import 'package:doc_text_extractor/doc_text_extractor.dart';

final extractor = TextExtractor();
Enter fullscreen mode Exit fullscreen mode

Step 3: Extract Text from a URL

Use extractText to process URLs for .doc, .docx, .md, PDF, or Google Docs:

Future<void> processDocumentUrl(String url) async {
  try {
    final result = await extractor.extractText(url);
    final text = result.text;
    final filename = result.filename;
    print('Filename: $filename');
    print('Text: ${text.substring(0, 100)}...');
    // Pass text to AI service (e.g., for NotteChat’s AI chat)
  } catch (e) {
    print('Error: $e');
    // Show user-friendly error (e.g., "Please convert .doc to .docx")
  }
}
Enter fullscreen mode Exit fullscreen mode

Example usage

processDocumentUrl('https://raw.githubusercontent.com/user/repo/main/README.md');
processDocumentUrl('https://docs.google.com/document/d/EXAMPLE_ID/edit');
Enter fullscreen mode Exit fullscreen mode

Step 4: Extract Text from a Local File

For local files (e.g., user-uploaded .md or .doc), set isUrl: false:

import 'package:path_provider/path_provider.dart';
import 'dart:io';

Future<void> processLocalFile(String filePath) async {
  try {
    final result = await extractor.extractText(filePath, isUrl: false);
    final text = result.text;
    final filename = result.filename;
    print('Filename: $filename');
    print('Text: ${text.substring(0, 100)}...');
    // Use text in app logic
  } catch (e) {
    print('Error: $e');
  }
}
Enter fullscreen mode Exit fullscreen mode

Example usage

final dir = await getTemporaryDirectory();
processLocalFile('${dir.path}/sample.md');
Enter fullscreen mode Exit fullscreen mode

Step 5: Integrate with your preferred AI API

You can now use the extracted text in your app with AI tools like OpenAI, Gemini, or Sonar APIs.

class ChatScreen extends StatelessWidget {
  Future<void> _handleDocument(String url) async {
    final result = await extractor.extractText(url);
    final text = result.text;
    final filename = result.filename;

    // Update session title
    final sessionTitle = 'Session ${DateTime.now().toIso8601String().split('T')[0]} - $filename';

    // Summarize with AI (e.g., Sonar API)
    final sonarService = SonarService();
    final summary = await sonarService.queryDocument(text, 'Summarize this document');

    // Display in UI
    print('Session: $sessionTitle');
    print('Summary: $summary');
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 6: Enhance UX with Error Handling

Add loading dialogs for large files and user-friendly errors:

if (e.toString().contains('Unsupported document type')) {
  ScaffoldMessenger.of(context).showSnackBar(
    SnackBar(content: Text('Unsupported format. Try converting to .docx or PDF.')),
  );
}
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

DocTextExtractor started as a necessity for NotteChat but evolved into a powerful, standalone Flutter package. It’s now available for anyone building document-based apps, AI tools, or productivity platforms.

Try it out: https://pub.dev/packages/doc_text_extractor
View the source: GitHub Repo

If you found this helpful or end up using the package, feel free to drop a ⭐ on GitHub or share your feedback. I’d love to hear how you’re using it!

Happy coding!

DocTextExtractor #NotteChat #Flutter #AI #DestinyEd #Dart

Top comments (5)

Collapse
 
idrisadeyemi01 profile image
Idris Idris

You did very well on this Destiny. Kudos.

Collapse
 
destinyed profile image
Destiny Ed

Thank you Idris

Collapse
 
gabbygreat profile image
Oranekwu Gabriel Ekene

Nice nice, in the past, I've had to use different converters, one for PDF, one for DocX, one for TXT etc.

This is really a game changer, I'll definitely try it out.

Collapse
 
destinyed profile image
Destiny Ed

You will definitely love it! Thanks for reading

Collapse
 
jamey_h_77980273155d088d1 profile image
Jamie H

Nice posting! Can we talk? I'm eager to talk to you