We’re back with a substantial update to DocWire SDK, the modern C++ library for structured document parsing, data extraction and secure, high-performance back-end workflows.
Version 2025.06.19 focuses on sharper OCR, more faithful PDF layout reconstruction and a brand-new archive module, alongside testing and CI upgrades.
Full release notes: https://github.com/docwire/docwire/releases/tag/2025.06.19
What’s New
1 · OCR Enhancements
-
Structured output with positional metadata –
OCRParser
now returns x, y, width, height plus line, paragraph and section grouping. - Configurable confidence filter (0–100) to ignore low-confidence words.
2 · Higher-Fidelity PDF Parsing
- Refactored
PDFParser
to sort elements by position, yielding more accurate text flow and layout reconstruction.
3 · Modern Archive Handling
- New
docwire_archives
library for modular, maintainable and faster archive processing. - Archive detection is now MIME-based.
4 · Expanded Format Support
- Automatic detection for ASP and ASP.NET documents.
Developer-Centric Improvements
- Plain-text exporter handles page breaks more clearly.
-
CI pipeline moves to
windows-2025
runners; ASAN re-enabled on Windows. - Broader automated test coverage (OCR, HTTP, CLI).
- Build fix on Windows via
NOMINMAX
flag to resolvewindows.h
/ PDFium conflicts. - Spacing and line-break corrections in PDF and OCR outputs.
Documentation
API docs and module dependency notes are fully up to date.
OCR’s vision, sharp and newly bright
PDF layouts, now a clearer sight
Archives rebuilt, with structure firm and new
DocWire advances, steady, strong, and true
Try It Now
- GitHub repo – https://github.com/docwire/docwire
- Latest release – https://github.com/docwire/docwire/releases/tag/2025.06.19
- Sourceforge - https://sourceforge.net/projects/docwire/files/2025.06.19/
We welcome feedback, issues and PRs.
Next up: deeper LLM integration and VCPKG support.
— The DocWire Team
Top comments (0)