Academia.eduAcademia.edu

Document Engineering

description674 papers
group72 followers
lightbulbAbout this topic
Document Engineering is the interdisciplinary field focused on the design, creation, management, and analysis of documents and document-centric systems. It encompasses methodologies and technologies for structuring, processing, and utilizing documents to enhance information retrieval, communication, and knowledge management.
lightbulbAbout this topic
Document Engineering is the interdisciplinary field focused on the design, creation, management, and analysis of documents and document-centric systems. It encompasses methodologies and technologies for structuring, processing, and utilizing documents to enhance information retrieval, communication, and knowledge management.

Key research themes

1. How can document structure modeling and ontology-based data organization reduce ambiguity and improve data management in digital documents?

This theme addresses methods for representing and managing the structure and semantics of digital documents to ensure data is stored and retrieved unambiguously. It tackles issues arising from traditional relational database models that often lose context due to normalization and lack of redundancy. By adopting ontology-oriented data management and advanced metamodels, this research area aims to enable compact, context-rich, and precise digital document aggregation, which is vital for efficient information processing and knowledge management.

Key finding: Introduces a novel metamodel that treats digital documents as compact aggregates classified as objects or event descriptions with explicit context, overcoming relational model drawbacks like data context loss and redundancy... Read more
Key finding: Proposes a unifying document management model distinguishing intellectual content, logical layout, and physical presentation, explicitly separating semantics from presentation. This abstraction facilitates handling both... Read more
Key finding: Presents an algebraic approach to document specification and processing where documents are formalized as algebraic terms and document types as algebraic models. Functions representing document processing tasks operate over... Read more

2. What are effective computational methods for automatic document structure analysis and transformation to support information retrieval and reuse?

This theme focuses on algorithmic and representational techniques for automatic detection, recognition, and transformation of document structures—such as paragraphs, tables, and semantic zones—especially in formats like PDF or scanned images. It includes methodologies to convert unstructured or semi-structured digital documents into machine-readable, semantically enriched, and ontologically represented content, facilitating improved retrieval, reuse, and cross-application interoperability.

Key finding: Proposes an intelligent approach to automatically identify and recognize layout structures (paragraphs, tables) in PDF documents and transform them into ontological representations. Experimental evaluation on construction... Read more
Key finding: Demonstrates that employing small-step pipelines—decomposing data curation and transformation into single-purpose, sequential XSLT/XPath programs—significantly reduces cyclomatic complexity (up to 2.5x) and increases... Read more
Key finding: Presents the resource-selector-link (RSL) hypermedia metamodel as a general and flexible framework for document management that supports advanced hypermedia and cross-media functionalities at the operating system level. The... Read more

3. How can user interactions and end-user behavior impact document quality and processing efficiency in natural language digital documents?

This research area studies the effects of end-user activities on the quality and resource costs of text-based digital documents, emphasizing the role of user proficiency in computational thinking and document handling. It investigates logging of user actions during document creation and editing to quantify the extra effort incurred by malformatted or erroneous documents, thereby proposing methodologies for pretesting, error detection, and improved user guidance to reduce inefficiencies and financial losses.

Key finding: Finds that editing erroneous, poorly formatted texts requires about five times more human and machine resources compared to properly formatted versions for short paragraphs. By logging every atomic user input, the study... Read more
Key finding: Shows that integrating intranet technologies, databases, and CAD software to deliver facility documentation enhances accessibility, usability, and reuse of complex operation documents compared to traditional paper-based... Read more
Key finding: Evaluates how controlled language checks improve source text translatability in technical documentation, showing that compliance with controlled language rules not only enhances readability and consistency but also... Read more

All papers in Document Engineering

Aligning the software process and the documentation process is a recipe for having both software and documentation in synchrony where changes in software seamlessly ripple along its documentation counterpart. This paper focuses on... more
XSLT, SVG, XML trees, Functional programming XML is a tree-oriented meta-language and visual description of XML structures often involves the construction of visual trees. These trees may use a variety of graphics for chosen elements and... more
In this paper we propose a technique of limiarization (also known as thresholding or binarization) tailored to improve the readability of degraded historical documents. Limiarization is a simple image processing technique, which is... more
Scholarly reading often involves engaging with various supplementary materials beyond PDFs to support understanding. In practice, scholars frequently incorporate such external materials into their reading workflow through annotation.... more
This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects... more
This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more)... more
This paper examines the document aspects of object-based broadcasting. Object-based broadcasting augments traditional video and audio broadcast content with additional (temporally-constrained) media objects. The content of these objects... more
This paper explores the suitability of structured (and declarative) multimedia document formats for supporting a novel type of performing arts: distributed theatre. In distributed theatre, the actors are split between two (or more)... more
In spite of the high profile of media types such as video, audio and images, many multimedia presentations rely extensively on text content. Text can be used for incidental labels, or as subtitles or captions that accompany other media... more
Inclusion of content with temporal behavior in a structured documents leads to such a document gaining temporal semantics. If we then allow changes to the document during its presentation, this brings with it a number of fundamental... more
Monochromatic documents claim for much less computer bandwidth for network transmission and storage space than their color or even grayscale equivalent. The binarization of historical documents is far more complex than recent ones as... more
When a group of authors collaboratively edits interrelated documents, consistency problems occur almost immediately. Current document management systems (DMSs) provide useful mechanisms such as document locking and version control, but... more
When a group of authors collaboratively edits interre- lated documents, consistency problems occur almost im- mediately. Current document management systems (DMS) often lack adequate facilities for consistency management. We extend... more
Office applications such as OpenOffice and Microsoft Office are widely used to edit the majority of today's business documents: office documents. Usually, version control systems consider office documents as binary objects, thus severely... more
Whenever a group of authors collaboratively edits interrelated documents, semantic consistency is a major goal. Current document management systems (DMS) lack adequate consistency management facilities. We propose liberal use of formal... more
Document binary images, created by different algorithms, are commonly evaluated based on a pre-existing ground truth. Previous research found several pitfalls in this methodology and suggested various approaches addressing the issue. This... more
Discussions concerning companies' environmental aspects gained considerable impor-tance and challenged business in recent years. Comprehensible and authentic as well as customised corporate environmental reporting requires... more
A gestão de documentos eletrônicos demanda novas formas de gerenciamento para garantir agilidade e atender requisitos jurídicos e arquivísticos. O artigo desenvolve estudo de caso realizado em uma empresa de jornalismo, que em seu... more
Resumen. Este artículo presenta un estudio realizado para la detección de defectos en registros de metadatos de objetos de aprendizaje. El estudio se realizó en dos asignaturas relacionadas con el aprendizaje online y los objetos de... more
We examine basic issues of glossary tools as part of a suite of annotational tools to help users make meaning from documents from unfamiliar realms of discourse. We specifically evaluated the performance of glossary tools for reading... more
We study the analysis problem of XPath expressions with counting constraints. Such expressions are commonly used in document transformations or programs in which they select portions of documents subject to transformations. We explore how... more
A obra investiga os impactos da 4ª Revolução Industrial sobre o Direito Processual brasileiro, em especial no que tange à adoção da citação eletrônica por aplicativos privados. Partindo de uma análise histórico-evolutiva das revoluções... more
Abstract—Scientific inquiry is at the core of the curricula of schools and universities across Europe. weSPOT is a new European initiative proposing a cloud-based approach for personal and social inquiry. weSPOT aims at enabling students... more
Decido esta tese a todos os pesquisadores e pesquisadoras da área de direito que se esforçam para, de alguma forma, realizar um trabalho mais rigoroso e fundamentado em padrões científicos internacionais, em vez de se limitarem a fazer o... more
O presente livro, Sob vigilância: fontes históricas policiais do Espírito Santo (século XIX), resulta de um projeto de inventário e reprodução digital de documentos que se inscreve no campo da burocracia policial oitocentista capixaba. O... more
Despite the wide-ranging recognition that paper remains a pervasive resource for human conduct and collaboration, there has been uncertain progress in developing technologies to bridge the paper-digital divide. In this essay we discuss... more
In 2025, financial compliance remains an expensive, slow-moving process that often fails to prevent fraud and systemic risks. This article explores how blockchain technology, coupled with algorithmic real-time regulatory enforcement... more
We describe one tool for Table of Content (ToC) identification and recognition from PDF books. This task is part of ongoing research on the development of tools for the semiautomatic conversion of PDF documents in the Epub format that can... more
This demo presents the TextCoop platform and the Dislog language, based on logic programming, which have primarily been designed for discourse processing. The linguistic architecture and the basics of discourse analysis in TextCoop are... more
While people often carry mobile phones for communication purposes, they are generally underutilized as productivity tools, especially in the workplace. In this paper we present Courier, a system that leverages the storage capacity and... more
This paper investigates the advancements in subtitle generation and automatic speech recognition (ASR) technologies, emphasizing their role in enhancing accessibility and user engagement in multimedia content. By leveraging deep learning... more
This paper proposes an extension of the XSL-FO standard which allows the specification of an unlimited number of arbitrarily shaped page regions. These extensions are built on top of XSL-FO 1.1 to enable flow content to be laid out into... more
This document defines the standard layout to be used for CWI tracts, monographs, reports , etc . Furthermore, it details the text-processing procedures needed to generate this layout .
Giving me all the necessary support, not only as an advisor, but as a friend. Thank you for all the availability, assisting, advising and correcting. Thank to all the other professors who contributed to my development as a student, as a... more
Ubiquitous computing applications rely heavily on context information to work properly. Capture and access applications, a special type of ubiquitous computing applications, process context information during the live capture of... more
Documents that are intended to be 'active', with high variability and context responsivity, are increasingly attractive building blocks for applications, inevitably defined in XML syntaxes. But many such documents within an application... more
In some hypermedia system applications, like interactive digital TV applications, authoring and presentation of documents may have to be done concomitantly. This is the case of live programs, where not only some contents are not known a... more
Document Object Modeling (DOM) is widely used approach for retrieving data from an XML document. If the size of the XML document is very large, however, using the DOM approach for retrieving data from the XML document may suffer from a... more
DelosDLMS is a prototype of a next-generation Digital Library (DL) management system. It is realized by combining various specialized DL functionalities provided by partners of the DELOS network of excellence. Currently, DelosDLMS... more
TEX is an ASCII text-based markup language. In a scheme of automated scientific document preparation LTEX provides the foundation, which is also a markup language created from TEX. In this work a user-friendly editor was developed... more
There are many perspectives on communication and journalism today, and this is a consequence of both technological and cultural changes. In this increasingly dynamic situation in the field of communication and media, what can still be... more
Download research papers for free!