Breaking News: Transform Your Audio with Docling’s New ASR Capacities and use withing a RAG!
Intrdouction
Automatic Speech Recognition (ASR) is a revolutionary technology that converts spoken language into written text. Far beyond simple dictation, ASR acts as a crucial bridge, transforming the vast ocean of audio and video content into searchable, analyzable data. For applications like Retrieval Augmented Generation (RAG), this capability is transformative. Imagine being able to ingest not just written documents, but also meeting recordings, customer service calls, or lecture audio, and extract precise information to ground your AI’s responses. By converting these previously inaccessible data sources into text, ASR significantly enriches the knowledge base of a RAG system, allowing it to retrieve more comprehensive and accurate information, ultimately leading to more insightful and contextually relevant AI-generated content.
Breaking news on the document intelligence frontier: Docling has just unveiled its groundbreaking ASR capabilities, seamlessly integrating advanced speech-to-text functionality directly into its powerful document processing platform! This monumental enhancement means Docling users can now effortlessly convert audio content into structured, searchable text, unlocking new dimensions of data analysis and information retrieval. With ASR now at its core, Docling empowers businesses and developers to go beyond traditional text documents, leveraging spoken insights to fuel richer RAG systems, automate more complex workflows, and derive deeper intelligence from every piece of their multimedia data. The future of comprehensive document understanding is here, and it speaks volumes!
Test
Building upon the foundational example provided in Docling’s GitHub project, I implemented this solution with only minor modifications to tailor it to my specific test.
- Environment preparation and what you need to install to make the sample application work ⬇️
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install docling
pip install openai-whisper
- If you are using CPU like me, you should also install ‘ffmpeg’.
brew install ffmpeg
- I made very slight changes to the original sample application to get a better console output and also to generate a markdown file.
from pathlib import Path
from docling_core.types.doc import DoclingDocument
from docling.datamodel import asr_model_specs
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.document import ConversionResult
from docling.datamodel.pipeline_options import AsrPipelineOptions
from docling.document_converter import AudioFormatOption, DocumentConverter
from docling.pipeline.asr_pipeline import AsrPipeline
def get_asr_converter():
"""Create a DocumentConverter configured for ASR with whisper_turbo model."""
pipeline_options = AsrPipelineOptions()
pipeline_options.asr_options = asr_model_specs.WHISPER_TURBO
converter = DocumentConverter(
format_options={
InputFormat.AUDIO: AudioFormatOption(
pipeline_cls=AsrPipeline,
pipeline_options=pipeline_options,
)
}
)
return converter
def asr_pipeline_conversion(audio_path: Path) -> DoclingDocument:
"""ASR pipeline conversion using whisper_turbo"""
# Check if the test audio file exists
assert audio_path.exists(), f"Test audio file not found: {audio_path}"
converter = get_asr_converter()
# Convert the audio file
result: ConversionResult = converter.convert(audio_path)
# Verify conversion was successful
assert result.status == ConversionStatus.SUCCESS, (
f"Conversion failed with status: {result.status}"
)
# --- Debugging Lines (can be removed if not needed, but useful for inspection) ---
print("\n--- DoclingDocument Object Content ---")
print(result.document)
print("\n--- DoclingDocument Plain Text (from .texts attribute) ---")
if hasattr(result.document, 'texts') and isinstance(result.document.texts, list):
if result.document.texts:
# Extract the actual text from TextItem objects
text_segments = [item.text for item in result.document.texts]
print(" ".join(text_segments)) # Concatenate all text segments for display
else:
print("[No text segments found]")
else:
print(f"DoclingDocument does not have a 'texts' attribute or it's not a list.")
print("------------------------------------\n")
# --- End Debugging Lines ---
return result.document
if __name__ == "__main__":
audio_path = Path("./input/sample_10s.mp3")
output_markdown_file = Path("output_asr.md") # Define the output file path
print(f"Attempting ASR conversion for: {audio_path}")
doc = asr_pipeline_conversion(audio_path=audio_path)
# Get the markdown content
markdown_content = doc.export_to_markdown()
# Write the markdown content to the specified file
try:
with open(output_markdown_file, "w", encoding="utf-8") as f:
f.write(markdown_content)
print(f"\nMarkdown content successfully written to: {output_markdown_file.absolute()}")
print("\n--- Content of output_asr.md ---")
print(markdown_content)
print("--------------------------------")
except IOError as e:
print(f"\nError writing markdown to file {output_markdown_file}: {e}")
# Expected output:
#
# [time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
#
# [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
- The expected output ⬇️
python minimal_asr_pipeline_2.py
/Users/xxxx/Devs/docling-asr/venv/lib/python3.12/site-packages/whisper/transcribe.py:126: UserWarning: FP16 is not supported on CPU; using FP32 instead
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:04.000] Shakespeare on Scenery by Oscar Wilde
[00:05.280 --> 00:09.960] This is a LibriVox recording. All LibriVox recordings are in the public domain.
--- DoclingDocument Object Content ---
schema_name='DoclingDocument' version='1.4.0' name='sample_10s' origin=DocumentOrigin(mimetype='audio/x-wav', binary_hash=5988282427051697350, filename='sample_10s.mp3', uri=None) furniture=GroupItem(self_ref='#/furniture', parent=None, children=[], content_layer=<ContentLayer.FURNITURE: 'furniture'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) body=GroupItem(self_ref='#/body', parent=None, children=[RefItem(cref='#/texts/0'), RefItem(cref='#/texts/1')], content_layer=<ContentLayer.BODY: 'body'>, name='_root_', label=<GroupLabel.UNSPECIFIED: 'unspecified'>) groups=[] texts=[TextItem(self_ref='#/texts/0', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[], orig='[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde', text='[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde', formatting=None, hyperlink=None), TextItem(self_ref='#/texts/1', parent=RefItem(cref='#/body'), children=[], content_layer=<ContentLayer.BODY: 'body'>, label=<DocItemLabel.TEXT: 'text'>, prov=[], orig='[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.', text='[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.', formatting=None, hyperlink=None)] pictures=[] tables=[] key_value_items=[] form_items=[] pages={}
--- DoclingDocument Plain Text (from .texts attribute) ---
[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde [time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
------------------------------------
--- Final Markdown Output ---
[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
- And the content of the sample markdown file;
[time: 0.0-4.0] Shakespeare on Scenery by Oscar Wilde
[time: 5.28-9.96] This is a LibriVox recording. All LibriVox recordings are in the public domain.
Eh voilà, Docling nailed it again 🥇
Conclusion
In conclusion, the integration of advanced ASR capabilities into Docling marks a pivotal moment for document intelligence. By transforming spoken words into actionable text, Docling not only bridges a critical gap in data accessibility but also profoundly enriches the potential of applications like Retrieval Augmented Generation. What’s more, the ease with which these powerful new features can be adopted, building directly upon Docling’s accessible samples, underscores its commitment to user-centric innovation. This significant enhancement empowers users to unlock deeper insights from their diverse data streams, moving confidently towards a future where every piece of information, whether written or spoken, contributes to a more comprehensive and intelligent understanding.
Links
- Docling project: https://github.com/docling-project
- Docling documentation: https://docling-project.github.io/docling/
- Docling ASR: https://github.com/docling-project/docling/blob/main/docs/examples/minimal_asr_pipeline.py
- Large Language Models are Efficient Learners of Noise-Robust Speech Recognition: https://arxiv.org/abs/2401.10446#:~:text=Recent%20advances%20in%20large%20language,LLMs%20to%20improve%20recognition%20results.
Top comments (0)