From Speech to Text: A Guide to IBM Granite Speech for Audio Transcriptions

#granite #huggingface #opensource #stt

Harnessing IBM Granite for Accurate Audio Transcription

Introduction

In today’s fast-paced digital world, the ability to effortlessly convert spoken language into written text has become more crucial than ever. Speech-to-Text (STT) technology, also known as Automatic Speech Recognition (ASR), is no longer a niche feature but a fundamental tool driving innovation across diverse industries.

From enhancing accessibility and streamlining customer service to revolutionizing data analysis and content creation, STT empowers individuals and organizations by bridging the gap between spoken words and actionable insights. This versatile technology is rapidly transforming how we interact with devices, access information, and derive value from audio, making it an indispensable component of modern digital experiences.

Why use IBM Granite Speech Models?
Among the leading solutions in this domain, IBM’s Granite open-source models, readily available on Hugging Face, stand out for their enterprise-grade performance and flexibility. These models, including the granite-speech-3.3-2b and granite-speech-3.3-8b variants, are designed to deliver high accuracy in speech recognition and translation tasks, often outperforming or matching similarly sized models from other providers on standard benchmarks. A key advantage lies in their open-source Apache 2.0 license, which fosters community collaboration and allows developers to freely experiment with, modify, and fine-tune the models for specific business needs. Furthermore, IBM has emphasized responsible AI practices in their development, contributing to trustworthy and robust STT solutions for a wide array of applications. The Granite Speech models' ability to handle varying audio lengths (beyond the typical 30-second window of some other models) and their competitive error rates across prominent public datasets underscore their advantageous position in the STT landscape.

You can find the comparative benchmarks on Hugging Face site.

Simple Implementation

As demonstrated by the sample code provided, which is based on the official Hugging Face examples (from IBM Granite), developing and deploying an application utilizing Granite Speech models is remarkably straightforward. The transformers library abstracts away much of the underlying complexity, allowing developers to quickly load, process, and generate transcriptions with just a few lines of Python code. This ease of integration significantly lowers the barrier to entry for incorporating powerful STT capabilities into various applications, from simple scripts to complex enterprise solutions.

It’s worth noting that for environments without a dedicated GPU, such as the one used for this demonstration, the ibm-granite/granite-speech-3.3-2b model offers an excellent balance of performance and efficiency. While larger models like granite-speech-3.3-8b provide even higher accuracy, the 2-billion parameter version still delivers impressive results with a significantly reduced computational footprint, making it a practical choice for CPU-only deployments where speed and resource consumption are key considerations.

As always, prepare the environment and install all the requirements.

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip

pip install torch 'transformers>=4.49' peft torchaudio soundfile huggingface_hub

Sample application.

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

# Determine the computation device: CUDA (GPU) if available, otherwise CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Define the model name
model_name = "ibm-granite/granite-speech-3.2-8b"

print(f"Loading processor for {model_name}...")
# Load the AutoProcessor for the model
# trust_remote_code=True is required for custom model code from Hugging Face
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name, trust_remote_code=True)
# The tokenizer is part of the processor
tokenizer = speech_granite_processor.tokenizer
print(f"Loading model {model_name}...")
# Load the AutoModelForSpeechSeq2Seq and move it to the determined device
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name, trust_remote_code=True).to(device)
print("Model loaded successfully.")

# Prepare speech and text prompt, using the appropriate prompt template

print("Downloading audio file...")
# Download a sample audio file from the Hugging Face Hub
audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
print(f"Audio file downloaded to: {audio_path}")

# Load the audio waveform and sample rate
# normalize=True scales the audio to a standard range
wav, sr = torchaudio.load(audio_path, normalize=True)
# Assertions to ensure the audio is mono (1 channel) and at 16kHz sample rate
assert wav.shape[0] == 1 and sr == 16000, "Audio must be mono and 16kHz"

# Create the chat history for the prompt
# This mimics a conversational context for the model
chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: December 19, 2024.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

# Apply the chat template to format the text prompt correctly for the model
text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

print("Computing audio embeddings...")
# Prepare the model inputs (text prompt and audio waveform)
# return_tensors="pt" ensures PyTorch tensors are returned
model_inputs = speech_granite_processor(
    text,
    wav,
    device=device, # Ensure tensors are on the correct device for computation
    return_tensors="pt",
).to(device) # Move the generated inputs to the device as well

print("Generating transcription...")
# Generate the transcription using the model
# max_new_tokens: maximum length of the generated transcription
# num_beams: number of beams for beam search (higher means better quality but slower)
# do_sample=False: disables sampling, uses greedy decoding with beam search
# min_length, top_p, repetition_penalty, length_penalty, temperature: generation parameters
# bos_token_id, eos_token_id, pad_token_id: special token IDs for generation
model_outputs = speech_granite.generate(
    **model_inputs,
    max_new_tokens=200,
    num_beams=4,
    do_sample=False,
    min_length=1,
    top_p=1.0,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=1.0,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Extract the newly generated tokens (excluding the input prompt tokens)
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

# Decode the generated tokens back into human-readable text
# add_special_tokens=False and skip_special_tokens=True remove special tokens like <s>, </s>
output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

Output ⬇️

python granite-stt-2.py
Using device: cpu
Loading processor for ibm-granite/granite-speech-3.3-2b...
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2.00/2.00 [00:00<00:00, 4.91kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 5.44k/5.44k [00:00<00:00, 9.16MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.42k/2.42k [00:00<00:00, 10.3MB/s]
vocab.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 777k/777k [00:00<00:00, 3.34MB/s]
merges.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 442k/442k [00:00<00:00, 2.31MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.48M/3.48M [00:00<00:00, 8.22MB/s]
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 229/229 [00:00<00:00, 1.17MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 801/801 [00:00<00:00, 6.63MB/s]
chat_template.jinja: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4.57k/4.57k [00:00<00:00, 17.8MB/s]
Loading model ibm-granite/granite-speech-3.3-2b...
adapter_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 717/717 [00:00<00:00, 3.50MB/s]
model.safetensors.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 68.2k/68.2k [00:00<00:00, 842kB/s]
model-00002-of-00003.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1.99G/1.99G [03:15<00:00, 10.2MB/s]
model-00001-of-00003.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1.99G/1.99G [03:17<00:00, 10.1MB/s]
model-00003-of-00003.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1.70G/1.70G [03:17<00:00, 8.62MB/s]
Fetching 3 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:18<00:00, 66.00s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.89it/s]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 996kB/s]
adapter_model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 68.2M/68.2M [00:03<00:00, 17.2MB/s]
Model loaded successfully.
Downloading audio file...
10226_10111_000000.wav: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 540k/540k [00:00<00:00, 874kB/s]
Audio file downloaded to: /Users/alainairom/.cache/huggingface/hub/models--ibm-granite--granite-speech-3.3-2b/snapshots/6ff8b9f6a889527e6f69b8b21f3ccf4c5037a077/10226_10111_000000.wav
Computing audio embeddings...
Generating transcription...
STT output = AFTER HIS NAP TIMOTHY LAZILY STRETCHED FIRST ONE GRAY VELVET FOOT THEN ANOTHER STROLLED INDOLENTLY TO HIS PLATE TURNING OVER THE FOOD CAREFULLY SELECTING CHOICE BITS NOSING OUT THAT WHICH HE SCORNED UPON THE CLEAN HEARTH

👍

Conclusion

In conclusion, the journey from raw audio to actionable text is being continually refined by advancements in Speech-to-Text technology. IBM Granite’s open-source models on Hugging Face represent a significant stride in this evolution, offering both high performance and accessibility. As we’ve seen, integrating these models, particularly the efficient granite-speech-3.3-2b version, into your applications is designed to be a seamless process, empowering developers to leverage state-of-the-art transcription capabilities even on more modest hardware. This demonstrates a broader trend in AI: bringing powerful, cutting-edge models into the hands of a wider developer community, paving the way for more innovative and inclusive applications across all sectors.

DEV Community

From Speech to Text: A Guide to IBM Granite Speech for Audio Transcriptions

Introduction

Simple Implementation

Conclusion

Links

Top comments (0)