How to Use OpenAI Whisper Voice-to-Text with NVIDIA GPU on Debian/Ubuntu

OpenAI Whisper is a powerful speech recognition system that can transcribe audio files with impressive accuracy. When combined with NVIDIA GPU acceleration through CUDA, Whisper can process audio files significantly faster than CPU-only processing. This guide demonstrates how to install and use Whisper with GPU support on Debian and Ubuntu Linux systems.

In this tutorial you will learn:

  • How to install OpenAI Whisper with GPU support
  • How to verify GPU acceleration is working
  • How to transcribe audio files using the command line
  • Basic Whisper usage examples and options
How to Use OpenAI Whisper Voice-to-Text with NVIDIA GPU on Debian/Ubuntu
How to Use OpenAI Whisper Voice-to-Text with NVIDIA GPU on Debian/Ubuntu
Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Debian or Ubuntu Linux with NVIDIA GPU
Software NVIDIA drivers, CUDA toolkit, PyTorch with CUDA support, FFmpeg
Other Python 3.8 or higher, pip package manager
Conventions # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires given linux commands to be executed as a regular non-privileged user

Prerequisites

Before installing Whisper with GPU support, you must have the following components installed and configured on your system:

  1. NVIDIA Drivers: Your system needs properly installed NVIDIA proprietary drivers. Follow the appropriate guide for your distribution:
  2. CUDA Toolkit: NVIDIA CUDA is required for GPU acceleration. Install it following these guides:
  3. PyTorch with CUDA Support: Whisper requires PyTorch with CUDA enabled. Follow this guide for both distributions:
  4. Configured Python virtual Enviroment: On Ubuntu/Debian you will need to create a Python virtual environment to install all required Python Modules:

VERIFY GPU SETUP
Before proceeding, verify that your GPU is detected and CUDA is working by running nvidia-smi and checking that PyTorch can access CUDA with: python3 -c "import torch; print(torch.cuda.is_available())"

Verifying Whisper GPU detection on Debian Linux showing cuda:0 output
Verifying that Whisper successfully detects and will use the NVIDIA GPU (cuda:0) for transcription processing
Whisper Model GPU Requirements
Model Parameters VRAM Required Relative Speed Recommended GPU Examples
tiny 39M ~1 GB ~32x GTX 1050, GT 1030, any modern GPU
base 74M ~1 GB ~16x GTX 1050 Ti, GTX 1650, RTX 3050
small 244M ~2 GB ~6x GTX 1060 (6GB), RTX 2060, RTX 3050
medium 769M ~5 GB ~2x RTX 3060 (12GB), RTX 4060, RTX 2070
large 1550M ~10 GB 1x (baseline) RTX 3080 (10GB), RTX 3090, RTX 4070 Ti

Installing OpenAI Whisper

  1. Install FFmpeg: Whisper requires FFmpeg to process audio files in various formats
    # apt install ffmpeg

    This package provides the necessary audio codec support for Whisper to handle MP3, WAV, and other common audio formats.

  2. Install OpenAI Whisper: Use pip to install Whisper with all its dependencies
    $ pip install openai-whisper

    The installation will download and install Whisper along with required packages including tiktoken, numba, and other dependencies. This may take a few minutes depending on your internet connection.

  3. Verify Whisper GPU Access: Confirm that Whisper can detect and use your GPU
    $ python3 -c "import whisper; print(whisper.load_model('base').device)"

    The output should display cuda:0, indicating that Whisper will use your NVIDIA GPU for processing. If you see cpu instead, review your PyTorch CUDA installation.

INSTALLATION COMPLETE
Whisper is now installed and configured to use GPU acceleration. You can proceed to transcribe audio files.

Basic Whisper Usage

Whisper provides a simple command-line interface for transcribing audio files. The basic syntax is straightforward and accepts various audio formats including MP3, WAV, M4A, and others.

  1. Download Test Audio File: First, download a sample speech audio file for testing
    $ wget http://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

    This downloads a free test audio file containing English speech from the Open Speech Repository.

  2. Basic Transcription: Transcribe the test audio file using the base model
    $ whisper OSR_us_000_0010_8k.wav --model base --device cuda

    This command transcribes the audio file using the base model with GPU acceleration. Whisper will automatically detect the language and create several output files including text, SRT subtitles, and VTT format. The transcription should complete in just a few seconds with GPU acceleration.

     Whisper GPU transcription with timing results showing 6.13 seconds total processing time on Debian Linux
    Downloading test audio file and transcribing with Whisper using GPU acceleration – total processing time of 6.13 seconds for 33 seconds of audio
  3. Specify Language: When you know the audio language, specifying it can improve accuracy and speed
    $ whisper OSR_us_000_0010_8k.wav --model base --device cuda --language English

    By specifying the language, Whisper skips the language detection phase and starts transcription immediately.

  4. Choose Output Format: Control which output formats are generated
    $ whisper OSR_us_000_0010_8k.wav --model base --device cuda --output_format txt

    Available formats include txt, srt, vtt, json, and tsv. You can specify multiple formats separated by commas.

  5. Use Different Models: Whisper offers several model sizes with different accuracy-speed tradeoffs
    $ whisper OSR_us_000_0010_8k.wav --model small --device cuda

    Available models from smallest to largest: tiny, base, small, medium, large. Larger models provide better accuracy but require more GPU memory and processing time.

  6. Translate to English: Automatically translate non-English audio to English text
    $ whisper your-audio-file.mp3 --model base --device cuda --task translate

    This is useful when you have audio in another language but want English text output. Replace your-audio-file.mp3 with your actual audio file.

Understanding Whisper Models

Whisper provides five different model sizes, each offering different trade-offs between accuracy, speed, and memory requirements:

Model Parameters VRAM Required Use Case
tiny 39M ~1 GB Fast processing, lower accuracy, real-time applications
base 74M ~1 GB Good balance for most tasks, recommended starting point
small 244M ~2 GB Better accuracy, still relatively fast
medium 769M ~5 GB High accuracy, slower processing
large 1550M ~10 GB Best accuracy, requires powerful GPU

CHECK AVAILABLE VRAM
To check how much VRAM your GPU has, run nvidia-smi and look at the “Memory-Usage” column. For example, “2048MiB / 8192MiB” means you have 8GB total VRAM. Choose models based on your available VRAM: tiny/base (1GB), small (2GB), medium (5GB), large (10GB).


MODEL SELECTION TIP
Start with the base model for testing. If accuracy is insufficient, try the small or medium model. Only use the large model if you have sufficient GPU memory and require the highest possible accuracy.

GPU vs CPU Performance Comparison

One of the main advantages of using GPU acceleration with Whisper is the dramatic speed improvement over CPU processing. To demonstrate this difference, we can transcribe the same audio file using both CPU and GPU, then compare the processing times.

  1. CPU Transcription: Time the transcription process using CPU
    $ time whisper OSR_us_000_0010_8k.wav --model base --device cpu 2> /dev/null

    This command transcribes the audio file using only the CPU and measures the total time taken. The 2> /dev/null suppresses error output for cleaner timing results.

  2. GPU Transcription: Time the same transcription using GPU acceleration
    $ time whisper OSR_us_000_0010_8k.wav --model base --device cuda 2> /dev/null

    This performs the identical transcription but leverages your NVIDIA GPU through CUDA. The speed difference is immediately noticeable.

The performance difference becomes even more pronounced with larger models and longer audio files. For the base model, GPU processing is typically 10-15x faster than CPU. This speed advantage makes GPU acceleration essential for processing large volumes of audio files or when working with higher accuracy models like medium or large.

Whisper CPU vs GPU performance comparison showing 21.5 seconds for CPU versus 6.5 seconds for GPU transcription
Performance comparison between CPU and GPU transcription – CPU processing takes 21.5 seconds while GPU acceleration completes the same task in just 6.5 seconds, demonstrating a 3.3x speedup

Common Whisper Options

Whisper supports numerous command-line options to customize transcription behavior. Here are the most useful ones:

  • --model MODEL: Choose the model size (tiny, base, small, medium, large)
  • --device cuda: Force GPU usage (though Whisper uses GPU by default when available)
  • --language LANGUAGE: Specify the audio language to skip detection
  • --task transcribe|translate: Either transcribe in original language or translate to English
  • --output_format FORMAT: Choose output format (txt, srt, vtt, json, tsv)
  • --output_dir DIRECTORY: Specify where to save output files
  • --verbose False: Reduce console output during processing
  • --temperature 0: Use deterministic decoding for consistent results

Monitoring GPU Usage

To verify that Whisper is actually using your GPU during transcription, you can monitor GPU activity in real-time:

$ watch -n 1 nvidia-smi

Run this command in a separate terminal window while Whisper is processing audio. You should see GPU utilization increase and memory usage spike during transcription. This confirms that GPU acceleration is working properly.

NVIDIA GPU monitoring showing Whisper medium model using 4702MB VRAM during transcription on RTX 3080
Monitoring GPU usage with nvidia-smi while Whisper transcribes audio using the medium model – shows active VRAM consumption of 4702MB on NVIDIA GeForce RTX 3080

Troubleshooting

  1. Whisper Uses CPU Instead of GPU: If Whisper falls back to CPU processing despite having a GPU
    $ python3 -c "import torch; print(torch.cuda.is_available())"

    If this returns False, your PyTorch installation does not have CUDA support. Reinstall PyTorch with CUDA following the prerequisites guide.

  2. FFmpeg Not Found Error: If you encounter “No such file or directory: ‘ffmpeg'”
    # apt install ffmpeg

    Whisper requires FFmpeg to decode audio files. Install it using your distribution’s package manager.

  3. Out of Memory Error: If you see CUDA out of memory errors
    $ whisper audiofile.mp3 --model tiny --device cuda

    Try using a smaller model that requires less GPU memory. The tiny or base models work well on GPUs with 4GB or less VRAM.

  4. Model Download Issues: If model downloads fail or are interruptedWhisper downloads models to ~/.cache/whisper/ on first use. Delete this directory and try again if downloads are corrupted:
    $ rm -rf ~/.cache/whisper/

Conclusion

OpenAI Whisper with GPU acceleration provides fast and accurate speech-to-text transcription on Debian and Ubuntu systems. By leveraging NVIDIA CUDA, transcription tasks that would take minutes on CPU can complete in seconds. The command-line interface is straightforward, making it accessible for users who need to transcribe audio files without complex setup or programming knowledge. Start with the base model and adjust to larger models if you need better accuracy or smaller models if you need faster processing.

 



Comments and Discussions
Linux Forum