OpenAI Whisper is a powerful speech recognition system that can transcribe audio files with impressive accuracy. When combined with NVIDIA GPU acceleration through CUDA, Whisper can process audio files significantly faster than CPU-only processing. This guide demonstrates how to install and use Whisper with GPU support on Debian and Ubuntu Linux systems.
In this tutorial you will learn:
- How to install OpenAI Whisper with GPU support
- How to verify GPU acceleration is working
- How to transcribe audio files using the command line
- Basic Whisper usage examples and options

| Category | Requirements, Conventions or Software Version Used |
|---|---|
| System | Debian or Ubuntu Linux with NVIDIA GPU |
| Software | NVIDIA drivers, CUDA toolkit, PyTorch with CUDA support, FFmpeg |
| Other | Python 3.8 or higher, pip package manager |
| Conventions | # – requires given linux commands to be executed with root privileges either directly as a root user or by use of sudo command$ – requires given linux commands to be executed as a regular non-privileged user |
Prerequisites
Before installing Whisper with GPU support, you must have the following components installed and configured on your system:
- NVIDIA Drivers: Your system needs properly installed NVIDIA proprietary drivers. Follow the appropriate guide for your distribution:
- CUDA Toolkit: NVIDIA CUDA is required for GPU acceleration. Install it following these guides:
- Debian: How to Install NVIDIA CUDA on Debian
- Ubuntu: How to Install CUDA on Ubuntu
- PyTorch with CUDA Support: Whisper requires PyTorch with CUDA enabled. Follow this guide for both distributions:
- Configured Python virtual Enviroment: On Ubuntu/Debian you will need to create a Python virtual environment to install all required Python Modules:
VERIFY GPU SETUP
Before proceeding, verify that your GPU is detected and CUDA is working by running nvidia-smi and checking that PyTorch can access CUDA with: python3 -c "import torch; print(torch.cuda.is_available())"

| Model | Parameters | VRAM Required | Relative Speed | Recommended GPU Examples |
|---|---|---|---|---|
| tiny | 39M | ~1 GB | ~32x | GTX 1050, GT 1030, any modern GPU |
| base | 74M | ~1 GB | ~16x | GTX 1050 Ti, GTX 1650, RTX 3050 |
| small | 244M | ~2 GB | ~6x | GTX 1060 (6GB), RTX 2060, RTX 3050 |
| medium | 769M | ~5 GB | ~2x | RTX 3060 (12GB), RTX 4060, RTX 2070 |
| large | 1550M | ~10 GB | 1x (baseline) | RTX 3080 (10GB), RTX 3090, RTX 4070 Ti |
Installing OpenAI Whisper
- Install FFmpeg: Whisper requires FFmpeg to process audio files in various formats
# apt install ffmpeg
This package provides the necessary audio codec support for Whisper to handle MP3, WAV, and other common audio formats.
- Install OpenAI Whisper: Use pip to install Whisper with all its dependencies
$ pip install openai-whisper
The installation will download and install Whisper along with required packages including tiktoken, numba, and other dependencies. This may take a few minutes depending on your internet connection.
- Verify Whisper GPU Access: Confirm that Whisper can detect and use your GPU
$ python3 -c "import whisper; print(whisper.load_model('base').device)"The output should display
cuda:0, indicating that Whisper will use your NVIDIA GPU for processing. If you seecpuinstead, review your PyTorch CUDA installation.
INSTALLATION COMPLETE
Whisper is now installed and configured to use GPU acceleration. You can proceed to transcribe audio files.
Basic Whisper Usage
Whisper provides a simple command-line interface for transcribing audio files. The basic syntax is straightforward and accepts various audio formats including MP3, WAV, M4A, and others.
- Download Test Audio File: First, download a sample speech audio file for testing
$ wget http://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
This downloads a free test audio file containing English speech from the Open Speech Repository.
- Basic Transcription: Transcribe the test audio file using the base model
$ whisper OSR_us_000_0010_8k.wav --model base --device cuda
This command transcribes the audio file using the base model with GPU acceleration. Whisper will automatically detect the language and create several output files including text, SRT subtitles, and VTT format. The transcription should complete in just a few seconds with GPU acceleration.

Downloading test audio file and transcribing with Whisper using GPU acceleration – total processing time of 6.13 seconds for 33 seconds of audio - Specify Language: When you know the audio language, specifying it can improve accuracy and speed
$ whisper OSR_us_000_0010_8k.wav --model base --device cuda --language English
By specifying the language, Whisper skips the language detection phase and starts transcription immediately.
- Choose Output Format: Control which output formats are generated
$ whisper OSR_us_000_0010_8k.wav --model base --device cuda --output_format txt
Available formats include
txt,srt,vtt,json, andtsv. You can specify multiple formats separated by commas. - Use Different Models: Whisper offers several model sizes with different accuracy-speed tradeoffs
$ whisper OSR_us_000_0010_8k.wav --model small --device cuda
Available models from smallest to largest:
tiny,base,small,medium,large. Larger models provide better accuracy but require more GPU memory and processing time. - Translate to English: Automatically translate non-English audio to English text
$ whisper your-audio-file.mp3 --model base --device cuda --task translate
This is useful when you have audio in another language but want English text output. Replace
your-audio-file.mp3with your actual audio file.
Understanding Whisper Models
Whisper provides five different model sizes, each offering different trade-offs between accuracy, speed, and memory requirements:
| Model | Parameters | VRAM Required | Use Case |
|---|---|---|---|
| tiny | 39M | ~1 GB | Fast processing, lower accuracy, real-time applications |
| base | 74M | ~1 GB | Good balance for most tasks, recommended starting point |
| small | 244M | ~2 GB | Better accuracy, still relatively fast |
| medium | 769M | ~5 GB | High accuracy, slower processing |
| large | 1550M | ~10 GB | Best accuracy, requires powerful GPU |
CHECK AVAILABLE VRAM
To check how much VRAM your GPU has, run nvidia-smi and look at the “Memory-Usage” column. For example, “2048MiB / 8192MiB” means you have 8GB total VRAM. Choose models based on your available VRAM: tiny/base (1GB), small (2GB), medium (5GB), large (10GB).
MODEL SELECTION TIP
Start with the base model for testing. If accuracy is insufficient, try the small or medium model. Only use the large model if you have sufficient GPU memory and require the highest possible accuracy.
GPU vs CPU Performance Comparison
One of the main advantages of using GPU acceleration with Whisper is the dramatic speed improvement over CPU processing. To demonstrate this difference, we can transcribe the same audio file using both CPU and GPU, then compare the processing times.
- CPU Transcription: Time the transcription process using CPU
$ time whisper OSR_us_000_0010_8k.wav --model base --device cpu 2> /dev/null
This command transcribes the audio file using only the CPU and measures the total time taken. The
2> /dev/nullsuppresses error output for cleaner timing results. - GPU Transcription: Time the same transcription using GPU acceleration
$ time whisper OSR_us_000_0010_8k.wav --model base --device cuda 2> /dev/null
This performs the identical transcription but leverages your NVIDIA GPU through CUDA. The speed difference is immediately noticeable.
The performance difference becomes even more pronounced with larger models and longer audio files. For the base model, GPU processing is typically 10-15x faster than CPU. This speed advantage makes GPU acceleration essential for processing large volumes of audio files or when working with higher accuracy models like medium or large.

Common Whisper Options
Whisper supports numerous command-line options to customize transcription behavior. Here are the most useful ones:
--model MODEL: Choose the model size (tiny, base, small, medium, large)--device cuda: Force GPU usage (though Whisper uses GPU by default when available)--language LANGUAGE: Specify the audio language to skip detection--task transcribe|translate: Either transcribe in original language or translate to English--output_format FORMAT: Choose output format (txt, srt, vtt, json, tsv)--output_dir DIRECTORY: Specify where to save output files--verbose False: Reduce console output during processing--temperature 0: Use deterministic decoding for consistent results
Monitoring GPU Usage
To verify that Whisper is actually using your GPU during transcription, you can monitor GPU activity in real-time:
$ watch -n 1 nvidia-smi
Run this command in a separate terminal window while Whisper is processing audio. You should see GPU utilization increase and memory usage spike during transcription. This confirms that GPU acceleration is working properly.

Troubleshooting
- Whisper Uses CPU Instead of GPU: If Whisper falls back to CPU processing despite having a GPU
$ python3 -c "import torch; print(torch.cuda.is_available())"
If this returns
False, your PyTorch installation does not have CUDA support. Reinstall PyTorch with CUDA following the prerequisites guide. - FFmpeg Not Found Error: If you encounter “No such file or directory: ‘ffmpeg'”
# apt install ffmpeg
Whisper requires FFmpeg to decode audio files. Install it using your distribution’s package manager.
- Out of Memory Error: If you see CUDA out of memory errors
$ whisper audiofile.mp3 --model tiny --device cuda
Try using a smaller model that requires less GPU memory. The tiny or base models work well on GPUs with 4GB or less VRAM.
- Model Download Issues: If model downloads fail or are interruptedWhisper downloads models to
~/.cache/whisper/on first use. Delete this directory and try again if downloads are corrupted:$ rm -rf ~/.cache/whisper/
Conclusion
OpenAI Whisper with GPU acceleration provides fast and accurate speech-to-text transcription on Debian and Ubuntu systems. By leveraging NVIDIA CUDA, transcription tasks that would take minutes on CPU can complete in seconds. The command-line interface is straightforward, making it accessible for users who need to transcribe audio files without complex setup or programming knowledge. Start with the base model and adjust to larger models if you need better accuracy or smaller models if you need faster processing.