Whisper.cpp#
Whisper.cpp is an automatic speech recognition tool that converts spoken audio into text. It is well suited to processing large collections of audio recordings, such as interviews, lectures, oral histories, podcasts, and produces transcripts in several formats at once:
Plain text (
.txt): one file per recording, ready to read or searchSRT subtitles (
.srt): compatible with most video players and editing toolsWebVTT subtitles (
.vtt): used by web-based video playersDetailed JSON (
.json): includes word-level timestamps and confidence scores
Running it on the BEDE HPC cluster lets you process dozens or hundreds of files in parallel overnight, rather than transcribing them one at a time on a laptop.
The software is available as a module on BEDE and requires no installation on your part. The command line tools used in this tutorial should be available in Microsoft PowerShell as well as Linux and MacOS shells.
Walkthrough#
This tutorial walks you through transcribing a folder of audio files on the BEDE HPC cluster using the provided scripts. It assumes you already have a BEDE account and are comfortable typing commands into a terminal. No programming experience is required.
You will need two terminal windows open: one connected to BEDE (for running jobs), and one on your own computer (for transferring files). Commands that must be typed on your local computer are shown with a different background colour; all other commands are run on BEDE.
Here is an overview of the steps:
Upload your audio files to BEDE, or download them there directly
Create a short job script that tells BEDE what to transcribe and where to save the results
Submit the job and wait for the transcripts to appear
To connect to BEDE, open a terminal on your local computer and run (as described in the “Using Bede” section):
ssh <bede-username>@bede.dur.ac.uk
Audio files can be large, so you should store them in the /nobackup area rather than your home directory. The /nobackup
area has much more space and is the right place for large datasets. Note that files there are not backed up, so keep originals elsewhere.
In the shell you used to log in, navigate there (replace <project-id> with your project code):
cd /nobackup/projects/<project-id>
## create a user folder if you do not already have one
mkdir -p $USER
There are two ways to get audio files onto BEDE: uploading from your own computer, or downloading directly from the internet on the cluster. The tutorial assumes you keep the BEDE-connected terminal open throughout.
Transcribing local files#
If your audio files are on your own computer, you transfer them using scp (Secure Copy: a standard command-line tool
for sending files over a network connection). Open a new terminal on your local computer. Do not use the one already connected to BEDE.
The commands differ slightly depending on your operating system:
Navigate into the parent folder of your audio files:
cd /path/to/your/<audio-folder>
Move up one level so you can copy the whole folder across to BEDE. The -r option means “recursive”: it copies the folder and everything inside it.
cd .. #go to previous folder to copy the audio folder with its contents.
scp -r <audio-folder> <bede-username>@bede.dur.ac.uk:/nobackup/projects/<project>/<bede-username>
On Windows, your files may be on a different drive (e.g. D:). Switch to that drive first,
then navigate into the parent folder of your audio files:
D:
cd D:\path\to\your\<audio-folder>
Move up one level so you can copy the whole folder across to BEDE. The -r option means “recursive”: it copies the folder and everything inside it.
cd .. #go to previous folder to copy the audio folder with its contents.
scp -r <audio-folder> <bede-username>@bede.dur.ac.uk:/nobackup/projects/<project>/<bede-username>
Transcribing files from the internet#
Files from the internet can be downloaded directly on the cluster using wget or curl.
First we create a folder in our user folder and then download the file on the cluster, using the console connected to the cluster. The example below downloads a Nobel lecture by Ada E. Yonath. Replace the URL with the address of your own file.
cd /nobackup/projects/<project-id>/<bede-username>
mkdir -p audio_files
cd audio_files
wget https://nobel-videocdn01.azureedge.net/video/lecture_2009_che_yonath-intro_01_496.mp4
Either way, your audio files are now on BEDE and ready to be transcribed.
The next step is to create a job script — a short text file that tells BEDE’s job scheduler (called SLURM) what program to run, on which files, and for how long. SLURM queues your request, finds a free GPU node, runs it, and writes the output to a folder you choose. You do not need to stay logged in while it runs.
Choose the right script depending on how many files you have:
Fewer than ~50 files (up to ~10 hours of audio total): use the single-job script, which processes the whole folder in one go.
More than ~50 files: use the job array script, which splits the work across several parallel jobs and is much faster.
Creating a job script#
Create a new text file on BEDE using the nano editor (a simple terminal text editor; use the arrow keys to move, Ctrl+O to save,
Ctrl+X to exit):
nano transcribe.sh
Copy the appropriate script below into the editor. You only need to edit the lines in the USER CONFIGURATION section: replace
<project> with your project code (e.g. bddur99) and adjust the paths to your audio folder and desired output folder.
The ENV_FILE line is optional and is explained in the next section.
BEDE currently has two different sets of computers called nodes available. Which of the two you use can be chosen with the #SBATCH --partition=gpu
line. The current setting (gpu) chooses nodes using V100 GPUs, which are available in higher numbers.
For running the job on a GraceHopper node, you can change the partition to gh.
These nodes are more powerful than needed for most transcription jobs, and as they are less widely available,
jobs often wait longer in the queue before starting.
If you have a small number of files to transcribe, you can use the following slurm script. It will process all audio files in the specified folder in one job.
#!/bin/bash
################################################################################
# Whisper.cpp Folder Transcription — Single-node (no job array)
# Processes all audio files in a folder in one job.
################################################################################
# ── SLURM directives ──────────────────────────────────────────────────────────
#SBATCH --account=<project> # Run job under project <project>
#SBATCH --time=3:00:0
#SBATCH --job-name=whisper_folder
# GPU node
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
# Log files (%j = job ID)
#SBATCH --output=logs/%j/folder.out
#SBATCH --error=logs/%j/folder.err
# ── USER CONFIGURATION ────────────────────────────────────────────────────────
PROJECT=<project> # Your BEDE project name (e.g. myproject)
# Folder containing your audio files
AUDIO_DIR=/nobackup/projects/$PROJECT/$USER/audio_files
# Audio file extension to process (e.g. mp3, wav, flac, m4a)
AUDIO_EXT=mp3
# Where transcription output will be written (txt/, vtt/, srt/, json/ auto-created)
OUTDIR=/nobackup/projects/$PROJECT/$USER/whisper_output
# Whisper parameters (see scripts/config/whisper.env to adjust quality/language/etc.)
ENV_FILE="/path/to/my.env"
# ── END USER CONFIGURATION ────────────────────────────────────────────────────
# Load whisper.cpp module
module load ai4science
module load whisper.cpp
# Load whisper parameters from config file
if [ -f "$ENV_FILE" ]; then
set -a
source <(sed 's/#.*//' "$ENV_FILE" | grep -v '^\s*$')
set +a
echo "Loaded config from: $ENV_FILE"
else
echo "WARNING: Config file not found: $ENV_FILE (worker will use built-in defaults)"
fi
# Export variables needed by the worker
export AUDIO_DIR OUTDIR AUDIO_EXT MODEL_DIR
export EXTRA_ARGS="${EXTRA_ARGS:-}"
# Hand off all processing logic to the worker script
bash "$WORKER_PATH"
For a large number of files, it is more efficient to use a job array. The following slurm script will process each audio file in the specified folder as a separate job in a job array. This allows for better parallelization and resource management.
You also need to set #SBATCH --array=1-N, where N is the number of parallel jobs to run. A good rule of thumb is one job per 20 files. To count your files, run:
find /nobackup/projects/<project>/<bede-username>/audio_files -name "*.mp3" | wc -l
Then divide that number by 20 and round up; that is your N. For example, 75 files ÷ 20 = 3.75, so set --array=1-4. Each job automatically figures out which slice of files it is responsible for, so you do not need to split them manually.
#!/bin/bash
################################################################################
# Whisper.cpp Batch Transcription — SLURM Submit Script
################################################################################
# ── SLURM directives ──────────────────────────────────────────────────────────
#SBATCH --account=<project>
#SBATCH --time=2:00:0
#SBATCH --job-name=whisper_transcribe
#SBATCH --array=1-4
# GPU node
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
# Log files (%A = job ID, %a = array task ID)
#SBATCH --output=logs/%A/task_%a.out
#SBATCH --error=logs/%A/task_%a.err
# ── USER CONFIGURATION ────────────────────────────────────────────────────────
PROJECT=<project> # Your BEDE project name (e.g. myproject)
# Folder containing your audio files
AUDIO_DIR=/nobackup/projects/$PROJECT/$USER/audio_files
# Audio file extension to process (e.g. mp3, wav, flac, m4a)
AUDIO_EXT=mp3
# Where transcription output will be written (txt/, vtt/, srt/, json/ auto-created)
OUTDIR=/nobackup/projects/$PROJECT/$USER/whisper_output
# Whisper parameters (see scripts/config/whisper.env to adjust quality/language/etc.)
ENV_FILE=/path/to/my.env
# ── END USER CONFIGURATION ────────────────────────────────────────────────────
# Load whisper.cpp module
module load ai4science
module load whisper.cpp
# Load whisper parameters from config file
if [ -f "$ENV_FILE" ]; then
set -a
source <(sed 's/#.*//' "$ENV_FILE" | grep -v '^\s*$')
set +a
echo "Loaded config from: $ENV_FILE"
else
echo "WARNING: Config file not found: $ENV_FILE (worker will use built-in defaults)"
fi
# Export variables needed by the worker
export AUDIO_DIR OUTDIR AUDIO_EXT MODEL_DIR
export EXTRA_ARGS="${EXTRA_ARGS:-}"
# Hand off all processing logic to the worker script
bash "$WORKER_PATH"
Save the file and submit it to the queue:
sbatch transcribe.sh
BEDE will print a job ID, for example Submitted batch job 123456. You can check whether your job is running or waiting in the queue with:
squeue --me
When the job finishes, the transcripts will be in the output folder you set in OUTDIR, organised into subfolders:
OUTDIR/txt/— plain text transcriptsOUTDIR/srt/— SRT subtitle filesOUTDIR/vtt/— WebVTT subtitle filesOUTDIR/json/— detailed JSON with timestamps
If something went wrong, check the log file in logs/<job-id>/ for error messages.
Getting the files back to your computer#
Once your job has finished, copy the output folder from BEDE to your local computer using scp.
In a terminal on your local computer (not the one connected to BEDE), go to the folder where you want to save the transcripts using cd, then run:
scp -r <bede-username>@bede.dur.ac.uk:/nobackup/projects/<project>/<bede-username>/whisper_output ./whisper_output
This will download the entire whisper_output folder into your current directory. Inside it you will find four subfolders:
txt/, srt/, vtt/, and json/, each containing one file per recording.
If you only want the plain text transcripts and not the subtitle or JSON files, you can download just that subfolder:
scp -r <bede-username>@bede.dur.ac.uk:/nobackup/projects/<project>/<bede-username>/whisper_output/txt ./transcripts
This should enable you to produce transcripts. If you encounter any issues or want to customise the transcription process, read on for more details about the configuration options.
Converting to compatible format with FFmpeg#
Whisper.cpp relies on FFmpeg to read audio files. If your files are in an unusual format, you may need to convert them first. You can either do that
on your local computer or use the following SLURM script. It converts all files in a folder to MP3 files, which are compatible with Whisper.
Adjust the INPUT_EXT and OUTPUT_EXT variables as needed.
#!/bin/bash
################################################################################
# Audio Format Conversion with FFmpeg — SLURM Submit Script
################################################################################
# ── SLURM directives ──────────────────────────────────────────────────────────
#SBATCH --account=<project>
#SBATCH --time=1:00:0
#SBATCH --job-name=ffmpeg_convert
# GPU node
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
# Log files (%j = job ID)
#SBATCH --output=logs/%j/ffmpeg.out
#SBATCH --error=logs/%j/ffmpeg.err
# ── USER CONFIGURATION ────────────────────────────────────────────────────────
PROJECT=<project> # Your BEDE project name (e.g. myproject)
# Folder containing your original audio files
INPUT_DIR=/nobackup/projects/$PROJECT/$USER/original_audio
# Folder where converted files will be saved
OUTPUT_DIR=/nobackup/projects/$PROJECT/$USER/converted_audio
# Original and target audio formats (e.g. wav, flac, m4a, mp3)
INPUT_EXT=wav
OUTPUT_EXT=mp3
# ── END USER CONFIGURATION ────────────────────────────────────────────────────
# Load whisper.cpp module
module load ai4science
module load whisper.cpp
# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"
# Convert each file using FFmpeg
for file in "$INPUT_DIR"/*."$INPUT_EXT"; do
filename=$(basename "$file" ."$INPUT_EXT")
ffmpeg -i "$file" "$OUTPUT_DIR/${filename}.${OUTPUT_EXT}"
done
More customisation with an env file#
The ENV_FILE is optional. The built-in defaults work well for most English-language recordings,
so you can skip this section entirely on your first run.
If you do need to change something — most commonly the language — create a plain text file containing
only the lines you want to override. Save it anywhere on BEDE and set ENV_FILE in your job script
to point at it. Everything you leave out keeps its default value.
Common configurations#
Non-English audio
WHISPER_LANGUAGE=de # replace 'de' with your language code
Common codes: de German, fr French, es Spanish, nl Dutch, it Italian,
pt Portuguese, ja Japanese, zh Chinese. Use auto to let Whisper detect
the language automatically (slightly slower).
Translate speech into English (e.g. a French interview delivered as an English transcript)
WHISPER_LANGUAGE=fr
TRANSLATE=true
Faster processing, slightly lower accuracy (useful for a quick first pass)
BEST_OF=2
BEAM_SIZE=2
Plain text output only (skip subtitle and JSON files to save space)
OUTPUT_FORMATS=-otxt
Pointing your job script at the env file#
Save the file on BEDE, for example as my_project.env alongside your job script, then set
ENV_FILE in the USER CONFIGURATION section of your script:
ENV_FILE=/nobackup/projects/<project>/<bede-username>/my_project.env
Alternatively, pass it at submission time without editing the script at all:
sbatch --export=ALL,ENV_FILE=/nobackup/projects/<project>/<bede-username>/my_project.env transcribe.sh
Full settings reference#
The tables below list every available setting. You will not need most of these for typical transcription work.
Language
Variable |
Default |
Description |
|---|---|---|
|
|
Spoken language of your audio. Use |
|
|
Set to |
Output formatting
Variable |
Default |
Description |
|---|---|---|
|
|
Space-separated output format flags. Remove any you do not need.
|
|
|
Maximum characters per transcript segment. |
|
|
Number of previous tokens Whisper remembers across segments. |
|
see below |
Initial text prompt to steer Whisper’s style and vocabulary. Helpful for
punctuation, capitalisation, and domain-specific terms.
Default: |
Accuracy and speed
Variable |
Default |
Description |
|---|---|---|
|
|
Number of candidate transcriptions generated per chunk (best-of-N sampling).
Higher values improve accuracy at the cost of speed. The whisper-cli default is |
|
|
Beam search width — how many paths are explored simultaneously.
Higher values improve accuracy at the cost of speed. The whisper-cli default is |
|
|
Sampling temperature (0.0 = fully deterministic). Higher values introduce more
variation. The whisper-cli default is |
|
|
Temperature increment applied on each fallback retry (only relevant when
|
|
|
When |
Voice Activity Detection (VAD)
VAD filters silence and background noise before passing audio to Whisper, preventing hallucinated text during quiet periods. It is strongly recommended for recordings with background noise or music.
Variable |
Default |
Description |
|---|---|---|
|
|
Silero VAD model file. Must be present in the module’s model directory. |
|
|
Speech confidence threshold (0.0–1.0). Lower values retain more quiet speech but
risk passing through background noise. The whisper-cli default is |
|
|
Minimum duration (ms) a speech burst must last to be kept. Shorter bursts
(coughs, clicks) are discarded. The whisper-cli default is |
|
|
Minimum silence gap (ms) between speech segments before they are split into
separate chunks. The whisper-cli default is |
Models
Whisper is available in several sizes trading off speed against accuracy. The scripts automatically try the highest-quality model first and fall back to the smaller one if the GPU runs out of memory.
Variable |
Default |
Description |
|---|---|---|
|
|
Highest-accuracy model (~3.9 GB VRAM). Used first. |
|
|
Lighter model (~2.1 GB VRAM). Used when the primary model runs out of memory. |
Advanced: quality thresholds
These settings control when the script considers a transcribed segment unreliable and falls back to the smaller model. The defaults are well-tuned and rarely need changing.
Variable |
Default |
Description |
|---|---|---|
|
|
Maximum token entropy allowed before a segment is marked as failed. Higher values
tolerate more uncertainty. The whisper-cli default is |
|
|
Minimum average log-probability for a segment. |
|
|
Probability above which a segment is classified as silence and discarded.
Lower values keep more borderline speech. The whisper-cli default is |
|
|
Minimum per-word timestamp confidence. The whisper-cli default is |