How to Transcribe Audio to Text with OpenAI Whisper Model

In this blog, we’ll dive into a cool project where we’ll transform YouTube videos into text using OpenAI's Whisper model. Whether you’re a student trying to summarize a lecture, or a creator looking to add subtitles to your videos, this guide has got you covered.

What is Speech-to-Text Transcription?

Speech-to-text transcription is nothing but reinventing spoken language into written text. All your voice assistants, be it Siri, Alexa, or even Ok Google, use speech-to-text.

Why Transcribe Speech?

Captions: It helps generate subtitles for videos.
Summarization: You can use transcription to create summaries or extract key information from long videos.
Transcribing Meetings: Meetings can be transcribed and saved for future reference.
Automatic Translations: Transcriptions can be easily translated into multiple languages, making content accessible to a global audience.

Know About OpenAI’s Whisper

Whisper is a general-purpose speech recognition model (and very UNDERRATED).

It is an Automatic Speech Recognition (ASR) model that works efficiently with various languages and accents.

It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

Here’s a compilation of available Models and Languages supported by Whisper:

compilation of available Models and Languages supported by Whisper

Project Setup

Before jumping into the code, let’s ensure you have a few things ready.

whisper: To transcribe the audio into text.

Run the below command to install whisper

pip install -U openai-whisper

It also requires the command-line tool ffmpeg to be installed on your system.

ffmpeg is a command line tool to record, convert, and stream audio and video.

It is available to download via most package managers.

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew
brew install ffmpeg

# on Windows using Chocolatey
choco install ffmpeg

# on Windows using Scoop
scoop install ffmpeg

Code Walkthrough

We are going to convert an audio file into an SRT file. The SubRip Subtitle (SRT) file format is a plain text file that contains subtitles with their corresponding timecodes.

Here’s an example of an SRT file:

1
00:00:01,000 --> 00:00:04,000
Hello, welcome to the tutorial!

2
00:00:05,500 --> 00:00:08,000
In this video, we will learn about subtitles.

3
00:00:10,000 --> 00:00:12,000
Creating subtitles is simple and easy.

Step 1: Load the Whisper Model:

import whisper

model = whisper.load_model("base") # Load Whisper base model

Step 2: Transcribe Audio Using Whisper

Here’s how to transcribe the audio file:

result = model.transcribe(input_file, fp16=False, task="transcribe")
segments = result["segments"]

The transcribe method processes the audio file and splits it into segments with start and end times, along with the transcribed text.

The fp16 parameter in Whisper determines whether the model should use 16-bit floating-point precision (fp16) during processing. Setting fp16=False forces the model to use 32-bit floating-point precision (fp32).

Whisper's fp16 mode requires a GPU that supports half-precision computation. If you are running the model on a CPU (which doesn’t support fp16), you need to disable fp16 to avoid runtime errors.

Step 3: Format Timestamps

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    seconds = seconds % 60
    milliseconds = int((seconds - int(seconds)) * 1000)
    return f"{hours:02}:{minutes:02}:{int(seconds):02},{milliseconds:03}"

Timestamps are formatted to the HH:MM:SS,mmm format required by .srt files.

Step 4: Generate SRT Content

srt_content += f"{i + 1}\n{start} --> {end}\n{text}\n\n"

Each segment is formatted into the SRT standard, with a sequential index, timestamps, and the corresponding text.

Step 5: Save the Output

with open(output_file, "w", encoding="utf-8") as srt_file:
    srt_file.write(srt_content)

The formatted SRT content is written to a file.

Code in Action

Place your audio file (e.g., audio.wav) in the project directory.

Once you’ve got the script ready, you can run it directly from your terminal:

python script.py

The generated subtitles will be saved in subtitles.srt

This SRT file can then be used to generate subtitles for your video content on platforms like YouTube, LinkedIn, and Twitter/X.

SpeechRecognition vs Whisper

If you’ve been writing Python for quite some time now, you must be wondering why we should use Whisper if we have an existing solution in the form of a SpeechRecognition library.

When to use each?

SpeechRecognition:

Quick and simple transcription tasks
Lightweight API-based applications

Whisper:

High accuracy needed
Noisy audios
Multilingual transcription
Offline or privacy-sensitive tasks (How ? We’ll talk about it in the next blog)

Conclusion

With this, we came to the end of this blog where we saw how to script to transcribe YouTube videos easily using Whisper.

Whether you want to create subtitles for videos or extract key points from lectures or interviews, this approach can save time and effort.

You can easily extend this code to handle multiple languages.

Head out to the official GitHub repository of Whisper to learn more about its architecture.

In the next blog post, I’ll share how you can build a tiny pet privacy-friendly application using Whisper locally without the internet.

Connect with me on Twitter and Bluesky.

Keep Learning 🚀 and Keep Building ❤️

YouTube Transcriptions using OpenAI's Whisper & Python

Table of contents