YouTube Transcriptions using OpenAI's Whisper & Python

In this blog, we’ll dive into a cool project where we’ll transform YouTube videos into text using OpenAI's Whisper model. Whether you’re a student trying to summarize a lecture, or a creator looking to add subtitles to your videos, this guide has got you covered.

What is Speech-to-Text Transcription?

Speech-to-text transcription is nothing but reinventing spoken language into written text. All your voice assistants, be it Siri, Alexa, or even Ok Google, use speech-to-text.

Why Transcribe Speech?

  • Captions: It helps generate subtitles for videos.

  • Summarization: You can use transcription to create summaries or extract key information from long videos.

  • Transcribing Meetings: Meetings can be transcribed and saved for future reference.

  • Automatic Translations: Transcriptions can be easily translated into multiple languages, making content accessible to a global audience.

Know About OpenAI’s Whisper

Whisper is a general-purpose speech recognition model (and very UNDERRATED).

It is an Automatic Speech Recognition (ASR) model that works efficiently with various languages and accents.

It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.

Here’s a compilation of available Models and Languages supported by Whisper:

compilation of available Models and Languages supported by Whisper

Project Setup

Before jumping into the code, let’s ensure you have a few things ready.

yt-dlp: Python package allows you to download audio from YouTube videos.

Run this command to install yt-dlp

pip install yt-dlp

whisper: To transcribe the audio into text.

Run the below command to install whisper

pip install -U openai-whisper

It also requires the command-line tool ffmpeg to be installed on your system.

ffmpeg is a command line tool to record, convert, and stream audio and video.

It is available to download via most package managers.

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew
brew install ffmpeg

# on Windows using Chocolatey
choco install ffmpeg

# on Windows using Scoop
scoop install ffmpeg

Code Walkthrough

Step 1: Download Audio from YouTube

Let’s extract the audio from a YouTube video using yt-dlp.

Here’s the function to download audio from a YouTube video:

import yt_dlp as youtube_dl

def download_audio(url):
    ydl_opts = {
        'format': 'bestaudio/best',  # Get the best available audio
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',  # Convert the audio to WAV format
            'preferredcodec': 'wav',
            'preferredquality': '192',
        }],
        'outtmpl': './audio.%(ext)s',  # Save the file with the name 'audio.wav'
    }
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])
    return "audio.wav"

In this function:

  • Specify the format we want (bestaudio).

  • Set the output to WAV format because it’s commonly used for transcription.

  • The audio is saved with the filename audio.wav.

Step 2: Transcribe Audio Using Whisper

Here’s how to transcribe the audio file:

import whisper

def transcribe_audio(filepath):
    model = whisper.load_model("base")  # Load Whisper model
    result = model.transcribe(filepath, fp16=False)  # Transcribe the audio to text
    return result["text"]

The function transcribe_audio() loads the base model and converts audio file into text.

The fp16 parameter in Whisper determines whether the model should use 16-bit floating-point precision (fp16) during processing. Setting fp16=False forces the model to use 32-bit floating-point precision (fp32).

Whisper's fp16 mode requires a GPU that supports half-precision computation. If you are running the model on a CPU (which doesn’t support fp16), you need to disable fp16 to avoid runtime errors.

Step 3: Putting It All Together

Now, combine both the download and transcription steps in the main() function.

import argparse

def main(url):
    audio_path = download_audio(url)  # Download the audio
    text = transcribe_audio(audio_path)  # Transcribe the audio
    print(text)  # Print the transcribed text

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Transcribe YouTube videos")
    parser.add_argument("url", help="The URL of the YouTube video to transcribe")
    args = parser.parse_args()
    main(args.url)

Running the Code

Once you’ve got the script ready, you can run it directly from your terminal:

python transcribe_video.py https://youtu.be/Y8Tko2YC5hA

This will download the audio from the given YouTube URL, transcribe it, and print the text.

YouTube Video Transcription

SpeechRecognition vs Whisper

If you’ve been writing Python for quite some time now, you must be wondering why we should use Whisper if we have an existing solution in the form of a SpeechRecognition library.

When to use each?

SpeechRecognition:

  • Quick and simple transcription tasks

  • Lightweight API-based applications

Whisper:

  • High accuracy needed

  • Noisy audios

  • Multilingual transcription

  • Offline or privacy-sensitive tasks (How ? We’ll talk about it in the next blog)

Conclusion

With this, we came to the end of this blog where we saw how to script to transcribe YouTube videos easily using Whisper.

Whether you want to create subtitles for videos or extract key points from lectures or interviews, this approach can save time and effort.

You can easily extend this code to handle multiple languages, save the transcriptions to a file, or even build a full-fledged YouTube transcription service.

Head out to the official GitHub repository of Whisper to learn more about its architecture.

In the next blog post, I’ll share how you can build a tiny pet privacy-friendly application using Whisper locally without the internet.

Connect with me on Twitter and Bluesky.

Keep Learning 🚀 and Keep Building ❤️