YouTube Transcriptions using OpenAI's Whisper & Python
In this blog, we’ll dive into a cool project where we’ll transform YouTube videos into text using OpenAI's Whisper model. Whether you’re a student trying to summarize a lecture, or a creator looking to add subtitles to your videos, this guide has got you covered.
What is Speech-to-Text Transcription?
Speech-to-text transcription is nothing but reinventing spoken language into written text. All your voice assistants, be it Siri, Alexa, or even Ok Google, use speech-to-text.
Why Transcribe Speech?
Captions: It helps generate subtitles for videos.
Summarization: You can use transcription to create summaries or extract key information from long videos.
Transcribing Meetings: Meetings can be transcribed and saved for future reference.
Automatic Translations: Transcriptions can be easily translated into multiple languages, making content accessible to a global audience.
Know About OpenAI’s Whisper
Whisper is a general-purpose speech recognition model (and very UNDERRATED).
It is an Automatic Speech Recognition (ASR) model that works efficiently with various languages and accents.
It is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.
Here’s a compilation of available Models and Languages supported by Whisper:
Project Setup
Before jumping into the code, let’s ensure you have a few things ready.
yt-dlp
: Python package allows you to download audio from YouTube videos.
Run this command to install yt-dlp
pip install yt-dlp
whisper
: To transcribe the audio into text.
Run the below command to install whisper
pip install -U openai-whisper
It also requires the command-line tool ffmpeg
to be installed on your system.
ffmpeg
is a command line tool to record, convert, and stream audio and video.
It is available to download via most package managers.
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew
brew install ffmpeg
# on Windows using Chocolatey
choco install ffmpeg
# on Windows using Scoop
scoop install ffmpeg
Code Walkthrough
Step 1: Download Audio from YouTube
Let’s extract the audio from a YouTube video using yt-dlp
.
Here’s the function to download audio from a YouTube video:
import yt_dlp as youtube_dl
def download_audio(url):
ydl_opts = {
'format': 'bestaudio/best', # Get the best available audio
'postprocessors': [{
'key': 'FFmpegExtractAudio', # Convert the audio to WAV format
'preferredcodec': 'wav',
'preferredquality': '192',
}],
'outtmpl': './audio.%(ext)s', # Save the file with the name 'audio.wav'
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
ydl.download([url])
return "audio.wav"
In this function:
Specify the format we want (
bestaudio
).Set the output to WAV format because it’s commonly used for transcription.
The audio is saved with the filename
audio.wav
.
Step 2: Transcribe Audio Using Whisper
Here’s how to transcribe the audio file:
import whisper
def transcribe_audio(filepath):
model = whisper.load_model("base") # Load Whisper model
result = model.transcribe(filepath, fp16=False) # Transcribe the audio to text
return result["text"]
The function transcribe_audio()
loads the base
model and converts audio file into text.
The fp16
parameter in Whisper determines whether the model should use 16-bit floating-point precision (fp16
) during processing. Setting fp16=False
forces the model to use 32-bit floating-point precision (fp32
).
Whisper's fp16
mode requires a GPU that supports half-precision computation. If you are running the model on a CPU (which doesn’t support fp16
), you need to disable fp16
to avoid runtime errors.
Step 3: Putting It All Together
Now, combine both the download and transcription steps in the main()
function.
import argparse
def main(url):
audio_path = download_audio(url) # Download the audio
text = transcribe_audio(audio_path) # Transcribe the audio
print(text) # Print the transcribed text
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Transcribe YouTube videos")
parser.add_argument("url", help="The URL of the YouTube video to transcribe")
args = parser.parse_args()
main(args.url)
Running the Code
Once you’ve got the script ready, you can run it directly from your terminal:
python transcribe_video.py https://youtu.be/Y8Tko2YC5hA
This will download the audio from the given YouTube URL, transcribe it, and print the text.
SpeechRecognition vs Whisper
If you’ve been writing Python for quite some time now, you must be wondering why we should use Whisper if we have an existing solution in the form of a SpeechRecognition library.
When to use each?
SpeechRecognition:
Quick and simple transcription tasks
Lightweight API-based applications
Whisper:
High accuracy needed
Noisy audios
Multilingual transcription
Offline or privacy-sensitive tasks (How ? We’ll talk about it in the next blog)
Conclusion
With this, we came to the end of this blog where we saw how to script to transcribe YouTube videos easily using Whisper.
Whether you want to create subtitles for videos or extract key points from lectures or interviews, this approach can save time and effort.
You can easily extend this code to handle multiple languages, save the transcriptions to a file, or even build a full-fledged YouTube transcription service.
Head out to the official GitHub repository of Whisper to learn more about its architecture.
In the next blog post, I’ll share how you can build a tiny pet privacy-friendly application using Whisper locally without the internet.
Connect with me on Twitter and Bluesky.
Keep Learning 🚀 and Keep Building ❤️