In this tutorial, I will show you how to use the OpenAI Whisper model to transcribe your audio files. Big thanks to Rainy Dong for showing me how easy it is to do this! ♥
About OpenAI Whisper model
Whisper is an automatic speech recognition (ASR) model trained on hundreds of hours of multilingual and multitask supervised data collected from the internet. You can find more information about it on the OpenAI Whisper website.
Using Google Colab to run your code
To run code in Google Colab, simply click the arrow in the leftmost corner of the code chunk/cell. Depending on what the code does, it may take a few seconds (minutes or hours!) for the code to run. Once executed, the arrow will turn into a number (numbers correspond to the order in which you ran the cells), and you'll see the exectition time displayed next to it.
Things to remember when using Colab:
- Please do not run multiple chunks at the same time, because things will get stuck.
- Most code chunks will produce some output, always check that the output is what you expect.
- Warnings are usually fine, but please read them to make sure you're not missing anything (e.g., deprecated functions).
Mounting Google Drive
To analyze files from your Google Drive, you have to let the script know where your data is located. Execute the chunk below. A pop-up window will open; follow the instructions to grant Colab access to your files in Google Drive.
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive') # say where your drive is
Reading audio files
Once you've done that, indicate which folder on Google Drive contains the audio files you wish to transcribe. You can name your folder however you like, but please do not use spaces.
import os
# Change this below to path to the folder containing your audio files
wav_folder = "/content/drive/MyDrive/speechcues/data/"
os.chdir(wav_folder)
Now, let's check if your files are being read correctly. Execute the chunk below. The output should show a list of the files that are currently in the folder you specified above.
- You can change the format of your files from mp3 to wav if needed
- If you add new files to an already mounted drive, you might need to re-read the folder
- If you add files in a new folder, you need to re-mount the drive; otherwise the script won't see it
# Get audio files
wav_files = [f for f in os.listdir(wav_folder) if f.endswith('.mp3')]
# Check if all files loaded correctly ()
print(wav_files)
len(wav_files)
Transcribing audio files
Once you set up everything correcly, running the model is super easy.
Loading the model
Run the following code to import the model into your Colab workspace. It will produce a lot of output - that's normal. Most messages will say things like installing, collecting, or successully installed/uninstalled. This might take some time to finish because the models are huge.
# Import model
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg
You can use the command below to check if the model was loaded successfully. If models are loaded correctly, this should return usage notes for all available functions.
# Print model help
!whisper -h
Running the model
The code below uses the OpenAI model to generate trascriptions with the default settings for large English model and rsaves them as .txt files. This can take a long time to run (especially if you are not connected to Collab runtime - which I don't recommend doing unless you have a really powerful machine), so please be patient and do not close your browser.
- If you're running this using your CPU, it will take forever (depending on your equipment, even several hours!)
- If you're running this using Collab GPU, it will be 100 times faster and also, it won't drain your computer's resources so you can do other things in the meantime
# Transcribe all audio files in folder
for wav_file in wav_files:
!whisper "{wav_file}" --model large --language en --output_format txt
That's it! You can easily change all the settings you wish (e.g., size of the model, language, values of various parameters, fiele output format) by including appropriate flags in your code.
After you run this, your transcribed files in specified format will appear in your folder as soon as the transcription is ready.
FAQ
Which model to choose?
It depends on your application and amount of data. Smaller models are much faster, but usually much less accurate. Larger models would provide better accuracy, but require more computational power and are slower.
First published on: June 26, 2025