Transcription Made Easy: Exploring Whisper’s Transcription Features
Amruta Agnihotri
Senior Software Architect
Divya Gupta
Software Engineer
MORE ARTICLES
Introduction
Transcription is a process of transforming spoken words into text,
making information more accessible, searchable, and easy to
manage. In everyday life, transcription simplifies tasks by
allowing people to revisit and analyze conversations, meetings, or
lectures without relying solely on memory. Hence, transcription is
widely used in fields like medicine, law, academia, business,
media, and market research for better accessibility, analysis, and
documentation. Transcription is also widely used in applications
like voice assistants where it enables voice command recognition,
personalized responses, language translation, and accessibility
features, making them more intuitive and user-friendly.
When it comes to implementing a transcription solution in your
system, there are technically two approaches you can take. One is
integrating with cloud transcription services such as Amazon
Transcribe (AWS), Google Cloud Speech-to-text, Microsoft Azure
Speech-to-text, OpenAI Whisper, and so on. The second way is to
deploy and maintain an open-source solution on your own. There are
many alternatives available in this space as well such as Vosk,
Whisper, DeepSpeech, Athena, and TensorFlowASR to name a few.
While choosing a transcription solution one needs to consider
multiple factors such as:
Accuracy, amount of training data used which directly affects
accuracy, cost of deployment, supported languages, and latency to
name a few important ones.
In this article, we will explore more about using OpenAI’s Whisper
model as its accuracy is one of the best amongst available options
so far and it can be used in both flavors: On-premise and API
based.
About OpenAI Whisper
OpenAI Whisper is an advanced speech recognition system developed
by OpenAI. Whisper is designed to transcribe spoken language into
written text with high accuracy. This AI model is one of the most
accurate automatic speech recognition models. It stands out from
the rest of the tools in the market due to the large number of
training data sets. It was trained on 680 thousand hours of audio
files from the internet. This can be used through the OpenAI
Whisper API directly or by downloading the open-source Whisper
library.
Integrating with OpenAI’s Whisper APIs
Integrating with API-based Whisper model has some advantages
like
It can scale effortlessly to handle large volumes of
transcription tasks, making it suitable for businesses of
all sizes.
It can run on normal machines at high speed, i.e. no
external instance is required to host the service, thus
making it easy to implement and maintain.
By using the API, businesses can avoid the overhead costs
of maintaining in-house infrastructure for speech
recognition, opting instead for a pay-as-you-go model that
scales with their needs. Its cost is $0.006 for 60 seconds
as of September 2024 when this article is written.
While it’s a convenient choice for quick prototyping, it may not
be suitable for all use cases due to its cloud-based nature, which
involves sharing data with OpenAI. This could be a concern for
some scenarios. Additionally, the service’s performance can be
affected by external factors such as server downtime and variable
API costs, which means we might not have full control over it.
Owning your AI solution becomes critical for ensuring data privacy
and control in certain use cases.
Own your Whisper Transcription Solution
Whisper is an open-source model, and it comes primarily in 6
variants named ‘tiny’, ‘small’, ‘base’, ‘medium’, ‘large’ and
‘turbo’ with the following benefits
The model can be deployed in an environment of your
choice, on-premise or on your own cloud.
Running the Whisper library locally ensures that data
remains under your control, ensuring more security and
data privacy.
The service can be started on a normal machine bearing no
cost if either low accuracy with high speed or high
accuracy with low speed is accepted.
A version of the model can be chosen as per your specific
requirements of transcription accuracy and acceptable
response time (i.e. whether the requirement is of real
time transcription or batch transcription processing)
For example, on a 2 vCPUs, 4 GiB RAM instance to process 6
seconds of raw PCM audio buffer, for instance tiny model
takes 2-3 seconds, whereas a small model takes 11-12
seconds and 2 GB of RAM.
Vertical and Horizontal scalability of the deployment can
be managed as per the needs.
Getting the best results with the Whisper Model
Whisper is an open-source model, and it comes primarily in 6
variants named ‘tiny’, ‘small’, ‘base’, ‘medium’, ‘large’ and
‘turbo’ with the following benefits
Ensure the audio clip is processed correctly to
eliminate/reduce noise and silence buffer. This will not
only help in getting more accurate transcriptions but also
help in faster processing as the model will get only
meaningful data to process. In the case of cloud Whisper
API, this will also result in cost savings as the cost is
calculated based on the number of seconds of audio data
processing.
The Whisper model includes Automatic Speech Recognition
(ASR), which allows it to automatically detect the spoken
language without needing explicit specification. However,
for optimal results, it’s recommended to set the spoken
language, if possible, especially in noisy environments.
Pay attention to the configurable parameters provided by
the Whisper model. For example, log_prob_threshold which
is used to determine if a segment should be considered as
failed based on the average log probability of sampled
tokens. Lowering this value makes Whisper more sensitive
to small sounds.
Another such example is no_speech_threshold. This
parameter helps Whisper decide if a segment is silent. If
the no_speech probability is higher than this value and
the average log probability is below the
log_prob_threshold, the segment is considered silent.
The Whisper model can transcribe speech from various
languages directly into English. This feature is
particularly useful in multilingual settings, such as
meetings where participants speak different languages, but
you need a unified transcription in English.