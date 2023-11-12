Sam Altman spent almost no time on it during OpenAI DevDay. All attention was focused on GPT-4 Turbo and the GPTs. However, for those of us who do not pay for artificial intelligence nor have we yet gotten used to creating with prompts, there is a much simpler and more effective tool.

We are talking about Whisper, which this week has reached its third generation. It is the voice recognition model that not only understands and translates dozens of languages, but is also capable of transcribing entire conversations with surprising accuracy.

Unlike ChatGPT or DALL·E, Whisper V3 es open source. Its code is already published on Github and can be used freely through Hugging Face or Replicate. Using Whisper is as simple as uploading the audio file and clicking it.

Whisper V3 gets the commas right

Whisper V3 ha sido trained with over a million hours of labeled audio and with more than 4 million hours of pseudotagged audio. Compared to the previous model, Whisper now has 10-20% fewer errors. In the case of Spanish, the error rate is below 5%, being one of the languages ​​that best understands this model.

In my case, I have been using Whisper V2 for months to help me transcribe interviews, both in English and Spanish. I quickly tested Whisper V3 and the result is even better. The result is practically the same, because in the end Whisper V2 already understood the voice very well, but The difference with Whisper V3 is that it is right even in the pauses of the conversationplacing commas and periods much more accurately.





Whisper can be used directly as a translator or to transcribe a language. It is also capable of Automatically identify when you switch from one language to another in the same conversation. Being a language model, the goal of OpenAI is for other companies or developers to use it for their own voice assistants.

As in previous generations, Whisper is available in various sizes to fit into different applications. From a tiny version that requires less than 1 GB of VRAM and is trained with 39 million parameters to the large model, trained with 1.55 billion parameters and requirements of about 10 GB of VRAM. This large model is the one available directly through Hugging Face or Replicate.

Transcribing audio to text until now had always been a disaster. Most free tools gave too many errors, with incorrectly placed words, figures that were not correct or expressions that were missing. In the end you needed to review all the audio carefully, so you didn’t save much time.

With Whisper V2 it was the first time that the result of a free tool convinced me enough. With Whisper V3 I have the feeling that this language model is here to stay. It has just what we ask from technology: that it be easy to use, fast, effective and also free. Altman, we want more models like this.

