Reading List
Apple’s New Foundation Model Speech APIs Outpace Whisper for Transcription from Daring Fireball RSS feed.
Apple’s New Foundation Model Speech APIs Outpace Whisper for Transcription
John Voorhees, writing at MacStories, regarding a new command-line transcription tool cleverly named Yap written by his son Finn last week during WWDC:
On the way, Finn filled me in on a new class in Apple’s Speech framework called SpeechAnalyzer and its SpeechTranscriber module. Both the class and module are part of Apple’s OS betas that were released to developers last week at WWDC. My ears perked up immediately when he told me that he’d tested SpeechAnalyzer and SpeechTranscriber and was impressed with how fast and accurate they were. [...]
What stood out above all else was Yap’s speed. By harnessing SpeechAnalyzer and SpeechTranscriber on-device, the command line tool tore through the 7GB video file a full 2.2× faster than MacWhisper’s Large V3 Turbo model, with no noticeable difference in transcription quality.
At first blush, the difference between 0:45 and 1:41 may seem insignificant, and it arguably is, but those are the results for just one 34-minute video. Extrapolate that to running Yap against the hours of Apple Developer videos released on YouTube with the help of
yt-dlp
, and suddenly, you’re talking about a significant amount of time. Like all automation, picking up a 2.2× speed gain one video or audio clip at a time, multiple times each week, adds up quickly.
Apple’s Foundation Models sure seem to be the sleeper hit from WWDC this year. This bodes very well for all sorts of use cases where transcription would be helpful, like third-party podcast players.