At Last: Transcription Software That Works
It has only taken a few decades, but finally a company has fulfilled one of my most fervent technology hopes.
This is an installment in the “Interesting Software” chronicles. The topic has fascinated me ever since I patched together my own first computer and programs in the late 1970s. The hardware was a Processor Technology SOL-20, designed by Lee Felsenstein and renowned in the computer world as the machine that came in a lovely burled-walnut case. The software was a writing program called The Electric Pencil, invented by the filmmaker Michael Shrayer. The SOL-20 is still in a closet. I dust off the walnut every so often.
At the end of this post I’ll include links to some entries from that era onward, about programs, machinery, advances, and setbacks I have found notable through the personal-computing age.
Today’s entry is about a recent advance that has substantially changed my working life for the better. Everything that follows below is an elaboration on these next two sentences: If your professional or personal duties involve converting voice recordings to text, you’ll want to check out a new service called Otter.ai. I’ve tried countless applications in this field, and for me this is the first one that works. (For the record, I have no connection with the company except as a customer.)
Now, the details.
The headline on that article was “From Your Lips to Your Printer,” and the article began this way:
For years I knew exactly what a computer would have to do to make itself twice as useful as it already was. It would have to show that it could accurately convert the sound of spoken language to typed-up text.
I had a specific chore in mind for such a machine. I would give it the tape recordings I make during interviews or while attending speeches, and it would give me back a transcript of who said what. This would save the two or three hours it takes to listen to and type up each hour's worth of recorded material.
This machine would have advantages for other people, too. It would help groups that want minutes of their meetings or brainstorming sessions, legal professionals who need quick transcripts of what just happened at trials, students in big lecture halls, people who want to dictate e-mail while stuck in traffic, and those who, owing to disability or stress injury, are not able to type.
That was the dream. The article laid out some halting early steps toward its realization. For instance, that era’s most effective dictation systems, like Dragon Naturally Speaking and IBM’s ViaVoice, could be “trained” to do a more-or-less good job of recognizing and understanding your own voice, after enough patient trial-and-error correction.
I revealed at the end of the article that, as a stunt, I had composed the whole piece by dictation with Dragon, rather than by typing. As a technical challenge that wasn’t so hard, but it required a much greater cognitive leap than I had anticipated. It made me realize that after tens of thousands of hours at a keyboard, when writing I essentially think through my fingers. More details on that in the original piece.
Of course for circumstances where people can’t think through their fingers, the programs were a huge advance in accessibility and convenience. But they were a long, long distance from the free-form, non-dictated, un-trained, “speaker independent” transcription of meetings or interviews that I’d been dreaming of.
As I mentioned in an earlier post, going out and seeing and hearing things is the fun part of reporting. Converting what you’re learned, to something people can read or listen to, is hard work, but it is the price you pay for the privilege of learning. Transcribing recorded interviews? That’s just a giant headache —for which the software of 20 years ago offered no relief. For later discussion: there are rare occasions where you learn new things by listening again to the recording. Usually it’s just drudgery.
That year-2000 article also explained why voice-to-text translation was so difficult for computers. Voice recordings come into a computer as a stream of sound waves. Effective speech-recognition requires parsing the sound waves into phonemes; grouping those phonemes into likely words (and choosing among homonyms); piecing the words into plausible phrases; allocating those phrases into reasonable sentences; coping with uhh-type fillers and the other shagginess of real language; and more. And, doing all this despite differences in accent, and speaking pace, and recording volume, and voice pitch, and specialized vocabulary, and background noise, and interruptions and cross-talk, et cetera.
What’s more, “close enough” is not very good for speech recognition. Back in those days, programs would advertise themselves as having “90% accuracy!” or something in that range. That’s like saying, “this food tastes 90% free of putrid rot.” Unless the transcript was much, much better than that, you’d spend so much time fixing typos, making sense of garble, and rolling your eyes at ludicrous errors, that the whole exercise was pointless. You’d have saved time by just sitting down to transcribe—which is what I usually did.
Ten cycles of Moore’s Law
But that was a long time ago. One illustration: when that article came out, the world market for smartphones was zero. They did not exist. The “modern” phone of the era was the Nokia 3310, which now would seem too primitive even for “burner” use. The first Blackberry had just been introduced.
More important, in Moore’s Law terms, that era was ten generations in the past. (Moore’s Law started out with a different technical meaning but has evolved into the concept that computing power will double every two years.) This means that, in rough terms, today’s “standard” computer is at least a thousand times more powerful than one 20 years ago.
Everything about speech-recognition depends on computing power. Collecting enormous amounts of spoken-data-with-matching-transcripts, on which the recognition programs can “train.” Storing those enormous volumes. (“Big Data.”) Processing, analyzing, indexing, and parsing them. Using the analytical algorithms to convert raw sound into prose. Applying these steps to new streams of sound as they come in. Constantly improving and self-correcting, which is what “deep learning” boils down to.
These days, all of this happens at least a thousand times faster than back when I was complaining it didn’t work well enough. Over time, a difference in degree becomes a difference in kind. What was too slow and clumsy, is now fast and, if not perfect, usably precise.
This brings us to Otter.ai.
Otter is an online voice-to-text service that made its debut a few years ago. It has received a lot of attention in the tech world. For instance, you can read John Markoff’s New York Times article from 2019, and see podcasts and other information here, here, here, here, here, and on the company’s site.
Why “Otter”? The company started out in 2016 with a more generic, tech-sounding name. “Otters have a friendly image,” I heard back from the company spokesperson when I asked about the name. “They're adorable, and you often see pictures of otters holding hands and floating on water. Otters project a collaborative image and we are building a collaboration product. Otters are also one of the smartest animals in the world, many people don't know that, they can learn a lot of skills and have a good memory.”
As for paying, the details are at the Otter site. In essence: there’s a free option, which I used for a while. It allows up to 600 minutes of transcription per month, and up to 40 minutes per each audio session. This is a no-risk way for interested potential users to give it a try.
After a week or two on the free plan, I found that I was using and relying on it so much that I switched to the “Pro” option, for about $100 a year. It allows 6000 minutes per month, and 4 hours per recording. There are other enterprise scale-type options available. I haven’t come close to hitting the “Pro” limits.
The co-founders of Otter are Sam Liang and Yun Fu, both long-time Silicon Valley figures who were both originally from China. I will say more about their personal and technology-world backgrounds another time. And also about the political, journalistic, personal, and other implications of technological advances like this.
For the moment I want to use a conversation I had with Liang last week to illustrate exactly what I mean by, “It works.”
Does it work? Take a look and see what you think.
During the Zoom call I had with Liang and some of his colleagues, Otter was providing a real-time transcription of what all of us were saying. Otter’s recent business push has been for integration with Zoom and many other online meeting platforms, to create meeting notes.
I want to share part of the automated transcript of what Liang said —not so much for its content (which is interesting, to me) but for its accuracy.
I stress that what I have quoted below is the raw version of what Otter produced. The capitalization, commas, periods, and sentence breaks are shown as they appeared. I have altered this passage in only two ways: I’ve added some extra paragraph separations, for clarity; and I’ve added some comments in brackets-and-bold, [like this], to register tiny inaccuracies I noted.
Update: As my friends and journalistic comrades Walter Shapiro and Joe Nocera have pointed out, Otter also allows you to click on any line in the transcript, and hear the original voice recording of that moment. It’s an easy way to clarify anything that’s confusing, or confirm important passages. I have seen this in some other applications over the years, but again, it works.
Here is what Otter gave us, from a Zoom call:
One thing we recognized was that, you know boys [“voice”] is such an important form for communication. However, there is no really easy way to search voice information.
First of all, most of the voice data was not even captured on it, looking back into history actually before 1877 When Thomas Edison invented audio recorder, no noise [“voice”] was ever kept. If you think about it, we'd love to have the, You know, anything, spoken by Shakespeare or Charles Darwin when he was doing his research right before that time there no voice [the system may be adapting; from here on, it’s all “voice”] was given say, which is actually a huge loss of human knowledge.
But then even after the recorder was he invented most of the voice data was not even captured, and for the first few percentage of data that was captured, was actually very hard to search, very hard to access.
Because, right, you have to rewind, fast forward multiple tries to find the part you're interested in. So all those combined motivated us to build something that can help with this problem, you know capture human knowledge. Make information searchable.
So then we look at the technology available at that time, we look at the Google API Microsoft API and a few other API's, we tested it with human conversations and we found that the system at that time actually worked pretty badly on accuracy was low. It couldn't understand my accent it….
It was traced in the complexity of English language, right, it's just so many variations of saying the same word or something that words have similar pronunciations it's an How do you handle those so that we thought that wow, you know, people have been working on speed tracking for so long, you mentioned Kai Fuli. [Usually spelled in English as Kai-fu Lee. But this is an arbitrary choice of English spelling; and, more broadly, this and other “errors” in the transcript are fewer than I would make in the standard typing session.].
He, you know, was one of the pioneers of speech recognition when he was, was doing his PhD exam here. However, the problem was so hard after so many years, it's still not doing very well, especially for conversations.
You may say: Sure, Otter does a good job — of recognizing the voice of its own CEO! Fair point. It’s also true that its renderings of my questions are just as accurate as for Liang’s replies. (I’m not including my questions because Otter too faithfully rendered my meanders and run-ons.) Also: anyone listening to Liang would know that he has spent his professional life working in English but that he did not grow up as a native speaker. The several people on the call were speaking different-sounding versions of English, and the system rendered them all.
I’m writing this item because what I’ve learned might be useful to others. In a future installment I’ll say more about what I learned in my talk with Liang.
Check it out.
“Interesting Software” links
Here are a few, from the archives: