One Voice, One Internet!
What thought first comes to your brain when you hear “Text-to-Speech” or “Speech-to-Text”? We have multitudinous Digital Assistants for such tasks and for making our daily lives easy and hands-free. And for many of us, it is, sometimes a forlorn experience as the AI fails to recognize our voice (accent) correctly. WHY such a bummer occurs? WHY the AI muddles to recognize our voice properly? So, let’s divulge what happens behind the scenes and how does this amazing tech works…
How does Speech-to-Text system works:
We don’t always say words in the same way. Our voices rise and fall in pitch and volume to convey different meanings or emotions. Sometimes we talk quickly, sometimes slowly. If you’re as logical as a computer and you have to scrutinize something as variable as the human voice, things can get pretty convoluted. So, voice recognition programs use a variety of statistical, pattern recognition techniques to help them. The more sophisticated programs are “taught” about the grammatical structure of the language, so they know which sorts of sentences make sense. Most of them have built-in dictionaries and know which words tend to follow one another. For example, if you say “thank” and then a short word that sounds like “dew”, the computer can guess that you mean “you” because “thank” and “you” often go together. Some programs also assimilate to recognize words you say often so they can guess that you mean “Constantinople” and not “Can’t stand the noble”, lol, yeah sometimes it can be that naive. Computer programs can pull off these neat tricks using mathematical “guessing processes” called Hidden Markov models. A more sophisticated knack is to use a neural network—a computer program that learns to recognize recurring patterns in a broadly similar way to the human brain.
Typically, most STT systems record the audio from the local device and send it to the cloud, where Machine Learning models peruse the audio and generate the text; which is sent back to the local device where it can be used for many tasks such as accessing Google assistant to playing your favourite MKBHD, Darknet Diaries, Unbox Therapy or The Verge. STT APIs are provided by Google Cloud, Azure, AWS, IBM Watson, Speechmatics etc.
About Mozilla’s Common Voice Initiative:
Voice is natural, voice is human. That’s why Mozilla is excited about creating usable voice technology for the machines. But to create voice systems, developers need an extremely large amount of voice data.
Most of the data used by large companies are unavailable to most people and that stifles innovation. So Mozilla has launched Common Voice, a project to help make voice recognition open and accessible to everyone.
Now everyone can donate their voice to help Mozilla build an open-source voice database that anyone can use to make innovative apps for devices and the web. Read a sentence to help machines learn how real people speak. Check the work of other contributors to improve the quality. It’s that simple!
But why Common Voice?
As it’s out of scope for this article, hence we’re not going to discuss how machine learning model works here. But you should know that audio sent from your device to the cloud stays there forever, which is making your STT system more meticulous.
Most users don’t care about it since they’re getting free services but there are a number of such attestation where STT systems are eavesdropping and recording the audio even when it’s not being used. If you’re a company which is vigilant about your customer’s privacy, you’d like to circumvent that.
- Internet Connectivity
Even though we’re living in an era where 5G technology is pummeling our doors, there is 52% of world’s population which doesn’t have access to the internet, mainly due to “lack of infrastructure”. So, there should be a way to avail cutting edge technology offline as well.
- Humongous Collection of different languages
If language is an impediment for you, don’t worry, Mozilla has got you covered. You can contribute in whatever language from the given plethora of list of languages. If you couldn’t find the language you want to contribute in, you can even request for that language and Mozilla will try to include that language in their catalogue of languages.
Also, there’s a roadblock though, training a Speech-To-Text engine requires a huge amount of voice data, along with its text annotation. Even if we’re just talking about English, there are 160 distinct dialects of English throughout the world. That’s where Common Voice comes into the picture, where anyone can donate their voice or even just validate the small audio clips (1–3 seconds). To make it accessible to everyone, Mozilla has launched 35 languages on the Common Voice portal, as of now.
How you can contribute?
- Create an account at https://voice.mozilla.org
- You should see two options in the middle of the screen: Speak and Listen.
Speak and Donate your Voice:
Step up and speak in various dialects and accents you can. All you need to do is just hit that microphone button and start speaking. It’s as simple as that!
Listen, be Mozilla’s ears and help them validate the voices:
Well, if not speaking, then you can contribute by listening to other contributions and can validate whether the given voice matches the sentence and does it sound as it is written. It couldn’t get any better and simpler this…
Sounds fun, right! So, what are you waiting for? Start your contribution today.
Ahoy mate, Grats for making your contribution!
You just contributed to Open-Source; you may pat your back. All the recorded voice data is publicly get-at-able for free. If you’re a developer, who’d like to maneuvre this data, just go to this link and you can download the voice data for your own speech recognition projects, don’t forget to share it with the community 😉.
A glimpse at a Common Voice Sprint that us Mozillians did pulling an all-nighter.
So, us, a small but amazing group of members of Mozilla Gujarat Community who call ourselves “The MozFam” did a Common Voice Sprint at a Night-in because that’s how we roll!
We decided that every individual of us will do a total of 400 contributions; 200 contributions each in both Speaking and Listening sections. And hell yeah, that was amazeballs for us.
And here are few memes I made to understand how the Common Voice Sprint got us like:
Hey there fellas, myself Pranit Brahmbhatt and I am “One in a Mozillian”.
So, that was all for this time. Catch y’all later. Ciao Ciao!