How to convert your speech voice to text data

speech to text

Now, a day’s people are so busy with their work that nobody wants to spend extra time in texting. Google translation is a great tool where you can translate text by voice. When we use voice as a medium to translate to text, it uses the same technology called speech to text conversion.

First will see, How it will work and convert speech to text data.

The first step in speech recognition is obvious — we need to feed sound waves into a computer.

Everybody knows, the sound is transmitted as waves but the computer knows only numbers. so First thing, we need to convert to numbers. Sound waves are one-dimensional. At every moment in time, they have a single value based on the height of the wave. Let’s zoom in on one tiny part of the sound wave and take a look:

To turn this sound wave into numbers, we just record the height of the wave at equally spaced points:

Sampling wave

This is a sampling. It takes a reading of a thousand words a second and recording a number representing the height of the sound wave at that point in time.

Let's sample our “Hello” sound wave 16,000 times per second. Here are the first 100 samples:

Each number represents the amplitude of the sound wave at 1/16000th of a second intervals

Recognizing Characters from Short Sounds

Now that we have our audio in a format that’s easy to process, we will feed it into a deep neural network. The input to the neural network will be 20-millisecond audio chunks. For each little audio slice, it will try to figure out the letter that corresponds to the sound currently being spoken.

We’ll use a recurrent neural network — that is, a neural network that has a memory that influences future predictions. That’s because each letter it predicts should affect the likelihood of the next letter it will predict too. For example, if we have said “HEL” so far, it’s very likely we will say “LO” next to finish out the word “Hello”. It’s much less likely that we will say something unpronounceable next like “XYZ”. So having that memory of previous predictions helps the neural network makes more accurate predictions going forward.

Wait for a second!

You might be thinking “But what if someone says ‘Hullo’? It’s a valid word. Maybe ‘Hello’ is the wrong transcription!”

Try it out! If your phone is set to American English, try to get your phone’s digital assistant to recognize the world “Hullo.” You can’t! It refuses! It will always understand it as “Hello.”

Not recognizing “Hullo” is reasonable behavior, but sometimes you’ll find annoying cases where your phone just refuses to understand something valid you are saying. That’s why these speech recognition models are always being retrained with more data to fix these edge cases.

the flow of speech to text converter

For a company like Google or Amazon, hundreds of thousands of hours of spoken audio recorded in real-life situations is gold. That’s the single biggest thing that separates their world-class speech recognition system from your hobby system. The whole point of putting Google Now!

So if you are looking for a start-up idea, I wouldn’t recommend trying to build your own speech recognition system to compete with Google cloud speech to text API. Instead, figure out a way to get people to give you recordings of themselves talking for hours and try with deep speech. The data can be your product instead.

Inthe next post, I will write about how to use Google cloud speech API to convert speech to text.

$……………….………… Happy learning…………………………….$

If you enjoyed this article, feel free to hit that clap button 👏 to help others find it.

I have a passion for understanding technology at a fundamental level and Sharing ideas and code. * Aspire to Inspire before I expire*

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store