Build Speech-Enabled Apps and Services Using Azure Cognitive Services: Speech APIs

OVERVIEW
In the age we live in, it is crucial to enhance the user experience using the latest hot tech to stand out!
User experience is all about ease of use and intelligence, and both can be achieved by implementing apps with a human side aka Artificial Intelligence.
What if you can implement an app that can turn a user’s speech into text, translate that in real-time to a variety of languages, and produce a response based on that and have it spoken out loud?
What if you could tailor your app based on your user’s identity?
This has never been as easy as it is nowadays!
I will walk you in this blogpost through the latest A.I. powered Speech APIs available at your fingertips!
Speech APIs Breakdown
Speech Services
- Speech to Text
- Text to Speech
- Speech Translation
Speaker Recognition
- Speaker Identification
- Speaker Verification
Demo
The best way to showcase an API is to actually try it, thus, I figured why not walk through each and every Speech API through the Azure portal!
Speech Services
Whether you want to convert audio to text, text to audio, or even translate speech in real-time, then Speech Services is the way to go!
Speech to Text
The Speech to Text (STT) API offers a range of capabilities.
The following demo (which you can try by clicking the previous header) shows how the API can use an audio can generate a complete dialogue between multiple people and identify who is saying what in real-time, which can come in handy when it comes to meetings and conferences for example.

The following transcription (which you can try by clicking the previous header) shows a demo of me quoting Tony Stark aka Iron Man replying to Steve Rogers aka Captain America saying “Genius, billionaire, playboy, philanthropist.” in The Avengers (2012).

The Speech to Text API also allows customization to increase transcription accuracy in case the baseline isn’t enough in a given scenario like the following demo shows (which you can try by clicking the previous header).

Text to Speech
The Text to Speech (TTS) API provides Neural TTS, Standard TTS, and the ability to customize so you can tailor to add your unique brand voice.
Neural TTS is self explanatory, it provides more human-like natural prosody and clear articulation of words!
Standard TTS still sound amazing, but is more digital and doesn’t have that human-like feel.
You can click the previous header to try the following demos for yourself and spot the difference!


Speaker Recognition
The Speaker Recognition API allows you to identify a person in a specific audio, or verify identity through voice.
The Speaker Identification feature demoed below of President Barack Obama speaking (which you can try by clicking the previous header) can be used for example to spot which user is using your app so you could tailor the visuals based on that specific user.

The returned data from the API is in JSON format so that it can be easily used!

The Speaker Verification feature demoed below (which you can try by clicking the previous header) can be used for example to verify the user trying to access a certain account is actually the owner.
You first need to enroll your voice by reading a certain phrase for a couple of times.

Then, once your voice is enrolled it will easily verify your identity.

Resources
You can start building solutions right away using the available SDKs and examples that explains how to do so.
Summary
This was just a quick play around using the latest Speech APIs and it is amazing how all this is available publicly for everyone to innovate and build awesome solutions with ease and speed.