From the course: Azure AI Services Essential Training (2023)

Implement speech to text

- [Instructor] In this demo, I'm going to show you one of the capabilities of the speech service cognitive services. So you see over here that I have a file called audio.wav. I'm going to play this for you. Ready? Hello, this is a test of the speech recognition API. Okay, so I said hello. This is a test of this speech recognition API and I recorded that in my voice. Again, let me just play it one more time. - Hello, this is a test of the speech recognition API. Alright, so what I have done is that I've already included that audio.wav file in my code and that is what I want to be able to convert into text using the speech service. Now before we look at the code, let's look at the speech service first. Just like before, I have created an instance of cognitive services speech service API in a resource group called COG Services in the East US location. The thing that we are interested in over here are again the keys and endpoint portion. So I've taken these keys and I have already pasted them in my code, so grab these, also grab the location, you're going to need that in a second, and let's examine our code. Now, the things that I grabbed from the Azure portal, I have put them over here. These are the keys and the location I have put over here as you can see. Let me scroll to the right so you can see this full URL over here. Okay, now let's dive into index.js where the actual code lives. Now again, this code is really simple. Really all of cognitive services are a matter of making a rest call and getting the results. Now for some scenarios they've also created SDKs, but the SDKs underneath are, well, they're just calling rest calls. So let's just look at the rest calls. And again, this code is written in Node.js, but feel free to follow along in any platform of your choice. Let's start at the very bottom. I'm going to call a method called convert speech to text which is at the top which relies on a method called Read Bytes. What Read Bytes is going to do is that it is going to accept a file path, in my case audio.wav, the file that I just played for you a moment ago, and that file, I'm returning a byte array representation of it. Now let's go to the top and look at the convert speech to text method. Now I just want to show you in my file structure that I have an audio.wav file already. Okay, perfect. Now let's look at convert speech to text. In converts speech to text, I am creating an object called as request options. Request options is sent a request.post and it allows me to specify the header and the body. Now authentication method that most cognitive services accept is this Ocp-Apim subscription key and I have to specify a content type as well. Now, a lot of cognitive services do also work with Azure Active Directory authentication. I have other courses on LinkedIn Learning where I show you how to acquire an access token for Azure Active Directory as well. In this case, we'll keep it simple and we'll use this key. Okay, now, once I have that key and I have created the endpoint, all I need to do is post this object over to cognitive services and on line 18, I should get an output, so all that's left to do is run this. So let's go ahead and run this code example. So to run this, I'm going to hit F5, choose Node.js. I have already done an NPM install and it looks like I got some results back. So we hover over this, this is what it looks like. Let's look at this in the debug console output. I'm going to hit F10 to get this output here and I'm going to copy this and let's go ahead and paste it over here so we don't lose it so we can examine it. Stop debugging. Now let's give ourselves some space over here so we can examine this. Let me format this and this is what it looks like. As you see that the text that I was speaking was, hello, this is a test of the speech recognition API, and it looks like cognitive services has done a pretty good job at converting that into typed text and there are equivalent capabilities that say allow you to do text to speech, for example, and all the other capabilities I've talked about in this module. So feel free to explore this API further, but really they're all simple rest APIs.

Contents