azure ai speech
34 TopicsMy Journey of Building a Voice Bot from Scratch
My Journey in Building Voice Bot for production The world of artificial intelligence is buzzing with innovations, and one of its most captivating branches is the development of voice bots. These digital entities have the power to transform user interactions, making them more natural and intuitive. In this blog post, I want to take you on a journey through my experience of building a voice bot from scratch using Azure's cutting-edge technologies: OpenAI GPT-4o-Realtime, Azure Text-to-Speech (TTS), and Speech-to-Text (STT). Key Features for Building Effective Voice Bot Natural Interaction: A voice agent's ability to converse naturally is paramount. The goal is to create interactions that mirror human conversation, avoiding robotic or scripted responses. This naturalism fosters user comfort, leading to a more seamless engaging experience. Context Awareness: True sophistication in a voice agent comes from its ability to understand context and retain information. This capability allows it to provide tailored responses and actions based on user history, preferences, and specific queries. Multi-Language Support: One of the significant hurdles in developing a comprehensive voice agent lies in the need for multi-language support. As brands cater to diverse markets, ensuring clear and contextually accurate communication across languages is vital. Real-time Processing: The real-time capabilities of voice agents allow for immediate responses, enhancing the customer experience. This feature is crucial for tasks like booking, purchasing, and inquiries where time sensitivity matters. Furthermore, there are immense opportunities available. When implemented successfully, a robust voice agent can revolutionize customer engagement. Consider a scenario where a business utilizes an AI-driven voice agent to reach out to potential customers in a marketing campaign. This approach can greatly enhance efficiency, allowing the business to manage high volumes of prospects, providing a vastly improved return on investment compared to traditional methods. Before diving into the technicalities, it's crucial to have a clear vision of what you want to achieve with your voice bot. For me, the goal was to create a bot that could engage users in seamless conversations, understand their needs, and provide timely responses. I envisioned a bot that could be integrated into various platforms, offering flexibility and adaptability. Azure provides a robust suite of tools for AI development, and choosing it was an easy decision due to its comprehensive offerings and strong integration capabilities. Here’s how I began: Text-to-Speech (TTS):This service would convert the bot's text responses into human-like speech. Azure TTS offers a range of customizable voices, allowing me to choose one that matched the bot's personality. Speech-to-Text (STT):To understand user inputs, the bot needed to convert spoken language into text. Azure STT was instrumental in achieving this, providing real-time transcription with high accuracy. Foundational Model: This would refer to a large language model (LLM) that powers the bot's understanding of language and generation of text responses. Examples of foundational models include: GPT-4: A powerful LLM developed by OpenAI, capable of generating human-quality text, translating languages, writing different kinds of creative content, and answering your questions in an informative way. Foundation Speech to Speech Model: This could refer to a model that directly translates speech from one language to another, without the need for text as an intermediate step. Such a model could be used for real-time translation or for generating speech in a language different from the input language. As voice technology continues to evolve, different types of voice bots have emerged to cater to varying user needs. In this analysis, we will explore three prominent types: Voice Bot Duplex, GPT-4o-Realtime, and GPT-4o-Realtime + TTS. This detailed comparison will cover their architecture, strengths, weaknesses, best practices, challenges, and potential opportunities for implementation. Type 1: Voice Bot Duplex: Duplex Bot is an advanced AI system that conducts phone conversations and completes tasks using Voice Activity Detection (VAD), Speech-to-Text (STT), Large Language Models (LLM), and Text-to-Speech (TTS). Azure’s automatic speech recognition (ASR) technology, turning spoken language into text. This text is analysed by an LLM to generate responses, which are then converted back to speech by Azure Text-To-Speech (TTS). Duplex Bot can listen and respond simultaneously, improving interaction fluidity and reducing response time. This integration enables Duplex to autonomously manage tasks like booking appointments with minimal human intervention. - Strengths: Low operational cost . Complex architecture with multiple processing hops, making it difficult to implement. Suitable for straightforward use cases with basic conversational requirements. Customizable easily for both STT and TTS side - Weaknesses: Higher latency compared to advanced models, limiting real-time capabilities. Limited ability to perform complex actions or maintain context over longer conversations. Does not capture the human emotion from the speech Switching between language is difficult during the conversation. You have to choose the language beforehand for better output. Type 2- GPT-4o-Realtime GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model model which process these speech byte array , reason and respond back speech as byte array. - Strengths: Simplest architecture with no processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost . You can not customize the voice synthesized. You can not add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Support for different language may be an issue as there is no official documentation of language specific support. Type 3- GPT-4o-Realtime + TTS GPT-4o-Realtime based voice bot are the simplest to implement as they used Foundational Speech model as it could refer to a model that directly takes speech as an input and generates speech as output, without the need for text as an intermediate step. Architecture is very simple as speech array goes directly to foundation speech model which process these speech bytes array, reason and respond back speech as byte array. But if you want to customize the speech synthesis it then there is no finetune options present to customize the same. Hence, we came up with an option where we plugged in GPT-4o-Realtime with Azure TTS where we take the advanced voice modulation like built-in Neural voices with range of Indic languages also you can also finetune a custom neural voice (CNV). Custom neural voice (CNV) is a text to speech feature that lets you create a one-of-a-kind, customized, synthetic voice for your applications. With custom neural voice, you can build a highly natural-sounding voice for your brand or characters by providing human speech samples as training data. Out of the box,text to speechcan be used with prebuilt neural voices for eachsupported language. The prebuilt neural voices work well in most text to speech scenarios if a unique voice isn't required. Custom neural voice is based on the neural text to speech technology and the multilingual, multi-speaker, universal model. You can create synthetic voices that are rich in speaking styles, or adaptable cross languages. The realistic and natural sounding voice of custom neural voice can represent brands, personify machines, and allow users to interact with applications conversationally. See thesupported languagesfor custom neural voice. - Strengths: Simple architecture with only one processing hops, making it easier to implement. Low latency and high reliability Suitable for straightforward use cases with complex conversational requirements and customized voice. Switching between language is very easy Captures emotion of user. - Weaknesses: High operational cost but still lower than GPT-4o-Realtime. You cannot add Business specific abbreviation to the model to handle separately Hallucinate a lot during number input. If you say the model 123456 sometimes the model takes 123435 Does not support custom phrases Conclusion Building a voice bot is an exciting yet challenging journey. As we've seen, leveraging Azure’s advanced tools like GPT-4o-Realtime, Text-to-Speech, and Speech-to-Text can provide the foundation for creating a voice bot that understands, engages, and responds with human-like fluency. Throughout this journey, key aspects like natural interaction, context awareness, multi-language support, and real-time processing were vital in ensuring the bot’s effectiveness across various scenarios. While each voice bot model, from Voice Bot Duplex to GPT-4o-Realtime and GPT-4o-Realtime + TTS, offers its strengths and weaknesses, they all highlight the importance of carefully considering the specific needs of the application. Whether aiming for simple conversations or more sophisticated interactions, the choice of model will directly impact the bot's performance, cost, and overall user satisfaction. Looking ahead, the potential for AI-driven voice bots is immense. With ongoing advancements in AI, voice bots are bound to become even more integrated into our daily lives, transforming the way we interact with technology. As this field continues to evolve, the combination of innovative tools and strategic thinking will be key to developing voice bots that not only meet but exceed user expectations. My Previous Blog: From Zero to Hero: Building Your First Voice Bot with GPT-4o Real-Time API using Python Github Link: https://github.com/monuminu/rag-voice-bot144Views3likes0CommentsAzure AI voices in Arabic improved pronunciation
This blog introduces our work on improving Arabic TTS (Text to Speech) pronunciation with Azure AI Speech. A key component in Arabic TTS is the diacritic model, which represents a challenging task. In written Arabic, diacritics, which indicate vowel sounds, are typically omitted. The diacritic task involves predicting the diacritic for each Arabic character in the written form. We enhanced diacritic prediction by utilizing a base model pre-trained with Machine Translation and other NLP tasks, then fine-tuning it on a comprehensive diacritics' corpus. This approach reduced word-level pronunciation errors by 78%. Additionally, we improved the reading of English words in Arabic texts. For English words transcribed using the Arabic alphabet, they can now be read as standard English. Pronunciation improvement Below shows the diacritic improvement on Microsoft Ar-SA HamedNeural voice. Other Ar-SA and Ar-EG voices also benefit. This improvement is now online for all Ar-SA and Ar-EG voices. Script Baseline Improved Proper noun الهيئة الوطنية للامن الالكتروني نيسا Short sentence ويحتل بطل أفريقيا المركز الثالث في المجموعة، وسيلتقي مع الإكوادور في آخر مبارياته بالمجموعة، يوم الثلاثاء المقبل. Long sentence العالم كله أدرك أن التغيرات المناخية قريبة وأضاف خلال مداخلة هاتفية لبرنامج “في المساء مع قصواء”، مع الإعلامية قصواء الخلالي، والمذاع عبر فضائية CBC، أن العالم كله أدرك أن التغيرات المناخية قريبة من كل فرد على وجه الكرة الأرضية، مشيرًا إلى أن مصر تستغل الزخم الموجود حاليا، وبخاصة أنها تستضيف قمة المناخ COP 27 في شرم الشيخ بنوفمبر المقبل. Our service was compared with two other popular services (referred to as Comany A and Company B) using 400 general scripts, measuring word-level pronunciation accuracy. The results indicate that HamedNeural Voice outperforms Provider A by 1.49% and Provider B by 3.88%. Below are some samples that shows the differences. Azure (Ar-SA HamedNeural) Company A Company B أوتوفيستر: أتذكر العديد من لاعبي نادي الزمالك وتابع: "بالتأكيد أتذكر العديد من لاعبي نادي الزمالك في ذلك التوقيت، عبد الواحد السيد، وبشير التابعي، وحسام حسن، وشقيقه إبراهيم، حازم إمام، ميدو، ومباراة الإسماعيلي التي شهدت 7 أهداف". ويشار إلى أن جرعات اللقاح وأعداد السكان الذين يتم تطعيمهم هي تقديرات تعتمد على نوع اللقاح الذي تعطيه الدولة، أي ما إذا كان من جرعة واحدة أو جرعتين. وتتكامل هذه الخطوة مع جهود إدارة البورصة المستمرة لرفع مستويات وعي ومعرفة المجتمع المصري، وخاصة فئة الشباب منهم، بأساسيات الاستثمار والادخار من خلال سوق الأوراق المالية، وذلك بالتوازي مع جهود تعريف الكيانات الاقتصادية العاملة بمختلف القطاعات الإنتاجية بإجراءات رحلة القيد والطرح والتداول بسوق الأوراق المالية، وذلك للوصول إلى التمويل اللازم للتوسع والنمو ومن ثم التشغيل وزيادة الإنتاجية، ذات مستهدفات خطط الحكومة المصرية التنموية. English word reading The samples below demonstrate the enhancement in reading English words (transcribed using the Arabic alphabet) with the Microsoft Ar-SA HamedNeural voice. This feature will be available online soon. Script Baseline Improved هبتُ إلى كوفي شوب مع أصدقائي لتناول القهوة والتحدث. اشترى أخي هاتفًا جديدًا من هواوي تك لأنه يحتوي على ميزات متقدمة. Get started Microsoft offers over 600 neural voices coveringmore than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with theCustom Neural Voicecapability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demoto listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback261Views1like0CommentsInuktitut: A Milestone in Indigenous Language Preservation and Revitalization via Technology
Project Overview The Power of Indigenous Languages Inuktitut, an official language of Nunavut and a cornerstone of Inuit identity, is now at the forefront of technological innovation. This project demonstrates the resilience and adaptability of Indigenous languages in the digital age. By integrating Inuktitut into modern technology, we affirm its relevance and vitality in contemporary Canadian society. Collaboration with Inuit Communities Central to this project is the partnership between the Government of Nunavut and Microsoft. This collaboration exemplifies the importance of Indigenous leadership in technological advancements. The Government of Nunavut, representing Inuit interests, has been instrumental in guiding this project to ensure it authentically serves the Inuit community. Inuktitut by the Numbers Inuktitut is the language of many Inuit communities – foundational to their way of life. Approximately 24,000 Inuit speak Inuktitut, with 80% using it as their primary language. The 2016 Canadian census reported around 37,570 individuals identifying Inuktitut as their mother tongue, highlighting its significance in Canada's linguistic landscape. New Features Honoring Inuktitut We're excited to introduce two neural voices, "SiqiniqNeural" and "TaqqiqNeural," supporting both Roman and Syllabic orthography. These voices, developed with careful consideration of Inuktitut's unique sounds and rhythms, are now available across various Microsoft applications (Microsoft Translator app, Bing Translator, ClipChamp, Edge Read Aloud and more to come), you also can integrate these voices into your own application through Azure AI Speech services. You can listen to these voices in samples below: Voice name Text Audio iu-Cans-CA-SiqiniqNeural / iu-Latn-CA-SiqiniqNeural ᑕᐃᒫᒃ ᐅᒥᐊᓪᓘᓐᓃᑦ ᑲᓅᓪᓘᓐᓃᑦ, ᐊᖁᐊᓂ ᑕᕝᕙᓂ ᐊᐅᓚᐅᑏᑦ ᐊᑕᖃᑦᑕᕐᖓᑕ,ᖃᐅᔨᒪᔭᐃᓐᓇᕆᒐᔅᓯᐅᒃ. Taimaak umialluunniit kanuulluunniit, aquani tavvani aulautiit ataqattarngata,qaujimajainnarigassiuk. English translation: The boat or the canoes, the outboard motors, are attached to the motors. iu-Cans-CA-TaqqiqNeural / iu-Latn-CA-TaqqiqNeural ᑐᓴᐅᒪᔭᑐᖃᕆᓪᓗᒋᑦ ᓇᓄᐃᑦ ᐃᓄᑦᑎᑐᒡᒎᖅ ᐃᓱᒪᓖᑦ ᐅᑉᐱᓕᕆᐊᒃᑲᓐᓂᓚᐅᖅᓯᒪᕗᖓ ᑕᐃᔅᓱᒪᓂ. Tusaumajatuqarillugit nanuit inuttitugguuq isumaliit uppiliriakkannilauqsimavunga taissumani. English translation: I have heard that the polar bears have Inuit ideas and I re-believed in them at that time. Preserving Language Through Technology The Government of Nunavut has generously shared an invaluable collection of linguistic data, forming the foundation of our text-to-speech models. This rich repository includes 11,300 audio files from multiple speakers, totaling approximately 13 hours of content. These recordings capture a diverse range of Inuktitut expression, from the Bible to traditional stories, and even some contemporary novels written by Inuktitut speakers. Looking Forward This project is more than a technological advancement; it's a step towards digital Reconciliation. By ensuring Inuktitut's presence in the digital realm, we're supporting the language's vitality and accessibility for future generations of Inuit. Global Indigenous Language Revitalization The groundbreaking work with Inuktitut has paved the way for a broader, global initiative to support Indigenous languages worldwide. This expansion reflects Microsoft's commitment to Reconciliation and puts us on the path as a leader in combining traditional knowledge with cutting-edge technology. While efforts began here in Canada with Inuktitut, Microsoft recognizes the global need for Indigenous language revitalization. We're now working with more Indigenous communities across the world, from Māori in New Zealand to Cherokee in North America, always guided by the principle of Indigenous-led collaboration that was fundamental to the success of the Inuktitut project. Our aim is to co-create AI tools that not only translate languages but truly capture the essence of each Indigenous culture. This means working closely with elders, language keepers, and community leaders to ensure our technology respects and accurately reflects the unique linguistic features, cultural contexts, and traditional knowledge systems of each language. These AI tools are designed to empower Indigenous communities in their own language revitalization efforts. From interactive language learning apps to advanced text-to-speech systems, we're providing technological support that complements grassroots language programs and traditional teaching methods. Conclusion We are particularly proud to celebrate this milestone in Indigenous language revitalization in partnership with the Government of Nunavut. This project stands as a testament to what can be achieved when Indigenous knowledge and modern technology come together in a spirit of true partnership and respect, fostering the continued growth and use of Indigenous languages. Find more information about the project in video below: Press release from Government of Nunavut:Language Preservation and Promotion Through Technology: MS Translator Project | Government of Nunavut Get started In our ongoing quest to enhance multilingual capabilities in text-to-speech (TTS) technology, our goal is bringing the best voices to our product, our voices are designed to be incredibly adaptive, seamlessly switching languages based on the text input. They deliver natural-sounding speech with precise pronunciation and prosody, making them invaluable for applications such as language learning, travel guidance, and international business communication. Microsoft offers over 500 neural voices covering more than 140 languages and locales. These TTS voices can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots, providing a richer conversational experience for users. Additionally, with theCustom Neural Voicecapability, businesses can easily create a unique brand voice. With these advancements, we continue to push the boundaries of what is possible in TTS technology, ensuring that our users have access to the most versatile and high-quality voices available. For more information Try our demoto listen to existing neural voices Add Text-to-Speech to your apps today Apply for access to Custom Neural Voice Join Discord to collaborate and share feedback239Views3likes0CommentsBoost Your Holiday Spirit with Azure AI
Here's the revised LinkedIn post with points 7 and 8 integrated into points 2 and 3: 🎄✨ **Boost Your Holiday Spirit with Azure AI! 🎄✨ As we gear up for the holiday season, what better way to bring innovation to your business than by using cutting-edge Azure AI technologies? From personalized customer experiences to festive-themed data insights, here’s how Azure AI can help elevate your holiday initiatives: 🎅 1. Azure OpenAI Service for Creative Content Kickstart the holiday cheer by using Azure OpenAI to create engaging holiday content. From personalized greeting messages to festive social media posts, the GPT models can assist you in generating creative text in a snap. 🎨 Step-by-step: Use GPT to draft festive email newsletters, promotions, or customer-facing messages. Train models on your specific brand voice for customized holiday greetings. 🎁 2. Azure AI Services for Image Recognition and Generation Enhance your holiday product offerings by leveraging image recognition to identify and categorize holiday-themed products. Additionally, create stunning holiday-themed visuals with DALL-E. Generate unique images from text descriptions to make your holiday marketing materials stand out. 📸 Step-by-step: Use Azure Computer Vision to analyze product images and automatically categorize seasonal items. Implement the AI model in e-commerce platforms to help customers find holiday-specific products faster. Use DALL-E to generate holiday-themed images based on your descriptions. Customize and refine the images to fit your brand’s style. Incorporate these visuals into your marketing campaigns. ✨ 3. Azure AI Speech Services for Holiday Customer Interaction and Audio Generation Transform your customer service experience with Azure’s Speech-to-Text and Text-to-Speech services. You can create festive voice assistants or add holiday-themed voices to your customer support lines for a warm, personalized experience. Additionally, add a festive touch to your audio content with Azure OpenAI. Use models like Whisper for high-quality speech-to-text and text-to-speech conversions, perfect for creating holiday-themed audio messages and voice assistants. 🎙️ Step-by-step: Use Speech-to-Text to transcribe customer feedback or support requests in real-time. Build a holiday-themed voice model using Text-to-Speech for interactive voice assistants. Use Whisper to transcribe holiday messages or convert text to festive audio. Customize the audio to match your brand’s tone and style. Implement these audio clips in customer interactions or marketing materials. 🎄 4. Azure Machine Learning for Predictive Holiday Trends Stay ahead of holiday trends with Azure ML models. Use AI to analyze customer behavior, forecast demand for holiday products, and manage stock levels efficiently. Predict what your customers need before they even ask! 📊 Step-by-step: Use Azure ML to train models on historical sales data to predict trends in holiday shopping. Build dashboards using Power BI integrated with Azure for real-time tracking of holiday performance metrics. 🔔 5. Azure AI for Sentiment Analysis Understand the holiday mood of your customers by implementing sentiment analysis on social media, reviews, and feedback. Gauge the public sentiment around your brand during the festive season and respond accordingly. 📈 Step-by-step: Use Text Analytics for sentiment analysis on customer feedback, reviews, or social media posts. Generate insights and adapt your holiday marketing based on customer sentiment trends. 🌟 6. Latest Azure AI Open Models Explore the newest Azure AI models to bring even more innovation to your holiday projects: GPT-4o and GPT-4 Turbo: These models offer enhanced capabilities for understanding and generating natural language and code, perfect for creating sophisticated holiday content. Embeddings: Use these models to convert holiday-related text into numerical vectors for improved text similarity and search capabilities. 🔧7. Azure AI Foundry Leverage Azure AI Foundry to build, deploy, and scale AI-driven applications. This platform provides everything you need to customize, host, run, and manage AI applications, ensuring your holiday projects are innovative and efficient 🎉 Conclusion: With Azure AI, the possibilities to brighten your business this holiday season are endless! Whether it's automating your operations or delivering personalized customer experiences, Azure's AI models can help you stay ahead of the game and spread holiday joy. Wishing everyone a season filled with innovation and success! 🎄✨305Views1like0CommentsBuilding custom AI Speech models with Phi-3 and Synthetic data
Introduction In today’s landscape, speech recognition technologies play a critical role across various industries—improving customer experiences, streamlining operations, and enabling more intuitive interactions. With Azure AI Speech, developers and organizations can easily harness powerful, fully managed speech functionalities without requiring deep expertise in data science or speech engineering. Core capabilities include: Speech to Text (STT) Text to Speech (TTS) Speech Translation Custom Neural Voice Speaker Recognition Azure AI Speech supports over 100 languages and dialects, making it ideal for global applications. Yet, for certain highly specialized domains—such as industry-specific terminology, specialized technical jargon, or brand-specific nomenclature—off-the-shelf recognition models may fall short. To achieve the best possible performance, you’ll likely need to fine-tune a custom speech recognition model. This fine-tuning process typically requires a considerable amount of high-quality, domain-specific audio data, which can be difficult to acquire. The Data Challenge: When training datasets lack sufficient diversity or volume—especially in niche domains or underrepresented speech patterns—model performance can degrade significantly. This not only impacts transcription accuracy but also hinders the adoption of speech-based applications. For many developers, sourcing enough domain-relevant audio data is one of the most challenging aspects of building high-accuracy, real-world speech solutions. Addressing Data Scarcity with Synthetic Data A powerful solution to data scarcity is the use of synthetic data: audio files generated artificially using TTS models rather than recorded from live speakers. Synthetic data helps you quickly produce large volumes of domain-specific audio for model training and evaluation. By leveraging Microsoft’s Phi-3.5 model and Azure’s pre-trained TTS engines, you can generate target-language, domain-focused synthetic utterances at scale—no professional recording studio or voice actors needed. What is Synthetic Data? Synthetic data is artificial data that replicates patterns found in real-world data without exposing sensitive details. It’s especially beneficial when real data is limited, protected, or expensive to gather. Use cases include: Privacy Compliance: Train models without handling personal or sensitive data. Filling Data Gaps: Quickly create samples for rare scenarios (e.g., specialized medical terms, unusual accents) to improve model accuracy. Balancing Datasets: Add more samples to underrepresented classes, enhancing fairness and performance. Scenario Testing: Simulate rare or costly conditions (e.g., edge cases in autonomous driving) for more robust models. By incorporating synthetic data, you can fine-tune custom STT(Speech to Text) models even when your access to real-world domain recordings is limited. Synthetic data allows models to learn from a broader range of domain-specific utterances, improving accuracy and robustness. Overview of the Process This blog post provides a step-by-step guide—supported by code samples—to quickly generate domain-specific synthetic data with Phi-3.5 and Azure AI Speech TTS, then use that data to fine-tune and evaluate a custom speech-to-text model. We will cover steps 1–4 of the high-level architecture: End-to-End Custom Speech-to-Text Model Fine-Tuning Process Custom Speech with Synthetic data Hands-on Labs:GitHub Repository Step 0: Environment Setup First, configure a .env file based on the provided sample.env template to suit your environment. You’ll need to: Deploy the Phi-3.5 model as a serverless endpoint on Azure AI Foundry. Provision Azure AI Speech and Azure Storage account. Below is a sample configuration focusing on creating a custom Italian model: # this is a sample for keys used in this code repo. # Please rename it to .env before you can use it # Azure Phi3.5 AZURE_PHI3.5_ENDPOINT=https://aoai-services1.services.ai.azure.com/models AZURE_PHI3.5_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_PHI3.5_DEPLOYMENT_NAME=Phi-3.5-MoE-instruct #Azure AI Speech AZURE_AI_SPEECH_REGION=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_AI_SPEECH_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx # https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt CUSTOM_SPEECH_LANG=Italian CUSTOM_SPEECH_LOCALE=it-IT # https://speech.microsoft.com/portal?projecttype=voicegallery TTS_FOR_TRAIN=it-IT-BenignoNeural,it-IT-CalimeroNeural,it-IT-CataldoNeural,it-IT-FabiolaNeural,it-IT-FiammaNeural TTS_FOR_EVAL=it-IT-IsabellaMultilingualNeural #Azure Account Storage AZURE_STORAGE_ACCOUNT_NAME=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_ACCOUNT_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx AZURE_STORAGE_CONTAINER_NAME=stt-container Key Settings Explained: AZURE_PHI3.5_ENDPOINT / AZURE_PHI3.5_API_KEY / AZURE_PHI3.5_DEPLOYMENT_NAME: Access credentials and the deployment name for the Phi-3.5 model. AZURE_AI_SPEECH_REGION: The Azure region hosting your Speech resources. CUSTOM_SPEECH_LANG / CUSTOM_SPEECH_LOCALE: Specify the language and locale for the custom model. TTS_FOR_TRAIN / TTS_FOR_EVAL: Comma-separated Voice Names (from the Voice Gallery) for generating synthetic speech for training and evaluation. AZURE_STORAGE_ACCOUNT_NAME / KEY / CONTAINER_NAME: Configurations for your Azure Storage account, where training/evaluation data will be stored. ure AI Speech Studio > Voice Gallery Step 1: Generating Domain-Specific Text Utterances with Phi-3.5 Use the Phi-3.5 model to generate custom textual utterances in your target language and English. These utterances serve as a seed for synthetic speech creation. By adjusting your prompts, you can produce text tailored to your domain (such as call center Q&A for a tech brand). Code snippet (illustrative): topic = f""" Call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages. """ question = f""" create 10 lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages. only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. """ response = client.complete( messages=[ SystemMessage(content=""" Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases. Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. """), UserMessage(content=f""" #topic#: {topic} Question: {question} """), ], ... ) content = response.choices[0].message.content print(content) # Prints the generated JSONL with no, locale, and content keys Sample Output (Contoso Electronics in Italian): {"no":1,"it-IT":"Come posso risolvere un problema con il mio televisore Contoso?","en-US":"How can I fix an issue with my Contoso TV?"} {"no":2,"it-IT":"Qual è la garanzia per il mio smartphone Contoso?","en-US":"What is the warranty for my Contoso smartphone?"} {"no":3,"it-IT":"Ho bisogno di assistenza per il mio tablet Contoso, chi posso contattare?","en-US":"I need help with my Contoso tablet, who can I contact?"} {"no":4,"it-IT":"Il mio laptop Contoso non si accende, cosa posso fare?","en-US":"My Contoso laptop won't turn on, what can I do?"} {"no":5,"it-IT":"Posso acquistare accessori per il mio smartwatch Contoso?","en-US":"Can I buy accessories for my Contoso smartwatch?"} {"no":6,"it-IT":"Ho perso la password del mio router Contoso, come posso recuperarla?","en-US":"I forgot my Contoso router password, how can I recover it?"} {"no":7,"it-IT":"Il mio telecomando Contoso non funziona, come posso sostituirlo?","en-US":"My Contoso remote control isn't working, how can I replace it?"} {"no":8,"it-IT":"Ho bisogno di assistenza per il mio altoparlante Contoso, chi posso contattare?","en-US":"I need help with my Contoso speaker, who can I contact?"} {"no":9,"it-IT":"Il mio smartphone Contoso si surriscalda, cosa posso fare?","en-US":"My Contoso smartphone is overheating, what can I do?"} {"no":10,"it-IT":"Posso acquistare una copia di backup del mio smartwatch Contoso?","en-US":"Can I buy a backup copy of my Contoso smartwatch?"} These generated lines give you a domain-oriented textual dataset, ready to be converted into synthetic audio. Step 2: Creating the Synthetic Audio Dataset Using the generated utterances from Step 1, you can now produce synthetic speech WAV files using Azure AI Speech’s TTS service. This bypasses the need for real recordings and allows quick generation of numerous training samples. Core Function: def get_audio_file_by_speech_synthesis(text, file_path, lang, default_tts_voice): ssml = f"""<speak version='1.0' xmlns="https://www.w3.org/2001/10/synthesis" xml:lang='{lang}'> <voice name='{default_tts_voice}'> {html.escape(text)} </voice> </speak>""" speech_sythesis_result = speech_synthesizer.speak_ssml_async(ssml).get() stream = speechsdk.AudioDataStream(speech_sythesis_result) stream.save_to_wav_file(file_path) Execution: For each generated text line, the code produces multiple WAV files (one per specified TTS voice). It also creates a manifest.txt for reference and a zip file containing all the training data. Note: If DELETE_OLD_DATA = True, the training_dataset folder resets each run. If you’re mixing synthetic data with real recorded data, set DELETE_OLD_DATA = False to retain previously curated samples. Code snippet (illustrative): import zipfile import shutil DELETE_OLD_DATA = True train_dataset_dir = "train_dataset" if not os.path.exists(train_dataset_dir): os.makedirs(train_dataset_dir) if(DELETE_OLD_DATA): for file in os.listdir(train_dataset_dir): os.remove(os.path.join(train_dataset_dir, file)) timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") zip_filename = f'train_{lang}_{timestamp}.zip' with zipfile.ZipFile(zip_filename, 'w') as zipf: for file in files: zipf.write(os.path.join(output_dir, file), file) print(f"Created zip file: {zip_filename}") shutil.move(zip_filename, os.path.join(train_dataset_dir, zip_filename)) print(f"accionvegana zip file to: {os.path.join(train_dataset_dir, zip_filename)}") train_dataset_path = {os.path.join(train_dataset_dir, zip_filename)} %store train_dataset_path You’ll also similarly create evaluation data using a different TTS voice than used for training to ensure a meaningful evaluation scenario. Example Snippet to create the synthetic evaluation data: import datetime print(TTS_FOR_EVAL) languages = [CUSTOM_SPEECH_LOCALE] eval_output_dir = "synthetic_eval_data" DELETE_OLD_DATA = True if not os.path.exists(eval_output_dir): os.makedirs(eval_output_dir) if(DELETE_OLD_DATA): for file in os.listdir(eval_output_dir): os.remove(os.path.join(eval_output_dir, file)) eval_tts_voices = TTS_FOR_EVAL.split(',') for tts_voice in eval_tts_voices: with open(synthetic_text_file, 'r', encoding='utf-8') as f: for line in f: try: expression = json.loads(line) no = expression['no'] for lang in languages: text = expression[lang] timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S") file_name = f"{no}_{lang}_{timestamp}.wav" get_audio_file_by_speech_synthesis(text, os.path.join(eval_output_dir,file_name), lang, tts_voice) with open(f'{eval_output_dir}/manifest.txt', 'a', encoding='utf-8') as manifest_file: manifest_file.write(f"{file_name}\t{text}\n") except json.JSONDecodeError as e: print(f"Error decoding JSON on line: {line}") print(e) Step 3: Creating and Training a Custom Speech Model To fine-tune and evaluate your custom model, you’ll interact with Azure’s Speech-to-Text APIs: Upload your dataset (the zip file created in Step 2) to your Azure Storage container. Register your dataset as a Custom Speech dataset. Create a Custom Speech model using that dataset. Create evaluations using that custom model with asynchronous calls until it’s completed. You can also use UI-based approaches to customize a speech model with fine-tuning in the Azure AI Foundry portal, but in this hands-on, we'll use the Azure Speech-to-Text REST APIs to iterate entire processes. Key APIs & References: Azure Speech-to-Text REST APIs (v3.2) The provided common.py in the hands-on repo abstracts API calls for convenience. Example Snippet to create training dataset: uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key) kind="Acoustic" display_name = "acoustic dataset(zip) for training" description = f"[training] Dataset for fine-tuning the {CUSTOM_SPEECH_LANG} base model" zip_dataset_dict = {} for display_name in uploaded_files: zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE) You can monitor training progress usingmonitor_training_status function which polls the model’s status and updates you once training completes Core Function: def monitor_training_status(custom_model_id): with tqdm(total=3, desc="Running Status", unit="step") as pbar: status = get_custom_model_status(base_url, headers, custom_model_id) if status == "NotStarted": pbar.update(1) while status != "Succeeded" and status != "Failed": if status == "Running" and pbar.n < 2: pbar.update(1) print(f"Current Status: {status}") time.sleep(10) status = get_custom_model_status(base_url, headers, custom_model_id) while(pbar.n < 3): pbar.update(1) print("Training Completed") Step 4: Evaluate Trained Custom Speech After training, create an evaluation job using your synthetic evaluation dataset. With the custom model now trained, compare its performance (measured by Word Error Rate, WER) against the base model’s WER. Key Steps: Use create_evaluation function to evaluate the custom model against your test set. Compare evaluation metrics between base and custom models. Check WER to quantify accuracy improvements. After evaluation, you can view the evaluation results of the base model and the fine-tuning model based on the evaluation dataset created in the 1_text_data_generation.ipynb notebook in either Speech Studio or the AI Foundry Fine-Tuning section, depending on the resource location you specified in the configuration file. Example Snippet to create evaluation: description = f"[{CUSTOM_SPEECH_LOCALE}] Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model" evaluation_ids={} for display_name in uploaded_files: evaluation_ids[display_name] = create_evaluation(base_url, headers, project_id, dataset_ids[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, CUSTOM_SPEECH_LOCALE) Also, you can see a simple Word Error Rate (WER) number in the code below, which you can utilize in 4_evaluate_custom_model.ipynb. Example Snippet to create WER dateframe: # Collect WER results for each dataset wer_results = [] eval_title = "Evaluation Results for base model and custom model: " for display_name in uploaded_files: eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name]) eval_title = eval_title + display_name + " " wer_results.append({ 'Dataset': display_name, 'WER_base_model': eval_info['properties']['wordErrorRate1'], 'WER_custom_model': eval_info['properties']['wordErrorRate2'], }) # Create a DataFrame to display the results print(eval_info) wer_df = pd.DataFrame(wer_results) print(eval_title) print(wer_df) About WER: WER is computed as (Insertions + Deletions + Substitutions) / Total Words. A lower WER signifies better accuracy. Synthetic data can help reduce WER by introducing more domain-specific terms during training. You'll also similarly create a WER result markdown file using the md_table_scoring_result method below. Core Function: # Create a markdown file for table scoring results md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files) Implementation Considerations The provided code and instructions serve as a baseline for automating the creation of synthetic data and fine-tuning Custom Speech models. The WER numbers you get from model evaluation will also vary depending on the actual domain. Real-world scenarios may require adjustments, such as incorporating real data or customizing the training pipeline for specific domain needs. Feel free to extend or modify this baseline to better match your use case and improve model performance. Conclusion By combining Microsoft’s Phi-3.5 model with Azure AI Speech TTS capabilities, you can overcome data scarcity and accelerate the fine-tuning of domain-specific speech-to-text models. Synthetic data generation makes it possible to: Rapidly produce large volumes of specialized training and evaluation data. Substantially reduce the time and cost associated with recording real audio. Improve speech recognition accuracy for niche domains by augmenting your dataset with diverse synthetic samples. As you continue exploring Azure’s AI and speech services, you’ll find more opportunities to leverage generative AI and synthetic data to build powerful, domain-adapted speech solutions—without the overhead of large-scale data collection efforts. 🙂 Reference Azure AI Speech Overview Microsoft Phi-3 Cookbook Text to Speech Overview Speech to Text Overview Custom Speech Overview Customize a speech model with fine-tuning in the Azure AI Foundry Scaling Speech-Text Pre-Training with Synthetic Interleaved Data (arXiv) Training TTS Systems from Synthetic Data: A Practical Approach for Accent Transfer (arXiv) Generating Data with TTS and LLMs for Conversational Speech Recognition (arXiv)449Views2likes5CommentsMake your voice chatbots more engaging with new text to speech features
Today we're thrilled to announce Azure AI Speech's latest updates, enhancing text to speech capabilities for a more engaging and lifelike chatbot experience. These updates include: A wider range of multilingual voices for natural and authentic interactions; More prebuilt avatar options, with latest sample codes for seamless GPT-4o integration; and A new text stream API that significantly reduces latency for ChatGPT integration, ensuring smoother and faster responses.7.5KViews2likes1CommentAzure AI Speech launches new zero-shot TTS models for Personal Voice
Azure AI Speech Service has upgraded its Personal Voice feature with new zero-shot TTS models. Compared to the initial model, these new models improve the naturalness of synthesized voices and better resemble the speech characteristics of the voice in the prompt.14KViews2likes2CommentsAnnouncing GA of new Indian voices
TTS requirements for a modern business have upgraded significantly. They now require more natural, conversational and diverse voices which can cater to high-value scenarios like call center automation, voice assistants, chatbots and others. We are pleased to announce GA of a host of new Indian locale voices that cater to these requirements.6.5KViews1like0Comments