azure ai vision
23 TopicsMicrosoft Computer Vision Test
Please refer to my repo to get more AI resources, wellcome to star it: https://github.com/xinyuwei-david/david-share.git This article if from one of my repo: https://github.com/xinyuwei-david/david-share/tree/master/Multimodal-Models/Computer-Vision I have developed 2 Python programs that runs on Windows and utilizes Azure Computer Vision (Azure CV) . Perform object recognition on images selected by the user. After the recognition is complete, the user can choose the objects they wish to retain (one or more). The selected objects are then cropped and saved locally. Do background remove based on the images and the object user select. Object detection and image segmentation: Please refer to my demo vedio on Yutube: https://youtu.be/edjB-PDapN8 Currently, the background removal API of Azure CV has been discontinued. In the future, this functionality can be achieved through the region-to-segmentation feature of Florence-2. For detailed implementation, please refer to:https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb Object recognition and background remove: Please refer to my demo vedio on Yutube: https://youtu.be/6x49D3YUTGA Code for Object detection and image segmentation import requests from PIL import Image, ImageTk, ImageDraw import tkinter as tk from tkinter import messagebox, filedialog import threading # Azure Computer Vision API 信息 subscription_key = "o" endpoint = "https://cv-2.cognitiveservices.azure.com/" # 图像分析函数 def analyze_image(image_path): analyze_url = endpoint + "vision/v3.2/analyze" headers = { 'Ocp-Apim-Subscription-Key': subscription_key, 'Content-Type': 'application/octet-stream' } params = {'visualFeatures': 'Objects'} try: with open(image_path, 'rb') as image_data: response = requests.post( analyze_url, headers=headers, params=params, data=image_data, timeout=10 # 设置超时时间为10秒 ) response.raise_for_status() analysis = response.json() print("图像分析完成") return analysis except requests.exceptions.Timeout: print("请求超时,请检查网络连接或稍后重试。") messagebox.showerror("错误", "请求超时,请检查网络连接或稍后重试。") except Exception as e: print("在 analyze_image 中发生异常:", e) messagebox.showerror("错误", f"发生错误:{e}") # 背景移除函数 def remove_background(image_path, objects_to_keep): print("remove_background 被调用") try: image = Image.open(image_path).convert("RGBA") width, height = image.size # 创建一个透明背景的图像 new_image = Image.new("RGBA", image.size, (0, 0, 0, 0)) # 创建一个与图像大小相同的掩码 mask = Image.new("L", (width, height), 0) draw = ImageDraw.Draw(mask) # 在掩码上绘制要保留的对象区域 for obj in objects_to_keep: x1, y1, x2, y2 = obj['coords'] # 将坐标转换为整数 x1, y1, x2, y2 = map(int, [x1, y1, x2, y2]) # 绘制矩形区域,填充为白色(表示保留) draw.rectangle([x1, y1, x2, y2], fill=255) # 应用掩码到原始图像上 new_image.paste(image, (0, 0), mask) print("背景移除完成,显示结果") new_image.show() # 保存结果 save_path = filedialog.asksaveasfilename( defaultextension=".png", filetypes=[('PNG 图像', '*.png')], title='保存结果图像' ) if save_path: new_image.save(save_path) messagebox.showinfo("信息", f"处理完成,结果已保存到:{save_path}") except Exception as e: print("在 remove_background 中发生异常:", e) messagebox.showerror("错误", f"发生错误:{e}") print("remove_background 完成") # GUI 界面 def create_gui(): # 创建主窗口 root = tk.Tk() root.title("选择要保留的对象") # 添加选择图像的按钮 def select_image(): image_path = filedialog.askopenfilename( title='选择一张图像', filetypes=[('图像文件', '*.png;*.jpg;*.jpeg;*.bmp'), ('所有文件', '*.*')] ) if image_path: show_image(image_path) else: messagebox.showwarning("警告", "未选择图像文件。") def show_image(image_path): analysis = analyze_image(image_path) if analysis is None: print("分析结果为空,无法创建 GUI") return # 加载图像 pil_image = Image.open(image_path) img_width, img_height = pil_image.size tk_image = ImageTk.PhotoImage(pil_image) # 创建 Canvas canvas = tk.Canvas(root, width=img_width, height=img_height) canvas.pack() # 在 Canvas 上显示图像 canvas.create_image(0, 0, anchor='nw', image=tk_image) canvas.tk_image = tk_image # 保留对图像的引用 # 记录对象的矩形、标签和选择状态 object_items = [] # 处理每个检测到的对象 for obj in analysis['objects']: rect = obj['rectangle'] x = rect['x'] y = rect['y'] w = rect['w'] h = rect['h'] obj_name = obj['object'] # 绘制对象的边界框 rect_item = canvas.create_rectangle( x, y, x + w, y + h, outline='red', width=2 ) # 显示对象名称 text_item = canvas.create_text( x + w/2, y - 10, text=obj_name, fill='red' ) # 将对象的选择状态初始化为未选中 selected = False # 将对象的信息添加到列表 object_items.append({ 'rect_item': rect_item, 'text_item': text_item, 'coords': (x, y, x + w, y + h), 'object': obj_name, 'selected': selected }) # 定义点击事件处理函数 def on_canvas_click(event): for item in object_items: x1, y1, x2, y2 = item['coords'] if x1 <= event.x <= x2 and y1 <= event.y <= y2: # 切换选择状态 item['selected'] = not item['selected'] if item['selected']: # 已选中,边框设为绿色 canvas.itemconfig(item['rect_item'], outline='green') canvas.itemconfig(item['text_item'], fill='green') else: # 未选中,边框设为红色 canvas.itemconfig(item['rect_item'], outline='red') canvas.itemconfig(item['text_item'], fill='red') break canvas.bind("<Button-1>", on_canvas_click) # 提交按钮 def on_submit(): print("on_submit 被调用") selected_objects = [] for item in object_items: if item['selected']: # 如果对象被选中,保存其信息 selected_objects.append(item) if not selected_objects: messagebox.showwarning("警告", "请至少选择一个对象。") else: # 调用背景消除函数 threading.Thread(target=remove_background, args=(image_path, selected_objects)).start() print("on_submit 完成") submit_button = tk.Button(root, text="提交", command=on_submit) submit_button.pack() # 添加选择图像的按钮 select_button = tk.Button(root, text="选择图像", command=select_image) select_button.pack() root.mainloop() # 示例使用 if __name__ == "__main__": create_gui() Demo result: Code for Object recognition and background remove On GPU VM: from transformers import AutoProcessor, AutoModelForCausalLM from PIL import Image, ImageDraw, ImageChops import torch import numpy as np import ipywidgets as widgets from IPython.display import display, clear_output import io # Load the model model_id = 'microsoft/Florence-2-large' device = 'cuda' if torch.cuda.is_available() else 'cpu' model = AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, torch_dtype='auto' ).to(device) processor = AutoProcessor.from_pretrained( model_id, trust_remote_code=True ) def run_example(task_prompt, image, text_input=None): if text_input is None: prompt = task_prompt else: prompt = task_prompt + text_input # Process inputs inputs = processor( text=prompt, images=image, return_tensors="pt" ) # Move inputs to the device with appropriate data types inputs = { "input_ids": inputs["input_ids"].to(device), # input_ids are integers (int64) "pixel_values": inputs["pixel_values"].to(device, torch.float16) # pixel_values need to be float16 } with torch.no_grad(): generated_ids = model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, early_stopping=False, do_sample=False, num_beams=3, ) generated_text = processor.batch_decode( generated_ids, skip_special_tokens=False )[0] parsed_answer = processor.post_process_generation( generated_text, task=task_prompt, image_size=(image.width, image.height) ) return parsed_answer def create_mask(image_size, prediction): mask = Image.new('L', image_size, 0) mask_draw = ImageDraw.Draw(mask) for polygons in prediction['polygons']: for _polygon in polygons: _polygon = np.array(_polygon).reshape(-1, 2) if len(_polygon) < 3: continue _polygon = _polygon.flatten().tolist() mask_draw.polygon(_polygon, outline=255, fill=255) return mask def combine_masks(masks): combined_mask = Image.new('L', masks[0].size, 0) for mask in masks: combined_mask = ImageChops.lighter(combined_mask, mask) return combined_mask def apply_combined_mask(image, combined_mask): # Convert the image to RGBA image = image.convert('RGBA') result_image = Image.new('RGBA', image.size, (255, 255, 255, 0)) result_image = Image.composite(image, result_image, combined_mask) return result_image def process_image_multiple_objects(image, descriptions): """ Process the image for multiple object descriptions. Parameters: - image: PIL.Image object. - descriptions: list of strings, descriptions of objects to retain. Returns: - output_image: Processed image with the specified objects retained. """ masks = [] for desc in descriptions: print(f"Processing description: {desc}") results = run_example('<REFERRING_EXPRESSION_SEGMENTATION>', image, text_input=desc.strip()) prediction = results['<REFERRING_EXPRESSION_SEGMENTATION>'] if not prediction['polygons']: print(f"No objects found for description: {desc}") continue # Generate mask for this object mask = create_mask(image.size, prediction) masks.append(mask) if not masks: print("No objects found for any of the descriptions.") return image.convert('RGBA') # Combine all masks combined_mask = combine_masks(masks) # Apply the combined mask output_image = apply_combined_mask(image, combined_mask) return output_image def on_file_upload(change): # Clear any previous output (except for the upload widget) clear_output(wait=True) display(widgets.HTML("<h3>Please upload an image file using the widget below:</h3>")) display(upload_button) # Check if a file has been uploaded if upload_button.value: # Get the first uploaded file uploaded_file = upload_button.value[0] # Access the content of the file image_data = uploaded_file.content image = Image.open(io.BytesIO(image_data)).convert('RGB') # Display the uploaded image print("Uploaded Image:") display(image) # Create a text box for object descriptions desc_box = widgets.Text( value='', placeholder='Enter descriptions of objects to retain, separated by commas', description='Object Descriptions:', disabled=False, layout=widgets.Layout(width='80%') ) # Create a button to submit the descriptions submit_button = widgets.Button( description='Process Image', disabled=False, button_style='primary', tooltip='Click to process the image', icon='check' ) # Function to handle the button click def on_submit_button_click(b): object_descriptions = desc_box.value if not object_descriptions.strip(): print("Please enter at least one description.") return # Disable the button to prevent multiple clicks submit_button.disabled = True # Clear previous output clear_output(wait=True) print("Processing the image. This may take a few moments...") # Split the descriptions by commas descriptions_list = [desc.strip() for desc in object_descriptions.split(',') if desc.strip()] if not descriptions_list: print("No valid descriptions entered. Exiting the process.") return # Process the image output_image = process_image_multiple_objects(image, descriptions_list) # Display the result display(output_image) # Optionally, save the output image # Uncomment the lines below to save the image # output_image.save('output_image.png') # print("The image with background removed has been saved as 'output_image.png'") submit_button.on_click(on_submit_button_click) # Display the text box and submit button display(widgets.VBox([desc_box, submit_button])) # Create the upload widget upload_button = widgets.FileUpload( accept='image/*', multiple=False ) display(widgets.HTML("<h3>Please upload an image file using the widget below:</h3>")) display(upload_button) # Observe changes in the upload widget upload_button.observe(on_file_upload, names='value') GPU resource needed during inference:80Views0likes0CommentsEnhancing Workplace Safety and Efficiency with Azure AI Foundry's Content Understanding
Discover how Azure AI Foundry’s Content Understanding service, featuring the Video Shot Analysis template, revolutionizes workplace safety and efficiency. By leveraging Generative AI to analyze video data, businesses can gain actionable insights into worker actions, posture, safety risks, and environmental conditions. Learn how this cutting-edge tool transforms operations across industries like manufacturing, logistics, and healthcare.280Views2likes0CommentsBoost Your Holiday Spirit with Azure AI
Here's the revised LinkedIn post with points 7 and 8 integrated into points 2 and 3: 🎄✨ **Boost Your Holiday Spirit with Azure AI! 🎄✨ As we gear up for the holiday season, what better way to bring innovation to your business than by using cutting-edge Azure AI technologies? From personalized customer experiences to festive-themed data insights, here’s how Azure AI can help elevate your holiday initiatives: 🎅 1. Azure OpenAI Service for Creative Content Kickstart the holiday cheer by using Azure OpenAI to create engaging holiday content. From personalized greeting messages to festive social media posts, the GPT models can assist you in generating creative text in a snap. 🎨 Step-by-step: Use GPT to draft festive email newsletters, promotions, or customer-facing messages. Train models on your specific brand voice for customized holiday greetings. 🎁 2. Azure AI Services for Image Recognition and Generation Enhance your holiday product offerings by leveraging image recognition to identify and categorize holiday-themed products. Additionally, create stunning holiday-themed visuals with DALL-E. Generate unique images from text descriptions to make your holiday marketing materials stand out. 📸 Step-by-step: Use Azure Computer Vision to analyze product images and automatically categorize seasonal items. Implement the AI model in e-commerce platforms to help customers find holiday-specific products faster. Use DALL-E to generate holiday-themed images based on your descriptions. Customize and refine the images to fit your brand’s style. Incorporate these visuals into your marketing campaigns. ✨ 3. Azure AI Speech Services for Holiday Customer Interaction and Audio Generation Transform your customer service experience with Azure’s Speech-to-Text and Text-to-Speech services. You can create festive voice assistants or add holiday-themed voices to your customer support lines for a warm, personalized experience. Additionally, add a festive touch to your audio content with Azure OpenAI. Use models like Whisper for high-quality speech-to-text and text-to-speech conversions, perfect for creating holiday-themed audio messages and voice assistants. 🎙️ Step-by-step: Use Speech-to-Text to transcribe customer feedback or support requests in real-time. Build a holiday-themed voice model using Text-to-Speech for interactive voice assistants. Use Whisper to transcribe holiday messages or convert text to festive audio. Customize the audio to match your brand’s tone and style. Implement these audio clips in customer interactions or marketing materials. 🎄 4. Azure Machine Learning for Predictive Holiday Trends Stay ahead of holiday trends with Azure ML models. Use AI to analyze customer behavior, forecast demand for holiday products, and manage stock levels efficiently. Predict what your customers need before they even ask! 📊 Step-by-step: Use Azure ML to train models on historical sales data to predict trends in holiday shopping. Build dashboards using Power BI integrated with Azure for real-time tracking of holiday performance metrics. 🔔 5. Azure AI for Sentiment Analysis Understand the holiday mood of your customers by implementing sentiment analysis on social media, reviews, and feedback. Gauge the public sentiment around your brand during the festive season and respond accordingly. 📈 Step-by-step: Use Text Analytics for sentiment analysis on customer feedback, reviews, or social media posts. Generate insights and adapt your holiday marketing based on customer sentiment trends. 🌟 6. Latest Azure AI Open Models Explore the newest Azure AI models to bring even more innovation to your holiday projects: GPT-4o and GPT-4 Turbo: These models offer enhanced capabilities for understanding and generating natural language and code, perfect for creating sophisticated holiday content. Embeddings: Use these models to convert holiday-related text into numerical vectors for improved text similarity and search capabilities. 🔧7. Azure AI Foundry Leverage Azure AI Foundry to build, deploy, and scale AI-driven applications. This platform provides everything you need to customize, host, run, and manage AI applications, ensuring your holiday projects are innovative and efficient 🎉 Conclusion: With Azure AI, the possibilities to brighten your business this holiday season are endless! Whether it's automating your operations or delivering personalized customer experiences, Azure's AI models can help you stay ahead of the game and spread holiday joy. Wishing everyone a season filled with innovation and success! 🎄✨301Views1like0CommentsAnnouncing the General Availability of Document Intelligence v4.0 API
The Document Intelligence v4.0 API is now generally available! This latest version of Document Intelligence API brings new and updated capabilities across the entire product including updates to Read and Layout APIs for content extraction, prebuilt and custom extraction models for schema extraction from documents and classification models. Document Intelligence has all the tools to enable RAG and document automation solutions for structured and unstructured documents. Enhanced Layout capabilities This release brings significant updates to our Layout capabilities, making it the default choice for document ingestion with enhanced support for Retrieval-Augmented Generation (RAG) workflows. The Layout API now offers a markdown output format that provides a better representation of document elements such as headers, footers, sections, section headers and tables when working with Gen AI models. This structured output enables semantic chunking of content, making it easier to ingest documents into RAG workflows and generate more accurate results. Try Layout in the Document Intelligence Studio or use Layout as a skill in your RAG pipelines with Azure Search. Searchable PDF output Document Intelligence no longer outputs only JSON! With the 4.0 release, you can now generate a searchable PDF output from an input document. The recognized text is overlaid over the scanned text, making all the content in the documents instantly searchable. This feature enhances the accessibility and usability of your documents, allowing for quick and efficient information retrieval. Try the new searchable PDF output in the Studio or learn more. Searchable PDF is available as an output from the Read API at no additional cost. This release also includes several updates to the OCR model to better handle complex text recognition challenges. New and updated Prebuilt models Prebuilt models offer a simple API to extract a defined schema from known document types. The v4.0 release adds new prebuilt models for mortgage processing, bank document processing, paystub, credit/debit card, check, marriage certificate, and prebuilt models for processing variants of the 1095, W4, and 1099 tax forms for US tax processing scenarios. These models are ideal for extracting specific details from documents like bank statements, checks, paystubs, and various tax forms. With over 22 prebuilt model types, Document Intelligence has models for common documents in procurement, tax, mortgage and financial services. See models overview for a complete list of document types supported with prebuilt models. Query field add-on capability Query field is an add-on capability to extend the schema extracted from any prebuilt model. This add-on capability is ideal when you have simple fields that need to be extracted. Query field also work with Layout, so for simple documents, you don’t need to train a custom model and can just define the query fields to begin processing the document with no training. Query field supports a maximum of 20 fields per request. Try query field in the Document Intelligence Studio with Layout or any prebuilt model. Document classification model The custom classification models are updated to improve the classification process and now support multi-language documents and incremental training. This allows you to update the classifier model with additional samples or classes without needing the entire training dataset. Classifiers also support analyzing Office document types (.docx, .pptx, and .xls). Version 4.0 adds a classifier copy operation for copying your classifier across resources, regions or subscriptions making model management easier. This version also introduces some changes in the splitting behavior, by default, the custom classification model no longer splits documents during analysis. Learn more about the classification and splitting capabilities. Improvements to Custom Extraction models Custom extraction models now output confidence scores for tables, table rows, and cells. This makes the process of validating model results much easier and provides the tools to trigger human reviews. Custom model capabilities have also improved with the addition of signature detection to neural models and support for overlapping fields. Neural models now include a paid training tier for when you have a large dataset of labeled documents to train. Paid training enables longer training to ensure you have a model that performs better on the different variations in your training dataset. Learn more about improvements to custom extraction models. New implementation of model compose for greater flexibility With custom extraction models in the past, you could compose multiple models into a single composed model. When a document was analyzed with a composed model, the service picked the model best suited to process the document. With this version, the model compose introduces a new implementation requiring a classification model in addition to the extraction models. This enables processing multiple instances of the same document with splitting, conditional routing and more. Learn more about the new model compose implementation. Get started with the v4.0 API today The Document Intelligence v4.0 API is packed with many more updates. Start with the what’s new page to learn more. You can try all of the new and updated capabilities in the Document Intelligence Studio. Explore the new REST API or the language specific SDKs to start building our updating your document workflows.1.3KViews0likes0CommentsMultimodal video search powered by Video Retrieval in Azure
Video content is becoming increasingly central to business operations, from training materials to safety monitoring. As part of Azure's comprehensive video analysis capabilities, we're excited to discuss Azure Video Retrieval, a powerful service that enables natural language search across your video and image content. This service makes it easier than ever to locate exactly what you need within your media assets. What is Azure Video Retrieval? Azure Video Retrieval allows you to create a search index and populate it with both videos and images. Using natural language queries, you can search through this content to identify visual elements (like objects and safety events) and speech content without requiring manual transcription or specialized expertise. The service offers powerful customization options - developers can define metadata schemas for each index, ingest custom metadata, and specify which features (vision, speech) to extract and filter during search operations. Whether you're looking for specific spoken phrases or visual occurrences, the service pinpoints exact timestamps where your search criteria appear. Key Features Multimodal Search: Search across both visual and audio content using natural language Custom Metadata Support: Define and ingest metadata schemas for enhanced retrieval Flexible Feature Extraction: Specify which features (vision, speech) to extract and search Precise Timestamp Matching: Get exact frame locations where your search criteria appear Multiple Content Types: Index and search both videos and images Simple Integration: Easy implementation with Azure Blob Storage Comprehensive API: Full REST API support for custom implementations Getting Started Prerequisites Before you begin, you'll need: An Azure Cognitive Services multi-service account An Azure Blob Storage Account for video content Setting Up Video Indexing The indexing process is straightforward. Here's how to create an index and upload videos: # Iterate through blobs and build the index for blob in blob_service_client.get_container_client(az_storage_container_name).list_blobs(): blob_name = blob.name blob_url = f"https://{az_storage_account_name}.blob.core.windows.net/{az_storage_container_name}/{blob_name}" # Generate SAS URL for secure access sas_url = blob_url + "?" + sas_token # Add video to index payload["videos"].append({ "mode": "add", "documentId": str(uuid.uuid4()), "documentUrl": sas_url, "metadata": { "cameraId": "video-indexer-demo-camera1", "timestamp": datetime.datetime.now(datetime.UTC).strftime("%Y-%m-%d %H:%M:%S") } }) # Create index response = requests.put(url, headers=headers, json=payload) Searching Videos The service supports two primary search modes: # Query templates for searching by text or speech query_by_text = { "queryText": "<user query>", "filters": { "featureFilters": ["vision"], }, } query_by_speech = { "queryText": "<user query>", "filters": { "featureFilters": ["speech"], }, } The search input is passed to the REST API based on the mode chosen. # Function to search for video frames based on user input, from the Azure Video Retrieval Service def search_videos(query, query_type): url = f"https://{az_video_indexer_endpoint}/computervision/retrieval/indexes/{az_video_indexer_index_name}:queryByText?api-version={az_video_indexer_api_version}" headers = { "Ocp-Apim-Subscription-Key": az_video_indexer_key, "Content-Type": "application/json", } input_query = None if query_type == "Speech": query_by_speech["queryText"] = query input_query = query_by_speech else: query_by_text["queryText"] = query input_query = query_by_text try: response = requests.post(url, headers=headers, json=input_query) response.raise_for_status() print("search response \n", response.json()) return response.json() except Exception as e: print("error", e.args) print("error", e) return None The REST APIs that are required to complete the steps in this process are covered here Use Cases Azure Video Retrieval can transform how organizations work with video content across various scenarios: Training and Education: Quickly locate specific topics or demonstrations within training videos Content Management: Efficiently organize and retrieve media assets Safety and Compliance: Find specific safety-related content or incidents Media Production: Locate specific scenes or dialogue across video libraries Demo Watch this sample application that uses Video retrieval to let users search frames across multiple videos in an Index The source code of the sample application can be accessed here Resources : Video Retrieval API Video Retrieval API reference Azure AI Video Indexer overview173Views0likes0CommentsPhi-3 Vision – Catalyzing Multimodal Innovation
Microsoft's Phi-3 Vision is a new AI model that combines text and image data to deliver smart and efficient solutions. With just 4.2 billion parameters, it offers high performance and can run on devices with limited computing power. From describing images to analyzing documents, Phi-3 Vision is designed to make advanced AI accessible and practical for everyday use. Explore how this model is set to change the way we interact with AI, offering powerful capabilities in a small and efficient package.30KViews5likes2CommentsTransforming Video into Value with Azure AI Content Understanding
Unlocking Value from Unstructured Video Every minute, social video sharing platforms see over 500 hours of video uploads [1] and 91% of businesses leverage video as a key tool[2]. From media conglomerates managing extensive archives to enterprises producing training and marketing materials, organizations are overwhelmed with video. Yet, despite this abundance, video remains inherently unstructured and difficult to utilize effectively. While the volume of video content continues to grow exponentially, its true value often remains untapped due to the friction involved in making video useful. Organizations grapple with several pain points: Inaccessibility of Valuable Content Archives: Massive video archives sit idle because finding the right content to reuse requires extensive manual effort. The Impossibility of Personalization Without Metadata: Personalization holds the key to unlocking new revenue streams and increasing engagement. However, without reliable and detailed metadata, it's cost-prohibitive to tailor content to specific audiences or individuals. Missed Monetization Opportunities: For media companies, untapped archives mean missed chances to monetize content through new formats or platforms. Operational Bottlenecks: Enterprises struggle with slow turnaround times for training materials, compliance checks, and marketing campaigns due to inefficient video workflows, leading to delays and increased expenses. Many video processing application rely on purpose-built, frame-by-frame analysis to identify objects and key elements within video content. While this method can detect a specific list of objects, it is inherently lossy, struggling to capture actions, events, or uncommon objects. It also is expensive and time consuming to customize for specific tasks. Generative AI promises to revolutionize video content analysis, with GPT-4o topping leaderboards for video understanding tasks, but finding a generative model that processes video is just the first step. Creating video pipelines with generative models is hard. Developers must invest significant effort in infrastructure to create custom video processing pipelines to get good results. These systems need optimized prompts, integrated transcription, smart handling of context-window limitations, shot aligned segmentation, and much more. This makes them expensive to optimize and hard to maintain over time. Introducing Azure AI Content Understanding for video This is where Azure AI Content Understanding transforms the game. By offering an integrated video pipeline that leverages advanced foundational models, you can effortlessly extract insights from both the audio and visual elements of your videos. This service transforms unstructured video into structured, searchable knowledge, enabling powerful use cases like media asset management and highlight reel generation. With Content Understanding, you can automatically identify key moments in a video to extract highlights and summarize the full context. For example, for corporate events and conferences you can quickly produce same-day highlight reels. This capability not only reduces the time and cost associated with manual editing but also empowers organizations to deliver timely, professional reaction videos that keep audiences engaged and informed. In another case, A news broadcaster can create a new personalized viewing experience for news by recommending stories of interest. This is achieved by automatically tagging segments with relevant metadata like topic and location, enabling the delivery of content personalized to individual interests, driving higher engagement and viewer satisfaction. By generating specific metadata on a segment-by-segment basis, including chapters, scenes, and shots, Content Understanding provides a detailed outline of what's contained in the video, facilitating these workflows. This is enabled by a streamlined pipeline for video that starts with content extraction tasks like transcription, shot detection, key frame extraction, and face grouping to create grounding data for analysis. Then, generative models use that information to extract the specific fields you request for each segment of the video. This generative field extraction capability enables customers to: Customize Metadata: Tailor the extracted information to focus on elements important to your use case, such as key events, actions, or dialogues. Create Detailed Outlines: Understand the structure of your video content at a granular level. Automate Repetitive Editing Tasks: Quickly pinpoint important segments to create summaries, trailers, or compilations that capture the essence of the full video. By leveraging these capabilities, organizations can automate many video creation tasks including creating highlight reels and repurposing content across formats, saving time and resources while delivering compelling content to their audiences. Whether it's summarizing conference keynotes, capturing the essence of corporate events, or showcasing the most exciting moments in sports, Azure AI Content Understanding makes video workflows efficient and scalable. But how do these solutions perform in real-world scenarios? Customer Success Stories IPV Curator: Transforming Media Asset Management IPV Curator, a leader in media asset management solutions, assists clients in managing and monetizing extensive video libraries across various industries, including broadcast, sports, and global enterprises. It enables seamless, zero-download editing of video in Azure cloud using Adobe applications. Their customers needed an efficient way to search, repurpose, and produce vast amounts of video content with data extraction tailored to specific use cases. IPV integratedAzure AI Content Understandinginto their Curator media asset management platform. They found that it provided a step-function improvement in metadata extraction for their clients. It was particularly beneficial as it enabled: Industry Specific Metadata: Allowed clients to extract metadata tailored to their specific needs by using simple prompts and without the need for domain-specific training of new AI models. For example: Broadcast: Rapidly identified key scenes for promo production and to efficiently identify their highest value content for Free ad-supported streaming TV (FAST) channels. Travel Marketing Content: Automatically tagged geographic locations, landmarks, shot types (e.g., aerial, close-up), and highlighted scenic details. Shopping Channel Content: Detected specific products, identified demo segments, product categories, and key selling points. Advanced Action and Event Analysis: Enabled detailed analysis of a set of frames in a video segment to identify actions and events. This provides a new level of insights compared to frame-by-frame analysis of objects. Segmentation Aligned to Shots: Detected shot boundaries in produced videos and in-media edit points, enabling easy reuse by capturing full shots in segments. As a result, IPV's clients can quickly find and repurpose content, significantly reducing editing time and accelerating video production at scale. IPV Curator enables search across industry specific metadata extracted from videos "IPV's collaboration with Microsoft transforms media stored in Azure into an easily accessible, streaming, and highly searchable active archive. The powerful search engine within IPV's new generation of Media Asset Management uses Azure AI Content Understanding to accurately surface any archived video clip, driving users to their highest value content in seconds." —Daniel Mathew, Chief Revenue Officer, IPV Cognizant: Innovative Ad Moderation Cognizant, a global leader in consulting and professional services, has identified a challenge of moderating advertising content for its media customers. Their customers' traditional methods are heavily reliant on manual review and struggling to scale with the increasing volume of content requiring assessment. The Cognizant Ad Moderation solution framework leverages Content Understanding to create a more accurate, cost-effective approach to ad moderation that results in a 96% reduction in review time. It allows customers to automate ad reviews to ensure cultural sensitivity, regulatory compliance, and optimizing programming placement, ultimately reducing manual review efforts. Cognizant achieves these results by leveraging Content Understanding for multimodal field extraction, tailored output, and native generative AI video processing. Multimodal Field Extraction: Extracts key attributes from both the audio and visual elements, allowing for a more comprehensive analysis of the content. This analysis is critical to get a holistic view of suitability for various audiences. Tailored Output Schema: Outputs a custom structured schema that detects content directly relevant to the moderation task. This includes detecting specific risky attributes like prohibited language, potentially banned topics, violations of content restrictions, and sensitive products like alcohol or smoking. Native Generative AI Video Processing: Content Understanding natively processes video files with generative AI to provide the detailed insights requested in the schema capturing context, actions, and events over entire segments of the video. This optimized video pipeline provides Cognizant with a detailed analysis of videos to ground an automated decision. It allows them to quickly green light compliant ads and flag others for rejection or human review. Content Understanding empowers Cognizant to focus on solving business challenges rather than managing the underlying infrastructure for video processing and integrating generative models. “I'm absolutely thrilled about the Azure AI Content Understanding service! It's a game-changer that accelerates processing by integrating multiple AI capabilities into a single service call, delivering combined audio and video transcription in one JSON output with incredibly detailed results. The ability to add custom fields that integrate with an LLM provides even more detailed, meaningful, and flexible output.” - Rushil Patel – Developer @ Cognizant The Broader Impact: Transformation across industries The transformative power of Azure AI Content Understanding extends far beyond these specific use cases, offering significant benefits across various industries and workflows. By leveraging advanced AI capabilities on video, organizations have been able to unlock new opportunities and drive innovation in several key areas: Social Media Listening and Consumer Insights: Analyze video content across social platforms to understand how products are perceived and discussed online. Gain valuable consumer insights to inform product development, marketing strategies, and brand management. Unlocking Video for AI Assistants and Agents: Enable AI assistants and agents to access and utilize information from video content, transforming meeting recordings, training videos, and events into valuable data sources for Retrieval-Augmented Generation (RAG). Enhance customer support and knowledge management by integrating video insights into AI-driven interactions. Enhancing Accessibility with Audio Descriptions: Generate draft audio descriptions for video content to provide a starting point for human editors. This streamlines the creation of accessible content for visually impaired audiences, reducing effort and accelerating compliance with accessibility standards. Marketing and Advertising Workflows: Automate content analysis to ensure brand alignment and effective advertising. Understand and optimize the content within video advertisements to maintain consistent branding and enhance audience engagement. The business value of Azure AI Content Understanding is clear. By addressing core challenges in video content management with generative AI, customization, and native video processing, it enhances operational efficiencies and unlocks new opportunities for monetization and innovation. Organizations can now turn dormant video archives into valuable assets, deliver personalized content to engage audiences effectively, and automate manual time-consuming workflows. Ready to Transform Your Video Content? For more details on how to use Content Understanding for video check out theVideo Solution Overview. If you are at Microsoft Ignite 2024 or are watching online, check out thisbreakout session. Try this new service in Azure AI Foundry. For documentation, please refer to the Content Understanding Overview For a broader perspective, seeAnnouncing Azure AI Content Understanding: Transforming Multimodal Data into Insightsand discover how it extends these capabilities across all content formats. ----- [1] According to Statistia in 2022 - Hours of video uploaded every minute 2022 | Statista [2] According to a Wyzowl survey in 2024 - Video Marketing 2024 (10 Years of Data) | Wyzowl3KViews0likes0CommentsAnnouncing Azure AI Content Understanding: Transforming Multimodal Data into Insights
Solve Common GenAI Challenges with Content Understanding As enterprises leverage foundation models to extract insights from multimodal data and develop agentic workflows for automation, it's common to encounter issues like inconsistent output quality, ineffective pre-processing, and difficulties in scaling out the solution. Organizations often find that to handle multiple types of data, the effort is fragmented by modality, increasing the complexity of getting started. Azure AI Content Understanding is designed to eliminate these barriers, accelerating success in Generative AI workflows. Handling Diverse Data Formats: By providing a unified service for ingesting and transforming data of different modalities, businesses can extract insights from documents, images, videos, and audio seamlessly and simultaneously, streamlining workflows for enterprises. Improving Output Data Accuracy: Deriving high-quality output for their use-cases requires practitioners to ensure the underlying AI is customized to their needs. Using advanced AI techniques like intent clarification, and a strongly typed schema, Content Understanding can effectively parse large files to extract values accurately. Reducing Costs and Accelerating Time-to-Value: Using confidence scores to trigger human review only when needed minimizes the total cost of processing the content. Integrating the different modalities into a unified workflow and grounding the content when applicable allows for faster reviews. Core Features and Advantages Azure AI Content Understanding offers a range of innovative capabilities that improve efficiency, accuracy, and scalability, enabling businesses to unlock deeper value from their content and deliver a superior experience to their end users. Multimodal Data Ingestion and Content Extraction: The service ingests a variety of data types such as documents, images, audio, and video, transforming them into a structured format that can be easily processed and analyzed. It instantly extracts core content from your data including transcriptions, text, faces, and more. Data Enrichment: Content Understanding offers additional features that enhance content extraction results, such as layout elements, barcodes, and figures in documents, speaker recognition and diarization in audio, and more. Schema Inferencing: The service offers a set of prebuilt schemas and allows you to build and customize your own to extract exactly what you need from your data. Schemas allow you to extract a variety of results, generating task-specific representations like captions, transcripts, summaries, thumbnails, and highlights. This output can be consumed by downstream applications for advanced reasoning and automation. Post Processing: Enhances service capabilities with generative AI tools that ensure the accuracy and usability of extracted information. This includes providing confidence scores for minimal human intervention and enabling continuous improvement through user feedback. Transformative Applications Across Industries Azure AI Content Understanding is ideal for a wide range of use cases and industries, as it is fully customizable and allows for the input of data from multiple modalities. Here are just a few examples of scenarios Content Understanding is powering today: Post call analytics:Customers utilize Azure AI Content Understanding to extract analytics on call center or recorded meeting data, allowing you to aggregate data on the sentiment, speakers, and content discussed, including specific names, companies, user data, and more. Media asset management and content creation assistance: Extract key features from images and videos to better manage media assets and enable search on your data for entities like brands, setting, key products, people, and more. Insurance claims: Analyze and process insurance claims and other low-latency batch processing scenarios to automate previously time-intensive processes. Highlight video reel generation:With Content Understanding, you can automatically identify key moments in a video to extract highlights and summarize the full content. For example, automatically generate a first draft of highlight reels from conferences, seminars, or corporate events by identifying key moments and significant announcements. Retrieval Augmented Generation (RAG): Ingest and enrich content of any modality to effectively find answers to common questions in scenarios like customer service agents, or power content search scenarios across all types of data. Customer Success with Content Understanding Customers all over the world are already finding unique and powerful ways to accelerate their inferencing and unlock insights on their data by leveraging the multi modal capabilities of Content Understanding. Here are a few examples of how customers are unlocking greater value from their data: Philips: Philips Speech Processing Solutions (SPS) is a global leader in dictation and speech-to-text solutions, offering innovative hardware and software products that enhance productivity and efficiency for professionals worldwide. Content Understanding enables Philips to power their speech-to-result solution, allowing customers to use voice to generate accurate, ready-to-use documentation. “With Azure AI Content Understanding, we're taking Philips SpeechLive, our speech-to-result solution to a whole new level. Imagine speaking, and getting fully generated, accurate documents—ready to use right away, thanks to powerful AI speech analytics that work seamlessly with all the relevant data sources.” – Thomas Wagner, CTO Philips Dictation Services WPP:WPP, one of the world’s largest advertising and marketing services providers, is revolutionizing website experiences using Azure AI Content Understanding. SJR, a content tech firm within WPP, is leveraging this technology for SJR Generative Experience Manager (GXM) which extracts data from all types of media on a company's website—including text, audio, video, PDFs, and images—to deliver intelligent, interactive, and personalized web experiences, with the support of WPP's AI technology company, Satalia. This enables them to convert static websites into dynamic, conversational interfaces, unlocking information buried deep within websites and presenting it as if spoken by the company's most knowledgeable salesperson. Through this innovation, WPP's SJR is enhancing customer engagement and driving conversion for their clients. ASC: ASC Technologies is a global leader in providing software and cloud solutions for omni-channel recording, quality management, and analytics, catering to industries such as contact centers, financial services, and public safety organizations. ASC utilizes Content Understanding to enhance their compliance analytics solution, streamlining processes and improving efficiency. "ASC expects to significantly reduce the time-to-market for its compliance analytics solutions. By integrating all the required capture modalities into one request, instead of customizing and maintaining various APIs and formats, we can cover a wide range of use cases in a much shorter time.” - Tobias Fengler, Chief Engineering Officer Numonix: Numonix AI specializes in capturing, analyzing, and managing customer interactions across various communication channels, helping organizations enhance customer experiences and ensure regulatory compliance. They are leveraging Content Understanding to capture insights from recorded call data from both audio and video to transcribe, analyze, and summarize the contents of calls and meetings, allowing them to ensure compliance across all conversations. “Leveraging Azure AI Content Understanding across multiple modalities has allowed us to supercharge the value of the recorded data Numonix captures on behalf of our customers. Enabling smarter communication compliance and security in the financial industry to fully automating quality management in the world’s largest call centers.” – Evan Kahan, CTO & CPO Numonix IPV Curator: A leader in media asset management solutions, IPV is leveraging Content Understanding to improve their metadata extraction capabilities to produce stronger industry specific metadata, advanced action and event analysis, and align video segmentation to specific shots in videos. IPV’s clients are now able to accelerate their video production, reduce editing time, access their content more quickly and easily. To learn more about how Content Understanding empowers video scenarios as well as how our customers such as IPV are using the service to power their unique media applications, check out Transforming Video Content into Business Value. Robust Security and Compliance Built using Azure’s industry-leading enterprise security, data privacy, and Responsible AI guidelines, Azure AI Content Understanding ensures that your data is handled with the utmost care and compliance and generates responses that align with Microsoft’s principles for responsible use of AI. We are excited to see how Azure AI Content Understanding will empower organizations to unlock their data's full potential, driving efficiency and innovation across various industries. Stay tuned as we continue to develop and enhance this groundbreaking service. Getting Started If you are at Microsoft Ignite 2024 or are watching online, check out this breakout session on Content Understanding. Learn more about the new Azure AI Content Understanding service here. Build your own Content Understanding solution in the Azure AI Foundry. For all documentation on Content Understanding, please refer to this page.3.7KViews1like0CommentsAnnouncing New Fine-Tuning Capabilities with Images on Azure OpenAI Service
We are excited to introduce a groundbreaking feature in the Azure OpenAI Service that allows you to fine-tune models with images in your JSONL files. This enhancement opens new possibilities for creating more dynamic and interactive AI applications. Fine-Tuning with Images You can now include images in your training data, just as you can send image inputs to chat completions. Images can be provided either as publicly accessible URLs or data URIs containing base64 encoded images. This feature allows you to create more comprehensive training datasets that include visual elements, enhancing the model's ability to understand and generate content based on images. Use cases In theretail and e-commercesector, vision fine-tuning can significantly enhance product recommendations by analyzing images of products that customers have viewed or purchased. This leads to higher conversion rates and increased customer loyalty by creating personalized shopping experiences. Additionally, automating the tagging and categorization of product images simplifies inventory management, especially for large inventories. Inagriculture, fine-tuning models with images of crops can help identify diseases, pests, and nutrient deficiencies early, saving crops and reducing losses. This is particularly effective when combined with drones and satellite imagery for large-scale monitoring. For example, a model fine-tuned with images of different stages of crop growth can provide insights into the health and development of the crops. In themanufacturingindustry, vision fine-tuning is invaluable for quality control and defect detection. By training models with images of products at various stages of production, manufacturers can identify specific defects such as cracks, misalignments, or surface imperfections early in the process. This ensures that only high-quality products reach the market, reducing waste and improving efficiency. Forsecurity and surveillance, fine-tuning models with images from security cameras enhances the ability to detect and recognize suspicious activities or objects. This is particularly useful in monitoring public spaces, airports, and critical infrastructure. Integrating these models with other security systems, such as alarms or access control, provides a more comprehensive security solution. Inhealthcare, beyond diagnosing diseases from medical images, vision fine-tuning can be used to monitor patient progress over time. For instance, models can be trained with images of wounds or skin conditions to track healing and provide recommendations for treatment. This continuous monitoring helps healthcare providers offer personalized care and improve patient outcomes. Additionally, the potential for remote consultations and telemedicine can be highlighted, making healthcare more accessible. These use cases demonstrate the versatility and potential of vision fine-tuning across various industries. Image Dataset Requirements To ensure the best performance and compliance, there are specific requirements for your image datasets: Size: Your training file can contain up to 50,000 examples with images, with each example having a maximum of 64 images. Each image can be up to 10 MB. Format: Images must be in JPEG, PNG, or WEBP format and in RGB or RGBA mode. Images cannot be included as output from messages with the assistant role. Content Moderation: Images are scanned before training to ensure compliance with our usage policy. Images containing people, faces, or CAPTCHAs will be excluded from the dataset. Handling Skipped Images If your images are skipped during the training process, it could be due to several reasons such as containing CAPTCHAs, people, faces, inaccessible URLs, large file sizes, invalid mode or invalid formats. Ensure your images meet the specified requirements to avoid these issues. Uploading Large Files For large training files, you can upload files up to 8 GB in multiple parts using the Uploads API. This is particularly useful for extensive datasets that exceed the 512 MB limit of the Files API. Reducing Training Costs To optimize training costs, you can set the detail parameter for an image to low, which resizes the image to 512 by 512 pixels and represents it by 85 tokens regardless of its size. This reduces the cost of training while maintaining the quality of the model. Additional Considerations To control the fidelity of image understanding, you can set the detail parameter of image_url to low, high, or auto for each image. This affects the number of tokens per image that the model sees during training and impacts the cost of training. We are thrilled to see how you will leverage these new capabilities to create innovative and engaging AI applications. For more detailed information, please refer to our documentation on Azure OpenAI Service. Stay tuned for more updates and happy fine-tuning! Ready to get started? Learn more aboutAzure OpenAI Service Watch this Ignite session about new fine-tuning capabilities in Azure OpenAI Service Check out ourHow-To Guide for Fine Tuning with Azure OpenAI Try it out withAzure AI FoundryShare Your Experience with Azure AI and Support a Charity
AI is transforming how leaders tackle problem-solving and creativity across different industries. From creating realistic images to generating human-like text, the potential of large and small language model-powered applications is vast. Our goal at Microsoft is to continuously enhance our offerings and provide the best safe, secure, and private AI services and machine learning platform for developers, IT professionals and decision-makers who are paving the way for AI transformations. Are you using Azure AI to build your generative AI apps? We’re excited to invite our valued Azure AI customers to share their experiences and insights on Gartner Peer Insights. Your firsthand review not only helps fellow developers and decision-makers navigate their choices but also influences the evolution of our AI products. Write a Review: Microsoft Gartner Peer Insights https://gtnr.io/JK8DWRoL0.1.3KViews2likes0Comments