WTD Analytics

WTD Analytics

Data Infrastructure and Analytics

Data Engineering and Analytics Agency. Databricks MVP | Databricks Partner

About us

Databricks MVP and Databricks Partner. We provide analytics and data engineering implementation services.

Website
https://wtdanalytics.com
Industry
Data Infrastructure and Analytics
Company size
2-10 employees
Headquarters
Mumbai
Type
Privately Held
Founded
2024
Specialties
Data Engineering, Analytics, Data lakehouse, Databricks, and Data Infrastructure

Locations

Employees at WTD Analytics

Updates

  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    What is ai_summarize function in Databricks SQL and how helps us in generating concise summaries of text data with Databricks ? What is ai_summarize? - Purpose: This function uses state-of-the-art generative AI to summarize text data within SQL queries. - Best For: Testing on small datasets (<100 rows) due to rate-limiting in preview mode. How Does It Work? Syntax: ai_summarize(content[, max_words]) Why Use It? - Saves time by summarizing long text. - Enhances productivity by providing quick insights from unstructured data. - Useful in scenarios like summarizing support tickets, product reviews, or meeting notes. Key Benefit: A scalable approach to text summarization directly integrated with SQL workflows. #WhatsTheData #DataEngineering #Databricks

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    What is DLT in Databricks and How Can It Simplify Your Data Pipelines? Managing data can be tricky, but Delta Live Tables (DLT) in Databricks is here to help. Let's break down what it is and how it can make your life easier. What is DLT? DLT (Delta Live Tables) is a tool in Databricks that helps you easily build and manage data pipelines. It automates data tasks, ensuring your data is always clean, up-to-date, and ready for analysis. Whether you're working with large data sets or just a few tables, DLT simplifies the process. How does DLT work? Simple Setup: Use basic SQL or Python to define your data pipelines. No complex coding required. Automated Data Management: DLT takes care of cleaning, organizing, and updating your data without you having to lift a finger. Built-in Data Checks: It ensures your data meets quality standards by running checks automatically. Data Versioning: Easily track changes in your data and see how it has evolved over time. #DataEngineering #Databricks #WhatsTheData

  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    What is Unity Catalog Tagging, and How Does It Help Us Track Data Ownership and Project Allocation in Databricks? Unity Catalog's tagging system simplifies data governance by allowing us to apply key-value pairs to data assets. These tags help track data ownership, sensitivity, and project allocation. How it helps: Tags can be applied across catalogs, schemas, tables, views, and more. Use Catalog Explorer to add and manage tags, or use SQL commands for automated tagging. Tags improve searchability and organization while supporting cost management. Example in Databricks: In Databricks, tags help categorize datasets for better management. For example, a cost_center tag like Marketing helps track all costs related to the marketing department, enabling finance teams to allocate expenses accurately. Similarly, a sensitivity tag like High allows security teams to easily identify and apply extra protections to sensitive datasets. #DataEngineering #WhatsTheData #Databricks

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    How to set up Backfilling Historical Data along alongside Streaming with Databricks Delta Live Tables? Sometimes, historical data needs to be included in streaming tables without interrupting the current streaming pipeline. Backfilling ensures that historical data gets integrated seamlessly while keeping active data flows uninterrupted. This is particularly useful in scenarios where data ingestion pipelines need to be updated to account for historical records or changes in data sources over time. How to Set Up Backfill in DLT 1. Define a Streaming Table in DLT: Set up a regular streaming table to receive live data streams. 2. Implement the Backfill Function: Use the `@dlt.append_flow` decorator to define a function that backfills historical data to the streaming table. Benefits of Backfilling in DLT: - - Keeps historical and live data synchronized. - - No need to stop the current data streams. - - Allows for streamlined integration of legacy data. Real Life Example: Imagine you have a sales dashboard that tracks real-time orders from an online store. Last month, a new data source was added, but there’s valuable historical sales data stored in an older database that wasn’t yet integrated. Backfilling allows you to pull in this historical sales data without pausing or disrupting the live stream of new orders. #Databricks #DataEngineering #WhatsTheData

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    What is a Flow in Delta Live Tables (DLT) and How It Helps Us in Efficient Data Processing ? A flow in Databricks DLT is essentially a streaming query that incrementally updates target tables by processing only new or changed data which leads to faster execution and optimal resource use. This approach is ideal for scalable data engineering workflows. How Flows Optimize Our Data Processing - It Processes only new or modified records, avoiding full reprocessing. - Often Reduces system load on memory and CPU therefore making it ideal for high-volume streams. - Also Supports complex cases with explicit flows for tasks like merging multiple data sources or backfilling data. Real-Life Example: Customer Order Processing in E-commerce An e-commerce platform can use implicit flows to: - Filter orders over a certain amount to prioritize high-value customers. - Standardize customer names for consistent downstream analytics. - Update the orders table with only new or modified orders in real time. #Databricks #DataEngineering #WhatsTheData

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    How to Use Kafka with Streaming Tables in Databricks SQL ? What? Kafka Integration with Streaming Tables in Databricks SQL enables you to ingest and process real-time data streams from Kafka topics directly into Databricks for analysis. How? - Set Up Kafka: Connect Databricks SQL to your Kafka topic. - Create Streaming Table: Use SQL to create a streaming table that ingests data from the Kafka topic. - Process Data: Continuously process and analyze the data as it streams in, using materialized views or other SQL operations. Real-Life Example Suppose you’re tracking real-time transactions in an e-commerce application using Kafka. With Streaming Tables, you can ingest these transactions as they happen and generate real-time sales reports. #WhatsTheData #DataEngineering #Databricks

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    How we can manage Schema Inference and Evolution with Databricks Auto Loader ? Schema inference and evolution in Auto Loader simplify managing data schemas over time, especially when working with dynamically changing datasets. Here's how Auto Loader handles schema detection, evolution, and unexpected data all while keeping your data streams running smoothly. Schema Inference: - Automatically detects schemas when loading data. - Handles JSON, CSV, XML, Parquet, and Avro formats. - Saves schema history in the schema location. - Infers all columns as strings for untyped formats like JSON and CSV. Schema Evolution: - Detects and manages new columns as they appear. - Options to fail, rescue, or ignore new columns during schema evolution. - Default behavior is to stop the stream on encountering new columns and add them to the schema. Real-Life Example A retail company ingests JSON data for online orders. Initially, the schema includes order_id, customer_id, and order_date. Over time, new columns like coupon_code and delivery_time are added. With Auto Loader: - Schema inference detects new columns automatically. - Schema evolution adds coupon_code and delivery_time without manual intervention. - Unexpected data like malformed records are rescued for further analysis. #WhatsTheData #DataEngineering #Databricks

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    What is Databricks Auto Loader and How can it help you improve your Data Ingestion process? Databricks Auto Loader is a tool that simplifies the process of bringing new data into your data lake. It automatically finds and loads files as they arrive, making it much easier to handle streaming data without a lot of manual work. Why Use Auto Loader? - It keeps an eye on your data storage and ingests new files as soon as they appear. It can figure out the structure of your data (like Parquet or JSON) and adjust as your data changes. It can manage large amounts of data without any hassle. Before Auto Loader: You had to write custom scripts to manually load files. There was a higher chance of missing data or making mistakes. After Auto Loader: Files are ingested automatically without you lifting a finger. The process is more reliable, and you spend less time on manual tasks. #Databricks #dataengineering #WhatsTheData

    • No alternative text description for this image
  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    What are the Key Highlights of Databricks November Updates ? Here’s a quick summary of the November updates in Databricks: Databricks Runtime 16.0 is GA: Introduced JDK 17 as default, ended Hosted RStudio, and disabled DBFS library installations for better security. Schema Evolution in MERGE: Automatic schema evolution is now supported during MERGE operations. Reclustering for Liquid Tables: You can now force reclustering for improved table performance. Identity Columns in Delta APIs: Python and Scala APIs now support identity columns. New Functions Added: Functions like try_url_decode, zeroifnull, and nullifzero help handle data efficiently. Query History for DLT Pipelines: Access and analyze query performance metrics directly from the DLT UI or Query History page. Emergency Access for Admins: Prevent lockouts by enabling password + MFA access for up to 20 users. Predictive Optimization for Unity Catalog: Automates maintenance tasks like OPTIMIZE, VACUUM, and ANALYZE for managed tables. Service Credentials in Unity Catalog: Simplified AWS IAM role configuration for secure and seamless integration. #Databricks #DataEngineering #WhatsTheData

  • WTD Analytics reposted this

    View profile for Vishal Waghmode, graphic

    Founder @ WTD Analytics | Databricks MVP & Partner | Data Engineering Consulting

    Databricks November Updates: Databricks Runtime 16.0 is Generally Available. Databricks Runtime 16.0, powered by Apache Spark 3.5.0, introduces several new features, improvements, and breaking changes to enhance performance and developer experience. Let’s dive into what’s new and how it helps you. What is Databricks Runtime 16.0? Databricks Runtime 16.0 is the latest runtime environment for Databricks clusters, offering: Upgrades to JDK 17 for better security and performance. Support for advanced Delta Lake operations like automatic schema evolution. Improved reliability and error reporting for Structured Streaming. Bug Fixes Auto Loader now handles Avro files with empty schemas by rescuing these records. Fixed incorrect timestamps with time zones containing a second offset in JSON, XML, or CSV outputs. Real-Life Example with DBR 16.0: Imagine a retail company using Databricks to manage its product inventory with a Delta Lake table. With automatic schema evolution, adding new fields (like discount_rate) to their incoming data can now be handled seamlessly during MERGE. By using OPTIMIZE FULL, they can recluster all records to improve query performance. #WhatsTheData #DataEngineering #Databricks

    • No alternative text description for this image

Similar pages

Browse jobs