Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Published Nov 22, 2024

Effortlessly manage table syncing in multiple formats (Hudi, Delta, Iceberg) with this innovative AWS architecture. Designed for flexibility and scalability, this solution leverages Apache XTable, AWS Lambda, and API Gateway to give you control over how and when your tables are synced. Let’s dive into the details of this architecture and explore how it works.

Video Guides

Demo on AWS Lambda

Overview of the Architecture

This setup allows syncing tables in three formats—Hudi, Delta, and Iceberg. It supports:

Scheduled Syncs using CRON jobs.
Manual Syncs triggered via an API Gateway.
Process-Driven Triggers for real-time flexibility.

How It Works

CRON Configuration:

A CRON job is set up to point to a config.yaml file stored in an S3 bucket.
The CRON job triggers an AWS Lambda function, which reads the configuration and executes the Apache XTable sync command

Manual Sync:

Users or processes can initiate a manual sync by making a POST request to the API Gateway.
The API Gateway sends the request to a Lambda function, which runs the sync command with the specified configuration.

Serverless Scalability:

AWS Lambda provides automatic scaling, ensuring the system handles large workloads without manual intervention.

Technical Details

Dockerized Lambda Function

We leverage Docker to bundle all necessary dependencies, Java libraries, and Python code into a single, reusable container image.

Dockerfile

https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/Dockerfile

requirements.txt

Python Lambda Code

The Lambda function is written in Python and uses the JPype library to interact with Apache XTable's Java classes.

https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/lambda_function.py

Testing the Setup

Step 1: Build the Docker Image

Step 2: Run the Docker Container

Step 3: Trigger a Lambda Function locally

Output Screenshots

Why Choose This Architecture?

Flexibility: Sync tables automatically on a schedule or manually as needed.
Scalability: Built on AWS Lambda, the architecture adjusts seamlessly to workloads.
Ease of Use: Centralized configuration management with config.yaml in S3.
Future-Proof: Supports multiple table formats (Hudi, Delta, Iceberg), making it adaptable to evolving data needs.

Labs : https://github.com/soumilshah1995/xtable-sync-lambda

Conclusion

This architecture demonstrates how to combine the power of Apache XTable, AWS Lambda, and API Gateway for a robust table-syncing solution. Whether you need automated CRON jobs or manual sync triggers, this setup is a reliable and scalable choice.

Happy syncing!

#AWS #ApacheXTable #Serverless #Lambda #DataSync #CloudArchitecture

References

Lalit Moharana

AI Enthusiast || Data Science || Data Engineer || Product Engineer

Really good usecase but I have one question. Since you are using lambda which has 15 mins Max runtime don't you think that will be a bottle neck for bigger table sizes ??

Vinish Reddy Pannala

Great blog Soumil S. I feel the lambda function would be a good contribution in the XTable project too, can be useful for AWS users to get started. Your thoughts ? We can discuss more on how we package it etc.

Sagar Lakshmipathy

Solutions Engineering @ Onehouse | We're Hiring!

1mo

i like the usage of jpype! nice blog Soumil S.

1 Reaction

See more comments

To view or add a comment, sign in

See all

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

Soumil S.

Sr. Software Engineer | Big Data & AWS Expert | Apache Hudi Specialist | Spark & AWS Glue| Data Lake Specialist | YouTuber

Overview of the Architecture

How It Works

Technical Details

Dockerized Lambda Function

Dockerfile

requirements.txt

Python Lambda Code

Testing the Setup

Step 2: Run the Docker Container

Step 3: Trigger a Lambda Function locally

Why Choose This Architecture?

Conclusion

References

More articles by this author

Insights from the community

Others also viewed

Blog post | Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 2

Understanding the Future of Apache Iceberg Catalogs

Understanding JSON: The Backbone of Modern Data Exchange

Elastic Search

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

CRUD Operations In DynamoDB With Node JS

Mastering JPA Repository Queries in Spring Boot: A Comprehensive Guide

August 2023 - Iceberg Community News

Kafka Producer And Consumer In Spring Boot

Google DataFlow aka Data Stream & Batch Processing Service

Explore topics

Overview of the Architecture

How It Works

Technical Details

Dockerized Lambda Function

Dockerfile

requirements.txt

Python Lambda Code

Testing the Setup

Step 2: Run the Docker Container

Step 3: Trigger a Lambda Function locally

Why Choose This Architecture?

Conclusion

References

How to Query New S3 Table Buckets Using DuckDB: A Hands-On Guide

Dec 20, 2024

Medallion Architecture (Raw → Bronze → Silver) Using New S3 Table Buckets, EMR EC2, and Orchestrating Jobs with Step Functions | Hands-On Labs

Dec 17, 2024

Amazon DynamoDB Zero-ETL Integration with SageMaker Lakehouse(Iceberg Tables) : Hands-on Lab | Query With Any Engine Athena | DuckDB

Dec 9, 2024

Simple 4-Step Process to Create S3 Table Buckets and Deploy an Iceberg PySpark Job with EMR 7.5 with simple Shell Script | Hands on Labs

Dec 5, 2024

Learn How to Use New S3 Table Buckets and Build Iceberg Tables on EMR 7.5 | Hands-On Labs

Dec 5, 2024

Key AWS re:Invent 2024 Announcements in the Data Space for Data Engineers

Dec 4, 2024

Learn How to Ingest Semi-Structured Data from Kafka Topics in a Stream-Oriented Fashion into Delta 4.0 with Variant Type in Spark 4.0.0-beta1

Dec 3, 2024

Fast and Cost-Effective Querying with DuckDB on AWS Lambda (Docker Container): Scaling Queries on Parquet and Table Formats (Hudi | Iceberg | Delta) |

Dec 2, 2024

Using DuckDB to Cache Query Results and Reduce Load on Your Operational Database

Nov 30, 2024

Leverage Replacing MergeTree for Real-Time PostgreSQL to ClickHouse Sync Using Kafka & Debezium | Hands-On Lab

Nov 30, 2024

Insights from the community

Others also viewed

Blog post | Bridging Complexities: Migrating Hive UDFs to BigQuery Remote Functions - Part 2

Understanding the Future of Apache Iceberg Catalogs

Understanding JSON: The Backbone of Modern Data Exchange

Elastic Search

Memory Optimization Techniques for Spring Boot Applications with Practical Coding Strategies

CRUD Operations In DynamoDB With Node JS

Mastering JPA Repository Queries in Spring Boot: A Comprehensive Guide

August 2023 - Iceberg Community News

Kafka Producer And Consumer In Spring Boot

Google DataFlow aka Data Stream & Batch Processing Service

Explore topics