Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand
Effortlessly manage table syncing in multiple formats (Hudi, Delta, Iceberg) with this innovative AWS architecture. Designed for flexibility and scalability, this solution leverages Apache XTable, AWS Lambda, and API Gateway to give you control over how and when your tables are synced. Let’s dive into the details of this architecture and explore how it works.
Video Guides
Demo on AWS Lambda
Overview of the Architecture
This setup allows syncing tables in three formats—Hudi, Delta, and Iceberg. It supports:
Scheduled Syncs using CRON jobs.
Manual Syncs triggered via an API Gateway.
Process-Driven Triggers for real-time flexibility.
How It Works
CRON Configuration:
A CRON job is set up to point to a config.yaml file stored in an S3 bucket.
The CRON job triggers an AWS Lambda function, which reads the configuration and executes the Apache XTable sync command
Manual Sync:
Users or processes can initiate a manual sync by making a POST request to the API Gateway.
The API Gateway sends the request to a Lambda function, which runs the sync command with the specified configuration.
Serverless Scalability:
AWS Lambda provides automatic scaling, ensuring the system handles large workloads without manual intervention.
Technical Details
Dockerized Lambda Function
We leverage Docker to bundle all necessary dependencies, Java libraries, and Python code into a single, reusable container image.
Dockerfile
https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/Dockerfile
requirements.txt
Python Lambda Code
The Lambda function is written in Python and uses the JPype library to interact with Apache XTable's Java classes.
https://github.com/soumilshah1995/xtable-sync-lambda/blob/main/lambda_function.py
Testing the Setup
Step 1: Build the Docker Image
Step 2: Run the Docker Container
Step 3: Trigger a Lambda Function locally
Output Screenshots
Why Choose This Architecture?
Flexibility: Sync tables automatically on a schedule or manually as needed.
Scalability: Built on AWS Lambda, the architecture adjusts seamlessly to workloads.
Ease of Use: Centralized configuration management with config.yaml in S3.
Future-Proof: Supports multiple table formats (Hudi, Delta, Iceberg), making it adaptable to evolving data needs.
Labs : https://github.com/soumilshah1995/xtable-sync-lambda
Conclusion
This architecture demonstrates how to combine the power of Apache XTable, AWS Lambda, and API Gateway for a robust table-syncing solution. Whether you need automated CRON jobs or manual sync triggers, this setup is a reliable and scalable choice.
Happy syncing!
#AWS #ApacheXTable #Serverless #Lambda #DataSync #CloudArchitecture
AI Enthusiast || Data Science || Data Engineer || Product Engineer
4wReally good usecase but I have one question. Since you are using lambda which has 15 mins Max runtime don't you think that will be a bottle neck for bigger table sizes ??
Great blog Soumil S. I feel the lambda function would be a good contribution in the XTable project too, can be useful for AWS users to get started. Your thoughts ? We can discuss more on how we package it etc.
Solutions Engineering @ Onehouse | We're Hiring!
1moi like the usage of jpype! nice blog Soumil S.