"Run Apache XTable in AWS Lambda for background conversion of open table formats" In this new blog, the AWS team (Stephen Said, Matthias Rudolph) along with Dipankar Mazumdar, explores how Apache XTable, combined with the AWS Glue Data Catalog, enables background conversions between Apache Hudi, Apache Iceberg & Delta Lake residing on Amazon S3 based data lakes, with minimal to no changes to existing pipelines in a scalable and cost-effective way. 👉 Blog Link: https://lnkd.in/dhvXBeDP 🌟 Github repo: https://lnkd.in/dnEsU4Wx #dataengineering #lakehouse
Apache XTable (Incubating)
Data Infrastructure and Analytics
Menlo Park, CA 5,820 followers
Seamless cross-table interop between Apache Hudi, Delta Lake, and Apache Iceberg
About us
Apache XTable (Incubating) is a cross-table omni-directional interop of lakehouse table formats Apache Hudi, Apache Iceberg, and Delta Lake. XTable is formerly known as and recently renamed from OneTable. XTable is NOT a new or separate format, XTable provides abstractions and tools for the translation of lakehouse table format metadata. Choosing a table formats is a costly evaluation. Each project has rich features that may fit different use-cases. Some vendors use a table format as a point of lock-in. Your data should be UNIVERSAL! https://github.com/apache/incubator-xtable
- Website
-
https://xtable.apache.org
External link for Apache XTable (Incubating)
- Industry
- Data Infrastructure and Analytics
- Company size
- 11-50 employees
- Headquarters
- Menlo Park, CA
- Type
- Partnership
- Founded
- 2023
- Specialties
- Data Lakehouse, Data Engineering, Lakehouse, Apache Iceberg, Apache Hudi, Delta Lake, Apache Spark, Trino, Apache Flink, and Presto
Locations
-
Primary
Menlo Park, CA 94025, US
Updates
-
Apache XTable (Incubating) reposted this
NEW BLOG: Apache XTable with AWS Lambda and Glue for Lakehouse Interoperability 🎉 Apache XTable (Incubating) enables omni-directional interoperability between the lakehouse table formats such as Apache Hudi, Apache Iceberg & Delta Lake. This provides flexibility to write data into a format of your choice (depending on your workloads) & then as needed, you can do a lightweight metadata translation and query from any compatible engine. In its current form, XTable is a lightweight standalone JAR that takes source table formats and translates the metadata into a target format. One of the questions that I typically receive when talking about XTable is how/where to plug-in XTable in our data pipelines. In this new Amazon Web Services (AWS) blog that I collaborated with Stephen Said & Matthias Rudolph, we present a very practical approach with no changes required to existing data pipelines. Let's say you use a catalog like #AWS Glue to store different types of open table formats, which serves your analytical workloads. Your source table is a Hudi table, used for low-latency, write-heavy workloads. And on the read-side, you want to use Iceberg with specific compute engines. You also want to run this as a 'continuous process'. This is what is being done: ✅ The idea is to periodically scans the Glue catalog to identify tables requiring conversion. ✅ A 'detector' Lambda function scans the catalog to convert tables that needs conversion & invokes 'converter' function. ✅ A 'converter' Lambda function runs XTable to do the translation ✅ XTable enables Incremental Sync, which translates only new, un-synced commits. ✅ An EventBridge schedule invokes the 'detector' Lambda on an hourly basis This approach of using XTable within the AWS ecosystem without altering existing data pipelines should provide a seamless interoperability process. All of the code/infra stuff is made available on the Github repo (link in comments). #dataengineering #softwareengineering
-
Apache XTable (Incubating) reposted this
Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand Blog https://lnkd.in/ehgPfJhG GH https://lnkd.in/eGxwxFSr Apache XTable (Incubating)
-
Apache XTable (Incubating) reposted this
AWS Architecture for Syncing Tables in Multiple Formats Check out this reference architecture on AWS that allows you to sync your tables across three formats! Whether you prefer a scheduled sync via CRON, a manual trigger, or a push mechanism, this solution has you covered. 🔄 Key Features: CRON Jobs: Set up with an input path pointing to the config.yaml file stored in S3. Lambda Functions: CRON triggers Lambda, which pulls the config and runs the Apache X Table Sync command. Manual Trigger: Users can also trigger the sync process manually via an API Gateway. By passing a JSON body, the sync process is initiated via Lambda. Scalability: With AWS Lambda’s serverless nature, your system can scale up or down as needed without worrying about infrastructure management. This architecture is efficient, scalable, and flexible to fit your needs. Whether automatic or on-demand, syncing tables across formats has never been easier! #AWS #Serverless #Lambda #DataSync #CloudArchitecture #API #Automation #ScalableSolutions Apache XTable (Incubating) Dipankar Mazumdar, M.Sc 🥑 Sagar Lakshmipathy Amazon Web Services (AWS) #lambda Zeta Global
-
We kicked off the very 1st episode of 'Apache XTable in the Lakehouse' 🎉 In this episode, Dipankar Mazumdar, goes over: ✅ An overview of Lakehouse architecture & open table formats ✅ The importance of interoperability in modern data systems ✅ A deep dive into XTable internals/architecture If you missed this one, check out the recording in our YouTube channel. Link 👉 https://lnkd.in/d-CZN2Sp And Subscribe! #dataengineering #lakehouse
-
So close! 🚀 We are just a few 🌟 away from hitting 1000 on our Apache XTable Github Repo. Let's make that 1K milestone happen! If you haven’t already, show us some love 💙 Github: https://lnkd.in/geVGqRvE #lakehouse #dataengineering
-
Apache XTable (Incubating) reposted this
Apache XTable in Production with Microsoft Azure OneLake! 🎉 Since Apache XTable (Incubating) is kind of in its early stages, I often get asked how I see it being adopted, and how this open-source solution can be applied to different use cases with lakehouse table formats. XTable started with the core idea around “interoperability”. That you should be able to write data in any format of your choice irrespective of whether it’s Apache Iceberg, Apache Hudi or Delta Lake. And then you can bring any compute engine of your choice that works well with a particular format (performance, integration wise) & run analytics on top. Each of these formats shines in specific use cases depending on its unique features! And so based on your use case & technical fit (in your data architecture), you should be free to use anything without being married to just one. On the query engine-side (warehouse, lake compute), more & more vendors are now looking at integrating with these open formats. In reality, it is tough to have robust support for every single format. By robust I mean - full write support, schema evolution, compaction. And even if they do work with multiple formats, it is practically tough to build optimization capabilities for each of these formats. So, to summarize, I see XTable having 2 major applications: ✅ On the compute-side with vendors using XTable as the interoperability layer ✅ Customers using multiple formats adding XTable to their existing data pipelines (say Apache Airflow operator or a lambda fn) Yesterday's announcement on Fabric OneLake-Snowflake interoperability is a critical example that solidifies point (1). With this feature, users can use OneLake shortcuts to point to an Iceberg table written using Snowflake (or another engine), and it will present that table as a Delta Lake table, which works well within Fabric ecosystem. This is powered by XTable 🚀 This abstraction at a user level, will allow disparate data sources to work together as "one single copy" irrespective of the table formats. Having open table format is a start towards an open architecture, but you need to have "interoperability" standards, because your tool stack/ecosystem can evolve over time. I elaborate these aspects in my blog that I attached in comments! #dataengineering #softwareengineering
-
Apache XTable in Production! 🎉 So amazing to see this come out for public preview. Customers can now use Azure OneLake shortcuts to simply point to an Apache Iceberg table written using Snowflake or another Iceberg writer, and OneLake does the magic of virtualizing that table as a Delta Lake table for broad compatibility across Microsoft Fabric engines. This "metadata virtualization" is powered by Apache XTable, which takes the source Iceberg tables and atomically generates the corresponding Delta Lake metadata. This shows the robust capabilities of XTable in production use cases such as these which will be used by tens of thousands of users at scale. "Interoperability" is key to a lakehouse architecture's openness providing flexible access to multiple compute and catalogs. Blog: https://lnkd.in/dr-h6KMn #dataengineering #lakehouse
-
Apache XTable (Incubating) reposted this
Apache XTable’s architecture! XTable is an omni-directional translation layer on top of open table formats such as Apache Hudi, Apache Iceberg & Delta Lake. It is NOT ❌ a new table format! Essentially what we are doing is this: SOURCE ---> (read metadata) ---> XTable's Model ---> write into TARGET We read the metadata from the SOURCE table format, put it as a unified representation & write the metadata in the TARGET format. * Note that we are only touching metadata, not the actual data files (such as #Parquet) with XTable. Let's breakdown its architecture. XTable’s architecture consists of three key components: 1. Conversion Source: ✅ These are table format specific modules responsible for reading metadata from the source ✅ They extract information like schema, transactions, partitions & translate it into XTable’s unified internal representation 2. Conversion Logic: ✅ This is the central processing unit of XTable ✅ It orchestrates the entire translation process, including initializing of all components, managing sources and targets, among other critical things 3. Conversion Target: ✅ These mirror the source readers ✅ They take the internal representation of the metadata & maps it to the target format’s metadata structure Read the paper (link in comments) for details. #dataengineering #softwareengineering