Onehouse’s Post

View organization page for Onehouse, graphic

8,368 followers

Want to dive deep into Apache Hudi 1.0, the origins of Hudi, and the future of #datalakehouses and #datacatalogs? We have a webinar for you! Join this fireside chat next week with Ananth P. of Data Engineering Weekly and Vinoth Chandar, founder and CEO of Onehouse and Apache Hudi PMC Chair.

Bridging the Gap: A Database Experience on the Data Lake

www.linkedin.com

30 Comments

Megh Vidani

Specialist - Data Engineering at NPCI

12h

A simple issue: Let's say I have a streaming pipeline running in spark which is writing a Hudi table, now I want to insert data into this table by selecting some columns from another table, or let's say I want to back populate data into this table from a historical table, I have to create another spark/flink pipeline for this rather than firing a simple Trino SQL query insert into hudi_table select x,y,z from historical_table. Another point is table maintenance operations like compaction, clustering, cleaning would be very easy to schedule as SQL queries automated via Trino plus some workflow tool like airflow/dagster

Kyle Weller

Head of Product @ Onehouse.ai | ex Azure Databricks

13h

Time for the party to start

4 Reactions

Sai Sri Harsha G.

Technical Staff at Secureworks

12h

Can you talk abit more on Metadata table and how its going to be evolved. most of the new indexing features are tied to it. would like to learn more on this

Soumil S.

13h

true databases would have lot of running param "Much more than just format "

Mike Dillion

Retired

12h

This seems to be Databricks strategy: build another layer and accept diverse file formats.

Amit Kumar

Product Development Engineer III @ Phenom People | Scalable System Design

12h

DebeziumSource implementation is available for Mysql and Postgres. What's the plan for other db like MongoDB?

Sai Sri Harsha G.

Technical Staff at Secureworks

12h

Maybe the community would benefit from a deep dive into MT table

Aman Yadav

Data Engineer II @ Rakuten | NITK '21

13h

Ezechiel YINDOULA

Consultant BI | Talend | Oracle | PostgreSQL | Power BI & CGP - Consultant Patrimonial

13h

Harsh Raj Srivastav

Engineering @ClickPe (YC W23) | Ex-Backend Intern @Liveasy | Skilled in Java, Python & AWS | Specializing in Backend & Cloud Solutions.

13h

Attending

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Jay Chia

Founder, Eventual
8mo
Report this post
Had a great time at the LinkedIn Big Data meetup today talking about Daft (www.getdaft.io)! Also really cool talks from the Apache XTable (Incubating) folks (Dipankar Mazumdar, M.Sc 🥑 Ashvin A. Lei S.), and I learned a ton about query engines and Spark from all the other speakers (Khai Tran Walaa Eldin Moustafa Shardul Mahadik). #datalake #datalakehouse #dataengineering
4 Comments
Like Comment
To view or add a comment, sign in
InfluxData

20,613 followers
9mo Edited
Report this post
Catch Andrew Lamb, Staff Engineer at InfluxData, and chair of the Apache Arrow Program Management Committee at the Open Data Science Conference (ODSC) next month. He'll cover the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations. https://bit.ly/4aja5zL #InfluxDB #ODSCEast #ApacheArrow #ApacheParquet
Like Comment
To view or add a comment, sign in
Dionisis Karaoglanoglou

Senior Engineering Manager
2mo
Report this post
TL;DR: Snowplow recently introduced our Lake Loader for loading Snowplow events in near-real time into a data lake in major formats, including Hudi, Delta, and Iceberg. Now, with the introduction of Spark support for Snowplow's dbt packages, we've connected the dots, allowing you to run our models directly on data lakes. By integrating Spark into this process, customers can scale their data transformations to handle even the largest datasets, all while benefiting from the cost efficiencies and flexibility of a data lake architecture.
Snowplow

14,405 followers
2mo

We're thrilled to announce Apache Spark support for Snowplow's dbt models, a major milestone for behavioral data management in data lakes. Now, you can manage and process vast volumes of data, unlocking valuable behavioral insights at scale without increasing operational costs. 🎉 To find out more and get started, click here: https://lnkd.in/eEsSnPNV #DataLakes #ApacheSpark
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
8mo
Report this post
So many great nuggets in the 1st Apache Hudi Newsletter! Some of my favourite picks: ✅ Cost Optimization Strategies for scalable Data Lakehouse - Halodoc Engineering ✅ Enabling near real-time data analytics on the data lake - Grab Engineering ✅ How to use StarRocks & Apache Hudi to build a robust open data lakehouse architecture ✅ Combine Transactional Integrity and Data Lake Operations with Yugabyte and Apache Hudi Link in comments. Subscribe for all the upcoming ones! #dataengineering #softwareengineering

2 Comments
Like Comment
To view or add a comment, sign in
Wherobots

5,466 followers
2mo Edited
Report this post
🚀 We’re thrilled to announce the release of k-Nearest Neighbors (kNN) Join in WherobotsDB and Apache Sedona 1.7.0! This powerful feature helps you efficiently locate the nearest entities in large datasets. Choose between: 🔹 Exact kNN Join – Ideal for precise applications like geocoding and real-time transit mapping. 🔹 Approximate kNN Join – Optimized for high-speed processing where accuracy can be slightly flexible, such as meeting production requirements and image search. Whether you’re building complex geospatial applications or need fast data processing on the fly, the kNN Join capability in WherobotsDB and Apache Sedona 1.7.0 opens up new possibilities for scalable, high-performance data workflows. Join us on this journey to make data processing more powerful and accessible—no matter the size or scope of your datasets. 🌎 Explore how kNN Join can elevate your geospatial and big data workflows! https://bit.ly/4ebxZP6 ➡️ Sign up to get started with Wherobots for free: https://bit.ly/3NUyiDb #GeospatialData #BigData #DataScience #ApacheSedona
Like Comment
To view or add a comment, sign in
Harshdeep Kaur

SDE3 @Amazon
2mo
Report this post
🚀 Exciting News! 🚀 I’m hosting a Braindate at the Grace Hopper Celebration 2024! 🎉 Topic: "Mastering Data Lakes with Apache Hudi: Best Practices and Use Cases" Join me for a deep dive into how to leverage Apache Hudi for efficient data management! Whether you're a data engineer, data scientist, or just curious about big data, this session is for you! #GraceHopperCelebration #ApacheHudi #DataLakes #BigData #WomenInTech #GHC2024

Mastering Data Lakes with Apache Hudi: Best Practices and Use Cases

api.braindate.com

1 Comment
Like Comment
To view or add a comment, sign in
Apache Hudi

10,798 followers
2mo
Report this post
If you missed the Episode 2 of 'Lakehouse Chronicles with Hudi', check out the recorded session now available on YouTube 🎊 In this session, our guest Albert Wong showcases his latest Hudi Docker Demo. The demo offer data engineers a real-world, end-to-end example of Apache Hudi in action. Learn how to ingest data in batches using HudiStreamer from #Kafka, use multiple compute engines like Apache Spark and Trino on the same data for reads/writes, execute compaction on Merge-on-Read tables and sync data to Catalogs for Analytics. Link 👉 https://lnkd.in/diZgk2EY Let us know what other topics might be interesting for the community! #dataengineering #softwareengineering
Like Comment
To view or add a comment, sign in
Doug Ortiz 🐘🤖🦙 Doug Ortiz 🐘🤖🦙 is an Influencer

☁️Cloud | 🛠️DevOps | 🐘Postgres | 🛢️Databases | 🚢K8s | 🎙️Podcaster | 🤖AI | 🧠Machine Learning | 🤔Tech Challenges? Let's Talk! | 💡Solving Impossible Projects | 📓Instructor | ⚙️Automating Success | 🤝DevRel
8mo
Report this post
🔑 Unlocking the Data Mesh Revolution: Apache Iceberg vs. Apache Hoodie 🌐Dive into the forefront of data management trends with a focus on the groundbreaking concept of data mesh and open table formats. 🚀Discover the transformative potential of Apache Iceberg and Apache Hoodie as leading solutions tailored for large-scale enterprises. 🔍Stay ahead of the curve with this enlightening YouTube short video exploration. 📊 💡Ready to elevate your data strategy and empower your enterprise with cutting-edge solutions? Don't miss out! #DataMeshRevolution #ApacheIceberg #ApacheHoodie #DataManagementTrends #OpenTableFormats #EnterpriseSolutions #TechInsights #DataAnalytics #DataStrategy #StayAhead

1 Comment
Like Comment
To view or add a comment, sign in
Apache Hudi

10,798 followers
5mo
Report this post
New to Apache Hudi? George Yates have been putting some great introductory video content out there 🔥 Check them out. 👉 Introduction to Apache Hudi: https://lnkd.in/d5_MuiWe 👉 End-to-end incremental data lake with Hudi, Trino & Spark : https://lnkd.in/d546GD5f #dataengineering #softwareengineering

Introduction to Apache Hudi for Data Lake Management! Apache Hudi for Beginners!

https://www.youtube.com/
Like Comment
To view or add a comment, sign in
SheCanCode

16,825 followers
8mo
Report this post
Apache Flink is a powerful technology framework for those working in data management and analysis. 🔍 While it might seem complex, you don't need to be a tech expert to wrap your head around what it is and what it can do. To help demystify Apache Flink, we enlisted the help of Maria Berinde-Tâmpǎnariu from Confluent 🤩 Read more 👉https://bit.ly/4cVWO22 #ApacheFlink #Data

Demystifying Apache Flink: what is it and what can it do?
Like Comment
To view or add a comment, sign in

8,368 followers

View Profile Connect

Onehouse’s Post

Bridging the Gap: A Database Experience on the Data Lake

www.linkedin.com

More Relevant Posts

Introduction to Apache Hudi for Data Lake Management! Apache Hudi for Beginners!

https://www.youtube.com/

Explore topics