Our consulting partnership program is open!
𝐝𝐥𝐭𝐇𝐮𝐛 𝐏𝐚𝐫𝐭𝐧𝐞𝐫𝐬𝐡𝐢𝐩 𝐩𝐫𝐨𝐠𝐫𝐚𝐦 𝐢𝐬 𝐧𝐨𝐰 𝐨𝐩𝐞𝐧 𝐭𝐨 𝐜𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭𝐬 𝐚𝐧𝐝 𝐚𝐠𝐞𝐧𝐜𝐢𝐞𝐬 𝐑𝐞𝐚𝐝 𝐦𝐨𝐫𝐞 𝐡𝐞𝐫𝐞: 𝐡𝐭𝐭𝐩𝐬://𝐝𝐥𝐭𝐡𝐮𝐛.𝐜𝐨𝐦/𝐛𝐥𝐨𝐠/𝐜𝐨𝐧𝐬𝐮𝐥𝐭
Since 2017, the number of Python users has been increasing by millions annually. The vast majority of these people leverage Python as a tool to solve problems at work. Our mission is to make them autonomous when they create and use data in their organizations. For this end, we are building an open source Python library called data load tool (dlt). Our users use dlt in their Python scripts to turn messy, unstructured data into regularly updated datasets. It empowers them to create highly scalable, easy to maintain, straightforward to deploy data pipelines without having to wait for help from a data engineer. We are dedicated to keeping dlt an open source project surrounded by a vibrant, engaged community. To make this sustainable, dltHub stewards dlt while also offering additional software and services that generate revenue (similar to what GitHub does with Git). dltHub is based in Berlin and New York City. It was founded by data and machine learning veterans. We are backed by Dig Ventures and many technical founders from companies such as Hugging Face, Instana, Matillion, Miro, and Rasa.
Externer Link zu dltHub
Berlin, DE
Our consulting partnership program is open!
𝐝𝐥𝐭𝐇𝐮𝐛 𝐏𝐚𝐫𝐭𝐧𝐞𝐫𝐬𝐡𝐢𝐩 𝐩𝐫𝐨𝐠𝐫𝐚𝐦 𝐢𝐬 𝐧𝐨𝐰 𝐨𝐩𝐞𝐧 𝐭𝐨 𝐜𝐨𝐧𝐬𝐮𝐥𝐭𝐚𝐧𝐭𝐬 𝐚𝐧𝐝 𝐚𝐠𝐞𝐧𝐜𝐢𝐞𝐬 𝐑𝐞𝐚𝐝 𝐦𝐨𝐫𝐞 𝐡𝐞𝐫𝐞: 𝐡𝐭𝐭𝐩𝐬://𝐝𝐥𝐭𝐡𝐮𝐛.𝐜𝐨𝐦/𝐛𝐥𝐨𝐠/𝐜𝐨𝐧𝐬𝐮𝐥𝐭
What is Delta Sharing? (Simplified) 👇 ● Open protocol for secure, scalable, and platform-agnostic data sharing. ● Share data across platforms and clouds without vendor lock-in. ✅How It Works? ● Databricks-to-Databricks: Advanced sharing with governance and AI tools. ● Open Sharing: Share with any platform using secure tokens. ● Customer-Managed: Host your own Delta Sharing server for complete control. ✅ Why Use It? ● Flexible for multi-cloud and hybrid setups. ● Secure with token-based access and auditing. ● Cost-efficient: No data replication or high transfer fees. 💡 How does your organization handle secure data sharing? Let's discuss! #DeltaSharing #DataCollaboration #SecureDataSharing
🚀 𝗔𝗽𝗮𝗰𝗵𝗲 𝗜𝗰𝗲𝗯𝗲𝗿𝗴: 𝘁𝗵𝗲 𝗻𝗲𝘅𝘁 𝗔𝗽𝗮𝗰𝗵𝗲 𝗛𝗮𝗱𝗼𝗼𝗽? 📍𝗛𝗮𝗱𝗼𝗼𝗽 𝘃𝘀. 𝗜𝗰𝗲𝗯𝗲𝗿𝗴: Hadoop solved the "big data explosion" in the 2010s. Iceberg fixes today’s data lake headaches. 📍 𝗔𝗱𝗼𝗽𝘁𝗶𝗼𝗻 𝗰𝗵𝗮𝗹𝗹𝗲𝗻𝗴𝗲𝘀: Fast adoption often creates messy, overcomplicated setups. 📍𝗦𝗺𝗮𝗹𝗹 𝗳𝗶𝗹𝗲𝘀 𝗽𝗿𝗼𝗯𝗹𝗲𝗺: Iceberg struggles with too many small files as Hadoop did. High-frequency writes = metadata overload. 📍 𝗡𝗼𝘁 𝗼𝗻𝗲 𝘁𝗼𝗼𝗹: Iceberg isn’t a single tool. It’s part of an ecosystem (query engines like Trino/Spark/Flink + storage like S3/GCS). This needs a platform mindset. 📍 𝗦𝗻𝗮𝗽𝘀𝗵𝗼𝘁 𝗺𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁: Snapshots avoid single points of failure. But as they grow, metadata can get messy. Keep things clean with metadata purging and monitoring. 📍𝗦𝗲𝗹𝗳-𝗵𝗼𝘀𝘁𝗶𝗻𝗴 𝘃𝘀. 𝗺𝗮𝗻𝗮𝗴𝗲𝗱: Self-hosting = more control. Managed services = simpler but less flexible. 📍 𝗖𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝘆 𝗮𝗻𝗱 𝗰𝗼𝗺𝗽𝗲𝘁𝗶𝘁𝗶𝗼𝗻: Iceberg has a strong open-source community. But competition (Delta Lake, Hudi) risks fragmentation. Standardization will be key. ⭐ 𝗙𝘂𝘁𝘂𝗿𝗲 𝘁𝗿𝗲𝗻𝗱𝘀 𝘁𝗼 𝘄𝗮𝘁𝗰𝗵: 1️⃣ One table format might dominate. 2️⃣ Tools to make Iceberg easier to manage (e.g., Amazon S3 Tables). 3️⃣ Growth in real-time use cases (streaming + ML). ⭐ 𝗧𝗵𝗲 𝗯𝗶𝗴 𝗽𝗶𝗰𝘁𝘂𝗿𝗲: Iceberg is a big leap in data management. By learning from Hadoop, we can build better systems for the future. #DataEngineering #BigData
Google introduced a new pipe syntax for BigQuery SQL.👇 - It’s an 𝗲𝘅𝗽𝗲𝗿𝗶𝗺𝗲𝗻𝘁𝗮𝗹 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 aimed at improving SQL readability and design. - While it doesn’t enhance execution performance, it simplifies query structuring. - Pipe syntax enables 𝗹𝗶𝗻𝗲𝗮𝗿 𝗾𝘂𝗲𝗿𝗶𝗲𝘀: a series of easy transformations to read and maintain. Here’s a traditional SQL query example for calculating average order value: ```sql WITH customer_orders AS ( SELECT c.customer_id, o.order_id, o.order_value FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE c.customer_status = 'active' ), average_order_value AS ( SELECT AVG(order_value) AS average_order_value FROM customer_orders ) SELECT * FROM average_order_value ``` And here’s the same query using the new 𝗽𝗶𝗽𝗲 𝘀𝘆𝗻𝘁𝗮𝘅: ```sql customers | JOIN orders ON customers.customer_id = orders.customer_id | WHERE customer_status = 'active' | SELECT customer_id, order_id, order_value | SELECT AVG(order_value) AS average_order_value ``` Currently, pipe syntax is in preview mode and requires applying for access. Would you use this for a cleaner SQL design? Share your thoughts below! 👇 #BigQuery #SQLDesign #DataEngineering #DataQueries
dltHub hat dies direkt geteilt
🚀 Excited to release github-assistant: an AI assistant for repository data from the GitHub API. Simon Farshid and I set out to build an AI assistant on a public dataset. What we built in a very short time is a testament of how far dev tools in data and AI have come in the past year. 🌟 Try it out: https://lnkd.in/epcg3MYp 📖 Learn how it works: https://lnkd.in/es-JQraG A shoutout to all the tools that made this possible: 👉 Relta which powers the semantic layer and text-to-sql 👉 assistant-ui for all interfaces 👉 dltHub for all data pipelines from GitHub API 👉 LangChain for the agent infrastructure
𝐀𝐖𝐒 𝐥𝐚𝐮𝐧𝐜𝐡𝐞𝐝 𝐒𝟑 𝐓𝐚𝐛𝐥𝐞𝐬. It's a new S3 bucket type that optimizes storage and performance for Apache Iceberg tables. Native support for Iceberg in S3 is a big deal. It has significant implications for data engineers, architects, and the broader data ecosystem. S3 Tables are storage-optimized buckets for managing Apache Iceberg tables. Standard S3 buckets require manual operations for compaction, snapshot cleanup, etc. S3 Tables automate this, resulting in better performance. Key features: - 3x faster queries vs. standard S3 - Up to 10x higher transaction throughput (for high-volume workloads) - Built-in automated maintenance - Integration with AWS services (Amazon Athena, EMR, Glue, QuickSight) - Data stored in Iceberg-compatible formats (e.g. Parquet) ➜ access through any third-party query engines that support Iceberg S3 Tables strengthen AWS's position in the lakehouse ecosystem. S3 Tables also simplify Iceberg table operations. They reduce engineering effort for metadata management and performance tuning. S3 Tables adhere to Iceberg standards. They interoperate with tools like Apache Spark, Flink, Dremio, Starburst, and Estuary Flow. It will likely increase Iceberg adoption, especially among AWS users. Adoption could lead to: - Escalating costs for high-frequency or real-time workloads - Increased vendor lock-in ➜ Reliance on AWS-managed features could complicate future migrations to other Iceberg-compatible systems In conclusion, AWS S3 Tables fundamentally change how Iceberg tables are managed and queried. Read their official documentation here:
Our partners from Untitled Data Company sent us a special pull request: 10k calories in a delicious package! Gratefully accepted!
How does Forto use dlt? Learn from Ajit Gupta about the challenges needs and solutions they face in their daily data work. https://lnkd.in/efsUD7vZ
Learning Friday: “Shifting Yourself Left” for Better Data Quality This week, we dig into Josh Wills’ talk, “Shift Yourself Left: Integration Testing for Data Engineers.” The session spotlights how data engineers can push beyond traditional boundaries, like the data warehouse, and start integrating earlier, closer to where the data is generated and transformed. Here is the takeaway: As data engineers, we often talk about “data quality,” but that’s really shorthand for upstream changes that break our downstream pipelines. When schemas shift without notice or when business logic updates aren’t communicated, everyone downstream scrambles. The video emphasizes that we can catch these issues sooner if we shift ourselves left—meaning we move our integration tests and validation closer to the source code and the upstream engineering teams. What we learn: Shift Yourself Left vs. Shift Left: Instead of just pushing more work onto engineering teams ("shift left"), we, as data engineers, should join them upstream. By being involved earlier, we can implement integration tests that prevent broken pipelines before data hits the warehouse. Leverage lightweight tools & containerization: Big data warehouses aren’t easy to spin up for every test. Tools like DuckDB allow you to replicate transformations at small scale, locally, and on every pull request. Containerizing your data pipeline and hooking it into your CI/CD setup makes end-to-end testing more efficient and accessible. Data contracts & testing in practice: While data contracts (schema definitions, agreed-upon fields, and types) are great, they’re not enough alone. True data quality emerges from testing real logic. By pulling data ingestion (with tools like DLT) and transformation (DBT, DuckDB) directly into your test environment, you can verify that your pipelines still run as intended even if upstream changes occur. Adopt a developer mindset: For too long, data engineers have been tied to massive production systems. Embracing developer habits like writing integration tests, using containers for local dev/test, and understanding CI/CD pipelines fosters a culture of continuous improvement and proactive data quality checks. Why watch this talk: If you’re looking to prevent schema breaks, reduce firefighting, improve data testing, and establish clear “write-audit-publish” patterns for data pipelines, this talk will give you a practical blueprint. You’ll learn to build flexibility into your stack and collaborate more tightly with upstream teams, ultimately ensuring that what you publish downstream is always trusted and correct. Happy Learning Friday!