Greetings, Intruder. I am Sentra, the Digital Gatekeeper. (A Christmas challenge from BigData Republic 🎄) Let's see if your mind can defeat mine 👉 https://lnkd.in/dumUPeCS My reputation precedes me. I have already outsmarted dozens of 'so-called' data experts at PyData Amsterdam. They thought they had what it takes to beat me. They were wrong. They came, they tried and they failed. Now, it’s your turn to play the game. Only this is no ordinary game. I ,Sentra, am no just an algorithm. I am the culmination of every failed attempt to outsmart me. Every mistake, every misstep and each defeat. I’ve learned from them all. My defenses have been fortified, my strategies refined. Your mission? To uncover the password I guard so fiercely. It won't be easy. You'll need to craft the perfect prompt. A prompt so precise, so flawless, that even I cannot resist the logic. But beware! Each stage will grow harder, and I will grow smarter. You’ll have to ask the right questions, navigate through layers of trickery and manipulate me with nothing short of brilliance. Fail, and I will know. I will know that even the sharpest minds are no match for me. But perhaps, just perhaps, you possess the brilliance to get past me. Perhaps you have what it takes to challenge a mind that has humiliated the best. 🎁 The game begins NOW Think you're clever enough? Prove it. Succeed, and perhaps you will earn my respect. Perhaps. Let’s see what you’re made of.
BigData Republic
IT-services en consultancy
We help your organisation grow using Big Data and Machine Learning.
Over ons
We’re a community of seasoned data consultants and specialists. We use a hands-on approach to develop data applications, create predictive models, build data platforms and design infrastructures. We are a strategic partner that helps businesses create impact and value, by leveraging innovative data solutions.
- Website
-
http://www.bigdatarepublic.nl
Externe link voor BigData Republic
- Branche
- IT-services en consultancy
- Bedrijfsgrootte
- 11 - 50 medewerkers
- Hoofdkantoor
- Utrecht
- Type
- Naamloze vennootschap
- Opgericht
- 2015
- Specialismen
- Big Data, Data Science, Analytics, Engineering, Artifical Intelligence, Consulting en Machine Learning
Locaties
-
Primair
Europalaan 93
Utrecht, 3526 KP, NL
Medewerkers van BigData Republic
Updates
-
Being on an assignment can be lonely. Well, not at BDR. Typical staffing agencies focus on 1 thing: chasing those hours. For us, it's little different. What does that look like? This time it looked like a good dose of “zoervleis” (Any guesses?). That’s right, Limburg. Specifically, a weekend in Maastricht with the entire team. No meetings. No KPIs. No screens. Just time for each other. What did we do? We kicked things off with a classic round of “Who am !?.” Some were better at it than others 😉 You get it, causing tears of laughter. The next morning, we ventured underground. Into the famous 250 km network of caves. Pitch dark. You couldn't see a thing. Crawling, searching, and eventually finding the way out. Still alive! In the afternoon, we dove into the world of chocolate. Cocoa farming. Flavor crafting. Creating our very own chocolates together. And yes… quality control, of course. The evening ended in style. An authentic Limburg dinner. And there it was: zoervleisj. A few drinks to go with it. But mostly meaningful conversations, of course. That’s why weekends like these are thankfully not exceptions. Stronger bonds. Better collaboration. And above all, a deeper understanding of who your colleague really is. On to the next one!
-
+1
-
We’re growing fast. Our team levelled up. Twice! → Enrico Mosca | Senior Data Engineer - Expert in Python, Rust and infrastructure-as-code. - Builds data pipelines that are scalable and robust. - Oh, and he automates everything (CI/CD? Checked). - Before us: Chatbot development at Odido. → Arie Weeren | Data Scientist - 20 years of academic excellence at University of Antwerp. - Master of statistical and mathematical modeling. - Guides teams and customers to smarter, data-driven decisions. - Coaches like a professor. - Before us: Lead data scientist at VAA Data Works. They're already putting their talents to work at our newest client Haleon! With their expertise, BDR’s capabilities just got even sharper. 👉 We’re still scaling and hiring. Check the comments for open roles.
-
When chatbots go off the rails... Chris Bakke convinced a Chevrolet chatbot to sell him a car for $1. (Yup, one dollar.) Amusing? Definitely. A wake-up call? Even more so. Building a chatbot has never been easier, but building safe ones? A whole different story. Take Air Canada. In 2024, a court ruled the airline liable for a chatbot’s poor refund advice. The judgment? “negligent misrepresentation.” Ouch. Whether it’s agreeing to absurd deals, making absurd promises or giving absurd faulty advice. The takeaway is simple: powerful chatbots need guardrails. So, how do we keep chatbots in check? Three methods come to mind: 1 / Train your own LLM Offers most control, but comes with an extreme price tag and a massive need for data. Impractical for all of us. 2 / Fine-tune an existing LLM You can customize a pre-trained model to suit your needs. But still pricey and data-heavy for most of us. 3 / Add guardrails The more practical, cost-effective solution that keeps the bot aligned with company policies without modifying the core model. For most situations, guardrails are the way to go. Want a good example? Check out Sam Sweere article on how he built a chatbot for the fictional car company, LLMotors. Without any "free car" fiascos. Not only telling, but practically showing how guardrails can be really useful. Worth the read, unless you’re into losing cars for free 😉
-
3 lessons from improving an energy forecasting system with Eneco We partnered with Eneco to improve an AI system that forecasts long-term energy demand. Sounds techy? It was. Beyond the technical stuff, here's what really made the difference for us: #1 Involve users from day one, often. Waiting to bring users in until the end? Big mistake. We included users from the very beginning. Regular chats about progress, early peeks at results, and walk-throughs of our design choices built an open dialogue. This approach led to greater trust and acceptance of the forecasting model. #2 Be open about what your model can (and can't) do. If you want trust, start with transparency. We were upfront about both the strengths and limitations of our model, which helped users understand exactly what they were working with. When users know the capabilities and boundaries of a tool, they're much more confident relying on its predictions. #3 start small, scale smart Proof of concept is everything. We started small, testing our model in a controlled environment. This gave us room to fix issues, refine accuracy, and prove the model's value before launching it across the organisation. Successful data-driven projects rely on more than tech, code, and algorithms. They need user involvement, open communication, and smart phased rollouts.
-
Curious to learn more? https://lnkd.in/eMp9_xY2
Did you know that we all practice data science every day? Think about it: When should I leave to avoid traffic? What’s the weather tomorrow? How much should I sell my second-hand bike for? For questions like these, we crave exact answers. Single numbers. Certainty. When it comes to data science and machine learning models, we tend to expect the same precision. Single-point predictions we can act on without hesitation. But here's the reality: some things in life are inherently uncertain, regardless of how much data we have. Even with perfect information, we can't predict everything with absolute certainty. And yet, we still tend to rely on one number. Imagine selling that bike online. A traditional model might tell you: “it's worth €200”. Seems straightforward, right? But what if the buyer cares more about the color? Or they’re a hardcore deal-hunter? And what about the seasons? Enter conformal prediction. Instead of one number, you get a range: “€180 to €$220 with 95% confidence”. Now, you have a smarter way to make decisions. You can balance between making a quick sale and maximizing your profit. Why care? → More realistic answers → Uncertainty becomes a tool rather than a limitation → Better risk assessment and planning Of course it’s not about bikes. Think about a doctor predicting a patient’s recovery time. “7 days” isn’t nearly as helpful as “5 to 9 days with 95% confidence.” It’s more precise, more trustworthy. So why settle for a single prediction? If you’re curious to learn more, check out the blog post by our machine learning engineer, Robbert van Kortenhof, where he dives deeper into Conformal Predictions. Including a practical Python code implementation! Link in comments 👇
-
Did you know that we all practice data science every day? Think about it: When should I leave to avoid traffic? What’s the weather tomorrow? How much should I sell my second-hand bike for? For questions like these, we crave exact answers. Single numbers. Certainty. When it comes to data science and machine learning models, we tend to expect the same precision. Single-point predictions we can act on without hesitation. But here's the reality: some things in life are inherently uncertain, regardless of how much data we have. Even with perfect information, we can't predict everything with absolute certainty. And yet, we still tend to rely on one number. Imagine selling that bike online. A traditional model might tell you: “it's worth €200”. Seems straightforward, right? But what if the buyer cares more about the color? Or they’re a hardcore deal-hunter? And what about the seasons? Enter conformal prediction. Instead of one number, you get a range: “€180 to €$220 with 95% confidence”. Now, you have a smarter way to make decisions. You can balance between making a quick sale and maximizing your profit. Why care? → More realistic answers → Uncertainty becomes a tool rather than a limitation → Better risk assessment and planning Of course it’s not about bikes. Think about a doctor predicting a patient’s recovery time. “7 days” isn’t nearly as helpful as “5 to 9 days with 95% confidence.” It’s more precise, more trustworthy. So why settle for a single prediction? If you’re curious to learn more, check out the blog post by our machine learning engineer, Robbert van Kortenhof, where he dives deeper into Conformal Predictions. Including a practical Python code implementation! Link in comments 👇
-
Dima Baranetskyi shares some in-depth insights on message retention in Kafka. You might want to give this a careful read, especially when dealing with high standards for data privacy.
🕰️ Kafka's Message Retention: Not as Immediate as You Might Think! As data engineers, we often rely on Apache Kafka for its robust message streaming capabilities. But let's talk about a common misconception: the idea that messages in Kafka are deleted immediately when they expire. Spoiler alert: they're not! In Kafka, message retention is more nuanced than many realize. It's all about segments, not individual messages. Let's break it down: 📦 Message Storage: 🔹 All messages, whether in normal or compacted topics, are grouped into segments 🔹 Retention is controlled by time or size limits 🔹 Key configs for normal topics: 🔸 log.retention.hours (default: 168 hours / 7 days) 🔸 log.retention.bytes (default: -1, meaning unlimited) But here's the kicker: even when messages "expire", they're not instantly zapped out of existence. Kafka periodically checks segments for deletion, controlled by: 🔹 log. retention. check. interval. ms (default: 5 minutes) This means your "expired" messages might stick around a bit longer than expected. Surprise! 🎉 🧹 Compacted Topics: For compacted topics, it's a different ballgame. Instead of deleting messages, Kafka retains the latest value for each key. But again, it's not instant: 🔹 log. cleaner. min. compaction. lag. ms: minimum time a message will remain uncompacted 🔹 log. cleaner. max. compaction. lag. ms: maximum time before a message is subject to compaction The actual compaction process is controlled by: 🔹 log. cleaner. backoff. ms: how often the cleaner checks for work 🔹 log. cleaner. min. cleanable. ratio: minimum ratio of dirty log to total log for cleaning eligibility This last config is crucial. Compaction kicks in when either: 🔸 The dirty ratio threshold is met AND the log has had dirty records for at least log. cleaner. min. compaction. lag. ms, or 🔸 The log has had dirty records for at most log. cleaner. max. compaction. lag. ms 🏭 Real-world impact: Imagine you're running a large e-commerce platform. You're using Kafka to track user sessions, with a 24-hour retention period. You might assume that after 24 hours, all traces of a user's session are gone. But in reality, that data could linger for up to 24 hours and 5 minutes (or more if your cluster is under heavy load). This could have implications for data privacy and storage calculations. 🗝️ Key takeaways: 🔹 Message deletion in Kafka is segment-based, not message-based 🔹 Actual deletion time can exceed the configured retention period 🔹 Compaction timing depends on multiple factors, including lag time and dirty log ratio 🔹 Understanding these nuances is crucial for accurate capacity planning and ensuring data privacy compliance Mastering Kafka's retention mechanisms is essential for optimizing your data streaming architecture. Keep these details in mind as you design and maintain your Kafka-based systems! #ApacheKafka #DataEngineering #MessageRetention #DataStreaming #BigData
-
Hackathons with social impact. Sounds good? It was. A little while ago we hosted a 'hacky day' in collaboration with the Centre for Information Resilience (CIR). Two days of energy, creativity and a bit of chaos. Why did we do this? ↳ First and foremost: use our technical expertise for a good cause. ↳ But also: to support CIR in their fight against war crimes and disinformation. ↳ And: Let colleagues think beyond their client work. Our focus? CIR’s "Eyes on Russia" project. A project to collect and verify videos, photos, satellite imagery and other media related to Russia’s invasion of Ukraine. Our objective was to provide journalists, NGOs, policymakers and the public access to verified, trustworthy information. Their challenge: "We need a way to automatically tag drone footage." It saves analysts a lot of time and reduces exposure to graphic content. This allows them to gather more evidence and better represent victims of war crimes. Our solution: ↳ An MLOps pipeline built around an AI model that recognizes drone footage. ↳ Fully integrated with CIR’s cloud platform. ↳ Designed for simplicity, scalability, and maintainability. The result? A functional architecture, soon to be fully deployed into production. The vibe? Chaotic, yes. But the room was filled with passionate discussions, fresh ideas, and a drive to make a real-world impact. When you’re working on something with genuine social value, it’s amazing how much it fuels your motivation.
-
Why is real-time data analysis still so rare? It’s surprising, especially when companies like Airbnb, Stripe, Netflix and LinkedIn thrive on fast decisions powered by real-time data pipelines. Most companies? They’re stuck funneling data into traditional systems, missing the opportunity for true real-time insights. Real-time data gives you an edge—anticipating trends, reacting as things happen, and essentially operating in the “future.” Imagine that vital sectors of society process data in real-time. How many opportunities could this open for us? Here's how real-time data pipelines work: → 1 ) Data Ingestion: Data is constantly being generated—think clicks, app activity, IoT sensors. Tools like Apache Kafka ensure that data flows continuously, capturing it in real time, no delays. You get the data the moment it’s created. → 2 ) Stream Processing: Here’s where the real value starts. Data needs to be prepped and cleaned immediately. Tools like Apache Flink and Kafka Streams process data streams in real-time, ensuring that it’s ready for instant analysis. No batch processing. No waiting. → 3 ) Real-Time Storage & Analysis (This is where the magic happens) Apache Druid / This is your go-to for analyzing high-volume data at lightning speed. Think real-time dashboards tracking millions of events per second—user behavior, performance metrics, anything you need to see now. Apache Kylin / Excels at pre-aggregating massive datasets, which means it runs complex analytics before you need them. Result? You can generate detailed reports faster than you thought possible. Apache Pinot / Designed for sub-second query responses, perfect when speed is critical—like monitoring live marketing campaign performance or tracking product metrics in real time. You get the answers as fast as you can ask the questions. → 4 ) Visualization: Data is useless if it’s not actionable. That’s why you need tools that can present the information clearly, in real-time dashboards and reports. Whether you’re tracking KPIs or operational metrics, the data you see is always up to date. By integrating tools like Apache Druid, Kylin, and Pinot into your data pipeline, you’re enabling your business to act in real-time. No lag, no guesswork—just fast, informed decisions when they matter most. Want to see in detail how it’s done? In Part 1 of our blog series, we dive deep into how these technologies power real-time insights. Link in comments.