Ramin Mehran’s Post

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster

3mo

In this episode, we discuss Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources by Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli. The paper presents Source2Synth, a method designed to enhance Large Language Models (LLMs) by generating synthetic data with intermediate reasoning steps, grounded in real-world sources, to improve performance without costly human annotations. Source2Synth also filters out low-quality data points to ensure high-quality datasets. The method demonstrates significant improvements in performance for multi-hop question answering and tool usage in tabular question answering, with respective boosts of 22.57% on HotPotQA and 25.51% on WikiSQL.

arxiv preprint - Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

podbean.com

To view or add a comment, sign in

More Relevant Posts

SciBite

3,988 followers
1mo
Report this post
What is Retrieval Augmented Generation and why is the data you feed it so important? http://spkl.io/6041fVMXS Read on to understand the importance of the information retrieval steps in such a workflow and how vector and ontology-based methods compare in our blog by Joseph Mullen, Director of Data Science & Professional Service at SciBite. http://spkl.io/6041fVMXS #LargeLanguageModels #LLMs #Ontologies
Like Comment
To view or add a comment, sign in
Aryn

619 followers
8mo
Report this post
#RAG has fundamental limitations that prevent it from generating accurate answers for diverse queries on complex documents or on datasets at scale. Mehul Shah writes about the need for a new approach: LLM-Powered Unstructured Analytics. We call it LUnA, and this architecture is at the core of #Sycamore. Read about LUnA here: https://lnkd.in/gThndXVf #Sycamore #Aryn #Analytics #UnstructuredData #Search #GenAI

RAG is a band-aid; we need LLM-powered Unstructured Analytics — LUnA

aryn.ai
Like Comment
To view or add a comment, sign in
SciBite

3,988 followers
5mo
Report this post
What is Retrieval Augmented Generation and why is the data you feed it so important? http://spkl.io/60484CS7A Read on to understand the importance of the information retrieval steps in such a workflow and how vector and ontology-based methods compare in our latest blog by Joseph Mullen Joe Mullen, Director of Data Science & Professional Service at SciBite. http://spkl.io/60484CS7A #LargeLanguageModels #LLMs #Ontologies
Like Comment
To view or add a comment, sign in
Nattapong Thanngam

Data Science Team Lead at Data Cafe
7mo
Report this post
📚 Dive into "Exploring Summarization Techniques with LLMs" in our Data Mastery Series! Discover how to leverage Large Language Models (LLMs) to generate concise and insightful summaries from extensive texts. This episode covers various advanced summarization methods, including Map-Reduce, Custom Prompts, and Cluster-Based Summarization. Perfect for anyone looking to enhance the efficiency of information processing and ensure critical details are captured effectively. Join us for an insightful journey into the world of LLM summarization techniques!

Introduction to LLMs: Exploring Summarization Techniques

link.medium.com
Like Comment
To view or add a comment, sign in
Vincent Giordano

Economist at the New York City Council
7mo
Report this post
Always remember folks: data rarely fits our core assumptions in regression analysis. This is why you should always make sure to run your diagnostic tests and provide as much error specification to your model as needed. *PS, if your data somehow satisfies all assumptions of regression analysis, then chances are you’ve engaged in data torture (big no-no). #statistics #regression #datascience

Jarom Hulet

Data Science Manager at Toyota Financial Services
7mo

Do you remember learning about the homoscedastic (what a word!) errors assumption for linear regression in stats 101? I recently wrote an article about how and why that assumption can be loosened by using heteroscedastic robust errors on Towards Data Science! If heteroscedasticity is keeping you up at night you should check it out!

Bite Size Data Science: Heteroscedastic Robust Errors

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Wow Development Quality Assurance

17,919 followers
7mo
Report this post
Causal Validation: A Unified Theory of Everything How to detect and fix any type of error in a directed acyclic graph so that it is a valid representation of the underlying data Graham Harrison Towards Data Science Graham Harrison · Following Published in Towards Data Science · 28 min read · 7 hours ago 145 1 Photo by Guillermo Ferla on Unsplash Introduction Causal inference is an emerging field within machine learning that can move beyond predicting what could happen to explaining why it will happen and it doing so offers the promise of permanently resolving the underlying problem rather than dealing with the potential fallout. One of the key components of a causal model is a “Directed Acyclic Graph” (DAG) which captures the cause-and-effect relationships between variables and events in a simple visual format but the main issue with DAGs is that they are typically constructed subjectively by the domain experts. Hence there is no guarantee that the DAG will be correct and if it is incorrect the calculations and conclusions of a causal inference model are likely to be wrong. Causal Validation is the term used to describe the process of checking a DAG against the underlying data it represents with the objective of identifying any mistakes or inconsistencies and fixing them and if this can be done reliably it will ensure that the conclusions of causal inference and the associated… https://lnkd.in/dRxJ67WW

Causal Validation: A Unified Theory of Everything

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Lava Kafle
7mo
Report this post
Causal Validation: A Unified Theory of Everything How to detect and fix any type of error in a directed acyclic graph so that it is a valid representation of the underlying data Graham Harrison Towards Data Science Graham Harrison · Following Published in Towards Data Science · 28 min read · 7 hours ago 145 1 Photo by Guillermo Ferla on Unsplash Introduction Causal inference is an emerging field within machine learning that can move beyond predicting what could happen to explaining why it will happen and it doing so offers the promise of permanently resolving the underlying problem rather than dealing with the potential fallout. One of the key components of a causal model is a “Directed Acyclic Graph” (DAG) which captures the cause-and-effect relationships between variables and events in a simple visual format but the main issue with DAGs is that they are typically constructed subjectively by the domain experts. Hence there is no guarantee that the DAG will be correct and if it is incorrect the calculations and conclusions of a causal inference model are likely to be wrong. Causal Validation is the term used to describe the process of checking a DAG against the underlying data it represents with the objective of identifying any mistakes or inconsistencies and fixing them and if this can be done reliably it will ensure that the conclusions of causal inference and the associated… https://lnkd.in/d76ASeNm

Wow Development Quality Assurance

17,919 followers
7mo

Causal Validation: A Unified Theory of Everything How to detect and fix any type of error in a directed acyclic graph so that it is a valid representation of the underlying data Graham Harrison Towards Data Science Graham Harrison · Following Published in Towards Data Science · 28 min read · 7 hours ago 145 1 Photo by Guillermo Ferla on Unsplash Introduction Causal inference is an emerging field within machine learning that can move beyond predicting what could happen to explaining why it will happen and it doing so offers the promise of permanently resolving the underlying problem rather than dealing with the potential fallout. One of the key components of a causal model is a “Directed Acyclic Graph” (DAG) which captures the cause-and-effect relationships between variables and events in a simple visual format but the main issue with DAGs is that they are typically constructed subjectively by the domain experts. Hence there is no guarantee that the DAG will be correct and if it is incorrect the calculations and conclusions of a causal inference model are likely to be wrong. Causal Validation is the term used to describe the process of checking a DAG against the underlying data it represents with the objective of identifying any mistakes or inconsistencies and fixing them and if this can be done reliably it will ensure that the conclusions of causal inference and the associated… https://lnkd.in/dRxJ67WW

Causal Validation: A Unified Theory of Everything

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Nick Tarazona, MD
1mo
Report this post
👉🏼 Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study 🤓 Carl Ehrett 👇🏻 https://lnkd.in/e86rvfKq 🔍 Focus on data insights: - 📊 The study identifies that the best-performing generative model for data augmentation is LLaMA 7B, utilized at temperature 0.7 with 100 synthetic augmentations. - ✅ Robustly Optimized BERT Pretraining Approach
Like Comment
To view or add a comment, sign in
Jarom Hulet

Data Science Manager at Toyota Financial Services
7mo
Report this post
Do you remember learning about the homoscedastic (what a word!) errors assumption for linear regression in stats 101? I recently wrote an article about how and why that assumption can be loosened by using heteroscedastic robust errors on Towards Data Science! If heteroscedasticity is keeping you up at night you should check it out!

Bite Size Data Science: Heteroscedastic Robust Errors

towardsdatascience.com

1 Comment
Like Comment
To view or add a comment, sign in
Eric Yang

Senior Director of Data Science @ Medidata (Acorn.Ai)
3mo
Report this post
Random Data Science pro-tip. Stop using one-hot encoding for high cardinality categorical data, in which each category can only be present once per row of data. If you're careful about splitting your train-test-validation sets, Target Encoding provides a much more straightforward way of encoding categorical data without blowing up the size of your data frame.

4 Comments
Like Comment
To view or add a comment, sign in

3,317 followers

352 Posts

View Profile Connect

Ramin Mehran’s Post

arxiv preprint - Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

podbean.com

More Relevant Posts

Explore topics