Nick Tarazona, MD’s Post

1mo

👉🏼 Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study 🤓 Carl Ehrett 👇🏻 https://lnkd.in/e86rvfKq 🔍 Focus on data insights: - 📊 The study identifies that the best-performing generative model for data augmentation is LLaMA 7B, utilized at temperature 0.7 with 100 synthetic augmentations. - ✅ Robustly Optimized BERT Pretraining Approach

To view or add a comment, sign in

More Relevant Posts

Marco Antonio Peña Cubillos

PhD Student | Data Scientist | AI Researcher | Machine Learning Engineer | ML Researcher |
10mo
Report this post
Delighted to announce the publication of my latest article, "Instance Selection Techniques for Large Volumes of Data." 📰 In this research, I explore advanced strategies for instance selection in extensive datasets, shedding light on effective methodologies with practical implications. It has been a pleasure collaborating with Antonio J. Tallón-Ballesteros. Our work dives into new dimensions, offering insights into instance selection. Read more about our findings in the full article here. https://lnkd.in/djSeQCph #ML #IA #BigData #machinelearning #artificialintelligence #data #Research #datascience #technology #dataanalytics

Instance Selection Techniques for Large Volumes of Data

link.springer.com
Like Comment
To view or add a comment, sign in
Ramin Mehran

Tech Lead @ Google DeepMind Multi-Modal perception/generation, AI Breakdown Podcaster
3mo
Report this post
In this episode, we discuss Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources by Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli. The paper presents Source2Synth, a method designed to enhance Large Language Models (LLMs) by generating synthetic data with intermediate reasoning steps, grounded in real-world sources, to improve performance without costly human annotations. Source2Synth also filters out low-quality data points to ensure high-quality datasets. The method demonstrates significant improvements in performance for multi-hop question answering and tool usage in tabular question answering, with respective boosts of 22.57% on HotPotQA and 25.51% on WikiSQL.

arxiv preprint - Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

podbean.com
Like Comment
To view or add a comment, sign in
Haitao Mao (Open to work)

PhD@MSU, expected graduate in May, 2025| Research Intern @ Snap Inc. | Recommender Systems, Large Language Model, Information Retrieval, Graph ML | Ex Intern @ Microsoft, Baidu
7mo
Report this post
Can we have one single Graph Foundation Model benefit from pretraining from arbitrary graph data and benefit graphs from all domains and different downstream tasks? We present UniAug as a showcase of cross-domain graph data scaling through diffusion models. See details at https://lnkd.in/eh37NW5Y Our graph model is pre-trained on graphs from 33 domains, UniAug can improve the performance across domains and downstream tasks. UniAug is a structural-only graph diffusion model to pre-train on graph structures across all the domains, aiming to understand the complicated structure graphs from various domains. Then the pre-training model is finetuned to generate data argumentation for the downstream task. Our model can achieve performance gain across node classification, link prediction, and graph classification. Positive transfer can also be found across domains. UniAug may outperform a domain-specific pre-train model in some cases. Notably, UniAug only utilizes structures with no specific design. UniAug has not converged yet! The performance may continuously improve with more computational time and larger datasets! UniAug also shows the potential of the existence of a universal structure space.
Like Comment
To view or add a comment, sign in
Alison Cordoba

Data Scientist | Python | R | SQL | Machine Learning
4mo
Report this post
Don't know what to read but want to get extra knowledge about things? Don't worry, I got you!😉 Here is a recommended article series about Finding Outliers in Your Time-Series Data by Sara Nóbrega. "Remember: Dataset size, computation resources, interpretability, and the nature of your task are key to choose the appropriate outlier detection methods. It can be beneficial to experiment with various methods and metrics to evaluate their performance accurately. If possible, consider using ensembles of methods to boost accuracy. Also, using what you or domain experts know about the field can guide your choice of method." -Sara Nóbrega - https://lnkd.in/e5Ahwh5F - https://lnkd.in/eaTuvBEy - https://lnkd.in/eSx5G6nx

The Ultimate Guide to Finding Outliers in Your Time-Series Data (Part 2)

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Louise Ballard (Moody)
2mo
Report this post
It's hiding in plain sight. I have so many conversations with people about uncovering all the extra (mindblowing) insight they could have in their OWN businesses with their OWN data (no hallucinating!) Very few even aware of those capabilities... #futureofwork #generativeAI #humanplusAI

Ethan Mollick Ethan Mollick is an Influencer
2mo

If LLMs were good at nothing else beyond taking unstructured data & structuring it and effective summarization & compression of text, that would be still a pretty big deal for many industries and researchers.
Like Comment
To view or add a comment, sign in
Jonathan Lumpkin, MBA

Retail & CPG Innovation @ Endava - 2023 MBA Graduate
2mo
Report this post
Check out the latest from Endava: 🔍 Synthetic Data: Driving Machine Learning Innovation 🔍 Our experts break down the following: 📈 Importance of Synthetic Data: Fueling model accuracy and privacy in AI training. ⚖️ Synthetic vs. Conventional Training: Boosts speed and flexibility, enabling data-driven advancements. 🧩 Integration Methodology: Implemented seamlessly to enhance model reliability. 📊 Endava’s Case Studies Support Our Analysis: Proven results from our own analyses showcase the power of synthetic data in action! Feel free to share your thoughts below. . . . . . #MachineLearning #SyntheticData #AIInnovation #DataScience
Endava North America

11,155 followers
2mo

Synthetic data is becoming a cornerstone of machine learning. https://okt.to/5Pc3UF Is your organisation prepared for it? Get an in-depth guide to this data acquisition method in our whitepaper.
Like Comment
To view or add a comment, sign in
Dr Diana Hermith-Ramirez, GradCert BSc MSc PhD

Molecular and Computational Biologist | Bioinformatician | Biostatistician | Clinical Research Fellow | Data Manager | Industry Fellow | Health Informatics
2mo
Report this post
Successful modelling of a complex data set is part science, part statistical methods, and part experience and common sense. The quote is due to Hosmer and Lemershow (2013) in their book on applied logistic regression but applies to any model.
Like Comment
To view or add a comment, sign in
Kaustubh Kaushik

Machine Learning | Transforming Data into Insights
4mo
Report this post
I’ve just published my first Medium article on K-means clustering. If you're interested in learning how this algorithm works and how it can help with data analysis, check it out!

A Deep Dive in k-means Intuition

link.medium.com

2 Comments
Like Comment
To view or add a comment, sign in
Brantley Harris
2mo
Report this post
This gets to what I believe is the core of LLM tech: Probabilistic structuring and unstructuring of data. It seems “intelligent” to us because previously we have never seen any *thing* that can do that other than a human. And you can call it intelligent or probabilistic or stupid-as-cat or whatever, but it’s certainly doing work. The problem is always that last 10% - because probabilistic means there is inherently randomness. So most of what we are doing in software engineering around llms is accounting for that. Unfortunately all the demos and hype never have to solve that last 10% because they can just hide it. So we don’t see the blood and sweat. This still isn’t magic.

Ethan Mollick Ethan Mollick is an Influencer
2mo

If LLMs were good at nothing else beyond taking unstructured data & structuring it and effective summarization & compression of text, that would be still a pretty big deal for many industries and researchers.
Like Comment
To view or add a comment, sign in
Dafinchi

126 followers
2mo
Report this post
You can summarize across the last 5 years of earnings transcripts for a company using dafinchiAI. What’s more, we have a feature that lets you create a collection of responses and then you can prompt our LLM with all those responses as context and as it sunmarise. Users get access to Claude 3.5 but our experiments great results with o1-preview from OpenAI. If you would like to use o1-preview then reach out to us at contact@dafinchi.ai

Ethan Mollick Ethan Mollick is an Influencer
2mo

If LLMs were good at nothing else beyond taking unstructured data & structuring it and effective summarization & compression of text, that would be still a pretty big deal for many industries and researchers.
Like Comment
To view or add a comment, sign in

1,971 followers

View Profile Connect

Nick Tarazona, MD’s Post

More from this author

Why We're All Actually Time Traveling Procrastinators: A Deep Dive Into How We Think About Time

The Last Pre-Robot Generation: A Journey into Our Weird Future

The Giant Flaming Sword of Doing Stuff Right (And Why Most People Wield It Backwards)

Explore topics

Nick Tarazona, MD’s Post

More Relevant Posts

arxiv preprint - Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

podbean.com

More from this author

Why We're All Actually Time Traveling Procrastinators: A Deep Dive Into How We Think About Time

The Last Pre-Robot Generation: A Journey into Our Weird Future

The Giant Flaming Sword of Doing Stuff Right (And Why Most People Wield It Backwards)

Explore topics