Educait reposted this
Could this be the ImageNet moment for scientific AI? Today, as part of a large international collaboration, we're releasing two massive datasets that span dozens of fields - from bacterial growth to supernova! We want this to enable multi-disciplinary foundation model research. You might ask: why would such diverse training data help AI? Well, as we've seen over the past few years, breadth of training can lead to stronger performance. We want AI to exploit common phenomena across sciences – take for example, waves, which appear in many systems! The first dataset, "The Well", contains curated simulations from 16 scientific domains, each capturing fundamental equations that appear throughout nature: - Fluid dynamics & turbulence - Supernova explosions - Biological pattern formation - Acoustic wave propagation - Magnetohydrodynamics All validated and/or generated by domain experts. The second dataset, "Multimodal Universe", brings together astronomical data from major observatories and surveys: - Hundreds of millions of observations - Multiple modalities, object types, wavelengths - Data from JWST, HST, Gaia, and several other major surveys All unified in a single, ML-ready format. For ML researchers, this is: - 115TB of validated, well-understood scientific data - Clean train/test splits - Benchmark tasks ready - Available now on HuggingFace! We are thankful to have both of these papers accepted to NeurIPS 2024, in the datasets & benchmarks track. Our code is open-sourced here, with easy-to-use APIs: - https://lnkd.in/eCz8BmqN - https://lnkd.in/e6Vv82P7 I'm excited about this both for ML research, and for the problems it will enable us to solve in science! This was a collaboration with a 'dream team' of researchers around the world, including: - Multimodal Universe: Eirini Angeloudi, Jeroen Audenaert, Micah Bowles, Ben Boyd, David Chemaly, Brian Cherinka, Ioana Ciucă, Aaron Do, Matt Grayling, Erin Hayes, Tom Hehir, Shirley Ho, Marc Huertas-Company, Kartheik Iyer, Maja Jabłońska, Francois Lanusse, Henry Leung, Kaisey Mandel, Juan Rafael Martínez Galarza, Peter Melchior, Lucas Meyer, Liam Parker, Helen Qu, Jeff Shen, Mike Smith, Connor Stone, Mike Walmsley, John F. Wu, PhD - The Well: Ruben Ohana, Michael McCabe, Lucas Meyer, Rudy Morel, Fruzsina Agocs, Miguel Beneitez, Marsha Berger, Blakesley Burkhart, Stuart Dalziel, Drummond Fielding, Daniel Fortunato, Jared Goldberg, Keiya Hirashima, Yan-Fei Jiang, Rich Kerswell, Suryanarayana Maddu, Jonah Miller, Payel Mukhopadhyay, Stefan Nixon, Jeff Shen, Romain Watteux, Bruno Régaldo-Saint Blancard, François Rozet, Liam Parker, Shirley Ho. I'm very proud of the team and we are all excited to see what the community can do with these amazing datasets!