Toloka’s Post

View organization page for Toloka, graphic

111,118 followers

Researchers from Toloka and CERN evaluated LLMs on complex science questions, with a new benchmark dataset created by domain experts. Highlights: 🏆 Llama outperformed every model in Bioinformatics 🏆 GPT-4o won overall Summary of the benchmark: - 10 subjects in the natural sciences - 10 criteria evaluated - 5 LLMs tested: Qwen2-7B-Instruct, Llama-3-8B-Instruct, Mixtral-8x7B, Gemini-1.0-pro, and GPT-4o Where all LLMs struggle to perform: - Depth and Breadth - Reasoning and Problem-Solving  - Conceptual and Factual Accuracy What does it mean? - Accuracy varies across science domains. - All tested LLMs underperform on complex questions. - LLM responses can be misleading to non-experts. 👉 Read the article to find out more: https://lnkd.in/g-dWGtsP #AI #STEM #NaturalSciences #LLM #Benchmarking #GPT4o #Llama #GeminiPro #Mixtral #Qwen2

  • chart, bar chart

To view or add a comment, sign in

Explore topics