Artificial Analysis

Artificial Analysis

Technology, Information and Internet

Independent analysis of AI models and hosting providers: https://artificialanalysis.ai/

About us

Leading provider of independent analysis of AI models and providers. Understand the AI landscape to choose the best AI technologies for your use-case.

Website
https://artificialanalysis.ai/
Industry
Technology, Information and Internet
Company size
11-50 employees
Type
Privately Held

Employees at Artificial Analysis

Updates

  • Thanks for the support Andrew Ng! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.

    View profile for Andrew Ng, graphic
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of Landing AI

    Shoutout to the team that built https://lnkd.in/g3Y-Zj3W . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!

    Model & API Providers Analysis | Artificial Analysis

    Model & API Providers Analysis | Artificial Analysis

    artificialanalysis.ai

  • OpenAI today previewed their new o3 and o3-mini models - eval performance marks a significant leap forward OpenAI has claimed huge leaps in today’s leading evaluation datasets: ➤ GPQA Diamond: 87.7% (vs. 78.0% for o1) ➤ SWE-Bench Verified: 71.7% (vs. 48.9% for o1) ➤ AIME: 96.7% (vs. 83.3% for o1) ➤ EpochAI’s FrontierMath: 25.2% (vs. 2.0% for SOTA) While lab claim eval scores should always be taken with a grain of salt, we have generally been able to replicate all OpenAI claims for the o1 series. These leaps put OpenAI firmly back as the clear leader of the AI frontier.

    • No alternative text description for this image
  • Announcing Speech to Speech benchmarking and releasing Big Bench Audio - the first dedicated dataset for evaluating reasoning performance of speech models Hugging Face Blog: https://lnkd.in/gkcZ7mT8 Speech to Speech page on Artificial Analysis: https://lnkd.in/gNkkzszJ The era of native Speech to Speech models has arrived: OpenAI’s GPT-4o and Google’s Gemini 2.0 Flash start a new paradigm of models that you can directly speak to. To evaluate reasoning performance for this new paradigm, we're releasing Big Bench Audio - a 1,000 question dataset adapted from Big Bench Hard for Speech to Speech models. We’re launching it today on Hugging Face - see our full post on the Hugging Face Blog today for details! Our initial results show a significant "speech reasoning gap": while GPT-4o achieves 92% accuracy on a text-only version of the dataset, its Speech to Speech performance drops to 66%. Traditional speech pipeline approaches (using Whisper for transcription, GPT-4o (Aug '24) for reasoning, and OpenAI’s TTS-1 for voice generation) are currently outperforming Speech to Speech models in reasoning. We saw almost no reasoning drop-off from GPT-4o’s Text to Text score when comparing to the speech pipeline described above.

    Evaluating Audio Reasoning with Big Bench Audio

    Evaluating Audio Reasoning with Big Bench Audio

    huggingface.co

  • Prompt Caching can offer 50% to 90% discounts on repeated input tokens We have launched comprehensive coverage of prompt caching on Artificial Analysis. Getting your approach to prompt caching right can have dramatic results - cost savings on input tokens and significant performance benefits, especially when using long context inputs. Summary of Caching Support by Model Family Anthropic Claude (including on Google Vertex and Amazon Bedrock): ➤ Manual activation required ➤ Up to 90% discount on cached input tokens ➤ Cache write price 25% higher than standard inputs ➤ 5 minute TTL Google Gemini: ➤ Manual activation required ➤ Up to 80% discount on cached input tokens ➤ Customizable TTL with per-hour storage pricing OpenAI (including on Microsoft Azure): ➤ Activated automatically ➤ 50% discount on some models ➤ 5-10 minute TTL DeepSeek: ➤ Activated automatically ➤ 90% discount on cached input tokens ➤ Stores the KV cache on disk, allowing the cache to persist much longer, but it does not provide the same speed benefits as caching in faster memory (HBM/DRAM) Link to our analysis below

    • No alternative text description for this image
  • Launch week, Day 3/5: Announcing Multilingual Benchmarking on Artificial Analysis 🇬🇧 🇪🇸 🇫🇷 🇩🇪 🇹🇿 🇧🇩 🇨🇳 🇯🇵 Key conclusions: ➤ Most frontier LLMs hold up impressively well across languages ➤ Out of the models we tested, only smaller models, like Llama 3.1 8B, show serious drop-off beyond English ➤ GPT-4o and Gemini 1.5 Pro demonstrate particularly impressive multilingual performance Our new Multilingual Reasoning analysis lets you compare per-language results from Multilingual MMLU and Multilingual GSM across 🇬🇧 English, 🇪🇸 Spanish, 🇫🇷 French, 🇩🇪 German, 🇹🇿 Swahili, 🇧🇩 Bengali, 🇨🇳 Chinese and 🇯🇵 Japanese. More languages (e.g. 🇮🇳Hindi, 🇪🇬 Arabic, etc) coming soon! See the below article for deep-dives for each language and a link to our analysis 👇

    Announcing Multilingual Benchmarking on Artificial Analysis 🇬🇧 🇪🇸 🇫🇷 🇩🇪 🇹🇿 🇧🇩 🇨🇳 🇯🇵

    Announcing Multilingual Benchmarking on Artificial Analysis 🇬🇧 🇪🇸 🇫🇷 🇩🇪 🇹🇿 🇧🇩 🇨🇳 🇯🇵

    Artificial Analysis on LinkedIn

  • Artificial Analysis reposted this

    View profile for Mark Kovarski, graphic

    Responsible AI | Co-Founder | CTO | Enterprise | Automation

    𝐀𝐫𝐭𝐢𝐟𝐢𝐜𝐢𝐚𝐥 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬 𝐀𝐈 𝐑𝐞𝐯𝐢𝐞𝐰 2024 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 Artificial Analysis has released the Artificial Analysis AI Review - 2024. This insightful 18 page year-end overview highlights key developments and trends that defined AI in 2024: 🤖 Labs worldwide have caught up to or surpassed GPT-4, with models like GPT-4o offering 100x cheaper inference costs. 🌍 The US leads in AI innovation, with China in a strong second position. 🔓 Open-source models are closing the gap with proprietary counterparts. 💰 Inference costs fell 75x, led by smarter, smaller models. 📏 Context lengths expanded 32x to 128k tokens (and even bigger), enabling multimodal tasks and agent workflows. 🖼️ Models like Recraft V3 reached an ELO score of 1161 in the Artificial Analysis Image Arena, based on 1.5M user votes. 🎙️ Transcription costs dropped to $0.33 per 1,000 minutes, with some models transcribing an hour of audio in ~10 seconds. Falling costs and improved capabilities are democratizing AI & driving broader adoption. What trends do you see shaping 2025? Home ➡️ https://lnkd.in/gqXxgzu2

  • Announcing the Artificial Analysis AI Review - 2024 Highlights Release For Day 2 of our Launch Week, we have put together key themes from our AI benchmarks & analysis for our first public release of content from the Artificial Analysis AI Review. Key topics in our 2024 Highlights Release: ➤ 2024 saw multiple labs catch up to OpenAI’s GPT-4, and the emergence of the first models to push beyond GPT-4’s level of intelligence ➤ The US dominates the intelligence frontier - for now… ➤ The performance gap between open source and proprietary models has decreased significantly ➤ Language model inference pricing fell dramatically for all levels of intelligence ➤ A key driver of the decline in inference pricing and increase in speed has been small models … and much more! See the below article for further excerpts and below for a link to download the report 👇

    Announcing the Artificial Analysis AI Review - 2024 Highlights Release

    Announcing the Artificial Analysis AI Review - 2024 Highlights Release

    Artificial Analysis on LinkedIn

  • Announcement: we’re doing five launches in five days! Inspired by OpenAI, we’re wrapping up the year by launching a handful of projects we can’t wait to share. Day 1/5: Image Arena Categories! With over 1.5 million votes cast in the Artificial Analysis Image Arena, we’re excited to break down performance of Text to Image models by style and subject. Want to know which model is best for text rendering, photorealism, or anime? We’re now calculating individual ELO scores for a range of styles and subjects. We’ve added hundreds of new prompts to the Arena, expanding coverage across more diverse categories. Each category needs a minimum number of prompts & votes to display an ELO score, start voting to see expanded coverage of more diverse categories! The ranking of models for specific styles and subjects can change substantially from the overall ranking. For images with Text & Typography, while Recraft's Recraft V3 remains the top model, Ideogram's Ideogram v2 shows its strength in text rendering by increasing from 6th to 2nd place.

    • No alternative text description for this image

Similar pages

Browse jobs