Ragas

Ragas

Software Development

Building the opensource framework for testing and evaluating AI Applications

About us

Website
https://github.com/explodinggradients
Industry
Software Development
Company size
2-10 employees
Type
Privately Held

Employees at Ragas

Updates

  • Ragas reposted this

    View profile for Rachitt Shah, graphic

    Applied AI Consultant | Past: Sequoia, Founder, Quant, SRE, Google OSS

    Was reading Shahul's excellent blog on aligning LLM judges with Human Experts, and came across Shreya's paper on EvalGen. TL;DR: Problem: LLMs are increasingly used to evaluate other LLM outputs, but these LLM-based evaluators can be unreliable and require validation. Existing tools lack sufficient support for verifying the quality of LLM-generated evaluations. Users struggle to define evaluation metrics for custom tasks. Proposed Solution: EvalGen: A mixed-initiative interface that assists users in creating and validating LLM-based evaluations. Workflow: LLM suggests evaluation criteria based on the prompt under test. LLM generates candidate assertions (code or LLM prompts) for each criterion. Users grade a subset of LLM outputs, providing feedback. EvalGen selects assertions that best align with user grades. A report card shows the alignment between chosen assertions and user grades. Key Features: Criteria Generation: LLM-powered suggestions for evaluation criteria. Assertion Synthesis: LLM-generated candidate implementations (code or LLM prompts). Active Learning: User grades guide the selection of aligned assertions. Alignment Measurement: Report card showing the alignment of assertions with user preferences. Mixed-Initiative: Combines automated assistance with user control. Evaluation: Offline Evaluation: Compared EvalGen's algorithm with SPADE (a fully automated assertion generation tool). EvalGen achieved better alignment with fewer assertions due to human input in the criteria selection stage. Qualitative User Study: Nine industry practitioners used EvalGen to build evaluators for LLM pipelines. As businesses scale AI first workflows, evals grow even more critical. Combining humans and LLMs seems like an excellent fit.

    • No alternative text description for this image
  • We are launching something new this week! 🚀 Fix your broken AI product evals by aligning LLM-based evaluators with human evaluators. 👉🏽 In this blog, we cover 1. Why your AI product evals are broken 2. How to fix them with ragas. 3. Explanation of the mechanism behind our solution. Checkout the blog: https://lnkd.in/g_VY7AB8

    • No alternative text description for this image
  • View organization page for Ragas, graphic

    1,360 followers

    Now align your LLM-based evaluators with human evaluators.🚀 LLM as judge metrics often fail to give desired results due to misalignment with human evaluators. We have taken the first step in solving this problem by introducing a new workflow for evaluation 1️⃣ Evaluate using LLM-based metrics 2️⃣ Review results and give feedback 3️⃣ Automatically train and align your evaluators with the collected data This creates a data flywheel where your evaluators improve continuously as you perform more evaluations and reviews. A detailed blog covering our experiments and optimization algorithm will be coming soon. 👉🏽 Get started now: https://lnkd.in/gKujRja5 👉🏽 Watch video: https://lnkd.in/gjWYadBb ⭐️ Star us on Github: https://lnkd.in/drY7MQHW

    • No alternative text description for this image
  • Weekly release update 🎉 v0.2.7 Release Highlights! • added support for multi-language in test set generation: https://lnkd.in/gk-XSBKK • bug fixes for test data generation • basic blocks for something new we are working on 🤫 Thanks to our contributors • @bmerkle: Fixed critical documentation issues in RAG testset generation • @ayulockin: Resolved a key system bug improving stability Thank you to all contributors for making this release possible!🙌

  • Full House for Evaluation driven development with Ragas workshop on @awscloud re: Invent conference ❤️ In this workshop, we covered: 1) Basic of Ragas 2) Evaluating RAG workflows with ragas 3) Evaluating Agentic workflows with ragas Special thanks to Aris Tsakpinis for making this happen.

    • No alternative text description for this image
  • View organization page for Ragas, graphic

    1,360 followers

    🚀 Synthetic data is reshaping the way we train and evaluate AI models. But how do you tailor high-quality synthetic data to fit your unique needs?🤔 Our latest blog explores Synthetic Data Generation for : 👉 Synthetic Data for Pre-training 👉 Fine-tuning with Synthetic Data 👉 Model Alignment and Safety 👉 Evaluating LLM Applications https://lnkd.in/gFv5xqcU

    • No alternative text description for this image
  • Ragas reposted this

    View profile for Sarthak Rastogi, graphic
    Sarthak Rastogi Sarthak Rastogi is an Influencer

    AI engineer experienced in agents, advanced RAG, LLMs and software engineering | Prev: ML research in multiple labs

    There are 4 metrics you should use to evaluate your RAG pipeline. Here’s how to easily calculate them in Python, using the Ragas library — 1. Faithfulness: Measures how accurately the generated answer aligns with the given context, which indicates factual consistency. It is scored from 0 to 1, with higher values indicating better faithfulness. 2. Answer Relevance: Assesses how directly and appropriately the generated answer addresses the original question. It uses mean cosine similarity between the original question and questions generated from the answer, with higher scores indicating better relevance. 3. Context Precision: Evaluates whether all relevant items in the contexts are ranked higher. Scores range from 0 to 1, with higher values indicating better precision. 4. Context Recall: Measures how well the retrieved context matches the ground-truth answer. It ranges from 0 to 1, with higher scores indicating better alignment with the ground truth. Ragas makes it easy to evaluate your RAG pipeline on these metrics. You can create a dataset of QAs with their respective context, and choose multiple metrics to evaluate them on. Link to the library docs is in the comments. #AI #LLMs #RAG

    • No alternative text description for this image
  • Ragas reposted this

    View profile for Meri Nova, graphic

    ML/AI Engineer | Community Builder | Founder @Break Into Data | ADHD + C-PTSD advocate

    Don't make the mistake 80% of AI engineers do when building RAG evaluations.   Many forget to measure the individual components of RAG and instead only focus on the output accuracy or relevance. To truly get consistent results with RAG, you need to evaluate these systems at multiple stages: 𝐑𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐒𝐭𝐚𝐠𝐞: - Context Precision: What percentage of the retrieved documents are actually relevant to the query? - Context Recall: Out of all relevant documents, what percentage does the system successfully retrieve? If document ranking is important, consider metrics like: - NDCG (Normalized Discounted Cumulative Gain) - MRR (Mean Reciprocal Rank). 𝐎𝐯𝐞𝐫𝐚𝐥𝐥 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞: Evaluate the system end-to-end to ensure all components work harmoniously. Think of these questions when evaluating your RAG system:  - How scalable is it, both in terms of data storage and query traffic? - How much data can you process in bulk at once when indexing?  - What is query latency? and more... For a comprehensive evaluation framework, consider using ragas, an open source LLM evaluation library. This tool is specifically designed to assess both the retrieval and generation components of any RAG application. ... If you're eager to learn more about optimizing RAG systems, check out my course "RAG with Langchain" on the DataCamp's platform. You can take the first chapter for free here - https://lnkd.in/gnGXMkTe And if you want to gain full access to all of their courses, you can get 50% off within the next 36 hours! Use the link here: https://lnkd.in/guVFxeQq And start building RAG projects!

Similar pages

Browse jobs