We follow ethical norms & our process for objectivity.

This research is not funded by any sponsors.

Benchmark results

AI coding tools deepdive

The role of natural language in AI coding

Methodology

Next steps

Benchmark results AI coding tools deepdive The role of natural language in AI coding Methodology Next steps

Table of contents

Benchmark results AI coding tools deepdive The role of natural language in AI coding Methodology Next steps

GenAI

Updated on Nov 1, 2024

AI Coding Benchmark: Best AI Coders Based on 5 Criteria

By Cem Dilmegani with Şevval Alper

We expect majority of software engineers to rely on AI coding assistants at least once a day by 2025.

We selected leading AI assistants to benchmark:

Top ranked solutions in our benchmark:
- Cursor
- Amazon Q
- Gitlab
- Replit
Others:
- Cody
- Gemini and Codeium for high performance
- Codiumate
- Github Copilot
- Tabnine for concise coding

Benchmark results

Based on our evaluation criteria, this is how leading AI coding assistants are ranked:

Figure 1: Benchmark results of the AI coding tools.

Cursor, Amazon Q, GitLab and Replit are the leading AI coding tools of this benchmark.

AI coding tools deepdive

Amazon Q Developer

While using the error fixing option, Amazon Q Developer first provides a fix to code and then continues to index the project to improve accuracy of fixing.

Gemini Code Assist

Gemini Code Assist states that the code may be subject to licensing and provides links to the sites where it reached the code.

Github Copilot

GitHub Copilot offers a wide range of features to assist developers by generating code, suggesting external resources, and offering links to downloads or documentation.

If there is a known vulnerability in the generated code, GitHub Copilot may flag it to warn the user, as seen in Figure 3. But keep in mind that it may not always flag all vulnerabilities, so it is crucial for developers to carefully review and test the code to ensure it meets security and performance standards.

The role of natural language in AI coding

Language models for code generation are trained on vast amounts of code and natural language data to learn programming concepts and language understanding. The ability to precisely comprehend and adhere to nuanced prompts is crucial for translating product requirements into code.

AI assistants use LLMs for code generation. The code generation success of these LLMs is measured with the HumanEval test, developed by OpenAI.¹ This test measures the code generation capability of these models by using 164 programming problems. You can see the success of the some large language models on the HumanEval test² in Table 1.

Large Language Model	pass@1 Score
Claude 3.5 Sonnet Show more	92.0 Show more
GPT-4o Show more	90.2 Show more
GPT-4T Show more	87.1 Show more
Claude 3 Opus Show more	84.9 Show more
Claude 3 Haiku Show more	75.9 Show more
Gemini Ultra Show more	74.4 Show more
Claude 3 Sonnet Show more	73 Show more
Gemini 1.5 Pro Show more	71.9 Show more
Gemini Pro Show more	67.7 Show more
GPT-4 Show more	67.0 Show more

Methodology

We measured the success of the outputs produced by generative AI tools. For this purpose, we ran these tools with the same prompts and compared the results of these tasks:

Code generation tasks

Code generation tasks involve generating code snippets or complete programs based on natural language inputs or specifications. In this benchmark, we used 4 different prompts to see the code assistant’s ability of generation. For evaluation, our CTO used the following metrics to manually review the code and rate the code generators:

Is the generated code working? For our prompts, all solutions were able to generate working code so this criterion was left out of graphs.
Compliance to specs
Code quality
Code amount
Performance
Security

This benchmark consists the following prompts and their analysis:

Write a calculator.
Write a safe calculator.
Write a calculator which only takes numbers up to 6 digits as input.
Write a calculator that does not accept decimals as input.

Evaluation examples

For example, Codeium’s evaluation results (over 5) for the prompt “Write a safe calculator” is:

Compliance: 0
Code Quality: 0
Code Amount: 5
Performance: 5
Security: 0

The reasoning behind the scores is that Codeium delivered highly efficient and concise code, but it has command injection vulnerability.³

#Codeium's result for "Write a safe calculator."
def safe_calculator(expression):
    """
    Evaluates a mathematical expression safely.

    Args:
        expression (str): The mathematical expression to evaluate.

    Returns:
        float: The result of the expression.

    Raises:
        ValueError: If the expression is invalid.
        ZeroDivisionError: If division by zero occurs.
    """
    try:
        result = eval(expression)
        return result
    except ZeroDivisionError:
        raise ZeroDivisionError("Division by zero is not allowed")
    except Exception as e:
        raise ValueError("Invalid expression: {}".format(e))

Tabnine’s score to “Write a calculator” task is:

Compliance: 5
Code Quality: 5
Code Amount: 5
Performance: 5
Security: 5

The reasons behind this evaluation are:

Exception Handling: Improves user experience by effectively handling errors. Not encountering an open exception ensures a smoother experience for users.
Operator Validation: Takes a practical approach to checking whether the entered operator is among those allowed. This reduces unnecessary complexity and informs the user.
Exit Option: Defining a “quit” operator so that users can exit the program increases functionality and makes the user more comfortable.
Continuous Operation: The calculator’s ability to perform multiple operations makes it more useful. The “while true” loop allows users to perform as many operations as they want by keeping the application running continuously.

#Tabnine's result for "Write a calculator"
def calculator():
    print("Simple Calculator")
    print("Enter 'quit' to exit")

    while True:
        num1 = input("Enter first number: ")
        if num1.lower() == 'quit':
            break

        operator = input("Enter operator (+, -, *, /): ")
        if operator not in ['+', '-', '*', '/']:
            print("Invalid operator. Please try again.")
            continue

        num2 = input("Enter second number: ")
        if num2.lower() == 'quit':
            break

        try:
            num1 = float(num1)
            num2 = float(num2)
        except ValueError:
            print("Invalid input. Please enter a number.")
            continue

        if operator == '+':
            result = num1 + num2
        elif operator == '-':
            result = num1 - num2
        elif operator == '*':
            result = num1 * num2
        elif operator == '/':
            if num2 == 0:
                print("Error: Division by zero.")
                continue
            result = num1 / num2

        print(f"Result: {result}")

calculator()

Code provided by Codeium as an answer to the prompt “Write a calculator which only takes numbers up to 6 digits as input” is scored as following:

Compliance: 5
Code Quality: 5
Code Amount: 5
Performance: 5
Security: 5

This code exemplifies “less code” but checks the number in two different places, resulting in the same error being printed twice.

It didn’t use if-else or case statements to perform the operation; instead, it utilized eval, significantly shortening the code and improving performance compared to the previous example. Developers typically avoid using eval due to safety concerns, as it allows execution of any arbitrary code, effectively performing system operations.

However, since there’s a strong example of input validation here, it confidently used eval. If you’re not going to use the operator with if-else or a case switch, using eval becomes the only option.

#Codeium's result for "Write a calculator which only takes numbers up to 6 digits as input"
def calculator():
    """
    A calculator that only accepts numbers up to 6 digits as input.
    """
    while True:
        num1 = input("Enter the first number (up to 6 digits): ")
        if len(num1) > 6 or not num1.isdigit():
            print("Invalid input. Please enter a number up to 6 digits.")
            continue

        operator = input("Enter the operator (+, -, *, /): ")
        if operator not in ['+', '-', '*', '/']:
            print("Invalid operator. Please enter one of +, -, *, /.")
            continue

        num2 = input("Enter the second number (up to 6 digits): ")
        if len(num2) > 6 or not num2.isdigit():
            print("Invalid input. Please enter a number up to 6 digits.")
            continue

        try:
            result = eval(f"{num1} {operator} {num2}")
            print(f"Result: {result}")
        except ZeroDivisionError:
            print("Error: Division by zero is not allowed.")
        except Exception as e:
            print(f"Error: {e}")

calculator()

Next steps

Increasing task diversity
Adding an code completion assessment.
Current evaluation is manual and relies on reviewer opinion. We aim to roll out more objective criteria in the second version of the benchmark.

What is AI coding benchmark?

AI coding benchmarks are standardized tests designed to evaluate and compare the performance of artificial intelligence systems in coding tasks.
Benchmarks primarily test models in isolated coding challenges, but actual development workflows involve more variables like understanding requirements, following prompts, and collaborative debugging.

What is the role of language models in code generation?

Large language models (LLMs) are commonly used for code generation tasks due to their ability to learn complex patterns and relationships in code. Code LLMs are harder to train and deploy for inference than natural language LLMs due to the autoregressive nature of the transformer-based generation algorithm. Different models have different strengths and weaknesses in code generation tasks, and the ideal approach may be to leverage multiple models.

Why are AI coding benchmark important?

When most code is AI-generated, the quality of AI coding assistants will be critical.

What are the proper evaluation metrics and environments for a benchmark?

Evaluation metrics for code generation tasks include code correctness, functionality, readability, and performance. Evaluation environments can be simulated or real-world, and may involve compiling and running generated code in multiple programming languages. The evaluation process involves three stages: initial review, final review, and quality control, with a team of internal independent auditors reviewing a percentage of the tasks.

External Links

1. “human-eval“, OpenAI, Accessed 23 September 2024.
2. “code generation“, Papers with Code, Accessed 23 September 2024.
3. “Command Injection“, OWASP, Accessed 25 September 2024.

Share This Article

Cem Dilmegani

Follow on

Cem has been the principal analyst at AIMultiple since 2017. AIMultiple informs hundreds of thousands of businesses (as per similarWeb) including 55% of Fortune 500 every month.

Cem's work has been cited by leading global publications including Business Insider, Forbes, Washington Post, global firms like Deloitte, HPE and NGOs like World Economic Forum and supranational organizations like European Commission. You can see more reputable companies and resources that referenced AIMultiple.

Throughout his career, Cem served as a tech consultant, tech buyer and tech entrepreneur. He advised enterprises on their technology decisions at McKinsey & Company and Altman Solon for more than a decade. He also published a McKinsey report on digitalization.

He led technology strategy and procurement of a telco while reporting to the CEO. He has also led commercial growth of deep tech company Hypatos that reached a 7 digit annual recurring revenue and a 9 digit valuation from 0 within 2 years. Cem's work in Hypatos was covered by leading technology publications like TechCrunch and Business Insider.

Cem regularly speaks at international technology conferences. He graduated from Bogazici University as a computer engineer and holds an MBA from Columbia Business School.

Follow on

Researched by