The challenge of effective prompting
AI is reshaping industries—from education to healthcare—thanks to advancements in large language models (LLMs). These models rely on prompts, carefully crafted inputs that guide them to produce relevant and meaningful outputs. While the impact of prompts is profound, creating prompts that can help with complex tasks is a time-intensive and expertise-heavy process, often involving months of trial and error.
This challenge grows as new tasks arise and models evolve rapidly, making manual methods for prompt engineering increasingly unsustainable. The question then becomes: How can we make prompt optimization faster, more accessible, and more adaptable across diverse tasks?
- Download PromptWizard
To address this challenge, we developed PromptWizard (PW), a research framework that automates and streamlines the process of prompt optimization. We are open sourcing the PromptWizard codebase (opens in new tab) to foster collaboration and innovation within the research and development community.
Introducing PromptWizard
PromptWizard (PW) is designed to automate and simplify prompt optimization. It combines iterative feedback from LLMs with efficient exploration and refinement techniques to create highly effective prompts within minutes.
PromptWizard optimizes both the instruction and the in-context learning examples. Central to PW is its self-evolving and self-adaptive mechanism, where the LLM iteratively generates, critiques, and refines prompts and examples in tandem. This process ensures continuous improvement through feedback and synthesis, achieving a holistic optimization tailored to the specific task at hand. By evolving both instructions and examples simultaneously, PW ensures significant gains in task performance.
Three key insights behind PromptWizard:
- Feedback-driven refinement: At its core, PW leverages an iterative feedback loop where the LLM generates, critiques, and refines its own prompts and examples. This continuous improvement mechanism ensures that each iteration is better than the last, leading to highly effective prompts and examples.
- Joint optimization and synthesis of diverse examples: PW generates synthetic examples that are not only robust and diverse but also task-aware. By optimizing prompts and examples together, it ensures they work in tandem to address specific task requirements effectively.
- Self-generated chain-of-thought (CoT) steps: Incorporating CoT reasoning improves the problem-solving capabilities of the model. By using selected few-shot examples, PW generates a detailed reasoning chain for each example, facilitating nuanced and step-by-step problem-solving approaches.
How PromptWizard works
PromptWizard begins with a user input: a problem description, an initial prompt instruction, and a few training examples that serve as a foundation for the task at hand.
Its output is a refined, optimized set of prompt instructions paired with carefully curated in-context few-shot examples. These outputs are enriched with detailed reasoning chains, task intent, and an expert profile that bridges human-like reasoning with the AI’s responses.
Stage 1: Refinement of prompt instruction
The first stage focuses on refining the task instructions of a prompt. PromptWizard generates multiple candidate instructions, evaluates them using feedback from the LLM, and iteratively synthesizes improved versions. This process balances exploration—trying diverse ideas—and exploitation—refining the most promising ones.
For example, if an initial instruction yields suboptimal results, PW incorporates feedback to identify its shortcomings and generates an improved version. Over three to five iterations, this iterative cycle ensures that the instruction converges to an optimal state.
Stage 2: Joint optimization of instructions and examples
The refined prompt obtained from Stage 1 is combined with carefully selected examples, and both are optimized together. Through the critique-and-synthesis mechanism, PromptWizard ensures alignment between the prompt and examples, simultaneously synthesizing new examples to enhance task performance.
This structured approach makes PromptWizard highly versatile, adapting to tasks as varied as solving math problems or generating creative content.
Results
PromptWizard stands out for its feedback-driven refinement and systematic exploration, delivering exceptional results across a wide variety of tasks while maintaining computational efficiency.
Comprehensive evaluation across tasks
PromptWizard was rigorously evaluated on over 45 tasks, spanning both general and domain-specific challenges. Benchmarked against state-of-the-art techniques—including Instinct, InstructZero, APE, PromptBreeder, EvoPrompt, DSPy, APO, and PromptAgent—PW consistently outperformed competitors in accuracy, efficiency, and adaptability. Please see detailed results in our paper.
- Accuracy: PW consistently outperformed other methods, maintaining performance close to the best across all tasks. Figure 4 shows the performance profile curve that highlights PW’s reliability, demonstrating how frequently it achieves near-best accuracy compared to other approaches for BigBench Instruction Induction dataset (BBII).
- Efficiency: Beyond accuracy, PW demonstrates its computational efficiency. Unlike many baseline methods that require extensive API calls and computational resources, PW achieves superior results with minimal overhead by striking an effective balance between exploration and exploitation. Table 1 demonstrates PW’s cost-effectiveness, with significantly reduced token usage for input and output while optimizing prompts effectively.
Methods | API calls | Total tokens |
---|---|---|
Instinct | 1730 | 115k |
PromptBreeder | 18600 | 1488k |
EvoPrompt | 5000 | 400k |
PW | 69 | 24k |
We also have conducted numerous experiments to highlight PromptWizard’s efficacy with limited training data and smaller LLMs.
Resilience with limited data
Real-world scenarios often lack abundant training data. PW excels in such conditions, requiring as few as five examples to produce effective prompts. Across five diverse datasets, PW demonstrated an average accuracy drop of only 5% when using five examples compared to 25 examples—highlighting its adaptability and efficiency (see Table 2).
Datasets | 5 Examples | 25 Examples |
---|---|---|
MMLU | 80.4 | 89.5 |
GSM8k | 94 | 95.4 |
Ethos | 86.4 | 89.4 |
PubMedQA | 68 | 78.2 |
MedQA | 80.4 | 82.9 |
Average | 81.9 | 87 |
Leveraging smaller models for optimization
PromptWizard also reduces computational costs by using smaller LLMs for prompt generation, reserving more powerful models for inference. For example, using Llama-70B for prompt generation resulted in negligible performance differences compared to GPT-4, while significantly lowering resource usage (see Table 3).
Dataset | Prompt Gen: Llama-70B | Prompt Gen: GPT4 |
---|---|---|
GSM8k | 94.6 | 95.4 |
Ethos | 89.2 | 89.4 |
Average | 91.9 | 92.4 |
PromptWizard shows that effective prompts combine optimized instructions refined through iterative feedback, thoughtfully chosen in-context examples, and a modular design that incorporates expert knowledge and task-specific intent. This approach enables the framework to handle a broad range of tasks, from simple to highly complex, with exceptional efficiency and flexibility.
Whether you are a researcher addressing cutting-edge challenges or an organization looking to streamline workflows, PromptWizard provides a practical, scalable, and impactful solution for enhancing model performance.