Prompt Engineering Tutorial¶

Learn how to use LLM-driven evolution to optimize prompt templates for better downstream task performance.

Academic Citation

If you use EvoToolkit in your research, please cite:

@article{guo2025evotoolkit,
title={evotoolkit: A Unified LLM-Driven Evolutionary Framework for Generalized Solution Search},
author={Guo, Ping and Zhang, Qingfu},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2025},
note={Submitted to arXiv}
}

Complete Example Code

This tutorial provides complete, runnable examples (click to view/download):

basic_example.py - Basic usage with mock LLM
README.md - Examples documentation and usage guide

Run locally:

cd examples/prompt_optimization
python basic_example.py

Overview¶

This tutorial demonstrates:

Creating prompt optimization tasks
Using LLM-driven evolution to improve prompt templates
Testing prompts on specific downstream tasks
Evolving high-quality prompts automatically

Installation¶

Install EvoToolkit:

pip install evotoolkit

Prerequisites:

Python >= 3.11
LLM API access (OpenAI, Claude, or other compatible providers)
Basic understanding of prompt engineering

Understanding Prompt Optimization Tasks¶

What is a Prompt Optimization Task?¶

A prompt optimization task evolves string templates to maximize performance on downstream tasks. Unlike Python tasks that evolve code, prompt tasks evolve prompt text directly.

Aspect	Python Task	Prompt Task
Solution type	Python code	String template
Evolution target	Function/algorithm	Prompt text
Evaluation	Execute code	Test template with LLM
Example	`def func(x): return x**2`	`"Solve: {question}\nAnswer:"`

Task Components¶

A prompt optimization task requires:

Test cases: Question-answer pairs for evaluation
Template syntax: String with {question} placeholder
LLM API: For testing prompt templates (or use mock mode)
Evaluation metric: Accuracy on test cases

Creating Your First Prompt Task¶

Step 1: Define Test Cases¶

Create test cases with questions and expected answers:

test_cases = [
    {"question": "What is 2+2?", "expected": "4"},
    {"question": "What is 5*3?", "expected": "15"},
    {"question": "What is 10-7?", "expected": "3"},
    {"question": "What is 12/4?", "expected": "3"},
    {"question": "What is 7+8?", "expected": "15"},
]

Step 2: Create the Task¶

from evotoolkit.task import PromptOptimizationTask
from evotoolkit.tools.llm import HttpsApi

# Configure LLM API
llm_api = HttpsApi(
    api_url="your_api_url",  # e.g., "ai.api.example.com"
    key="your_api_key",       # Your API key
    model="gpt-4o"
)

task = PromptOptimizationTask(
    test_cases=test_cases,
    llm_api=llm_api,
    use_mock=False
)

Step 3: Test Initial Template¶

# Get initial solution
init_sol = task.make_init_sol_wo_other_info()

print(f"Initial template: {init_sol.sol_string}")
print(f"Accuracy: {init_sol.evaluation_res.score:.2%}")
print(f"Correct: {init_sol.evaluation_res.additional_info['correct']}/{init_sol.evaluation_res.additional_info['total']}")

Output:

Initial template: "Answer this question: {question}"
Accuracy: 100.00%
Correct: 5/5

Step 4: Test Custom Templates¶

# Test your own template
custom_template = "Solve this math problem and give only the number: {question}"
result = task.evaluate_code(custom_template)

print(f"Custom template: {custom_template}")
print(f"Accuracy: {result.score:.2%}")
print(f"Correct: {result.additional_info['correct']}/{result.additional_info['total']}")

Running Evolution to Optimize Prompts¶

Step 1: Create Interface¶

import evotoolkit
from evotoolkit.task import EvoEngineerStringInterface

# Create interface
interface = EvoEngineerStringInterface(task)

Step 2: Run Evolution¶

# Run evolution with LLM
result = evotoolkit.solve(
    interface=interface,
    output_path='./prompt_results',
    running_llm=llm_api,
    max_generations=10,
    pop_size=5,
    max_sample_nums=20
)

print(f"Best template found: {result.sol_string}")
print(f"Accuracy: {result.evaluation_res.score:.2%}")

Try Different Algorithms

EvoToolkit supports multiple evolutionary algorithms for prompt optimization:

# Using EoH
from evotoolkit.task import EoHStringInterface
interface = EoHStringInterface(task)

# Using FunSearch
from evotoolkit.task import FunSearchStringInterface
interface = FunSearchStringInterface(task)

# Using EvoEngineer (default)
from evotoolkit.task import EvoEngineerStringInterface
interface = EvoEngineerStringInterface(task)

Then use the same evotoolkit.solve() call to run evolution. Different interfaces may perform better on different tasks.

Understanding Template Format¶

Valid Templates¶

Prompt templates must include the {question} placeholder:

# ✅ Good templates
"Answer this question: {question}"
"Solve this math problem: {question}\nGive only the number."
"Question: {question}\nThink step by step and provide only the final answer."
"Let's solve: {question}\nFirst, analyze the problem..."

# ❌ Bad templates (missing placeholder)
"Solve this problem"     # No {question} placeholder
"Answer: 42"            # No {question} placeholder

Template Evolution Example¶

During evolution, the LLM generates improved templates:

# Generation 1
"Answer: {question}"
# Accuracy: 60%

# Generation 3
"Solve this math problem: {question}\nProvide only the numerical answer."
# Accuracy: 85%

# Generation 7
"Calculate: {question}\nShow only the final number, no explanation."
# Accuracy: 100%

Use Cases and Applications¶

1. Math Problem Solving¶

test_cases = [
    {"question": "What is 15 * 7?", "expected": "105"},
    {"question": "What is 144 / 12?", "expected": "12"},
    # ...
]

task = PromptOptimizationTask(test_cases=test_cases, llm_api=llm_api)

2. Text Classification¶

test_cases = [
    {"question": "This movie is amazing!", "expected": "positive"},
    {"question": "This movie is terrible!", "expected": "negative"},
    {"question": "I loved this film!", "expected": "positive"},
    # ...
]

task = PromptOptimizationTask(test_cases=test_cases, llm_api=llm_api)

3. Information Extraction¶

test_cases = [
    {"question": "Extract the date: The meeting is on 2024-03-15", "expected": "2024-03-15"},
    {"question": "Extract the date: We'll meet on March 20th, 2024", "expected": "2024-03-20"},
    # ...
]

task = PromptOptimizationTask(test_cases=test_cases, llm_api=llm_api)

4. Translation Tasks¶

test_cases = [
    {"question": "Translate to French: Hello", "expected": "Bonjour"},
    {"question": "Translate to French: Thank you", "expected": "Merci"},
    # ...
]

task = PromptOptimizationTask(test_cases=test_cases, llm_api=llm_api)

Customizing Evolution Behavior¶

The quality of evolved prompts is controlled by the evolution method and its internal prompt design. To improve results:

Adjust prompts: Inherit existing Interface classes and customize LLM prompts
Develop new algorithms: Create entirely new evolutionary strategies

Learn More

These are general techniques applicable to all tasks. For detailed tutorials, see:

Customizing Evolution Methods - How to modify prompts and develop new algorithms
Advanced Usage - More advanced configuration options

Quick Example - Custom Prompts for Prompt Optimization:

from evotoolkit.task import EvoEngineerStringInterface

class CustomPromptInterface(EvoEngineerStringInterface):
    """Interface optimized for prompt template evolution."""

    def get_operator_prompt(self, operator_name, selected_individuals,
                           current_best_sol, random_thoughts, **kwargs):
        """Customize mutation prompt to emphasize clarity and structure."""

        if operator_name == "mutation":
            task_description = self.task.get_base_task_description()
            individual = selected_individuals[0]

            prompt = f"""# Prompt Template Optimization

{task_description}

## Current Best Template
**Accuracy:** {current_best_sol.evaluation_res.score:.2%}
**Template:** {current_best_sol.sol_string}

## Template to Mutate
**Accuracy:** {individual.evaluation_res.score:.2%}
**Template:** {individual.sol_string}

## Optimization Guidelines
Focus on improving the template by:
- Adding clear instructions
- Specifying output format explicitly
- Including relevant context or examples
- Using appropriate tone and style
- Ensuring the {{question}} placeholder is preserved

Generate an improved template that increases accuracy.

## Response Format:
name: [descriptive_name]
code:
[Your improved template with {{question}} placeholder]
thought: [reasoning for changes]
"""
            return [{"role": "user", "content": prompt}]

        # Use default for other operators
        return super().get_operator_prompt(operator_name, selected_individuals,
                                          current_best_sol, random_thoughts, **kwargs)

# Use custom interface
interface = CustomPromptInterface(task)
result = evotoolkit.solve(
    interface=interface,
    output_path='./custom_results',
    running_llm=llm_api,
    max_generations=10
)

Understanding Evaluation¶

Scoring Mechanism¶

Template Testing: Each template is tested on all test cases
LLM Response: The LLM generates answers using the template
Answer Checking: Responses are compared to expected answers
Accuracy Calculation: Score = (correct answers) / (total test cases)

Evaluation Output¶

result = task.evaluate_code(template)

if result.valid:
    print(f"Accuracy: {result.score:.2%}")
    print(f"Correct: {result.additional_info['correct']}/{result.additional_info['total']}")
    print(f"Details: {result.additional_info['details']}")
else:
    print(f"Error: {result.additional_info['error_msg']}")

Mock Mode for Testing¶

Use mock mode to test without LLM API costs:

# Mock mode always returns correct answers for testing
task = PromptOptimizationTask(
    test_cases=test_cases,
    use_mock=True  # No actual LLM calls
)

# Good for:
# - Testing task setup
# - Debugging template format
# - Understanding the workflow
# - Developing custom interfaces

Custom Evaluation Logic¶

For specialized tasks, you can customize answer checking:

from evotoolkit.task import PromptOptimizationTask

class CustomPromptTask(PromptOptimizationTask):
    """Custom task with specialized answer checking."""

    def _check_answer(self, response: str, expected: str) -> bool:
        """Custom evaluation logic."""
        # Example: Case-insensitive comparison
        return response.strip().lower() == expected.strip().lower()

        # Example: Fuzzy matching
        # from difflib import SequenceMatcher
        # similarity = SequenceMatcher(None, response, expected).ratio()
        # return similarity > 0.8

        # Example: Regex matching
        # import re
        # return bool(re.search(expected, response))

# Use custom task
test_cases = [
    {"question": "Capital of France?", "expected": "paris"},
    # "Paris", "PARIS", "paris" all accepted
]

task = CustomPromptTask(test_cases=test_cases, llm_api=llm_api)

Complete Example¶

Here's a full working example:

import evotoolkit
from evotoolkit.task import PromptOptimizationTask, EvoEngineerStringInterface
from evotoolkit.tools.llm import HttpsApi

# 1. Define test cases
test_cases = [
    {"question": "What is 2+2?", "expected": "4"},
    {"question": "What is 5*3?", "expected": "15"},
    {"question": "What is 10-7?", "expected": "3"},
    {"question": "What is 12/4?", "expected": "3"},
    {"question": "What is 7+8?", "expected": "15"},
]

# 2. Configure LLM API
llm_api = HttpsApi(
    api_url="your_api_url",  # e.g., "ai.api.example.com"
    key="your_api_key",       # Your API key
    model="gpt-4o"
)

# 3. Create task
task = PromptOptimizationTask(
    test_cases=test_cases,
    llm_api=llm_api,
    use_mock=False
)

# 4. Create interface
interface = EvoEngineerStringInterface(task)

# 5. Run evolution
result = evotoolkit.solve(
    interface=interface,
    output_path='./prompt_optimization_results',
    running_llm=llm_api,
    max_generations=10,
    pop_size=5,
    max_sample_nums=20
)

# 6. Show results
print(f"Best template found:")
print(f"  {result.sol_string}")
print(f"Accuracy: {result.evaluation_res.score:.2%}")
print(f"Correct: {result.evaluation_res.additional_info['correct']}/{result.evaluation_res.additional_info['total']}")

Next Steps¶

Explore Different Optimization Strategies¶

Try different evolutionary algorithms (EvoEngineer variants, EoH, FunSearch)
Compare results across different interfaces
Experiment with different test case sets
Test on various downstream tasks

Customize and Improve Evolution¶

Examine prompt designs in existing Interface classes
Inherit and override Interfaces to customize prompts
Design specialized prompts for different task types
Develop new evolutionary algorithms if needed

Learn More¶

Customizing Evolution Methods - Deep dive into prompt customization and algorithm development
Advanced Usage - Advanced configuration and techniques
API Reference - Complete API documentation
Development Docs - Contribute new methods and features