Examples
This page provides comprehensive examples of how to configure and run evaluations with Eve-evalkit. All examples use the YAML configuration format with the evaluate.py script.
Basic Structure
Every configuration file has the following structure:
constants: # Define reusable values
# ...
wandb: # Optional: WandB integration
# ...
models: # One or more models to evaluate
# ...
output_dir: # Where to save results
Example 1: EVE Earth Observation Tasks
Evaluate a model on Earth Observation-specific tasks:
constants:
judge_api_key: sk-or-v1-xxxxx
judge_base_url: https://openrouter.ai/api/v1
judge_name: mistralai/mistral-large-2411
tasks:
- name: mcqa_single_answer
num_fewshot: 2
max_tokens: 10000
- name: hallucination_detection
num_fewshot: 0
max_tokens: 100
- name: open_ended
num_fewshot: 5
max_tokens: 40000
judge_api_key: !ref judge_api_key
judge_base_url: !ref judge_base_url
judge_name: !ref judge_name
wandb:
enabled: true
project: eve-evaluations
entity: LLM4EO
run_name: eve-model-v1
models:
- name: eve-esa/eve_v0.1
base_url: https://api.runpod.ai/v2/endpoint-id/openai/v1/chat/completions
api_key: your-runpod-api-key
temperature: 0.1
num_concurrent: 10
timeout: 600
tasks: !ref tasks
output_dir: evals_outputs
Example 2: Using LM-Eval-Harness Tasks
Eve-evalkit supports all tasks from lm-evaluation-harness. Here's an example using MMLU-Pro:
constants:
tasks:
- name: mmlu_pro
num_fewshot: 5
max_tokens: 1000
- name: gsm8k
num_fewshot: 8
max_tokens: 512
- name: hellaswag
num_fewshot: 10
max_tokens: 100
models:
- name: gpt-4
base_url: https://api.openai.com/v1/chat/completions
api_key: your-openai-api-key
temperature: 0.0
num_concurrent: 3
tasks: !ref tasks
output_dir: evals_outputs
Example 3: Mixed EVE and Standard Tasks
Combine Earth Observation tasks with standard benchmarks:
constants:
judge_api_key: your-judge-api-key
judge_base_url: https://openrouter.ai/api/v1
judge_name: mistralai/mistral-large-2411
tasks:
# EVE Earth Observation Tasks
- name: hallucination_detection
num_fewshot: 0
max_tokens: 100
# Standard Benchmark Tasks
- name: mmlu_pro
num_fewshot: 5
max_tokens: 1000
- name: arc_challenge
num_fewshot: 25
max_tokens: 100
wandb:
enabled: true
project: comprehensive-eval
entity: your-org
models:
- name: your-model
base_url: https://api.provider.com/v1/chat/completions
api_key: your-api-key
temperature: 0.1
num_concurrent: 5
tasks: !ref tasks
output_dir: evals_outputs
Example 4: Multiple Models
Evaluate multiple models on the same tasks:
constants:
tasks:
- name: hallucination_detection
num_fewshot: 0
max_tokens: 100
- name: mcqa_single_answer
num_fewshot: 2
max_tokens: 1000
wandb:
enabled: true
project: model-comparison
models:
- name: model-a
base_url: https://api.provider-a.com/v1/chat/completions
api_key: api-key-a
temperature: 0.1
num_concurrent: 5
tasks: !ref tasks
- name: model-b
base_url: https://api.provider-b.com/v1/chat/completions
api_key: api-key-b
temperature: 0.1
num_concurrent: 5
tasks: !ref tasks
output_dir: evals_outputs
Example 5: Using Environment Variables
Instead of hardcoding API keys, use environment variables:
constants:
judge_api_key: ${JUDGE_API_KEY}
tasks:
- name: mcqa_single_answer
num_fewshot: 0
max_tokens: 20000
judge_api_key: !ref judge_api_key
judge_base_url: https://openrouter.ai/api/v1
judge_name: mistralai/mistral-large-2411
models:
- name: my-model
base_url: https://api.provider.com/v1/chat/completions
api_key: ${MODEL_API_KEY}
tasks: !ref tasks
wandb:
enabled: true
api_key: ${WANDB_API_KEY}
project: my-project
output_dir: evals_outputs
Set environment variables before running:
export JUDGE_API_KEY=your-judge-key
export MODEL_API_KEY=your-model-key
export WANDB_API_KEY=your-wandb-key
python evaluate.py evals.yaml
Example 6: Testing with Limited Samples
Test your configuration on a small subset before running full evaluation:
constants:
tasks:
- name: hallucination_detection
num_fewshot: 0
max_tokens: 100
limit: 10 # Only evaluate first 10 samples
- name: mcqa_single_answer
num_fewshot: 2
max_tokens: 1000
limit: 5 # Only evaluate first 5 samples
models:
- name: test-model
base_url: https://api.provider.com/v1/chat/completions
api_key: your-api-key
temperature: 0.1
num_concurrent: 2
tasks: !ref tasks
output_dir: test_outputs
Example 7: Using Seed for Reproducibility
Control randomness in evaluations by setting a seed value. This is especially useful for open-ended tasks with multiple judges where answer order randomization occurs:
constants:
judge_api_key: ${JUDGE_API_KEY}
judge_base_url: https://openrouter.ai/api/v1/
concurrent_requests: 20
open_ended_judges:
- name: qwen3-235b
model: qwen/qwen3-235b-a22b-2507
api_key: ${JUDGE_API_KEY}
base_url: https://openrouter.ai/api/v1/
prompt_path: metrics/prompts/llm_judge_qa.yaml
- name: mistral-large
model: mistral-large-2512
api_key: ${MISTRAL_API_KEY}
base_url: https://api.mistral.ai/v1
prompt_path: metrics/prompts/llm_judge_qa.yaml
tasks:
# Same task with different seeds for variance analysis
- name: open_ended_0_shot_seed_1234
task_name: open_ended
model_type: local-chat-completions
num_fewshot: 0
max_tokens: 10000
judges: !ref open_ended_judges
batch_size: !ref concurrent_requests
seed: 1234 # Fixed seed for reproducibility
- name: open_ended_0_shot_seed_5678
task_name: open_ended
model_type: local-chat-completions
num_fewshot: 0
max_tokens: 10000
judges: !ref open_ended_judges
batch_size: !ref concurrent_requests
seed: 5678 # Different seed
# Other tasks with seeds
- name: mcqa_single_answer_0_shot_seed_1234
task_name: mcqa_single_answer
model_type: local-chat-completions
num_fewshot: 0
max_tokens: 15
seed: 1234
wandb:
enabled: true
project: seed-reproducibility-test
entity: your-org
models:
- name: your-model
base_url: http://localhost:8010/v1/
api_key: EMPTY
temperature: 0.0
num_concurrent: !ref concurrent_requests
tasks: !ref tasks
timeout: 180
output_dir: evals_outputs
Why use seeds? - Reproduce exact same results across runs - Compare model performance with controlled randomness - Debug evaluation issues with consistent behavior - Analyze variance by running same task with different seeds
Example 8: Using EVE API for RAG-Enhanced Evaluation
Evaluate using the EVE API which provides RAG (Retrieval-Augmented Generation) enhanced responses with Earth Observation context:
constants:
# EVE API credentials
eve_email: your-email@example.com
eve_password: your-eve-password
eve_base_url: http://0.0.0.0:8000/
eve_public_collections: ['qwen-512-filtered', 'Wikipedia EO', 'Wiley AI Gateway']
eve_k: 10 # Number of documents to retrieve
eve_threshold: 0.5 # Similarity threshold
concurrent_requests: 7
# Judge configuration for open-ended tasks
open_ended_judges:
- name: qwen3-235b
model: qwen/qwen3-235b-a22b-2507
api_key: ${OPENROUTER_API_KEY}
base_url: https://openrouter.ai/api/v1/
prompt_path: metrics/prompts/llm_judge_qa.yaml
- name: mistral-large
model: mistral-large-2512
api_key: ${MISTRAL_API_KEY}
base_url: https://api.mistral.ai/v1
prompt_path: metrics/prompts/llm_judge_qa.yaml
tasks:
- name: open_ended_0_shot
task_name: open_ended
num_fewshot: 0
max_tokens: 10000
judges: !ref open_ended_judges
model_type: eve-api # Use EVE API model type
- name: mcqa_multiple_answer_0_shot
task_name: mcqa_multiple_answer
num_fewshot: 0
max_tokens: 10000
model_type: eve-api
- name: mcqa_single_answer_0_shot
task_name: mcqa_single_answer
model_type: eve-api
num_fewshot: 0
max_tokens: 1000
- name: hallucination_detection_0_shot
task_name: hallucination_detection
num_fewshot: 0
max_tokens: 10000
model_type: eve-api
wandb:
enabled: true
project: eve-api-evaluation
entity: LLM4EO
models:
- name: eve-api
# EVE API configuration
email: !ref eve_email
password: !ref eve_password
base_url: !ref eve_base_url
public_collections: !ref eve_public_collections
k: !ref eve_k
threshold: !ref eve_threshold
# General settings
temperature: 0.0
num_concurrent: !ref concurrent_requests
tasks: !ref tasks
timeout: 180
output_dir: evals_outputs_eve_api
EVE API Configuration Parameters:
- email: Email for EVE API authentication
- password: Password for EVE API authentication
- base_url: Base URL for the EVE API endpoint
- public_collections: List of document collections to search for RAG
- k: Number of documents to retrieve (default: 5)
- threshold: Similarity threshold for document retrieval (default: 0.5)
- model_type: eve-api: Must be specified in task configuration
Important Notes: - The EVE API automatically retrieves relevant context for each query - Retrieved documents are used to enhance the model's responses - Especially useful for Earth Observation domain-specific questions - Requires a running EVE API server
Running Examples
To run any of these examples:
- Save the configuration to a file (e.g.,
evals.yaml) - Replace placeholder values (API keys, URLs, etc.) with your actual values
- Run the evaluation:
python evaluate.py evals.yaml
Next Steps
- Learn more about EO Tasks
- Review the Getting Started guide for detailed configuration options
- Check the Code Reference for API documentation