Examples

This page provides comprehensive examples of how to configure and run evaluations with Eve-evalkit. All examples use the YAML configuration format with the evaluate.py script.

Basic Structure

Every configuration file has the following structure:

constants:          # Define reusable values
  # ...

wandb:             # Optional: WandB integration
  # ...

models:            # One or more models to evaluate
  # ...

output_dir:        # Where to save results

Example 1: EVE Earth Observation Tasks

Evaluate a model on Earth Observation-specific tasks:

constants:
  judge_api_key: sk-or-v1-xxxxx
  judge_base_url: https://openrouter.ai/api/v1
  judge_name: mistralai/mistral-large-2411

  tasks:

    - name: mcqa_single_answer
      num_fewshot: 2
      max_tokens: 10000

    - name: hallucination_detection
      num_fewshot: 0
      max_tokens: 100

    - name: open_ended
      num_fewshot: 5
      max_tokens: 40000
      judge_api_key: !ref judge_api_key
      judge_base_url: !ref judge_base_url
      judge_name: !ref judge_name

wandb:
  enabled: true
  project: eve-evaluations
  entity: LLM4EO
  run_name: eve-model-v1

models:
  - name: eve-esa/eve_v0.1
    base_url: https://api.runpod.ai/v2/endpoint-id/openai/v1/chat/completions
    api_key: your-runpod-api-key
    temperature: 0.1
    num_concurrent: 10
    timeout: 600
    tasks: !ref tasks

output_dir: evals_outputs

Example 2: Using LM-Eval-Harness Tasks

Eve-evalkit supports all tasks from lm-evaluation-harness. Here's an example using MMLU-Pro:

constants:
  tasks:
    - name: mmlu_pro
      num_fewshot: 5
      max_tokens: 1000

    - name: gsm8k
      num_fewshot: 8
      max_tokens: 512

    - name: hellaswag
      num_fewshot: 10
      max_tokens: 100

models:
  - name: gpt-4
    base_url: https://api.openai.com/v1/chat/completions
    api_key: your-openai-api-key
    temperature: 0.0
    num_concurrent: 3
    tasks: !ref tasks

output_dir: evals_outputs

Example 3: Mixed EVE and Standard Tasks

Combine Earth Observation tasks with standard benchmarks:

constants:
  judge_api_key: your-judge-api-key
  judge_base_url: https://openrouter.ai/api/v1
  judge_name: mistralai/mistral-large-2411

  tasks:
    # EVE Earth Observation Tasks
    - name: hallucination_detection
      num_fewshot: 0
      max_tokens: 100

    # Standard Benchmark Tasks
    - name: mmlu_pro
      num_fewshot: 5
      max_tokens: 1000

    - name: arc_challenge
      num_fewshot: 25
      max_tokens: 100

wandb:
  enabled: true
  project: comprehensive-eval
  entity: your-org

models:
  - name: your-model
    base_url: https://api.provider.com/v1/chat/completions
    api_key: your-api-key
    temperature: 0.1
    num_concurrent: 5
    tasks: !ref tasks

output_dir: evals_outputs

Example 4: Multiple Models

Evaluate multiple models on the same tasks:

constants:
  tasks:
    - name: hallucination_detection
      num_fewshot: 0
      max_tokens: 100

    - name: mcqa_single_answer
      num_fewshot: 2
      max_tokens: 1000

wandb:
  enabled: true
  project: model-comparison

models:
  - name: model-a
    base_url: https://api.provider-a.com/v1/chat/completions
    api_key: api-key-a
    temperature: 0.1
    num_concurrent: 5
    tasks: !ref tasks

  - name: model-b
    base_url: https://api.provider-b.com/v1/chat/completions
    api_key: api-key-b
    temperature: 0.1
    num_concurrent: 5
    tasks: !ref tasks

output_dir: evals_outputs

Example 5: Using Environment Variables

Instead of hardcoding API keys, use environment variables:

constants:
  judge_api_key: ${JUDGE_API_KEY}

  tasks:
    - name: mcqa_single_answer
      num_fewshot: 0
      max_tokens: 20000
      judge_api_key: !ref judge_api_key
      judge_base_url: https://openrouter.ai/api/v1
      judge_name: mistralai/mistral-large-2411

models:
  - name: my-model
    base_url: https://api.provider.com/v1/chat/completions
    api_key: ${MODEL_API_KEY}
    tasks: !ref tasks

wandb:
  enabled: true
  api_key: ${WANDB_API_KEY}
  project: my-project

output_dir: evals_outputs

Set environment variables before running:

export JUDGE_API_KEY=your-judge-key
export MODEL_API_KEY=your-model-key
export WANDB_API_KEY=your-wandb-key
python evaluate.py evals.yaml

Example 6: Testing with Limited Samples

Test your configuration on a small subset before running full evaluation:

constants:
  tasks:
    - name: hallucination_detection
      num_fewshot: 0
      max_tokens: 100
      limit: 10  # Only evaluate first 10 samples

    - name: mcqa_single_answer
      num_fewshot: 2
      max_tokens: 1000
      limit: 5   # Only evaluate first 5 samples

models:
  - name: test-model
    base_url: https://api.provider.com/v1/chat/completions
    api_key: your-api-key
    temperature: 0.1
    num_concurrent: 2
    tasks: !ref tasks

output_dir: test_outputs

Example 7: Using Seed for Reproducibility

Control randomness in evaluations by setting a seed value. This is especially useful for open-ended tasks with multiple judges where answer order randomization occurs:

constants:
  judge_api_key: ${JUDGE_API_KEY}
  judge_base_url: https://openrouter.ai/api/v1/
  concurrent_requests: 20

  open_ended_judges:
    - name: qwen3-235b
      model: qwen/qwen3-235b-a22b-2507
      api_key: ${JUDGE_API_KEY}
      base_url: https://openrouter.ai/api/v1/
      prompt_path: metrics/prompts/llm_judge_qa.yaml
    - name: mistral-large
      model: mistral-large-2512
      api_key: ${MISTRAL_API_KEY}
      base_url: https://api.mistral.ai/v1
      prompt_path: metrics/prompts/llm_judge_qa.yaml

  tasks:
    # Same task with different seeds for variance analysis
    - name: open_ended_0_shot_seed_1234
      task_name: open_ended
      model_type: local-chat-completions
      num_fewshot: 0
      max_tokens: 10000
      judges: !ref open_ended_judges
      batch_size: !ref concurrent_requests
      seed: 1234  # Fixed seed for reproducibility

    - name: open_ended_0_shot_seed_5678
      task_name: open_ended
      model_type: local-chat-completions
      num_fewshot: 0
      max_tokens: 10000
      judges: !ref open_ended_judges
      batch_size: !ref concurrent_requests
      seed: 5678  # Different seed

    # Other tasks with seeds
    - name: mcqa_single_answer_0_shot_seed_1234
      task_name: mcqa_single_answer
      model_type: local-chat-completions
      num_fewshot: 0
      max_tokens: 15
      seed: 1234

wandb:
  enabled: true
  project: seed-reproducibility-test
  entity: your-org

models:
  - name: your-model
    base_url: http://localhost:8010/v1/
    api_key: EMPTY
    temperature: 0.0
    num_concurrent: !ref concurrent_requests
    tasks: !ref tasks
    timeout: 180

output_dir: evals_outputs

Why use seeds? - Reproduce exact same results across runs - Compare model performance with controlled randomness - Debug evaluation issues with consistent behavior - Analyze variance by running same task with different seeds

Example 8: Using EVE API for RAG-Enhanced Evaluation

Evaluate using the EVE API which provides RAG (Retrieval-Augmented Generation) enhanced responses with Earth Observation context:

constants:
  # EVE API credentials
  eve_email: your-email@example.com
  eve_password: your-eve-password
  eve_base_url: http://0.0.0.0:8000/
  eve_public_collections: ['qwen-512-filtered', 'Wikipedia EO', 'Wiley AI Gateway']
  eve_k: 10  # Number of documents to retrieve
  eve_threshold: 0.5  # Similarity threshold

  concurrent_requests: 7

  # Judge configuration for open-ended tasks
  open_ended_judges:
    - name: qwen3-235b
      model: qwen/qwen3-235b-a22b-2507
      api_key: ${OPENROUTER_API_KEY}
      base_url: https://openrouter.ai/api/v1/
      prompt_path: metrics/prompts/llm_judge_qa.yaml
    - name: mistral-large
      model: mistral-large-2512
      api_key: ${MISTRAL_API_KEY}
      base_url: https://api.mistral.ai/v1
      prompt_path: metrics/prompts/llm_judge_qa.yaml

  tasks:
    - name: open_ended_0_shot
      task_name: open_ended
      num_fewshot: 0
      max_tokens: 10000
      judges: !ref open_ended_judges
      model_type: eve-api  # Use EVE API model type

    - name: mcqa_multiple_answer_0_shot
      task_name: mcqa_multiple_answer
      num_fewshot: 0
      max_tokens: 10000
      model_type: eve-api

    - name: mcqa_single_answer_0_shot
      task_name: mcqa_single_answer
      model_type: eve-api
      num_fewshot: 0
      max_tokens: 1000

    - name: hallucination_detection_0_shot
      task_name: hallucination_detection
      num_fewshot: 0
      max_tokens: 10000
      model_type: eve-api

wandb:
  enabled: true
  project: eve-api-evaluation
  entity: LLM4EO

models:
  - name: eve-api
    # EVE API configuration
    email: !ref eve_email
    password: !ref eve_password
    base_url: !ref eve_base_url
    public_collections: !ref eve_public_collections
    k: !ref eve_k
    threshold: !ref eve_threshold

    # General settings
    temperature: 0.0
    num_concurrent: !ref concurrent_requests
    tasks: !ref tasks
    timeout: 180

output_dir: evals_outputs_eve_api

EVE API Configuration Parameters: - email: Email for EVE API authentication - password: Password for EVE API authentication - base_url: Base URL for the EVE API endpoint - public_collections: List of document collections to search for RAG - k: Number of documents to retrieve (default: 5) - threshold: Similarity threshold for document retrieval (default: 0.5) - model_type: eve-api: Must be specified in task configuration

Important Notes: - The EVE API automatically retrieves relevant context for each query - Retrieved documents are used to enhance the model's responses - Especially useful for Earth Observation domain-specific questions - Requires a running EVE API server

Running Examples

To run any of these examples:

Save the configuration to a file (e.g., evals.yaml)
Replace placeholder values (API keys, URLs, etc.) with your actual values
Run the evaluation:

python evaluate.py evals.yaml

Next Steps

Learn more about EO Tasks
Review the Getting Started guide for detailed configuration options
Check the Code Reference for API documentation