Getting Started with Eve-evalkit

Eve-evalkit is built on top of the EleutherAI Language Model Evaluation Harness, which means it supports all tasks available in the lm-evaluation-harness in addition to the custom Earth Observation tasks.

Quick Start

1. Installation

Follow the installation instructions in the README:

# Clone the repository
git clone https://github.com/eve-esa/evalkit.git
cd evalkit

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

2. Running Evaluations

The recommended way to run evaluations is using the YAML configuration file. Create an evals.yaml file:

constants:
  judge_api_key: your-judge-api-key
  judge_base_url: https://openrouter.ai/api/v1
  judge_name: mistralai/mistral-large-2411
  tasks:
    - name: hallucination_detection
      num_fewshot: 0
      max_tokens: 100
    - name: mcqa_single_answer
      num_fewshot: 2
      max_tokens: 1000

wandb:
  enabled: true
  project: eve-evaluations
  entity: your-wandb-entity
  run_name: my-evaluation
  api_key: your-wandb-api-key

models:
  - name: your-model-name
    base_url: https://api.provider.com/v1/chat/completions
    api_key: your-api-key
    temperature: 0.1
    num_concurrent: 5
    timeout: 180
    tasks: !ref tasks

output_dir: evals_outputs

Run the evaluation:

python evaluate.py evals.yaml

Configuration File Structure

Constants Section

Define reusable values that can be referenced throughout the config using !ref:

constants:
  judge_api_key: your-judge-api-key
  judge_base_url: https://openrouter.ai/api/v1
  judge_name: mistralai/mistral-large-2411
  hf_token: your-huggingface-token  # Optional: for private datasets

  tasks:
    - name: task_name
      num_fewshot: 0
      max_tokens: 1000
      judge_api_key: !ref judge_api_key  # Reference to constant
      judge_base_url: !ref judge_base_url
      judge_name: !ref judge_name

Tasks Configuration

Each task can have the following parameters:

tasks:
  - name: task_name              # Required: Task identifier
    task_name: base_task         # Optional: Base task name (for custom naming)
    num_fewshot: 0               # Number of few-shot examples (default: 0)
    max_tokens: 1000             # Maximum tokens for generation (default: 512)
    temperature: 0.0             # Sampling temperature (default: 0.0)
    limit: 100                   # Optional: Limit number of samples to evaluate
    seed: 1234                   # Optional: Random seed for reproducibility
    model_type: local-chat-completions  # Optional: Model type (local-chat-completions, eve-api)
    judge_api_key: api-key       # Required for single-judge LLM-as-judge tasks
    judge_base_url: base-url     # Required for single-judge LLM-as-judge tasks
    judge_name: model-name       # Required for single-judge LLM-as-judge tasks
    judges: []                   # Optional: List of multiple judges for multi-judge evaluation

Random Seed for Reproducibility

The seed parameter allows you to control randomness in evaluations for reproducible results:

tasks:
  - name: open_ended_0_shot_seed_1234
    task_name: open_ended
    num_fewshot: 0
    max_tokens: 10000
    seed: 1234  # Fixed seed ensures same results across runs

Benefits of using seeds: - Reproduce exact same results across multiple runs - Debug evaluation issues with consistent behavior - Compare model performance with controlled randomness - Run variance analysis by using multiple different seeds

Example: Running same task with different seeds

tasks:
  - name: open_ended_seed_1234
    task_name: open_ended
    seed: 1234

  - name: open_ended_seed_5678
    task_name: open_ended
    seed: 5678

  - name: open_ended_seed_9012
    task_name: open_ended
    seed: 9012

This configuration runs the same task three times with different random seeds to analyze variance in results.

Multi-Judge Evaluation

For more robust evaluation, you can use multiple judges to evaluate each answer. This is particularly useful for open-ended tasks where a single judge might introduce bias.

constants:
  judges:
    - name: qwen3
      model: qwen/qwen3-235b-a22b-2507
      api_key: your_openrouter_api_key
      base_url: https://openrouter.ai/api/v1/
    - name: mistral_large
      model: mistralai/mistral-large-2411
      api_key: your_openrouter_api_key
      base_url: https://openrouter.ai/api/v1/
    - name: claude_sonnet
      model: anthropic/claude-3.5-sonnet
      api_key: your_openrouter_api_key
      base_url: https://openrouter.ai/api/v1/

  tasks:
    - name: open_ended_multi_judge
      task_name: open_ended
      num_fewshot: 0
      max_tokens: 10000
      judges: !ref judges  # Use multiple judges
      batch_size: 15

Multi-Judge Metrics: - llm_as_judge_{judge_name}: Individual score from each judge - judge_voting: Majority vote result (recommended primary metric) - llm_as_judge_avg: Average score across all judges - judge_agreement: Percentage of samples where all judges agree

Recommendations: - Use 3 judges for a good balance between cost and reliability - Use 5 judges for high-stakes evaluations - Avoid 2 judges (risk of ties) - Mix different model architectures and providers for diversity

Models Configuration

Configure one or more models to evaluate:

models:
  - name: model-identifier
    base_url: https://api.provider.com/v1/chat/completions
    api_key: your-api-key
    temperature: 0.1
    num_concurrent: 5      # Concurrent API requests (default: 3)
    timeout: 180          # Request timeout in seconds (default: 300)
    tasks: !ref tasks     # Reference to tasks list

EVE API Model Configuration

The EVE API provides RAG-enhanced (Retrieval-Augmented Generation) responses with Earth Observation context. To use the EVE API:

constants:
  # EVE API credentials
  eve_email: your-email@example.com
  eve_password: your-password
  eve_base_url: http://0.0.0.0:8000/
  eve_public_collections: ['qwen-512-filtered', 'Wikipedia EO', 'Wiley AI Gateway']
  eve_k: 10
  eve_threshold: 0.5

  tasks:
    - name: open_ended_0_shot
      task_name: open_ended
      num_fewshot: 0
      max_tokens: 10000
      model_type: eve-api  # Specify eve-api model type

models:
  - name: eve-api
    # EVE API configuration
    email: !ref eve_email
    password: !ref eve_password
    base_url: !ref eve_base_url
    public_collections: !ref eve_public_collections
    k: !ref eve_k
    threshold: !ref eve_threshold

    # General settings
    temperature: 0.0
    num_concurrent: 7
    tasks: !ref tasks
    timeout: 180

EVE API Parameters: - email: Email for EVE API authentication - password: Password for EVE API authentication - base_url: Base URL for the EVE API endpoint - public_collections: List of document collections to search for RAG context - k: Number of documents to retrieve (default: 5) - threshold: Similarity threshold for document retrieval (default: 0.5)

Important Notes: - Tasks using EVE API must specify model_type: eve-api in the task configuration - The EVE API automatically retrieves relevant context documents for each query - Retrieved documents are used to enhance the model's responses - Especially useful for Earth Observation domain-specific questions - Requires a running EVE API server at the specified base_url

Example Use Case:

The EVE API is particularly valuable for: - Evaluating models on Earth Observation tasks with factual grounding - Comparing RAG-enhanced responses vs. non-RAG responses - Testing model performance with domain-specific context retrieval - Hallucination detection where factual context is critical

Weights & Biases (WandB) Logging

Enable experiment tracking with WandB:

wandb:
  enabled: true                    # Enable/disable WandB logging
  project: project-name            # WandB project name
  entity: organization-name        # WandB entity/organization
  run_name: custom-run-name       # Optional: Custom run name prefix
  api_key: your-wandb-api-key     # WandB API key

When enabled, the evaluation will log: - Evaluation metrics (accuracy, F1, IoU, etc.) - Individual sample predictions - Task configurations - Model metadata - Evaluation duration and timestamps

Output Directory

Specify where evaluation results should be saved:

output_dir: evals_outputs  # Default: eval_results

Example Configurations

For comprehensive configuration examples including: - EVE Earth Observation tasks - LM-Eval-Harness standard benchmarks - Mixed evaluations - Multiple model comparisons - Environment variables usage - Testing with limited samples

Please see the dedicated Examples page.

Output Structure

After running evaluations, results are saved organized by task, then by model:

{output_dir}/
├── {task_name_1}/
│   ├── {model_name_sanitized}/
│   │   ├── results_{timestamp}.json
│   │   └── samples_{task_name}_{timestamp}.jsonl
│   ├── {another_model_name_sanitized}/
│   │   ├── results_{timestamp}.json
│   │   └── samples_{task_name}_{timestamp}.jsonl
│   └── ...
├── {task_name_2}/
│   └── ...

Example Structure:

evals_outputs/
├── hallucination_detection/
│   ├── eve-esa__eve_v0.1/
│   │   ├── results_2025-12-01T10-17-45.479920.json
│   │   └── samples_hallucination_detection_2025-12-01T10-17-45.479920.jsonl
│   └── gpt-4/
│       ├── results_2025-12-01T10-20-15.123456.json
│       └── samples_hallucination_detection_2025-12-01T10-20-15.123456.jsonl
├── mcqa_single_answer/
│   ├── eve-esa__eve_v0.1/
│   │   ├── results_2025-12-01T11-23-12.123456.json
│   │   └── samples_mcqa_single_answer_2025-12-01T11-23-12.123456.jsonl
│   └── gpt-4/
│       ├── results_2025-12-01T11-25-30.789012.json
│       └── samples_mcqa_single_answer_2025-12-01T11-25-30.789012.jsonl
└── open_ended/
    ├── eve-esa__eve_v0.1/
    │   ├── results_2025-12-01T12-34-56.789012.json
    │   └── samples_open_ended_2025-12-01T12-34-56.789012.jsonl
    └── gpt-4/
        ├── results_2025-12-01T12-40-10.456789.json
        └── samples_open_ended_2025-12-01T12-40-10.456789.jsonl

This structure makes it easy to compare multiple models on the same task.

Results File Format

The results_{timestamp}.json file contains:

{
  "results": {
    "task_name": {
      "alias": "task_name",
      "metric_1,none": 0.85,
      "metric_1_stderr,none": 0.02,
      "metric_2,none": 0.78,
      "metric_2_stderr,none": 0.03
    }
  },
  "group_subtasks": {},
  "configs": {
    "task_name": {
      "task": "task_name",
      "dataset_path": "dataset-path",
      "num_fewshot": 0,
      "metadata": {}
    }
  },
  "versions": {},
  "n-shot": {},
  "n-samples": {},
  "config": {},
  "git_hash": "abc123",
  "date": 1701234567.89,
  "total_evaluation_time_seconds": "123.45"
}

Samples File Format

The samples_{task_name}_{timestamp}.jsonl file contains individual predictions:

{"doc_id": 0, "doc": {...}, "target": "expected", "arguments": [...], "resps": [["predicted"]], "filtered_resps": ["predicted"], "doc_hash": "abc123", "prompt_hash": "def456", "task_name": "task_name"}
{"doc_id": 1, "doc": {...}, "target": "expected", "arguments": [...], "resps": [["predicted"]], "filtered_resps": ["predicted"], "doc_hash": "ghi789", "prompt_hash": "jkl012", "task_name": "task_name"}
...

Each line contains: - doc: The input document/question - target: Expected answer - resps: Raw model response - filtered_resps: Processed model response - Metadata for reproducibility

WandB Integration

When WandB logging is enabled, the following information is automatically logged:

Metrics Logged

Aggregate Metrics: Final scores for each metric (accuracy, F1, IoU, etc.)
Per-Sample Metrics: Individual predictions and correctness
Task Metadata: Dataset paths, splits, versions
Model Configuration: API endpoints, temperatures, timeouts
Evaluation Metadata: Git hash, timestamps, duration

Viewing Results

After evaluation completes, visit your WandB project to:

Compare Models: View metrics across different models side-by-side
Analyze Samples: Inspect individual predictions and failures
Track Progress: Monitor evaluation progress in real-time
Visualize Trends: Plot metric distributions and comparisons

Example WandB Output

Run: eve-model-v1-20251201
├── Summary Metrics
│   ├── hallucination_detection/acc: 0.822
│   ├── hallucination_detection/f1: 0.841
│   ├── hallucination_detection/precision: 0.869
│   ├── mcqa_single_answer/acc: 0.756
│   └── open_ended/llm_judge: 0.834
├── Config
│   ├── model: eve-esa/eve_v0.1
│   ├── temperature: 0.1
│   └── num_concurrent: 10
└── Samples
    ├── hallucination_detection_samples.csv
    ├── mcqa_single_answer_samples.csv
    └── open_ended.csv

Advanced Usage

Using Environment Variables

Instead of hardcoding API keys, use environment variables. See Examples for detailed configuration.

Limiting Samples for Testing

Test your configuration on a small subset. See Examples for detailed configuration.

Direct Command Line

For quick tests, you can use the lm_eval command directly:

lm_eval --model openai-chat-completions \
        --model_args base_url=https://api.provider.com,model=model-name,num_concurrent=5 \
        --tasks hallucination_detection,mcqa_single_answer \
        --include tasks \
        --num_fewshot 0 \
        --output_path ./outputs \
        --log_samples \
        --apply_chat_template

Available Tasks

EVE Earth Observation Tasks

See the EO Tasks page for detailed information about: - mcqa_multiple_answer - mcqa_single_answer - open_ended - open_ended_w_context - refusal - hallucination_detection

LM-Evaluation-Harness Tasks

All tasks from the lm-evaluation-harness are supported, including:

Popular Benchmarks: - mmlu_pro - MMLU-Pro (challenging multiple-choice) - gsm8k - Grade School Math - hellaswag - Commonsense reasoning - arc_challenge - AI2 Reasoning Challenge - truthfulqa - Truthfulness evaluation - winogrande - Commonsense reasoning - piqa - Physical commonsense - And more...

To list all available tasks:

lm_eval --tasks list

Troubleshooting

Common Issues

1. API Timeout Errors

Increase the timeout value:

models:
  - name: your-model
    timeout: 600  # Increase to 10 minutes

2. Rate Limiting

Reduce concurrent requests:

models:
  - name: your-model
    num_concurrent: 1  # Reduce concurrency

3. Judge Model Errors

Ensure judge credentials are set for tasks that require them:

tasks:
  - name: open_ended
    judge_api_key: !ref judge_api_key  # Required!
    judge_base_url: !ref judge_base_url
    judge_name: !ref judge_name

4. WandB Login Issues

Login before running:

wandb login your-api-key

Next Steps

Explore Tasks: Check out the EO Tasks page for details on Earth Observation evaluation tasks

Support

For issues or questions: - GitHub Issues: eve-esa/evalkit - Documentation: https://docs.eve-evaluation.org - LM-Eval-Harness: EleutherAI Documentation