Skip to content

Eve-evalkit

Welcome to the Eve-evalkit documentation. This framework provides comprehensive tools for evaluating language models on Earth Observation (EO) specific tasks and benchmarks.

What is Eve-evalkit?

Eve-evalkit is built on top of the EleutherAI Language Model Evaluation Harness, providing:

  • Custom EO Tasks: Specialized evaluation tasks for Earth Observation domain, including MCQA, hallucination detection, and more
  • Full LM-Eval-Harness Support: Access to all standard benchmarks (MMLU-Pro, GSM8K, HellaSwag, etc.)
  • WandB Integration: Automatic experiment tracking and metric logging
  • Flexible Configuration: YAML-based configuration for easy experiment management
  • Production Ready: Built for evaluating models via API endpoints with concurrent requests
  • Getting Started: Installation, configuration, and running your first evaluation
  • Examples: Comprehensive configuration examples and use cases
  • EO Tasks: Detailed information about Earth Observation evaluation tasks
  • Code Reference: API documentation and code examples

Key Features

Earth Observation Tasks

Evaluate models on specialized EO capabilities:

  • Multiple-Choice QA: Single and multiple-answer questions from EO curricula
  • Open-Ended QA: Free-form question answering with and without context
  • Hallucination Detection: Identify fabricated or unsupported information
  • Refusal Testing: Assess appropriate refusal behavior when context is insufficient

Comprehensive Metrics

  • LLM-as-Judge: Sophisticated evaluation using judge models
  • Traditional Metrics: Accuracy, F1, Precision, Recall, IoU
  • Semantic Metrics: BERTScore, Cosine Similarity
  • Generation Metrics: BLEU, ROUGE

Production Features

  • API Model Support: Evaluate models via OpenAI-compatible endpoints
  • Concurrent Requests: Speed up evaluations with parallel API calls
  • Timeout Handling: Graceful handling of slow or failed requests
  • Result Logging: Comprehensive JSON and JSONL outputs
  • WandB Integration: Track experiments and visualize metrics

Example Use Cases

Research & Development

  • Benchmark Earth Observation models against established tasks
  • Compare model performance across different architectures
  • Identify strengths and weaknesses in EO domain understanding

Model Selection

  • Evaluate multiple models on EO-specific capabilities
  • Compare general-purpose models vs. domain-specific models
  • Assess trade-offs between performance and cost

Documentation Structure

  • Getting Started: Installation, configuration, and basic usage
  • Examples: Comprehensive configuration examples and use cases
  • EO Tasks: Detailed task descriptions, metrics, and examples
  • Code Reference: API documentation and programmatic usage

Support & Contributing

  • GitHub: eve-esa/evalkit
  • Issues: Report bugs or request features on GitHub

Citation

If you use this evaluation framework in your research, please cite: