Eve-evalkit
Welcome to the Eve-evalkit documentation. This framework provides comprehensive tools for evaluating language models on Earth Observation (EO) specific tasks and benchmarks.
What is Eve-evalkit?
Eve-evalkit is built on top of the EleutherAI Language Model Evaluation Harness, providing:
- Custom EO Tasks: Specialized evaluation tasks for Earth Observation domain, including MCQA, hallucination detection, and more
- Full LM-Eval-Harness Support: Access to all standard benchmarks (MMLU-Pro, GSM8K, HellaSwag, etc.)
- WandB Integration: Automatic experiment tracking and metric logging
- Flexible Configuration: YAML-based configuration for easy experiment management
- Production Ready: Built for evaluating models via API endpoints with concurrent requests
Quick Links
- Getting Started: Installation, configuration, and running your first evaluation
- Examples: Comprehensive configuration examples and use cases
- EO Tasks: Detailed information about Earth Observation evaluation tasks
- Code Reference: API documentation and code examples
Key Features
Earth Observation Tasks
Evaluate models on specialized EO capabilities:
- Multiple-Choice QA: Single and multiple-answer questions from EO curricula
- Open-Ended QA: Free-form question answering with and without context
- Hallucination Detection: Identify fabricated or unsupported information
- Refusal Testing: Assess appropriate refusal behavior when context is insufficient
Comprehensive Metrics
- LLM-as-Judge: Sophisticated evaluation using judge models
- Traditional Metrics: Accuracy, F1, Precision, Recall, IoU
- Semantic Metrics: BERTScore, Cosine Similarity
- Generation Metrics: BLEU, ROUGE
Production Features
- API Model Support: Evaluate models via OpenAI-compatible endpoints
- Concurrent Requests: Speed up evaluations with parallel API calls
- Timeout Handling: Graceful handling of slow or failed requests
- Result Logging: Comprehensive JSON and JSONL outputs
- WandB Integration: Track experiments and visualize metrics
Example Use Cases
Research & Development
- Benchmark Earth Observation models against established tasks
- Compare model performance across different architectures
- Identify strengths and weaknesses in EO domain understanding
Model Selection
- Evaluate multiple models on EO-specific capabilities
- Compare general-purpose models vs. domain-specific models
- Assess trade-offs between performance and cost
Documentation Structure
- Getting Started: Installation, configuration, and basic usage
- Examples: Comprehensive configuration examples and use cases
- EO Tasks: Detailed task descriptions, metrics, and examples
- Code Reference: API documentation and programmatic usage
Support & Contributing
- GitHub: eve-esa/evalkit
- Issues: Report bugs or request features on GitHub
Citation
If you use this evaluation framework in your research, please cite: