NorEval is a comprehensive evaluation benchmark for Norwegian language models, described in Mikhailov et al. (2025). NorEval version 1.1 consists of 35 tasks spanning linguistic knowledge, language understanding, world knowledge & reasoning, generation & summarization, and translation. All evaluations are conducted using lm-eval-harness.
Error bars on this dashboard show combined uncertainty from two independent sources, added in quadrature:
The two components are combined as √(SE² + prompt_SE²). In aggregate views, SE is propagated as √(Σ SE²) / N across the N benchmarks being averaged.
Aggregate views normalize scores before averaging to put different metrics on a common scale. The default random baseline normalization maps each score to:
normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100
where 0 = random chance and 100 = perfect performance. Other normalization options (min-max, z-score, percentile) are available from the Normalization dropdown.
@inproceedings{mikhailov-etal-2025-noreval,
title = "{N}or{E}val: A {N}orwegian Language Understanding and Generation Evaluation Benchmark",
author = "Mikhailov, Vladislav and Enstad, Tita and Samuel, David and
Farseth{\r{a}}s, Hans Christian and Kutuzov, Andrey and
Velldal, Erik and {\O}vrelid, Lilja",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
year = "2025",
url = "https://aclanthology.org/2025.findings-acl.181/",
}