NorEval is a comprehensive evaluation benchmark for Norwegian language models, described in Mikhailov et al. (2025). NorEval version 1.1 consists of 35 tasks spanning linguistic knowledge, language understanding, world knowledge & reasoning, generation & summarization, and translation. All evaluations are conducted using lm-eval-harness.
This view compares instruction-tuned models — Norwegian fine-tunes and multilingual instruct models. Models are evaluated at 0-shot by default since instruction-tuning typically improves zero-shot performance.
Error bars on this dashboard show combined uncertainty from two independent sources, added in quadrature:
The two components are combined as √(SE² + prompt_SE²).
Aggregate views normalize scores before averaging. The default random baseline normalization maps each score to:
normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100
where 0 = random chance and 100 = perfect performance.
@inproceedings{mikhailov-etal-2025-noreval,
title = "{N}or{E}val: A {N}orwegian Language Understanding and Generation Evaluation Benchmark",
author = "Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\r{a}}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
year = "2025",
url = "https://aclanthology.org/2025.findings-acl.181/",
}