NorEval dashboard

Norwegian instruction-tuned model evaluation — click here for more information

About NorEval

NorEval is a comprehensive evaluation benchmark for Norwegian language models, described in Mikhailov et al. (2025). NorEval version 1.1 consists of 35 tasks spanning linguistic knowledge, language understanding, world knowledge & reasoning, generation & summarization, and translation. All evaluations are conducted using lm-eval-harness.

This view compares instruction-tuned models — Norwegian fine-tunes and multilingual instruct models. Models are evaluated at 0-shot by default since instruction-tuning typically improves zero-shot performance.

Error bars

Error bars on this dashboard show combined uncertainty from two independent sources, added in quadrature:

  • Sampling error (±1 SE). For classification metrics, SE = √(v·(1−v)/n). For corpus-level metrics, estimated via bootstrap resampling.
  • Prompt deviation. SD(scores across prompt variants) / √(k).

The two components are combined as √(SE² + prompt_SE²).

Normalization

Aggregate views normalize scores before averaging. The default random baseline normalization maps each score to:

normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100

where 0 = random chance and 100 = perfect performance.

Citation

@inproceedings{mikhailov-etal-2025-noreval,
    title = "{N}or{E}val: A {N}orwegian Language Understanding and Generation Evaluation Benchmark",
    author = "Mikhailov, Vladislav  and Enstad, Tita  and Samuel, David  and Farseth{\r{a}}s, Hans Christian  and Kutuzov, Andrey  and Velldal, Erik  and {\O}vrelid, Lilja",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    url = "https://aclanthology.org/2025.findings-acl.181/",
}
 
1B 73B

Tasks included in aggregation

Norwegian language models

Multilingual & non-Norwegian language models