NorEval dashboard

Norwegian base-model evaluation — click here for more information

About NorEval

NorEval is a comprehensive evaluation benchmark for Norwegian language models, described in Mikhailov et al. (2025). NorEval version 1.1 consists of 35 tasks spanning linguistic knowledge, language understanding, world knowledge & reasoning, generation & summarization, and translation. All evaluations are conducted using lm-eval-harness.

Design principles

  • Native, non-translated datasets. The benchmarks are built from natively Norwegian sources rather than machine-translated English datasets. This avoids translationese artifacts and better reflects genuine Norwegian language use.
  • Both written standards. Norwegian has two official written standards: Bokmål (nob) and Nynorsk (nno). NorEval includes parallel benchmark variants for both, enabling direct comparison of model performance across the two standards.
  • Northern Sámi. Two translation tasks (nob↔sme) and one linguistic acceptability task (MultiBLiMP) cover Northern Sámi, an endangered Uralic language spoken in Norway, Sweden, and Finland.
  • Multiple prompt templates. Most tasks are evaluated with 4–6 different prompt formulations to measure sensitivity to prompt phrasing. The dashboard reports a configurable summary statistic across prompts (max, mean, median, or min).
  • Few-shot evaluation. Each task is evaluated at 0-shot, 1-shot, and 5-shot to assess in-context learning capabilities.

Error bars

Error bars on this dashboard show combined uncertainty from two independent sources, added in quadrature:

  • Sampling error (±1 SE). For classification metrics (accuracy, F1, exact match), SE = √(v·(1−v)/n), where v is the score and n is the number of samples. For corpus-level metrics (BLEU, chrF, ROUGE), SE is estimated via bootstrap resampling (100 iterations).
  • Prompt deviation. Computed as SD(scores across prompt variants) / √(k), where k is the number of prompt variants. This captures uncertainty due to prompt formulation. Has no effect on single-prompt benchmarks.

The two components are combined as √(SE² + prompt_SE²). In aggregate views, SE is propagated as √(Σ SE²) / N across the N benchmarks being averaged.

Normalization

Aggregate views normalize scores before averaging to put different metrics on a common scale. The default random baseline normalization maps each score to:

normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100

where 0 = random chance and 100 = perfect performance. Other normalization options (min-max, z-score, percentile) are available from the Normalization dropdown.

Dashboard features

  • Filter by size or openness. The bottom row of controls lets you restrict to a model-size range or limit to fully-open-weight models.
  • Select/deselect models & tasks. The two sections below the plot let you choose exactly which models and tasks contribute to each view.
  • Task grouping. Related benchmarks (e.g., Bokmål/Nynorsk pairs, translation direction pairs) are shown as grouped bar charts for easy comparison.
  • Metric selector. Individual task views offer a dropdown (top control row) to switch between all available metrics for that benchmark.
  • Chart export. Charts can be downloaded as PNG or SVG, and the underlying data exported as JSON, via the toolbar buttons.

Citation

@inproceedings{mikhailov-etal-2025-noreval,
    title = "{N}or{E}val: A {N}orwegian Language Understanding and Generation Evaluation Benchmark",
    author = "Mikhailov, Vladislav  and Enstad, Tita  and Samuel, David  and
              Farseth{\r{a}}s, Hans Christian  and Kutuzov, Andrey  and
              Velldal, Erik  and {\O}vrelid, Lilja",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    url = "https://aclanthology.org/2025.findings-acl.181/",
}
 
6B 24B

Tasks included in aggregation

Norwegian language models

Multilingual & non-Norwegian language models