NorEval dashboard

Norwegian benchmark of base language models – click here for more information

About NorEval

NorEval is a comprehensive evaluation benchmark for Norwegian language models, described in Mikhailov et al. (2025). NorEval version 1.1 consists of 35 tasks spanning linguistic knowledge, language understanding, world knowledge & reasoning, generation & summarization, and translation. All evaluations are conducted using lm-eval-harness.

Design principles

Native, non-translated datasets. The benchmarks are built from natively Norwegian sources rather than machine-translated English datasets. This avoids translationese artifacts and better reflects genuine Norwegian language use.
Both written standards. Norwegian has two official written standards: Bokmål (nob) and Nynorsk (nno). NorEval includes parallel benchmark variants for both, enabling direct comparison of model performance across the two standards.
Northern Sámi. Two translation tasks (nob↔sme) and one linguistic acceptability task (MultiBLiMP) cover Northern Sámi, an endangered Uralic language spoken in Norway, Sweden, and Finland.
Multiple prompt templates. Most tasks are evaluated with 4–6 different prompt formulations to measure sensitivity to prompt phrasing. The dashboard reports a configurable summary statistic across prompts (max, mean, median, or min).
Few-shot evaluation. Each task is evaluated at 0-shot, 1-shot, and 5-shot to assess in-context learning capabilities.

Confidence intervals

Error bars are asymmetric 95% CIs for the score under the selected prompt aggregation (max by default).

For max. The interval combines per-prompt sampling noise with the uncertainty over which prompt is best. The lower bound has a Bonferroni correction across the $k$ prompts; the upper bound does not. CIs are therefore wider below the bar than above.

$$L = \max_i \ell\!\left(c_i,\, n_i,\, \tfrac{\alpha}{2k}\right), \qquad U = \max_i u\!\left(c_i,\, n_i,\, \tfrac{\alpha}{2}\right)$$

Per-prompt bounds $\ell$ and $u$ are exact Clopper–Pearson binomial quantiles for accuracy and exact-match metrics; for corpus-level metrics (BLEU, ROUGE, chrF, token-F1) we use the normal approximation with the harness bootstrap SE.

For mean. Symmetric CI from the Welch–Satterthwaite combination of sampling SE and the SD across prompts.

For median and first. Sampling CI of the prompt that achieved that statistic.

Caveats. Three benchmarks (norec_document, norec_sentence, ask_gec) have no harness-supplied SE, so we apply the binomial bounds as a coarse placeholder — F1 and ERRANT-F_0.5 aren't proper Bernoulli proportions, so their CIs are indicative rather than exact. When averaging across benchmarks, lower and upper distances are propagated independently as $\sqrt{\sum d_i^2}/N$.

Normalization

Aggregate views normalize scores before averaging to put different metrics on a common scale. The default random baseline normalization maps each score to:

normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100

where 0 = random chance and 100 = perfect performance. Other normalization options (min-max, z-score, percentile) are available from the Normalization dropdown.

Dashboard features

Filter by size or openness. The bottom row of controls lets you restrict to a model-size range or limit to fully-open-weight models.
Select/deselect models & tasks. The two sections below the plot let you choose exactly which models and tasks contribute to each view.
Task grouping. Related benchmarks (e.g., Bokmål/Nynorsk pairs, translation direction pairs) are shown as grouped bar charts for easy comparison.
Metric selector. Individual task views offer a dropdown (top control row) to switch between all available metrics for that benchmark.
Chart export. Charts can be downloaded as PNG or SVG, and the underlying data exported as JSON, via the toolbar buttons.

Citation

@inproceedings{mikhailov-etal-2025-noreval,
    title = "{N}or{E}val: A {N}orwegian Language Understanding and Generation Evaluation Benchmark",
    author = "Mikhailov, Vladislav  and Enstad, Tita  and Samuel, David  and
              Farseth{\r{a}}s, Hans Christian  and Kutuzov, Andrey  and
              Velldal, Erik  and {\O}vrelid, Lilja",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    url = "https://aclanthology.org/2025.findings-acl.181/",
}

Task:

Shots:

Prompt aggregation:

Normalization:

Metric:

Model size:

6B 24B

Fully-open models only