NorOLMo dashboard

Norwegian training-progress evaluation – click here for more information

About NorOLMo

NorOLMo is a fully-open Norwegian language model continually-trained on OLMo2 (13B stage 1) by the Language Technology Group at the University of Oslo. This view tracks NorOLMo's performance across 33 training checkpoints (steps 1k–33k) plus several late-stage ablation runs, evaluated on the NorEval benchmark (Mikhailov et al., 2025).

Ablations

The dashboard includes several ablations of NorOLMo's late training stages:

Stage 2 (stage 1 data, full decay). No length extension; stage-2 schedule continues training on stage-1 data with full LR decay.
Stage 2 (stage 1/2 data, ½ decay). Variants that halve the LR decay rate during stage 2 (so that stage 3 can then smoothly continue the decay).
Stage 3 (RoPE scaling / no RoPE scaling). Length-extension experiments at stage 3.

Confidence intervals

Shaded bands are asymmetric 95% CIs for the score under the selected prompt aggregation (max by default).

For max. The lower bound has a Bonferroni correction across the $k$ prompts; the upper bound does not, so bands are wider below the line than above.

$$L = \max_i \ell\!\left(c_i,\, n_i,\, \tfrac{\alpha}{2k}\right), \qquad U = \max_i u\!\left(c_i,\, n_i,\, \tfrac{\alpha}{2}\right)$$

Per-prompt bounds use exact Clopper–Pearson quantiles for accuracy / exact-match metrics and a normal approximation with the harness bootstrap SE otherwise. For mean: Welch–Satterthwaite combination of sampling and between-prompt variance. For median / first: sampling CI of the selected prompt.

Normalization

Aggregate views normalize scores before averaging. The default random baseline normalization maps each score to:

normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100

where 0 = random chance and 100 = perfect performance.

Task:

Shots:

Prompt aggregation:

Normalization:

Metric:

Tasks included in aggregation