NorOLMo dashboard

Norwegian training-progress evaluation — click here for more information

About NorOLMo

NorOLMo is a fully-open Norwegian language model continually-trained on OLMo2 (13B stage 2) by the Language Technology Group at the University of Oslo. This view tracks NorOLMo's performance across 33 training checkpoints (steps 1k–33k) plus several late-stage ablation runs, evaluated on the NorEval benchmark (Mikhailov et al., 2025).

Ablations

The dashboard includes several ablations of NorOLMo's late training stages:

  • Stage 2 (stage 1 data, full decay). No length extension; stage-2 schedule continues training on stage-1 data with full LR decay.
  • Stage 2 (stage 1/2 data, ½ decay). Variants that halve the LR decay rate during stage 2.
  • Stage 3 (RoPE scaling / no RoPE scaling). Length-extension experiments at stage 3.

Ablation lines start at different training steps, reflecting where each experiment diverges from the main run.

Error bars

Shaded bands show combined uncertainty from two independent sources, added in quadrature:

  • Sampling error. For classification metrics, SE = √(v·(1−v)/n). For corpus-level metrics, estimated via bootstrap resampling.
  • Prompt deviation. SD(scores across prompt variants) / √(k).

Normalization

Aggregate views normalize scores before averaging. The default random baseline normalization maps each score to:

normalized = (raw − random_baseline) / (max_performance − random_baseline) × 100

where 0 = random chance and 100 = perfect performance.

 

Tasks included in aggregation