MultiSynt tracks the training trajectories of several pretraining runs across four languages (Spanish, French, Finnish, and Norwegian), evaluated at multiple checkpoints. Each language has its own set of benchmarks, with at least a few prompt variants per benchmark.
The "Signal-filtered tasks" option in the task dropdown applies HPLT-E-style quality criteria (monotonicity, signal-to-noise ratio, prompt sensitivity, cross-model ranking consistency, etc.) to identify benchmarks with reliable training signal.
Shaded bands show combined uncertainty from sampling error (from each metric's variance) and prompt-template deviation (SD across prompt variants / √(k)), added in quadrature. Toggle the prompt uncertainty checkbox to exclude the prompt-deviation component.
The default random baseline normalization subtracts each task's random baseline (clamped at 0) and rescales to a 0–100 range, so different tasks can be averaged on a common scale.