Tech

Finetuning on bitext(bilingual finetuning) to translate from one language to another does not leverage the full capacity of the multilingual pretraining.
Multilingual translation models can be created through multilingual fine tuning.starting from pretrained models incor- porates the benefits of large quantities of unla- beled monolingual data, which is particularly important for low resource languages where bitext is not available.
Multi-lingual translation models with multilingual pretraining (with monolingual data) followed by multilingual finetuning (with parallel data).

Core Concepts:

mBART is trained as a de- noising autoencoder, training to predict the original text
Random span masking and order permutation used for creating text variation while training.
Instead of training a model from language i to language j, a model is trained to translate N languages to N other languages.
trained with temperature upsampling, which upsamples lower resource pairs so that the high resource languages do not dominate the training data.

On average, all models have around 5.7 to 7 BLEU points improvement over bilingual baselines.

However, multilingual finetuning would mean that the same model capacity must model many directions rather than just one, which could decrease performance.

Ref:

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning