Key developments:
- Many-to-Many multilingual translation model(non-English-Centric models) that can translate directly between any pair of 100 languages.
- Covers thousands of language directions in training data.
- Transformer-based neural machine translation models.
- Controls the distribution of word tokens for different languages found using SentencePiece from multilingual dataset.
- A special token in the encoder indicating the source language and a special token in the decoder indicating the target language were added.
- Evaluated the quality of translations with BLEU and Human evaluation.
- Built parallel corpus of multilingual translation text using LASER embeddings, FAISS indexing(checking semantic similarity) and from mided data from CCMatrix, CCAligned projects.
- Bitext data were mined based on language families and bridge languages.
-
Languages were grouped by linguistic similarity, geographic and cultural proximity.
- Selective augmenting Bitext Data with Backtranslation
- Added Language-Specific layers to Pre-Trained Transformers(at the end of the decoder) for improving performance.
More Study Resources:
-
Beyond English-Centric Multilingual Machine Translation, by Fan, Angela and Bhosale, Shruti and Schwenk, Holger and Ma, Zhiyi and El-Kishky, Ahmed and Goyal, Siddharth and Baines, Mandeep and Celebi, Onur and Wenzek, Guillaume and Chaudhary, Vishrav and Goyal, Naman and Birch, Tom and Liptchinsky, Vitaliy and Edunov, Sergey and Grave, Edouard and Auli, Michael and Joulin, Armand.
-
Ccmatrix: Mining billions of high-quality parallel sentences on the web by Schwenk, Holger and Wenzek, Guillaume and Edunov, Sergey and Grave, Edouard and Joulin, Armand.
-
A Massive Collection of Cross-Lingual Web-Document Pairs by El-Kishky, Ahmed and Chaudhary, Vishrav and Guzman, Francisco and Koehn, Philipp.
-
CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs by Ahmed El-Kishky, Vishrav Chaudhary, Francisco Guzman, Philipp Koehn