Trainslation

3/23/2023

In order to solve this task, the model is forced to develop a sophisticated representation of the language in question, developing a complex understanding of how words relate to other words in a sentence. Simultaneously, we train the model to learn representations of under-resourced languages directly from monolingual text using the MASS task. Rather than limiting our model to an artificial scenario with only monolingual text, we also include all available parallel text data with millions of examples for higher resource languages to enable the model to learn the translation task. Once we had a dataset of monolingual text in over 1000 languages, we then developed a simple yet practical approach for zero-resource translation, i.e., translation for languages with no in-language parallel text and no language-specific translation examples. A small number of languages have large amounts of parallel data, but there is a long tail of languages with only monolingual data. The amount of monolingual data per language versus the amount of parallel (translated) data per language. We performed human evaluations on samples of 68 of these languages and found that the majority (>70%) reflected high-quality, in-language content. The result of this effort was a dataset with monolingual text in over 1000 languages, of which 400 had over 100,000 sentences. We then applied the open sourced Term Frequency-Inverse Internet Frequency (TF-IIF) filtering to the resulting dataset to find and discard sentences that were actually in related high-resource languages, and developed a variety of language-specific filters to eliminate specific pathologies. We applied the Transformer-based model to a dataset that had been filtered with a CLD3 model and trained to recognize clusters of similar languages. MASS simply garbles the input by randomly removing sequences of tokens from it, and trains the model to predict these sequences. This model supplements the LangID task with the MAsked Sequence-to-Sequence (MASS) task to better generalize over noisy web data. In our early attempts to identify under-resourced languages on the web by training a standard Compact Language Detector v3 (CLD3) LangID model, we too found that the dataset was too noisy to be usable.Īs an alternative, we trained a Transformer-based, semi-supervised LangID model on over 1000 languages. Tasks like LangID, which work well for high-resource languages, are unsuccessful for under-resourced languages, and many publicly available datasets crawled from the web often contain more noise than usable data for the languages they attempt to support. Finally, we highlight how native speakers have helped us realize this accomplishment.Īutomatically gathering usable textual data for under-resourced languages is much more difficult than it may seem. The techniques we introduce supplement massively multilingual models with a self supervised task to enable zero-resource translation. For these languages, we created monolingual datasets by developing and using specialized neural language identification models combined with novel filtering approaches. As part of this effort, we are expanding Google Translate to include 24 under-resourced languages. In “ Building Machine Translation Systems for the Next Thousand Languages”, we describe how to build high-quality monolingual datasets for over a thousand languages that do not have translation datasets available and demonstrate how one can use monolingual data alone to train MT models. Both of these challenges need to be addressed for translation models to reach sufficient quality. MT models usually train on large amounts of parallel (translated) text, but without such data, models must learn to translate from limited amounts of monolingual text, which is a novel area of research. The second challenge arises from modeling limitations. The first arises from data scarcity digitized data for many languages is limited and can be difficult to find on the web due to quality issues with Language Identification (LangID) models. There are two key bottlenecks towards building functioning translation models for the long tail of languages. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Posted by Isaac Caswell and Ankur Bapna, Research Scientists, Google Translate

0 Comments

Trainslation

Leave a Reply.

Author

Archives

Categories