Skip to content

2022 NAACL On Synthetic Data for Back Translation #32

@IsaacJ60

Description

@IsaacJ60

Main Problem

The main problem addressed in this work is the generation of synthetic data for back translation in Neural Machine Translation (NMT) and understanding the factors that affect the performance of back translation.

Proposed Method

The authors propose two methods to improve the synthetic data for back translation: Data Manipulation and Gamma Score. In Data Manipulation, they combine synthetic corpora generated by beam search and sampling to balance the trade-off between importance and quality. They tune the combination ratio to optimize the back-translation performance. In Gamma Score, they introduce a score that balances both quality and importance to generate translations. The score is based on an interpolation of importance weight and the probability of the translation given the source sentence. They select the translation with the highest score or sample a translation based on the score distribution.

Input/Output

The input to the proposed methods is a monolingual corpus in the source language and a pretrained NMT model. The output is a synthetic corpus generated through either data manipulation or the gamma score method.

Example

In an experiment on the WMT14 DE-EN dataset, the authors compared the performance of their proposed methods with baseline methods. In Data Manipulation, they achieved similar BLEU scores to sampling back translation, even without using bitext, and improved the performance compared to beam search back translation. In the Gamma Score method, they achieved significantly better results than both sampling and beam search back translation. The results were measured using SacreBLEU and COMET metrics.

Related Works & Their Gaps

The related works discussed include the initial proposal of back translation by Bojar and Tamchyna, the extension of back translation for NMT by Sennrich et al., and the exploration of various back-translation generation methods by Imamura et al., Edunov et al., and others. Data augmentation methods for NMT, such as token frequency balancing and SwitchOut, are also mentioned. Also, the use of monolingual data in semi-supervised machine translation and the improvement of translation quality through back translation are discussed. The gaps in the related works include the limited exploration of balancing importance and quality in synthetic data, inconsistent improvements across different translation tasks in data augmentation, and the need for more efficient methods for leveraging monolingual data in NMT.

Metadata

Metadata

Assignees

Labels

literature-reviewSummary of the paper related to the work

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions