Cross-Lingual Data Augmentation Techniques: Insights from Multilingual Back Translation

Authors

  • Harini Champooranan Coppell High School
  • Dr. Solomon Ubani

DOI:

https://doi.org/10.47611/jsrhs.v13i3.7305

Keywords:

data augementation, natural language processing, back translation, training datasets, multilingual back translation

Abstract

This paper investigates the effectiveness of utilizing multiple chains of back translation compared to the traditional method of single-chain back translation for enhancing data diversity in natural language processing (NLP). We explore how multiple rounds of translation and back translation across different languages contribute to enriching the training dataset with diverse linguistic variations. We evaluate the effectiveness of multilingual back translation in achieving better data diversity by reporting the BLEU scores of different back translation techniques. Additionally, we investigate the impact of using languages from different language families and the resulting effect on the diversity of data. Our findings highlight the importance of leveraging multiple chains and multiple language families of back translation for augmenting datasets and provide insights for future research and advancement in data augmentation techniques for NLP.

Downloads

Download data is not yet available.

References or Bibliography

Hayashi, T., Watanabe, S., Zhang, Y., Toda, T., Hori, T., Astudillo, R., & Takeda, K. (2018, December). Back-translation-style data augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 426-433). IEEE.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proceedings of the 2013 conference on empirical methods in natural language processing. 2013, pp. 1631–1642

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002, July). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318).

Published

08-31-2024

How to Cite

Champooranan, H., & Ubani, S. (2024). Cross-Lingual Data Augmentation Techniques: Insights from Multilingual Back Translation. Journal of Student Research, 13(3). https://doi.org/10.47611/jsrhs.v13i3.7305

Issue

Section

HS Research Projects