In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.
Dettaglio pubblicazione
2019, Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Pages 3226-3234
Misspelling Oblivious Word Embeddings (04b Atto di convegno in volume)
Piktus Aleksandra, Bora Edizel Necati, Bojanowski Piotr, Grave Edouard, Ferreira Rui, Silvestri Fabrizio
Gruppo di ricerca: Algorithms and Data Science, Gruppo di ricerca: Theory of Deep Learning
keywords