Generation and Comparison of Distributed Representations of Words from Japanese Wiktionary with BERT Sentence-Embedding

Ryota Nishiura; Seiji Tsuchiya; Hirokazu Watabe

doi:10.52731/ijskm.v8.i2.847

Ryota Nishiura Graduate School of Doshisha University
Seiji Tsuchiya Doshisha University
Hirokazu Watabe Doshisha University

DOI: https://doi.org/10.52731/ijskm.v8.i2.847

Keywords: word embedding, Wiktionary, Wikipedia, BERT, Word2Vec, distributed representations of words

Abstract

In this paper, we introduce a novel approach to generate distributed representations of words. Our approach makes use of structured contents of Wiktionary, in contrast to existing methods such as Word2Vec, which are learned from unstructured corpora of plain text. Our approach generates distributed representations of headwords in Wiktionary by obtaining sentence embeddings of the corresponding contents of the entries with a pre-trained BERT model. We demonstrate that our proposed method outperforms a Word2Vec model in a XABC test, potentially due to its ability to leverage both the quality of a pre-trained BERT model and expert-moderated knowledge from Wik-tionary. We also conducted a comparative study on a variety of different BERT models to find the model conditions most suitable for our purpose.

References

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, et al., “Retrieval-augmented generation for knowledge-intensive NLP tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word rep-resentations in vector space,” arXiv preprint, arXiv:1301.3781, 2013.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bi-directional transformers for language understanding,” arXiv preprint, arXiv:1810.04805, 2018.

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.

T. Kudo, “MeCab: Yet another part-of-speech and morphological analyzer,” https://taku910.github.io/mecab/, Accessed Nov 20, 2024.

M. Suzuki and R. Takahashi, “Pretrained Japanese BERT models,” https://www.nlp.ecei.tohoku.ac.jp/research/open-resources/, Accessed Nov 20, 2024.

M. Asahara and Y. Matsumoto, “ipadic version 2.7.0 User’s Manual,” https://osdn.net/projects/naist-jdic/docs/ipadic-2.7.0-manual-en.pdf/en/1/ipadic-2.7.0-manual-en.pdf, Accessed Nov 20, 2024.

N. Okumura, S. Tsuchiya, H. Watabe, and T. Kawaoka, “A construction of large-scale concept-base for calculation of degree of association between concepts,” The Association for Natural Language Processing, vol. 14, no. 5, pp. 41–64, 2007.