Japanese Tokenization using Simple Associative Arrays
DOI:
https://doi.org/10.52731/lir.v005.414Keywords:
bibliometrics, management of research projects, open science, text miningAbstract
This study addresses a technical challenge in text analysis for the practice of Institutional Research (IR). In the research field of Natural Language Processing (NLP), development focuses on accuracy and speed. Therefore, suitable data for evaluation and efficient programming languages are used. On the other hand, tasks outside the scope of NLP are also important in IR. For example, not only textual data, but also numerical data must be analyzed, and in some cases, the analysis of data stored in web servers is required. Since NLP and IR involve different tasks, the tools required are also different. One serious problem for text analysis in IR tasks is the limited development of Japanese tokenization. In this study, we implement a Japanese tokenizer, focusing on its applicability in web servers. Since function libraries in scripting languages for web environments are limited, our main technical challenge is to achieve efficiency in tokenization by combining the available functions. To deal with this problem, our method uses associative arrays to reduce unnecessary steps in tokenization. Evaluation experiments were conducted using Japanese text data, and the results showed that our tokenizer is viable in web server environments.
References
S. Kurohashi, “From Data Platforms to Knowledge Infrastructure,” in Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2024, pp. 114–114.
K. Wach, C. D. Duong, J. Ejdys, R. Kazlauskait˙e, P. Korzynski, G. Mazurek, J. Paliszkiewicz, and E. Ziemba, “The dark side of generative artificial intelligence: A critical analysis of controversies and risks of ChatGPT,” Entrepreneurial Business and Economics Review, vol. 11, no. 2, pp. 7–30, 2023.
N. Kai and T. Shimbaru, “Characteristic Analysis of Data Description in Highly Cited Research Data,” IIAI Letters on Institutional Research, vol. 001, no. LIR010, pp. 1–9, 2022.
H. Phan, S. Hasegawa, and W. Gu, “Implementation of Automated Feedback System for Japanese Essays in Intermediate Education,” IIAI Letters on Informatics and Interdisciplinary Research, vol. 003, no. LIIR057, pp. 1–11, 2023.
T. Tsumagari, Y. Nakazato, and T. Tsumagari, “A Study on the Influence of Community-Based Education on Post-UniversityWorkers and Its Time Dependence,” IIAI Letters on Institutional Research, vol. 004, no. LIR284, pp. 1–9, 2024.
A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE transactions on Information Theory, vol. 13, no. 2, pp. 260–269, 1967.
J. Aoe and K. Morimoto, “Implementation of trie search strategies by double-array structures,” IPSJ SIG Technical Reports, vol. 1991, no. 80 (1991-NL-085), pp. 9–16, 1991.
T. Kudo, Theory and implementation of morphological analysis (Keitaiso kaiseki no riron to jisso). Kindai kagaku sha Co.,Ltd., 2018.
M. Asahara and Y. Matsumoto, “ipadic version 2.7. 0 User’s Manual,” 2003.
R. B. Miller, “Response time in man-computer conversational transactions,” in Proceedings of AFIPS, 1968, pp. 267–277.
R. Sedgewick, Algorithms in C, PARTS 1–4: Fundamentals, Data Structures, Sorting, Searching (3rd Edition). Pearson Education, 1998.
E. A. Mann, PHP Cookbook: Modern Code Solutions for Professional Developers. O’Reilly Media, 2023.
W. A. Burkhard, “Hashing and trie algorithms for partial match retrieval,” ACM Transactions on Database Systems (TODS), vol. 1, no. 2, pp. 175–187, 1976.
K. Akabe, S. Kanda, Y. Oda, and S. Mori, “Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification,” arXiv preprint arXiv:2406.17185, 2024.
N. Yoshinaga, “Back to Patterns: Efficient Japanese Morphological Analysis with Feature-Sequence Trie,” arXiv preprint arXiv:2305.19045, 2023.
K. Maekawa, “Compilation of the Balanced Corpus of Contemporary Written Japanese,” Journal of the Japanese Society for Artificial Intelligence, vol. 24, no. 5, pp. 616–622, 2009.
T. Oishi and T. Nishide, “A Survey on Self-Perception of Institutional Research Skills and Knowledge by Focusing on the Gap with the Participants Needs of Training Courses,” IIAI Letters on Institutional Research, vol. 004, no. LIR239, pp. 1–12, 2024.