Automatic Identification of Dataset Names in Scholarly Articles of Various Disciplines
Abstract
Although the number of freely accessible scholarly articles is increasing, it is difficult for non-experts to understandthem since they are written for experts and require background knowledge. Our big goal is to facilitate open innovation based on scholarly articles, developing methods to automatically extract essential elements in them. Once we could understand articles, they would be primary resources for institutional research. To this end, this paper is devoted to developing automatic identification of datasets in articles. Because a dictionary of datasets is necessary for evaluation, existing methods focused on some specific discipline. To achieve applicability to any disciplines, a machine learning approach with huge amounts of papers is adopted. Treating papers in multi-disciplines, the authors are not familiar with all dataset names in them. Therefore we quantitatively evaluate experimental results with precision@N, which does not require to know all the datasets in the papers, and qualitatively check if candidate tokens are dataset names or not using a GUI tool we have developed. Experimental results show precision@N is 0.450 and nDCG is 0.458. However, outputs include names of methods and software. It is an importantfuture work to remove these noise tokens.
References
“NII InstitutionalRepositories Program | Documents| Statistics,” https://www.nii.ac.jp/irp/en/archive/statistic/, accessed: 2020-03-04.
“arXiv.org e-Print archive,” https://arxiv.org/, accessed: 2020-03-04.
“Home - PubMed - NCBI,” https://www.ncbi.nlm.nih.gov/pubmed/, accessed: 2020-03-04.
“RePEc: Research Papers in Economics,” http://repec.org/, accessed: 2020-03-04.
“Home :: SSRN, ” https://www.ssrn.com/, accessed: 2020-03-04.
D. Ikeda and P. Wang, “Revealing Presence of Amateurs at an Institutional Repository by Analyzing Queries at Search Engine,” in Proceedings of the 7th International Conference of Open Repositories, 2012.
Y. Ohira, K. Ogashiwa, S. Muranaga, T. Matsumoto, and H. Naitoh, “A Questionnaire System for Institutional Research,” Information Engineering Express, vol. 3, no. 1, pp. 9–18, 2017.
K. Ogashiwa, T. Matsumoto, Y. Wang, J. Kariya, and H. Naitoh, “Evaluation of the Yamaguchi University Self-Assessment and Evaluation System and Its Improvement,” International Journal of Institutional Research and Management, vol. 3, no. 1, pp. 1–14, 2019.
D. Ikeda and D. Seguchi, “Automatically Extracting Keywords from Documents for Rich Indexes of Searchable Data Repositories,” in Proceedings of the 12th International Conference of Open Repositories, 2017.
B. Ghavimi, P. Mayr, S. Vahdati, and C. Lange, “Identifying and Improving Dataset References in Social Sciences Full Texts,” ArXiv e-prints, 2016.
A. Singhal, R. Kasturi, and J. Srivastava, “DataGopher: Context-based Search for Research Datasets,” in Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration, 2014, pp. 749–756.
A. Singhal and J. Srivastava, “Data Extract: Mining Context from the Web for Dataset Extraction,” International Journal of Machine Learning and Computing, vol. 3, no. 2, pp. 219–223, 2013.
“CORE – Aggregating the world’s open access research papers,” https://core.ac.uk/, accessed: 2020-03-04.
P. Knoth and Z. Zdrahal, “CORE: Three Access Levels to Underpin Open Access,” D-Lib Magazine, vol. 18, no. 11/12, 2012.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds. CurranAssociates, Inc., 2013, pp. 3111–3119.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” CoRR, 2013.
C. D. Manning, P. Raghavan, and H. Schuetze, Introduction to Information Retrieval. Cambridge University Press, 2008.
K. Jarvelin and J. Kek ¨ al ¨ ainen, “Cumulated Gain-based Evaluation of IR Techniques,” ¨ ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002.
D. Ikeda and Y. Taniguchi, “Toward Automatic Identification of Dataset Names in Scholarly Articles,” in Developments in Open Science and Research Data Management: 8th International Conference on Data Science and Institutional Research, 2019.
K. Gabor, D. Buscaldi, A.-K. Schumann, B. QasemiZadeh, H. Zargayouna, and ´ T. Charnois, “SemEv al-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers,” in Proceedings of The 12th International Workshop on Semantic Evaluation. New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 679–688. [Online]. Available: https://www.aclweb.org/anthology/S18-1111
Y. Yamada, D. Ikeda, and S. Hirokawa, “Automatic Wrapper Generation for Multilingual Web Resources,” in Proceedings of the 5th International Conference on Discovery Science, ser. Lecture Notes in Computer Science 2534. Springer-Verlag, 2002, pp. 332–339.
D. Ikeda and Y. Yamada, “Gathering Text Files Generated from Templates,” in Proceedings of Workshop on Information Integration on the Web (IIWeb-04), 2004, pp. 21–26.
R. L. Cilibrasi and P. M. Vitanyi, “The Google Similarity Distance,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 3, pp. 370–383, 2007.
A. Singhal and J. Srivastava, “Research Dataset Discovery from Research Publications Using Web Context,” Web Intelligence, vol. 15, no. 2, pp. 81–99, 2017.