Automatic Summarization Considering Thread Structure and Time Series in Electronic Bulletin Board System for Discussion
Abstract
On electronic bulletin board systems for discussion, a topic the users argue diversifies into multiple subtopics, and the entire structure becomes complicated. It is helpful to show users summarizations of the arguments because they can help in understanding the contents more easily without looking over from beginning to end of the discussion forum. The purpose of this paper is to propose an automatic summarization method of a single thread considering time series, reply relationships and user information. In the proposed method, a thread is restructured in several clusters by hierarchical clustering, and important sentences compressed with linguistic relationship of predicate argument structures are selected within each cluster using LexRank, which is a stochastic graph-based method for computing the relative importance of textual units. Finally, we conducted quantitative and qualitative analysis, comparing the proposed method with MMR. Both experimental results demonstrate that the proposed method can reduce redundancies more and extract fewer sentences unrelated to the whole context of the summary than the baseline. However, the proposed method included fewer important words than the baseline.
References
M. Asahara, M. Sugi, and S. Yanagino. Bccwj-summ: A summarization corpus of the ‘balanced corpus of contemporary written japanese’. In Proceedings of The 7th Wordshop of the Japanese Corpus Linguistics, pages 285–292, 2015.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal Machine Learning Research, 3:993–1022, 2003.
J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of The 21st Annual International Association for Computing Machinery Special Interest Group on Information Retrieval (ACM SIGIR) Conference on Research and Development in Information Retrieval, SIGIR ’98, pages 335–336, New York, NY, USA, 1998. ACM.
G. Erkan and D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22(1):457–479, 2004.
T. Ito, M. Okumura, T. Ito, and E. Hideshima. Implementation of a large-scale discussion support system collagree - large-scale discussion support based on a weakly structured discussion process -. Journal of Japan Industrial Management Association, 66(2):83–108, 2015.
H. Jun and A. Murakami. Extraction of important sentences and topics from online discussion using the thread structure and lexical chain. In Proceedings of The 16th Annual Meeting of The Association for Natural Language Processing, pages 290–293, 2010.
D. Kawahara and S. Kurohashi. Japanese morphological analyzer juman, 2012. http://nlp.ist.i.kyoto-u.ac.jp/index.php?JUMAN.
D. Kawahara and S. Kurohashi. Japanese systax / case structure / anaphora analyzer knp, 2012. http://nlp.ist.i.kyoto-u.ac.jp/index.php?KNP.
M. Kikuchi, M. Okamoto, and tomohiro Yamazaki. Extraction of topic transition from document stream based on hierarchical clustering. Database Society of Japan (DBSJ) Journal, 7(1):85–90, 2008.
R. Kitajima and I. Kobayashi. Graph based multi-document summarization with latent topics. Intelligence and Information, 25(6):914–923, 2013.
T. Kudo, K. Yamamoto, and Y. Matsumoto. Applying conditional random fields to japanese morphological analysis. In Proceedings of The 2004 Conference on Empirical Methods on Natural Language Processing, pages 230–237, 2004.
Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In Proceedings of The 31st International Conference on Machine Learning, pages 1188–1196, 2014.
Y. Matsuo, Y. Ohsawa, and M. Ishizuka. Minig and summarizing conversational data on electrical message boards. The 16th Annual Conference of the Japan Society of Artifitial Intelligence, 16:1–4, 2002.
R. Mojena. Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 20(4):359–363, 1977.
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.
D. R. Radev, H. Jing, M. Stys, and D. Tam. Centroid-based summarization of multiple ´ documents. Information Processing & Management, 40(6):919–938, 2004.
K. Yoshioka and M. Koeda. Extraction of related words and classification of words using bm25. In Proceedings of The 78th National Conversation of Information Processing Society of Japan, volume 6, page 1, 2012.