Fourth International Conference on Communication & Media Studies

Abstract

Digital technologies have made very easy and cheap to generate, store, and publish different kinds of data. Thus, almost in every discipline, people are using automated systems that generate information represented in text format in different natural languages. As a result, there is a growing interest towards better solutions for finding, organizing and analyzing these text documents. In this paper, we propose a system that clusters Amharic text documents using Encyclopedic Knowledge (EK) with neural word embedding. EK enables the representation of related concepts and neural word embedding is used to handle the contexts of these relatedness. During the clustering process, all the text documents pass through pre-processing stages. Then enriched text document features were extracted from each document through mapping with EK and trained word embedding model. Finally, text documents are clustered using popular spherical K-means algorithm. In order to experiment the feasibility of the proposed system, Amharic text corpus and Amharic Wikipedia data were used for testing. The study shows that the use of EK with word embedding for Amharic text document clustering results improvement in average accuracy than that of using only encyclopedic knowledge. Furthermore, changing the size of the class has a significant effect on the rate of accuracy and shows that as the cluster size increases the gap in rate of clustering accuracy between using EK with and without word embedding increases.