Text Clustering Using Encyclopedic Knowledge with Word Embedding

Abstract

Digital technologies have made very easy and cheap to generate, store, and publish different kinds of data. Thus, almost in every discipline, people are using automated systems that generate information represented in text format in different natural languages. As a result, there is a growing interest towards better solutions for finding, organizing and analyzing these text documents. In this paper, we propose a system that clusters Amharic text documents using Encyclopedic Knowledge (EK) with neural word embedding. EK enables the representation of related concepts and neural word embedding is used to handle the contexts of these relatedness. During the clustering process, all the text documents pass through pre-processing stages. Then enriched text document features were extracted from each document through mapping with EK and trained word embedding model. Finally, text documents are clustered using popular spherical K-means algorithm. In order to experiment the feasibility of the proposed system, Amharic text corpus and Amharic Wikipedia data were used for testing. The study shows that the use of EK with word embedding for Amharic text document clustering results improvement in average accuracy than that of using only encyclopedic knowledge. Furthermore, changing the size of the class has a significant effect on the rate of accuracy and shows that as the cluster size increases the gap in rate of clustering accuracy between using EK with and without word embedding increases.

Presenters

Dessalew Yohannes

Details

Presentation Type

Poster Session

Theme

Media Technologies

KEYWORDS

Concept based text Clustering, Feature Enrichment, Word embedding, Encyclopedia

Digital Media

This presenter hasn’t added media.
Request media and follow this presentation.