A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts

Chumwatana, Todsanai and Wong, Kok Wai and Xie, Hong (2010) A SOM-Based Document Clustering Using Frequent Max Substrings for Non-Segmented Texts. Journal of Intelligent Learning Systems and Applications, 02 (03). pp. 117-125. ISSN 2150-8402

[thumbnail of JILSA20100300002_31974922.pdf] Text
JILSA20100300002_31974922.pdf - Published Version

Download (414kB)

Abstract

This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently.

Item Type: Article
Subjects: Pustakas > Engineering
Depositing User: Unnamed user with email support@pustakas.com
Date Deposited: 11 Feb 2023 09:05
Last Modified: 15 Feb 2024 04:24
URI: http://archive.pcbmb.org/id/eprint/115

Actions (login required)

View Item
View Item