基于统计学习的自适应文本聚类

Research of Adaptive Text Clustering Based on the Statistics of the Datasets

作者：王纵虎(西安电子科技大学计算机学院)；刘志镜(西安电子科技大学计算机学院)；陈东辉(西安电子科技大学计算机学院)

Author：Wang Zonghu(School of Computer Sci. and Technol.,Xidian Univ.)；Liu Zhijing(School of Computer Sci. and Technol.,Xidian Univ.)；Chen Donghui(School of Computer Sci. and Technol.,Xidian Univ.)

收稿日期：2011-05-25 年卷（期）页码：2012,44(1):106-111

期刊名称：工程科学与技术

Journal Name：Advanced Engineering Sciences

关键字：聚类；向量空间模型；相似度；划分；阈值

Key words：clustering;VSM;similarity;partition;threshold

基金项目：国家科技支撑计划资助项目(2007BAH08802);陕西省13115科技创新工程重大专项资助项目(2007ZDKG-57)

中文摘要

针对文本数据的高维性和稀疏性从而使传统的聚类算法在文本聚类应用中的表现不能让人满意的问题，通过计算文档相似度矩阵，在聚类过程中动态地统计学习已划分和未划分文本集合的相关信息，探测剩余未划分的数据集中的与已划分类簇覆盖度较小的最大密集区域,逐步生成预定数目的初始聚类中心集合，最后将剩余文档划分到最相似的初始聚类中心集合完成聚类，从而有效地减小了划分聚类算法对初始聚类中心的敏感性。算法中的一些阈值参数均通过在聚类过程中动态地对数据集进行统计学习得到，避免了多数聚类算法通过经验或实验设定阈值参数的盲目性，在不同

英文摘要

Due to the high dimensionality and sparseness of text data, the performance of traditional clustering algorithm may not be satisfied in clustering text data. The largest dense region having a small coverage rate with the partitioned clusters was selected out as initial cluster centroid set gradually by learning the similarity information between the partitioned and remainning sets. After generating the predetermined number of initial cluster centroid set, the remaining documents were assigned to their nearest clusters. By this way, the sensitivity of the clustering algorithm to the initial cluster centroid was reduced. Some threshold values used in this algorithm were calculated by the automatic statistic of the dataset dynamically in the process of clustering to avoid the blindness of the threshold parameters by experience or experiment in most clustering algorithms. The experiments on several Chinese and English datasets showed that this algorithm performes better than clustering algorithms in CLUTO.

【关闭】

论文摘要

基于统计学习的自适应文本聚类

Research of Adaptive Text Clustering Based on the Statistics of the Datasets