基于特征关联度的K-means初始聚类中心优化算法

An Improved Initial Cluster Centers Selection Algorithm for <i>K</i>-means Based on Features Correlative Degree

作者：陈兴蜀(四川大学计算机学院网络与可信计算研究所)；吴小松(四川大学计算机学院网络与可信计算研究所)；王文贤(四川大学计算机学院网络与可信计算研究所)；王海舟(四川大学计算机学院网络与可信计算研究所)

Author：Chen Xingshu(Network and Trusted Computing Inst.,College of Computer Sci.,Sichuan Univ)；Wu Xiaosong(Network and Trusted Computing Inst.,College of Computer Sci.,Sichuan Univ)；Wang Wenxian(Network and Trusted Computing Inst.,College of Computer Sci.,Sichuan Univ)；Wang Haizhou(Network and Trusted Computing Inst.,College of Computer Sci.,Sichuan Univ)

收稿日期：2014-06-23 年卷（期）页码：2015,47(1):13-19

期刊名称：工程科学与技术

Journal Name：Advanced Engineering Sciences

关键字：<i>K</i>-means；特征关联度；初始聚类中心；文本聚类

Key words：K-means;feature correlative degree;initial cluster center;text clustering

基金项目：国家科技支撑计划资助项目(2012BAH18B05)；国家自然科学基金资助项目（61272447）；四川大学青年教师科研启动基金资助项目（2013SCU11017）

中文摘要

针对K-means算法在进行文本聚类时对初始聚类中心敏感的问题，提出基于特征关联度的初始聚类中心选择算法。由于在原始文本集中不易找到类别代表性都较强的多个独立文本作为初始聚类中心，因此先从降维后的文本特征集合中，选取关联度大的特征构造新的文本集，再利用“或运算”合并其中的相似文本得到初始聚类中心候选集，最后通过计算文本密度并结合“最小最大”原则从候选集中选取最优的初始中心。在5个数据集上进行对比实验，该算法在多数聚类结果中的F-score值都高于90%，熵值低于0.5，明显优于Mahout提供的K-means算法，表明该算法可选出高质量的初始聚类中心，得到更好的聚类结果。

英文摘要

In order to solve the problem thatK-means algorithms is highly sensitive to initial clusters centers in text clustering, an initial cluster center selection algorithm based on the correlative degree of features was proposed. Features with high correlative degree were chosen after reducing dimensions and a new dataset was created. Subsequently, a candidate initial cluster center set was constructed by merging the similar documents in the new dataset using “OR operation”. Finally, the best centers from the candidate dataset were obtained through computing document density and following the minimax principle. The results of five experimental datasets showed that most F-scores are more than 90%, and entropies are below 0.5. Comparison with theK-means algorithms of Mahout showed that the improved algorithm can choose higher quality centers and produce better clustering results.

【关闭】

论文摘要

基于特征关联度的K-means初始聚类中心优化算法

An Improved Initial Cluster Centers Selection Algorithm for <i>K</i>-means Based on Features Correlative Degree