基于多类特征池化的文本分类算法

Chinese Text Categorization Based on Multi-Pooling

作者：阳馨(四川水利职业技术学院)；蒋伟(四川水利职业技术学院)；刘晓玲(四川水利职业技术学院)

Author：YANG Xin(Sichuan Water Conservancy Vocational College)；JIANG Wei(Sichuan Water Conservancy Vocational College)；LIU Xiao-Ling(Sichuan Water Conservancy Vocational College)

收稿日期：2016-06-12 年卷（期）页码：2017,54(2):287-292

期刊名称：四川大学学报: 自然科学版

Journal Name：Journal of Sichuan University (Natural Science Edition)

关键字：中文文本分类；池化；分类算法；Skip-gram；Softmax

Key words：Chinese text categorization; Pooling; Classification algorithm; Skip-gram; Softmax

基金项目：

中文摘要

文本分类是文本挖掘的一个内容，在信息检索、邮件过滤、网页分类等领域有着广泛的应用价值。目前文本分类算法在特征表示上的信息仍然不足，对此本文提出了基于多种特征池化的文本分类算法。在该算法中，本文首先对分词后的文本采用skip-gram模型获取词向量，然后对整个文本的词向量进行多种池化，最后将多种池化的特征作为一个整体输入到Softmax回归模型中得到文本的类别信息。通过对复旦大学所提供的文本分类语料库(复旦)测试语料的实验，该结果表明本文所给出的多种特征池化方法能够提高文本分类的准确率，说明了本文算法的有效性。

英文摘要

Text classification is one of the contents of text mining, which has a wide range of applications in the fields of information retrieval, e-mail filtering, web page classification and so on. At present, the text classification algorithm on the feature representation is still insufficient. This paper proposes a text classification algorithm based on a variety of features. In the algorithm. firstly, the word vector was obtained by using the skip-gram model on the segmentation of text. And then various pool methods are applied to get the vector of the entire text. Finally, the various pool features are a whole input, which is the input of the softmax regression model to obtain the categorization. Through the text classification corpus provided by Fudan University (Fudan) experimental test corpus, the results show that the proposed method can improve the accuracy of text classification, which shows the effectiveness of the proposed algorithm.

【关闭】

论文摘要

基于多类特征池化的文本分类算法

Chinese Text Categorization Based on Multi-Pooling