基于CNN特征空间的微博多标签情感分类

Multi-label Emotion Classification for Microblog Based on CNN Feature Space

作者：孙松涛(武汉大学计算机学院, 湖北武汉 430072)；何炎祥(武汉大学计算机学院, 湖北武汉 430072;武汉大学软件工程国家重点实验室, 湖北武汉 430072)

Author：SUN Songtao(School of Computer, Wuhan Univ., Wuhan 430072, China)；HE Yanxiang(School of Computer, Wuhan Univ., Wuhan 430072, China;State Key Lab. of Software Eng., Wuhan Univ., Wuhan 430072, China)

收稿日期：2016-08-07 年卷（期）页码：2017,49(3):162-169

期刊名称：工程科学与技术

Journal Name：Advanced Engineering Sciences

关键字：情感分类;多标签分类;词向量表示;卷积神经网络;语义合成

Key words：emotion classification;multi-label classification;word embedding;convolution neural network;semantic compositionality

基金项目：国家自然科学基金资助项目（61303115；61373039；61472290）；高等学校博士学科点专项科研基金资助项目（2013014111002512）

中文摘要

面对微博情感评测任务中的多标签分类问题时，基于向量空间模型的传统文本特征表示方法难以提供有效的语义特征。基于深度学习的词向量表示技术，能够很好地体现词语的语法和语义关系，且可以依据语义合成原理有效地构建句子的特征表示向量。作者提出一个针对微博句子的多标签情感分类系统，首先从1个大规模的无标注微博文本数据集中学习中文词语的词向量表示，然后采用卷积神经网络（convolution neural network，CNN）模型进行有监督的多情感分类学习，利用学习到的CNN模型将微博句子中的词向量合成为句子向量，最后将这些句子向量作为特征训练多标签分类器，完成微博的多标签情感分类。2013年NLPCC（Natural Language Processing and Chinese Computing）会议的微博情感评测公开数据集中，相比最优评测结果的宽松指标和严格指标，本系统的最佳分类性能分别提升了19.16%和17.75%；采用Recursive Neural Tensor Network模型合成句子向量的方法，取得目前已知文献中的最佳分类性能，系统将2个指标分别提升了3.66%和2.89%。采用多种多标签分类器来对比不同的特征表示方法，发现基于CNN特征空间的句子向量具有最好的情感语义区分度；通过对CNN迭代训练过程的分析，体现了语义合成过程中的模式识别规律。进一步的工作包括引入更多合适的深度学习模型，并深入探索基于词向量的语义合成现象。

英文摘要

While the evaluation task of microblog emotion is a multi-label classification problem,the traditional text representing methods,which are usually based on vector space model,fail to provide more effective semantic features.Word embedding technology is based on deep learning,which can well capture the syntax and semantic relations between words,and build sentence representing effectively according to semantic compositionality.A multi-label emotion classification system was proposed.First,word embedding for Chinese words was learned from a large scale of unlabeled Chinese microblog text dataset.Second,the Convolution Neural Network (CNN) model was exploited to train a supervised multi-emotion classifier.Third,the learned CNN model was used to composite the feature vector for sentences from microblog.At last,these sentence vectors were treated as semantic features to train the multi-label classifier,which was used to finish the multi-label emotion classification for microblog.Based on the open dataset from microblog emotion evaluation task of NLPCC (Natural Language Processing and Chinese Computing) conference in 2013,the best performance of the proposed system achieved 19.16% and 17.75% improvement in the loose metric and the strict metric,respectively,comparing to the best performance of all the evaluation results.The state-of-art performance,which was achieved by the method of exploiting Recursive Neural Tensor Network model to composite the sentence vector,was also outperformed by the proposed system up to 3.66% and 2.89% on the two metrics.Several multi-label classifiers were employed to compare different feature representing methods,and the sentence vectors based CNN feature space were showed to have the most discriminative emotion semantic.The pattern recognition in the semantic composition procedure was showed by analyzing the training iteration of CNN model.

【关闭】

论文摘要

基于CNN特征空间的微博多标签情感分类

Multi-label Emotion Classification for Microblog Based on CNN Feature Space