基于mRMR与因子分解机的分类模型研究

Classification model based on mRMR and factorization machines algorithm

作者：王美(昆明理工大学)；龙华(昆明理工大学信息工程与自动化学院)；邵玉斌(昆明理工大学信息工程与自动化学院)；杜庆治(昆明理工大学信息工程与自动化学院)

Author：Wangmei(Kunming University of Science and Technology)；Long Hua(Kunming University of Science and Technology, Faculty of Information Engineering and Automation)；Shao Yubin(Kunming University of Science and Technology, Faculty of Information Engineering and Automation)；Du Qingzhi(Kunming University of Science and Technology, Faculty of Information Engineering and Automation)

收稿日期：2019-04-03 年卷（期）页码：2020,57(1):96-102

期刊名称：四川大学学报: 自然科学版

Journal Name：Journal of Sichuan University (Natural Science Edition)

关键字：最小冗余最大相关；GTD；因子分解机；MCC；TFM分类模型

Key words：mRMR; GTD; Factorization Machines; MCC; TFM classification model

基金项目：国家自然科学基金（61761025）

中文摘要

很多学者用"全球恐怖主义研究数据库"GTD数据集，采用博弈论、K近邻法和支持向量机等分析恐怖事件的聚集性，已经取得一些成果。但在前期研究中未有很好考虑数据的稀疏性以及高维度多冗余等会导致聚集分类准确率不高的问题。本文提出一种基于"最小冗余最大相关"(mRMR)与"因子分解机"（FM）结合的TFM分类模型，使用增量搜索方法寻找近似最优的特征解决高维度多冗余问题和FM方法解决数据稀疏问题，并对预处理后的恐怖袭击事件数据用TFM模型做量化分类。实验中根据最小损失函数选择最优特征集，确定最小损失值收敛于6.0442573对应的37个特征量。在同样特征选取下，对GTD数据集的分类效果，文中使用朴素贝叶斯NB、支持向量机SVM和逻辑回归LR与TFM四个模型的"马修斯相关系数"(MCC)进行比较，结果显示TFM的MCC相对于其他三个模型NB、SVM、LR分别提高了107.6%,2.6%,2.4%，可见TFM模型有一定可行性，但作为模型优化补偿，从实验结果来看TFM模型计算时长较大，分析应该是引入辅助向量导致，解决该问题可做后期研究。

英文摘要

Many scholars have made some achievements in aggregation analysis of terrorist events by using the data set of "Global Terrorism Research Database"(GTD) with game theory, k nearest neighbor method and support vector machine. However, data sparsity and high dimensional multi redundancy are not well considered in the previous research, which may lead to low accuracy of clustering classification. This paper proposes a TFM classification model based on "Minimal redundancy maximal relevancy" (mRMR) combined with " Factorization Machines " (FM), in which the incremental search method is used to find approximately optimal features to address the high dimensional multi redundancy and the data sparsity is tackled with FM method. TFM model is then used to make quantitative classification on the pre processed terrorist attack data. The experimental results show the proposed TFM model, in terms of Matthews correlation coefficient (MCC), is increased by 49.9%, 2.5% and 2.3% respectively compared with naive Bayes (NB), support vector machine (SVM) and logistic regression (LR). The comparative result demonstrates that TFM model is feasible to some extent.

【关闭】

论文摘要

基于mRMR与因子分解机的分类模型研究

Classification model based on mRMR and factorization machines algorithm