基于文本嵌入特征表示的恶意软件家族分类

Malware family classification based on text embedding feature representation

作者：张涛(四川大学)；王俊峰(四川大学)

Author：zhangtao(Sichuan University)；wang junfeng(Sichuan University)

收稿日期：2018-08-09 年卷（期）页码：2019,56(3):441-449

期刊名称：四川大学学报: 自然科学版

Journal Name：Journal of Sichuan University (Natural Science Edition)

关键字：恶意软件；分类；文本嵌入；神经概率语言模型

Key words：Malware; Classification ;Text Embedding;NNLM

基金项目：国家重点研发计划项目（2016YFB0800605、2016QY06X1205）；装备预研教育部联合基金（6141A02033304、6141A02011607）；四川省重点研发计划项目（18ZDYF3867、18ZDYF2039）

中文摘要

自动化、高效率和细粒度是恶意软件检测与分类领域目前面临的主要挑战.随着深度学习在图像处理、语音识别和自然语言处理等领域的成功应用,其在一定程度上缓解了传统分析方法在人力和时间成本上的巨大压力.因此本文提出一种自动、高效且细粒度的恶意软件分析方法 mal2vec,其将每个恶意软件看成是一个具有丰富行为语义信息的文本,文本的内容由恶意软件动态执行时的API序列构成,采用经典的神经概率模型Doc2Vec对文本集进行训练学习.实验结果表明,与Rieck［1］等人的分类效果相比,本文方法得到的效果有明显提升.特别的,不同于其他深度学习的方法,本文方法能够抽取模型训练的中间结果进行显式表示,这种显式的中间结果表示具有可解释性,可以让我们从细粒度层面分析恶意软件家族的行为模式.

英文摘要

Automation, efficiency, and granularity are major challenges in the area of malware detection and classification. With the successful application of deep learning in the fields of image processing, speech recognition and natural language processing, it has alleviated the enormous pressure of traditional analysis methods on manpower and time cost to some extent. This paper describes mal2vec: an automatic, efficient and fine grained malware analysis method, which treats each malware as a text with rich behavioral semantic information. The content of the text is composed of API sequences when malware is dynamically executed. We use the classical neural probability model Doc2Vec to train the text set. The experimental results show that the effect of this paper is significantly improved compared with the classification effect of Rieck et al. In particular, unlike other methods of deep learning, this method can extract the intermediate results of model training for explicit representation. This explicit intermediate result is interpretable and allows us to analyze the behavior patterns of the malware family from a fine grained level.

【关闭】

论文摘要

基于文本嵌入特征表示的恶意软件家族分类

Malware family classification based on text embedding feature representation