基于LSTM与随机森林混合构架的钓鱼网站识别研究

Research on Classifying Phishing URLs Using Hybrid Architecture of LSTM and Random Forest

作者：方勇(四川大学网络空间安全学院, 四川成都 610065)；龙啸(四川大学电子信息学院, 四川成都 610065)；黄诚(四川大学网络空间安全学院, 四川成都 610065)；刘亮(四川大学网络空间安全学院, 四川成都 610065)

Author：FANG Yong(College of Cybersecurity, Sichuan Univ., Chengdu 610065, China)；LONG Xiao(College of Electronics and Infor. Eng. Sichuan Univ., Chengdu 610065, China)；HUANG Cheng(College of Cybersecurity, Sichuan Univ., Chengdu 610065, China)；LIU Liang(College of Cybersecurity, Sichuan Univ., Chengdu 610065, China)

收稿日期：2017-09-19 年卷（期）页码：2018,50(5):196-201

期刊名称：工程科学与技术

Journal Name：Advanced Engineering Sciences

关键字：长短期记忆;递归神经网络;随机森林;钓鱼攻击检测

Key words：Long short-term memory;recurrent neural networks;random forest;phishing attack detection

基金项目：

中文摘要

针对传统的钓鱼站点攻击检测模型时延高、效率低、特征提取复杂的问题，提出一种使用长短期记忆网络（long short term memory，LSTM）和随机森林的混合算法模型。该模型主要包括网址上下文特征提取和混合特征分类两部分。首先，根据循环神经网络特点建立128步长的深度网络结构。实验数据参考开源社区提供的钓鱼网站网址和正常网址情报。利用自然语言处理技术对网址数据进行编码得到具有局部特征的网址序列。通过构建的LSTM网络对网址序列进行字符上下文特征提取，结合传统检测方法中的非字符序列特征，共同构成实验特征集。随后，利用随机森林获取每一个特征的最佳分裂点，构建混合特征分类模型。该模型以网址数据为检测源，一方面降低了随机森林的字符序列特征维度，另一方面结合传统钓鱼网址检测中的非序列特征，弥补了LSTM算法检测特征单一的问题。为验证该模型的有效性，设计了本文模型与随机森林算法、LSTM算法的对比实验，并进一步对不同LSTM训练规模的时间成本进行分析。从实验中发现，基于LSTM与随机森林的混合模型大幅度提高了钓鱼网站的识别准确率，模型准确率达到98.52%，比相同训练规模的LSTM准确率高3%，比实验中的单一随机森林准确率高7%。同时，相比于LSTM算法同等幅度的准确率提升，该混合算法具有更小的时间代价。实验结果表明，作者提出的混合模型克服了传统识别模型在特征提取、识别效率上的问题，适合于海量钓鱼网站攻击的快速识别。

英文摘要

In order to solve the problem of high delay, low efficiency and complex features extraction in the traditional website phishing detection methods, a hybrid algorithm model using LSTM and the random forest was proposed. The model was composed of URL context feature extraction and hybrid features classification. Firstly, a 128-step deep network structure according to the Recurrent Neural Network was built. The experiment data was collected from the open source community, including phishing URLs and benign URLs. The URL data was encoded to a series of sequences with local features by natural language processing technology. The experiment feature sets were composed of the character context features of the URL sequence extracted by LSTM network and non-character sequence features in the traditional detection methods. Secondly, in order to get the best split point of each feature,phishing URLs recognition model was constructed by Random Fores. Then, the URL characters were chosen as the input source.On the one hand, the character sequence feature dimension of the random forest was reduced. On the other hand, in combination with the non-sequential features, the problem of the single detection rule of LSTM algorithm was avoided. In order to verify the validity of the model, a comparison experiment of our model with random forest algorithm and LSTM algorithm was designed, and the time cost of different LSTM training scale was further analyzed. The experiments demonstrated that the hybrid algorithm model provided an accuracy rate of 98.52%, surpassing single LSTM neural network and a single random forest by 3% and 7%. Meanwhile, when LSTM and hybrid model increased the same magnitude of accuracy, the latter had a smaller time cost.The experiment showed that the hybrid model overcame the efficiency problem of the traditional recognition model in feature extraction and recognition. Thus, the hybrid algorithm was suitable for rapid detection undera large of phishing attacks.

【关闭】

论文摘要

基于LSTM与随机森林混合构架的钓鱼网站识别研究

Research on Classifying Phishing URLs Using Hybrid Architecture of LSTM and Random Forest