期刊导航

论文摘要

一种解决命名实体识别数据集类别标记失衡的方法

Method for solving class imbalance of named entity recognition dataset

作者:许丽丹(四川大学网络空间安全学院);刘嘉勇(四川大学网络空间安全学院);何祥(四川大学 电子信息学院)

Author:xu lidan(sichuan university College of cybersecurity);Liu jiayong(College of Cybersecurity, Sichuan University);He xiang(College of Electronics and Information Engineering, Sichuan University)

收稿日期:2019-05-11          年卷(期)页码:2020,57(1):82-88

期刊名称:四川大学学报: 自然科学版

Journal Name:Journal of Sichuan University (Natural Science Edition)

关键字:命名实体识别;类别失衡;数据合成;统计学习模型;遗传算法

Key words:Named entity recognition; Class imbalance; Data synthesis; Statistical learning model; Genetic algorithm

基金项目:中国科学院网络测评技术重点实验室开放课题基金“面向非结构化数据的威胁情报知识图谱构建”(NST 18 001)

中文摘要

命名实体识别研究中常见的公开数据集普遍存在数据类别标记不平衡的问题,限制了基于统计学习模型方法性能的进一步提高。针对上述问题,提出了基于遗传算法的数据类别标记平衡方法。该方法基于原始数据集中已有的标记数据,通过修改遗传算法中的指标适应度函数和基因组合规则,合成类别分布均衡的文本用以扩充原始数据集,降低标记数据不平衡性从而改善命名实体识别的效果。为验证该方法的有效性,采用Bi-LSTM-CRF模型分别基于CoNLL 2003及JNLPBA数据集设计了该方法与平衡欠采样、随机过采样方法的对比实验。从实验中发现,提出的方法在CoNLL2003数据集上模型召回率提高3.26%,F1值提高1.70%;在JNLPB数据集上召回率提高2.44%,F1值提高1.03%。实验结果表明提出的方法能够有效地缓解类别标记失衡问题达到提高命名实体识别效果的目的。

英文摘要

The public data sets in named entity recognition research are often class label imbalanced,which limits the further performance improvement based on statistical learning model methods. Aiming at the above problems, a data class label balancing method based on genetic algorithm is proposed, which modifies the fitness function and gene combination rules tried to balance the dataset by generating new samples to augment the original dataset. In order to verify the validity, the proposed method was compared with the balanced undersampling method and the random oversampling method by using the Bi LSTM CRF model on the CoNLL 2003 and JNLPBA datasets respectively. The results show that the proposed method increased the recall rate by 3.26% and the F1 value by 1.70% on the CoNLL2003 dataset, and the recall rate by 2.44% and the F1 value by 1.03% on the JNLPBA dataset. The experimental results demonstrate that the proposed method can effectively alleviate the class imbalance and improves the effect of named entity recognition.

关闭

Copyright © 2020四川大学期刊社 版权所有.

地址:成都市一环路南一段24号

邮编:610065