期刊导航

论文摘要

基于改进的ccLDA多数据源热点话题检测模型

Multi-source Topic Detection Analysis Based on Improved ccLDA Model

作者:陈兴蜀(四川大学 网络空间安全学院, 四川 成都 610065;四川大学 计算机学院, 四川 成都 610065);马晨曦(四川大学 计算机学院, 四川 成都 610065);王文贤(四川大学 网络空间安全研究院, 四川 成都 610065);高悦(四川大学 计算机学院, 四川 成都 610065);王海舟(四川大学 网络空间安全学院, 四川 成都 610065)

Author:CHEN Xingshu(College of Cybersecurity,Sichuan Univ.,Chengdu 610065,China;College of Computer Sci.,Sichuan Univ.,Sichuan,Chengdu 610065,China);MA Chenxi(College of Computer Sci.,Sichuan Univ.,Sichuan,Chengdu 610065,China);WANG Wenxian(Cybersecurity Research Inst.,Sichuan Univ.,Chengdu 610065,China);GAO Yue(College of Computer Sci.,Sichuan Univ.,Sichuan,Chengdu 610065,China);WANG Haizhou(College of Cybersecurity,Sichuan Univ.,Chengdu 610065,China)

收稿日期:2017-08-05          年卷(期)页码:2018,50(2):141-147

期刊名称:工程科学与技术

Journal Name:Advanced Engineering Sciences

关键字:话题检测;话题模型;LDA;多数据源;IccLDA

Key words:topic detection;topic model;LDA;multi-source;IccLDA

基金项目:国家科技支撑计划资助项目(2012BAH18B05);国家自然科学基金资助项目(61272447);四川省科技厅计划资助项目(16ZHSF0483)

中文摘要

目前,跨文本集的话题发现模型(cross-collection LDA,ccLDA)只适用于各个数据源话题相似度很高的场景,而且其全局话题和每个数据源的局部话题会强制对齐,存在词语稀疏的问题。针对ccLDA模型中的不足,提出了改进的跨文本集话题发现模型(improved ccLDA,IccLDA)。该模型在采样时先判断词语属于全局话题还是局部话题,再分别进行采样,避免了ccLDA模型中全局话题和局部话题必须对齐的缺点,进而降低了词语在全局话题和局部话题的分散程度,使该模型可以适用于多数据源的场景。在公开数据集上进行了多数据源文本集的话题发现实验,并进行了话题比较性分析。实验结果表明,在设置不同的话题数时,IccLDA模型的困惑度值均低于LDA模型和ccLDA模型,表明IccLDA模型具有更优的建模能力。最后,在真实数据集上开展了进一步实验验证,证明了本文提出的改进模型不仅建模能力优于原始模型,还可以有效地发现各个数据源讨论的公共话题和每个数据源讨论的局部话题,更适用于多数据源场景的文本话题发现。

英文摘要

At present,ccLDA (cross collection LDA) model has been found only applicable to data sources that topic similarity is very high,and its global topics and local topics of each data source will be forced alignment,hence causing words sparse.In order to solve the problem of ccLDA model,an improved ccLDA topic model (IccLDA) was proposed.When sampling,this model firstly decides whether words are global topics or local topics,and then takes samples respectively.In this way,it can avoid the problem that the global topics and local topics in ccLDA model must be aligned,and also can reduce the dispersion degree of the words in the global topics and local topics,making the model suitable for multiple data source scenarios.The topic discovery experiments of multiple data source were conducted on public data sets,and a comparative analysis of topics was conducted.The experimental results showed that the confusion degree of IccLDA model is lower than LDA model and ccLDA model,indicating that IccLDA model has better modeling ability.Finally,further experimental verification was performed with the data sets of real-world scenarios.The result showed that the improved model not only has better modeling ability than the traditional models,but also can effectively discover public topics discussed by various data sources and local topics discussed by each data source,and is more suitable for topic discovery in multiple data source scenarios.

关闭

Copyright © 2020四川大学期刊社 版权所有.

地址:成都市一环路南一段24号

邮编:610065