基于ICE-LDA模型的中英文跨语言话题发现研究
Analysis and Research on Cross Language Topic Discovery in Chinese and English
作者:陈兴蜀(四川大学 网络空间安全研究院, 四川 成都 610065;四川大学 计算机学院, 四川 成都 610065);罗梁(四川大学 计算机学院, 四川 成都 610065);王海舟(四川大学 网络空间安全研究院, 四川 成都 610065;四川大学 计算机学院, 四川 成都 610065);王文贤(四川大学 网络空间安全研究院, 四川 成都 610065;四川大学 计算机学院, 四川 成都 610065);高悦(四川大学 计算机学院, 四川 成都 610065)
Author:CHEN Xingshu(Cybersecurity Research Inst., Sichuan Univ., Chengdu 610065, China;College of Computer Sci., Sichuan Univ., Chengdu 610065, China);LUO Liang(College of Computer Sci., Sichuan Univ., Chengdu 610065, China);WANG Haizhou(Cybersecurity Research Inst., Sichuan Univ., Chengdu 610065, China;College of Computer Sci., Sichuan Univ., Chengdu 610065, China);WANG Wenxian(Cybersecurity Research Inst., Sichuan Univ., Chengdu 610065, China;College of Computer Sci., Sichuan Univ., Chengdu 610065, China);GAO Yue(College of Computer Sci., Sichuan Univ., Chengdu 610065, China)
收稿日期:2016-09-18 年卷(期)页码:2017,49(2):100-106
期刊名称:工程科学与技术
Journal Name:Advanced Engineering Sciences
关键字:话题发现;跨英汉文本;ICE-LDA模型;TF-IDF特征提取;共现话题
Key words:topic model;cross language;ICE-LDA model;TF-IDF feature word extraction;co-occurrence topic
基金项目:国家科技支撑计划资助项目(2012BAH18B05);国家自然科学基金资助项目(61272447);四川大学青年教师启动基金(2015SCU11079)
中文摘要
近年来互联网在全球化的大背景下飞速发展,针对跨语言的网络数据挖掘成为国内外舆情分析的热点问题,有效实时地检测中英文网络环境下的热点话题对舆情的掌握和舆情的发展有着至关重要的作用。网络新闻作为网络信息舆情中的重要组成部分,由于互联网的大规模普及而成为人们方便快捷获知信息的重要来源。首先,本文选择中文与英文的网络新闻作为数据源进行采集,提出了在LDA模型上改进的ICE-LDA模型进行跨英汉语言网络环境下的共现话题发现。采用话题向量化的方式,对建模产生的话题进行JS距离检测和话题文本分布相似度度量。其次,本文分别对爬虫采集到的中英混合新闻数据分别构建可对比平行语料集和非可对比语料集进行话题建模,在建模过程中利用TF-IDF算法对文档提取特征词去噪,提高话题特征表示去除无意义噪音词。最后,分别采用两种不同的话题向量化方式进行跨语言的共现话题发现建模。实验结果表明,在本文设计的爬虫采集构建的真实数据集上,改进后的话题模型不仅能够在不需要先验话题对的情况下对可对比语料集进行跨语言共现话题进行发现,而且能够对语料不平衡的情况进行共现话题发现。
英文摘要
With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis.Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion.Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens.Firstly,Internet news in Chinese and English network were collected.Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset.Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model.Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data.During model building,the TF-IDF algorithm was used to remove noise words of the text. Finally,two kinds of topic vectors were used to detect the co-occurrence topics.The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.
【关闭】