基于伪属性语义匹配的Deep web信息抽取

Deep Web Information Extraction Based onSemantic Match over Pseudo Attributes

作者：郑皎凌(四川大学计算机学院数据库与知识工程研究所,　四川成都610065)；唐常杰(四川大学计算机学院数据库与知识工程研究所,　四川成都610065)；姜玥(四川大学计算机学院数据库与知识工程研究所,　四川成都610065)

Author：(Inst. of Database and Knowledge Eng.,Schoole of Computer Sci.,Sichuan Univ., Chengdu 610065,China)；(Inst. of Database and Knowledge Eng.,Schoole of Computer Sci.,Sichuan Univ., Chengdu 610065,China)；(Inst. of Database and Knowledge Eng.,Schoole of Computer Sci.,Sichuan Univ., Chengdu 610065,China)

收稿日期：2008-01-23 年卷（期）页码：2009,41(2):173-178

期刊名称：工程科学与技术

Journal Name：Advanced Engineering Sciences

关键字：deep web;信息抽取;伪属性;语义匹配

Key words：deep web;information extraction;pseudo attributes;semantic match

基金项目：国家自然科学基金资助项目(60473071)；成都信息工程学院院选课题资助项目（CRF200819）

中文摘要

已有的Deep Web信息抽取算法主要对结构规范的网页进行模版的提取，目前多数Deep Web网页在结构上是非规范的，网页中记录属性字段可能缺失或重复、原子属性字段可能被html标签分隔。为了正确抽取这些非规范网页,提出了一种新方法：引入了记录的伪属性及其语义匹配概念，通过实现记录间伪属性序列的语义匹配实现信息抽取；提出了伪属性序列的模型及其语义匹配算法和记录Wrapper模型及其生成算法。实验表明，在结构不规范deep web网页的抽取上，能达到91%的查全率和93%的查准率，相对其它算法有一定优势。

英文摘要

Existing deep web information extraction methods focused on extracting the templates of well structured pages. But many deep web pages are ill structured.To handle ill structured pages, a new method was proposed.First, the concept of pseudo attributes and their semantic match were proposed.To extract the records’ information is to acquire the semantic match of the records’ pseudo attributes. Then, the records’ pseudo attributes model and their matching algorithm were proposed.Finally, the record’s wrapper model and generation algorithm were proposed. Experiment results showed that the wrapper can achieve a precision of 93% and recall of 91% for ill structured deep web pages.

【关闭】

论文摘要

基于伪属性语义匹配的Deep web信息抽取

Deep Web Information Extraction Based onSemantic Match over Pseudo Attributes