Overcoming Semantic drift in information extraction

Li, Z; Li, H; Wang, H; Yang, Y; Zhang, X; Zhou, X

Overcoming Semantic drift in information extraction

Li, Z Li, H Wang, H Yang, Y

Zhang, X Zhou, X

Permalink

Publication Type:: Conference Proceeding
Citation:: Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology, Proceedings, 2014, pp. 169 - 180
Issue Date:: 2014-01-01

Closed Access

	Filename	Description	Size
	paper_281.pdf	Published version	779.39 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Z	en_US
dc.contributor.author	Li, H	en_US
dc.contributor.author	Wang, H	en_US
dc.contributor.author	Yang, Y https://orcid.org/0000-0001-5528-0546	en_US
dc.contributor.author	Zhang, X	en_US
dc.contributor.author	Zhou, X	en_US
dc.date.issued	2014-01-01	en_US
dc.identifier.citation	Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology, Proceedings, 2014, pp. 169 - 180	en_US
dc.identifier.isbn	9783893180653	en_US
dc.identifier.uri	http://hdl.handle.net/10453/120922
dc.description.abstract	Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the "symptoms" of semantic drift, then DPs are the "causes" of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to demonstrate the effectiveness of our approach in cleaning information extraction errors caused by semantic drift. We perform experiments on a isA relation iterative extraction, where 90.5 million of isA pairs are automatically extracted from 1.6 billion web documents with a low precision. The experimental results show our DP cleaning method enables us to clean more than 90% incorrect instances with 95% precision, which outperforms the previous approaches we compare with. As a result, our method greatly improves the prevision of this large isA data set from less than 50% to over 90%.	en_US
dc.relation.ispartof	Advances in Database Technology - EDBT 2014: 17th International Conference on Extending Database Technology, Proceedings	en_US
dc.relation.isbasedon	10.5441/002/edbt.2014.16	en_US
dc.title	Overcoming Semantic drift in information extraction	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

Semantic drift is a common problem in iterative information extraction. Previous approaches for minimizing semantic drift may incur substantial loss in recall. We observe that most semantic drifts are introduced by a small number of questionable extractions in the earlier rounds of iterations. These extractions subsequently introduce a large number of questionable results, which lead to the semantic drift phenomenon. We call these questionable extractions Drifting Points (DPs). If erroneous extractions are the "symptoms" of semantic drift, then DPs are the "causes" of semantic drift. In this paper, we propose a method to minimize semantic drift by identifying the DPs and removing the effect introduced by the DPs. We use isA (concept-instance) extraction as an example to demonstrate the effectiveness of our approach in cleaning information extraction errors caused by semantic drift. We perform experiments on a isA relation iterative extraction, where 90.5 million of isA pairs are automatically extracted from 1.6 billion web documents with a low precision. The experimental results show our DP cleaning method enables us to clean more than 90% incorrect instances with 95% precision, which outperforms the previous approaches we compare with. As a result, our method greatly improves the prevision of this large isA data set from less than 50% to over 90%.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/120922