ANEEC: A quasi-automatic system for massive named entity extraction and categorization

Peng, B; Wu, J; Yuan, H; Guo, Q; Tao, D

ANEEC: A quasi-automatic system for massive named entity extraction and categorization

Peng, B Wu, J Yuan, H Guo, Q Tao, D

Permalink

Publication Type:: Journal Article
Citation:: Computer Journal, 2013, 56 (11), pp. 1328 - 1346
Issue Date:: 2013-11-01

Closed Access

	Filename	Description	Size
	2013000994OK.pdf		7.07 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Peng, B	en_US
dc.contributor.author	Wu, J	en_US
dc.contributor.author	Yuan, H	en_US
dc.contributor.author	Guo, Q	en_US
dc.contributor.author	Tao, D https://orcid.org/0000-0001-7225-5449	en_US
dc.date.issued	2013-11-01	en_US
dc.identifier.citation	Computer Journal, 2013, 56 (11), pp. 1328 - 1346	en_US
dc.identifier.issn	0010-4620	en_US
dc.identifier.uri	http://hdl.handle.net/10453/27133
dc.description.abstract	Named entity recognition seeks to locate atomic elements in texts and classify them into predefined categories. It is essentially useful for many applications, including microblog analysis and query suggestion. In recent years, with the explosion of Web 2.0, people have found it a promising way to extract large-scale, high-quality entities from structured web content. However, existing studies seldom provide an integrated system for simultaneously extracting and categorizing both the head and tail entities, and the identification of ambiguous entities is still a challenging task. In light of these, we propose a system named quasi-Automatic Named Entity Extraction and Categorization (ANEEC) for massive named-entity management. Specifically, ANEECfirst identifies representative websites by using a small seed-set of entities and the query logs of a search engine, and then extracts high-quality entities from the parallel structures in the webpages. ANEEC then employs the extracted entities and their corresponding atom-level groups to establish an entity taxonomy as well as a hierarchical classifier ensemble. Two problems, i.e. definition abnormality and granularity unfitness, have also been addressed to further improve the quality of the taxonomy. An application case using 932 seed entities and the query logs of the search engine Bing demonstrates that ANEEC can effectively identify over 870 000 named entities in 32 bottom-level categories, and the resulting taxonomy has an excellent classification performance with F1 = 85.17%, provided that the entity features are properly preprocessed and weighted. In particular,ANEEC shows the potential for tail entity recognition and ambiguous entity detection.© The Author 2012. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.	en_US
dc.relation.ispartof	Computer Journal	en_US
dc.relation.isbasedon	10.1093/comjnl/bxs114	en_US
dc.subject.classification	Computation Theory & Mathematics	en_US
dc.title	ANEEC: A quasi-automatic system for massive named entity extraction and categorization	en_US
dc.type	Journal Article
utslib.citation.volume	11	en_US
utslib.citation.volume	56	en_US
utslib.for	0802 Computation Theory and Mathematics	en_US
utslib.for	08 Information and Computing Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access
pubs.issue	11	en_US
pubs.publication-status	Published	en_US
pubs.volume	56	en_US

Abstract:

Named entity recognition seeks to locate atomic elements in texts and classify them into predefined categories. It is essentially useful for many applications, including microblog analysis and query suggestion. In recent years, with the explosion of Web 2.0, people have found it a promising way to extract large-scale, high-quality entities from structured web content. However, existing studies seldom provide an integrated system for simultaneously extracting and categorizing both the head and tail entities, and the identification of ambiguous entities is still a challenging task. In light of these, we propose a system named quasi-Automatic Named Entity Extraction and Categorization (ANEEC) for massive named-entity management. Specifically, ANEECfirst identifies representative websites by using a small seed-set of entities and the query logs of a search engine, and then extracts high-quality entities from the parallel structures in the webpages. ANEEC then employs the extracted entities and their corresponding atom-level groups to establish an entity taxonomy as well as a hierarchical classifier ensemble. Two problems, i.e. definition abnormality and granularity unfitness, have also been addressed to further improve the quality of the taxonomy. An application case using 932 seed entities and the query logs of the search engine Bing demonstrates that ANEEC can effectively identify over 870 000 named entities in 32 bottom-level categories, and the resulting taxonomy has an excellent classification performance with F1 = 85.17%, provided that the entity features are properly preprocessed and weighted. In particular,ANEEC shows the potential for tail entity recognition and ambiguous entity detection.© The Author 2012. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/27133