ANEEC: A quasi-automatic system for massive named entity extraction and categorization
- Publication Type:
- Journal Article
- Computer Journal, 2013, 56 (11), pp. 1328 - 1346
- Issue Date:
Named entity recognition seeks to locate atomic elements in texts and classify them into predefined categories. It is essentially useful for many applications, including microblog analysis and query suggestion. In recent years, with the explosion of Web 2.0, people have found it a promising way to extract large-scale, high-quality entities from structured web content. However, existing studies seldom provide an integrated system for simultaneously extracting and categorizing both the head and tail entities, and the identification of ambiguous entities is still a challenging task. In light of these, we propose a system named quasi-Automatic Named Entity Extraction and Categorization (ANEEC) for massive named-entity management. Specifically, ANEECfirst identifies representative websites by using a small seed-set of entities and the query logs of a search engine, and then extracts high-quality entities from the parallel structures in the webpages. ANEEC then employs the extracted entities and their corresponding atom-level groups to establish an entity taxonomy as well as a hierarchical classifier ensemble. Two problems, i.e. definition abnormality and granularity unfitness, have also been addressed to further improve the quality of the taxonomy. An application case using 932 seed entities and the query logs of the search engine Bing demonstrates that ANEEC can effectively identify over 870 000 named entities in 32 bottom-level categories, and the resulting taxonomy has an excellent classification performance with F1 = 85.17%, provided that the entity features are properly preprocessed and weighted. In particular,ANEEC shows the potential for tail entity recognition and ambiguous entity detection.© The Author 2012. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved.
Please use this identifier to cite or link to this item: