Efficient approximate entity extraction with edit distance constraints

DSpace/Manakin Repository

Search OPUS

Advanced Search


My Account

Show simple item record

dc.contributor.author Wang, W
dc.contributor.author Xiao, C
dc.contributor.author Lin, X
dc.contributor.author Zhang, C
dc.contributor.editor Ugur, A
dc.contributor.editor Stanley, BZ
dc.contributor.editor Donald, K
dc.contributor.editor Nesime, T
dc.date.accessioned 2010-06-17T04:37:15Z
dc.date.issued 2009-01
dc.identifier.citation Proceedings of the 35th SIGMOD international conference on Management of data, 2009, pp. 759 - 770
dc.identifier.isbn 978-1-60558-551-2
dc.identifier.other E1 en_US
dc.identifier.uri http://hdl.handle.net/10453/12354
dc.description.abstract Named entity recognition aims at extracting named entities from unstructured text. A recent trend of named entity recognition is finding approximate matches in the text with respect to a large dictionary of known entities, as the domain knowledge encoded in the dictionary helps to improve the extraction performance. In this paper, we study the problem of approximate dictionary matching with edit distance constraints. Compared to existing studies using token-based similarity constraints, our problem definition enables us to capture typographical or orthographical errors, both of which are common in entity extraction tasks yet may be missed by token-based similarity constraints. Our problem is technically challenging as existing approaches based on q-gram filtering have poor performance due to the existence of many short entities in the dictionary. Our proposed solution is based on an improved neighborhood generation method employing novel partitioning and prefix pruning techniques. We also propose an efficient document processing algorithm that minimizes unnecessary comparisons and enumerations and hence achieves good scalability. We have conducted extensive experiments on several publicly available named entity recognition datasets. The proposed algorithm outperforms alternative approaches by up to an order of magnitude.
dc.publisher ACM
dc.title Efficient approximate entity extraction with edit distance constraints
dc.type Conference Proceeding
dc.parent Proceedings of the 35th SIGMOD international conference on Management of data
dc.journal.number en_US
dc.publocation Rhode Island, USA en_US
dc.identifier.startpage 759 en_US
dc.identifier.endpage 770 en_US
dc.cauo.name FEIT.Faculty of Engineering & Information Technology en_US
dc.conference Verified OK en_US
dc.conference ACM Special Interest Group on Management of Data Conference
dc.for 080101 Adaptive Agents and Intelligent Robotics
dc.for 080109 Pattern Recognition and Data Mining
dc.personcode 011221
dc.percentage 70 en_US
dc.classification.name Pattern Recognition and Data Mining en_US
dc.classification.type FOR-08 en_US
dc.edition en_US
dc.custom ACM Special Interest Group on Management of Data Conference en_US
dc.date.activity 20090629 en_US
dc.date.activity 2009-06-29
dc.location.activity Rhode Island, USA en_US
dc.description.keywords NA en_US
pubs.embargo.period Not known
pubs.organisational-group /University of Technology Sydney
pubs.organisational-group /University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group /University of Technology Sydney/Strength - Quantum Computation and Intelligent Systems
utslib.copyright.status Closed Access
utslib.copyright.date 2015-04-15 12:17:09.805752+10
utslib.collection.history Closed (ID: 3)

Files in this item

This item appears in the following Collection(s)

Show simple item record