An efficient Wikipedia semantic matching approach to text document classification

Wu, Z; Zhu, H; Li, G; Cui, Z; Huang, H; Li, J; Chen, E; Xu, G

An efficient Wikipedia semantic matching approach to text document classification

Wu, Z Zhu, H Li, G Cui, Z Huang, H Li, J Chen, E Xu, G

Permalink

Publication Type:: Journal Article
Citation:: Information Sciences, 2017, 393 pp. 15 - 28
Issue Date:: 2017-07-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (307.08 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, Z	en_US
dc.contributor.author	Zhu, H	en_US
dc.contributor.author	Li, G	en_US
dc.contributor.author	Cui, Z	en_US
dc.contributor.author	Huang, H	en_US
dc.contributor.author	Li, J	en_US
dc.contributor.author	Chen, E	en_US
dc.contributor.author	Xu, G https://orcid.org/0000-0003-4493-6663	en_US
dc.date.available	2020-05-25T19:03:43Z
dc.date.issued	2017-07-01	en_US
dc.identifier.citation	Information Sciences, 2017, 393 pp. 15 - 28	en_US
dc.identifier.issn	0020-0255	en_US
dc.identifier.uri	http://hdl.handle.net/10453/105006
dc.description.abstract	© 2017 Elsevier Inc. A traditional classification approach based on keyword matching represents each text document as a set of keywords, without considering the semantic information, thereby, reducing the accuracy of classification. To solve this problem, a new classification approach based on Wikipedia matching was proposed, which represents each document as a concept vector in the Wikipedia semantic space so as to understand the text semantics, and has been demonstrated to improve the accuracy of classification. However, the immense Wikipedia semantic space greatly reduces the generation efficiency of a concept vector, resulting in a negative impact on the availability of the approach in an online environment. In this paper, we propose an efficient Wikipedia semantic matching approach to document classification. First, we define several heuristic selection rules to quickly pick out related concepts for a document from the Wikipedia semantic space, making it no longer necessary to match all the concepts in the semantic space, thus greatly improving the generation efficiency of the concept vector. Second, based on the semantic representation of each text document, we compute the similarity between documents so as to accurately classify the documents. Finally, evaluation experiments demonstrate the effectiveness of our approach, i.e., which can improve the classification efficiency of the Wikipedia matching under the precondition of not compromising the classification accuracy.	en_US
dc.relation.ispartof	Information Sciences	en_US
dc.relation.isbasedon	10.1016/j.ins.2017.02.009	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	An efficient Wikipedia semantic matching approach to text document classification	en_US
dc.type	Journal Article
utslib.citation.volume	393	en_US
utslib.for	0102 Applied Mathematics	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0999 Other Engineering	en_US
utslib.for	01 Mathematical Sciences	en_US
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	09 Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	393	en_US

Abstract:

© 2017 Elsevier Inc. A traditional classification approach based on keyword matching represents each text document as a set of keywords, without considering the semantic information, thereby, reducing the accuracy of classification. To solve this problem, a new classification approach based on Wikipedia matching was proposed, which represents each document as a concept vector in the Wikipedia semantic space so as to understand the text semantics, and has been demonstrated to improve the accuracy of classification. However, the immense Wikipedia semantic space greatly reduces the generation efficiency of a concept vector, resulting in a negative impact on the availability of the approach in an online environment. In this paper, we propose an efficient Wikipedia semantic matching approach to document classification. First, we define several heuristic selection rules to quickly pick out related concepts for a document from the Wikipedia semantic space, making it no longer necessary to match all the concepts in the semantic space, thus greatly improving the generation efficiency of the concept vector. Second, based on the semantic representation of each text document, we compute the similarity between documents so as to accurately classify the documents. Finally, evaluation experiments demonstrate the effectiveness of our approach, i.e., which can improve the classification efficiency of the Wikipedia matching under the precondition of not compromising the classification accuracy.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/105006