Instance-based and feature-based classification enhancement for short & sparse texts

Long, G

Instance-based and feature-based classification enhancement for short & sparse texts

Long, G

Permalink

Publication Type:: Thesis
Issue Date:: 2014

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (70.99 kB)

Adobe PDF

Download thesisAdobe PDF (966.76 kB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Long, G
dc.date.accessioned	2014-08-27T09:50:23Z
dc.date.available	2014-08-27T09:50:23Z
dc.date.issued	2014
dc.identifier.uri	http://hdl.handle.net/10453/29252
dc.description	University of Technology, Sydney. Faculty of Engineering and Information Technology.	en_US
dc.description.abstract	Short, sparse texts are becoming increasingly prevalent as a result of the growing popularity of social networking web sites, such as micro-blogs, Twitter and Flickr, and sites offering online product reviews. These short & sparse texts usually consist of a dozen or more words, or a few sentences, which we represent as a sparse document-term matrix. Compared to normal texts, short & sparse texts have three specific characteristics: (1) insufficient word co-occurrence to measure similarity, (2) low quality data resulting from spelling error, acronyms and slang, and (3) data sparseness. Normal classification methods therefore fail to achieve the desired level of accuracy for classifying short & sparse text. In this thesis, we present a series of novel approaches to enhance the performance of short & sparse text classification. Most texts can be represented as a two-dimensional matrix and we use the terms - “instance” and “feature” to denote the “row” and “column” concept respectively in the matrix. Corresponding to the matrix’s two dimensions, we design an instance- and feature-based framework to expand the rows/columns in the matrix. • for the instance-based framework, we extract an auxiliary dataset from an external online source (i.e. Wikipedia) with predefined class information, and integrate the target and auxiliary datasets with an instance-based transfer learning tool to enhance the classification performance of the target short text domain. Moreover, we propose a sampling framework to handle the challenge of low quality data in auxiliary dataset; • for the feature-based framework, we infer two kinds of feature sets with the given short texts, and then combine them with multi-view learning tool to enhance the classification performance. To handle the view disagreement challenge, we integrate a Bagging framework with Multi-view learning. The aim of the proposed algorithms is to improve classification performance (i.e. accuracy). To evaluate the proposed algorithms, we test them using a variety of benchmark datasets and real world datasets, such as sentiment texts in Twitter, pre-processed 20 Newsgroup data, review texts for seminars, and search snippets. Moreover, we compare the algorithm with other benchmark algorithms on all datasets. The results of our experiments demonstrate that the accuracy of our proposed algorithms is superior to that of other similar algorithms.	en_US
dc.format	Thesis (PhD)	en_US
dc.language.iso	en	en_US
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/29252/2/02whole.pdf
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.subject	Short text.	en
dc.subject	Classification.	en
dc.subject	Text messages.	en
dc.subject	Text processing.	en
dc.title	Instance-based and feature-based classification enhancement for short & sparse texts	en_US
dc.type	Thesis
utslib.copyright.status	open_access

Abstract:

Short, sparse texts are becoming increasingly prevalent as a result of the growing popularity of social networking web sites, such as micro-blogs, Twitter and Flickr, and sites offering online product reviews. These short & sparse texts usually consist of a dozen or more words, or a few sentences, which we represent as a sparse document-term matrix. Compared to normal texts, short & sparse texts have three specific characteristics: (1) insufficient word co-occurrence to measure similarity, (2) low quality data resulting from spelling error, acronyms and slang, and (3) data sparseness. Normal classification methods therefore fail to achieve the desired level of accuracy for classifying short & sparse text. In this thesis, we present a series of novel approaches to enhance the performance of short & sparse text classification. Most texts can be represented as a two-dimensional matrix and we use the terms - “instance” and “feature” to denote the “row” and “column” concept respectively in the matrix. Corresponding to the matrix’s two dimensions, we design an instance- and feature-based framework to expand the rows/columns in the matrix. • for the instance-based framework, we extract an auxiliary dataset from an external online source (i.e. Wikipedia) with predefined class information, and integrate the target and auxiliary datasets with an instance-based transfer learning tool to enhance the classification performance of the target short text domain. Moreover, we propose a sampling framework to handle the challenge of low quality data in auxiliary dataset; • for the feature-based framework, we infer two kinds of feature sets with the given short texts, and then combine them with multi-view learning tool to enhance the classification performance. To handle the view disagreement challenge, we integrate a Bagging framework with Multi-view learning. The aim of the proposed algorithms is to improve classification performance (i.e. accuracy). To evaluate the proposed algorithms, we test them using a variety of benchmark datasets and real world datasets, such as sentiment texts in Twitter, pre-processed 20 Newsgroup data, review texts for seminars, and search snippets. Moreover, we compare the algorithm with other benchmark algorithms on all datasets. The results of our experiments demonstrate that the accuracy of our proposed algorithms is superior to that of other similar algorithms.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/29252