Instance-based and feature-based classification enhancement for short & sparse texts

Publication Type:
Thesis
Issue Date:
2014
Full metadata record
Short, sparse texts are becoming increasingly prevalent as a result of the growing popularity of social networking web sites, such as micro-blogs, Twitter and Flickr, and sites offering online product reviews. These short & sparse texts usually consist of a dozen or more words, or a few sentences, which we represent as a sparse document-term matrix. Compared to normal texts, short & sparse texts have three specific characteristics: (1) insufficient word co-occurrence to measure similarity, (2) low quality data resulting from spelling error, acronyms and slang, and (3) data sparseness. Normal classification methods therefore fail to achieve the desired level of accuracy for classifying short & sparse text. In this thesis, we present a series of novel approaches to enhance the performance of short & sparse text classification. Most texts can be represented as a two-dimensional matrix and we use the terms - “instance” and “feature” to denote the “row” and “column” concept respectively in the matrix. Corresponding to the matrix’s two dimensions, we design an instance- and feature-based framework to expand the rows/columns in the matrix. • for the instance-based framework, we extract an auxiliary dataset from an external online source (i.e. Wikipedia) with predefined class information, and integrate the target and auxiliary datasets with an instance-based transfer learning tool to enhance the classification performance of the target short text domain. Moreover, we propose a sampling framework to handle the challenge of low quality data in auxiliary dataset; • for the feature-based framework, we infer two kinds of feature sets with the given short texts, and then combine them with multi-view learning tool to enhance the classification performance. To handle the view disagreement challenge, we integrate a Bagging framework with Multi-view learning. The aim of the proposed algorithms is to improve classification performance (i.e. accuracy). To evaluate the proposed algorithms, we test them using a variety of benchmark datasets and real world datasets, such as sentiment texts in Twitter, pre-processed 20 Newsgroup data, review texts for seminars, and search snippets. Moreover, we compare the algorithm with other benchmark algorithms on all datasets. The results of our experiments demonstrate that the accuracy of our proposed algorithms is superior to that of other similar algorithms.
Please use this identifier to cite or link to this item: