Weighted kernel method for text categorization

Publication Type:
Thesis
Issue Date:
2011
Full metadata record
Files in This Item:
Filename Description Size
Thumbnail01Front.pdf2.48 MB
Adobe PDF
Thumbnail02Whole.pdf27.53 MB
Adobe PDF
NO FULL TEXT AVAILABLE. This thesis contains 3rd party copyright material. ----- Text categorization (or classification) is to classify the natural text or hypertext documents into a fixed number of predefined categories based on their content. In the '90s, machine learning became the dominant approach in text categorization. Popular machine learning approaches relied on kernel methods to build an automatic text classifier by learning from a set of pre-classified documents. These methods operate either with word frequency or with word sequences, but not both. This thesis presents a new kernel method that operates with both word frequency and word sequences. The method includes a new kernel model and a critical vector learning algorithm that works on the model to perform the text document classification task. The proposed model, called Weighted Kernel Model (WKM), represents a text document consisting of both its word frequency and sequences information with a weighting algorithm. The motivation for the WKM is that text documents are different. Traditional research was focused on the large collection of documents where each document was also of large size. The learning algorithm based on the proposed WKM demonstrates its effectiveness, accuracy and computational efficiency for various types of text documents, especially for short and medium documents. The thesis demonstrates this with application of WKM to two different text data sets - Reuters news data set and Enron email data set. The thesis concludes with the strengths and limitations of the proposed method.
Please use this identifier to cite or link to this item: