Weighted kernel method for text categorization
- Publication Type:
- Thesis
- Issue Date:
- 2011
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
01Front.pdf | contents and abstract | 2.48 MB | |||
02Whole.pdf | thesis | 27.53 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
NO FULL TEXT AVAILABLE. This thesis contains 3rd party copyright material. ----- Text categorization (or classification) is to classify the natural text or hypertext
documents into a fixed number of predefined categories based on their content. In
the '90s, machine learning became the dominant approach in text categorization.
Popular machine learning approaches relied on kernel methods to build an
automatic text classifier by learning from a set of pre-classified documents. These
methods operate either with word frequency or with word sequences, but not
both.
This thesis presents a new kernel method that operates with both word frequency
and word sequences. The method includes a new kernel model and a critical
vector learning algorithm that works on the model to perform the text document
classification task.
The proposed model, called Weighted Kernel Model (WKM), represents a text
document consisting of both its word frequency and sequences information with a
weighting algorithm.
The motivation for the WKM is that text documents are different. Traditional
research was focused on the large collection of documents where each document
was also of large size. The learning algorithm based on the proposed WKM
demonstrates its effectiveness, accuracy and computational efficiency for various
types of text documents, especially for short and medium documents. The thesis
demonstrates this with application of WKM to two different text data sets -
Reuters news data set and Enron email data set.
The thesis concludes with the strengths and limitations of the proposed method.
Please use this identifier to cite or link to this item: