THESIS
1996
xiii, 102 leaves : ill. ; 30 cm
Abstract
Text categorization is the classification of unstructured text documents with respect to a set of one or more pre-defined categories. This task is often performed in automatic text indexing systems to assign subject categories to text documents. The benefit of text categorization is that once the documents are categorized, users can limit the scope of search by concentrating on a few categories relevant to their information needs....[
Read more ]
Text categorization is the classification of unstructured text documents with respect to a set of one or more pre-defined categories. This task is often performed in automatic text indexing systems to assign subject categories to text documents. The benefit of text categorization is that once the documents are categorized, users can limit the scope of search by concentrating on a few categories relevant to their information needs.
In this thesis, we propose a text categorization model using an artificial neural network trained by the Backpropagation learning algorithm as the text classifier. Due to the high dimensionality of the feature space typical for textual data, scalability is poor if the neural network is trained using this high dimensional raw data. In order to improve the scalability of the proposed model, we proposed and compared four dimensionality reduction techniques to reduce the feature space into an input space of much lower dimension for the neural network classifier. The first three of these techniques are domain dependent term selection methods, namely the DF method, the CF-DF method and the TFxIDF method. The fourth technique is a domain independent feature extraction method based on a statistical multivariate data analysis technique called Principal Component Analysk.
To test the effectiveness of the proposed model, experiments were conducted using a subset of the Reuters-22173 test collection for text categorization. The results showed that the proposed model was able to achieve high categorization effectiveness as measured by precision and recall. Among the four dimensionality reduction techniques proposed, Principal Component Analysis was found to be the most effective in reducing the dimensionality of the feature space.
Post a Comment