It has been estimated that the amount of information in the world doubles every twenty months. The size and number of databases probably increase even faster. Today, corporate, governmental, and scientific communities are being overwhelmed with an influx of data increasingly available in semi-structured and textual forms from the World Wide Web or via an Intranet. Knowledge discovery or data mining is the emerging field that aims at analysing massive amount of data and extracting meaningful and comprehensible patterns, called knowledge. This thesis examines the problems associated with knowledge discovery, focusing specifically on issues arising from the construction of a categorical classifier using distributed and textual data sources.
Textual documents, generally speaking, contain richer and complementary information to merely numerical data. For example, numbers only tell that a stock (index) moved up and by how much; but numbers do not tell us why this happened, but the reasons can be found in text. Thus textual documents provide better resources for data mining provided there are techniques available able to fully exploit textual content. Potentially, text mining can extract more useful and relevant knowledge than traditional numeric data mining techniques.
There are at least four major steps in any knowledge discovery process: information selection, data preprocessing, mining, and interpretation of the findings. These steps provide the organizational foundation for this thesis.
Selecting high quality data sources and features is the first problem addressed in this study. The study suggests the notion of data quality. Given the fact that a set of sources and features, we define an information matrix. It is argued that a quantitative measure, which we call data quality, can be derived from the information matrix. Data quality is independent of a particular classifier and is a proxy for the ability to solve a classification problem using pertinent data sources. Experimental results show that this quality notion does have positive correlation with the corresponding solution accuracy.
In the data preprocessing stage, we investigate ways of transforming unstructured textual data into structured feature weightings. In contrast to the methods presented in the literature, our transformed methods are specifically tailored towards solving categorical prediction problems.
As far as mining is concerned, a common situation is that there are many data sources from which to mine knowledge. One possibility is to bring together all the data sources to form a large single database. However, not only is this a costly proposition but also an infeasible one. The second possibility is to mine only one single data source and then to assume that the found knowledge is also applicable to all the other data sources. This however might bias the result. We therefore investigate a third way of mining knowledge. From each data source, only one rule is generated. Individual rules are then brought together to represent adequately the knowledge of the complete data.
Finally, we suggest and compare various ways of combining individual categorical predictions to come up with a consensus prediction. This allows distributed predictions, that is, the generation of one opinion from each data source and combination of all the opinions to form the final result.
Most of the experiments are carried out on the application of predicting the Hang Seng Index, Hong Kong's stock market index, based on financial news available on from web sources such as the Wall Street Journal and CNN. In addition, some publicly available benchmark data sets are used to verify the proposed techniques.
In general, our discovery framework is applicable to a set of text documents each describing a different instance or object. The task is then to classify these objects. For example, suppose medical diagnosis about patients in form of text; the patients have then to be categorized (e.g. into mentally healthy or not) according to these doctors' evaluation reports.
Post a Comment