Data mining in finance (multiple features & leading stocks based prediction)
by Wong Pong Ching
M.Phil. Computer Science and Engineering
xvi, 87 p. : ill. ; 30 cm
With rapid digitalization of mass media and successive evolution of computational modeling, applying data mining techniques in the stock forecasting has been a prevailing topic that is being explored by numerous computer scientists. Presently, researchers focus on studying how to make use of more powerful modeling techniques to improve the forecasting accuracies, but relatively few researchers consider the essence of correct features selection (effective factors identification)....[ Read more ]
With rapid digitalization of mass media and successive evolution of computational modeling, applying data mining techniques in the stock forecasting has been a prevailing topic that is being explored by numerous computer scientists. Presently, researchers focus on studying how to make use of more powerful modeling techniques to improve the forecasting accuracies, but relatively few researchers consider the essence of correct features selection (effective factors identification).
In the financial market, whether the stock price is going up or going down is mainly driven by the expectation of investors and speculators. The correct problem formulation achieved by considering effective factors is superior to making improvement on model complexities. In the real financial world, most of the investment decisions rely on textual information (for instance, news articles) and human perceptions (trends of stock price) to predict market price movement. Most of these features, especially textual information, are unstructured and fuzzy in nature and are not able to be readily processed in computational models .For the text mining methods, they also remain many of technical challenges in the literature. In this thesis, we propose feasible solutions to process and select useful features in our cutting edge forecasting models. A support vector machine (SVM) based prediction system, using multiple feature based forecasting framework (MFF), has been developed to process news, stock trend components, and technical indicators as well as Volatility Index (VIX) in order to improve the prediction performance. At the same time, based on the shortcomings of current approaches in text mining and trend components extraction, some new methods are also discussed and verified.
Apart from processing and utilizing additional features, the improvement of stock forecasting accuracy is also attempted through studying inter-relationships among multiple financial time series, traditionally believed to be useful for investors to optimize portfolios, and speculators to capture the chance of statistical arbitrage (most likely, pair trading). Existing researchers, advocate utilizing statistical correlations (co-movement in prices) between different stock entities as a factor mining method. However, high correlation in financial time series data does not imply causality. Therefore, we should reasonably address the later rather than the former. Notwithstanding the foreseeable improvement through modeling causalities, relatively few works are concerned with studying it and therefore explore the potential of lagging effects to boost accuracies of stock prediction. In this thesis, we propose a novel leading stock based prediction framework (LSPF), dedicated to mining leading stocks. By definition in this study, a stock is considered as a leader once it precedes other stocks in rising or falling. In other words, the predictive power of any data modeling over lead stocks can arise by considering these leading stocks as factors in the modeling process. LSPF tracks the inter-leading and lagging relationships between stock entities by investigating three feasible leading stock mining models, linear Granger causality test, non-linear Granger causality test, and lagged correlation measurement. A leadership ranking approach is suggested to weigh the importance of discovered leading and lagging stocks after the mining processes.
In studies of multiple features, our extensive experiments, using the Dow Jones consistent stock daily basis data in the New York Exchange (NYSE), show that our approaches with additional features obviously outperform those with price and volume only. More importantly, a profitable simulation trading result is gained (reaching over 200% annual return on several stock entities, in comparison with the same period Dow Jones Index performance of -25%) during the sub-prime mortgage crisis, justifying the effectiveness and robustness of our system against the economic depression.
On the other hand, LSPF is evaluated in terms of its boosted accuracies over different prediction models, including neural network (NN) and support vector regression (SVR). Examined by the high frequency market microstructure data in the Hong Kong Stock Exchange (HKEX), it has shown that the LSPF is robust to volatile stock markets with its promising improvement in prediction accuracies, which confirms the presence and significance of leading stocks.