042 Time Series Basics with Pandas and Finance Data



042 Time Series Basics with Pandas and Finance DataProject PlanHo Ka Leung (3035106449) Supervisor: Dr. Anthony Tam | Company: Microsoft Hong Kong LimitedBackgroundKeeping making profit from securities market has always been a long-term goal for every investor around the world. To achieve this, investors always hope to have “magic” to foresee every major change in the market so that they can make corresponding trading orders in advance. Many mathematical or statistical models were developed to try to predict the future of a single stock or the whole market as accurate as possible.With the recent rapid development and popularity of deep learning methods, more and more researchers and investors have started applying these new techniques to help forecasting the stock market. Numerous researches were done on similar topics. However, the common approach focuses on short-term change and profitability. This project tries to take a different approach by focusing on long-term growth of the company.This project aims to predict the market value of a new company or a new industry in 10 years based on current financial data, hoping to eliminate all short-term noise and identify the next unicorn that is worth investing.ObjectivesScopeThis project will focus on 15-20 various technology company stocks in Hong Kong, and US stock markets, including successful and failed ones. By analyzing 2 to 3 years of financial data, a deep learning model will predict the market value in 10 years. The models and algorithms used are limited to ones that were shown to be effective in time series analysis.The project must involve Microsoft Cognitive Toolkit (CNTK) and pandas (a Python data library), which is a project requirement from Microsoft.FocusesThere will be two major focuses of this project, data and evaluation.DataThe focus of earlier phase is to understand financial data. Different types of financial data will be studied to determine what are their roles on evaluating a company’s performance. Subsequently, the more useful metrics will be selected for processing to produce clean time series data that are ready for training. The data processing techniques will differ based on different metrics and types of stock.EvaluationThe focus of later phase is to evaluate the performance of the neural network model. The difference between the predicted result of the neural network and the actual result should be accurately measured and calculated to facilitate improvements on the network. The final performance of the network should also be used demonstrating the possibilities of the proposed approach. Different evaluation methods should be thoroughly studied to choose the best one.DeliverablesBased on the 2 focuses defined, there will be two deliverables. One is data processor. One is functioning neural network model.Data ProcessorA data processor, being the deliverable in the earlier stage, should accepts raw input from a financial data API, performs data processing based on given instructions and returns time series data ready to be fed to the neural network model.Deep Neural Network ModelThe final product of the project is a well-trained deep neural network model that takes 2 to 3 years of pre-processed financial data of a given stock as input and predicts the market value in 10 years.Expected OutcomeThe final neural network is expected to have more than 90% of accuracy.Project MethodologiesDataSourceGoogle Finance will be the major data source for this project. An API call in Google Spreadsheet will return an attribute based on the given stock code, time and frequency in the form of an array of data. The data can then exported to .csv for further processing.Structure and ProcessingPandas, an open-source Python library providing data structures and data analysis tools will be used to grab, parse, process and store the data.VisualizationA Python plotting library matplotlib is planned to be used for data visualization, which helps presenting the final result of the model.Deep LearningModelLong Short-term Memory (LSTM), a recurrent neural network that use a complex structure to resemble the human brain’s mechanism of remembering and forgetting information, has shown to be suitable for classifying, processing and predicting time series.TrainingBackpropagation, which is one major training algorithm for neural networks, is chosen as the training algorithm. By calculating error between the predicted output and desired output, the algorithm will adjust connection weights between nodes from two consecutive layers. Bakcpropagation can thus minimize error and allows the neural network to return correct output. Genetic algorithm is another candidate for training algorithmFrameworkMicrosoft Cognitive Toolkit (CNTK) is the deep learning framework, developed by and Microsoft. It supports wide range of deep learning algorithms, including LSTM and backpropagation. It also is optimized on different computing structure and takes advantage of Azure environment.ProgrammingLanguagePython will be the programming language used in this project due to its functional paradigm and vast support on machine learning.EnvironmentAs CNTK is not natively support by macOS, Azure Notebook, which is an online Jyputer Notebook service using library instead of file system, was recommended by Microsoft.Project Schedule and MilestonesTimelineActivityDatePreliminary Research and Studying1-13 October 2017Developing Data Processor14 October - 24 November 2017Preparing Data Set25 November 2017 - 8 December 2017Constructing Neural Network 9-22 December 2017Testing Neural Network23-31 December 2017First Training & Testing and First Presentation1-20 January 2018 Deliverables of Phase 221 January 2018More Training & Testing22 January 2018 - 6 April 2018Final Evaluation7-14 April 2018Deliverables of Phase 315 April 2018Final Presentation16-20 April 2018Project Exhibition2 May 2018DetailsPreliminary Research and StudyingThis activity is to fully understand the domains of financial market, data analytics and machine learning.Developing Data ProcessorThe development will be split into 3 two-week iterations, each delivering one working prototype. Different data structures and processing techniques will be explored in each iteration.Preparing Data SetThis activity is to select stocks and prepare data set using the constructed data processor.Constructing Neural NetworkThis activity is to construct a workable LSTM neural network.Testing Neural NetworkThis activity is to test the functionality of the neural network using simple data.Training & TestingThis stage will be split into 5 iterations, each consisted of sub-activities of training, verifying, studying, and fine-tuning the neural network. TrainingUsing 60% of dataset to train the neural network.VerifyingUsing 30% of dataset to test the performance of the trained neural network.StudyingStudying the weakness and potential improvement of the neural network, should be done concurrently with training and testing.Fine-TuningReconfigure the neural network based on the weaknesses and potential improvements spotted.Final EvaluationUsing 10% of dataset to evaluate the performance of the final neural network. The results will be visualized used to draw conclusion.MilestonesConfirmation of Data Structure and Processing DetailsThe data structure and processing details will greatly affect the design of neural network. They should be finalized in the start of the third iteration of developing data processor for final implementation.Dataset ReadyDataset is critical to machine learning. Thus the completion of processing raw data to dataset marks a major milestone of the project.Production of Neural NetworkThis marks the end of coding work. The focus can then shift to studying on the problem.First Full RunAfter the neural network undergoes one training & testing, optimization work can be startedReaching 68% AccuracyWhile the desired accuracy for neural network is 90%, 68% of accuracy, which covers 1 standard deviation, will be considered a minor success. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download