Www.leephd.com



Introduction to Data Science IExploratory Data AnalsysisKihwan LeeDue by 2017/7/29Table of ContentChapter 1.Data Set DefinitionChapter 2. Problem DefinitionChapter 3. Discussion and Future WorkAppendix APython ScriptChapter 1: Data Set DefinitionThe data sets were obtained from Quandl, which is the premier source for financial, economic, and alternative datasets, serving investment professionals. Quandl was orignally introduced in class work as a part of exercise that introduced sutdents to the python script in Data Science. Upon spending some time on it, it became clear that they provided a wide variety of data in DataFrame format which were prepared in several popular programming languges, such as python, R, JSON, and etc. After exploring with data acquisition through python and considering the time limitation, three categories of data were chosen as baseline exploratory data setsNational Unemployment Rate in the U. S. Provides a relative reference point for the current economic state.StockAppleGoogleJ. P. MorganForeign Exchange RateKoreaThe data were obtained from Quandl website with weekly frequency in the date ranging from 1/1/1989 to 12/31/2017. One difficulty of obtraining useble data lies in the frequency of available data. The National Unempolyment Rate had the smalleset number of available data while the Apple stock prices had the largest, as shown in Table 1, which has detailed statistical data of the three sets of data. In order to obtain correlation between each set, JOIN operation was used on the relevant set of data so that data at equal interval was obtained. Normalization was applied to all set of data under correlation study to avoid any numerical error derived from different scale. Additional stock prices were obtaind for Google and J. P. Morgain for the purpose of investigating their trend in the Bollinger Band. Chapter 2: Problem DefinitionThe purpose of this project was to increase familiarity with python script applied in the Data Science analysis. As such, the emphasis was placed on obtaining getting acquainted withInstallation of Python development environmentLearning how to use DataFrame in PandasLearning how to use Matplot for plottingLearning how to obatin statistical data from the dataLearning where to find credible data, how to obtain them, and how to process themLearning how to conduct exploratory data analysisPython installation was carried out through installation of Anaconda, a suite of development environment platform particularly tailed for scientific development, including R and Python. Among the available e development platforms, Scientific Python Development EnviRonment, SPYDER, was used. NumPy, SciPy, and Matplot were already available with Anaconda installation. Pandas provided a convinient form scripting language. It made complex array/hash operation relatively convenient for the average users. Data was presented in DataFrame class and it offers number of pre-defined functions for array operation as well as statistical calculation. For graphics, Matplot was used. Starting from the basic plotting operation, it offers correlation, historgram, which are the basic tools for exploratory data analysis. Matplot works similar to the plot function in MATLAB, a popular platform for scientific development, which makes it easier to adapt for the average users who are already familiar with MATLAB.For data source, Quandl was chosen, since they offer convient methods to access variety of data using Python and R out of many other ways. With the available data, exploratoy data analysis was conducted. Looking at the statics was the first step to gain overal picture of the data, such as min/max/mean/median/standard deviation, as shown in Table 1. Histogram was plotted for additional insight into the data, as shown in Figures 1 to 3. Since the data size was different for each set of data, a set of common data was pulled by using JOIN operation between the relevant data sets. With the commond data set, correlation was performed, giving further insight into the relationship between the chosen data sets. The National Unemployment Rate was chosen as the baseline data set on which Apple Stock price and Foreign Exchange Rate with South Korea were correlated, as shown in Table 2 and Figures 4 and 5. The correlation result could help the investors in the future decision making. Additional stock prices were pulled for Google and J. P. Morgan. The stock prices for three major companies were plotted against Bollinger Band to examine their behaviro against the upper and lower band, as shown in Figures 6 to 8. The comparison could offer historical evidence on how well the given stock price follows the prinple of Bollinger Band. Chapter 3: Discussion and Future WorkData statistics are shown in Table 1. The first thing to note is variation in the data set number. Further processing was required to put them in the format suitable for data analysis. Table 1. Data StatisticsData StatisticsUnitMinMaxMeanMedianStandard DeviationData Set NumberNational Unemployment Rate%4.75.95.15.00.31116AppleU.S. dollar13.1700.1112.055.0147.231513FX, KoreaKorean Won669.21707.31030.91084.5202.0348Histograms for the NUR, Apple Stock closing price, and the Foreign Exchange rate with South Korea are shown in Figures 1 to 3. Figure 1. Histogram of National Unemployment RateFigure 2. Histogram of Apple Stock Closing PriceFigure 3. Histogram of Foreign Exchange Rate with South KoreaIn order to proceed with correlation, a common set of data was obtained through JOIN operation on each data set. Due to their different scale of two orders of magnitude, both sets were normalized to enhance numerical accuracy. The correlation between the National Unemployment Rate (NUR) and Apple Stock closing price shows mild negative dependence, which makes sense. The correlation between the NUR and Foreign Exchange rate with South Korea shows stronger negative dependence.Table 2. Correlation DataNational Unemployment RateApple Stock Closing Price-0.50Foreign Exchange Rate with South Korea-0.73Correlation plots are shown in Figures 5 and 6. The Apple stock price shows strong bias in 2nd half of the time span.Figure 4. Correlation Plot Between National Unemployment Rate and Apple StockFigure 5. Correlation Plot Between National Unemployment Rate and FX Rate with South KoreaStock prices for Google, J. P. Morgan, and Apple are plotted against Bollinger Band in Figures 7 to 9. They all fall reasonably well within the upper and lower bands. Strong performance of J. P.Morgan in the recent years is note-worthy.Figure 6. Apple Stock Price within Bollinger BandFigure 7. Google Stock Price within Bollinger BandFigure 8. J. P. Morgan Stock Price within Bollinger BandThe exploratory data on the Stock prices/Foreign Exchange rate/NUR shows mild negative dependence on each other, while the stock prices generally lie within the Bollinger Band. It will be interesting to run additional correlation on more sets of data, such as house price increase, minimum income, and GDP. Applying classification and regression on the available data set might produce useful data as well.Appendix A"""Created on Sat Jul 7 00:59:44 2018@author: Kihwan LeeCopyright ? 2018 by Kihwan Lee. All Rights Reserved."""import quandlimport matplotlib.pyplot as pltimport pandas as pd# import numpy as np# quandl key = K1AVCSgpR4UPKtkNFpxJquandl.ApiConfig.api_key = "K1AVCSgpR4UPKtkNFpxJ"apple = quandl.get("WIKI/AAPL", start_date="1989-01-01", end_date="2017-12-31", collapse="weekly")NUR = quandl.get("FRED/NROU", authtoken="K1AVCSgpR4UPKtkNFpxJ", start_date="1989-01-01", end_date="2017-12-31", collapse="weekly")FX1 = quandl.get("FRED/EXKOUS", start_date="1989-01-01", end_date="2017-12-31", collapse="weekly")apple_vol = pd.DataFrame(apple['Adj. Close'])apple_cp = pd.DataFrame(apple['Close'])nur_value = pd.DataFrame(NUR['Value'])fx1_value = pd.DataFrame(FX1['Value'])# ======================================================================# 1) Data summaryprint(" National Unemployment Rate Statistics")print(" Min = %.4f, Max =%.4f, Mean = %.4f, Median = %.4f, Standard Deviation = %.4f" % ( NUR.min(), NUR.max(), NUR.mean(), NUR.median(), NUR.std()) )print(" Apple Stock Closing Adjusted Price Statistics")print(" Min = %.4f, Max =%.4f, Mean = %.4f, Median = %.4f, Standard Deviation = %.4f" % ( apple_cp.min(), apple_cp.max(), apple_cp.mean(), apple_cp.median(), apple_cp.std()) )print(" Foreign Exchange Rate with South Korea")print(" Min = %.4f, Max =%.4f, Mean = %.4f, Median = %.4f, Standard Deviation = %.4f" % ( FX1.min(), FX1.max(), FX1.mean(), FX1.median(), FX1.std()) )# ======================================================================# 2) Histogramplt.show()plt.figure(figsize=(7,5))plt.hist(nur_value.values,bins=20)plt.title('Histogram of National Unemployment Rate, Bin = 20')plt.ylabel('Number')plt.xlabel('Percent (%)')plt.grid()plt.show()plt.show()plt.figure(figsize=(7,5))plt.hist(apple_cp.values,bins=20)plt.title('Histogram of Apple Stock Closing Price, Bin = 20')plt.ylabel('Number')plt.xlabel('U.S. Dollar ($)')plt.grid()plt.show()plt.show()plt.figure(figsize=(7,5))plt.hist(fx1_value.values,bins=20)plt.title('Histogram of Foreign Exchange Rate with South Korea, Bin = 20')plt.ylabel('Number')plt.xlabel('Korean Won/U.S. Dollar')plt.grid()plt.show()# ======================================================================# 3) Correlation between NUR and Apple Stockjoin1_apple_nur = apple_vol.join(nur_value, how='inner')join1_apple_nur['Adj. Close'] = join1_apple_nur['Adj. Close']/join1_apple_nur['Adj. Close'].max()join1_apple_nur['Value'] = join1_apple_nur['Value']/join1_apple_nur['Value'].max()join1_corr = join1_apple_nur.corr(method='pearson')print(join1_corr)plt.show()plt.figure(figsize=(15,10))plt.plot(join1_apple_nur['Adj. Close'],'bo')plt.plot(join1_apple_nur['Value'],'r+')plt.legend(['Apple Stock Closing Price', 'Natioinal Unemployment Rate'])plt.title('Correlation Between Apple Stock Closing Price And National Unemployment Rate')plt.xlabel('Year')plt.grid()plt.show()# ======================================================================# 4) Correlation between NUR and FX Rate with South Koreafx1_value.columns = ['FX1Value']join2_fx1_nur = fx1_value.join(nur_value, how='inner')join2_fx1_nur['Value'] = join2_fx1_nur['Value']/join2_fx1_nur['Value'].max()join2_fx1_nur['FX1Value'] = join2_fx1_nur['FX1Value']/join2_fx1_nur['FX1Value'].max()join2_corr = join2_fx1_nur.corr(method='pearson')print(join2_corr)plt.show()plt.figure(figsize=(15,10))plt.plot(join2_fx1_nur['FX1Value'],'bo')plt.plot(join2_fx1_nur['Value'],'r+')plt.legend(['FX Rate with S. Korea', 'Natioinal Unemployment Rate'])plt.title('Correlation Bet. National Unemployment Rate And S. Korea FX Rate')plt.xlabel('Year')plt.grid()plt.show()# ======================================================================# 5) Obatain rolling average and plot against Bollinger band#apple_daily = quandl.get("WIKI/AAPL", start_date="1989-01-01", end_date="2017-12-31", collapse="daily")apple_daily = quandl.get("WIKI/AAPL", start_date="2016-01-01", end_date="2017-12-31", collapse="daily")stock_daily_var1 = pd.DataFrame(apple_daily['Adj. Close'])roll1_stock_std20 = stock_daily_var1.rolling(window=20).std()roll1_stock_mean20 = stock_daily_var1.rolling(window=20).mean()Bolinger_UB = roll1_stock_mean20 + roll1_stock_std20 * 2.0Bolinger_LB = roll1_stock_mean20 - roll1_stock_std20 * 2.0plt.show()plt.figure(figsize=(15,10))plt.plot(Bolinger_UB,'b-')plt.plot(roll1_stock_mean20,'k-')plt.plot(stock_daily_var1,'mo', markersize=2.0)plt.plot(Bolinger_LB,'r-')plt.legend(['Bollinger UB', 'Mean', 'Data', 'Bollinger LB'])plt.title('Apple Stock Closing Price Between Bollinger Band')plt.xlabel('Year')plt.grid()plt.show()google_daily = quandl.get("WIKI/GOOGL", start_date="1989-01-01", end_date="2017-12-31", collapse="weekly")stock_daily_var1 = pd.DataFrame(google_daily['Adj. Close'])roll1_stock_std20 = stock_daily_var1.rolling(window=20).std()roll1_stock_mean20 = stock_daily_var1.rolling(window=20).mean()Bolinger_UB = roll1_stock_mean20 + roll1_stock_std20 * 2.0Bolinger_LB = roll1_stock_mean20 - roll1_stock_std20 * 2.0plt.show()plt.figure(figsize=(15,10))plt.plot(Bolinger_UB,'b-')plt.plot(roll1_stock_mean20,'k-')plt.plot(stock_daily_var1,'mo', markersize=2.0)plt.plot(Bolinger_LB,'r-')plt.legend(['Bollinger UB', 'Mean', 'Data', 'Bollinger LB'])plt.title('Google Stock Closing Price Between Bollinger Band')plt.xlabel('Year')plt.grid()plt.show()jpm_daily = quandl.get("WIKI/JPM", start_date="1989-01-01", end_date="2017-12-31", collapse="weekly")stock_daily_var1 = pd.DataFrame(jpm_daily['Adj. Close'])roll1_stock_std20 = stock_daily_var1.rolling(window=20).std()roll1_stock_mean20 = stock_daily_var1.rolling(window=20).mean()Bolinger_UB = roll1_stock_mean20 + roll1_stock_std20 * 2.0Bolinger_LB = roll1_stock_mean20 - roll1_stock_std20 * 2.0plt.show()plt.figure(figsize=(15,10))plt.plot(Bolinger_UB,'b-')plt.plot(roll1_stock_mean20,'k-')plt.plot(stock_daily_var1,'mo', markersize=2.0)plt.plot(Bolinger_LB,'r-')plt.legend(['Bollinger UB', 'Mean', 'Data', 'Bollinger LB'])plt.title('J.P. Morgan Stock Closing Price Between Bollinger Band')plt.xlabel('Year')plt.grid()plt.show() ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download