Edustudyzone.com



Python - Data Science TutorialData is the new Oil. This statement shows how every modern IT system is driven by capturing, storing and analysing data for various needs. Be it about making decision for business, forecasting weather, studying protein structures in biology or designing a marketing campaign. All of these scenarios involve a multidisciplinary approach of using mathematical models, statistics, graphs, databases and of course the business or scientific logic behind the data analysis. So we need a programming language which can cater to all these diverse needs of data science. Python shines bright as one such language as it has numerous libraries and built in features which makes it easy to tackle the needs of Data science.In this tutorial we will cover these the various techniques used in data science using the Python programming language.AudienceThis tutorial is designed for Computer Science graduates as well as Software Professionals who are willing to learn data science in simple and easy steps using Python as a programming language.PrerequisitesBefore proceeding with this tutorial, you should have a basic knowledge of writing code in Python programming language, using any python IDE and execution of Python programs. If you are completely new to python then please refer our?Python tutorial?to get a sound understanding of the language.Execute Python ProgramsFor most of the examples given in this tutorial you will find Try it option, so just make use of it and enjoy your learning.Try following example using Try it option available at the top right corner of the below sample code box#!/usr/bin/pythonprint "Hello, Python!"Python - Data Science IntroductionData science is the process of deriving knowledge and insights from a huge and diverse set of data through organizing, processing and analysing the data. It involves many different disciplines like mathematical and statistical modelling, extracting data from it source and applying data visualization techniques. Often it also involves handling big data technologies to gather both structured and unstructured data. Below we will see some example scenarios where Data science is used.Recommendation systemsAs online shopping becomes more prevalent, the e-commerce platforms are able to capture users shopping preferences as well as the performance of various products in the market. This leads to creation of recommendation systems which create models predicting the shoppers needs and show the products the shopper is most likely to buy.Financial Risk managementThe financial risk involving loans and credits are better analysed by using the customers past spend habits, past defaults, other financial commitments and many socio-economic indicators. These data is gathered from various sources in different formats. Organising them together and getting insight into customers profile needs the help of Data science. The outcome is minimizing loss for the financial organization by avoiding bad debt.Improvement in Health Care servicesThe health care industry deals with a variety of data which can be classified into technical data, financial data, patient information, drug information and legal rules. All this data need to be analysed in a coordinated manner to produce insights that will save cost both for the health care provider and care receiver while remaining legally puter VisionThe advancement in recognizing an image by a computer involves processing large sets of image data from multiple objects of same category. For example, Face recognition. These data sets are modelled, and algorithms are created to apply the model to newer images to get a satisfactory result. Processing of these huge data sets and creation of models need various tools used in Data science.Efficient Management of EnergyAs the demand for energy consumption soars, the energy producing companies need to manage the various phases of the energy production and distribution more efficiently. This involves optimizing the production methods, the storage and distribution mechanisms as well as studying the customers consumption patterns. Linking the data from all these sources and deriving insight seems a daunting task. This is made easier by using the tools of data science.Python in Data ScienceThe programming requirements of data science demands a very versatile yet flexible language which is simple to write the code but can handle highly complex mathematical processing. Python is most suited for such requirements as it has already established itself both as a language for general computing as well as scientific computing. More over it is being continuously upgraded in form of new addition to its plethora of libraries aimed at different programming requirements. Below we will discuss such features of python which makes it the preferred language for data science.A simple and easy to learn language which achieves result in fewer lines of code than other similar languages like R. Its simplicity also makes it robust to handle complex scenarios with minimal code and much less confusion on the general flow of the program.It is cross platform, so the same code works in multiple environments without needing any change. That makes it perfect to be used in a multi-environment setup easily.It executes faster than other similar languages used for data analysis like R and MATLAB.Its excellent memory management capability, especially garbage collection makes it versatile in gracefully managing very large volume of data transformation, slicing, dicing and visualization.Most importantly Python has got a very large collection of libraries which serve as special purpose analysis tools. For example – the NumPy package deals with scientific computing and its array needs much less memory than the conventional python list for managing numeric data. And the number of such packages is continuously growing.Python has packages which can directly use the code from other languages like Java or C. This helps in optimizing the code performance by using existing code of other languages, whenever it gives a better result.In the subsequent chapters we will see how we can leverage these features of python to accomplish all the tasks needed in the different areas of Data Science.Python - Data Science Environment SetupTo successfully create and run the example code in this tutorial we will need an environment set up which will have both general-purpose python as well as the special packages required for Data science. We will first look as installing the general-purpose python which can be python 2 or python 3. But we will prefer python 2 for this tutorial mainly because of its maturity and wider support of external packages.Getting PythonThe most up-to-date and current source code, binaries, documentation, news, etc., is available on the official website of Python? can download Python documentation from?. The documentation is available in HTML, PDF, and PostScript formats.Installing PythonPython distribution is available for a wide variety of platforms. You need to download only the binary code applicable for your platform and install Python.If the binary code for your platform is not available, you need a C compiler to compile the source code manually. Compiling the source code offers more flexibility in terms of choice of features that you require in your installation.Here is a quick overview of installing Python on various platforms ?Unix and Linux InstallationHere are the simple steps to install Python on Unix/Linux machine.Open a Web browser and go to? the link to download zipped source code available for Unix/Linux.Download and extract files.Editing the?Modules/Setup?file if you want to customize some options.run ./configure scriptmakemake installThis installs Python at standard location?/usr/local/bin?and its libraries at?/usr/local/lib/pythonXX?where XX is the version of Python.Windows InstallationHere are the steps to install Python on Windows machine.Open a Web browser and go to? the link for the Windows installer?python-XYZ.msi?file where XYZ is the version you need to install.To use this installer?python-XYZ.msi, the Windows system must support Microsoft Installer 2.0. Save the installer file to your local machine and then run it to find out if your machine supports MSI.Run the downloaded file. This brings up the Python install wizard, which is really easy to use. Just accept the default settings, wait until the install is finished, and you are done.Macintosh InstallationRecent Macs come with Python installed, but it may be several years out of date. See? instructions on getting the current version along with extra tools to support development on the Mac. For older Mac OS's before Mac OS X 10.3 (released in 2003), MacPython is available.Jack Jansen maintains it and you can have full access to the entire documentation at his website ??. You can find complete installation details for Mac OS installation.Setting up PATHPrograms and other executable files can be in many directories, so operating systems provide a search path that lists the directories that the OS searches for executables.The path is stored in an environment variable, which is a named string maintained by the operating system. This variable contains information available to the command shell and other programs.The?path?variable is named as PATH in Unix or Path in Windows (Unix is case sensitive; Windows is not).In Mac OS, the installer handles the path details. To invoke the Python interpreter from any particular directory, you must add the Python directory to your path.Setting path at Unix/LinuxTo add the Python directory to the path for a particular session in Unix ?In the csh shell?? type setenv PATH "$PATH:/usr/local/bin/python" and press Enter.In the bash shell (Linux)?? type export ATH="$PATH:/usr/local/bin/python" and press Enter.In the sh or ksh shell?? type PATH="$PATH:/usr/local/bin/python" and press Enter.Note?? /usr/local/bin/python is the path of the Python directorySetting path at WindowsTo add the Python directory to the path for a particular session in Windows ?At the command prompt?? type path %path%;C:\Python and press Enter.Note?? C:\Python is the path of the Python directoryPython Environment VariablesHere are important environment variables, which can be recognized by Python ?Sr.No.Variable & Description1PYTHONPATHIt has a role similar to PATH. This variable tells the Python interpreter where to locate the module files imported into a program. It should include the Python source library directory and the directories containing Python source code. PYTHONPATH is sometimes preset by the Python installer.2PYTHONSTARTUPIt contains the path of an initialization file containing Python source code. It is executed every time you start the interpreter. It is named as .pythonrc.py in Unix and it contains commands that load utilities or modify PYTHONPATH.3PYTHONCASEOKIt is used in Windows to instruct Python to find the first case-insensitive match in an import statement. Set this variable to any value to activate it.4PYTHONHOMEIt is an alternative module search path. It is usually embedded in the PYTHONSTARTUP or PYTHONPATH directories to make switching module libraries easy.Running PythonThere are three different ways to start Python ?Interactive InterpreterYou can start Python from Unix, DOS, or any other system that provides you a command-line interpreter or shell window.Enter?python?the command line.Start coding right away in the interactive interpreter.$python # Unix/Linuxorpython% # Unix/LinuxorC:> python # Windows/DOSHere is the list of all the available command line options ?Sr.No.Option & Description1-dIt provides debug output.2-OIt generates optimized bytecode (resulting in .pyo files).3-SDo not run import site to look for Python paths on startup.4-vverbose output (detailed trace on import statements).5-Xdisable class-based built-in exceptions (just use strings); obsolete starting with version 1.6.6-c cmdrun Python script sent in as cmd string7filerun Python script from given fileScript from the Command-lineA Python script can be executed at command line by invoking the interpreter on your application, as in the following ?$python script.py # Unix/Linuxorpython% script.py # Unix/Linuxor C: >python script.py # Windows/DOSNote?? Be sure the file permission mode allows execution.Integrated Development EnvironmentYou can run Python from a Graphical User Interface (GUI) environment as well, if you have a GUI application on your system that supports Python.Unix?? IDLE is the very first Unix IDE for Python.Windows?? PythonWin is the first Windows interface for Python and is an IDE with a GUI.Macintosh?? The Macintosh version of Python along with the IDLE IDE is available from the main website, downloadable as either MacBinary or BinHex'd files.Installing SciPy PackThe best way to enable the required packs is to use an installable binary package specific to your operating system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy, matplotlib, IPython, SymPy and nose packages along with core Python).WindowsAnaconda (from?continuum.io) is a free Python distribution for SciPy stack. It is also available for Linux and Mac.Canopy (products/canopy/) is available as free as well as commercial distribution with full SciPy stack for Windows, Linux and Mac.Python (x,y): It is a free Python distribution with SciPy stack and Spyder IDE for Windows OS. (Downloadable from?python-xy.github.io/)LinuxPackage managers of respective Linux distributions are used to install one or more packages in SciPy stack.For Ubuntusudo apt-get install python-numpy python-scipy python-matplotlibipythonipythonnotebook python-pandas python-sympy python-noseFor Fedorasudo yum install numpyscipy python-matplotlibipython python-pandas sympy python-nose atlas-develBuilding from SourceCore Python (2.6.x, 2.7.x and 3.2.x onwards) must be installed with distutils and zlib module should be enabled.GNU gcc (4.2 and above) C compiler must be available.To install NumPy, run the following command.Python setup.py installLet us test whether NumPy module is properly installed, try to import it from Python prompt.If it is not installed, the following error message will be displayed.Traceback (most recent call last): File "<pyshell#0>", line 1, in <module> import numpy ImportError: No module named 'numpy'Similarly we can check for the installation of all the required Data Science packages shown in the next chapters.Pandas is an open-source Python Library used for high-performance data manipulation and data analysis using its powerful data structures. Python with pandas is in use in a variety of academic and commercial domains, including Finance, Economics, Statistics, Advertising, Web Analytics, and more. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, organize, manipulate, model, and analyse the data.Below are the some of the important features of Pandas which is used specifically for Data processing and Data analysis work.Key Features of PandasFast and efficient DataFrame object with default and customized indexing.Tools for loading data into in-memory data objects from different file formats.Data alignment and integrated handling of missing data.Reshaping and pivoting of date sets.Label-based slicing, indexing and subsetting of large data sets.Columns from a data structure can be deleted or inserted.Group by data for aggregation and transformations.High performance merging and joining of data.Time Series functionality.Pandas deals with the following three data structures ?SeriesDataFrameThese data structures are built on top of Numpy array, making them fast and efficient.Dimension & DescriptionThe best way to think of these data structures is that the higher dimensional data structure is a container of its lower dimensional data structure. For example, DataFrame is a container of Series, Panel is a container of DataFrame.Data StructureDimensionsDescriptionSeries11D labeled homogeneous array, size-immutable.Data Frames2General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.DataFrame is widely used and it is the most important data structures.SeriesSeries is a one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers 10, 23, 56, …10235617526173902672Key Points of SeriesHomogeneous dataSize ImmutableValues of Data MutableDataFrameDataFrame is a two-dimensional array with heterogeneous data. For example,NameAgeGenderRatingSteve32Male3.45Lia28Female4.6Vin45Male3.9Katie38Female2.78The table represents the data of a sales team of an organization with their overall performance rating. The data is represented in rows and columns. Each column represents an attribute and each row represents a person.Data Type of ColumnsThe data types of the four columns are as follows ?ColumnTypeNameStringAgeIntegerGenderStringRatingFloatKey Points of Data FrameHeterogeneous dataSize MutableData MutableWe will see lots of examples on using pandas library of python in Data science work in the next chapters.NumPy is a Python package which stands for 'Numerical Python'. It is a library consisting of multidimensional array objects and a collection of routines for processing of array.Operations using NumPyUsing NumPy, a developer can perform the following operations ?Mathematical and logical operations on arrays.Fourier transforms and routines for shape manipulation.Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation.NumPy – A Replacement for MatLabNumPy is often used along with packages like?SciPy?(Scientific Python) and?Mat?plotlib?(plotting library). This combination is widely used as a replacement for MatLab, a popular platform for technical computing. However, Python alternative to MatLab is now seen as a more modern and complete programming language.It is open source, which is an added advantage of NumPy.ndarray ObjectThe most important object defined in NumPy is an N-dimensional array type called?ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. Every item in an ndarray takes the same size of block in the memory. Each element in ndarray is an object of data-type object (called?dtype). Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types.We will see lots of examples on using NumPy library of python in Data science work in the next chapters.he SciPy library of Python is built to work with NumPy arrays and provides many user-friendly and efficient numerical practices such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install and are free of charge. NumPy and SciPy are easy to use, but powerful enough to depend on by some of the world's leading scientists and engineers.SciPy Sub-packagesSciPy is organized into sub-packages covering different scientific computing domains. These are summarized in the following table ?scipy.constantsPhysical and mathematical constantsscipy.fftpackFourier transformscipy.integrateIntegration routinesscipy.interpolateInterpolationscipy.ioData input and outputscipy.linalgLinear algebra routinesscipy.optimizeOptimizationscipy.signalSignal processingscipy.sparseSparse matricesscipy.spatialSpatial data structures and algorithmsscipy.specialAny special mathematical functionsscipy.statsStatisticsData StructureThe basic data structure used by SciPy is a multidimensional array provided by the NumPy module. NumPy provides some functions for Linear Algebra, Fourier Transforms and Random Number Generation, but not with the generality of the equivalent functions in SciPy.We will see lots of examples on using SciPy library of python in Data science work in the next chapters.Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing feature to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts etc. It is used along with NumPy to provide an environment that is an effective open source alternative for MatLab. It can also be used with graphics toolkits like PyQt and wxPython.Conventionally, the package is imported into the Python script by adding the following statement ?from matplotlib import pyplot as pltMatplotlib ExampleThe following script produces the?sine wave plot?using matplotlib.Exampleimport numpy as np import matplotlib.pyplot as plt # Compute the x and y coordinates for points on a sine curve x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) plt.title("sine wave form") # Plot the points using matplotlib plt.plot(x, y) plt.show() Its?output?is as follows ?We will see lots of examples on using Matplotlib library of python in Data science work in the next chapters.Python handles data of various formats mainly through the two libraries, Pandas and Numpy. We have already seen the important features of these two libraries in the previous chapters. In this chapter we will see some basic examples from each of the libraries on how to operate on data.Data Operations in NumpyThe most important object defined in NumPy is an N-dimensional array type called ndarray. It describes the collection of items of the same type. Items in the collection can be accessed using a zero-based index. An instance of ndarray class can be constructed by different array creation routines described later in the tutorial. The basic ndarray is created using an array function in NumPy as follows ?numpy.array Following are some examples on Numpy Data handling.Example 1# more than one dimensions import numpy as np a = np.array([[1, 2], [3, 4]]) print aThe output is as follows ?[[1, 2] [3, 4]]Example 2# minimum dimensions import numpy as np a = np.array([1, 2, 3,4,5], ndmin = 2) print aThe output is as follows ?[[1, 2, 3, 4, 5]]Example 3# dtype parameter import numpy as np a = np.array([1, 2, 3], dtype = complex) print aThe output is as follows ?[ 1.+0.j, 2.+0.j, 3.+0.j]Data Operations in PandasPandas handles data through?Series,Data Frame, and?Panel. We will see some examples from each of these.Pandas SeriesSeries is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. A pandas Series can be created using the following constructor ?pandas.Series( data, index, dtype, copy)ExampleHere we create a series from a Numpy Array.#import the pandas library and aliasing as pdimport pandas as pdimport numpy as npdata = np.array(['a','b','c','d'])s = pd.Series(data)print sIts?output?is as follows ?0 a1 b2 c3 ddtype: objectPandas DataFrameA Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. A pandas DataFrame can be created using the following constructor ?pandas.DataFrame( data, index, columns, dtype, copy)Let us now create an indexed DataFrame using arrays.import pandas as pddata = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])print dfIts?output?is as follows ? Age Namerank1 28 Tomrank2 34 Jackrank3 29 Steverank4 42 RickyPandas PanelA?panel?is a 3D container of data. The term?Panel data?is derived from econometrics and is partially responsible for the name pandas ??pan(el)-da(ta)-s.A Panel can be created using the following constructor ?pandas.Panel(data, items, major_axis, minor_axis, dtype, copy)In the below example we create a panel from dict of DataFrame Objects#creating an empty panelimport pandas as pdimport numpy as npdata = {'Item1' : pd.DataFrame(np.random.randn(4, 3)), 'Item2' : pd.DataFrame(np.random.randn(4, 2))}p = pd.Panel(data)print pIts?output?is as follows ?<class 'pandas.core.panel.Panel'>Dimensions: 2 (items) x 4 (major_axis) x 5 (minor_axis)Items axis: 0 to 1Major_axis axis: 0 to 3Minor_axis axis: 0 to 4Missing data is always a problem in real life scenarios. Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values. In these areas, missing value treatment is a major point of focus to make their models more accurate and valid.When and Why Is Data Missed?Let us consider an online survey for a product. Many a times, people do not share all the information related to them. Few people share their experience, but not how long they are using the product; few people share how long they are using the product, their experience but not their contact information. Thus, in some or the other way a part of data is always missing, and this is very common in real time.Let us now see how we can handle missing values (say NA or NaN) using Pandas.# import the pandas libraryimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print dfIts?output?is as follows ? one two threea 0.077988 0.476149 0.965836b NaN NaN NaNc -0.390208 -0.551605 -2.301950d NaN NaN NaNe -2.000303 -0.788201 1.510072f -0.930230 -0.670473 1.146615g NaN NaN NaNh 0.085100 0.532791 0.887415Using reindexing, we have created a DataFrame with missing values. In the output,?NaN?means?Not a Number.Check for Missing ValuesTo make detecting missing values easier (and across different array dtypes), Pandas provides the?isnull()?and?notnull()?functions, which are also methods on Series and DataFrame objects ?Exampleimport pandas as pdimport numpy as np df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df['one'].isnull()Its?output?is as follows ?a Falseb Truec Falsed Truee Falsef Falseg Trueh FalseName: one, dtype: boolCleaning / Filling Missing DataPandas provides various methods for cleaning the missing values. The fillna function can “fill in” NA values with non-null data in a couple of ways, which we have illustrated in the following sections.Replace NaN with a Scalar ValueThe following program shows how you can replace "NaN" with "0".import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])df = df.reindex(['a', 'b', 'c'])print dfprint ("NaN replaced with '0':")print df.fillna(0)Its?output?is as follows ? one two threea -0.576991 -0.741695 0.553172b NaN NaN NaNc 0.744328 -1.735166 1.749580NaN replaced with '0': one two threea -0.576991 -0.741695 0.553172b 0.000000 0.000000 0.000000c 0.744328 -1.735166 1.749580Here, we are filling with value zero; instead we can also fill with any other value.Fill NA Forward and BackwardUsing the concepts of filling discussed in the ReIndexing Chapter we will fill the missing values.MethodActionpad/fillFill methods Forwardbfill/backfillFill methods BackwardExampleimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df.fillna(method='pad')Its?output?is as follows ? one two threea 0.077988 0.476149 0.965836b 0.077988 0.476149 0.965836c -0.390208 -0.551605 -2.301950d -0.390208 -0.551605 -2.301950e -2.000303 -0.788201 1.510072f -0.930230 -0.670473 1.146615g -0.930230 -0.670473 1.146615h 0.085100 0.532791 0.887415Drop Missing ValuesIf you want to simply exclude the missing values, then use the?dropnafunction along with the?axis?argument. By default, axis=0, i.e., along row, which means that if any value within a row is NA then the whole row is excluded.Exampleimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print df.dropna()Its?output?is as follows ? one two threea 0.077988 0.476149 0.965836c -0.390208 -0.551605 -2.301950e -2.000303 -0.788201 1.510072f -0.930230 -0.670473 1.146615h 0.085100 0.532791 0.887415Replace Missing (or) Generic ValuesMany times, we have to replace a generic value with some specific value. We can achieve this by applying the replace method.Replacing NA with a scalar value is equivalent behavior of the?fillna()function.Exampleimport pandas as pdimport numpy as npdf = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})print df.replace({1000:10,2000:60})Its?output?is as follows ? one two0 10 101 20 02 30 303 40 404 50 505 60 60Reading data from CSV(comma separated values) is a fundamental necessity in Data Science. Often, we get data from various sources which can get exported to CSV format so that they can be used by other systems. The Panadas library provides features using which we can read the CSV file in full as well as in parts for only a selected group of columns and rows.Input as CSV FileThe csv file is a text file in which the values in the columns are separated by a comma. Let's consider the following data present in the file named?input.csv.You can create this file using windows notepad by copying and pasting this data. Save the file as?input.csv?using the save As All files(*.*) option in notepad.id,name,salary,start_date,dept1,Rick,623.3,2012-01-01,IT2,Dan,515.2,2013-09-23,Operations3,Tusar,611,2014-11-15,IT4,Ryan,729,2014-05-11,HR5,Gary,843.25,2015-03-27,Finance6,Rasmi,578,2013-05-21,IT7,Pranab,632.8,2013-07-30,Operations8,Guru,722.5,2014-06-17,FinanceReading a CSV FileThe?read_csv?function of the pandas library is used read the content of a CSV file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file.import pandas as pddata = pd.read_csv('path/input.csv')print (data)When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function. id name salary start_date dept0 1 Rick 623.30 2012-01-01 IT1 2 Dan 515.20 2013-09-23 Operations2 3 Tusar 611.00 2014-11-15 IT3 4 Ryan 729.00 2014-05-11 HR4 5 Gary 843.25 2015-03-27 Finance5 6 Rasmi 578.00 2013-05-21 IT6 7 Pranab 632.80 2013-07-30 Operations7 8 Guru 722.50 2014-06-17 FinanceReading Specific RowsThe?read_csv?function of the pandas library can also be used to read some specific rows for a given column. We slice the result from the read_csv function using the code shown below for first 5 rows for the column named salary.import pandas as pddata = pd.read_csv('path/input.csv')# Slice the result for first 5 rowsprint (data[0:5]['salary'])When we execute the above code, it produces the following result.0 623.301 515.202 611.003 729.004 843.25Name: salary, dtype: float64Reading Specific ColumnsThe?read_csv?function of the pandas library can also be used to read some specific columns. We use the multi-axes indexing method called?.loc()?for this purpose. We choose to display the salary and name column for all the rows.import pandas as pddata = pd.read_csv('path/input.csv')# Use the multi-axes indexing funtionprint (data.loc[:,['salary','name']])When we execute the above code, it produces the following result. salary name0 623.30 Rick1 515.20 Dan2 611.00 Tusar3 729.00 Ryan4 843.25 Gary5 578.00 Rasmi6 632.80 Pranab7 722.50 GuruReading Specific Columns and RowsThe?read_csv?function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called?.loc()?for this purpose. We choose to display the salary and name column for some of the rows.import pandas as pddata = pd.read_csv('path/input.csv')# Use the multi-axes indexing funtionprint (data.loc[[1,3,5],['salary','name']])When we execute the above code, it produces the following result. salary name1 515.2 Dan3 729.0 Ryan5 578.0 RasmiReading Specific Columns for a Range of RowsThe?read_csv?function of the pandas library can also be used to read some specific columns and a range of rows. We use the multi-axes indexing method called?.loc()?for this purpose. We choose to display the salary and name column for some of the rows.import pandas as pddata = pd.read_csv('path/input.csv')# Use the multi-axes indexing funtionprint (data.loc[2:6,['salary','name']])When we execute the above code, it produces the following result. salary name2 611.00 Tusar3 729.00 Ryan4 843.25 Gary5 578.00 Rasmi6 632.80 PranabJSON file stores data as text in human-readable format. JSON stands for JavaScript Object Notation. Pandas can read JSON files using the?read_jsonfunction.Input DataCreate a JSON file by copying the below data into a text editor like notepad. Save the file with?.json?extension and choosing the file type as?all files(*.*).{ "ID":["1","2","3","4","5","6","7","8" ], "Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ] "Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ], "StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013", "7/30/2013","6/17/2014"], "Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]}Read the JSON FileThe?read_json?function of the pandas library can be used to read the JSON file into a pandas DataFrame.import pandas as pddata = pd.read_json('path/input.json')print (data)When we execute the above code, it produces the following result. Dept ID Name Salary StartDate0 IT 1 Rick 623.30 1/1/20121 Operations 2 Dan 515.20 9/23/20132 IT 3 Tusar 611.00 11/15/20143 HR 4 Ryan 729.00 5/11/20144 Finance 5 Gary 843.25 3/27/20155 IT 6 Rasmi 578.00 5/21/20136 Operations 7 Pranab 632.80 7/30/20137 Finance 8 Guru 722.50 6/17/2014Reading Specific Columns and RowsSimilar to what we have already seen in the previous chapter to read the CSV file, the?read_json?function of the pandas library can also be used to read some specific columns and specific rows after the JSON file is read to a DataFrame. We use the multi-axes indexing method called?.loc()?for this purpose. We choose to display the Salary and Name column for some of the rows.import pandas as pddata = pd.read_json('path/input.xlsx')# Use the multi-axes indexing funtionprint (data.loc[[1,3,5],['salary','name']])When we execute the above code, it produces the following result. salary name1 515.2 Dan3 729.0 Ryan5 578.0 RasmiReading JSON file as RecordsWe can also apply the?to_json?function along with parameters to read the JSON file content into individual records.import pandas as pddata = pd.read_json('path/input.xlsx')print(data.to_json(orient='records', lines=True))When we execute the above code, it produces the following result.{"Dept":"IT","ID":1,"Name":"Rick","Salary":623.3,"StartDate":"1\/1\/2012"}{"Dept":"Operations","ID":2,"Name":"Dan","Salary":515.2,"StartDate":"9\/23\/2013"}{"Dept":"IT","ID":3,"Name":"Tusar","Salary":611.0,"StartDate":"11\/15\/2014"}{"Dept":"HR","ID":4,"Name":"Ryan","Salary":729.0,"StartDate":"5\/11\/2014"}{"Dept":"Finance","ID":5,"Name":"Gary","Salary":843.25,"StartDate":"3\/27\/2015"}{"Dept":"IT","ID":6,"Name":"Rasmi","Salary":578.0,"StartDate":"5\/21\/2013"}{"Dept":"Operations","ID":7,"Name":"Pranab","Salary":632.8,"StartDate":"7\/30\/2013"}{"Dept":"Finance","ID":8,"Name":"Guru","Salary":722.5,"StartDate":"6\/17\/2014"}Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. The Panadas library provides features using which we can read the Excel file in full as well as in parts for only a selected group of Data. We can also read an Excel file with multiple sheets in it. We use the?read_excelfunction to read the data from it.Input as Excel FileWe Create an excel file with multiple sheets in the windows OS. The Data in the different sheets is as shown below.You can create this file using the Excel Program in windows OS. Save the file as?input.xlsx.# Data in Sheet1id,name,salary,start_date,dept1,Rick,623.3,2012-01-01,IT2,Dan,515.2,2013-09-23,Operations3,Tusar,611,2014-11-15,IT4,Ryan,729,2014-05-11,HR5,Gary,843.25,2015-03-27,Finance6,Rasmi,578,2013-05-21,IT7,Pranab,632.8,2013-07-30,Operations8,Guru,722.5,2014-06-17,Finance# Data in Sheet2idnamezipcode1Rick3012242Dan3412553Tusar2977044Ryan2166505Gary4387006Rasmi6651007Pranab3412118Guru347480Reading an Excel FileThe?read_excel?function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. The function can read the files from the OS by using proper path to the file. By default, the function will read Sheet1.import pandas as pddata = pd.read_excel('path/input.xlsx')print (data)When we execute the above code, it produces the following result. Please note how an additional column starting with zero as a index has been created by the function. id name salary start_date dept0 1 Rick 623.30 2012-01-01 IT1 2 Dan 515.20 2013-09-23 Operations2 3 Tusar 611.00 2014-11-15 IT3 4 Ryan 729.00 2014-05-11 HR4 5 Gary 843.25 2015-03-27 Finance5 6 Rasmi 578.00 2013-05-21 IT6 7 Pranab 632.80 2013-07-30 Operations7 8 Guru 722.50 2014-06-17 FinanceReading Specific Columns and RowsSimilar to what we have already seen in the previous chapter to read the CSV file, the?read_excel?function of the pandas library can also be used to read some specific columns and specific rows. We use the multi-axes indexing method called?.loc()?for this purpose. We choose to display the salary and name column for some of the rows.import pandas as pddata = pd.read_excel('path/input.xlsx')# Use the multi-axes indexing funtionprint (data.loc[[1,3,5],['salary','name']])When we execute the above code, it produces the following result. salary name1 515.2 Dan3 729.0 Ryan5 578.0 RasmiReading Multiple Excel SheetsMultiple sheets with different Data formats can also be read by using read_excel function with help of a wrapper class named?ExcelFile. It will read the multiple sheets into memory only once. In the below example we read sheet1 and sheet2 into two data frames and print them out individually.import pandas as pdwith pd.ExcelFile('C:/Users/Rasmi/Documents/pydatasci/input.xlsx') as xls: df1 = pd.read_excel(xls, 'Sheet1') df2 = pd.read_excel(xls, 'Sheet2')print("****Result Sheet 1****")print (df1[0:5]['salary'])print("")print("***Result Sheet 2****")print (df2[0:5]['zipcode'])When we execute the above code, it produces the following result.****Result Sheet 1****0 623.301 515.202 611.003 729.004 843.25Name: salary, dtype: float64***Result Sheet 2****0 3012241 3412552 2977043 2166504 438700Name: zipcode, dtype: int64We can connect to relational databases for analysing data using the?pandaslibrary as well as another additional library for implementing database connectivity. This package is named as?sqlalchemy?which provides full SQL language functionality to be used in python.Installing SQLAlchemyThe installation is very straight forward using Anaconda which we have discussed in the chapter?Data Science Environment. Assuming you have installed Anaconda as described in this chapter, run the following command in the Anaconda Prompt Window to install the SQLAlchemy package.conda install sqlalchemyReading Relational TablesWe will use Sqlite3 as our relational database as it is very light weight and easy to use. Though the SQLAlchemy library can connect to a variety of relational sources including MySql, Oracle and Postgresql and Mssql. We first create a database engine and then connect to the database engine using the?to_sql?function of the SQLAlchemy library.In the below example we create the relational table by using the?to_sqlfunction from a dataframe already created by reading a csv file. Then we use the?read_sql_query?function from pandas to execute and capture the results from various SQL queries.from sqlalchemy import create_engineimport pandas as pddata = pd.read_csv('/path/input.csv')# Create the db engineengine = create_engine('sqlite:///:memory:')# Store the dataframe as a tabledata.to_sql('data_table', engine)# Query 1 on the relational tableres1 = pd.read_sql_query('SELECT * FROM data_table', engine)print('Result 1')print(res1)print('')# Query 2 on the relational tableres2 = pd.read_sql_query('SELECT dept,sum(salary) FROM data_table group by dept', engine)print('Result 2')print(res2)When we execute the above code, it produces the following result.Result 1 index id name salary start_date dept0 0 1 Rick 623.30 2012-01-01 IT1 1 2 Dan 515.20 2013-09-23 Operations2 2 3 Tusar 611.00 2014-11-15 IT3 3 4 Ryan 729.00 2014-05-11 HR4 4 5 Gary 843.25 2015-03-27 Finance5 5 6 Rasmi 578.00 2013-05-21 IT6 6 7 Pranab 632.80 2013-07-30 Operations7 7 8 Guru 722.50 2014-06-17 FinanceResult 2 dept sum(salary)0 Finance 1565.751 HR 729.002 IT 1812.303 Operations 1148.00Inserting Data to Relational TablesWe can also insert data into relational tables using sql.execute function available in pandas. In the below code we previous csv file as input data set, store it in a relational table and then insert another record using sql.execute.from sqlalchemy import create_enginefrom pandas.io import sqlimport pandas as pddata = pd.read_csv('C:/Users/Rasmi/Documents/pydatasci/input.csv')engine = create_engine('sqlite:///:memory:')# Store the Data in a relational tabledata.to_sql('data_table', engine)# Insert another rowsql.execute('INSERT INTO data_table VALUES(?,?,?,?,?,?)', engine, params=[('id',9,'Ruby',711.20,'2015-03-27','IT')])# Read from the relational tableres = pd.read_sql_query('SELECT ID,Dept,Name,Salary,start_date FROM data_table', engine)print(res)When we execute the above code, it produces the following result. id dept name salary start_date0 1 IT Rick 623.30 2012-01-011 2 Operations Dan 515.20 2013-09-232 3 IT Tusar 611.00 2014-11-153 4 HR Ryan 729.00 2014-05-114 5 Finance Gary 843.25 2015-03-275 6 IT Rasmi 578.00 2013-05-216 7 Operations Pranab 632.80 2013-07-307 8 Finance Guru 722.50 2014-06-178 9 IT Ruby 711.20 2015-03-27Deleting Data from Relational TablesWe can also delete data into relational tables using sql.execute function available in pandas. The below code deletes a row based on the input condition given.from sqlalchemy import create_enginefrom pandas.io import sqlimport pandas as pddata = pd.read_csv('C:/Users/Rasmi/Documents/pydatasci/input.csv')engine = create_engine('sqlite:///:memory:')data.to_sql('data_table', engine)sql.execute('Delete from data_table where name = (?) ', engine, params=[('Gary')])res = pd.read_sql_query('SELECT ID,Dept,Name,Salary,start_date FROM data_table', engine)print(res)When we execute the above code, it produces the following result. id dept name salary start_date0 1 IT Rick 623.3 2012-01-011 2 Operations Dan 515.2 2013-09-232 3 IT Tusar 611.0 2014-11-153 4 HR Ryan 729.0 2014-05-114 6 IT Rasmi 578.0 2013-05-215 7 Operations Pranab 632.8 2013-07-306 8 Finance Guru 722.5 2014-06-17As more and more data become available as unstructured or semi-structured, the need of managing them through NoSql database increases. Python can also interact with NoSQL databases in a similar way as is interacts with Relational databases. In this chapter we will use python to interact with MongoDB as a NoSQL database. In case you are new to MongoDB, you can learn it in our tutorial?here.In order to connect to MongoDB, python uses a library known as?pymongo. You can add this library to your python environment, using the below command from the Anaconda environment.conda install pymongoThis library enables python to connect to MOngoDB using a db client. Once connected we select the db name to be used for various operations.Inserting DataTo insert data into MongoDB we use the insert() method which is available in the database environment. First we connect to the db using python code shown below and then we provide the document details in form of a series of key-value pairs.# Import the python librariesfrom pymongo import MongoClientfrom pprint import pprint# Choose the appropriate clientclient = MongoClient()# Connect to the test db db=client.test# Use the employee collectionemployee = db.employeeemployee_details = { 'Name': 'Raj Kumar', 'Address': 'Sears Streer, NZ', 'Age': '42'}# Use the insert methodresult = employee.insert_one(employee_details)# Query for the inserted document.Queryresult = employee.find_one({'Age': '42'})pprint(Queryresult)When we execute the above code, it produces the following result.{u'Address': u'Sears Streer, NZ', u'Age': u'42', u'Name': u'Raj Kumar', u'_id': ObjectId('5adc5a9f84e7cd3940399f93')}Updating DataUpdating an existing MongoDB data is similar to inserting. We use the update() method which is native to mongoDB. In the below code we are replacing the existing record with new key-value pairs. Please note how we are using the condition criteria to decide which record to update.# Import the python librariesfrom pymongo import MongoClientfrom pprint import pprint# Choose the appropriate clientclient = MongoClient()# Connect to dbdb=client.testemployee = db.employee# Use the condition to choose the record# and use the update methoddb.employee.update_one( {"Age":'42'}, { "$set": { "Name":"Srinidhi", "Age":'35', "Address":"New Omsk, WC" } } )Queryresult = employee.find_one({'Age':'35'})pprint(Queryresult)When we execute the above code, it produces the following result.{u'Address': u'New Omsk, WC', u'Age': u'35', u'Name': u'Srinidhi', u'_id': ObjectId('5adc5a9f84e7cd3940399f93')}Deleting DataDeleting a record is also straight forward where we use the delete method. Here also we mention the condition which is used to choose the record to be deleted.# Import the python librariesfrom pymongo import MongoClientfrom pprint import pprint# Choose the appropriate clientclient = MongoClient()# Connect to dbdb=client.testemployee = db.employee# Use the condition to choose the record# and use the delete methoddb.employee.delete_one({"Age":'35'})Queryresult = employee.find_one({'Age':'35'})pprint(Queryresult)When we execute the above code, it produces the following result.NoneSo we see the particular record does not exist in the db any more.Often in data science we need analysis which is based on temporal values. Python can handle the various formats of date and time gracefully. The?datetime?library provides necessary methods and functions to handle the following scenarios.Date Time RepresentationDate Time ArithmeticDate Time ComparisonWe will study them one by one.Date Time RepresentationA date and its various parts are represented by using different datetime functions. Also, there are format specifiers which play a role in displaying the alphabetical parts of a date like name of the month or week day. The following code shows today's date and various parts of the date.import datetimeprint 'The Date Today is :', datetime.datetime.today()date_today = datetime.date.today()print date_todayprint 'This Year :', date_today.yearprint 'This Month :', date_today.monthprint 'Month Name:',date_today.strftime('%B')print 'This Week Day :', date_today.dayprint 'Week Day Name:',date_today.strftime('%A')When we execute the above code, it produces the following result.The Date Today is : 2018-04-22 15:38:35.8350002018-04-22This Year : 2018This Month : 4Month Name: AprilThis Week Day : 22Week Day Name: SundayDate Time ArithmeticFor calculations involving dates we store the various dates into variables and apply the relevant mathematical operator to these variables.import datetime #Capture the First Dateday1 = datetime.date(2018, 2, 12)print 'day1:', day1.ctime()# Capture the Second Dateday2 = datetime.date(2017, 8, 18)print 'day2:', day2.ctime()# Find the difference between the datesprint 'Number of Days:', day1-day2date_today = datetime.date.today() # Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Datebefore_four_days = date_today - no_of_days print 'Before Four Days:', before_four_days # Use Delta for future Dateafter_four_days = date_today + no_of_days print 'After Four Days:', after_four_days When we execute the above code, it produces the following result.day1: Mon Feb 12 00:00:00 2018day2: Fri Aug 18 00:00:00 2017Number of Days: 178 days, 0:00:00Before Four Days: 2018-04-18After Four Days: 2018-04-26Date Time ComparisonDate and time are compared using logical operators. But we must be careful in comparing the right parts of the dates with each other. In the below examples we take the future and past dates and compare them using the python if clause along with logical operators.import datetimedate_today = datetime.date.today() print 'Today is: ', date_today# Create a delta of Four Days no_of_days = datetime.timedelta(days=4) # Use Delta for Past Datebefore_four_days = date_today - no_of_days print 'Before Four Days:', before_four_days after_four_days = date_today + no_of_daysdate1 = datetime.date(2018,4,4)print 'date1:',date1if date1 == before_four_days : print 'Same Dates'if date_today > date1: print 'Past Date'if date1 < after_four_days: print 'Future Date'When we execute the above code, it produces the following result.Today is: 2018-04-22Before Four Days: 2018-04-18date1: 2018-04-04Past DateFuture DateData wrangling involves processing the data in various formats like - merging, grouping, concatenating etc. for the purpose of analysing or getting them ready to be used with another set of data. Python has built-in features to apply these wrangling methods to various data sets to achieve the analytical goal. In this chapter we will look at few examples describing these methods.Merging DataThe Pandas library in python provides a single function,?merge, as the entry point for all standard database join operations between DataFrame objects ?pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,left_index=False, right_index=False, sort=True)Let us now create two different DataFrames and perform the merging operations on it.# import the pandas libraryimport pandas as pdleft = pd.DataFrame({ 'id':[1,2,3,4,5], 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5']})right = pd.DataFrame( {'id':[1,2,3,4,5], 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5']})print leftprint rightIts?output?is as follows ? Name id subject_id0 Alex 1 sub11 Amy 2 sub22 Allen 3 sub43 Alice 4 sub64 Ayoung 5 sub5 Name id subject_id0 Billy 1 sub21 Brian 2 sub42 Bran 3 sub33 Bryce 4 sub64 Betty 5 sub5Grouping DataGrouping data sets is a frequent need in data analysis where we need the result in terms of various groups present in the data set. Panadas has in-built methods which can roll the data into various groups.In the below example we group the data by year and then get the result for a specific year.# import the pandas libraryimport pandas as pdipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings', 'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'], 'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2], 'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017], 'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}df = pd.DataFrame(ipl_data)grouped = df.groupby('Year')print grouped.get_group(2014)Its?output?is as follows ? Points Rank Team Year0 876 1 Riders 20142 863 2 Devils 20144 741 3 Kings 20149 701 4 Royals 2014Concatenating DataPandas provides various facilities for easily combining together?Series, DataFrame, and?Panel?objects. In the below example the?concat?function performs concatenation operations along an axis. Let us create different objects and do concatenation.import pandas as pdone = pd.DataFrame({ 'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'subject_id':['sub1','sub2','sub4','sub6','sub5'], 'Marks_scored':[98,90,87,69,78]}, index=[1,2,3,4,5])two = pd.DataFrame({ 'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 'subject_id':['sub2','sub4','sub3','sub6','sub5'], 'Marks_scored':[89,80,79,97,88]}, index=[1,2,3,4,5])print pd.concat([one,two])Its?output?is as follows ? Marks_scored Name subject_id1 98 Alex sub12 90 Amy sub23 87 Allen sub44 69 Alice sub65 78 Ayoung sub51 89 Billy sub22 80 Brian sub43 79 Bran sub34 97 Bryce sub65 88 Betty sub5Python has several methods are available to perform aggregations on data. It is done using the pandas and numpy libraries. The data must be available or converted to a dataframe to apply the aggregation functions.Applying Aggregations on DataFrameLet us create a DataFrame and apply aggregations on it.import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range('1/1/2000', periods=10), columns = ['A', 'B', 'C', 'D'])print dfr = df.rolling(window=3,min_periods=1)print rIts?output?is as follows ? A B C D2000-01-01 1.088512 -0.650942 -2.547450 -0.5668582000-01-02 0.790670 -0.387854 -0.668132 0.2672832000-01-03 -0.575523 -0.965025 0.060427 -2.1797802000-01-04 1.669653 1.211759 -0.254695 1.4291662000-01-05 0.100568 -0.236184 0.491646 -0.4660812000-01-06 0.155172 0.992975 -1.205134 0.3209582000-01-07 0.309468 -0.724053 -1.412446 0.6279192000-01-08 0.099489 -1.028040 0.163206 -1.2743312000-01-09 1.639500 -0.068443 0.714008 -0.5659692000-01-10 0.326761 1.479841 0.664282 -1.361169Rolling [window=3,min_periods=1,center=False,axis=0] We can aggregate by passing a function to the entire DataFrame, or select a column via the standard?get item?method.Apply Aggregation on a Whole Dataframeimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range('1/1/2000', periods=10), columns = ['A', 'B', 'C', 'D'])print dfr = df.rolling(window=3,min_periods=1)print r.aggregate(np.sum)Its?output?is as follows ? A B C D2000-01-01 1.088512 -0.650942 -2.547450 -0.5668582000-01-02 1.879182 -1.038796 -3.215581 -0.2995752000-01-03 1.303660 -2.003821 -3.155154 -2.4793552000-01-04 1.884801 -0.141119 -0.862400 -0.4833312000-01-05 1.194699 0.010551 0.297378 -1.2166952000-01-06 1.925393 1.968551 -0.968183 1.2840442000-01-07 0.565208 0.032738 -2.125934 0.4827972000-01-08 0.564129 -0.759118 -2.454374 -0.3254542000-01-09 2.048458 -1.820537 -0.535232 -1.2123812000-01-10 2.065750 0.383357 1.541496 -3.201469 A B C D2000-01-01 1.088512 -0.650942 -2.547450 -0.5668582000-01-02 1.879182 -1.038796 -3.215581 -0.2995752000-01-03 1.303660 -2.003821 -3.155154 -2.4793552000-01-04 1.884801 -0.141119 -0.862400 -0.4833312000-01-05 1.194699 0.010551 0.297378 -1.2166952000-01-06 1.925393 1.968551 -0.968183 1.2840442000-01-07 0.565208 0.032738 -2.125934 0.4827972000-01-08 0.564129 -0.759118 -2.454374 -0.3254542000-01-09 2.048458 -1.820537 -0.535232 -1.2123812000-01-10 2.065750 0.383357 1.541496 -3.201469Apply Aggregation on a Single Column of a Dataframeimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range('1/1/2000', periods=10), columns = ['A', 'B', 'C', 'D'])print dfr = df.rolling(window=3,min_periods=1)print r['A'].aggregate(np.sum)Its?output?is as follows ? A B C D2000-01-01 1.088512 -0.650942 -2.547450 -0.5668582000-01-02 1.879182 -1.038796 -3.215581 -0.2995752000-01-03 1.303660 -2.003821 -3.155154 -2.4793552000-01-04 1.884801 -0.141119 -0.862400 -0.4833312000-01-05 1.194699 0.010551 0.297378 -1.2166952000-01-06 1.925393 1.968551 -0.968183 1.2840442000-01-07 0.565208 0.032738 -2.125934 0.4827972000-01-08 0.564129 -0.759118 -2.454374 -0.3254542000-01-09 2.048458 -1.820537 -0.535232 -1.2123812000-01-10 2.065750 0.383357 1.541496 -3.2014692000-01-01 1.0885122000-01-02 1.8791822000-01-03 1.3036602000-01-04 1.8848012000-01-05 1.1946992000-01-06 1.9253932000-01-07 0.5652082000-01-08 0.5641292000-01-09 2.0484582000-01-10 2.065750Freq: D, Name: A, dtype: float64Apply Aggregation on Multiple Columns of a DataFrameimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(10, 4), index = pd.date_range('1/1/2000', periods=10), columns = ['A', 'B', 'C', 'D'])print dfr = df.rolling(window=3,min_periods=1)print r[['A','B']].aggregate(np.sum)Its?output?is as follows ? A B C D2000-01-01 1.088512 -0.650942 -2.547450 -0.5668582000-01-02 1.879182 -1.038796 -3.215581 -0.2995752000-01-03 1.303660 -2.003821 -3.155154 -2.4793552000-01-04 1.884801 -0.141119 -0.862400 -0.4833312000-01-05 1.194699 0.010551 0.297378 -1.2166952000-01-06 1.925393 1.968551 -0.968183 1.2840442000-01-07 0.565208 0.032738 -2.125934 0.4827972000-01-08 0.564129 -0.759118 -2.454374 -0.3254542000-01-09 2.048458 -1.820537 -0.535232 -1.2123812000-01-10 2.065750 0.383357 1.541496 -3.201469 A B2000-01-01 1.088512 -0.6509422000-01-02 1.879182 -1.0387962000-01-03 1.303660 -2.0038212000-01-04 1.884801 -0.1411192000-01-05 1.194699 0.0105512000-01-06 1.925393 1.9685512000-01-07 0.565208 0.0327382000-01-08 0.564129 -0.7591182000-01-09 2.048458 -1.8205372000-01-10 2.065750 0.383357library known as beautifulsoup. Using this library, we can search for the values of html tags and get specific data like title of the page and the list of headers in the page.Install BeautifulsoupUse the Anaconda package manager to install the required package and its dependent packages.conda install BeaustifulsoapReading the HTML fileIn the below example we make a request to an url to be loaded into the python environment. Then use the html parser parameter to read the entire html file. Next, we print first few lines of the html page.import urllib2from bs4 import BeautifulSoup# Fetch the html fileresponse = urllib2.urlopen('')html_doc = response.read()# Parse the html filesoup = BeautifulSoup(html_doc, 'html.parser')# Format the parsed html filestrhtm = soup.prettify()# Print the first few charactersprint (strhtm[:225])When we execute the above code, it produces the following result.<!DOCTYPE html><!--[if IE 8]><html class="ie ie8"> <![endif]--><!--[if IE 9]><html class="ie ie9"> <![endif]--><!--[if gt IE 9]><!--><html> <!--<![endif]--> <head> <!-- Basic --> <meta charset="utf-8"/> <title>Extracting Tag ValueWe can extract tag value from the first instance of the tag using the following code.import urllib2from bs4 import BeautifulSoupresponse = urllib2.urlopen('')html_doc = response.read()soup = BeautifulSoup(html_doc, 'html.parser')print (soup.title)print(soup.title.string)print(soup.a.string)print(soup.b.string)When we execute the above code, it produces the following result.Python OverviewNonePython is InterpretedExtracting All TagsWe can extract tag value from all the instances of a tag using the following code.import urllib2from bs4 import BeautifulSoupresponse = urllib2.urlopen('')html_doc = response.read()soup = BeautifulSoup(html_doc, 'html.parser')for x in soup.find_all('b'): print(x.string)When we execute the above code, it produces the following result.Python is InterpretedPython is InteractivePython is Object-OrientedPython is a Beginner's LanguageEasy-to-learnEasy-to-readEasy-to-maintainA broad standard libraryInteractive ModePortableExtendableDatabasesGUI ProgrammingScalableThe data that is already present in a row and column format or which can be easily converted to rows and columns so that later it can fit nicely into a database is known as structured data. Examples are CSV, TXT, XLS files etc. These files have a delimiter and either fixed or variable width where the missing values are represented as blanks in between the delimiters. But sometimes we get data where the lines are not fixed width, or they are just HTML, image or pdf files. Such data is known as unstructured data. While the HTML file can be handled by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a delimiter does not have tags to handle. In such scenario we use different in-built functions from various python libraries to process the file.Reading DataIn the below example we take a text file and read the file segregating each of the lines in it. Next we can divide the output into further lines and words. The original file is a text file containing some paragraphs describing the python language.filename = 'path\input.txt' with open(filename) as fn: # Read each line ln = fn.readline()# Keep count of lines lncnt = 1 while ln: print("Line {}: {}".format(lncnt, ln.strip())) ln = fn.readline() lncnt += 1When we execute the above code, it produces the following result.Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.Counting Word FrequencyWe can count the frequency of the words in the file using the counter function as follows.from collections import Counterwith open(r'pathinput2.txt') as f: p = Counter(f.read().split()) print(p)When we execute the above code, it produces the following result.Counter({'and': 3, 'Python': 3, 'that': 2, 'a': 2, 'programming': 2, 'code': 1, '19Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc. The Natural Language Tool kit(NLTK) is a library used to achieve this. Install NLTK before proceeding with the python program for word tokenization.conda install -c anaconda nltkNext we use the?word_tokenize?method to split the paragraph into individual words.import nltkword_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"nltk_tokens = nltk.word_tokenize(word_data)print (nltk_tokens)When we execute the above code, it produces the following result.['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the','comforts', 'of', 'their', 'drawing', 'rooms']Tokenizing SentencesWe can also tokenize the sentences in a paragraph like we tokenized the words. We use the method?sent_tokenize?to achieve this. Below is an example.import nltksentence_data = "Sun rises in the east. Sun sets in the west."nltk_tokens = nltk.sent_tokenize(sentence_data)print (nltk_tokens)When we execute the above code, it produces the following result.['Sun rises in the east.', 'Sun sets in the west.']In the areas of Natural Language Processing we come across situation where two or more words have a common root. For example, the three words - agreed, agreeing and agreeable have the same root word agree. A search involving any of these words should treat them as the same word which is the root word. So it becomes essential to link all the words into their root word. The NLTK library has methods to do this linking and give the output showing the root word.The below program uses the Porter Stemming Algorithm for stemming.import nltkfrom nltk.stem.porter import PorterStemmerporter_stemmer = PorterStemmer()word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"# First Word tokenizationnltk_tokens = nltk.word_tokenize(word_data)#Next find the roots of the wordfor w in nltk_tokens: print "Actual: %s Stem: %s" % (w,porter_stemmer.stem(w))When we execute the above code, it produces the following result.Actual: It Stem: ItActual: originated Stem: originActual: from Stem: fromActual: the Stem: theActual: idea Stem: ideaActual: that Stem: thatActual: there Stem: thereActual: are Stem: areActual: readers Stem: readerActual: who Stem: whoActual: prefer Stem: preferActual: learning Stem: learnActual: new Stem: newActual: skills Stem: skillActual: from Stem: fromActual: the Stem: theActual: comforts Stem: comfortActual: of Stem: ofActual: their Stem: theirActual: drawing Stem: drawActual: rooms Stem: roomLemmatization is similar ti stemming but it brings context to the words.So it goes a steps further by linking words with similar meaning to one word. For example if a paragraph has words like cars, trains and automobile, then it will link all of them to automobile. In the below program we use the WordNet lexical database for lemmatization.import nltkfrom nltk.stem import WordNetLemmatizerwordnet_lemmatizer = WordNetLemmatizer()word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"nltk_tokens = nltk.word_tokenize(word_data)for w in nltk_tokens: print "Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w))When we execute the above code, it produces the following result.Actual: It Lemma: ItActual: originated Lemma: originatedActual: from Lemma: fromActual: the Lemma: theActual: idea Lemma: ideaActual: that Lemma: thatActual: there Lemma: thereActual: are Lemma: areActual: readers Lemma: readerActual: who Lemma: whoActual: prefer Lemma: preferActual: learning Lemma: learningActual: new Lemma: newActual: skills Lemma: skillActual: from Lemma: fromActual: the Lemma: theActual: comforts Lemma: comfortActual: of Lemma: ofActual: their Lemma: theirActual: drawing Lemma: drawingActual: rooms Lemma: roomPython has excellent libraries for data visualization. A combination of?Pandas,?numpy?and?matplotlib?can help in creating in nearly all types of visualizations charts. In this chapter we will get started with looking at some simple chart and the various properties of the chart.Creating a ChartWe use numpy library to create the required numbers to be mapped for creating the chart and the pyplot method in matplotlib to draws the actual chart.import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Simple Plotplt.plot(x,y)Its?output?is as follows ?Labling the AxesWe can apply labels to the axes as well as a title for the chart using appropriate methods from the library as shown below.import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Labeling the Axes and Titleplt.title("Graph Drawing") plt.xlabel("Time") plt.ylabel("Distance") #Simple Plotplt.plot(x,y)Its?output?is as follows ?Formatting Line type and ColourThe style as well as colour for the line in the chart can be specified using appropriate methods from the library as shown below.import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Labeling the Axes and Titleplt.title("Graph Drawing") plt.xlabel("Time") plt.ylabel("Distance") # Formatting the line colorsplt.plot(x,y,'r')# Formatting the line type plt.plot(x,y,'>') Its?output?is as follows ?Saving the Chart FileThe chart can be saved in different image file formats using appropriate methods from the library as shown below.import numpy as np import matplotlib.pyplot as plt x = np.arange(0,10) y = x ^ 2 #Labeling the Axes and Titleplt.title("Graph Drawing") plt.xlabel("Time") plt.ylabel("Distance") # Formatting the line colorsplt.plot(x,y,'r')# Formatting the line type plt.plot(x,y,'>') # save in pdf formatsplt.savefig('timevsdist.pdf', format='pdf')The above code creates the pdf file in the default path of the python The charts created in python can have further styling by using some appropriate methods from the libraries used for charting. In this lesson we will see the implementation of Annotation, legends and chart background. We will continue to use the code from the last chapter and modify it to add these styles to the chart.Adding AnnotationsMany times, we need to annotate the chart by highlighting the specific locations of the chart. In the below example we indicate the sharp change in values in the chart by adding annotations at those points.import numpy as np from matplotlib import pyplot as plt x = np.arange(0,10) y = x ^ 2 z = x ^ 3t = x ^ 4 # Labeling the Axes and Titleplt.title("Graph Drawing") plt.xlabel("Time") plt.ylabel("Distance") plt.plot(x,y)#Annotateplt.annotate(xy=[2,1], s='Second Entry') plt.annotate(xy=[4,6], s='Third Entry') Its?output?is as follows ?Adding LegendsWe sometimes need a chart with multiple lines being plotted. Use of legend represents the meaning associated with each line. In the below chart we have 3 lines with appropriate legends.import numpy as np from matplotlib import pyplot as plt x = np.arange(0,10) y = x ^ 2 z = x ^ 3t = x ^ 4 # Labeling the Axes and Titleplt.title("Graph Drawing") plt.xlabel("Time") plt.ylabel("Distance") plt.plot(x,y)#Annotateplt.annotate(xy=[2,1], s='Second Entry') plt.annotate(xy=[4,6], s='Third Entry') # Adding Legendsplt.plot(x,z)plt.plot(x,t)plt.legend(['Race1', 'Race2','Race3'], loc=4) Its?output?is as follows ?Chart presentation StyleWe can modify the presentation style of the chart by using different methods from the style package.import numpy as np from matplotlib import pyplot as plt x = np.arange(0,10) y = x ^ 2 z = x ^ 3t = x ^ 4 # Labeling the Axes and Titleplt.title("Graph Drawing") plt.xlabel("Time") plt.ylabel("Distance") plt.plot(x,y)#Annotateplt.annotate(xy=[2,1], s='Second Entry') plt.annotate(xy=[4,6], s='Third Entry') # Adding Legendsplt.plot(x,z)plt.plot(x,t)plt.legend(['Race1', 'Race2','Race3'], loc=4) #Style the backgroundplt.style.use('fast')plt.plot(x,z)Its?output?is as follows ? Boxplots are a measure of how well distributed the data in a data set is. It divides the data set into three quartiles. This graph represents the minimum, maximum, median, first quartile and third quartile in the data set. It is also useful in comparing the distribution of data across data sets by drawing boxplots for each of them.Drawing a Box PlotBoxplot can be drawn calling Series.box.plot() and DataFrame.box.plot(), or DataFrame.boxplot() to visualize the distribution of values within each column.For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1).import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])df.plot.box(grid='True') Its?output?is as follows ? A heatmap contains values representing various shades of the same colour for each value to be plotted. Usually the darker shades of the chart represent higher values than the lighter shade. For a very different value a completely different colour can also be used.The below example is a two-dimensional plot of values which are mapped to the indices and columns of the chart.from pandas import DataFrameimport matplotlib.pyplot as pltdata=[{2,3,4,1},{6,3,5,2},{6,3,5,4},{3,7,5,4},{2,8,1,5}]Index= ['I1', 'I2','I3','I4','I5']Cols = ['C1', 'C2', 'C3','C4']df = DataFrame(data, index=Index, columns=Cols)plt.pcolor(df)plt.show()Its?output?is as follows ? Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis.Drawing a Scatter PlotScatter plot can be created using the DataFrame.plot.scatter() methods.import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd'])df.plot.scatter(x='a', y='b') Its?output?is as follows ? Bubble charts display data as a cluster of circles. The required data to create bubble chart needs to have the xy coordinates, size of the bubble and the colour of the bubbles. The colours can be supplied by the library itself.Drawing a Bubble ChartBubble chart can be created using the DataFrame.plot.scatter() methods.import matplotlib.pyplot as pltimport numpy as np # create datax = np.random.rand(40)y = np.random.rand(40)z = np.random.rand(40)colors = np.random.rand(40) # use the scatter functionplt.scatter(x, y, s=z*1000,c=colors)plt.show() Its?output?is as follows ? Python is also capable of creating 3d charts. It involves adding a subplot to an existing two-dimensional plot and assigning the projection parameter as 3d.Drawing a 3D Plot3dPlot is drawn by mpl_toolkits.mplot3d to add a subplot to an existing 2d plot.from mpl_toolkits.mplot3d import axes3dimport matplotlib.pyplot as pltchart = plt.figure()chart3d = chart.add_subplot(111, projection='3d')# Create some test data.X, Y, Z = axes3d.get_test_data(0.08)# Plot a wireframe.chart3d.plot_wireframe(X, Y, Z, color='r',rstride=15, cstride=10)plt.show() Its?output?is as follows ? Time series is a series of data points in which each data point is associated with a timestamp. A simple example is the price of a stock in the stock market at different points of time on a given day. Another example is the amount of rainfall in a region at different months of the year.In the below example we take the value of stock prices every day for a quarter for a particular stock symbol. We capture these values as a csv file and then organize them to a dataframe using pandas library. We then set the date field as index of the dataframe by recreating the additional Valuedate column as index and deleting the old valuedate column.Sample DataBelow is the sample data for the price of the stock on different days of a given quarter. The data is saved in a file named as stock.csvValueDatePrice01-01-2018,1042.0502-01-2018,1033.5503-01-2018,1029.704-01-2018,1021.305-01-2018,1015.4............23-03-2018,1161.326-03-2018,1167.627-03-2018,1155.2528-03-2018,1154Creating Time Seriesfrom datetime import datetimeimport pandas as pdimport matplotlib.pyplot as pltdata = pd.read_csv('path_to_file/stock.csv')df = pd.DataFrame(data, columns = ['ValueDate', 'Price'])# Set the Date as Indexdf['ValueDate'] = pd.to_datetime(df['ValueDate'])df.index = df['ValueDate']del df['ValueDate']df.plot(figsize=(15, 6))plt.show()Its?output?is as follows ? Many open source python libraries now have been created to represent the geographical maps. They are highly customizable and offer a varierty of maps depicting areas in different shapes and colours. One such package is Cartopy. You can download and install this package in your local environment from?Cartopy. You can find numerous examples in its gallery.In the below example we show a portion of the world map showing parts of Asia and Australia. You can adjust the values of the parameters in the method set_extent to locate different areas of world map.import matplotlib.pyplot as pltimport cartopy.crs as ccrs fig = plt.figure(figsize=(15, 10))ax = fig.add_subplot(1, 1, 1, projection=ccrs.PlateCarree()) # make the map global rather than have it zoom in to # the extents of any plotted dataax.set_extent((60, 150, 55, -25))ax.stock_img()ax.coastlines()ax.tissot(facecolor='purple', alpha=0.8)plt.show()Its?output?is as follows ? CSGraph stands for?Compressed Sparse Graph, which focuses on Fast graph algorithms based on sparse matrix representations.Graph RepresentationsTo begin with, let us understand what a sparse graph is and how it helps in graph representations.What exactly is a sparse graph?A graph is just a collection of nodes, which have links between them. Graphs can represent nearly anything ? social network connections, where each node is a person and is connected to acquaintances; images, where each node is a pixel and is connected to neighbouring pixels; points in a high-dimensional distribution, where each node is connected to its nearest neighbours and practically anything else you can imagine.One very efficient way to represent graph data is in a sparse matrix: let us call it G. The matrix G is of size N x N, and G[i, j] gives the value of the connection between node ‘i' and node ‘j’. A sparse graph contains mostly zeros ? that is, most nodes have only a few connections. This property turns out to be true in most cases of interest.The creation of the sparse graph submodule was motivated by several algorithms used in scikit-learn that included the following ?Isomap?? A manifold learning algorithm, which requires finding the shortest paths in a graph.Hierarchical clustering?? A clustering algorithm based on a minimum spanning tree.Spectral Decomposition?? A projection algorithm based on sparse graph laplacians.As a concrete example, imagine that we would like to represent the following undirected graph ?This graph has three nodes, where node 0 and 1 are connected by an edge of weight 2, and nodes 0 and 2 are connected by an edge of weight 1. We can construct the dense, masked and sparse representations as shown in the following example, keeping in mind that an undirected graph is represented by a symmetric matrix.G_dense = np.array([ [0, 2, 1], [2, 0, 0], [1, 0, 0] ]) G_masked = np.ma.masked_values(G_dense, 0)from scipy.sparse import csr_matrixG_sparse = csr_matrix(G_dense)print G_sparse.dataThe above program will generate the following output.array([2, 1, 2, 1])This is identical to the previous graph, except nodes 0 and 2 are connected by an edge of zero weight. In this case, the dense representation above leads to ambiguities ? how can non-edges be represented, if zero is a meaningful value. In this case, either a masked or a sparse representation must be used to eliminate the ambiguity.Let us consider the following example.from scipy.sparse.csgraph import csgraph_from_denseG2_data = np.array([ [np.inf, 2, 0 ], [2, np.inf, np.inf], [0, np.inf, np.inf]])G2_sparse = csgraph_from_dense(G2_data, null_value=np.inf)print G2_sparse.dataThe above program will generate the following output.array([ 2., 0., 2., 0.])Mathematically central tendency means measuring the center or distribution of location of values of a data set. It gives an idea of the average value of the data in the data set and also an indication of how widely the values are spread in the data set. That in turn helps in evaluating the chances of a new input fitting into the existing data set and hence probability of success.There are three main measures of central tendency which can be calculated using the methods in pandas python library.Mean - It is the Average value of the data which is a division of sum of the values with the number of values.Median - It is the middle value in distribution when the values are arranged in ascending or descending order.Mode - It is the most commonly occurring value in a distribution.Calculating Mean and MedianThe pandas functions can be directly used to calculate these values.import pandas as pd#Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','Chanchal','Gasper','Naviya','Andres']), 'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}#Create a DataFramedf = pd.DataFrame(d)print "Mean Values in the Distribution"print df.mean()print "*******************************"print "Median Values in the Distribution"print df.median()Its?output?is as follows ?Mean Values in the DistributionAge 31.833333Rating 3.743333dtype: float64*******************************Median Values in the DistributionAge 29.50Rating 3.79dtype: float64Calculating ModeMode may or may not be available in a distribution depending on whether the data is continous or whether there are values which has maximum frquency. We take a simple distribution below to find out the mode. Here we have a value which has maximum frequency in the distribution.import pandas as pd#Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','Chanchal','Gasper','Naviya','Andres']), 'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46])}#Create a DataFramedf = pd.DataFrame(d)print df.mode()Its?output?is as follows ? Age Name0 25.0 Andres1 NaN Chanchal2 NaN Gasper3 NaN Jack4 NaN James5 NaN Lee6 NaN Naviya7 NaN Ricky8 NaN Smith9 NaN Steve10 NaN Tom11 NaN VinIn statistics, variance is a measure of how far a value in a data set lies from the mean value. In other words, it indicates how dispersed the values are. It is measured by using standard deviation. The other method commonly used is skewness.Both of these are calculated by using functions available in pandas library.Measuring Standard DeviationStandard deviation is square root of variance. variance is the average of squared difference of values in a data set from the mean value. In python we calculate this value by using the function std() from pandas library.import pandas as pd#Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','Chanchal','Gasper','Naviya','Andres']), 'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}#Create a DataFramedf = pd.DataFrame(d)# Calculate the standard deviationprint df.std()Its?output?is as follows ?Age 7.265527Rating 0.661628dtype: float64Measuring SkewnessIt used to determine whether the data is symmetric or skewed. If the index is between -1 and 1, then the distribution is symmetric. If the index is no more than -1 then it is skewed to the left and if it is at least 1, then it is skewed to the rightimport pandas as pd#Create a Dictionary of seriesd = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 'Lee','Chanchal','Gasper','Naviya','Andres']), 'Age':pd.Series([25,26,25,23,30,25,23,34,40,30,25,46]), 'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}#Create a DataFramedf = pd.DataFrame(d)print df.skew()Its?output?is as follows ?Age 1.443490Rating -0.153629dtype: float64So the distribution of age rating is symmetric while the distribution of age is skewed to the right.The normal distribution is a form presenting data by arranging the probability distribution of each value in the data.Most values remain around the mean value making the arrangement symmetric.We use various functions in numpy library to mathematically calculate the values for a normal distribution. Histograms are created over which we plot the probability distribution curve.import matplotlib.pyplot as pltimport numpy as npmu, sigma = 0.5, 0.1s = np.random.normal(mu, sigma, 1000)# Create the bins and histogramcount, bins, ignored = plt.hist(s, 20, normed=True)# Plot the distribution curveplt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=3, color='y')plt.show()Its?output?is as follows ? The binomial distribution model deals with finding the probability of success of an event which has only two possible outcomes in a series of experiments. For example, tossing of a coin always gives a head or a tail. The probability of finding exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the binomial distribution.We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also, the scipy package helps is creating the binomial distribution.from scipy.stats import binomimport seaborn as sbbinom.rvs(size=10,n=20,p=0.8)data_binom = binom.rvs(n=20,p=0.8,loc=0,size=1000)ax = sb.distplot(data_binom, kde=True, color='blue', hist_kws={"linewidth": 25,'alpha':1})ax.set(xlabel='Binomial', ylabel='Frequency')Its?output?is as follows ? A Poisson distribution is a distribution which shows the likely number of times that an event will occur within a pre-determined period of time. It is used for independent events which occur at a constant rate within a given interval of time. The Poisson distribution is a discrete function, meaning that the event can only be measured as occurring or not as occurring, meaning the variable can only be measured in whole numbers.We use the seaborn python library which has in-built functions to create such probability distribution graphs. Also the scipy package helps is creating the binomial distribution.from scipy.stats import poissonimport seaborn as sbdata_binom = poisson.rvs(mu=4, size=10000)ax = sb.distplot(data_binom, kde=True, color='green', hist_kws={"linewidth": 25,'alpha':1})ax.set(xlabel='Poisson', ylabel='Frequency')Its?output?is as follows ? The Bernoulli distribution is a special case of the Binomial distribution where a single experiment is conducted so that the number of observation is 1. So, the Bernoulli distribution therefore describes events having exactly two outcomes.We use various functions in numpy library to mathematically calculate the values for a bernoulli distribution. Histograms are created over which we plot the probability distribution curve.from scipy.stats import bernoulliimport seaborn as sbdata_bern = bernoulli.rvs(size=1000,p=0.6)ax = sb.distplot(data_bern, kde=True, color='crimson', hist_kws={"linewidth": 25,'alpha':1})ax.set(xlabel='Bernouli', ylabel='Frequency')Its?output?is as follows ? The p-value is about the strength of a hypothesis. We build hypothesis based on some statistical model and compare the model's validity using p-value. One way to get the p-value is by using T-test.This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean,?popmean. Let us consider the following example.from scipy import statsrvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2))print stats.ttest_1samp(rvs,5.0)The above program will generate the following output.Ttest_1sampResult(statistic = array([-1.40184894, 2.70158009]),pvalue = array([ 0.16726344, 0.00945234]))Comparing two samplesIn the following examples, there are two samples, which can come either from the same or from different distribution, and we want to test whether these samples have the same statistical properties.ttest_ind?? Calculates the T-test for the means of two independent samples of scores. This is a two-sided test for the null hypothesis that two independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.We can use this test, if we observe two independent samples from the same or different population. Let us consider the following example.from scipy import statsrvs1 = stats.norm.rvs(loc = 5,scale = 10,size = 500)rvs2 = stats.norm.rvs(loc = 5,scale = 10,size = 500)print stats.ttest_ind(rvs1,rvs2)The above program will generate the following output.Ttest_indResult(statistic = -0.67406312233650278, pvalue = 0.50042727502272966)You can test the same with a new array of the same length, but with a varied mean. Use a different value in?loc?and test the same.Correlation refers to some statistical relationships involving dependence between two data sets. Simple examples of dependent phenomena include the correlation between the physical appearance of parents and their offspring, and the correlation between the price for a product and its supplied quantity.We take example of the iris data set available in seaborn python library. In it we try to establish the correlation between the length and the width of the sepals and petals of three species of iris flower. Based on the correlation found, a strong model could be created which easily distinguishes one species from another.import matplotlib.pyplot as pltimport seaborn as snsdf = sns.load_dataset('iris') #without regressionsns.pairplot(df, kind="scatter")plt.show()Its?output?is as follows ? Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like ? Yes/No, Male/Female, Red/Green etc. For example, we can build a data set with observations on people's ice-cream buying pattern and try to correlate the gender of a person with the flavour of the ice-cream they prefer. If a correlation is found we can plan for appropriate stock of flavours by knowing the number of gender of people visiting.We use various functions in numpy library to carry out the chi-square test.from scipy import statsimport numpy as npimport matplotlib.pyplot as pltx = np.linspace(0, 10, 100)fig,ax = plt.subplots(1,1)linestyles = [':', '--', '-.', '-']deg_of_freedom = [1, 4, 7, 6]for df, ls in zip(deg_of_freedom, linestyles): ax.plot(x, stats.chi2.pdf(x, df), linestyle=ls)plt.xlim(0, 10)plt.ylim(0, 0.4)plt.xlabel('Value')plt.ylabel('Frequency')plt.title('Chi-Square Distribution')plt.legend()plt.show()Its?output?is as follows ? In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.The functions in Seaborn to find the linear regression relationship is regplot. The below example shows its use.import seaborn as sbfrom matplotlib import pyplot as pltdf = sb.load_dataset('tips')sb.regplot(x = "total_bill", y = "tip", data = df)plt.show()Its?output?is as follows ? ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download