Www.ryangineer.com

D209_Data_Mining_I

/

Performance_Assessment

Ryan L. Buchanan Student ID: 001826691 Masters Data Analytics (12/01/2020) Program Mentor: Dan Estes 385-432-9281 (MST) rbuch49@wgu.edu

JUPYTER FAQ

A1. Proposal of Question: Which customers are at high risk of churn? And, which customer features/variables are most significant to churn?

This question will be answered using the K-Nearest Neighbors algorithm.

A2. Defined Goal:

Stakeholders in the company will benefit by knowing, with some measure of confidence, which customers are at highest risk of churn because this will provide weight for decisions in marketing improved services to customers with these characteristics and past user experiences.

Part II: Method Justification B. Explain the reasons for your chosen classification method from part A1 by doing the following:

1. Explain how the classification method you chose analyzes the selected data set. Include expected outcomes. 2. Summarize one assumption of the chosen classification method. 3. List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

B1. Explanation of Classification Method:

The algorithm stores all available cases & classifies new cases by a "majority vote" of its K-nearest neighbors. KNN will find the most similar data points in the training data. A k number of data points will be chosen by the model. The dominant classes of the closest data points will suggest how a data point of interest should be classified. Test data will then be used to test the generalizability of the models outcomes.

Expected outcomes include our test data points to be classified in accordance with their closest neighbors in hyperspace.

B2. Summary of Method Assumption: Given a specified Euclidean distance, the method assumes that the closest neighbors are similar enough to classified the data point of interest as the same (Grant, p. 1).

B3. Packages or Libraries List:

The packages or libraries I have chosen for Python include:

Pandas Numpy Matplotlib Seaborn Scikit-learn

Pandas, Numpy & Matplotlib are considered standard imports in a data science project, providing methods & statistical packages for reading, scoring & visualizing the data. The Seaborn package provides more descriptive & visually intuitive graphs, matrices & plots. The Scikit-learn packages efficiently implements methods for splitting, fitting, predicting & applying metrics for many machine learning models.

Also, IPython Jupyter notebooks will be used to support this analysis. Python offers very intuitive, simple & versatile programming style & syntax, as well as a large system of mature packages for data science & machine learning. Since, Python is cross-platform, it will work well whether consumers of the analysis are using Windows PCs or a MacBook laptop. It is fast when compared with other possible programming languages like R or MATLAB (Massaron, p. 8).

Also, there is strong support for Python as the most popular data science programming language in popular literature & media (CBTNuggets, p. 1).

C1. Data Preprocessing: Many of the features that will be used are binary (yes/no) variables must be accordingly encoded with dummy variables (1/0).

C2. Dataset Variables: Identify the initial dataset variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical.

C3. Steps for Analysis:

The steps used to prepare the dataset will include:

1. Back up my data and the process I am following as a copy to my machine and, since this is a manageable dataset, to GitHub using command line and gitbash.

2. Read the data set into Python using Pandas' read_csv command.

3. Evaluate the data struture to better understand input data using info & describe methods.

4. Naming the dataset as a the variable "churn_df" & subsequent useful slices of the dataframe as "df".

5. Examine potential misspellings, awkward variable naming & missing data.

6. Explore descriptive statistics for outliers that may create or hide statistical significance using histograms & box plots.

7. Where necessary, impute records missing data with meaningful measures of central tendency (mean, median or mode) or simply remove outliers that are several standard deviations above the mean.

8. Remove less meaningful categorical variables from dataset to provide fully numerical dataframe for further analysis.

9. Extract cleaned dataset as "churn_df_prepared.csv" for use in K-Nearest Neighbor model.

Most relevant to our decision making process is the dependent variable of "Churn" which is binary categorical with only two values, "Yes" or "No". "Churn" will be our categorical target variable.

In cleaning the data, we may discover relevance of the continuous predictor variables:

Children Income Outage_sec_perweek Email Contacts Yearly_equip_failure Tenure (the number of months the customer has stayed with the provider) MonthlyCharge Bandwidth_GB_Year

Likewise, we may discover relevance of the categorical predictor variables (all binary categorical with only two values, "Yes" or "No", except where noted) The following will be encoded as dummy variables with 1/0:

Techie: Whether the customer considers themselves technically inclined (based on

customer questionnaire when they signed up for services) (yes, no) Contract: The contract term of the customer (month-to-month, one year, two year) Port_modem: Whether the customer has a portable modem (yes, no) Tablet: Whether the customer owns a tablet such as iPad, Surface, etc. (yes, no) InternetService: Customer's internet service provider (DSL, fiber optic, None) Phone: Whether the customer has a phone service (yes, no) Multiple: Whether the customer has multiple lines (yes, no) OnlineSecurity: Whether the customer has an online security add-on (yes, no) OnlineBackup: Whether the customer has an online backup add-on (yes, no) DeviceProtection: Whether the customer has device protection add-on (yes, no) TechSupport: Whether the customer has a technical support add-on (yes, no) StreamingTV: Whether the customer has streaming TV (yes, no) StreamingMovies: Whether the customer has streaming movies (yes, no)

Finally, discrete ordinal predictor variables from the survey responses from customers regarding various customer service features may be relevant in the decision-making process. In the surveys, customers provided ordinal numerical data by rating 8 customer service factors on a scale of 1 to 8 (1 = most important, 8 = least important):

Item1: Timely response Item2: Timely fixes Item3: Timely replacements Item4: Reliability Item5: Options Item6: Respectful response Item7: Courteous exchange Item8: Evidence of active listening

In [1]: # Standard data science imports

import numpy as np

import pandas as pd

from pandas import Series, DataFrame

# Visualization libraries

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline

# Scikit-learn

import sklearn

from sklearn import datasets

from sklearn import preprocessing

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.metrics import classification_report

In [2]: # Change color of Matplotlib font

import matplotlib as mpl

COLOR = 'white'

mpl.rcParams['text.color'] = COLOR

mpl.rcParams['axes.labelcolor'] = COLOR

mpl.rcParams['xtick.color'] = COLOR

mpl.rcParams['ytick.color'] = COLOR

In [3]: # Increase Jupyter display cell-width

from IPython.core.display import display, HTML

display(HTML(".container { width:75% !important; }"))

In [4]: # Ignore Warning Code

import warnings

warnings.filterwarnings('ignore')

In [5]: # Load data set into Pandas dataframe

churn_df = pd.read_csv('data/churn_clean.csv', index_col=0)

In [6]: # Examine the features of the dataset

churn_df.columns

Out[6]: Index(['Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Zip',

'Lat', 'Lng', 'Population', 'Area', 'TimeZone', 'Job', 'Children',

'Age', 'Income', 'Marital', 'Gender', 'Churn', 'Outage_sec_perweek',

'Email', 'Contacts', 'Yearly_equip_failure', 'Techie', 'Contract',

'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple',

'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',

'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 'PaymentMethod',

'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year', 'Item1', 'Item2',

'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'],

dtype='object')

In [7]: # Get an idea of dataset size

churn_df.shape

Out[7]: (10000, 49)

In [8]: # Examine first few records of dataset

churn_df.head()

Out[8]:

Customer_id

Interaction

UID

City State County Zip

Lat

Lng Population ... MonthlyCharge Bandwidth_GB_Year Item1

CaseOrder

1

K409198

aa90260b4141-4a24-

8e36b04ce1f4f77b

e885b299883d4f9fb18e39c75155d990

Point Baker

Prince of AK Wales- 99927 56.25100 -133.37571

Hyder

2

fb76459f-c047-

S120509

4a9d-8af9-

e0f7d4ac2524

f2de8bef964785f41a2959829830fb8a

West Branch

MI Ogemaw 48661 44.32893 -84.24080

344d114c-

3

K191035

3736-4be598f7-

f1784cfa9f6d92ae816197eb175d3c71

Yamhill

OR

Yamhill 97148 45.35589 -123.24657

c72c281e2d35

38 ...

172.455519

10446 ...

242.632554

3735 ...

159.947583

904.536110

5

800.982766

3

2054.706961

4

abfa2b40-

4

D90850

2d43-4994b15a-

dc8a365077241bb5cd5ccd305136b05e

Del Mar

CA

San Diego

92014

32.96687

-117.24798

13863 ...

119.956840

989b8c79e311

2164.579412

4

68a861fd-

5

K662701

0d20-4e51a587-

aabb64a116e83fdc4befc1fbab1663f9 Needville

TX

Fort Bend

77461

29.38012

-95.80673

8a90407ee574

11352 ...

149.948316

271.493436

4

5 rows ? 49 columns

In [9]: # View DataFrame info

churn_

Out[9]:

In [10]: # Provide an initial look at extant dataset

churn_df.head()

Out[10]:

Customer_id

Interaction

UID

City State County Zip

Lat

Lng Population ... MonthlyCharge Bandwidth_GB_Year Item1

CaseOrder

1

K409198

aa90260b4141-4a24-

8e36b04ce1f4f77b

e885b299883d4f9fb18e39c75155d990

Point Baker

Prince of AK Wales- 99927 56.25100 -133.37571

Hyder

38 ...

172.455519

904.536110

5

2

fb76459f-c047-

S120509

4a9d-8af9-

e0f7d4ac2524

f2de8bef964785f41a2959829830fb8a

West Branch

MI Ogemaw 48661 44.32893 -84.24080

10446 ...

242.632554

800.982766

3

344d114c-

3

K191035

3736-4be598f7-

f1784cfa9f6d92ae816197eb175d3c71

Yamhill

OR

Yamhill 97148 45.35589 -123.24657

c72c281e2d35

3735 ...

159.947583

2054.706961

4

abfa2b40-

4

D90850

2d43-4994b15a-

dc8a365077241bb5cd5ccd305136b05e

Del Mar

CA

San Diego

92014

32.96687

-117.24798

13863 ...

119.956840

989b8c79e311

2164.579412

4

68a861fd-

5

K662701

0d20-4e51a587-

aabb64a116e83fdc4befc1fbab1663f9 Needville

TX

Fort Bend

77461

29.38012

-95.80673

8a90407ee574

11352 ...

149.948316

271.493436

4

5 rows ? 49 columns

In [11]: # Get an overview of descriptive statistics

churn_df.describe()

Out[11]:

Zip

Lat

Lng

Population Children

Age

Income Outage_sec_perweek

Email

Contacts ... MonthlyCharge Bandwidth_GB

count 10000.000000 10000.000000 10000.000000 10000.000000 10000.0000 10000.000000 10000.000000

10000.000000 10000.000000 10000.000000 ... 10000.000000

10000.0

mean 49153.319600

38.757567 -90.782536 9756.562400

2.0877

53.078400 39806.926771

10.001848

12.016000

0.994200 ...

172.624816

3392.3

std 27532.196108

5.437389

15.156142 14432.698671

2.1472

20.698882 28199.916702

2.976019

3.025898

0.988466 ...

42.943094

2185.2

min 601.000000

17.966120 -171.688150

0.000000

0.0000

18.000000

348.670000

0.099747

1.000000

0.000000 ...

79.978860

155.5

25% 26292.500000

35.341828 -97.082813

738.000000

0.0000

35.000000 19224.717500

8.018214

10.000000

0.000000 ...

139.979239

1236.4

50% 48869.500000

39.395800 -87.918800 2910.500000

1.0000

53.000000 33170.605000

10.018560

12.000000

1.000000 ...

167.484700

3279.5

75% 71866.500000

42.106908 -80.088745 13168.000000

3.0000

71.000000 53246.170000

11.969485

14.000000

2.000000 ...

200.734725

5586.1

max 99929.000000

70.640660 -65.667850 111850.000000

10.0000

89.000000 258900.700000

21.207230

23.000000

7.000000 ...

290.160419

7158.9

8 rows ? 22 columns

In [12]: # Get data types of features

churn_df.dtypes

Out[12]: Customer_id Interaction UID City State County Zip Lat Lng Population Area TimeZone Job Children Age Income Marital Gender Churn Outage_sec_perweek Email Contacts Yearly_equip_failure Techie Contract Port_modem Tablet InternetService Phone Multiple OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling PaymentMethod Tenure MonthlyCharge Bandwidth_GB_Year Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 dtype: object

object

object

object

object

object

object

int64

float64

float64

int64

object

object

object

int64

int64

float64

object

object

object

float64

int64

int64

int64

object

object

object

object

object

object

object

object

object

object

object

object

object

object

object

float64

float64

float64

int64

int64

int64

int64

int64

int64

int64

int64

In [13]: # Rename last 8 survey columns for better description of variables

churn_df.rename(columns = {'Item1':'TimelyResponse',

'Item2':'Fixes',

'Item3':'Replacements',

'Item4':'Reliability',

'Item5':'Options',

'Item6':'Respectfulness',

'Item7':'Courteous',

'Item8':'Listening'},

inplace=True)

In [14]: # Create histograms of contiuous variables & categorical variables

churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email',

'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge',

'Bandwidth_GB_Year', 'TimelyResponse', 'Courteous']].hist()

plt.savefig('churn_pyplot.jpg')

plt.tight_layout()

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

In [15]: # Create a scatterplot to get an idea of correlations between potentially related variables

sns.scatterplot(x=churn_df['Outage_sec_perweek'], y=churn_df['Churn'], color='blue')

plt.show();

In [16]: # Create a scatterplot to get an idea of correlations between potentially related variables

sns.scatterplot(x=churn_df['Tenure'], y=churn_df['Churn'], color='blue')

plt.show();

In [17]: # Create a scatterplot to get an idea of correlations between potentially related variables

sns.scatterplot(x=churn_df['MonthlyCharge'], y=churn_df['Outage_sec_perweek'], color='blue')

plt.show();

In [18]: # Provide a scatter matrix of numeric variables for high level overview of potential relationships & distributions

churn_numeric = churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts','Yearly_equip_failure', 'Tenure',

'MonthlyCharge', 'Bandwidth_GB_Year', 'Replacements',

'Reliability', 'Options', 'Respectfulness', 'Courteous',

'Listening']]

pd.plotting.scatter_matrix(churn_numeric, figsize = [15, 15]);

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

In [19]: # Create individual scatterplot for viewing relationship of key financial featurte against target variable

sns.scatterplot(x = churn_df['MonthlyCharge'], y = churn_df['Churn'], color='red')

plt.show();

In [20]: # Set plot style to ggplot for aesthetics & R style

plt.style.use('ggplot')

# Countplot more useful than scatter_matrix when features of dataset are binary

plt.figure()

sns.countplot(x='Techie', hue='Churn', data=churn_df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes']) plt.show()

In [21]: # Countplot more useful than scatter_matrix when features of dataset are binary

plt.figure()

sns.countplot(x='PaperlessBilling', hue='Churn', data=churn_df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes']) plt.show()

In [22]: # Countplot more useful than scatter_matrix when features of dataset are binary

plt.figure()

sns.countplot(x='InternetService', hue='Churn', data=churn_df, palette='RdBu')

plt.xticks([0,1], ['No', 'Yes']) plt.show()

In [23]: # Create multiple boxplots for continuous & categorical variables

churn_df.boxplot(column=['MonthlyCharge','Bandwidth_GB_Year'])

Out[23]:

In [24]: # Create Seaborn boxplots for continuous & categorical variables

sns.boxplot('MonthlyCharge', data = churn_df)

plt.show()

In [25]: # Create Seaborn boxplots for continuous & categorical variables

sns.boxplot('Bandwidth_GB_Year', data = churn_df)

plt.show()

In [26]: # Create Seaborn boxplots for continuous variables

sns.boxplot('Tenure', data = churn_df)

plt.show()

Anomalies It appears that anomolies have been removed from the supplied dataset, churn_clean.csv. There are no remaining outliers.

In [27]: # Discover missing data points within dataset

data_nulls = churn_df.isnull().sum()

print(data_nulls)

Customer_id

0

Interaction

0

UID

0

City

0

State

0

County

0

Zip

0

Lat

0

Lng

0

Population

0

Area

0

TimeZone

0

Job

0

Children

0

Age

0

Income

0

Marital

0

Gender

0

Churn

0

Outage_sec_perweek

0

Email

0

Contacts

0

Yearly_equip_failure 0

Techie

0

Contract

0

Port_modem

0

Tablet

0

InternetService

0

Phone

0

Multiple

0

OnlineSecurity

0

OnlineBackup

0

DeviceProtection

0

TechSupport

0

StreamingTV

0

StreamingMovies

0

PaperlessBilling

0

PaymentMethod

0

Tenure

0

MonthlyCharge

0

Bandwidth_GB_Year

0

TimelyResponse

0

Fixes

0

Replacements

0

Reliability

0

Options

0

Respectfulness

0

Courteous

0

Listening

0

dtype: int64

In [28]: # Check for missing data & visualize missing values in dataset

# Install appropriate library

!pip install missingno

# Importing the libraries

import missingno as msno

# Visualize missing values as a matrix

msno.matrix(churn_df);

"""(GeeksForGeeks, p. 1)"""

Requirement already satisfied: missingno in c:\users\vreed\anaconda3\lib\site-packages (0.5.0) Requirement already satisfied: seaborn in c:\users\vreed\anaconda3\lib\site-packages (from missingno) (0.10.0)

Requirement already satisfied: scipy in c:\users\vreed\anaconda3\lib\site-packages (from missingno) (1.4.1)

Requirement already satisfied: numpy in c:\users\vreed\anaconda3\lib\site-packages (from missingno) (1.18.1)

Requirement already satisfied: matplotlib in c:\users\vreed\anaconda3\lib\site-packages (from missingno) (3.1.3)

Requirement already satisfied: pandas>=0.22.0 in c:\users\vreed\anaconda3\lib\site-packages (from seaborn->missingno) (1.0.1)

Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\vreed\anaconda3\lib\site-packages (from matplotlib->missin Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\vreed\anaconda3\lib\site-packages (from matplotlib->missingno) (1.1.0)

Requirement already satisfied: cycler>=0.10 in c:\users\vreed\anaconda3\lib\site-packages (from matplotlib->missingno) (0.10.0)

Requirement already satisfied: python-dateutil>=2.1 in c:\users\vreed\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.1)

Requirement already satisfied: pytz>=2017.2 in c:\users\vreed\anaconda3\lib\site-packages (from pandas>=0.22.0->seaborn->missingno) (2019.3)

Requirement already satisfied: setuptools in c:\users\vreed\anaconda3\lib\site-packages (from kiwisolver>=1.0.1->matplotlib->missingno) (45.2. Requirement already satisfied: six in c:\users\vreed\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->missingno) (1.14.0)

Out[28]: '(GeeksForGeeks, p. 1)'

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans.

In [29]: # Encode binary categorical variables with dummies

churn_df['DummyGender'] = [1 if v == 'Male' else 0 for v in churn_df['Gender']]

churn_df['DummyChurn'] = [1 if v == 'Yes' else 0 for v in churn_df['Churn']] ### If the customer left (churned) they get a '1'

churn_df['DummyTechie'] = [1 if v == 'Yes' else 0 for v in churn_df['Techie']]

churn_df['DummyContract'] = [1 if v == 'Two Year' else 0 for v in churn_df['Contract']]

churn_df['DummyPort_modem'] = [1 if v == 'Yes' else 0 for v in churn_df['Port_modem']]

churn_df['DummyTablet'] = [1 if v == 'Yes' else 0 for v in churn_df['Tablet']]

churn_df['DummyInternetService'] = [1 if v == 'Fiber Optic' else 0 for v in churn_df['InternetService']]

churn_df['DummyPhone'] = [1 if v == 'Yes' else 0 for v in churn_df['Phone']]

churn_df['DummyMultiple'] = [1 if v == 'Yes' else 0 for v in churn_df['Multiple']]

churn_df['DummyOnlineSecurity'] = [1 if v == 'Yes' else 0 for v in churn_df['OnlineSecurity']]

churn_df['DummyOnlineBackup'] = [1 if v == 'Yes' else 0 for v in churn_df['OnlineBackup']]

churn_df['DummyDeviceProtection'] = [1 if v == 'Yes' else 0 for v in churn_df['DeviceProtection']]

churn_df['DummyTechSupport'] = [1 if v == 'Yes' else 0 for v in churn_df['TechSupport']]

churn_df['DummyStreamingTV'] = [1 if v == 'Yes' else 0 for v in churn_df['StreamingTV']]

churn_df['StreamingMovies'] = [1 if v == 'Yes' else 0 for v in churn_df['StreamingMovies']]

churn_df['DummyPaperlessBilling'] = [1 if v == 'Yes' else 0 for v in churn_df['PaperlessBilling']]

In [30]: # Drop original categorical features from dataframe

churn_df = churn_df.drop(columns=['Gender', 'Churn', 'Techie', 'Contract', 'Port_modem', 'Tablet',

'InternetService', 'Phone', 'Multiple', 'OnlineSecurity',

'OnlineBackup', 'DeviceProtection', 'TechSupport',

'StreamingTV', 'StreamingMovies', 'PaperlessBilling'])

In [31]: churn_df.head()

Out[31]:

Customer_id

Interaction

UID

City State County Zip

Lat

Lng Population ... DummyTablet DummyInternetService Dumm

CaseOrder

1

K409198

aa90260b4141-4a24-

8e36b04ce1f4f77b

e885b299883d4f9fb18e39c75155d990

Point Baker

Prince of AK Wales- 99927 56.25100 -133.37571

Hyder

38 ...

1

1

2

fb76459f-c047-

S120509

4a9d-8af9-

e0f7d4ac2524

f2de8bef964785f41a2959829830fb8a

West Branch

MI Ogemaw 48661 44.32893 -84.24080

10446 ...

1

1

344d114c-

3

K191035

3736-4be598f7-

f1784cfa9f6d92ae816197eb175d3c71

Yamhill

OR

Yamhill 97148 45.35589 -123.24657

3735 ...

0

0

c72c281e2d35

abfa2b40-

4

D90850

2d43-4994b15a-

dc8a365077241bb5cd5ccd305136b05e

Del Mar

CA

San Diego

92014

32.96687

-117.24798

13863 ...

0

0

989b8c79e311

68a861fd-

5

K662701

0d20-4e51a587-

aabb64a116e83fdc4befc1fbab1663f9 Needville

TX

Fort Bend

77461

29.38012

-95.80673

11352 ...

0

1

8a90407ee574

5 rows ? 48 columns

In [32]: # Remove less meaningful categorical variables from dataset to provide fully numerical dataframe for further analysis

churn_df = churn_df.drop(columns=['Customer_id', 'Interaction', 'UID',

'City', 'State', 'County', 'Zip', 'Lat', 'Lng',

'Area', 'TimeZone', 'Job', 'Marital', 'PaymentMethod'])

churn_df.head()

Out[32]:

Population Children Age Income Outage_sec_perweek Email Contacts Yearly_equip_failure Tenure MonthlyCharge ... DummyTablet DummyInternetService DummyPhone

CaseOrder

1

38

0 68 28561.99

7.978323

10

0

1 6.795513

172.455519 ...

1

1

1

2

10446

1 27 21704.77

11.699080

12

0

1 1.156681

242.632554 ...

1

1

1

3

3735

4 50 9609.57

10.752800

9

0

1 15.754144

159.947583 ...

0

0

1

4

13863

1 48 18925.23

14.913540

15

2

0 17.087227

119.956840 ...

0

0

1

5

11352

0 83 40074.19

8.147417

16

2

1 1.670972

149.948316 ...

0

1

0

5 rows ? 34 columns

In [33]: # Move DummyChurn to end of dataset to set as target

churn_df = churn_df[['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts',

'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',

'TimelyResponse', 'Fixes', 'Replacements',

'Reliability', 'Options', 'Respectfulness', 'Courteous', 'Listening',

'DummyGender', 'DummyTechie', 'DummyContract',

'DummyPort_modem', 'DummyTablet', 'DummyInternetService', 'DummyPhone',

'DummyMultiple', 'DummyOnlineSecurity', 'DummyOnlineBackup',

'DummyDeviceProtection', 'DummyTechSupport', 'DummyStreamingTV',

'DummyPaperlessBilling', 'DummyChurn',]]

churn_df.head()

Out[33]:

Children Age Income Outage_sec_perweek Email Contacts Yearly_equip_failure Tenure MonthlyCharge Bandwidth_GB_Year ... DummyInternetService DummyPhone Dum

CaseOrder

1

0 68 28561.99

7.978323 10

0

1 6.795513

172.455519

904.536110 ...

1

1

2

1 27 21704.77

11.699080 12

0

1 1.156681

242.632554

800.982766 ...

1

1

3

4 50 9609.57

10.752800

9

0

1 15.754144

159.947583

2054.706961 ...

0

1

4

1 48 18925.23

14.913540 15

2

0 17.087227

119.956840

2164.579412 ...

0

1

5

0 83 40074.19

8.147417 16

2

1 1.670972

149.948316

271.493436 ...

1

0

5 rows ? 33 columns

In [34]: # List features for analysis

features = (list(churn_df.columns[:-1]))

print('Features for analysis include: \n', features)

Features for analysis include:

['Children', 'Age', 'Income', 'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Yea

C4. Cleaned Dataset: Cleaned data set is extracted as "churn_prepared.csv."

In [35]: # Extract Clean dataset

churn_df.to_csv('data/churn_prepared.csv')

Part IV: Analysis

In [36]: # Re-read fully numerical prepared dataset

churn_df = pd.read_csv('data/churn_prepared.csv')

# Set predictor features & target variable

X = churn_df.drop('DummyChurn', axis=1).values

y = churn_df['DummyChurn'].values

In [37]: # Import model, splitting method & metrics from sklearn

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score

from sklearn.model_selection import cross_val_score, train_test_split

D1. Splitting the Data

In [38]: # Set seed for reproducibility

SEED = 1

# Create training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = SEED)

In [39]: # Instantiate KNN model

knn = KNeighborsClassifier(n_neighbors = 7)

# Fit data to KNN model

knn.fit(X_train, y_train)

# Predict outcomes from test set

y_pred = knn.predict(X_test)

D2. Output & Intermediate Calculations

In [40]: # Print initial accuracy score of KNN model

print('Initial accuracy score KNN model: ', accuracy_score(y_test, y_pred))

Initial accuracy score KNN model: 0.7145

In [41]: # Compute classification metrics

print(classification_report(y_test, y_pred))

0 1

accuracy

macro avg weighted avg

precision

0.78 0.49

0.63 0.70

recall f1-score support

0.83 0.40

0.81 0.44

1442

558

0.62 0.71

0.71 0.62 0.71

2000

2000

2000

D3. Code Execution

In [42]: # Create pipeline object & scale dataframe

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.metrics import accuracy_score

# Set steps for pipeline object

steps = [('scaler', StandardScaler()),

('knn', KNeighborsClassifier())]

# Instantiate pipeline

pipeline = Pipeline(steps)

# Split dataframe

X_train_scaled, X_test_scaled, y_train_scaled, y_test_scaled = train_test_split(X, y, test_size = 0.2, random_state = SEED)

# Scale dateframe with pipeline object

knn_scaled = pipeline.fit(X_train_scaled, y_train_scaled)

# Predict from scaled dataframe

y_pred_scaled = pipeline.predict(X_test_scaled)

In [43]: # Print new accuracy score of scaled KNN model

print('New accuracy score of scaled KNN model: {:0.3f}'.format(accuracy_score(y_test_scaled, y_pred_scaled)))

New accuracy score of scaled KNN model: 0.790

In [44]: # Compute classification metrics after scaling

print(classification_report(y_test_scaled, y_pred_scaled))

0 1

accuracy

macro avg weighted avg

precision

0.84 0.64

0.74 0.78

recall f1-score support

0.88 0.56

0.86 0.60

1442

558

0.72 0.79

0.79 0.73 0.79

2000

2000

2000

In [45]: # Import sklearn confusion_matrix & generate results

from sklearn.metrics import confusion_matrix

cf_matrix = confusion_matrix(y_test, y_pred)

print(cf_matrix)

[[1204 238]

[ 333 225]]

In [46]: # Create a visually more intuitive confusion matrix

"""(Dennis, pg. 1)"""

group_names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']

group_counts = ["{0:0.0f}".format(value) for value in

cf_matrix.flatten()]

group_percentages = ["{0:.2%}".format(value) for value in

cf_matrix.flatten()/np.sum(cf_matrix)]

labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in

zip(group_names,group_counts,group_percentages)]

labels = np.asarray(labels).reshape(2,2)

sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

Out[46]:

E1. Accuracy & AUC

Model Comparison

It appears that scaling improved model performance from an Accuracy of 0.71 to 0.79 & Precision of 0.78 to 0.84. The area under the curve is a decent score at 0.7959.

In [47]: # Import GridSearchCV for cross validation of model

from sklearn.model_selection import GridSearchCV

# Set up parameters grid

param_grid = {'n_neighbors': np.arange(1, 50)}

# Re-intantiate KNN for cross validation

knn = KNeighborsClassifier()

# Instantiate GridSearch cross validation

knn_cv = GridSearchCV(knn , param_grid, cv=5)

# Fit model to

knn_cv.fit(X_train, y_train)

# Print best parameters

print('Best parameters for this KNN model: {}'.format(knn_cv.best_params_))

Best parameters for this KNN model: {'n_neighbors': 6}

In [48]: # Generate model best score

print('Best score for this KNN model: {:.3f}'.format(knn_cv.best_score_))

Best score for this KNN model: 0.735

In [49]: # Import ROC AUC metrics for explaining the area under the curve

from sklearn.metrics import roc_auc_score

# Fit it to the data

knn_cv.fit(X, y)

# Compute predicted probabilities: y_pred_prob

y_pred_prob = knn_cv.predict_proba(X_test)[:,1]

# Compute and print AUC score

print("The Area under curve (AUC) on validation dataset is: {:.4f}".format(roc_auc_score(y_test, y_pred_prob)))

The Area under curve (AUC) on validation dataset is: 0.7959

In [50]: # Compute cross-validated AUC scores: cv_auc

cv_auc = cross_val_score(knn_cv, X, y, cv=5, scoring='roc_auc')

# Print list of AUC scores

print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))

AUC scores computed using 5-fold cross-validation: [0.68120909 0.17406045 0.96370684 0.96560711 0.58834745]

E2. Results & Implications

E3. Limitation

"When using the k-nearest neighbors algorithm you have the ability to change k, potentially yielding dramatically different results. You choose the value of k by trying different values and testing the prediction capabilities of the model. This means you must develop, validate, and test several models" (Grant, pg. 1).

What this means to our analysis here is that the relatively arbitrary choice of k = 7 nearest neighbors might yield dramatically different results if we chose a different k number of neighbors. As discovered in our cross validation grid search, perhaps it should be the 6 nearest neighbors.

Also, it appears to be memory intensive & computationally expensive. Therefore, simply, it takes a long time to compute.

E4. Course of Action

It is critical that decision-makers & marketers understand that our predictor variables create a relatively low accuracy score with the results of an 0.84 after scaling. We should analyse the features that are in common among those leaving the company & attempt to reduce their likelihood of occuring with any given customer in the future. This suggests that as a customer subscribes to more services that the company provided, an additional port modem or online backup for example, they are less likely to leave the company. Clearly, it is the best interest of retaining customers to provide them with more services & improve their experience with the company by helping customers understand all the services that are available to them as a subscriber, not simple mobile phone service.

F. Video



G. Sources for Third-Party Code

GeeksForGeeks. (2019, July 4). Python | Visualize missing values (NaN) values using Missingno Library. GeeksForGeeks.

Dennis, T. (2019, July 25). Confusion Matrix Visualization. Medium.

H. Sources

CBTNuggets. (2018, September 20). Why Data Scientists Love Python. CBTNuggets.

Grant, P. (2019, July 21). Introducing k-Nearest Neighbors. TowardDataScience.

Massaron, L. & Boschetti, A. (2016). Regression Analysis with Python. Packt Publishing.

In [ ]: !wget -nc

from colab_pdf import colab_pdf

colab_pdf('D209 Data Mining 1 - NVM2 - Classification Analysis.ipynb')

This website does not host notebooks, it only renders notebooks available on other websites.

Delivered by Fastly,

Rendered by OVHcloud nbviewer GitHub repository.

nbviewer version:

90c61cc nbconvert version: 5.6.1 Rendered

a few seconds ago

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download