Sorabh27.com



PREDICTING CUSTOMER SATISFACTION USING DATA MINING AND MACHINE LEARNING TECHNIQUESTABLE OF CONTENTAbstract………………………………………………………………………………………………………………….2Introduction……………………………………………………………………………………………………….......2Related Work………………………………………………………………………………………………………….2Data………………………………………………………………………………………………………........................3Data Cleaning and Transformation………………………………………………………….........................5Research Objectives and Key Findings……………………………….......................................................5Methodology and Data Modelling…………………………………………………………............................6Logistic Regression and Classification…………………………………………………………...................8Random Forest Classification…………………………………………………………..................................16K- Means Clustering………………………………………………………….....................................................18Results and Conclusion…………………………………………………………...............................................21References…….…………………………………………………………................................................................22AppendixI. ABSTRACTAirline customer surveys have various parameters which if analysed correctly can present many insights for the aviation companies to improve and give better dedicated services. The objective of this project is to identify difference in results obtained from different machine learning algorithms and gauging accuracies of the same. In this study we analyse customer satisfaction survey collected from IBM Watson analytics community and come up with our customer model with respect to such data. The study measures passenger satisfaction based on the responses from 129889 passengers who flew within United States in different airlines. The study was fielded for the first 3 months of 2014. We apply multivariate logistic regression to model customers as satisfied or unsatisfied. This analysis put a basis for a model which the airlines carrier companies can implement to predict, based upon certain significant feature whether they are content with the services. In the subsequent parts of study, we perform dimensionality reduction followed by cluster-based modelling. II. INTRODUCTIONData mining comprises of variety of techniques which can be used to uncover meaningful patterns from tremendous amount of data. Using these patterns and insights, various data models can be constructed to expedite business processes such as marketing strategy development, decision making etc. Airline carriers in current business market with high level of competition can benefit a lot from these kinds of models and can improve suitable strategies which can help them maintain the highest level of satisfaction amongst its customers. In this study we discussed what factors are significant while customers are rating the services provided by different airlines and construct our data model on those factors. Finally, in the results section we discuss the extensive empirical analysis conducted on the survey results and draw conclusions using our assumptions and methodologyThe main hypothesis of this study is to come up with models that will enable the airline marketers to predict a survey respondent be satisfied with the airline services. It will also help the airline industry to perform analysis on already collected historical data such as customer surveys collected on the same attributes.One of the interesting findings highlighted was, although price seems to be a statistically significant factor for the overall satisfaction of the customer, but RFE (recursive feature elimination) showed that it is not an important feature. Thus, this is quite helpful for the airline marketers to build a brand rather than focusing on providing cheaper services. III. RELATED WORKMany data mining techniques have been effectively used by airlines to develop strategies, through customers satisfaction evaluation by various surveys. Interesting no prior work has been done on the dataset we are working on. But there have been some studies which are somewhat similar to the one implemented in this project. In one of the studies by Liou and Tzeng [1], they evaluated Taiwanese Airline Market through a survey using multiple choice questions. They concluded that safety and price are two paramount factors that affects the customer satisfaction in the Airline market. Another important study that was done related to Airline Satisfaction was by Tamilla Curtis, Dawna L Rhoades & Blaise P Waguespack Jr[2]. Their study was based on the investigating the satisfaction level between frequent and infrequent flyers using ANOVA. The Results from their study indicates that the level of satisfaction with overall airline quality & other important parameters decrease as the passengers fly more. This study was more towards statistical procedure to analyse different research objectives. Building clusters for CRM strategies by mining airlines customer data, was another work done in the field of evaluating customer data and implemented business strategy incorporating inputs from sales, marketing and customer services. They evaluated 3 different clustering algorithms namely k-means, SOM and HSOM. The results show that k means is the best amongst all on their data consisting of responses from 20000 customers. SOM technique also gave more or less similar results. Our study is different from these in a way that we are classifying customers based on different parameters. Different models are constructed, and comparison is made amongst these. Our results showed that Classification of passengers as satisfied or unsatisfied, is best done by Random Forest Classifier with a significantly higher accuracy and precision as compared to Logistic Regression. In addition to this, k – means clustering failed to give us any business insight on the dataset we worked on. Another work related to this which we referred was to Increase the Airline Customer Satisfaction done by Kathryn Bryant [3] in which she did web scraping from Skytrax website and then suggested different models as to how can you increase the airline satisfaction by targeting those parameters. There was another Data mining work that was done related to Airline Data in which the main dig deeper and understand the customer satisfaction in airline industry by Ibrahim Yakut, Tugba Turkoglu1 and Fikriye Yakut [4]. It deployed the Principal Component Analysis on the data and then performing kmeans clustering to understand the various business insights related to the data.IV. DATAThis Airlines dataset has:Number of Responses: 129889Number of Attributes: 30Attributes Description:Satisfaction – this parameter measures that how satisfied a respondent is and gives a rating from 1 to 5. We are moving ahead with the assumption that 5 means higher satisfied, and 1 is lowest level of satisfaction. This column is later on converted to a binary variable with binary entries of 0 and 1, depicting unsatisfied and satisfied respectively. Airline Status –in this entry each customer marks a different type of airline status provided to them, which are platinum, gold, silver, and blue. This column is further created as dummy columns to represent categorical data.Age – is a continuous measure of the customer’s age. This ranges from 15 to 85 years old.Age Range – it shows the people between ages in an interval of 10. Gender – male or female. This is further converted to a binary column with 0 representing female and 1 representing males.Price Sensitivity – it represents the degree to which the customers purchasing behaviour is affected to price of ticket. The price sensitivity has a range from 0 to 5. It is an ordinal data which again we assume that 5 means they are highly price sensitive, and 0 is lowest level.Year of First Flight – this attribute shows the first flight year of each single customer. No of Flights p. a. – this is the count of flights that each customer has taken in a year.?It ranges between 0 to 100.No of Flights p.a. grouped – the above column is then grouped in group of 10 years. Percent of Flight with other Airlines – if we were Southeast Airline, we would like to know how many times that customer fly with other Airlines. Type of Travel – is to provide information about the traveller’s purpose of travel. The three travelling purpose indicated are business travel, mileage tickets and personal travel. This column is further created as dummy columns to represent categorical data.No. Of other Loyalty Cards – it is kind of membership card of each customer, that for retail establishment to gain a benefit such as discounts. Shopping Amount at Airport – showing the quantity of products purchased by customers. The range of shopping amount is from 0 to 875.Eating and Drinking at Airport –quantity consumed in eating and drinking for each consumer at the airport and ranges from 0 to 895.Class – this column shows the kind of cabin the customer travelled. There are 3 categories, namely business, and economy plus, economy. This column is further created as dummy columns to represent categorical data.Day of Month – the day on which the customer travelled.Flight date – date of flight on which customer travelled, which were since 2014 and only in January, February, and March. Airline Code – designated code for the 14 airlinesAirline Name – There are several airlines company names such as, West Airways, Southeast Airlines Co etc. This attribute provides information of the airline on which they travelled.Origin City – refers to actual city that customers boarded the flight from. There are total of unique 295 cities.Origin State – the state in which the city has been deported from. There are 15 unique states in the data.Destination City – the place to which passenger travels to. There are total of 296 unique city values.Destination State – the state to which the passengers have travelled to.Scheduled Departure Hour – the time of scheduled departure of flight. This column shows scheduled departure hour starting from 1 am until 23 pm.Departure Delay in Minutes –it is the minutes of departure delayed for each passenger, when compared to schedule. In this data the rage is starting from 0 until 1128 minutes. Arrival Delay in Minutes – how many minutes the arrival of flight was delayed per passenger. It ranges from 0 until 1115 minutes. Flight Cancelled – occurs when the airline dose not operates the flight at all, and that is for a certain reason. This is further converted to a binary column with 0 representing no cancellation and 1 if flight was cancelled.Flight time in minutes – indicate to time in minutes it took to reach the destination.Flight Distance –this column tells the distance in minutes between the two places. It ranges from 31 until 4983 minutes.Arrival Delay greater 5 Minutes – It means the delay of arrival airline time, which is more than 5 minutes per each passenger in the data. This is further converted to a binary column with 0 representing no delay greater than 5 minutes and 1 for delay.Data Source: -The data has been collected from the IBM Watson analytics community blog page. V. Data Cleaning and Data TransformationUpon performing preliminary analysis on the data, the data required pre-processing and transformations for data modelling: -Removing incorrect data entry-After visualising the satisfaction variable, we found that 9 responses were either incorrect or as float type. Since the majority of the entry were as integer entry, we removed these 9 values. The data-set from here onwards had 129880 records. Filling in the missing values- Instead of using the attribute mean to fill the nan values, we used learning algorithms to predict the missing values. We considered the attribute with the missing values as the dependent variable and ran a random forest regressor to predict and fill the missing values. The yellow lines next to all the attributes indicates missing values corresponding to these variables.Fig1- Variables with NA values Fig2- Handling missing values Data transformation: - We have made contrasts for the respective categorical variables, Airline Status, Type of Travel, Class, Airline Code.We have rounded the number of flights per annum to the next integer value, since this value should be in whole numbers.The Satisfaction variable has been converted from ordinal to binary with the passengers giving ratings equal and above 3 to be satisfied. Rest of them are given 0 and represent unsatisfied.The Arrival Delay greater 5 minutes has been converted from Yes and No to 1 and 0 respectively.Flight cancelled variable has been converted from Yes and No to 1 and 0 respectively.Airline Code have been made to contrasts from nominal type data.We dropped the following columns which were derived from other columns in our data-set in order to reduce multicollinearity. These included Age Range, Year of First Flight, No of Flights p.a. grouped, Airline Name We also dropped the following columns as they had less business implications and irrelevant to the model such as Flight date, Origin City, Origin State, Destination City, Destination State.For the dimensionality reduction we have used PCA and have standardised the dataset.Data exploration was carried out to test various research objectives by data visualization and see the various patterns in our data. Below are some of the research objectives for which the EDA is carried out: -VI. RESEARCH OBJECTIVES AND KEY FINDINGSResearch Objective 1:Hypothesis- Satisfied customers is significantly lower for increasing age range.Conclusion: As is evident from the below figure that the count of satisfied customers is lower as the age range increases. This can be attributed for the increasing requirements with the age. Furthermore, our results from the odds ratio while performing logistic regression on Model 2 supports this hypothesis. Another interesting insight, that came up while performing Recursive Feature Elimination method was that age is not considered an important attribute however it is statistically significant as concluded my logistic model2 statistical summary report. Fig3- Age Range vs SatisfactionResearch Objective 2:Hypothesis- The passengers travelling in the day (morning 7 onwards) are much more satisfied. Conclusion: Clearly from the graph, the satisfaction count is different in each hour but interestingly the number of respondents being satisfied are higher in later part of the day. This means that passengers don’t like travelling in odd hours (early morning till 7). This phenomenon is correctly validated in the conclusion of logistic modelling where the odds of people being satisfied compared to people being unsatisfied is higher in each hour.Fig4- Scheduled Departure Hour vs SatisfactionResearch Objective 3:Hypothesis- The number of unsatisfied customers is significantly higher for high price sensitivity rating.Conclusion: From the above graph it shows that for highly price sensitive customers (rating greater than 2), the rate of dissatisfaction increases. However, after performing Recursive Feature Elimination (to get the 10 most important variables), price sensitivity was not an important factor impacting the satisfaction of the respondents. This was in contrast to that of logistic regression model where feature selection on the basis of significance value had price sensitivity as a significant parameter. These were few surprising results from different feature-based modelling techniques.Fig5- Price sensitivity vs SatisfactionResearch Objective 4: Hypothesis: The males are more satisfied than females.Fig6- Gender vs SatisfactionConclusion: As is evident from the graph that there is a significantly higher count of satisfied male passengers in comparison to the females. This is further supported by the results from the summary output of MODEL1, MODEL2 & MODEL3, that gender is a significant & an important parameter when it comes to measuring satisfaction amongst the airline passengers in our data. Moreover, the odds ratio supports this hypothesis that the odds of males being satisfied is higher.Research Objective 5: Hypothesis: The people who are satisfied have higher shopping amount. Fig7- Shopping Amount at Airport vs SatisfactionConclusion: As is evident from the graph shopping amount and satisfaction show an increasing trend. This is further supported by the results from the summary output of MODEL1, MODEL2 & MODEL3, that shopping at the airport is a significant parameter when it comes to measuring satisfaction amongst the airline passengers in our data. Moreover, the odds ratio supports this hypothesis that with one unit increases in shopping amount the probability of being satisfied increases.VII. METHODOLOGY: DATA ANALYSIS AND MODELLING Fig8. Methodology Deployed for MODEL1, MODEL 2 & MODEL3VIII LOGISTIC REGRESSION ANALYSIS: -Since the class variable ‘Satisfaction’ is dichotomous in nature, so we use multivariate logistic regression as a predictive analysis technique. It is a special case of linear regression where the class variable is categorical in nature. It will give us the log odds of a person being satisfied using a logit function. In Linear Regression, the output is basically the weighted sum of inputs. Logistic Regression on the other hand doesn’t give the weighted sum of inputs as it output, we make a function(model) which map any real data as 0 or 1. So we won’t be using Linear Regression as the output can be more than 1 but in fact Multivariate Logistic Regression to output value between 0 and 1.We carried out data modelling and compared different models based on certain feature selection criteria. Each model is aimed to improve from the previous one and insights related to every model has been provided along with that.MODEL1: In the first model we create a model with all the parameters. Later in this, with subsequent iteration we keep eliminating the insignificant features based upon p-values. This gives us a baseline model to set a comparison with the upcoming models. where LHS of the equation gives the logit of odds of customer being satisfied & RHS of the equation shows the different coefficients along with the features. This is the model which gives a function value of 0.642. This means that with the features mentioned, the logit of odds of a customer being satisfied in the test data set is 0.642.Fig9: Logistic model summary outputWhen logistic regression model is made with all the parameters, we see a lot of nan values in the summary. This is due to the presence of multicollinearity amongst the independent variables. So, eliminating these variables from the model in order to remove dependence. All the following variables will be removed due to multicollinearity and due to low significance: -Type of Travel_Business travelType of Travel_Mileage travelType of Travel_Personal travelAirline Code_AAAirline Code_ASAirline Code_B6Airline Code_DLAirline Code_EVAirline Code_F9Airline Code_FLAirline Code_HAAirline Code_MQAirline Code_OOAirline Code_OUAirline Code_USAirline Code_VXAirline Code_WNEating and Drinking at AirportDay of MonthDeparture Delay in MinutesArrival Delay in MinutesFlight time in minutesFlight DistanceAirline Status_BlueAirline Status_GoldAirline Status_PlatinumAirline Status_SilverClass_BusinessClass_EcoClass_Eco PlusFig10. - Model 1- eliminating all features with multicollinearityElimination of the mentioned variables gives us the above summary, we see that now ‘No. of other loyalty Cards’ is becoming insignificant (p-value >0.05). So further elimination of this variable from the model, leaves us with the following features: -GenderPrice Sensitivity% of flight with other AirlinesShopping Amount at AirportScheduled Departure HourFlight CancelledArrival Delay greater than 5 minsThe multivariate logistic regression model with all these features has the following summary and interpretation. Along with the summary, there is a classification report and a confusion metrics explaining what performance evaluation measure we are considering suggesting this model to the airline carriers. Also, the AIC value for this model is 133489. For a good machine learning algorithm, this value should be low. So, all other Models will be evaluated on the following 2 bases: - Precision: - it refers to the percentage of predictions that are relevant.AIC Value: - In-sample fit to estimate the likelihood of a model to predict future values. It measures the trade off between the model fit and its complexity. A good model has a lower AIC value and represents a better fit.Fig11 - Model 1- Final Output Fig12. Classification Report Fig13. Confusion MatrixModel1 performance evaluation is based on the classification report and the confusion metrics.Let’s interpret these results:Out of the 12690 actual instances of unsatisfied customers, the classifier predicted 7634 correctly out of them. This corresponds to a recall value of 0.60Out of the 13286 actual instances of satisfied customers, the classifier predicted 9185 correctly leading to a recall value of 0.69.Out of all the 25976 customers, the classifier predicted 16819 correctly giving an accuracy of 0.65 or 65%.While determining how the model is performing, we have different parameters to evaluate our models, but we move forward with precision as our primary performance evaluation measure. Out of the total 14241 customers who were predicted as satisfied, classifier is correctly identifying 9185 correctly (who were actually satisfied). This gives us a precision of 64% which is pretty much good, and this model can be used by various airline carriers to see their performance in the industry. Similarly, for those who were predicted as unsatisfied, by the model i.e. 11735, classifier identifies 65% of them correctly. The airline carriers can definitely make use of this metrics to alter their decision-making strategies in order to target those customers who were not satisfied with them and hence can better their performance in the industry.MODEL2: We consider MODEL1 as the baseline model and try to improve in subsequent models. In this model we standardise the features that were significant in MODEL1. This is an improvement technique as all columns are in different range values. Thus, in order to correctly predict the class variable, we standardised all the features that were significant in MODEL1. Standardising the predictors is an easy way to reduce multicollinearity and the associated problems that are caused by the higher order terms. If this is not done, then we are at the risk of missing out on statistically significant results and producing misleading results. Although the accuracy is same, but the AIC value is lower than the baseline model suggesting that the model has shown some improvement. Interestingly, the pseudo R value is also much better than the baseline model. Thus, MODEL2 has improved than the baseline MODEL2.Fig14. MODEL2 Final summaryFrom the MODEL2 summary, we can see that standardization of predictors has improved the model as compared to the baseline model as the AIC score has reduced to 130358 from 133489. The classification report and the confusion matrix for this model is as follows along with the performance evaluation. Fig15. Classification Report MODEL2 Fig16. Confusion Matrix MODEL2Model2 performance evaluation is based on the classification report and the confusion metrics.Let’s interpret these results:Out of the 12690 actual instances of unsatisfied customers, the classifier predicted 7631 correctly out of them. This corresponds to a recall value of 0.60Out of the 13286 actual instances of satisfied customers, the classifier predicted 9188 correctly leading to a recall value of 0.69.Out of all the 25976 customers, the classifier predicted 16819 correctly giving an accuracy of 0.65 or 65%.The Classification Report and confusion metrics are almost similar for MODEL1 and MODEL2. Lower AIC of this MODEL2, suggests that we would be moving forward with this improved model.The final Model2 that we got after the elimination of different parameters is given below along with the model equation: -ln(p/(1-p)) = 0.065-0.3565*Age+0.2819*Gender-0.02128*Price Sensitivity-0.4626*No. of. Flights pa- 0.0275*% of flights with other airlines+ 0.021*Shopping Amount at Airport+ 0.0641*Scheduled Departure Hour-0.1137*Flight Cancelled-0.3256*Arrival Greater than 5 mins MODEL 3: To compare the results with one more feature selection technique we used RFE (Recursive Feature Elimination) and also standardise the features. It is feature selection method that fits the model and removes the weakest features until the specified number of features are selected (10 in our case). The features are ranked by the feature’s importance and eliminates the features recursively in a loop. This is different from the from MODEL2, where features are selected just on the basis of their statistical significance (p-value <0.05). This is an iterative and computationally expensive process but is superior to method followed for feature selection for MODEL1 and MODEL2. RFE also gives support to the feature, with TRUE support meaning relevant feature and vice versa. The performance measures were somewhat inferior to the MODEL2, but an improvement than MODEL1(baseline model). The RFE along with Logistic Regression classifier gives the following summary with 3 features eliminated from the model due to multicollinearity.Fig 17 MODEL3 Summary Fig18. Classification Report MODEL3 Fig19. Confusion Matrix MODEL3MODEL3 performance evaluation is based on the classification report and the confusion metrics.Let’s interpret these results:Out of the 12690 actual instances of unsatisfied customers, the classifier predicted 8318 correctly out of them. This corresponds to a recall value of 0.66Out of the 13286 actual instances of satisfied customers, the classifier predicted 8272 correctly leading to a recall value of 0.62 which is lesser than MODEL2.Out of all the 25976 customers, the classifier predicted 16819 correctly giving an accuracy of 0.64 or 64%.MODEL3 gives us an overall 64% precision which is inferior as compared to MODEL2. In addition to this, MODEL3 interestingly has a higher AIC score than MODEL2 signifying the model has not bettered as compared to MODEL2. The only reason why we got less precision might be the choice of a Logistic Regression classifier we chose along with the RFE technique. K Cross Validation on MODEL2: -Performing 10 Cross Validation on MODEL2, we get mean cross validation score of 0.65. What K Cross validation does is, it checks the performance of the model on the data which it has never seen. The pipeline parameter in cross validation pre-processes the data by scaling the feature variable’s values to mean zero and unit variance. It reduces the problem of overfitting and is an effective technique to measure model’s performance. To get a good measure of the model’s accuracy, we calculate the mean of the all scores. This is our measure of model accuracy. Hence our model is identifying 65% of the total data point correctly.Conclusion of Logistic Model: From MODEL1, MODEL2 and MODEL3, the best model that we got is MODEL2 which has better precision and lower AIC score than the other models. For further interpretation of the logistic model we calculate the odds ratio.Fig: Odds ratioWhen interpreting odd ratios, any value greater than 1 indicates an increase in the odds, i.e. an increase in the likely hood, of that group being in the outcome variable, and any value less than 1 indicates a decrease in the odds, i.e. a decrease in the likelihood.For every unit increase in age, the odds of being satisfied decreases by a factor of 0.7. Similarly, for every unit increase in Shopping amount at airport, the odds of being satisfied increases by a factor of 1.02. The odds of Male being satisfied is 1.32 times more than a female. This is what we tried to find in research objective 5. Similarly, the odds of people being satisfied is higher for people travelling in the later part of the day. This is as per what we visualised in Research objective 2). A comparison for the precision and recalls curves and the decision that ROC curve is more suitable as we have balanced dataset: -ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.The area under the curve(auc) of the ROC gives an indication of how well the model is classifying. Since the value is greater than 0.5 the classifier is doing well. ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets. Thus, we go ahead with the ROC curveIX. RANDOM FOREST CLASSIFICATIONMODEL4: -Another modelling technique that we have implemented on our Data is the Random Forest Machine Learning algorithm. It is a supervised learning algorithm which can be used for regression as well as classification. Since our main objective is to classify customer as satisfied or unsatisfied, we will be using Random Forest as a classifier. It is the most flexible and easy to interpret ML algorithm. It creates decision trees on randomly selected data and gets prediction from each tree and provides us with the best solution. Furthermore, it is one of the good indicators of feature’s importance. The working of this algorithm is as follows: -Select random samples from the given dataset.For each sample, the decision tree is constructedIt will rank results of each decision tree.Output the prediction result with the best output as its final predictionUnlike decision tree, random forest is not likely to overfit the data and is one of the robust methods as there are a lot of decision trees participating. When applied on our Dataset, random forest generates the following output: Fig20. Classification Report MODEL4 Fig21. Confusion Matrix MODEL4RandomForest performance evaluation is based on the classification report and the confusion metrics along with the accuracy .Let’s interpret these results:Out of the 12690 actual instances of unsatisfied customers, the classifier predicted 9245 correctly out of them. This corresponds to a recall value of 0.73Out of the 13286 actual instances of satisfied customers, the classifier predicted 10702 correctly leading to a recall value of 0.81Out of all the 25976 customers, the classifier predicted 16819 correctly giving an accuracy of 0.767 or 77%. This Model performed better than MODEL2 in every aspect. The precision value jumped from 65% to 77% with a surge in the accuracy from 64% to 77%. This Model with Random Forest as classifier has outperformed every other model. This model can further be improved by different ML techniques like Hyperparameter tuning. In this, what we can do is to make bunch of models with different settings and then choose the one model which does best on the validation data. MODEL5: - In this model, we have used RFE along with the Random Forest Classifier in order to tune or model to give better performance evaluation parameters. As we selected the features in Logistic Regression using RFE, we will be selecting the features recursively. The model gave us the following summary: - Fig22. Classification Report MODEL5 Fig23. Confusion Matrix MODEL5RandomForest performance along with RFE evaluation is based on the classification report and the confusion metrics along with the accuracy .What’s interesting to notice is that overall performance evaluation parameters have inferior values as compared to running Random Forest classfier without RFE. This might be due to the superior feature selection technique of Random Forest algorithm as compared to RFE. Fig24. Random Forest Feature Selection MODEL4Cross Validation MODEL4: -Since MODEL4 performed better than the MODEL5, we will take this model forward and apply cross validation on MODEL4. The Cross-Validation Score for the MODEL 4 is expressed in form of Mean Precision Score which is 0.767. Fig25. Random Forest Cross Validation Results CONCLUSION: -Upon building two models using 2 different machine learning algorithms, we see that Random forest performed better at classifying compared to logistic regression. The mean precision score of the model2 after 10-fold cross validation is 64.6% whereas the mean precision score on model 4 after the same is 75.7%. This clearly means that when Random forest classifier sees unseen data it would give relevant result 75.7% times. The possible reason why this outperformed logistic regression is due to the fact that our data has a greater number of categorical data. However, the best explanation is the logistic regression aims to produce an estimation of the probability of belonging to a specific class. So, there is only one "probability estimate" after a logistic regression. On the other hand, the probability obtained using random forest is it creates decision trees on randomly selected data and gets prediction from each tree and provides us with the best solution.X. K-MEANS CLUSTERINGIn order to perform k means clustering on our dataset, we will first be doing the dimensionality reduction using PCA viz. Principal Component analysis. Kmeans is an unsupervised machine learning algorithm which group data points together on the basis of similarity. The main goal of k means clustering is to minimize the inter cluster distances but at the same time maximize the intra cluster distances. PCA: -PCA is done to choose the 2 components that explains the maximum variance from the data. Since our data set has 30 variables (later 40), it is a good way to reduce data from high dimension space to a low dimension space for effective data visualization. Furthermore, PCA also reduces the multicollinearity between the independent variables. Although the explained variance curve chooses 31 principal components but for the sake of visualisation, we move with 2 components, which explains 11% variance.Fig26. PCA components vs Explained VarianceWe can also see the 2 PCA components by a heatmapFig27. Heatmap of PCA componentsBEST VALUE OF K_MEANS: -The best value of the k in K-means clustering can be found by the elbow method. The sum of squared distances can be seen at the elbow point and the value of k from the below curve is seen to be k=11. However, for the purpose of tuning our model we would also look at different values of k=3,4,5 and plotting their centres we conclude that the ideal k value =9. With higher k values, the cluster centres tend to overlap, and the value of Silhouette score drops after k=9.Fig28. Elbow method showing ideal components Fig29. K-Means clustering comparison for different k values K-MEANS PERFORMANCE EVALUATION: - We evaluate the performance of K-Means by Silhouette score. This score represents the validation of consistency within the cluster of data. This value ranges from -1 to 1 and a higher value indicates that the data point is well matched to its cluster and poorly to the other.The clustering is interesting the sense that although it is not performing great at prediction. This might be due to the fact that k-means is sensitive to outliers. Since the Silhouette score of 0 is bad, our model is performing fairly good. So, based upon the 2 PCA components explaining 11% variance, out data can be clustered best into 5 segments. This segment analysis is not performed here as it is not our research objective. The silhouette score for k=2 is 0.457, for k=3 is 0.51, for k=4 it is 0.561, for k=5 is 0.593. Till k=9 it increases to 0.68 and at k=10, the silhouette score tends to decrease and is around 0.63. This supports our assumption of k=9. With that number of clusters, the performance of model is pretty good.Conclusion of K-Means Clustering: - Fig29. K-Means clustering comparison for different Silhouette Score values This is quite an intersting insight as there are 14 airline carriers and probably 9 clusters indicate there might be some alliances. XI. RESULTS AND CONCLUSIONThe airlines marketers should be benefited from these statistical studies and should focus on providing services. From our models it becomes clear that although the trivial parameters of price and airline matters matters but ultimately the customers end up being more satisfied if the are given better services. On comparison between different model’s Random forest classifier performed better than the logistic classifier and we evaluated the models by tuning different parameters. Clearly the best model was chosen and can be used to predict the satisfaction of a new survey respondent with approximately 76% precision. If we classify a data point into clusters then k-means has performed fairly good given the silhouette score we got the best for different values of k. This study which was aimed at comparing different classification model has provided us with a model which can be adopted by any airline carrier and does fairly well given it’s a real-world business problem and the model is able to give us relevant results 76% of the time. Finally, from the k-means clustering we had an interesting insight that these airlines might be a part of alliances.XII. REFERENCESReferences to all different kinds of scholarly articles, i.e. research articles, review articles etc. are listed below: -[1]. A non-additive model for evaluating airline service quality (James J.H. Lioua , Gwo-Hshiung Tzeng).[2]. Satisfaction with Airline Service Quality: Familiarity Breeds Contempt.(Curtis, T., Rhoades, D. L., & Waguespack, B. P. (2012)).[3]. Increasing Airline Customer Satisfaction(Kathryn Bryant). ( works/increasing-airline-customer-satisfaction/)[4]. Understanding Customers Evaluations Through Mining Airline Reviews (Ibrahim Yakut1 , Tugba Turkoglu1 and Fikriye Yakut2). . APPENDIX (SOURCE CODE)import?pandas?as?pd??import?numpy?as?np??import?seaborn?as?sns?#?for?visualisation??from?sklearn.tree?import?DecisionTreeClassifier?#?Import?Decision?Tree?Classifier??from?sklearn.model_selection?import?train_test_split?#?Import?train_test_split?function??from?sklearn?import?metrics?#Import?scikit-learn?metrics?module?for?accuracy?calculation??from?sklearn.linear_model?import?LogisticRegression?#Import?logistic?regression?module??from?sklearn.metrics?import?confusion_matrix,classification_report,accuracy_score,roc_curve,?roc_auc_score??#?for?precision,recall?and?F1?score?calculations??from?sklearn?import?tree??from?sklearn?import?preprocessing???import?matplotlib.pyplot?as?plt?#?for?visualisation??from?sklearn.preprocessing?import?StandardScaler??from?sklearn.decomposition?import?PCA???#?for?PCA??from?sklearn.ensemble?import?RandomForestClassifier??import?datetime??%matplotlib?inline????#Reading?the?data?into?a?Data?dataframe?object??Data=pd.read_excel("F:/Waterloo/Spring/623/project/big?data?final?term?project/Satisfaction-Survey.xlsx/IBM?Satisfaction?Survey.xlsx")????#?Exploratory?data??Data.describe()????Data.Age.hist()??plt.title('Histogram?of?Age')??plt.xlabel('Age')??plt.ylabel('Frequency')??#?From?the?histogram?age?group?from?40-50?is?the?highest????Data['No?of?Flights?p.a.'].hist()??plt.title('Histogram?of?Flights?per?Annum')??plt.xlabel('Flight?per?annum')??plt.ylabel('Count')??#?The?histogram?shows?10-20?range?is?the?highest?flights?taken?in?a?year????#?Removing?incorrect?Satisfaction?values.?From?the?above?graph?satisfaction?rating?is?integer?only??incorrect_data?=?Data['Satisfaction'].astype(str).str.len()?>?1??Data_corrected?=?Data[~incorrect_data]??Data=Data_corrected.drop_duplicates(keep='first')????#?Converting?Satisfaction(class?variable)?to?a?binary?column??#?We?make?the?satifaction?category?as?a?rating?equal?to?and?higher?than?3,?lower?than?3?are?considered?as?unsatisfied??Data['Satisfaction']?=?np.where(Data['Satisfaction'].astype(str).astype(float)?>?3,1,0)????#?Exploratory?graphs??g=sns.countplot(x='Class',data?=?Data,hue='Satisfaction')????g=sns.countplot(x='Price?Sensitivity',data?=?Data,hue?=?'Satisfaction')????fig?=?plt.figure(figsize=(12,6))??g=sns.countplot(x='Airline?Code',data?=?Data,hue='Class')??plt.show()??#?Clearly?Economy?class?has?the?maximum?passengers??plt.savefig("F:/Waterloo/Spring/623/project/Final/Airline_vs_Class")????fig?=?plt.figure(figsize=(12,6))??g=sns.countplot(x='Airline?Code',data?=?Data,hue='Satisfaction')??plt.show()??#?Airline?having?code?WN(Cheapest?Airlines)?has?the?maximum?number?of?Satisfied?and?unsatisfied?customers??plt.savefig("F:/Waterloo/Spring/623/project/Final/_Satisfaction_per_Airline")????fig?=?plt.figure(figsize=(12,6))??g=sns.countplot(x='Airline?Status',data?=?Data,hue='Satisfaction')??plt.show()??#?Airline?having?code?WN(Cheapest?Airlines)?has?the?maximum?number?of?Satisfied?and?unsatisfied?customers??plt.savefig("F:/Waterloo/Spring/623/project/Final/_Satisfaction_per_Airline")????#?Data?cleaning????#?No?of?flights?taken?per?annum?cannot?be?in?decimals??Data['No?of?Flights?p.a.']=round(Data['No?of?Flights?p.a.'],0)????#?Number?of?flights?in?a?year?should?not?be?zero?if?a?person?is?giving?a?satisfaction?rating.?This?can?be?due?incorrect?data?entry??#?We?will?be?replacing?such?entries?with?nan?and?then?replacing?them?with?values?by?the?machine?learning?algorithm??col=['No?of?Flights?p.a.']??Data[col]=Data[col].replace(0,np.nan)????#?Making?the?airline?status?into?dummy?columns.?The?original?categories?were?Blue,Gold,SIlver?and?Platinum??Data?=?pd.get_dummies(Data,?columns=["Airline?Status"])??Data?=?pd.get_dummies(Data,?columns=["Type?of?Travel"])??Data?=?pd.get_dummies(Data,?columns=["Class"])??Data?=?pd.get_dummies(Data,?columns=["Airline?Code"])????#?Making?gender?column?as?binary??Gender_Replace?=?{'Gender':{'Male':1,'Female':0}}??Data.replace(Gender_Replace,?inplace=?True)????#?Making?flight?cancelled?as?a?binary?column??Flight_cancelled?=?{'Flight?cancelled':{'Yes':1,'No':0}}??Data.replace(Flight_cancelled,?inplace=?True)????#?Making?Arrival?delay?greater?than?5?mins?to?a??categorical?column??Arrival_Delay?=?{'Arrival?Delay?greater?5?Mins':{'yes':1,'no':0}}??Data.replace(Arrival_Delay,?inplace=?True)??????????sns.countplot(x='Age?Range',data?=?Data,hue='Satisfaction')????#?Finding?the?number?on?na?values?in?each?column??len(Data)?-?Data[Data.columns].count()????#?visualising?na?values?in?columns??plt.figure(figsize=(18,8))??sns.heatmap(Data.isnull(),yticklabels=False,cbar=False,cmap='viridis')????#?The?yellow?marks?represent?the?missing?values.???#?The?columns?with?missing?values?are?No?Of?Flights?p.a,Departure?Delay?in?Min,?Arrival?delay?in?Mins?and?flight?time?in?mins????#?dropping?grouped?columns?and?flight?date?as?we?are?doing?an?anlysis?from?2003-2012.?The?business?sense?allows?us?to?drop?these?columns??Data.drop(['Age?Range','Year?of?First?Flight','No?of?Flights?p.a.?grouped',?????????????'Flight?date','Airline?Name','Orgin?City','Origin?State','Destination?City','Destination?State'],axis?=?1,inplace?=?True)????sns.lineplot(y='Shopping?Amount?at?Airport',data?=?Data,x=?'Satisfaction')????sns.barplot(x=?'Satisfaction',y='Shopping?Amount?at?Airport',data?=?Data)????sns.set_style('whitegrid')??sns.countplot(x='Satisfaction',data?=?Data)????sns.countplot(x='Satisfaction',data?=?Data,hue='Gender')????sns.countplot(x='Gender'?,data?=?Data,hue='Satisfaction')????sns.countplot(x='Scheduled?Departure?Hour',data?=?Data,hue='Satisfaction')????#?removing?columns?that?have?na?values?in?order?to?fill?them?with?values?from?random?forest??features?=?list(Data.columns)??features.remove('No?of?Flights?p.a.')??features.remove('Departure?Delay?in?Minutes')??features.remove('Arrival?Delay?in?Minutes')??features.remove('Flight?time?in?minutes')??????#?Removing?na?values?from?number?of?flight?p.a??X?=?Data[features]??y=?Data['No?of?Flights?p.a.']????Data_without_flight_pa?=?Data[pd.isnull(Data['No?of?Flights?p.a.'])]??Data_with_flight_pa?=?Data[pd.isnull(Data['No?of?Flights?p.a.'])?==?False]??????from?sklearn.ensemble?import?RandomForestRegressor????rfModel_flight_pa?=?RandomForestRegressor(n_estimators=10)??rfModel_flight_pa.fit(Data_with_flight_pa[features],Data_with_flight_pa['No?of?Flights?p.a.'])??????generated_flight_pa_values?=?rfModel_flight_pa.predict(X?=?Data_without_flight_pa[features])????Data_without_flight_pa['No?of?Flights?p.a.']?=?generated_flight_pa_values.astype(float)??Data_updated1?=?Data_with_flight_pa.append(Data_without_flight_pa)??Data_updated1.reset_index(inplace=True)??Data_updated1.drop('index',inplace=True,axis=1)????#?removing?na?values?in?Departure?Delay?in?Minutes??features?=?list(Data.columns)??features.remove('Departure?Delay?in?Minutes')??features.remove('Arrival?Delay?in?Minutes')??features.remove('Flight?time?in?minutes')????X?=?Data_updated1[features]??y=?Data_updated1['Departure?Delay?in?Minutes']????Data_without_dep_delay?=?Data_updated1[pd.isnull(Data_updated1['Departure?Delay?in?Minutes'])]??Data_with_dep_delay?=?Data_updated1[pd.isnull(Data_updated1['Departure?Delay?in?Minutes'])?==?False]????rfModel_dep_delay?=?RandomForestRegressor(n_estimators=10)??rfModel_dep_delay.fit(Data_with_dep_delay[features],Data_with_dep_delay['Departure?Delay?in?Minutes'])??generated_dep_delay_values?=?rfModel_dep_delay.predict(X?=?Data_without_dep_delay[features])????Data_without_dep_delay['Departure?Delay?in?Minutes']?=?generated_dep_delay_values.astype(float)??Data_updated2?=?Data_with_dep_delay.append(Data_without_dep_delay)??Data_updated2.reset_index(inplace=True)??Data_updated2.drop('index',inplace=True,axis=1)????#?removing?na?values?in?Arrival?Delay?in?Minutes??features?=?list(Data.columns)??features.remove('Arrival?Delay?in?Minutes')??features.remove('Flight?time?in?minutes')????X?=?Data_updated2[features]??y=?Data_updated2['Arrival?Delay?in?Minutes']????Data_without_arrival_delay?=?Data_updated2[pd.isnull(Data_updated2['Arrival?Delay?in?Minutes'])]??Data_with_arrival_delay?=?Data_updated2[pd.isnull(Data_updated2['Arrival?Delay?in?Minutes'])?==?False]????rfModel_arrival_delay?=?RandomForestRegressor(n_estimators=10)??rfModel_arrival_delay.fit(Data_with_arrival_delay[features],Data_with_arrival_delay['Arrival?Delay?in?Minutes'])??generated_arrival_delay_values?=?rfModel_arrival_delay.predict(X?=?Data_without_arrival_delay[features])????Data_without_arrival_delay['Arrival?Delay?in?Minutes']?=?generated_arrival_delay_values.astype(float)??Data_updated3?=?Data_with_arrival_delay.append(Data_without_arrival_delay)??Data_updated3.reset_index(inplace=True)??Data_updated3.drop('index',inplace=True,axis=1)????#?removing?na?values?in?Flight?time?in?Minutes??features?=?list(Data.columns)??features.remove('Flight?time?in?minutes')??X?=?Data_updated3[features]??y=?Data_updated3['Flight?time?in?minutes']????Data_without_flight_mins?=?Data_updated3[pd.isnull(Data_updated3['Flight?time?in?minutes'])]??Data_with_flight_mins?=?Data_updated3[pd.isnull(Data_updated3['Flight?time?in?minutes'])?==?False]????rfModel_flight_mins?=?RandomForestRegressor(n_estimators=10)??rfModel_flight_mins.fit(Data_with_flight_mins[features],Data_with_flight_mins['Flight?time?in?minutes'])??generated_flight_mins_values?=?rfModel_flight_mins.predict(X?=?Data_without_flight_mins[features])????Data_without_flight_mins['Flight?time?in?minutes']?=?generated_flight_mins_values.astype(float)??Data_updated4?=?Data_with_flight_mins.append(Data_without_flight_mins)??Data_updated4.reset_index(inplace=True)??Data_updated4.drop('index',inplace=True,axis=1)????len(Data_updated4)?-?Data_updated4[Data_updated4.columns].count()????#?visualising?that?all?na?values?in?columns?are?filled??plt.figure(figsize=(18,8))??sns.heatmap(Data_updated4.isnull(),yticklabels=False,cbar=False,cmap='viridis')??????#?Doing?classification?with?2?machine?learning?techniques????#?logistic?regression?with?all?features??features?=?list(Data_updated4.columns)??features.remove('Satisfaction')??X?=?Data_updated4[features]??y=?Data_updated4['Satisfaction']????#?Split?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test????logmodel?=?LogisticRegression()??logmodel.fit(X_train,y_train)??predictions?=?logmodel.predict(X_test)????print('Confusion?Matrix:\n',confusion_matrix(y_test,predictions))??print('\n')????print('Classification?Report:\n',classification_report(y_test,predictions))????import?statsmodels.api?as?sm??logit_model=sm.Logit(y_train,X_train)??result=logit_model.fit(solver='liblinear')??print(result.summary2())??#?nan?value?represents?multi-collinearity????#keeping?all?the?features?with?significance?value?<0.05??features2=['Age',?'Gender',?'Price?Sensitivity',?????????'No?of?Flights?p.a.',?'%?of?Flight?with?other?Airlines',?????????'No.?of?other?Loyalty?Cards',?'Shopping?Amount?at?Airport',?????????'Scheduled?Departure?Hour',?'Flight?cancelled',?????????'Arrival?Delay?greater?5?Mins']??X?=?Data_updated4[features2]??y=?Data_updated4['Satisfaction']??#?Split?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test??logmodel?=?LogisticRegression(solver='liblinear')??logmodel.fit(X_train,y_train)??predictions?=?logmodel.predict(X_test)????#import?statsmodels.api?as?sm??logit_model=sm.Logit(y_train,X_train)??result=logit_model.fit(solver='liblinear')??print(result.summary2())????#?since?the?significance?value?of?No.?of?other?Loyalty?Cards?is?greater?than?0.05,eliminating?this?from?features??features3=['Age',?'Gender',?'Price?Sensitivity',?????????'No?of?Flights?p.a.',?'%?of?Flight?with?other?Airlines',?'Shopping?Amount?at?Airport',?????????'Scheduled?Departure?Hour',?'Flight?cancelled',?????????'Arrival?Delay?greater?5?Mins']??X?=?Data_updated4[features3]??y=?Data_updated4['Satisfaction']??#?Split?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test??logmodel?=?LogisticRegression(solver='liblinear')??logmodel.fit(X_train,y_train)??predictions?=?logmodel.predict(X_test)????#?MODEL?1?LOGISTIC?REGRESSION??logit_model=sm.Logit(y_train,X_train)??result=logit_model.fit(solver='liblinear')??print(result.summary2())??????print('Classification?Report:\n',classification_report(y_test,predictions))????print('Confusion?Matrix:\n',confusion_matrix(y_test,predictions))??print('\n')????sns.heatmap(pd.DataFrame(confusion_matrix(y_test,predictions)),annot=?True,cmap?=?'plasma')??plt.show()????#?Model2?standardising?the?features?????#?Standardised?logistic?regression??X?=?Data_updated4[features3]????#?Standardizing?the?features??X_Standardized=?StandardScaler().fit_transform(X)????Standardized_data?=?pd.DataFrame(data?=?X_Standardized,?columns=?features3)??finalDf?=?pd.concat([Standardized_data,?Data_updated4[['Satisfaction']]],?axis?=?1)????X_new?=?finalDf[features3]??y=finalDf['Satisfaction']????#?Split?standardized?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X_new,?y,?test_size=0.2,?random_state=1)?#?70%?training?and?30%?test????logmodel?=?LogisticRegression(solver='liblinear')??logmodel.fit(X_train,y_train)??predictions2?=?logmodel.predict(X_test)????#?This?is?our?improved?Logistic?regression?model?with?standardising?the?parameters??logit_model=sm.Logit(y_train,X_train)??result=logit_model.fit(solver='liblinear')??print(result.summary2())????print('Classification?Report:\n',classification_report(y_test,predictions2))????sns.heatmap(pd.DataFrame(confusion_matrix(y_test,predictions2)),annot=?True,cmap?=?'plasma')??plt.show()????#?GETTING?THE?ODDS?RATIOS??model2_odds?=?pd.DataFrame(np.exp(result.params),?columns=?['OR'])??model2_odds????#?Recursive?Feature?Elimination?to?select?all?important?features(?Top?10?in?our?case)?for?LOGISTIC?REGRESSION?classifier??from?sklearn.feature_selection?import?RFE??from?sklearn.linear_model?import?LogisticRegression??logreg?=?LogisticRegression(solver='liblinear')??#?create?the?RFE?model?for?the?Random?classifier???#?and?select?attributes??all_features?=?list(Data_updated4.columns)??all_features.remove('Satisfaction')??X?=?Data_updated4[all_features]??y=?Data_updated4['Satisfaction']??????rfe?=?RFE(logreg,?10?)??rfe?=?rfe.fit(X,?y)??#?print?summaries?for?the?selection?of?attributes??print(rfe.support_)??print(rfe.ranking_)????important_features_logistic?=?[?'Gender','Flight?cancelled',?'Arrival?Delay?greater?5?Mins',???'Airline?Status_Blue',?'Airline?Status_Platinum','Airline?Status_Silver','Type?of?Travel_Business?travel',???'Type?of?Travel_Mileage?tickets','Type?of?Travel_Personal?Travel','Class_Business']????#Model?3?building??#?Logistic?model?on?the?RFE?variables??X?=?Data_updated4[important_features_logistic]????#?Standardizing?the?features??X_Standardized=?StandardScaler().fit_transform(X)????Standardized_data?=?pd.DataFrame(data?=?X_Standardized,?columns=?important_features_logistic)??finalDf?=?pd.concat([Standardized_data,?Data_updated4[['Satisfaction']]],?axis?=?1)????X_new?=?finalDf[important_features_logistic]??y=finalDf['Satisfaction']????#?Split?standardized?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X_new,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test????logmodel?=?LogisticRegression(solver='liblinear')??logmodel.fit(X_train,y_train)??predictions3?=?logmodel.predict(X_test)????#?This?is?our?improved?Logistic?regression?model?with?standardising?the?parameters??logit_model=sm.Logit(y_train,X_train)??result=logit_model.fit(solver='liblinear')??print(result.summary2())??#?nan?value?represents?multi-collinearity????#?After?removing?insignificant?features??important_features_logistic_updated?=?['Gender',???'Flight?cancelled','Arrival?Delay?greater?5?Mins','Airline?Status_Blue',???'Airline?Status_Platinum','Airline?Status_Silver','Class_Business']????X?=?Data_updated4[important_features_logistic_updated]????#?Standardizing?the?features??X_Standardized=?StandardScaler().fit_transform(X)????Standardized_data?=?pd.DataFrame(data?=?X_Standardized,?columns=?important_features_logistic_updated)??finalDf?=?pd.concat([Standardized_data,?Data_updated4[['Satisfaction']]],?axis?=?1)????X_new?=?finalDf[important_features_logistic_updated]??y=finalDf['Satisfaction']????#?Split?standardized?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X_new,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test????logmodel?=?LogisticRegression(solver='liblinear')??logmodel.fit(X_train,y_train)??predictions3?=?logmodel.predict(X_test)????#?This?is?our?improved?Logistic?regression?model?with?standardising?the?parameters??logit_model=sm.Logit(y_train,X_train)??result=logit_model.fit(solver='liblinear')??print(result.summary2())??print('The?intercept?of?this?model?is:',logmodel.intercept_)????print('Classification?Report:\n',classification_report(y_test,predictions3))????sns.heatmap(pd.DataFrame(confusion_matrix(y_test,predictions3)),annot=?True,cmap?=?'plasma')??plt.show()????#?ROC-AUC?curve?for?the?best?logistic?model?i.e?model2????fpr,?tpr,_=roc_curve(y_test,predictions2,drop_intermediate=False)????import?matplotlib.pyplot?as?plt??plt.figure()??##Adding?the?ROC??plt.plot(fpr,?tpr,?color='red',???lw=2,?label='ROC?curve')??##Random?FPR?and?TPR??plt.plot([0,?1],?[0,?1],?color='blue',?lw=2,?linestyle='--')??##Title?and?label??plt.xlabel('FPR')??plt.ylabel('TPR')??plt.title('ROC?curve')??plt.show()????auc?=?roc_auc_score(y_test,predictions2)??auc????#?Cross?validation?for?logistic?regression?on?model2??import?numpy?as?np??from?sklearn?import?datasets??from?sklearn?import?metrics??from?sklearn.model_selection?import?KFold,?cross_val_score??from?sklearn.pipeline?import?make_pipeline??from?sklearn.linear_model?import?LogisticRegression??from?sklearn.preprocessing?import?StandardScaler????X?=?Data_updated4[features3]??y=?Data_updated4['Satisfaction']????#?Create?standardizer??standardizer?=?StandardScaler()????#?Create?logistic?regression??logit?=?LogisticRegression()????#?Create?a?pipeline?that?standardizes,?then?runs?logistic?regression??pipeline?=?make_pipeline(standardizer,?logit)????#?Create?k-Fold?cross-validation??kf?=?KFold(n_splits=10,?shuffle=True,?random_state=1)????#?Do?k-fold?cross-validation??cv_results?=?cross_val_score(pipeline,?#?Pipeline???????????????????????????????X,?#?Feature?matrix???????????????????????????????y,?#?Target?vector???????????????????????????????cv=kf,?#?Cross-validation?technique???????????????????????????????scoring="precision",????????????????????????????????n_jobs=-1)?#?Use?all?CPU?scores????#?Calculate?mean??print('The?mean?precision?score?after?10?fold?cross?validation?is:',cv_results.mean())????#Model?4?building?RANDOM?FOREST?CLASSIFIER??features?=?list(Data_updated4.columns)??features.remove('Satisfaction')??X?=?Data_updated4[features]??y=?Data_updated4.Satisfaction????#?Split?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test????#Import?Random?Forest?Model??from?sklearn.ensemble?import?RandomForestClassifier????#Create?a?Gaussian?Classifier??clf=RandomForestClassifier(n_estimators=10)????#Train?the?model?using?the?training?sets?y_pred=clf.predict(X_test)??clf.fit(X_train,y_train)????predictions4=clf.predict(X_test)????#?Model?Accuracy,?how?often?is?the?classifier?correct???#print("Accuracy:",metrics.accuracy_score(y_test,?y_pred))????print(classification_report(y_test,predictions4))????sns.heatmap(pd.DataFrame(confusion_matrix(y_test,predictions4)),annot=?True,cmap?=?'plasma')??plt.show()????#?Using?Recursive?Feature?Elimination?to?select?the?important?features?using?RANDOM?FOREST?CLASSIFIIER??from?sklearn.ensemble?import?RandomForestClassifier???from?sklearn.feature_selection?import?RFE??#creating?RandomForest?Classifier??object??clf=RandomForestClassifier(n_estimators=10)??#?create?the?RFE?model?for?the?Random?classifier?and?select?attributes??all_features?=?list(Data_updated4.columns)??all_features.remove('Satisfaction')??X?=?Data_updated4[all_features]??y=?Data_updated4['Satisfaction']??????rfe?=?RFE(clf,?10?)??rfe?=?rfe.fit(X,?y)??#?print?summaries?for?the?selection?of?attributes??print(rfe.support_)??print(rfe.ranking_)????important_variable_random_forest?=?['Age','No?of?Flights?p.a.','%?of?Flight?with?other?Airlines','Eating?and?Drinking?at?Airport',???'Day?of?Month',?'Arrival?Delay?in?Minutes','Flight?time?in?minutes',???'Flight?Distance','Type?of?Travel_Business?travel',?'Type?of?Travel_Personal?Travel']?????#?Random?forest?on?RFE?features??X?=?Data_updated4[important_variable_random_forest]????#?Standardizing?the?features??X_Standardized=?StandardScaler().fit_transform(X)????Standardized_data?=?pd.DataFrame(data?=?X_Standardized,?columns=?important_variable_random_forest)??finalDf?=?pd.concat([Standardized_data,?Data_updated4[['Satisfaction']]],?axis?=?1)????X_new?=?finalDf[important_variable_random_forest]??y=finalDf['Satisfaction']????#?Split?standardized?dataset?into?training?set?and?test?set??X_train,?X_test,?y_train,?y_test?=?train_test_split(X_new,?y,?test_size=0.2,?random_state=1)?#?80%?training?and?20%?test????#Import?Random?Forest?Model??from?sklearn.ensemble?import?RandomForestClassifier????#Create?a?Gaussian?Classifier??clf=RandomForestClassifier(n_estimators=10)????#Train?the?model?using?the?training?sets?y_pred=clf.predict(X_test)??clf.fit(X_train,y_train)??predictions5=clf.predict(X_test)????print(classification_report(y_test,predictions5))????sns.heatmap(pd.DataFrame(confusion_matrix(y_test,predictions5)),annot=?True,cmap?=?'plasma')??plt.show()????#?Cross?validation?for?random?forest?on?Model?4????X?=?Data_updated4[features]??y=?Data_updated4.Satisfaction????from?sklearn.model_selection?import?cross_val_score??clf_cv_score?=?cross_val_score(clf,?X,?y,?cv=10,?scoring='precision')????print(clf_cv_score)??print('\n')??print("===?Mean?Precision?Score===")??print("Mean?Precision?Score?-?Random?Forest:?",?clf_cv_score.mean())????plt.style.use('seaborn')??feats?=?{}??for?feature,?importance?in?zip(features,?clf.feature_importances_):??????feats[feature]?=?importance????????importances?=?pd.DataFrame.from_dict(feats,?orient='index').rename(columns={0:?'Gini-importance'})??importances.sort_values(by='Gini-importance').plot(kind='bar',?rot=90,?figsize=(10,?6))????#K-Means?clustering????#?for?PCA?2?component??PCA_features?=?list(Data_updated4.columns)??PCA_features.remove('Satisfaction')??X?=?Data_updated4[PCA_features]????#?Standardizing?the?features??X_Standardized=?StandardScaler().fit_transform(X)??Standardized_data?=?pd.DataFrame(data?=?X_Standardized,?columns=?PCA_features)????X_new2?=?Standardized_data[PCA_features]????#Fitting?the?PCA?algorithm?with?our?Data??pca?=?PCA().fit(X_Standardized)??#Plotting?the?Cumulative?Summation?of?the?Explained?Variance??plt.figure()??plt.plot(np.cumsum(pca.explained_variance_ratio_))??plt.xlabel('Number?of?Components')??plt.ylabel('Variance?(%)')?#for?each?component??plt.title('Airline?Survey?Dataset?Explained?Variance')??plt.show()????pca?=?PCA(n_components=2)??principalComponents?=?pca.fit_transform(X_new2)??principalDf2?=?pd.DataFrame(data?=?principalComponents???????????????,?columns?=?['principal?component?1','principal?component?2'])????df_comp?=?pd.DataFrame(ponents_,columns=PCA_features)????plt.figure(figsize=(12,6))??sns.heatmap(df_comp,cmap='plasma')????#?Finding?the?value?of?K?by?elbow?method?for?PCA?data????Sum_of_squared_distances?=?[]??K?=?range(1,15)??for?k?in?K:??????km?=?KMeans(n_clusters=k)??????km?=?km.fit(X_Standardized)??????Sum_of_squared_distances.append(km.inertia_)??????????????plt.plot(K,?Sum_of_squared_distances,?'bx-')??plt.xlabel('k')??plt.ylabel('Sum_of_squared_distances')??plt.title('Elbow?Method?For?Optimal?k')??plt.show()????sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2)????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=2)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??labels2?=?kmeans.fit_predict(principalDf2)??centers?=?kmeans.cluster_centers_??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=200)????#?scoore?for?k=2??from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels2)??avg_score?=?np.mean(silh_val)??avg_score????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=3)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??labels?=?kmeans.fit_predict(principalDf2)??centers?=?kmeans.cluster_centers_??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=200)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels)??avg_score3?=?np.mean(silh_val)??avg_score3????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=4)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??labels4?=?kmeans.fit_predict(principalDf2)??centers?=?kmeans.cluster_centers_??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=200)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels4)??avg_score4?=?np.mean(silh_val)??avg_score4????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=5)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??centers?=?kmeans.cluster_centers_??labels5?=?kmeans.fit_predict(principalDf2)??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=200)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels5)??avg_score5?=?np.mean(silh_val)??avg_score5????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=6)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??centers?=?kmeans.cluster_centers_??labels6?=?kmeans.fit_predict(principalDf2)??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=200)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels6)??avg_score6?=?np.mean(silh_val)??avg_score6????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=8)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??centers?=?kmeans.cluster_centers_??labels6?=?kmeans.fit_predict(principalDf2)??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=200)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels6)??avg_score6?=?np.mean(silh_val)??avg_score6????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=9)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??centers?=?kmeans.cluster_centers_??labels10?=?kmeans.fit_predict(principalDf2)??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=100)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels10)??avg_score9?=?np.mean(silh_val)??avg_score9????#?Performing?k-means?clustering?on?the?PCA?components??kmeans?=?KMeans(n_clusters=10)??kmeans.fit(principalDf2)??print("cluster?memberships:\n{}".format(kmeans.labels_[:]))??coord_cluster?=?kmeans.labels_??centers?=?kmeans.cluster_centers_??labels10?=?kmeans.fit_predict(principalDf2)??principalDf2['coord_cluster']?=?coord_cluster????principalDf2.columns??sns.scatterplot(x='principal?component?1',y='principal?component?2',data?=?principalDf2,hue='coord_cluster')??plt.scatter(centers[:,0],centers[:,1],c='black',s=80)????from?sklearn.metrics?import?silhouette_samples,silhouette_score??silh_val?=?silhouette_samples(principalDf2,labels10)??avg_score10?=?np.mean(silh_val)??avg_score10???? ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download