Daniel Yerelian



Excel DashboardsBullet Graphs in excel and combo boxes: Cleaning and Descriptive Statistics:The measures from a population are called parameters.The measures from a sample are called statistics.?Type of sampling?Probabilistic sampling (assume there is no order, no bias against any characteristic, and they are all heterogeneously mixed)simple random sampling with our without replacement (random number gerenrator in Excel, use function randtween (1, 1638) - the numbers being the size of the populationproportional random samplingLet's say you want to do proportional random sampling on books on south Americaby viewing the number of books in each country and their percentage of total, you can figure out, for instance, that for Argentina the proportion is 578/3192, so your sample needs to have the same proportionfor example, if you have 200 samples, then you need to have 200 *(578/3192) from ArgentinaHow to pick the samples, e.g. with flight delay dataEnter observation IDs for each recordsort your file on delaynote the number of delays vs. non delays. Let's stay the first 400,000 records are non-delaysrandbetween (1, 400,000) to sample delay = 0. ?Keep the ratio of delays vs. non delays in your sample same as original datarandbetween (400,001, 500,000) to sample delay = 1Copy all of the resulting numbers and paste special – valuethen do ?vlookup, to get the data from the other sheetOutliersAnything more than 3 times the standard deviationYour range is mean + 3 sigma, mean -3 sigma?exclude any values outside of this range (e.g. conditional formatting to highlight offending values, filter, etc)keep deleting until there are no values highlighted (as it deletes, it will recalculate the mean + 3 sigma and it will highlight new values, that's ok, just keep deleting)Frequency Charts (e.g. frequency of house size from real estate)Insert Pivot TableHouse size is rowCount (Price) as a column (can be anything else)Right click house size – group. Enter Bin size in “by”Right click resulting table – Insert Chart – Column ChartEyeball data to determine outliersHistograms: Data Data Analysis HistogramsDescriptive Stats: Data Data Analysis Descriptive StatisticsBox Plots: Add Ins – Data Analysis – XLMiner – Charts – Box PlotPrediction and Classification Methods: You will have three data sets - training, validation and test setsTraining set is what you build the model on Validation is used for validating the quality of the modelTest is testing the accuracy of the modelData Partitioning:Open Data Set in ExcelAdd Ins --> XLMiner --> Partition Data --> Standard PartitionSelect all the variables and put them on the rightPartitioning by default is set to Automatic - 60% training, 40% validationBut you can also say “use partition variable”There, you say what kind of set you want by putting in a variable - t(test), s (training), v (validation)This generated the partitions and you can use the hyperlinks at the top to stitch between training and validation dataNa?ve BayesPartition DataClick on Training SetXML Miner - Classification - Naive BayesSelect input variables ?and output variables (flight status)?NextNextCheck Summary Reports Detailed report, score validation data summary report, Lift ChartsGo to Prior Class Probability The 0.8 and 0.2 on top are the prior class probabilitiesCopy data from other spreadsheet – paste – special - transposeUse VLOOKUP On the conditional probabilitiesThen PRODUCT(I43:I49, I41) (multiply each one of the probabilities by the overall probability Then Na?ve Bayes formula = Prob of on time / (prob of on time + prob of delay)Probability of each condition give on time * Overall on time probabilitydivided by?Probability of each condition given on time * overall on time probability +?Probability of each condition given dealyed * overall delayed probability?Multiple RegressionYou are trying to find how a dependent variable is related to independent variable. You want to check:?whether the dependent variable has a linear relationship with the independent variablewhether the independent variable is indeed independentto make sure that it is a continuous relationship rather than a discrete one (e.g. one-bedroom, 2 bedroom, 3 bedroom is discrete)Regression equation: Y = Alpha + Beta*X + errorIn other words: Dependent variable = constant + the contribution of an independent variable + something randomFor example: House price = 3000 + 600* sq ft + EYou can say Y hat = a +bx ?<-- this is an estimate (you drop the random part)??? ???? ???? ???? ???? ???? ???? ???? ???? <-- a is an estimate of Alpha and b is an estimate of Beta?? ???? ???? ???? ???? ???? ???? ???? ???? <-- The error in your estimate is Y - Y hat??? ???? ???? ???? ???? ???? ???? ???? ???? <-- of you square that, you will get Error squaredIf you add all the errors squared (for each error) (call it i) that is the total error?There are models that minimize this error --> you use derivativesIf you want to minimize the error, you need to find a and b that minimize y hat.?Before you run the regression, you want to make sure that there are no correlated variables (they are truly independent).?To find out, you go to XLMiner - Charts - Matrix Plot - pick all the variables you are interested in (Beds, baths, sq ft, price)Price and Sq Ft has almost linear relationship (look in lower right corner)Lower left - discontinuous and you also see that price and beds and sq ft and bed are also positively corelated...?So here you will pick sq ft because?it means I will get more bedrooms and bathroomsIt's the continuous variable where beds and baths are discreteYou also know that you have to separate them because beds and baths are discontinuous (lower right corner)So let's say you can't decide which variable to use. Run regression for all three variables independently (In Excel --> Data --> Data Analysis --> Regression. Check residual plots).?Since bed vs. price residual plot is discontinuous, you can tell that beds is not a good variable to use.?Alpha is what you set (it's your tolerance for error, typically it's 0.05 or less), p is what you get.?Lower P --> Better result.?P =< Alpha – independent variable is significantP > Alpha – independent variable not significant and can be removed from regressionT > 2 – significantAlso, you look at the adjusted R Square to see the explanatory power of the model. Lower R means worse.?Check the standard error – make sure it’s low. Look at Correlation (Excel Data Analysis Correlation)Highlight results, do a conditional formatting - color bar (home conditional formatting - color scales).? Do absolute value first This is another way to determine multi-colinearity (in addition to doing Matrix Plot in XLMiner)What if I run multiple regression on all variables (using XLMiner).Fitted values will give you the predicted valueCheck Unstandardized, summary reportYou know this is a problem because the coefficient for the bedroom is negative So adding a bedroom reduces your house value????The good news is, XlMiner will determine the best variable for you.XLMiner --> Mutliple Linear regressionAt Step 2, click Best Subset. Backwards elimination (it takes all the variables, and eliminates the least significant first)Look at the adjusted r Square - where is it tapering off? No more improvement between 5 and 6.Also, look at CP - highest R Square and CP = total number of predictors. Pick # 12 because R square is higher and CP is close to the total number of predictors.?Principle of parsimony - if you can do the job with two variables, don't use 3.?In fact, if you include too many variables, you overfit the data- you match the model perfectly to the data and there is no predictive power.?From the output from XLMiner, click Subset Selection, then choose subset?It will automatically select the subset for you, but then you have to rerun the regression on just this data.The regression equation is under "Reg Model"Prediction = constant ?+ coeff.*input variable + coeff. *input variable....K Nearest Neighborwhen you get a new record, you compare it to existing recordsyou find the "distance" between this new record and the existing records=SQRT(SUMXMY2($H$3, A2) + ($I$3-B2)^2)You decide to use the k number of records with the smallest distance (has to be odd but you set it)see Excel example in MBAD 698 folderTo do this in XLMinerPartition data firstOn the training set, click inside the set?Add Ins --> XLMiner --> Classification - K Nearest NeighborOn Step 2, select score on best K between 1 and specified value (will let you do 19 max)Click on Prior class Probabilities. Best K will be highlighted there.?Classification TreePartition DataXLMiner - Classification - Classification TreeSelect input and output variablesRunWhen you look at the output, less is on left, more is on rightThe number in between is the number of records in the existing set that fall into that categorye.g. someone with less than 100.5K and less than 2.95 CC average ?--> not worth personal loanClassification Errors and CostsMisclassification - how many are placed in the incorrect category on the test/validation dataTwo kinds of Errors:Individual Misclassification Error - this is for each category itself (you think that a mailing will generate business but it does not); ssually associated with false positive or false negativeOverall Misclassification Error: useful for evaluating the overall modelBehavior of the errors with cut-off probability valuesIf you provide a cut-off probability, then the classification algorithm will reclassify according to the cut-off. Typical default cut-off is 50%Cut-off probability is dependent on misclassification cost and business contextData Table (What-If Analysis) can be used to plot the behaviorFor calculations and decision making for the future records, typically validation results are used.Lift Chart (or gains chart) is a graphical way to see the effectiveness of the classification model. If you do not use any classification and just send an offer to everyone, then your response rate will be whatever is the underlying probability. However, when you use a classification scheme, and then sort the target records accordingly and send the offer, then your response rate should be much higher. The ratio of gain is the lift.The Decile chart shows the same information, only in blocks of 10% of the records. Allows you to know when to stop targeting.Let Us Recreate the Lift and Decile Chart for the Universal Bank ExampleSort the records in the validation score in descending order of the classification probabilitiesCreate a new column on the left to number the cases serially from 1 to 1000Create a column to count the cumulative number of 1’s (successes) in the actual columnComplete the entries for all columns using appropriate formulasFind out the actual number of 1s and 0s in the validation data set (hint: can be easily done from classification confusion matrix) and create the overall prob of 1 and 0In a new sheet create a table that would show the number of success from every 50 records as per the probability and from your actual cumulative column in the validation score worksheet. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download