Datamined.files.wordpress.com



a)Given the data visualized below, sketch the best fit linear regression curve:For the data above, is the correlation coefficient between latitude and temperature negative, zero or positive? c) Draw the best polynomial fit on the figure below: Draw an overfitting curve on the figure below: Draw an underfitting curve on the figure below: You observe that the training accuracy of your KNN classifier is much higher than the test accuracy. What is this phenomenon called? The following dataset contains eight items representing colored points on the x-y plane: Using this data as the training set, run the k-nearest-neighbors classification algorithm (manually) to decide the most likely color for a new item with ?x = 3? and ?y = 3?. The distance between points is the actual distance on the x-y plane (also called Euclidean distance). a)If you run the algorithm with k=1 what color is assigned to the new item? b)If you run the algorithm with k=4 what color is assigned to the new item? Given the following data, calculate the a. Accuracy b. Precision c. Recalld. F1 score e. Why might using accuracy as the only metric not be ideal? f. Create the confusion matrix for this dataData pre-processing and conditioning is one of the key factors that determine whether a data mining project will be a success. For each of the following topics, describe the affect this issue can have on our data mining session and what techniques can we use to counter this problem.a. Noisy datab. Missing datac. Data normalization and scalingd. Data type conversione. Attribute and instance selectionWe've covered several data mining techniques in this course. For each technique identified below, describe the technique, identify which problems it is best suited for, identify which problems it has difficulties with, and describe any issues or limitations of the technique.a. Decision treesb. K-Nearest neighbors algorithmc. Linear regressiond. Bayes classifierse. support vector machinesf. rule-based classifiers7)To the best of your ability, draw the best SVM line (with the largest “soft margin”) separating the two sets of data (squares and diamonds)Given the following data determine the best indicator function for the first data split using the AdaBoost algorithm:Use entropy to find best root split (among Outllok, Temperature, Humidity, Windy) for a Decision Tree given the following data ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download