Assignment 4



Assignment 4Task0: Visualize the complex9_gn8 dataset; visualize the third attribute using different colors, similar to supervised scatterplots we used in Assignments 1.Task1: Develop a 2D spatial outlier detection technique of your own preference that identifies abnormal data in datasets which contain pairs of numbers. CompletedTask2: Implement the chosen outlier detection technique for the complex9_gn8 pleted.As explained earlier your implementation of your outlier detection technique should add a column/attribute ols to the dataset and fill this column with numbers, as explained above. Task 3: Evaluation a. Apply our outlier detection to the complex9_GN8.data dataset obtaining a new file X; your outlier detection method is only applied to attributes x and y of the dataset, and ignores the attribute named class. Completed.b. Sort X in descending order based on the values of attribute ols (the example with the highest ols value should be the first entry in X)!Completed.c. Visualize the first 2% of the observations in X, just displaying their x and y value and the class using a different color for each class, in a display and the remaining 98% of the observations in a second display. In general, the first display visualizes the outliers and the second display visualizes the normal observations in the dataset. d Visualize the first 5% of the observations in X, just displaying the x and y value and the class using a different color for each class and the remaining 95% of the observations in a second display. e. Visualize the first 10% of the observations in X, just displaying the x and y value and class using a different color for each class, in a display and the remaining 90% of the observations in a second display. f. Interpret the 6 displays you generated in steps c-e; particularly, assess how well does your outlier detection method work—intuitively observations that are quite far a way of the 9 natural clusters of the original Complex 9 dataset should be outliers. Also try to characterize which points are picked as outliers first (the top 2%), second (2% to 5% percentile), and third (5-10% percentile).Overall, the outlier detection method used performed quite well. The visualization of the first 2% shows that all records included in the top 2% by outlier score are outliers. This group consists almost entirely of outliers that are on the outer bounds of the set. All the nearest neighbors tend to lie in one direction and are farther than the mean which drives up the outlier score. The sole exception is one point in the center of cluster six, though it is far enough from its neighbors to be captured. The 2% limit seems to be where the outliers on the fringe are exhausted.The 2% to 5% range looks very similar to the original Complex9 dataset. The points removed as outliers in this group are mainly those lying between the clusters, with a small number of fringe points that were relatively close to the natural clusters. At this point it becomes obvious that some of the noise points cannot be removed as outliers since they are embedded in other clusters. The specs of different colors within the clusters represent the points. Using only the spatial information, there is not a method able to discern these points from the cluster itself. It also appears that true members of the natural clusters are beginning to be removed at this point. This is especially noticeable just below the y=200 line for class 5 and around the edges of class 6.From 5% to 10% nearly all the points removed appear to be members of the natural clusters. It is possible a few noise points were removed as well; however, the wholes beginning to appear in the clusters show that too many points were removed. The noise points which embedded themselves into the natural clusters remain, as shown by having different color than their surroundings. Removal of any additional noise points at this step requires too great a loss in original points to continue.g. Create a histogram for the ols values of the top 10% entries in file X. Interpret the results. Moreover, look for gaps in the ols values in the file X; if observe any gaps, try you best to interpret why they occur! The top ten percent appear to have a bimodal distribution with peaks about 0-10 and 60-90. The peak at the low end likely represents the points which are part of the natural clusters, yet are within the top 10% of outlier score values. They are not noise, hence the low outlier score. The second peak is likely true outliers. If the points were adjusted by adding one normally distributed variation to the x coordinates and another to the y coordinate, this could represent the new mean distance from the original cluster about which the outliers are centered. It would appear that any score above 50 denotes one of the fringe points. This is confirmed by checking the minimum ols for the top 2% range, which is 42.24. The visualization of the top 2% showed the points removed were the fringe points.Gaps do not occur until the 180-190 marks, soon followed by the 200-230 gap and the 240-280 gap. The gaps are on the high end and are likely caused due to the fact that getting a large variance from two independent samples from a normal distribution is less likely than either one or no large variances. These combinations of two large variances are what increase the distance and consequently the outlier score. Since higher distances are less frequent, the gaps in this range begin to occur. Additionally, the highest outlier scores will belong to the fringe points that were moved away from all clusters. This makes scores in this range even less frequent, increasing the likelihood of gaps.0866775 Task 4 (optional): Enhance your outlier detection technique based on the feedback of Task3 and redo Task 3The boxplot to the left shows a good representation of all outlier scores due to the extreme concentration of values toward the low end of the range. A summary of the scores is shown immediately below. The 3rd quartile is 4.766, the first quartile 1.831 and all values above 9.169 are shown as outliers (3rd quartile + 1.5 * IQR). This agrees with the visualizations of the remaining 95% and 90% shown previously. The minimum outlier score for the top 5% was 10.18, slightly above the boxplot cutoff. This agrees with the assessment that most all outliers that could be removed using this method had been removed. For the top 10% the minimum outlier score was 7.209, well below the cutoff for the boxplot, and supporting the conclusion that non-outliers had been removed in that visualization.Since the visualizations show the outlier detection to be effective, and the boxplot and summary data of the outlier score support the conclusions made, the outlier detection technique was not modified for enhancement.Task 5: Write a 2-5 paragraphs, explaining your outlier detection technique works and how it was implemented. If you enhanced your approach based on feedback to get better results also describe how you enhanced your technique. If your outlier detection technique needs the selection of parameter values before it can be run, describe how you selected those parameter values for conducting Task3. Moreover, mention in a paragraph what (if any) external software packages your used in the project! The outlier detection technique used was based on the distances to each point’s 1st k nearest neighbors. Two weighting factors were considered. First, the distance to the k nearest neighbor is assumed to be less important than the distance to the k + 1 nearest neighbor. Using the definitions from DBSCAN, this is due to the fact that core points will have many points surrounding them in all directions, border points will tend to have neighbors within an arc of 180 degrees or less and true outliers will have few points which are close. Assigning the weight k to the score for the k nearest neighbor will give more weight to the higher k value scores, in turn raising the outlier score of outliers drastically, of border points marginally and of core points minimally. Second, most points in dense clusters will be relatively close to one another. Points with Z-scored distance values above a certain threshold are more likely to be outliers. Outliers will skew the mean distances for the k, k + 1, … , k + n nearest neighbors higher; therefore, this threshold should be in the negative range. Anything below the threshold value will be assigned a minimum multiplier; anything above it will be assigned a multiplier of the z score shifted by the difference between the minimum multiplier and the threshold. The overall outlier score will be the sum of the adjusted z scores of the distances for the k nearest neighbors, each weighted by their respective k value (see below).t=thresholdm=minimum multiplierz=z score, fz=m, &z≤tz +m-t, &z>t, ols = i=1ki f(z)A k value of 5 was selected to achieve an appropriate number of comparisons, while keeping the maximum outlier score from extreme ranges. Obvious outliers will be assigned high values, yet there should be a differentiation in the score of border and core points. The threshold score was chosen in order to apply the minimum multiplier to distances which are below the median value for each value of k, yet still above the 1st quartile. A threshold of -0.25 achieved this objective as shown below in the summary and cutoff tables on the following page. Minimum multiplier was set at 0.1 as a method to keep the core point scores low, the minimum possible score being 1.5 when k = 5.Summary of 1st 5 Nearest NeighborsCutoff Value Table for -0.25 Threshold1st NN2nd NN3rd NN4th NN5th NNMean 4.46506.60728.19309.527010.5940SD7.22179.574411.164313.023813.7709Cutoff2.65964.21365.40196.27117.1513The implementation of this technique was accomplished by first creating a distance matrix for the dataset using the rdist() function. The rdist() function was chosen as it returns a full matrix instead of a dist object, therefore the result is easier to manipulate. Each row is then sorted and the k lowest values are placed into a new matrix in the f.distToFirstkNN() function. The f.calcOLS() function then uses this data and the equation above to calculate the outlier scores. These scores are then returned in a vector which may be appended to the existing data frame. The only outside package used in the implementation was “fields”, from which the rdist() function was utilized. Task 6: Submit the code of the implementation of your outlier detection technique!# import dataset from url# (complex9_gn8) <- c("X", "Y", "Class")###################################################################### Task 0require("lattice")require("ggplot2")f.gg_color_hue <- function(n) { hues = seq(15, 375, length=n+1) hcl(h=hues, l=65, c=100)[1:n]}colors9 <- f.gg_color_hue(9)my.settings9 <- list( superpose.symbol=list(col = colors9, border="transparent"))d <- data.frame(complex9_gn8[,1:2], Class = factor(complex9_gn8$Class))xyplot(Y ~ X | Class, d, groups=d$Class, pch=20, main = "Visualization with Clusters Separated", par.settings = my.settings9)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization with Clusters Combined") + theme(plot.title = element_text(face="bold")) # End Task 0########################################################################################################################################### Task 1 & 2install.packages("fields")require("fields")f.distToFirstkNN <- function (distances, k = 1) { numRows <- nrow(distances) returnValue <- matrix(0, nrow = numRows, ncol = k) for (i in 1:numRows) { sortedRow <- sort(distances[i, ]) for (j in 1:k) { returnValue[i, j] <- sortedRow[j + 1] } } returnValue}f.calcOLS <- function(dataset, k = 1, threshold = -0.25, baseMult = 0.1) { numRows <- nrow(dataset) Mdist <- rdist(dataset) firstkNNDistances <- f.distToFirstkNN(Mdist, k) sds <- apply(firstkNNDistances, 2, sd) means <- apply(firstkNNDistances, 2, mean) ols <- vector("numeric", numRows) zShift <- baseMult - threshold for (i in 1:numRows) { score <- 0 for (j in 1:k) { z <- (firstkNNDistances[i, j] - means[j]) / sds[j] if (z < threshold) { z <- baseMult } else { z <- z + zShift } score <- score + (z * j) } ols[i] <- as.numeric(score) } ols}complex9_gn8$ols <- f.calcOLS(complex9_gn8[,1:2], k = 5)# End Task 1 & 2########################################################################################################################################### Task 3# a) & b) #########X <- complex9_gn8[with(complex9_gn8, order(-ols)),]# c)####n <- nrow(X)cutoff2 <- floor(n * .02)d <- data.frame(X[1:cutoff2, 1:3])d$Class <- as.factor(d$Class)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization of First 2%") + xlim(-240, 860) + ylim(-140, 640) + theme(plot.title = element_text(face="bold"))d <- data.frame(X[cutoff2:n, 1:3])d$Class <- as.factor(d$Class)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization of Remaining 98%") + xlim(-240, 860) + ylim(-140, 640) + theme(plot.title = element_text(face="bold"))# d)####cutoff5 <- floor(n * .05)d <- data.frame(X[1:cutoff5, 1:3])d$Class <- as.factor(d$Class)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization of First 5%") + xlim(-240, 860) + ylim(-140, 640) + theme(plot.title = element_text(face="bold"))d <- data.frame(X[cutoff5:n, 1:3])d$Class <- as.factor(d$Class)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization of Remaining 95%") + xlim(-240, 860) + ylim(-140, 640) + theme(plot.title = element_text(face="bold"))# e)####cutoff10 <- floor(n * .1)d <- data.frame(X[1:cutoff10, 1:3])d$Class <- as.factor(d$Class)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization of First 10%") + xlim(-240, 860) + ylim(-140, 640) + theme(plot.title = element_text(face="bold"))d <- data.frame(X[cutoff10:n, 1:3])d$Class <- as.factor(d$Class)ggplot(d, aes(x=X, y=Y, colour=Class)) + geom_point() + ggtitle("Visualization of Remaining 90%") + xlim(-240, 860) + ylim(-140, 640) + theme(plot.title = element_text(face="bold")) # g)####d <- data.frame(X[1:cutoff10, ])hist(d$ols, freq = TRUE, main = "Outlier Score Histogram for Top 10% ols Values", col = "red", xlab = "Outlier Score", breaks = 30)# End Task 3##################################################################### ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download