COP5992 Quiz 1 September 1, 2004 Name



COP5577 Midterm Exam Answers to Selected Questions

Question I: Short Answers (30pts, 2 pts for each question)

1. Describe major tasks in data-preprocessing

data cleaning, data integration, data transformation, data reduction, data discretization

2. How is a data warehouse different from a database? How are they similar?

Differences between a data warehouse and a database: A data warehouse is a repository of information collected from multiple sources, over a history of time, stored under a unified schema, and used for data analysis and decision support; whereas a database, is a collection of interrelated data that represents the current status of the stored data. There could be multiple heterogeneous databases where the schema of one database may not agree with the schema of another. A database system supports ad-hoc query and on-line transaction processing.

Similarities between a data warehouse and a database: Both are repositories of information, storing huge amounts of persistent data.

3. Discuss issues to consider during data integration.

Data integration involves combining data from multiple sources into a coherent data store. Issues that must be considered during such integration include:

Schema integration: The metadata from the different data sources must be integrated in order to match up equivalent real-world entities. This is referred to as the entity identification problem.

Handling redundant data: Derived attributes may be redundant, and inconsistent attribute naming may also lead to redundancies in the resulting data set. Duplications at the tuple level may occur and thus need to be detected and resolved.

Detection and resolution of data value conflicts: Differences in representation, scaling, or encoding may cause the same real-world entity attribute values to differ in the data sources being integrated.

4. What is the main idea of principal component analysis (PCA)?

The basic idea of PCA is to find the axis that shows the greatest variation, and project all points into this axis. In particular, it first computes the eigenvectors of the covariance matrix. These eigenvectors define the new space and the eigenvalues sort them in “goodness”order. Each data vector is then represented as a linear combination of some selected principal component vectors.

5. Describe the three important issues in visualization and give two visualization techniques to display high-dimensional data.

• Representation. How will you map objects, attributes, and relationships to visual elements?

• Arrangement. Are there any special considerations that need to be taken into account with respect to how visual elements are displayed? Specific examples might be the choice of viewpoint, the use of transparency, or the separation of certain groups of objects.

• Selection. How will you handle a large number of attributes and data objects?

Matrix Plot, Parallel Coordinates, Star Plots

6. Why is Naïve Bayesian classification called “naïve”? Briefly outline the major ideas of Naïve Bayesian Classification.

Ans: Naïve Bayesian classification is called “naïve” because it assumes that the attributes are conditional independent given the class label.

Based on the assumption, compute [pic]

Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)

7. What’s the generalization error? How do you get a good estimate of the generalization error?

Generalization error is the expected error of the model on previously unseen records. Methods for estimating generalization errors:

a. Training error (not a good estimate, over optimistic)

b. Pessimistic estimate

c. Test sample estimate

d. Cross-validation

8. What’s overfitting? How do you overcome overfitting in decision tree construction?

Ans: The model fits the training data well (low training error) have a poorer generalization error than a model with a higher training error. Overfitting can be caused by noise and insufficient representative samples. To overcome overfitting, we can use pessimistic error estimate, consider model complexity, apply pre-pruning or post-pruning.

9. Explain the basic ideas of Bagging and Boosting respectively

Bagging: (Bootstrap aggregation) First generate a sequence of component classifiers based on the set of bootstrap samples from the original training set. Final classifier decides by voting among the component classifiers.

Boosting: Generate component classifiers so that each does well where the previous ones do badly. Final classifier classifies by weighted voting among component classifiers according to their accuracy on the training set

10. Compare the advantages and disadvantages of eager classification (e.g., decision tree, Bayesian, neural network) versus lazy classification (e.g., k-nearest neighbor, case-based reasoning).

Eager classification is faster at classification than lazy classification because it constructs a generalization model before receiving any new tuples to classify. Weights can be assigned to attributes, which can improve classification accuracy. Disadvantages of eager classification are that it must commit to a single hypothesis that covers the entire instance space, which can decrease classification, and more time is needed for training. Lazy classification uses a richer hypothesis space, which can improve classification accuracy. It requires less time for training than eager classification. A disadvantage of lazy classification is that all training tuples need to be stored, which leads to expensive storage costs and requires efficient indexing techniques. Another disadvantage is that it is slower at classification because classifiers are not built until new tuples need to be classified. Furthermore, attributes are all equally weighted, which can decrease classification accuracy. (Problems may arise due to irrelevant attributes in the data.)

11. What are the main challenges of mining association rules with item

taxonomy?

Difficulty of deciding the right support and confidence thresholds. Items

residing at higher levels of the taxonomy have higher support than those

residing at lower levels of the taxonomy. Many of the rules may also be

redundant.

12. Explain the two important ideas behind the success of SVMs

Maximize the Margin and Kernel Tricks

13. A characteristic rule is a rule of the form {p}({q1, q2, . . . , qn}, where the rule antecedent contains only a single item. An itemset of size k can produce up to k characteristic rules. Let ζ be the minimum confidence of all characteristic rules generated from a given itemset: ζ({p1, p2, . . . , pk}) = min [ c({p1}({p2, p3, . . . , pk}), . . ., c({pk}(p1, p3 . . . , pk−1}) ]. Is ζ monotone, anti-monotone, or non-monotone? Provide your justifications.

ζ is an anti-monotone because

ζ({A1,A2, · · · ,Ak}) ≥ ζ({A1,A2, · · · ,Ak,Ak+1}).

14. List all the 4-subsequences contained in the following data sequence: < {1, 3} {2} {2, 3} {4} >,

< {1, 3} {2} {2}>

< {1, 3} {2} {4}>

< {1, 3} {3} {4}>

< {1} {2} {2} {4}>

< {1} {2, 3} {4}>

< {3} {2} {2} {4}>

< {3} {2, 3} {4}>

15. What are the value ranges of the following normalization methods? (a) min-max normalization; (b) z-score normalization; (c) normalization by decimal scaling

(a) min-max normalization: The rage is [new min, new max]

(b) z-score normalization: The range is [(old_min-mean)/std, (old_max-mean)/std]. In general it is possible to be [-\infinity,+\infinity].

(c) normalization by decimal scaling: The range is (-1,1). .

Consider the following set of frequent 3-itemsets:

{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {1, 3, 5}, {1, 3, 6}, {2, 3, 4}, {2, 3, 5}, {2,3,6}, {3, 4, 5}, {3,4,6}, {4,5,6}.

Assume that there are only six items in the data set.

16. List all candidate 4-itemsets obtained by the candidate generation procedure in Apriori.

{1, 2, 3, 4}, {1, 2, 3, 5}, {1,2,3,6}, {1, 2, 4, 5}, {1,2,4,6}, {1,3,4,5}, {1,3,4,6}, {1,3,5,6}, {2, 3, 4, 5}, {2, 3, 4, 6}, {2,3,5,6}, {3,4,5,6}.

17. List all candidate 4-itemsets that survive the candidate pruning step of the Apriori algorithm.

{1, 2, 3, 4}, {1,2,3,5}, {1,2,3,6}

18. The optimal solution w in SVM can be expressed as

Where represents the training sample and N is the sample size. Define

support vectors and explain why the resulting hyperplane classifier is insensitive to the number and position of non-support vectors.

Support vectors are the [pic]’s associated with Since The optimal solution w is a linear combination of support vectors, the resulting hyperplane thus insensitive to the non-support vectors.

Consider a binary classification problem with the following set of attributes and attribute values:

• Air Conditioner = {Working, Broken},

• Engine = {Good, Bad},

• Mileage = {High, Medium, Low},

• Rust = {Yes, No}

Suppose a rule-based classifier produces the following rule set:

a) Mileage = High ( Value = Low

b) Rust = No ( Value = High

c) Air Conditioner = Working, Engine = Good ( Value = High

d) Air Conditioner = Broken (Value = Low

19. Are the rules mutually exclusive?

No

20. Is the rule set exhaustive?

No

Question II: Short Answers II (40pts, 3 pts for the first 12 questions and 4pts for the last question.)

1. Let X=(x1,x2), Y=(y1,y2) are both two dimensional vectors. Given K(X,Y)=(3+X∙Y)2 , where ∙ is the dot product. Show the transformation implied by the kernel function.

[pic]

2. Suppose the confidence of the rules A −→ B and B −→ C are larger than some threshold, minconf. Is it possible that A −→ C has a confidence less than minconf?

Yes, It depends on the support of items A, B, and C.

For example:

s(A,B) = 60% s(A) = 90%

s(A,C) = 20% s(B) = 70%

s(B,C) = 50% s(C) = 60%

Let minconf = 50% Therefore:

c(A → B) = 66% > minconf

c(B → C) = 71% > minconf

But c(A → C) = 22% < minconf

3. Given a classification model, sensitivity is defined as the fraction of positive examples predicted correctly by the model and specificity is defined as the fraction of negative examples predicted correctly by the model. Show that accuracy is a function of sensitivity and specificity.

Accuracy=(TP+TN)/(TP+FP+TN+FN)=TP/(TP+FP+TN+FN) + TN/(TP+FP+TN+FN)

=TP/(TP+FN)*(TP+FN)/N + TN/(TN+FP) * (TP+FN)/N

=sensitivity * Number of positive samples / Total number of samples + specificity * Number of negative samples/ Total number of samples

4. In many applications, new data sets are incrementally added to the existing large data sets. Thus an important consideration for computing descriptive data summary is whether a measure can be computed efficiently in incremental manner. Can the following summary measures be computed efficiently in incremental manner: (1) Mean; (2) Median? If yes, please show how; and if no, please give justifications.

Mean: The current mean can be stored as a value, and when x number of new

values are added, we can easily update mean.

Median: To accurately calculate the median, we have to look at every value in the dataset. When we add a new value or values, we have to sort the new set and then find the median based on that new sorted set. This is much harder and thus makes the incremental addition of new values difficult.

5. Given frequent itemset A and subsets B of A, prove that the confidence of the rule B’((A\B’) can not be more than the confidence of B((A\B) where B’ is a subset of B. ( \ is the set difference operation)

B’ is a subset of B ( A\B is a subset of A\B’

( Support (A\B) is less than Support (A\B’)

( Support(A\B)/support(A) is less than Support (A\B’)/support(A)

( Conf((A\B’)(B’) is greater than Conf((A\B)(B)

6. A partitioning variation of Apriori subdivides the transactions of a database D into n non-overlapping partitions. Prove that any itemset that is frequent in D must be frequent in at least one partition of D.

Hint: Proof by contradiction.

Assume that the itemset is not frequent in any of the partitions of D. Let F be any frequent itemset. Let D be the task-relevant data, a set of database transactions. Let C be the total number of transactions in D. Let A be the total number of transactions in D containing the itemset F. Let minsup be the minimum support. F is a frequent itemset, which means that A = C X minsup. Let us partition D into n nonoverlapping partitions, d1; d2; d3;… ; dn. Thus, D = d1d2d3…dn. Let c1; c2; c3;…; cn be the total number of transactions in partitions d1; d2; d3; … ; dn, respectively. Then C = c1 + c2 + c3 + … + cn. Let a1; a2; a3;…; an be the total number of transactions in partitions d1; d2; d3;…; dn containing the itemset F, respectively. Thus, A = a1 + a2 + a3 + : : : + an. We can rewrite A = C X minsup as (a1 + a2 + a3 + … + an) = (c1 + c2 + c3 + … + cn) X minsup. Because of the assumption at the start of the proof, we know that F is not frequent in any of the partitions d1; d2; d3;…; dn of D. This means that a1 < c1 X minsup; a2 < c2 X minsup; a3 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download