Mr.Ghanshyam Dhomse (घनश्याम ढोमसे)
SNJB’s Late Sau. K.B. Jain College of Engineering, Chandwad
Department Of Computer Engineering
Subject- Machine Learning (410250)
Mock End Sem Paper Solution 2019-20 SEM-II
Q-1 A) Explain the role of machine learning algorithms in following applications 1. Visual tracking with a webcam or a Smartphone 2. Predictive analytics 6 Marks
Ans- Visual object tracking is a fundamental task within the field of computer vision. The use of visual tracking is pertinent in a variety of applications, such as automated surveillance, video indexing, human-computer interaction, vehicle navigation, and augmented reality. In general, tracking is a challenging task which consists in generating an inference about the motion of an object given a sequence of images.
Visual tracking is a way that detects, extracts, identifies and locates the object in video sequences. Specifically, once the object in the initial frame of the video sequence is determined, the position and the attribute of the object in the following sequences will be given by tracker automatically, or the prompt will be given if the object is out of view.
Visual tracking method includes not only the algorithms for all the procedures of object detection, extraction, recognition and localization, but also the scheme how to organize information and make decision in the overall level. All the procedures of tracking are interdependent and restricted mutually. For instance, the methods of object detection, extraction and the form of object expression determine the method of recognition and the efficiency and robustness of tracking, so we need to consider from the system level.
Visual tracking methods can be divided into two categories according to the observation model: generative method and discriminative method. The generative method uses the generative model to describe the apparent characteristics, and minimizes the reconstruction error to search the object, such as PCA. Discriminative method can be used to distinguish between the object and the background, its performance is more robust, and gradually becomes the main method in tracking. Discriminative method is also referred to as Tracking by-Detection, and deep learning belongs to this category.
Predictive analytics:- Analytics is the discovery and communication of meaningful patterns in data. Especially, valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operation research to qualify performance.
It is critical to design and built a data warehouse or Business Intelligence(BI) architecture that provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and analysis of large and diverse data set.
There are four type of data analytics:
1. Predictive (forecasting)
2. Descriptive (business intelligence and data mining)
3. Prescriptive (optimization and simulation)
4. Diagnostic analytics
Predictive Analytics: Predictive analytics turn the data into valuable, actionable information. Predictive analytics uses data to determine the probable outcome of an event or a likelihood of a situation occurring.
Predictive analytics holds a variety of statistical technique from modeling, machine, learning, data mining and game theory that analyze current and historical facts to make prediction about future event.
There are three basic cornerstones of predictive analytics-
• Predictive modeling
• Decision Analysis and optimization
• Transaction profiling
B) Differentiate with an example Label encoder and one Hot encoder for managing categorical data. 6 Marks
Ans-
Label Encoding
To begin with, you can find the SciKit Learn documentation for Label Encoder here. Now, let’s consider the following data:
[pic]
In this example, the first column is the country column, which is all text. As you might know by now, we can’t have text in our data if we’re going to run any kind of model on it. So before we can run a model, we need to make this data ready for the model.
And to convert this kind of categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data.
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])
We’ve assumed that the data is in a variable called ‘x’. After running this piece of code, if you check the value of x, you’ll see that the three countries in the first column have been replaced by the numbers 0, 1, and 2.
One Hot Encoder :-
What one hot encoding does is, it takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. In our example, we’ll get three new columns, one for each country — France, Germany, and Spain.
For rows which have the first column value as France, the ‘France’ column will have a ‘1’ and the other two columns will have ‘0’s. Similarly, for rows which have the first column value as Germany, the ‘Germany’ column will have a ‘1’ and the other two columns will have ‘0’s.
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
As you can see in the constructor, we specify which column has to be one hot encoded, [0] in this case. Then we fit and transform the array ‘x’ with the onehotencoder object we just created. And that’s it, we now have three new columns in our dataset:
[pic]
As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent.
So, that’s the difference between Label Encoding and One Hot Encoding.
C) Explain with reference to machine learning finding the optimal hyper-parameters through grid search with example. 8 Marks
Ans-
Finding the best hyperparameters (called this because they influence the parameters learned during the training phase) is not always easy and there are seldom good methods to start from. The personal experience (a fundamental element) must be aided by an efficient tool such as GridSearchCV, which automates the training process of different models and provides the user with optimal values using cross-validation.
As an example, we show how to use it to find the best penalty and strength factors for a linear regression with the Iris toy dataset.
A hyperparameter is a parameter whose value is set before the learning process begins.
Some examples of hyperparameters include penalty in logistic regression and loss in stochastic gradient descent.
In sklearn, hyperparameters are passed in as arguments to the constructor of the model classes.
Tuning Strategies
We will explore two different methods for optimizing hyperparameters:
• Grid SearchCV
• Random Search
Grid SearchCV
In GridSearchCV approach, machine learning model is evaluated for a range of hyper parameter values. This approach is called GridSearchCV, because it searches for best set of hyper parameters from a grid of hyper parameters values.
For example, if we want to set two hyper parameters C and Alpha of Logistic Regression Classifier model, with different set of values. The grid search technique will construct many versions of the model with all possible combinations of hyper parameters, and will return the best one.
As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4].
For a combination C=0.3 and Alpha=0.2, performance score comes out to be 0.726(Highest), therefore it is selected.
[pic]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}
# Instantiating logistic regression classifier
logreg = LogisticRegression()
# Instantiating the GridSearchCV object
logreg_cv = GridSearchCV(logreg, param_grid, cv = 5)
logreg_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {‘C’: 3.7275937203149381}
Best score is 0.7708333333333334
Drawback : GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.
OR
Q-2 A) Explain data formats for supervised learning problem with example. 6 Marks
Ans-
Labeled data: Data consisting of a set of training examples, where each example is a pair consisting of an input and a desired output value (also called the supervisory signal, labels, etc)
Classification: The goal is to predict discrete values, e.g. {1,0}, {True, False}, {spam, not spam}.
Regression: The goal is to predict continuous values, e.g. home prices.
In a supervised learning problem, there will always be a dataset, defined as a finite set of real vectors with m features each:
[pic]
Feature vector: A typical setting for machine learning is to be given a collection of objects (or data points), each of which is characterised by several different features.
Features can be of different sorts: e.g., they might be continuous (say, real- or integer-valued) or categorical (for instance, a feature for colour can have values like green, blue, red ).
A vector containing all of the feature values for a given data point is called the feature vector;
if this is a vector of length m, then one can think of each data point as being mapped to a m-dimensional vector space (in the case of real-valued features, this is R m ), called the feature space.
This means all variables belong to the same distribution D, and considering an arbitrary subset of m values, it happens that:
[pic]
The corresponding output values can be both numerical-continuous or categorical. In the first case, the process is called regression, while in the second, it is called classification. Examples of numerical outputs are:
[pic]
Categorical examples are
[pic]
We define generic regressor, a vector-valued function which associates an input value to a continuous output and generic classifier, a vector-values function whose predicted output is categorical (discrete).
B) Write a short note on Sparse PCA with example. 6 Marks
Ans- Finds the set of sparse components that can optimally reconstruct the data. The amount of sparseness is controllable by the coefficient of the L1 penalty, given by the parameter alpha.
If you think about the handwritten digits or other images that must be classified, their initial dimensionality can be quite high (a 10x10 image has 100 features).
However, applying a standard PCA selects only the average most important features, assuming that every sample can be rebuilt using the same components.
[pic]
On the other hand, we can always use a limited number of components, but without the limitation given by a dense projection matrix.
This can be achieved by using sparse matrices (or vectors), where the number of non-zero elements is quite low. In this way, each element can be rebuilt using its specific components
Here the non-null components have been put into the first block (they don't have the same order as the previous expression), while all the other zero terms have been separated. In terms of linear algebra, the vectorial space now has the original dimensions.
[pic]
from sklearn.decomposition import SparsePCA
>>> spca = SparsePCA(n_components=60, alpha=0.1)
>>> X_spca = spca.fit_transform(digits.data / 255)
• >>> ponents_.shape
(60L, 64L)
• The following snippet shows a sparse PCA with 60 components. In this context, they're usually called atoms and the amount of sparsity can be controlled via L1-norm regularization (higher alpha parameter values lead to more sparse results).
C) Explain the Ridge and Lasso types of regression with example. 8 Marks
Ans- Ridge and Lasso regression are powerful techniques generally used for creating parsimonious models in presence of a ‘large’ number of features. Here ‘large’ can typically mean either of two things:
• Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)
• Large enough to cause computational challenges. With modern systems, this situation might arise in case of millions or billions of features
• Though Ridge and Lasso might appear to work towards a common goal, the inherent properties and practical use cases differ substantially. If you’ve heard of them before, you must know that they work by penalizing the magnitude of coefficients of features along with minimizing the error between predicted and actual observations. These are called ‘regularization’ techniques.
Ridge regression imposes an additional shrinkage penalty to the ordinary least squares loss function to limit its squared L2 norm:
[pic]
Ridge Regression: Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients
Minimization objective = LS Obj + α * (sum of square of coefficients)
Note that here ‘LS Obj’ refers to ‘least squares objective’, i.e. the linear regression objective without regularization.
α can take various values:
• α = 0:
– The objective becomes same as simple linear regression.
– We’ll get the same coefficients as simple linear regression.
• α = ∞:
– The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.
• 0 < α < ∞:
– The magnitude of α will decide the weightage given to different parts of objective.
The coefficients will be somewhere between 0 and ones for simple linear regression
Lasso Regression: Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients
• Minimization objective = LS Obj + α * (sum of absolute value of coefficients)
• A Lasso regressor imposes a penalty on the L1 norm of w to determine a potentially higher number of null coefficients:
[pic]
Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Lets iterate it here briefly:
• α = 0: Same coefficients as simple linear regression
• α = ∞: All coefficients zero (same logic as before)
• 0 < α < ∞: coefficients between 0 and that of simple linear regression
Q-3 A) Justify with example, The following Statement : SVM is an example of a large margin classifier. 8 Marks
Ans-
[pic]
• SVM is a type of classifier which classifies positive and negative examples, here blue and red data points
• As shown in the image, the largest margin is found in order to avoid overfitting ie,.. the optimal hyperplane is at the maximum distance from the positive and negative examples(Equal distant from the boundary lines).
• To satisfy this constraint, and also to classify the data points accurately, the margin is maximised, that is why this is called the large margin classifier.
Maximal Margin Classifier: - If we have p-dimensional space, a hyperplane is a flat subspace with dimension p-1. For example, in two-dimensional space a hyperplane is a straight line, and in three-dimensional space, a hyperplane is a two-dimensional subspace. Imagine a knife cutting through a piece of cheese that is in cubical shape and dividing it into two parts. Then put both pieces back and observe that space in between both pieces, that is a two-dimensional subspace in a three-dimensional space. For p greater than 3, the imagination can be harder, but the intuition remains the same.
B) Write short notes on 1. Bernoulli Naive Bayes 2. Multinomial Naïve Bayes 3. Gaussian Naive Bayes 9 Marks
Ans-
Bernoulli naive Bayes
• If X is random variable and is Bernoulli-distributed, it can assume only two values (for simplicity, let's call them 0 and 1) and their probability is:
• [pic]
we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors; however, the class BernoulliNB has a binarize parameter, which allows us to specify a threshold that will be used internally to transform the features:
• we're going to generate a dummy dataset. Bernoulli naive Bayes expects binary feature vectors; however, the class BernoulliNB has a binarize parameter, which allows us to specify a threshold that will be used internally to transform the features:
from sklearn.datasets import make_classification
>>> nb_samples = 300
>>> X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0)
• We have decided to use 0.0 as a binary threshold, so each point can be characterized by the quadrant where it's located.
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)
>>> bnb = BernoulliNB(binarize=0.0)
>>> bnb.fit(X_train, Y_train)
>>> bnb.score(X_test, Y_test) 0.85333333333333339
[pic]
Now, checking the naive Bayes predictions, we obtain:
>>> data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
>>> bnb.predict(data) Output:- array([0, 0, 1, 1])
Multinomial Naïve Bayes :-
Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution. This is the event model typically used for document classification.
• A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences of a term or its relative frequency.
• If the feature vectors have n elements and each of them can assume k different values with probability pk, then:
[pic]
• The conditional probabilities P(xi|y) are computed with a frequency count (which corresponds to applying a maximum likelihood approach), but in this case, it's important to consider the alpha parameter (called Laplace smoothing factor). Its default value is 1.0 and it prevents the model from setting null probabilities when the frequency is zero.
• It's possible to assign all non-negative values; however, larger values will assign higher probabilities to the missing features and this choice could alter the stability of the model. In our example, we're going to consider the default value of 1.0.
Gaussian Naive Bayes classifier :-
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution.
When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values as shown below:
[pic]
Gaussian naive Bayes is useful when working with continuous values whose probabilities can be modeled using a Gaussian distribution:
[pic]
Q-4 A) Justify with example, the following statement:
Naive Bayes are a family of powerful and easy-to-train classifiers that determine the probability of an outcome given a set of conditions using Bayes’ the theorem. 8 Marks
Ans-
Naive Bayes classifiers: -
Naive Bayes are a family of powerful and easy-to-train classifiers, which determine the probability of an outcome, given a set of conditions using the Bayes’ theorem. In other words, the conditional probabilities are inverted so that the query can be expressed as a function of measurable quantities. The approach is simple and the adjective naive has been attributed not because these algorithms are limited or less efficient, but because of a fundamental assumption about the causal factors that we will discuss.
Naive Bayes are multi-purpose classifiers and it’s easy to find their application in many different contexts. However, the performance is particularly good in all those situations when the probability of a class is determined by the probabilities of some causal factors. A good example is given by natural language processing, where a text can be considered as a particular instance of a dictionary and the relative frequencies of all terms provide enough information to infer a belonging class. Our examples may be generic, so to let you understand the application of naive Bayes in various context.
[pic]
This formula has very deep philosophical implications and it’s a fundamental element of statistical learning. First of all, let’s consider the marginal probability P(A): this is normally a value that determines how probable a target event is, like P(Spam) or P(Rain). As there are no other elements, this kind of probability is called Apriori, because it’s often determined by mathematical considerations or simply by a frequency count. For example, imagine we want to implement a very simple spam filter and we’ve collected 100 emails. We know that 30 are spam and 70 are regular. So we can say that P(Spam) = 0.3.
However, we’d like to evaluate using some criteria (for simplicity, let’s consider a single one), for example, e-mail text is shorter than 50 characters. Therefore, our query becomes:
[pic]
A naive Bayes classifier is called in this way because it’s based on a naive condition, which implies the conditional independence of causes. This can seem very difficult to accept in many contexts where the probability of a particular feature is strictly correlated to another one. For example, in spam filtering, a text shorter than 50 characters can increase the probability of the presence of an image, or if the domain has been already blacklisted for sending the same spam emails to million users, it’s likely to find particular keywords. In other words, the presence of a cause isn’t normally independent from the presence of other ones.
Let’s consider a dataset:
[pic]
Every feature vector, for simplicity, will be represented as:
[pic]
We need also a target dataset:
[pic]
where each y can belong to one of P different classes. Considering the Bayes’ theorem under conditional independence, we can write:
[pic]
The values of the marginal Apriori probability P(y) and of the conditional probabilitiesP(xi|y) is obtained through a frequency count, therefore, given an input vector x, the predicted class is the one which a posteriori probability is maximum.
B) Write a short note on: 1. Linear classification 2. Kernel based classification 9 Marks
Ans-
Classification: Use an object characteristics to identify which class/category it belongs to
Example: a new email is ‘spam’ or ‘non-spam’
✓ A patient diagnosed with a disease or not
✓ Classification is an example of pattern recognition
Linear classification :- A classification algorithm (Classifier) that makes its classification based on a linear( predictor function combining a set of weights with the feature vector
✓ [pic]
A popular class of procedures for solving classification tasks are based on linear models. What this means is that they aim at dividing the feature space into a collection of regions labelled according to the values the target can take, where the decision boundaries between those regions are linear: they are lines in 2D, planes in 3D and hyperplanes with more features.
This article reviews popular linear models for classification, providing the descriptions of the discussed methods as well as Python implementations. We will cover the following approaches:
• Linear Discriminant Analysis,
• Quadratic Discriminant Analysis,
• Regularized Discriminant Analysis,
• Logistic Regression.
Kernel Based Classification-
When working with non-linear problems, it's useful to transform the original vectors by projecting them into a higher dimensional space where they can be linearly separated. We saw a similar approach when we discussed polynomial regression. SVMs also adopt the same approach, even if there's now a complexity problem that we need to overcome. Our mathematical formulation becomes:
[pic]
Every feature vector is now filtered by a non-linear function that can completely reshape the scenario. However, the introduction of such a function increased the computational complexity
Q-5 A) Explain with example, the process building a decision tree. 8 Marks
Ans- A decision tree is a tree-like structure that is used as a model for classifying data. A decision tree decomposes the data into sub-trees made of other sub-trees and/or leaf nodes.
A decision tree is made up of three types of nodes
• Decision Nodes: These type of node have two or more branches
• Leaf Nodes: The lowest nodes which represents decision
• Root Node: This is also a decision node but at the topmost level
Step by Step Procedure for Building a Decision Tree
Since decision trees are used for clasification, you need to determine the classes which are the basis for the decision.
In this case, it it the last column, that is Play Golf column with classes Yes and No.
To determine the rootNode we need to compute the entropy.
To do this, we create a frequency table for the classes (the Yes/No column).
[pic]
Step 2: Calculating Entropy for the classes (Play Golf)
In this step, you need to calculate the entropy for the Play Golf column and the calculation step is given below. Entropy(PlayGolf) = E(5,9)
[pic]
Step 3: Calculate Entropy for Other Attributes After Split
For the other four attributes, we need to calculate the entropy after each of the split.
E(PlayGolf, Outloook)
E(PlayGolf, Temperature)
E(PlayGolf, Humidity)
E(PlayGolf,Windy)
The entropy for two variables is calculated using the formula.
[pic]
There to calculate E(PlayGolf, Outlook), we would use the formula below:
[pic]
Which is
the same as:
E(PlayGolf, Outlook) = P(Sunny) E(3,2) + P(Overcast)E(4,0) + P(rainy) E(2,30
This formula may look unfriendly, but it is quite clear. The easiest way to approach this calculation is to create a frequency table for the two variables, that is PlayGolf and Outlook.
This frequency table is given below:
[pic]
Using this table, we can then calculate E(PlayGolf, Outlook), which would then be given by the formula below
[pic]
Let’s go ahead to calculate E(3,2)
We would not need to calculate the second and the third terms! This is because
E(4, 0) = 0
E(2,3) = E(3,2)
[pic]
Just for clarification, let’s show the the calculation steps
The calculation steps for E(4,0):
[pic]
The calculation step for E(2,3) is given below
[pic]
Time to put it all together.
We go ahead to calculate the E(PlayGolf, Outlook) by substituting the values we calculated from E(Sunny),E(Overcast) and E(Rainy) in the equation:
E(PlayGolf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)
[pic]
E(PlayGolf, Temperature) Calculation
Just like in the previous calculation, the calculation of E(PlayGolf, Temperature) is given below. It
It is easier to do if you form the frequency table for the split for Temperature as shown.
[pic]
E(PlayGolf, Temperature) = P(Hot) E(2,2) + P(Cold) E(3,1) + P(Mild) E(4,2)
[pic]
E(PlayGolf, Humidity) Calculation
Just like in the previous calculation, the calculation of E(PlayGolf, Humidity) is given below.
It is easier to do if you form the frequency table for the split for Humidity as shown.
[pic]
[pic]
E(PlayGolf, Windy) Calculation
Just like in the previous calculation, the calculation of E(PlayGolf, Windy) is given below. It
It is easier to do if you form the frequency table for the split for Windy as shown.
[pic]
[pic]
So now that we have all the entropies for all the four attributes, let’s go ahead to summarize them as shown in below:
E(PlayGolf, Outloook) = 0.693
E(PlayGolf, Temperature) = 0.911
E(PlayGolf, Humidity) = 0.788
E(PlayGolf,Windy) = 0.892
Step 4: Calculating Information Gain for Each Split
The next step is to calculate the information gain for each of the attributes. The information gain is calculated from the split using each of the attributes. Then the attribute with the largest information gain is used for the split.
The information gain is calculated using the formula:
Gain(S,T) = Entropy(S) – Entropy(S,T)
For example, the information gain after spliting using the Outlook attibute is given by:
Gain(PlayGolf, Outlook) = Entropy(PlayGolf) – Entropy(PlayGolf, Outlook)
So let’s go ahead to do the calculation
Gain(PlayGolf, Outlook) = Entropy(PlayGolf) – Entropy(PlayGolf, Outlook)
= 0.94 – 0.693 = 0.247
Gain(PlayGolf, Temperature) = Entropy(PlayGolf) – Entropy(PlayGolf, Temparature)
= 0.94 – 0.911 = 0.029
Gain(PlayGolf, Humidity) = Entropy(PlayGolf) – Entropy(PlayGolf, Humidity)
= 0.94 – 0.788 = 0.152
Gain(PlayGolf, Windy) = Entropy(PlayGolf) – Entropy(PlayGolf, Windy)
= 0.94 – 0.892 = 0.048
Having calculated all the information gain, we now choose the attribute that gives the highest information gain after the split.
Step 5: Perform the First Split
Draw the First Split of the Decision Tree
Now that we have all the information gain, we then split the tree based on the attribute with the highest information gain.
From our calculation, the highest information gain comes from Outlook. Therefore the split will look like this:
[pic]
Figure 2: Decision Tree after first split
Now that we have the first stage of the decison tree, we see that we have one leaf node. But we still need to split the tree further.
To do that, we need to also split the original table to create sub tables.
This sub tables are given in below.
[pic]
From Table 3, we could see that the Overcast outlook requires no further split because it is just one homogeneous group. So we have a leaf node.
Step 6: Perform Further Splits
The Sunny and the Rainy attributes needs to be split
The Rainy outlook can be split using either Temperature, Humidity or Windy.
Quiz 1: What attribute would best be used for this split? Why?
Answer: Humidity. Because it produces homogenous groups.
[pic]
Table 8: Split using Humidity
The Rainy attribute could be split using High and Normal attributes and that would give us the tree below.
[pic]
Figure 3: Split using the Humidity Attribute
Let’t now go ahead to do the same thing for the Sunny outlook
The Rainy outlook can be split using either Temperature, Humidity or Windy.
What attribute would best be used for this split? Why?
Answer: Windy . Because it produces homogeneous groups.
[pic]
Table 9: Split using Windy Attribute
If we do the split using the Windy attribute, we would have the final tree that would require no further splitting! This is shown in Figure 4
Step 7: Complete the Decision Tree
The complete table is shown in Figure 4
Note that the same calculation that was used initially could also be used for the further splits. But that would not be necessary since you could just look at the sub table and be able to determine which attribute to use for the split.
[pic]
B) Comment on evaluation methods of clustering, With reference to the same, explain in brief, adjusted rand index, Completeness, Homogeneity. 9 Marks
Ans- First approach (A): using the technique that relies on classes to clusters evaluation. In this technique you can simply rely on method that is called "brute-force" for finding the minimum error assignment of class labels to clusters, with the constraint that a specific class label can only be assigned to one cluster. And if a cluster receives "No class" then all instances in that cluster are misclassified. On the other hand, if there is a situation where the error is equal between the assignment of a particular class to one of several clusters, then the first cluster considered during the search receives the assignment. Hence, in this approach, you can rely on "Incorrectly clustered instances" when comparing several cluster methods.
Second approach (B): converting clustering technique into a classification one by letting the clusterer method (e.g., K-means) to be used through a classification process where previous approach (A) will be used to evaluate the assigned clusters. Then utilizing, for instance, cross-validation to evaluate the overall learning process. By doing so, you can use evaluation metrics (i.e., Confusion Matrix, F-Measure, etc.) of classification technique to evaluate the K-means clusterer.
Idea for these as not to treat clustering as classification problem, so matrice of classification like ROC, confusion metric, log loss they are not applicable, because clustering means entirely different thing than classification, like
1. Silhouette Coefficient
2. Adjusted Rand index
3. Mutual Information based scores
4. Calinski-Harabaz Index
5. Homogeneity, completeness and V-measure
6. Fowlkes-Mallows scores
Silhouette and Dunn index, for example, measure how compact your clusters are. So, if your first intention was to get well-separated clusters, they will guarantee you the quality of your method. ( K-means is one of many methods that finds compact clusters)
Accuracy, precision, recall, F-measure, and MCC are better if you want a "statistical" approach. They all need a ground truth to run, i.e., if you're running clustering over a grand new data set, you'll need to evaluate your method over a well-known data set (choose the best you can and keep it for the first data set). MCC is great for an unbalanced data set, which is often problematic because the predominant class influences the final result above the others.
Entropy, Homogeneity, Completeness and V-measure are option. They are all based on Shannon's Entropy and measure the uncertainty on clusters and classes. Unfortunately, they are not the best choice to an unbalanced data set.
Moreover, it's always possible to adapt those measures to your semantic.
Q-6 A) With reference to clustering, Explain the issue of finding the optimal number of clusters with example 8 Marks
Ans- Determining the optimal number of clusters in a data set is a fundamental issue in partitioning clustering, such as k-means clustering, which requires the user to specify the number of clusters k to be generated.
A simple and popular solution consists of inspecting the dendrogram produced usinghierarchical clustering to see if it suggests a particular number of clusters.
These methods include direct methods and statistical testing methods:
1. Direct methods: consists of optimizing a criterion, such as the within cluster sums of squares or the average silhouette. The corresponding methods are named elbow andsilhouette methods, respectively.
2. Statistical testing methods: consists of comparing evidence against null hypothesis. An example is the gap statistic.
a core problem in applying many of the existing clustering methods is that the number of clusters (k parameter) needs to be pre-specified before the clustering is carried out. The parameter k is either identified by users based on prior information or determined in a certain way. Clustering results may largely depend on the number of clusters specified. It is necessary to provide educated guidance for determining the number of clusters in order to achieve appropriate clustering results. Since the number of clusters is rarely previously known, the usual approach is to run the clustering algorithm several times with a different kvalue for each run.
Validity indices are typically classified by researchers into two groups, i.e., internal or external. The external indices validate a partition by comparing it with the external information (true cluster labels), whereas the internal indices focus on the partitioned data and measure the compactness and separation of the clusters.
1. Knowledge driven: you should have some ideas how many cluster do you need from business point of view. For example, you are clustering customers, you should ask yourself, after getting these customers, what should I do next? May be you will have different treatment for different clusters? (e.g., advertising by email or phone). Then how many possible treatments are you planning? In this example, you select say 100 clusters will not make too much sense.
2. Data driven: more number of clusters is over-fitting and less number of clusters is under-fitting. You can always split data in half and run cross validation to see how many number of clusters are good. Note, in clustering you still have the loss function, similar to supervised setting.
B) Write short notes on: (i) Bagging (ii) Boosting (iii) Random Forests 9 Marks
Ans-
Bagging :- Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set.
Bagging (or Bootstrap Aggregating) technique uses these subsets (bags) to get a fair idea of the distribution (complete set). The size of subsets created for bagging may be less than the original set.
[pic]
• Multiple subsets are created from the original dataset, selecting observations with replacement.
• A base model (weak model) is created on each of these subsets.
• The models run in parallel and are independent of each other.
• The final predictions are determined by combining the predictions from all the models.
Bagged (or Bootstrap) trees: In this case, the ensemble is built completely. The training process is based on a random selection of the splits and the predictions are based on a majority vote. Random forests are an example of bagged tree ensembles.
[pic]
Boosting
• Before we go further, here’s another question for you: If a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictions provide better results? Such situations are taken care of by boosting.
• Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model. Let’s understand the way boosting works in the below steps.
• A subset is created from the original dataset.
• Initially, all data points are given equal weights.
• A base model is created on this subset.
• This model is used to make predictions on the whole dataset.
• [pic]
• Similarly, multiple models are created, each correcting the errors of the previous model.
• The final model (strong learner) is the weighted mean of all the models (weak learners).
• [pic]
Random Forests :-
Random Forest is another ensemble machine learning algorithm that follows the bagging technique. It is an extension of the bagging estimator algorithm.
The base estimators in random forest are decision trees.
Unlike bagging meta estimator, random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.
• Looking at it step-by-step, this is what a random forest model does:
• Random subsets are created from the original dataset (bootstrapping).
• At each node in the decision tree, only a random set of features are considered to decide the best split.
• A decision tree model is fitted on each of the subsets.
• The final prediction is calculated by averaging the predictions from all decision trees.
• To sum up, Random forest randomly selects data points and features, and builds multiple trees (Forest) .
• from sklearn.ensemble import RandomForestClassifier model= RandomForestClassifier(random_state=1) model.fit(x_train, y_train) model.score(x_test,y_test) 0.77
Q 7 A) Elaborate Agglomerative clustering using scikit-learn library. 8 Marks
Ans-
Agglomerative Clustering
Recursively merges the pair of clusters that minimally increases a given linkage distance.
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together. The linkage criteria determines the metric used for the merge strategy:
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.
• Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.
• Average linkage minimizes the average of the distances between all observations of pairs of clusters.
• Single linkage minimizes the distance between the closest observations of pairs of clusters.
AgglomerativeClustering can also scale to large number of samples when it is used jointly with a connectivity matrix, but is computationally expensive when no connectivity constraints are added between samples: it considers at each step all the possible merges.
Agglomerative Clustering supports Ward, single, average, and complete linkage strategies.
[pic]
Scikit Code-
>>> from sklearn.cluster import AgglomerativeClustering
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> clustering = AgglomerativeClustering().fit(X)
>>> clustering
AgglomerativeClustering()
>>> clustering.labels_
array([1, 1, 1, 0, 0, 0])
B) With reference to Deep Learning, Explain the concept of fully connected layers with example 8 Marks
Ans- A Convolutional Neural Network (CNN) is a type of neural network that specializes in image recognition and computer vision tasks.
CNNs have two main parts:
• A convolution/pooling mechanism that breaks up the image into features and analyzes them
• A fully connected layer that takes the output of convolution/pooling and predicts the best label to describe the image
[pic]
As part of the convolutional network, there is also a fully connected layer that takes the end result of the convolution/pooling process and reaches a classification decision.
Convolutional layer:- a “filter” passes over the image, scanning a few pixels at a time and creating a feature map that predicts the class to which each feature belongs.
• Pooling layer (downsampling)━reduces the amount of information in each feature obtained in the convolutional layer while maintaining the most important information (there are usually several rounds of convolution and pooling).
• Fully connected input layer (flatten)━takes the output of the previous layers, “flattens” them and turns them into a single vector that can be an input for the next stage.
• The first fully connected layer━takes the inputs from the feature analysis and applies weights to predict the correct label.
• Fully connected output layer━gives the final probabilities for each label.
Role of a fully connected layer in a CNN architecture
The objective of a fully connected layer is to take the results of the convolution/pooling process and use them to classify the image into a label (in a simple classification example).
The output of convolution/pooling is flattened into a single vector of values, each representing a probability that a certain feature belongs to a label. For example, if the image is of a cat, features representing things like whiskers or fur should have high probabilities for the label “cat”.
The image below illustrates how the input values flow into the first layer of neurons. They are multiplied by weights and pass through an activation function (typically ReLu), just like in a classic artificial neural network. They then pass forward to the output layer, in which every neuron represents a classification label.
[pic]
Q 8 A) Justify with elaboration the following statement :
Hierarchical clustering is based on the general concept of finding a hierarchy of partial clusters, built using either a bottom-up or a top down approach. 8 Marks
Ans- Hierarchical clustering is the hierarchical decomposition of the data based on group similarities
Finding hierarchical clusters
There are two top-level methods for finding these hierarchical clusters:
• Agglomerative clustering uses a bottom-up approach, wherein each data point starts in its own cluster. These clusters are then joined greedily, by taking the two most similar clusters together and merging them.
• Divisive clustering uses a top-down approach, wherein all data points start in the same cluster. You can then use a parametric clustering algorithm like K-Means to divide the cluster into two clusters. For each cluster, you further divide it down to two clusters until you hit the desired number of clusters.
Agglomerative Hierarchical Clustering
The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative Nesting). It's a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
How does it work?
1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms N-1 clusters
3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.
Have a look at the visual representation of Agglomerative Hierarchical Clustering for better understanding:
[pic]
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. Some of the common linkage methods are:
• Complete-linkage: the distance between two clusters is defined as the longest distance between two points in each cluster.
• Single-linkage: the distance between two clusters is defined as the shortest distance between two points in each cluster. This linkage may be used to detect high values in your dataset which may be outliers as they will be merged at the end.
• Average-linkage: the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
• Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.
The choice of linkage method entirely depends on you and there is no hard and fast method that will always give you good results. Different linkage methods lead to different clusters.
Divisive Hierarchical Clustering
In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method where we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. So this clustering approach is exactly opposite to Agglomerative clustering.
[pic]
There is evidence that divisive algorithms produce more accurate hierarchies than agglomerative algorithms in some circumstances but is conceptually more complex.
In both agglomerative and divisive hierarchical clustering, users need to specify the desired number of clusters as a termination condition(when to stop merging).
Hierarchical clustering is a very useful way of segmentation. The advantage of not having to pre-define the number of clusters gives it quite an edge over k-Means. However, it doesn't work well when we have huge amount of data.
B) Explain with example Naive User based Recommendation systems. 8 Marks
Ans- we assume that we have a set of users represented by m-dimensional feature vectors:
[pic]
Typical features are age, gender, interests, and so on. All of them must be encoded using one of the techniques discussed in the previous chapters (for example, they can be binarized, normalized in a fixed range, or transformed into one-hot vectors). However, in general, it's useful to avoid different variances that can negatively impact the computation of the distance between neighbors.
We have a set of k items:
[pic]
Let's also assume that there is a relation that associates each user with a subset of items (bought or positively reviewed), items for which an explicit action or feedback has been performed:
[pic]
******************************** THE END ****************************************************
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.