Decision Tree to Decision Rules - BE COMPS RAIT, NERUL - …



centercenterDepartment of Computer EngineeringLab ManualFinal Year Semester-VIIISubject: Data Warehouse and MiningEven SemesterInstitutional Vision, Mission and Quality PolicyOur VisionTo foster and permeate higher and quality education with value added engineering, technology programs, providing all facilities in terms of technology and platforms for all round development with societal awareness and nurture the youth with international competencies and exemplary level of employability even under highly competitive environment so that they are innovative adaptable and capable of handling problems faced by our country and world at large.RAIT’s firm belief in new form of engineering education that lays equal stress on academics and leadership building extracurricular skills has been a major contribution to the success of RAIT as one of the most reputed institution of higher learning. The challenges faced by our country and world in the 21 Century needs a whole new range of thought and action leaders, which a conventional educational system in engineering disciplines are ill equipped to produce. Our reputation in providing good engineering education with additional life skills ensure that high grade and highly motivated students join us. Our laboratories and practical sessions reflect the latest that is being followed in the Industry. The project works and summer projects make our students adept at handling the real life problems and be Industry ready. Our students are well placed in the Industry and their performance makes reputed companies visit us with renewed demands and vigour.Our MissionThe Institution is committed to mobilize the resources and equip itself with men and materials of excellence thereby ensuring that the Institution becomes pivotal center of service to Industry, academia, and society with the latest technology. RAIT engages different platforms such as technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers, Entrepreneurial Development Center and Societal Interaction Cell. To develop the college to become an autonomous Institution & deemed university at the earliest with facilities for advanced research and development programs on par with international standards. To invite international and reputed national Institutions and Universities to collaborate with our institution on the issues of common interest of teaching and learning sophistication.RAIT’s Mission is to produce engineering and technology professionals who are innovative and inspiring thought leaders, adept at solving problems faced by our nation and world by providing quality education.The Institute is working closely with all stake holders like industry, academia to foster knowledge generation, acquisition, dissemination using best available resources to address the great challenges being faced by our country and World. RAIT is fully dedicated to provide its students skills that make them leaders and solution providers and are Industry ready when they graduate from the Institution.We at RAIT assure our main stakeholders of students 100% quality for the programmes we deliver. This quality assurance stems from the teaching and learning processes we have at work at our campus and the teachers who are handpicked from reputed institutions IIT/NIT/MU, etc. and they inspire the students to be innovative in thinking and practical in approach. We have installed internal procedures to better skills set of instructors by sending them to training courses, workshops, seminars and conferences. We have also a full fledged course curriculum and deliveries planned in advance for a structured semester long programme. We have well developed feedback system employers, alumni, students and parents from to fine tune Learning and Teaching processes. These tools help us to ensure same quality of teaching independent of any individual instructor. Each classroom is equipped with Internet and other digital learning resources.The effective learning process in the campus comprises a clean and stimulating classroom environment and availability of lecture notes and digital resources prepared by instructor from the comfort of home. In addition student is provided with good number of assignments that would trigger his thinking process. The testing process involves an objective test paper that would gauge the understanding of concepts by the students. The quality assurance process also ensures that the learning process is effective. The summer internships and project work based training ensure learning process to include practical and industry relevant aspects. Various technical events, seminars and conferences make the student learning complete.Our Quality PolicyOurQuality PolicyItisourearnestendeavourtoproducehighqualityengineeringprofessionalswhoare innovative and inspiring, thought and action leaders, competent to solve problems facedbysociety,nationandworldatlargebystrivingtowardsveryhighstandardsin learning,teaching andtrainingmethodologies.OurMotto: If itis not of quality, itis NOTRAIT!Dr. Vijay D.PatilPresident, RAESDepartmental Vision, MissionVisionTo impart higher and quality education in computer science with value added engineering and technology programs to prepare technically sound, ethically strong engineers with social awareness. To extend the facilities, to meet the fast changing requirements and nurture the youths with international competencies and exemplary level of employability and research under highly competitive environments.MissionTo mobilize the resources and equip the institution with men and materials of excellence to provide knowledge and develop technologies in the thrust areas of computer science and Engineering. To provide the diverse platforms of sports, technical, cocurricular and extracurricular activities for the overall development of student with ethical attitude. To prepare the students to sustain the impact of computer education for social needs encompassing industry, educational institutions and public service. To collaborate with IITs, reputed universities and industries for the technical and overall upliftment of students for continuing learning and entrepreneurship.Departmental Program Educational Objectives (PEOs)Learn and IntegrateTo provide Computer Engineering students with a strong foundation in the mathematical, scientific and engineering fundamentals necessary to formulate, solve and analyze engineering problems and to prepare them for graduate studies.Think and Create To develop an ability to analyze the requirements of the software and hardware, understand the technical specifications, create a model, design, implement and verify a computing system to meet specified requirements while considering real-world constraints to solve real world problems.Broad BaseTo provide broad education necessary to understand the science of computer engineering and the impact of it in a global and social context.Techno-leaderTo provide exposure to emerging cutting edge technologies, adequate training & opportunities to work as teams on multidisciplinary projects with effective communication skills and leadership qualities.Practice citizenshipTo provide knowledge of professional and ethical responsibility and to contribute to society through active engagement with professional societies, schools, civic organizations or other community activities.Clarify Purpose and PerspectiveTo provide strong in-depth education through electives and to promote student awareness on the life-long learning to adapt to innovation and change, and to be successful in their professional work or graduate studies. Departmental Program Outcomes (POs)Foundation of computing - An ability to apply knowledge of computing, applied mathematics, and fundamental engineering concepts appropriate to the discipline.Experiments & Data Analysis - An ability to understand, identify, analyze and design the problem, implement and validate the solution including both hardware and software. Current Computing Techniques – An ability to use current techniques, skills, and tools necessary for computing practice .Teamwork – An ability to have leadership and management skills to accomplish a common goal.Engineering Problems - anabilitytoidentify,formulates, andsolveengineering problems.Professional Ethics – An understanding of professional, ethical, legal, security and social issues and responsibilities. Communication – An ability to communicate effectively with a range of audiences in both verbal and written form. Impact of Technology – An ability to analyze the local and global impact of computing on individuals, organizations, and society. Life-long learning – An ability to recognize the need for, and an ability to engage in life-long learning.Contemporary Issues – An ability to exploit gained skills and knowledge of contemporary issues.Professional Development – Recognition of the need for and an ability to engage in continuing professional development and higher studies.Employment - An ability to get an employment to the international repute industries through the training programs, internships, projects, workshops and seminars. IndexSr. No.ContentsPage No.1.List of Experiments82.Course Objective, Course Outcomes and Experiment Plan93.CO-PO Mapping114.Study and Evaluation Scheme125.Experiment No. 1136.Experiment No. 2167.Experiment No. 3218.Experiment No. 4299.Experiment No. 53210.Experiment No. 63911.Experiment No. 74212.Experiment No. 84613.Experiment No. 95314.Experiment No. 105715.Experiment No. 1161List of ExperimentsSr. No.Experiments Name1To study and implement all basic HTML tags.2To implement Cascading Style Sheet3To implement bank transaction form using JavaScript4To design email registration form and validate it using Javascript5To implement Javascript document and window object6To design home page for RAIT using Kompozer7To design online examination form using Kompozer8To design home page for online mobile shopping using Kompozer9To design XML document using XML schema for representing your semester marksheet using PHP.10To design DTD for representing your semester marksheet.11To design XML schema and DTD for railway reservation system.12Design HTML form to accept the two numbers N1 and N2. Display prime numbers between N1 and N2 using PHP.13Design a login form to add username, id, password into database & validate it (use php).14Design course registration form and perform various database operations using PHP and MySQL database connectivity15Mini ProjectCourse Objectives & Course Outcome, Experiment PlanCourse Objectives:To study the methodology of engineering legacy databases for data warehousing.To study the design modeling of data warehouse.To study the preprocessing and online analytical processing of data.To study the methodology of engineering legacy of datamining to derive business rules for decision support systems.To analyze the data, identify the problems, and choose the relevant modelsand algorithms toapply.Course Outcomes:CO1Student will be able to understand data warehouse and design model of data warehouse. CO2Students will be able to learned steps of preprocessing CO3Students will be able to understand the analytical operations on data.CO4Students will be able to discover patterns and knowledge from data warehouse. CO5Students will be able to understand and implement classical algorithms in data Experiment PlanModuleNo.WeekNo.Experiments NameCourseOutcomeWeightage1W1, W2One case study given to a group of 3 /4 students of a data mart/ data warehouse.CO1102W3Implementation of classifier like Decision tree using JavaCO513W4Use WEKA to implement like Decision treeCO514W5Implementation of clustering algorithm like K-means using JavaCO525W6Use WEKA to implement the K-means Clustering AlgorithmCO526W7Implementation Association Mining like Apriori using JavaCO527W8Use WEKA to implement Association Mining like AprioriCO528W9Use R tool to implement Clustering/Association Rule/ Classification Algorithms.CO359W10Detailed study of BI tool - SPSS, Clementine. CO3510W11Study different OLAP operations.CO41011W12Study different pre-processing steps for DWHCO210Mapping Course Outcomes (CO) - Program Outcomes (PO)Subject WeightCourse OutcomesContribution to Program outcomesPaPbPcPdPePfPgPhPiPjPkPlPRATICAL50%CO1Student will be able to understand data warehouse and design model of data warehouse. 11212111CO2Students will be able to learned steps of pre-processing3111121CO3Students will be able to understand the analytical operations on data.1111123CO4Students will be able to discover patterns and knowledge from datawarehouse.1111123CO5Students will be able to understand and implement classical algorithms in data mining and data warehousing; students will be able to assess the strengths and weaknesses of the algorithms, identify the application area of algorithms, and apply them.1112122Study and Evaluation SchemeCourse CodeCourse NameTeaching SchemeCredits AssignedCPC801Data Warehouse and MiningTheoryPracticalTutorialTheoryPracticalTutorialTotal0402--0401--05Course CodeCourse NameExamination SchemeCPC801Data Warehouse and MiningTerm WorkPracticalTotal252550Term Work:Internal Assessment consists of two tests. Test 1, an Institution level central test, isfor 20 marks and is to be based on a minimum of 40% of the syllabus. Test 2 isalso for 20 marks and is to be based on the remaining syllabus. Test 2 may beeither a class test or assignment on live problems or course projectPractical & Oral:Oral examination is to be conducted by pair of internal and external examiners based on the syllabus.Data Warehouse and MiningExperiment No. : 1Case study on Data warehouse System.Experiment No. 1Aim: One case study on Data warehouse System.Write Detail Statement Problem and creation of dimensional modelling (creation star and snowflake schema) Implementation of dimensional modelingImplementation of all dimension table and fact tableImplementation of OLAP operations.Objectives: From this experiment, the student will be able to Understand the basics of Data Warehouse Understand the design model of Data Warehouse Study methodology of engineering legacy databases for data warehousingOutcomes: The learner will be able toApply knowledge of legacy databases in creating data warehouseUnderstand, identify, analyse and design the warehouseUse current techniques, skills and tools necessary for designing a data warehouseSoftware Required :Oracle 11gTheory: In?computing,?online analytical processing, or?OLAP is an approach to answering?multi-dimensional analytical (MDA) queries swiftly?OLAP is part of the broader category of?business intelligence which also encompasses?relational database, report writing and?data mining. Typical applicationsof OLAP include?business reporting?for sales,?marketing, management reporting,?business process management?(BPM),?budgeting?and similar areas, with new applications coming up, such as?agriculture?The term?OLAP?was created as a slight modification of the traditional database term?online transaction processing.Dimensional modeling- Dimensional modeling?(DM) names a set of techniques and concepts used in?Dimensional modeling?(DM) names a set of techniques and concepts used in?datawarehouse?design. It is considered to be different from?Entity relationship?(ER). Dimensional Modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. ,?DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance.Star Schema- Fact table is in middle and dimension tables are arranged around the fact tableSnowflake SchemaNormalization and expansion of the dimension tables in a star schema result in the implementation of a snowflake design. Snowflaking in the dimensional model can impact understandability of the dimensional model and result in a decrease in performance because more tables will need to be joined to satisfy queriesConclusion:We have studied different schemas of data warehouse, and using the methodology of engineering legacy database, a new data warehouse was built. The normalization was applied wherever required on star schema and snowflake schema was designed. Viva Questions: What is data warehouse?What is multi-dimensional data?What is difference between star and snowflake schema? References:PaulrajPonniah, “Data Warehousing: Fundamentals for IT Professionals”, Wiley IndiaReemaTheraja “Data warehousing”, Oxford University PressData Warehouse and MiningExperiment No. : 2Implementation of decision tree algorithm in JAVA.Experiment No. 2Aim:Implementation of decision tree algorithm in JAVA.Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to applyUnderstand and implement classical algorithms in data miningIdentify the application of classification algorithm in data miningOutcomes: The learner will be able toAssess the strength and weaknesses of algorithmsIdentify, formulate and solveengineering problemsAnalyse the local and global impact of data mining on individuals, organizations and societySoftware Required :JDK for JAVATheory: Decision Tree learning is one of the most widely used and practical methods for inductive inference over supervised data. A decision tree represents a procedure for classifying categorical data based on their attributes. It is also efficient for processing large amount of data, so is often use in data mining operations. The construction of decision tree does not require any domain knowledge or parameter setting, and therefore appropriate for exploratory knowledge discovery. Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with?decision nodes?and?leaf nodesThe core algorithm for building decision trees called?ID3?by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses?Entropy?and?Information Gain?to construct a decision tree.Entropy: A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). ID3 algorithm uses entropy to calculate the homogeneity of a sample. If the sample is completely homogeneous the entropy is zero and if the sample is an equally divided it has entropy of one. To build a decision tree, we need to calculate two types of entropy using frequency tables as follows:Information Gain: The information gain is based on the decrease in entropy after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).Procedure/Program:Calculate entropy of the targetThe dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain, or decrease in entropy.Choose attribute with the largest information gain as the decision nodeA. A branch with entropy of 0 is a leaf nodeA branch with entropy more than 0 needs further splitting.The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.Results:Decision Tree to Decision RulesA decision tree can easily be transformed to a set of rules by mapping from the root node to the leaf nodes one by oneConclusion:The different classification algorithms of data mining were studied and one among them named decision tree (ID3) algorithm was implemented using JAVA. The need for classification algorithm was recognized and understood.Viva Questions: What are various classification algorithms?What is entropy?How does u find information gain?References: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd EditionM.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 3Implementation of ID3 algorithm using WEKA tool.Experiment No. 3Aim:Implementation of ID3 algorithm using WEKA tool.Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to applyUnderstand and implement classical algorithms in data miningIdentify the application of classification algorithm in data miningOutcomes: The learner will be able toAssess the strength and weaknesses of algorithmsIdentify, formulate and solve engineering problemsAnalyse the local and global impact of data mining on individuals, organizations and societySoftware Required :WEKA toolTheory: Decision tree learning is a method for assessing the most likely outcome value by taking into account the known values of the stored data instances. This learning method is among the most popular of inductive inference algorithms and has been successfully applied in broad range of tasks such as assessing the credit risk of applicants and improving loyality of regular customersProcedure:Download dataset for implementation of ID3 algorithm (.csv or .arff file). Here bank-data.csvdataset has taken fordecision tree analysisLoad data in WEKA toolSelect the "Classify" tab and click the "Choose" button to select the ID3 classifierSpecify the various parameters. These can be specified by clicking in the text box to the right of the "Choose" button. In this example we accept the default values. The default version does perform some pruning (using the subtree raising approach), but does not perform error pruningUnder the "Test options" in the main panel we select 10-fold cross-validation as our evaluation approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy of the generated model. We now click "Start" to generate the model.We can view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop-up menu. WEKA also provides view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop-up menu.We will now use our model to classify the new instances. ?However, in the data section, the value of the "pep" attribute is "?" (or unknown).In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows you to open the file containing test instances.In this case, we open the file "bank-new.arff" and upon returning to the main window, we click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict the value of "pep" attribute.The summary of the results in the right panel does not show any statistics. This is because in our test instances the value of the class attribute ("pep") was left as "?", thus WEKA has no actual values to which it can compare the predicted values of new instances.GUI vesion of WEKA is used to create a file containing all the new instances along with their predicted class value resulting from the application of the model.First, right-click the most recent result set in the left "Result list" panel. In the resulting pop-up window select the menu item "Visualize classifier errors". This brings up a separate window containing a two-dimensional graph.To save the file: In the new window, we click on the "Save" button and save the result as the file: "bank-predicted.arff"This file contains a copy of the new instances along with an additional column for the predicted value of "pep". The top portion of the file can be seen in below figure.Conclusion:The different classification algorithms of data mining were studied and one among them named decision tree (ID3) algorithm was implemented using JAVA. The need for classification algorithm was recognized and understood.Viva Questions: What is the use of WEKA tool?References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd EditionM.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 4Implementation of K-means clustering in JAVA.Experiment No. 4Aim:Implementation of K-means clustering in JAVA.Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to applyUnderstand and implement classical clustering algorithms in data miningIdentify the application of clustering algorithm in data miningOutcomes: The learner will be able toAssess the strength and weaknesses of algorithmsIdentify, formulate and solve engineering problemsAnalyse the local and global impact of data mining on individuals, organizations and societySoftware Required :JDK for JAVATheory:Clustering is dividing data points into homogeneous classes or clusters:Points in the same group are as similar as possiblePoints in different group are as dissimilar as possibleWhen a collection of objects is given, we put objects into group based on similarity.Clustering Algorithms:A Clustering Algorithm tries to analyse natural groups of data on the basis of some similarity. It locates the centroid of the group of data points. To carry out effective clustering, the algorithm evaluates the distance between each point from the centroid of the cluster. The goal of clustering is to determine the intrinsic grouping in a set of unlabelled dataTheory: K-means ClusteringK-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.Procedure:Input: K: the number of clusters D: a data set containing n objects. Output: A set of k clusters.1.Arbitrarily choose K objects from D as the initial cluster centers2.Partition of objects into k non-empty subsets3.Identifying the cluster centroids (mean point) of the current partition.4.Assigning each point to a specific pute the distances from each point and allot points to the cluster where the distance from the centroid is minimum.6.After re-allotting the points, find the centroid of the new cluster formed.Conclusion:The different clustering algorithms of data mining were studied and one among them named k-means clustering algorithm was implemented using JAVA. The need for clustering algorithm was recognized and understood.Viva Questions: What are different clustering techniques?What is difference between K-means and K-medoids?What is dendogram? References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd EditionM.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 5To implement the clustering algorithm – K-means using WEKA tool.Experiment No. 5Aim:To implement the clustering algorithm, K-means using WEKA tool.Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to applyUnderstand and implement classical clustering algorithms in data miningIdentify the application of clustering algorithm in data miningOutcomes: The learner will be able toAssess the strength and weaknesses of algorithmsIdentify, formulate and solve engineering problemsAnalyse the local and global impact of data mining on individuals, organizations and societySoftware Required :WEKA toolTheory: Weka is a landmark system in the history of the data mining and machine learning research communities,because it is the only toolkit that has gained such widespread adoption and survived for an extended period of timeThe key features responsible for Weka's success are: – ?It provides many different algorithms for data mining and machine learning.?Is is open source and freely available.?It is platform-independent. ?It is easily useable by people who are not data mining specialists. ?It provides flexible facilities for scripting experiments – it has kept up-to-date, with new algorithms WEKA INTERFACEThe GUI Chooser consists of four buttons—one for each of the four major Weka applications—and four menus.The buttons can be used to start the following applications: ? Explorer : An environment for exploring data with WEKA . ? Experimenter : An environment for performing experiments and conducting statistical tests between learning schemes. ? KnowledgeFlow : This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. ? SimpleCLI : Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface.WEKA CLUSTERERIt contains “clusterers” for finding groups of similar instances in a dataset. Some implemented schemes are: k-Means, EM, Cobweb, X-means, FarthestFirst .Clusters can be visualized and compared to “true” clusters.Procedure:The basic step of k-means clustering is simple. In the beginning, we determine number of cluster K and we assume the centroid or center of these clusters. We can take any random objects as the initial centroids or the first K objects can also serve as the initial centroids. Then the K means algorithm will do the three steps below until convergence. Iterate until stable (= no object move group): 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroids 3. Group the object based on minimum distance (find the closest centroid)K-means in WEKA 3.7The sample data set used is based on the "bank data" available in comma-separated format bank-data.csv. The resulting data file is “bank.arff” and includes 600 instances. As an illustration of performing clustering in WEKA, we will use its implementation of the K-means algorithm to cluster the cutomers in this bank data set, and to characterize the resulting customer segments.To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This results in a drop down list of available clustering algorithms. In this case we select "SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop-up window shown below, for editing the clustering parameter.In the pop-up window we enter 2 as the number of clusters and we leave the value of "seed" as is.The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters.Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. We can even visualize the assigned cluster as below91440057150You can choose the cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster.Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to the original data set. In the data portion, each instance now has its assigned cluster as the last attribute value (as shown below).Conclusion: The different clustering algorithms of data mining were studied and one among them named k-means clustering algorithm was implemented using JAVA. The need for clustering algorithm was recognized and understood. References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 6To study and implement Apriori Algorithm.Experiment No. 6Aim: To study and implement Apriori Algorithm.Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to applyUnderstand and implement classical association mining algorithmsIdentify the application of association mining algorithmsOutcomes: The learner will be able toAssess the strength and weaknesses of algorithmsIdentify, formulate and solve engineering problemsAnalyse the local and global impact of data mining on individuals, organizations and societySoftware Required :JDK for JAVATheory: Apriori algorithm is well known association rule algorithm is used in most commercial product. It uses itemset property: Any subset of large item set must be largeProcedure:Input: I = // itemset D = // db of transactions. S= // supportOutput: L1Apiriori Algorithm:K=0;L= #;Ci= I;repeat k=k+1;Lk= #;for each Ji belong to Ck doCi=0;for each I,j belong to D dofor each Ii belong to tj thenCi=Ci+1;for each Ii belong to Ck doifCi>=(S*/D/)doLk=L U Ii; L=L U Lk; Ck+1=Apriori-Gen(Lk) until Ck+1= # ; Conclusion:The different association mining algorithms of data mining were studied and one among them named Apriori association mining algorithm was implemented using JAVA. The need for association mining algorithm was recognized and understood.Viva Questions:What is support and confidence?What are differenttypes association mining algorithms?What is the disadvantage of apriori algorithm?References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 7Implementation of Apriori algorithm in WEKA.Experiment No. 7Aim: Implementation of Apriori algorithm in WEKA.Objectives: From this experiment, the student will be able to Analyse the data, identify the problem and choose relevant algorithm to applyUnderstand and implement classical association mining algorithmsIdentify the application of association mining algorithmsOutcomes: The learner will be able toAssess the strength and weaknesses of algorithmsIdentify, formulate and solve engineering problemsAnalyse the local and global impact of data mining on individuals, organizations and societySoftware Required :WEKA toolTheory: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Some key concepts for Apriori algorithm are: Frequent Item sets: The sets of item which has minimum support (denoted by Li for ith-Itemset)Apriori Property: Any subset of frequent item set must be frequent. Join Operation: To find Lk , a set of candidate k itemsets is generated by joining Lk-1 with itself.Procedure:WEKA implementation:To learn the system, TEST_ITEM_TRANS.arff has been used. Using the Apriori Algorithm we want to find the association rules that have minSupport=50% and minimum confidence=50%. After we launch the WEKA application and open the TEST_ITEM_TRANS.arff file as shown in below figure. Then we move to the Associate tab and we set up the configuration as shown belowAfter the algorithm is finished, we get the following results: === Run information === Scheme: weka.associations.Apriori -N 20 -T 0 -C 0.5 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: TEST_ITEM_TRANS Instances: 15 Attributes: 8 A B C D E F G H=== Associator model (full training set) ===Apriori ======= Minimum support: 0.5 (7 instances) Minimum metric: 0.5 Number of cycles performed: 10 Generated sets of large itemsets: Size of set of large itemsetsL(1): 10 Size of set of large itemsetsL(2): 12 Size of set of large itemsetsL(3): 3 Best rules found1. E=TRUE 11 ==> H=TRUE 11 conf:(1) 2. B=TRUE 10 ==> H=TRUE 10 conf:(1) 3. C=TRUE 10 ==> H=TRUE 10 conf:(1) 4. A=TRUE 9 ==> H=TRUE 9 conf:(1) 5. G=FALSE 9 ==> H=TRUE 9 conf:(1)6. D=TRUE 8 ==> H=TRUE 8 conf:(1)7. F=FALSE 8 ==> H=TRUE 8 conf:(1) 8. D=FALSE 7 ==> H=TRUE 7 conf:(1)9. F=TRUE 7 ==> H=TRUE 7 conf:(1)10. B=TRUE E=TRUE 7 ==> H=TRUE 7 conf:(1) 11. C=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1) 12. E=TRUE G=FALSE 7 ==> H=TRUE 7 conf:(1) 13. G=FALSE 9 ==> C=TRUE 7 conf:(0.78)14. G=FALSE 9 ==> E=TRUE 7 conf:(0.78) 15. G=FALSE H=TRUE 9 ==> C=TRUE 7 conf:(0.78) 16. G=FALSE 9 ==> C=TRUE H=TRUE 7 conf:(0.78) 17. G=FALSE H=TRUE 9 ==> E=TRUE 7 conf:(0.78) 18. G=FALSE 9 ==> E=TRUE H=TRUE 7 conf:(0.78) 19. H=TRUE 15 ==> E=TRUE 11 conf:(0.73) 20. B=TRUE 10 ==> E=TRUE 7 conf:(0.7)Conclusion:The different association mining algorithms of data mining were studied and one among them named Apriori association mining algorithm was implemented using JAVA. The need for association mining algorithm was recognized and understood.References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 8Study of R ToolExperiment No. 8Aim: Study of R Tool.Objectives: From this experiment, the student will be able to Learn basics of mining toolCreate web page for mobile shopping using editor toolStudy the methodology of engineering legacy of data miningOutcomes: The learner will be able toUse current techniques, skills and tools for mining.Engage them in life-long learning.Able to match industry requirements in domains of data miningSoftware Required :R toolTheory: R tool is "a programming “environment”, object-oriented similar to S-Plus freeware that provides calculations on matrices, excellent graphics capabilities and supported by a large user network.Installing R: Download from CRANSelect a download siteDownload the base package at a minimumDownload contributed packages as neededR Basics / Components Of R:ObjectsNaming conventionAssignmentFunctionsWorkspaceHistoryObjectsnamestypes of objects: vector, factor, array, matrix, data.frame, ts, listattributesmode: numeric, character, complex, logicallength: number of elements in objectcreationassign a valuecreate a blank objectNaming Conventionmust start with a letter (A-Z or a-z)can contain letters, digits (0-9), and/or periods “.”case-sensitiveeg. mydata different from MyDatado not use underscore “_”Assignment“<-” used to indicate assignmentegs. <-c(1,2,3,4,5,6,7)<-c(1:7)<-1:4Functionsactions can be performed on objects using functions (note: a function is itself an object)have arguments and options, often there are defaultsprovide a resultparentheses () are used to specify that a function is being called.Workspaceduring an R session, all objects are stored in a temporary, working memorylist objectsls()remove objectsrm()objects that you want to access later must be saved in a “workspace”from the menu bar: File->save workspacefrom the command line: save(x,file=“MyData.Rdata”)Historycommand line historycan be saved, loaded, or displayedsavehistory(file=“MyData.Rhistory)loadhistory(file=“MyData.Rhistory)history(max.show=Inf)during a session you can use the arrow keys to review the command historyTwo most common object types for statistics: A. Matrixa matrix is a vector with an additional attribute (dim) that defines the number of columns and rowsonly one mode (numeric, character, complex, or logical) allowedcan be created using matrix()x<-matrix(data=0,nr=2,nc=2)orx<-matrix(0,2,2)B. Data Frameseveral modes allowed within a single data framecan be created using data.frame()L<-LETTERS[1:4] #A B C Dx<-1:4 #1 2 3 4data.frame(x,L) #create data frameattach() and detach()the database is attached to the R search path so that the database is searched by R when it is evaluating a variable.objects in the database can be accessed by simply giving their namesData Elements:select only one elementeg. x[2]select range of elementseg. x[1:3]select all but one elementeg. x[-3]slicing: including only part of the objecteg. x[c(1,2,5)]select elements based on logical operatoreg. x(x>3)Data Import & Entry:Importing Dataread.table(): reads in data from an external filedata.entry(): create object first, then enter datac(): concatenatescan(): prompted data entryR has ODBC for connecting to other programs.Data entry & editingstart editor and save changesdata.entry(x)start editor, changes not savedde(x)start text editoredit(x)Useful Functionslength(object) # number of elements or componentsstr(object) ???# structure of an object class(object) ?# class or type of an objectnames(object) ?# namesc(object,object,...) ?# combine objects into a vectorcbind(object, object, ...) # combine objects as columns rbind(object, object, ...) # combine objects as rows ls() ??????# list current objectsrm(object) # delete an objectnewobject<- edit(object) # edit copy and save anewobjectfix(object) ?Exporting Data To A Tab Delimited Text Filewrite.table(mydata, "c:/mydata.txt", sep="\t") To an Excel Spreadsheetlibrary(xlsReadWrite)write.xls(mydata, "c:/mydata.xls") To SASlibrary(foreign)write.foreign(mydata,c:/mydata.txt", "c:/mydata.sas", ??package="SAS") Viewing Data There are a number of functions for listing the contents of an object or dataset: ?list objects in the working environment: ls()?list the variables in mydata: names(mydata)?list the structure of mydata: str(mydata) ?list levels of factor v1 in mydata: levels(mydata$v1)?dimensions of an object: dim(object) ?class of an object (numeric, matrix, dataframe, etc): class(object)?print mydata :mydata?print first 10 rows of mydata: head(mydata, n=10)?print last 5 rows of mydata: tail(mydata, n=5) Interfacing with RCSV FilesExcel FilesBinary FilesXML FilesJSON FilesWeb dataDatabaseWe can also create pie chartsbar chartsbox plotshistogramsline graphsscatterplotsDataTypesIn R ToolVectorsListsMatricesArraysFactorsData FramesInput The source( ) function runs a script in the current session. If the filename does not include a path, the file is taken from the current working directory. #input a script source("myfile")OutputThe sink( ) function defines the direction of the output. # direct output to a file sink("myfile", append=FALSE, split=FALSE)# return output to the terminal sink()The append option controls whether output overwrites or adds to a file.The split option determines if output is also sent to the screen as well as the output file.Creating new variablesUse the assignment operator <- to create new variables. A wide array of operators and functions are available here.# Three examples for doing the same computationsmydata$sum<- mydata$x1 + mydata$x2mydata$mean<- (mydata$x1 + mydata$x2)/2attach(mydata)mydata$sum<- x1 + x2mydata$mean<- (x1 + x2)/2 detach(mydata)mydata<- transform( mydata,sum = x1 + x2,mean = (x1 + x2)/2 )Renaming variables You can rename variables programmatically or interactively. # rename interactively fix(mydata) # results are saved on close # rename programmatically library(reshape)mydata<- rename(mydata, c(oldname="newname"))SortingTo sort a dataframe in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.MergingTo merge two dataframes (datasets) horizontally, use the merge function. In most cases, you join two dataframes by one or more common key variables (i.e., an inner join). Examples: # merge two dataframes by ID -- total <- merge(dataframeA,dataframeB,by="ID")# merge two dataframes by ID and Country --total<- merge(dataframeA,dataframeB,by=c("ID","Country")) Conclusion:R tool, free software environment for statistical computing and graphics is studied. Using R tool, various data mining algorithms were implemented. R and its packages, functions and task views for data mining process and popular data mining techniques were learnt.Viva Questions:How R tool is used for mining big data?References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 9Study ofBI ToolExperiment No. 9Aim:Study of Business Intelligence Tool, SPSS Clementine, and XL Miner etcObjectives: From this experiment, the student will be able to Learn basics of business intelligenceCreate web page for mobile shopping using editor toolStudy the methodology of engineering legacy of data miningOutcomes: The learner will be able toUse current techniques, skills and tools for mining.Engage them in life-long learning.Able to match industry requirements in domains of data miningSoftware Required :BI tool - SPSS ClementineTheory: IBM SPSS Modeler is a data mining and text analytics software application built by IBM. It is used to build predictive models and conduct other analytic tasks. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming. IBM SPSS Modeler was originally named Clementine by its creators,Applications:SPSS Modeler has been used in these and other industries:?Customer analytics and Customer relationship management (CRM)?Fraud detection and prevention?Optimizing insurance claims?Risk management?Manufacturing quality improvement?Healthcare quality improvement?Forecasting demand or sales?Law enforcement and border security?Education?Telecommunications?Entertainment: e.g., predicting movie box office receiptsSPSS is available in two separate bundles of features called editions.1.SPSS Modeller Professional2.SPSS Modeller PremiumIt all includes:text analyticsentity analyticssocial network analysisBoth the editions are available in desktop and server configurations.Earlier it was Unix based and designed as a consulting tool and not for sale to the customers. Originally developed by a UK Company called Integral Solutions in collaboration with Artificial Intelligence researchers at Sussex University. It mainly uses two of the Poplog languages, Pop11 and Prolog. It was the first data mining tool to use an icon based graphical user interface rather than writing programming languages. Clementine is a data mining software for business solutions.Previous version was a stand alone application architecture while new version is a distributed architecture.Fig. Previous version (stand alone)Distributed ArchitectureFig. New version (Distributed architecture)Multiple model building techniques in Clementine: Rule InductionGraphClusteringAssociation RulesLinear RegressionNeural NetworksFunctionalities: Classification: Rule Induction, neural NetworksAssociation: Rule Induction, AprioriClustering: Kohonen Networks, Rule InductionSequence: Rule Induction, Neural Networks, Linear RegressionPrediction: Rule Induction, Neural Networks Applications:Predict market shareDetect possible fraudLocate new retail sitesAssess financial riskAnalyze demographic trends and patternsConclusion:IBM SPSS Modeler is a data mining and text analytics software application is studied. It has a visual interface which allows users to leverage statistical and data mining algorithms without programming is understoodViva Questions:What are the functionalities of SPSS Clementine?References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 10Study different OLAP operationsExperiment No. 10Aim:Study different OLAP operationsObjectives: From this experiment, the student will be able to Discover patterns from data warehouseOnline analytical processing of dataObtain knowledge from data warehouseOutcomes: The learner will be able toRecognize the need of online analytical processing.Identify, formulate and solve engineering problems.Able to match industry requirements in domains of data warehouseTheory: Following are the different OLAP operationsRoll up (drill-up): summarize databy climbing up hierarchy or by dimension reductionDrill down (roll down): reverse of roll-upfrom higher level summary to lower level summary or detailed data, or introducing new dimensionsSlice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planesFact table View Multi-dimensional cube Dimension = 3ExampleCube aggregation – roll up and drill downExample – slicingExample – slicing and pivotingConclusion:OLAP, which performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelingis studied. Viva Questions:What are OLAP operations?What is difference between OLTP and OLAP?What is difference between slicing and dicing?References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson EducationData Warehouse and MiningExperiment No. : 11Study different pre-processing steps of data warehouseExperiment No. 11Aim:Study different pre-processing steps of data warehouseObjectives: From this experiment, the student will be able to Discover patterns from data warehouseLearn steps of pre-processing of dataObtain knowledge from data warehouseOutcomes: The learner will be able toRecognize the need of data pre-processing.Identify, formulate and solve engineering problems.Able to match industry requirements in domains of data warehouseTheory: Data pre-processing is an often neglected but important step in the data mining process. The phrase "Garbage In, Garbage Out" is particularly applicable to data mining and machine learning. Data gathering methods are often loosely controlled, resulting in out-of-range values (e.g., Income:-100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), missing values, etcIf there is much irrelevant and redundant information present or noisy and unreliable data, then knowledge discovery during the training phase is more difficult. Data preparation and filtering steps can take considerable amount of processing time. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set.Data Pre-processing MethodsRaw data is highly susceptible to noise, missing values, and inconsistency. The quality of data affects the data mining results. In order to help improve the quality of the data and, consequently, of the mining results raw data is pre-processed so as to improve the efficiency and ease of the mining process. Data pre-processing is one of the most critical steps in a data mining process which deals with the preparation and transformation of the initial dataset. Data pre-processing methods are divided into following categories:Data Cleaning 2)Data Integration 3)Data Transformation 4)Data Reduction Fig. Forms of data PreprocessingData CleaningData that is to be analyze by data mining techniques can be incomplete (lacking attribute values or certain attributes of interest, or containing only aggregate data), noisy (containing errors, or outlier values which deviate from the expected), and inconsistent (e.g., containing discrepancies in the department codes used to categorize items).Incomplete, noisy, and inconsistent data are commonplace properties of large, real -world databases and data warehouses. Incomplete data can occur for a number of reasons. Attributes of interest may not always be available, such as customer information for sales transaction data. Other data may not be included simply because it was not considered important at the time of entry. Relevant data may not be recorded due to a misunderstanding, or because of equipment malfunctions. Data that were inconsistent with other recorded data may have been deleted. Furthermore, the recording of the history or modifications to the data may have been overlooked. Missing data, particularly for tuples with missing values for some attributes, may need to be inferred. Data can be noisy, having incorrect attribute values, owing to the following. The data collection instruments used may be faulty. There may have been human or computer errors occurring at data entry. Errors in data transmission can also occur. There may be technology limitations, such as limited buffer size for coordinating synchronized data transfer and consumption. Incorrect data may also result from inconsistencies in naming conventions or data codes used. Duplicate tuples also require data cleaning. Data cleaning routines work to “clean" the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. Dirty data can cause confusion for the mining procedure. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding over fitting the data to the function being modelled. Therefore, a useful pre-processing step is to run your data through some data cleaning routines.Missing Values: If it is noted that there are many tuples that have no recorded value forseveral attributes, then the missing values can be filled in for the attribute by various methods described below:Ignore the tuple: This is usually done when the class label is missing (assuming themining task involves classification or description). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: In general, this approach is time-consuming andmay not be feasible given a large data set with many missing values. Use a global constant to fill in the missing value: Replace all missing attributevalues by the same constant, such as a label like \Unknown", or -∞. If missing values are replaced by, say, \Unknown", then the mining program may mistakenly think that they form an interesting concept, since they all have a value in common | that of \Unknown". Hence, although this method is simple, it is not recommended. Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class as the given tuple.Use the most probable value to fill in the missing value: This may be determinedwith inference-based tools using a Bayesian formalism or decision tree induction. Inconsistent data: There may be inconsistencies in the data recorded for some transactions.Some data inconsistencies may be corrected manually using external references. For example, errors made at data entry may be corrected by performing a paper trace. This may be coupled with routines designed to help correct the inconsistent use of codes. Knowledge engineering tools may also be used to detect the violation of known data constraints. For example, known functional dependencies between attributes can be used to find values contradicting the functional constraints.Data IntegrationIt is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. There are a number of issues to consider during data integration. Schema integration can be tricky. How can like real world entities from multiple data sources be 'matched up'? This is referred to as the entity identification problem. For example, how can the data analyst or the computer be sure that customer id in one database, and cust_number in another refer to the same entity? Databases and data warehouses typically have metadata - that is, data about the data. Such metadata can be used to help avoid errors in schema integration. Redundancy is another important issue. An attribute may be redundant if it can be “derived" from another table, such as annual revenue. Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data set.Data TransformationIn data transformation, the data are transformed or consolidated into forms appropriate for mining. Data transformation can involve the following:Normalization, where the attribute data are scaled so as to fall within a small specifiedrange, such as -1.0 to 1.0, or 0 to 1.0. Smoothing works to remove the noise from data. Such techniques include binning,clustering, and regression. Aggregation, where summary or aggregation operations are applied to the data. Forexample, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. Generalization of the data, where low level or 'primitive' (raw) data are replaced byhigher level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher level concepts, like city or country. Similarly, values for numeric attributes, like age, may be mapped to higher level concepts, like young, middle-aged, and senior.Data Reduction Data reduction techniques have been helpful in analyzing reduced representation of the dataset without compromising the integrity of the original data and yet producing the quality knowledge. The concept of data reduction is commonly understood as either reducing the volume or reducing the dimensions (number of attributes). There are a number of methods that have facilitated in analyzing a reduced volume or dimension of data and yet yield useful knowledge. Certain partition based methods work on partition of data tuples. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical resultsStrategies for data reduction include the following.Data cube aggregation, where aggregation operations are applied to the data in theconstruction of a data cube. Dimension reduction, where irrelevant, weakly relevant, or redundant attributes ordimensions may be detected and removed. Data compression, where encoding mechanisms are used to reduce the data set size. Themethods used for data compression are wavelet transform and Principal Component Analysis. Numerosity reduction, where the data are replaced or estimated by alternative, smallerdata representations such as parametric models (which need store only the model parameters instead of the actual data e.g. regression and log-linear models), or nonparametric methods such as clustering, sampling, and the use of histograms.5.Discretization and concept hierarchy generation, where raw data values for attributes arereplaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction, and are a powerful tool for data mining.Conclusion:Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Such pre-processing is thus studied.Viva Questions:What is pre-processing of data?What is the need for data pre-processing?What kind of data can be cleaned?References:Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3rd Edition M.H. Dunham, "Data Mining Introductory and Advanced Topics", Pearson Education ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download