Ishareyoublog.files.wordpress.com



CP5293 BIG DATA ANALYTICS L T P C OBJECTIVES: 3 0 0 3? To understand the competitive advantages of big data analytics ? To understand the big data frameworks ? To learn data analysis methods ? To learn stream computing ? To gain knowledge on Hadoop related tools such as HBase, Cassandra, Pig, and Hive for big data analytics. UNIT I INTRODUCTION TO BIG DATA 7 Big Data – Definition, Characteristic Features – Big Data Applications - Big Data vs Traditional Data - Risks of Big Data - Structure of Big Data - Challenges of Conventional Systems - Web Data – Evolution of Analytic Scalability - Evolution of Analytic Processes, Tools and methods - Analysis vs Reporting - Modern Data Analytic Tools. UNIT II HADOOP FRAMEWORK 9 Distributed File Systems - Large-Scale FileSystem Organization – HDFS concepts - MapReduce Execution, Algorithms using MapReduce, Matrix-Vector Multiplication – Hadoop YARN. UNIT III DATA ANALYSIS 13 Statistical Methods:Regression modelling, Multivariate Analysis - Classification: SVM & Kernel Methods - Rule Mining - Cluster Analysis, Types of Data in Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density Based Methods, Grid Based Methods, Model Based Clustering Methods, Clustering High Dimensional Data - Predictive Analytics – Data analysis using R. UNIT IV MINING DATA STREAMS 7 Streams: Concepts – Stream Data Model and Architecture - Sampling data in a stream - Mining Data Streams and Mining Time-series data - Real Time Analytics Platform (RTAP) Applications - Case Studies - Real Time Sentiment Analysis, Stock Market Predictions. UNIT V BIG DATA FRAMEWORKS 9 Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase Clients – Examples – .Cassandra: Data Model – Examples – Cassandra Clients – Hadoop Integration. Pig – Grunt – Pig Data Model – Pig Latin – developing and testing Pig Latin scripts. Hive – Data Types and File Formats – HiveQL Data Definition – HiveQL Data Manipulation – HiveQL Queries. TOTAL: 45 PERIODS OUTCOMES: At the end of this course, the students will be able to: ? Understand how to leverage the insights from big data analytics ? Analyze data by utilizing various statistical and data mining approaches ? Perform analytics on real-time streaming data ? Understand the various NoSql alternative database models REFERENCES: 1. Bill Franks, “Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced Analytics”, Wiley and SAS Business Series, 2012. 2. David Loshin, "Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph", 2013. 3. Learning R – A Step-by-step Function Guide to Data Analysis, Richard Cotton, O?Reilly Media, 2013. 4. Michael Berthold, David J. Hand, “Intelligent Data Analysis”, Springer, Second Edition, 2007. 5. Michael Minelli, Michelle Chambers, and Ambiga Dhiraj, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013. 6. P. J. Sadalage and M. Fowler, "NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence", Addison-Wesley Professional, 2012. Unit 1Part A1.What is big data?Big data?is?data sets?that are so voluminous and complex that traditional?data processing application software?are inadequate to deal with them. Big data challenges include?capturing data,?data storage,?data analysis, search,?sharing,?transfer,?visualization,?querying, updating and?information privacy. There are three dimensions to big data known as Volume, Variety and Velocity.2.What is the characteristics of big data?Big data can be described by the following characteristics VolumeThe quantity of generated and stored data. The size of the data determines the value and potential insight- and whether it can actually be considered big data or not.VarietyThe type and nature of the data. This helps people who analyze it to effectively use the resulting insight.VelocityIn this context, the speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development.VariabilityInconsistency of the data set can hamper processes to handle and manage it.VeracityThe?data quality?of captured data can vary greatly, affecting the accurate analysis 3 .Difference between big data and traditional dataTraditional?database systems are based on the structured?data i.e. traditional data?is stored in fixed format or fields in a file.?Big data?uses the semi-structured and unstructured?data?and improves the variety of the?data?gathered from different?sources like customers, audience or subscribers4. Explain about the risk of big data?Anticipation of?such?effects could prompt public and government scrutiny leading to regulation that could constrain the use of Big Data for positive purposes. Questions about big data and analytics?raise?risks that can have three components—risk of?error;?legal impact; and ethical?breach.5. What is big data analytics?Big data analytics is the process of examining large and varied data sets?-- i.e., big data -- to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful information that can help organizations make more-informed business decisions.6. What is the main source of big data? Big Data Sources.?Big data sources?are repositories of large volumes of?data. ... This brings more information to users' applications without requiring that the?data?be held in a single repository or cloud vendor proprietary?data?store. Examples of?big data sources?are Amazon Redshift, HP Vertica, and MongoDB.7.What is web data?The?Web data?is the?data?that comes from large or diverse number of sources.?Web data?are developed with the help of Semantic?Web?tools such as RDF, OWL, and SPARQL. Also, the?web data?allows sharing of information through HTTP protocol or SPARQL endpoint.8. List out the data analytic tools? TrifactaRapid MinerRattle GUIQlikviewWekaKNIMEOrange.9. what are the challenges of big data?Data challenges Volume, velocity, veracity, variety, Data discovery and comprehensiveness Scalability Process challenges Capturing data Aligning data from different sources, Transforming data into suitable form for data analysis ,Modeling data(mathematically, simulation,) Understanding output, visualizing results and display issues on mobile devicesManagement challenges Security , Privacy, Governance, Ethical issues10. what are the Trends in Big Data AnalyticsBig Data Analytics in the cloudHadoop: The new enterprise data operating systemBig Data lakesMore predictive analyticsSQL on Hadoop: Faster, betterMore, better NoSQLDeep learningIn-memory analyticsUnit II Part A1.What is Hadoop YARN?Apache?Hadoop YARN?(Yet Another Resource Negotiator) is a cluster management technology.YARN?is one of the key features in the second-generation?Hadoop?2 version of the Apache Software Foundation's open source distributed processing framework.YARN?(Yet Another Resource Negotiator) is a component of the MapReduce project created to overcome some performance issues in?Hadoop'soriginal design. MapReduce Version 2 is a re-write of the original MapReduce code run as anapplication?on top of?YARN. 2. What is DFS?A distributed file system is a?client/server-based?application?that allows clients to access and process data stored on the?server?as if it were on their own computer. When a user accesses a file on the server, the server sends the user a copy of the file, which is?cached?on the user's computer while the data is being processed and is then returned to the server.3.What is hadoop?Hadoop?is an open source big data framework deployed on a distributed cluster of nodes that allows processing of big data.?Hadoop?uses commodity hardware for large scale computation hence it provides cost benefit to enterprises.4. Define Mapreduce Concepts?MapReduce? is the heart of Apache? Hadoop?. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster. The MapReduce concept is fairly?simple?to understand for those who are?familiar?with clustered scale-out data processing solutions.The Hadoop?MapReduce?framework takes these?concepts?and uses them to process large volumes of information. A?MapReduce?program has two components: one that implements the mapper, and another that implements the reducer.?5.What is HDFS?HDFS is a?distributed file system?that provides high-performance access to?data?across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of?big data?and supporting?big data analytics?applications.6. List out the core components in hadoop?MapReduce?– A software programming model for processing large sets of data in parallelHDFS?– The Java-based distributed file system that can store all kinds of data without prior organization.YARN?– A resource management framework for scheduling and handling resource requests from distributed applications.\7.What are the key advantages in hadoop?A key?advantage?of using?Hadoop?is its fault tolerance. ... When it comes to handling large data sets in a safe and cost-effective manner,?Hadoop?has theadvantage?over relational database management systems, and its value for any size business will continue to increase as unstructured data continues to grow.8.List out the Hadoop Applications?Making Hadoop Applications More Widely Accessible. Apache Hadoop, the open source MapReduce framework, has dramatically lowered the cost barriers to processing and analyzing big data. ...A Graphical Abstraction Layer on Top of Hadoop Applications. ...Hadoop Applications, Seamlessly Integrated.9. What are the characteristics of hadoop?The?prominent?characteristics of Hadoop: Hadoop provides a reliable shared storage (HDFS) and analysis system (MapReduce). Hadoop is highlyscalable?and unlike the relational databases, Hadoop?scales linearly. Due to linear scale, a Hadoop Cluster can contain tens, hundreds, or even thousands of servers.10.what is Matrix Inversion?Matrix inversion is a fundamental operation for solving linear equations for many computational applications, especially for various emerging big data applications. However, it is a challenging task to invert large-scale matrices of extremely high order (several thousands or millions), which are common in most web-scale systems such as social networks and recommendation systems. In this paper, we present a LU decomposition-based block-recursive algorithm for large-scale matrix inversionUnit IIIPart A1,What is classification?Classification?is a general process related to?categorization, the process in which ideas and objects are recognized, differentiated, and understood.A?classification system?is an approach to accomplishing classification.2, What is clustering?Clustering can be considered the most important ?unsupervised learning?problem; so, as every other problem of this kind, it deals with finding a?structure?in a collection of unlabeled data.A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.A?cluster?is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.3, What are the different types of regression medels?Linear Regression. It is one of the most widely known modeling techniqueLogistic RegressionPolynomial RegressionStepwise RegressionRidge RegressionLasso Regression. Elastic Net Regression4, What is the difference between correlation and regression?Correlation?and linear?regression?are not the same.?Correlation?quantifies the degree to which two variables are related.?Correlation?does not fit a line through the data points. You simply are computing a?correlation?coefficient (r) that tells you how much one variable tends to change when the other one does.5, What is rule mining?Association?rule mining?is a procedure which is meant to find frequent patterns, correlations, associations, or causal structures from data sets found in various kinds of databases such as relational databases, transactional databases, and other forms of data repositories.6, what is predictive analytics?Predictive analytics encompasses a variety of statistical techniques from predictive modelling, machine learning, and data mining that analyze current and historical facts to make predictions about future or otherwise unknown events.7, List out the Clustering methods?Partitioning MethodsHierarchical MethodsDensity based methodsGrid based methodsModel based clustering methods8, What is R?Programming with Big Data in R is a series of R packages and an environment for statistical computing with big data by using high-performance statistical computation.Two main implementations in?R?using MPI are Rmpi and pbdMPI of pbdR.9, what are the characteristics of data analysis?There are five data characteristics that are the building blocks of an efficient data analytics solution:?accuracy, completeness, consistency, uniqueness, and timeliness.10,What is data analysis?Data analysis, also known as?analysis?of?data?or?data?analytics, is a process of inspecting, cleansing, transforming, and modeling?data?with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.The process of evaluating?data?using analytical and logical reasoning to examine each component of the?data?provided. ... There are a variety of specific?data analysis?method, some of which include?datamining, text analytics, business intelligence, and?data?visualizations.Unit IVPart A1, What do you mean by data stream?In connection-oriented communication, a?data stream?is a sequence of digitally encoded coherent signals (packets of?data?or?data?packets) used to transmit or receive information that is in the process of being transmitted.2 . Differentiate between DBMS and DSMS?Database Systems (DBS) ? Persistent relations (relatively static, stored) ? One-time queries ? Random access ? “Unbounded” disk store ? Only current state matters ? No real-time services ? Relatively low update rate ? Data at any granularity ? Assume precise data ? Access plan determined by query processor, physical DB design DSMS ? Transient streams (on-line analysis) ? Continuous queries (CQs) ? Sequential access ? Bounded main memory ? Historical data is important ? Real-time requirements ? Possibly multi-GB arrival rate ? Data at fine granularity ? Data stale/imprecise ? Unpredictable/variable data arrival and characteristics3. List out the applications of DSMS?Sensor Networks: – Monitoring of sensor data from many sources, complex filtering, activation of alarms, aggregation and joins over single or multiple streams ? Network Traffic Analysis: – Analyzing Internet traffic in near real-time to compute traffic statistics and detect critical conditions ? Financial Tickers: – On-line analysis of stock prices, discover correlations, identify trends ? On-line auctions ? Transaction Log Analysis, e.g., Web, telephone calls4. What is Data Stream? Data Stream Mining?is the process of extracting knowledge structures from continuous, rapid?datarecords. A?data stream?is an ordered sequence of instances that in many applications of?data stream mining?can be read only once or a small number of times using limited computing and storage capabilities.5. What do you meant by Real Time analytics?Real-time analytics?is the use of, or the capacity to use, data and related resources as soon as the data enters the system. ...?Real-time analytics?is also known as dynamic analysis,?real-time?analysis,?real-time?data integration and?real-time intelligence.6.What is the definition of real time data?Real-time data?(RTD) is information that is delivered immediately after collection. There is no delay in the timeliness of the information provided.?Real-time data?is often used for navigation or tracking. Some uses of the term "real-time data" confuse it with the term dynamic?data.7. Why we need RTAP? RTAP addresses the following issues in the traditional or existing RDBMS system Server based licensing is too expensive to use large DB servers Slow processing speed Little support tools for data extraction outside data warehouse Copying large datasets into system is too slow Workload differences among jobs Data kept in files and folder, managing them are difficult8.What is regression?? Predicts the quantity or probability of an outcome ? What is the likelihood of heart attack, given age, weight, …? ? What is the expected profit a customer will generate? ? What is the forecasted price of a stock? ? Algorithms: Logistic, Linear, Polynomial, Transform9, What is Real Time Analytics Platform?(RTAP)?Real Time Analytics Platform?(RTAP) analyzes data, correlates and predicts outcomes on a?real time?basis. The?platform?enables enterprises to track things in real time?on a worldwide basis and helps in timely decision making. This?platform provides us to build a range of powerful?analytic?applications.10, What is Sampling data in a stream?Sampling?from a finite?stream?is a special case of?sampling?from a station- ary window in which the window boundaries correspond to the first and last?stream elements.The foregoing schemes fall into the category of equal-probability sampling?because each window element is equally likely to be included in the sample.Unit VPart A1, What is NoSQL?A?NoSQL?(originally referring to "non SQL" or "non relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. ...?NoSQL?databases are increasingly used in big data and real-time web applications.2, Why do we need NoSQL?A relational database may?require?vertical and, sometimes horizontal expansion of servers, to expand as data or processing requirements grow. An alternative, more cloud-friendly approach is to employ?NoSQL. ...?NoSQL?is a whole new way of thinking about a database.?NoSQL?is not a relational database3.What is HBase?HBase?is an open-source, non-relational, distributed database modeled after Google's Bigtable and is written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-like capabilities for Hadoop4, What is the difference between HBase and Hive?Despite providing SQL functionality,?Hive?does not provide interactive querying yet - it only runs batch processes on Hadoop. Apache?HBase?is a NoSQL key/value store which runs on top of HDFS. Unlike?Hive,?HBase?operations run in real-time on its database rather than MapReduce jobs.5, What is the difference between Pig and Hive?Differences between Pig and Hive- Depending on the purpose and type of data you can either choose to use?Hive?Hadoop component or?Pig?Hadoop Component based on the below?differences?: 1)?Hive?Hadoop Component is used mainly by data analysts whereas?Pig?Hadoop Component is generally used by Researchers and Programmers6, What is Pig in hadoop??Pig?is a high level scripting language that is?used?with Apache?Hadoop.?Pig?enables data workers to write complex data transformations without knowing Java.?Pig's simple SQL-like scripting language is called?Pig?Latin, and appeals to developers already familiar with scripting languages and SQL. 7,What is Apache Pig??Apache Pig?is a high-level platform for creating programs that runs on?Apache Hadoop. Pig?Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems.8,What is Pig,Hive,HBase?PIG?is used for data transformation tasks. You have a file, want to extract a useful information from it or join two files or any other transformation then use?PIG.?HIVE?is used to query these files by defining a "virtual" table and running SQL like queries on those tables.?HBase?is a full fledged NoSQL database .9, What is Cassandra Client?cassandra-client?is a Node.js CQL 2 driver for Apache?Cassandra?0.8 and later. CQL is a query language for Apache?Cassandra. You use it in much the same way you would use SQL for a relational database. The?Cassandra?documentation can help you learn the syntax.10,List out the types of builtin operator in HIVE?Types of Built-in Operators in HIVE are:Relational OperatorsArithmetic OperatorsLogical OperatorsOperators on Complex typesComplex type ConstructorsUnit 1Part B1,Explain about structure of big data.As you read about big data, you will come across a lot of discussion on the concept of data being structured, unstructured, semi-structured, or even multi-structured. Big data is often described as unstructured and traditional data as structured. The lines aren’t as clean as such labels suggest, however. Let’s explore these three types of data structure from a layman’s perspective. Highly technical details are out of scope for this book. Most traditional data sources are fully in the structured realm. This means traditional data sources come in a clear, predefined format that is specified in detail. There is no variation from the defined formats on a day-to-day or update-to-update basis. For a stock trade, the first field received might be a date in a MM/DD/YYYY format. Next might be an account number in a 12-digit numeric format. Next might be a stock symbol that is a three- to five-digit character field. And so on. Every piece of information included is known ahead of time, comes in a specified format, and occurs in a specified order. This makes it easy to work with.Unstructured data sources are those that you have little or no control over. You are going to get what you get. Text data, video data, and audio data all fall into this classification. A picture has a format of individual pixels set up in rows, but how those pixels fit together to create the picture seen by an observer is going to vary substantially in each case. There are sources of big data that are truly unstructured such as those preceding. However, most data is at least semi-structured. Semi-structured data has a logical flow and format to it that can be understood, but the format is not user-friendly. Sometimes semi structured data is referred to as multi-structured data. There can be a lot of noise or unnecessary data intermixed with the nuggets of high value insuch a feed. Reading semi-structured data to analyze it isn’t as simple as specifying a fixed file format. To read semi-structured data, it is necessary to employ complex rules that dynamically determine how to proceed after reading each piece of information. Web logs are a perfect example of semi-structured data. Web logs are pretty ugly when you look at them; however, each piece of information does, in fact, serve a purpose of some sort. Whether any given piece of a web log serves your purposes is another question. See Figure 1.1 for anexample of a raw web log.WHAT STRUCTURE DOES YOUR BIG DATA HAVE?Many sources of big data are actually semi-structured or multistructured, not unstructured. Such data does have a logical flow to it that can be understood so that information can be extracted from it for analysis. It just isn’t as easy to deal with as traditional structured data sources. Taming semi-structured data is largely a matter of putting in the extra time and effort to figure out the best way to process it.There is logic to the information in the web log even if it isn’t entirely clear at first glance. There are fields, there are delimiters, and there are values just like in a structured source. However, they do not follow each other consistently or in a set way. The log text generated by a click on a web site right now can be longer or shorter than the log text generated by a click from a different page one minute from now. In the end, however, it’s important to understand that semi-structured data does have an underlying logic. It is possible to develop relationships between various pieces of it. It simply takes more effort than structured data. Analytic professionals will be more intimidated by truly unstructured data than by semi-structured data. They may have to wrestle with semi structured data to bend it to their will, but they can do it. Analysts can get semi-structured data into a form that is well structured and can incorporate it into their analytical processes. Truly unstructured data can be much harder to tame and will remain a challenge for organizations even as they tame semi-structured data.Algorithms Using MapReduceMapReduce is not a solution to every problem, not even every problem that profitably can use many compute nodes operating in parallel. As we mentioned in Section 2.1.2, the entire distributed-file-system milieu makes sense only when files are very large and are rarely updated in place. Thus, we would not expect to use either a DFS or an implementation of MapReduce for managing online retail sales, even though a large on-line retailer such as uses thousands of compute nodes when processing requests over theWeb. The reason is that the principal operations on Amazon data involve responding to searches for products, recording sales, and so on, processes that involve relatively little calculation and that change the database.2 On the other hand, Amazon might use MapReduce to perform certain analytic queries on large amounts of data, such as finding for each user those users whose buying patterns were most similar.The original purpose for which the Google implementation of MapReduce was created was to execute very large matrix-vector multiplications as areneeded in the calculation of PageRank (See Chapter 5). We shall see that matrix-vector and matrix-matrix calculations fit nicely into the MapReduce style of computing. Another important class of operations that can use MapReduce effectively are the relational-algebra operations. We shall examine the MapReduce execution of these operations as well.2.3.1 Matrix-Vector Multiplication by MapReduce Suppose we have an n×n matrix M, whose element in row i and column j will be denoted mij . Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vector product is the vector x of length n, whose ithelement xi is given byIf n = 100, we do not want to use a DFS or Map Reduce for this calculation. But this sort of calculation is at the heart of the ranking of Web pages that goes on at search engines, and there, n is in the tens of billions.3 Let us first assume that n is large, but not so large that vector v cannot fit in main memory and thus be available to every Map task. The matrix M and the vector v each will be stored in a file of the DFS. We assume that the row-column coordinates of each matrix element will be discoverable, either from its position in the file, or because it is stored with explicit coordinates, as a triple (i, j,mij). We also assume the position of element vj in the vector v will be discoverable in the analogous way. The Map Function: The Map function is written to apply to one element of M. However, if v is not already read into main memory at the compute node executing a Map task, then v is first read, in its entirety, and subsequently will be available to all applications of the Map function performed at this Map task. Each Map task will operate on a chunk of the matrix M. From each matrix element mij it produces the key-value pair (i,mijvj). Thus, all terms of the sum that make up the component xi of the matrix-vector product will get the same key, i.The Reduce Function: The Reduce function simply sums all the values associated with a given key i. The result will be a pair (i, xi)2.3.2 If the Vector v Cannot Fit in Main Memory However, it is possible that the vector v is so large that it will not fit in its entirety in main memory. It is not required that v fit in main memory at a compute node, but if it does not then there will be a very large number ofdisk accesses as we move pieces of the vector into main memory to multiplycomponents by elements of the matrix. Thus, as an alternative, we can dividethe matrix into vertical stripes of equal width and divide the vector into an equalnumber of horizontal stripes, of the same height. Our goal is to use enoughstripes so that the portion of the vector in one stripe can fit conveniently intomain memory at a compute node. Figure 2.4 suggests what the partition lookslike if the matrix and vector are each divided into five stripes. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download