Washington State University



Business Intelligence/Analytics GlossaryTermRough DefinitionAggregating?or grouping dataA central theme in analytics is the condensing of data into groups based on some attribute (or set of attributes) so that you can more easily analyze the data and discover patterns. When the data is too granular (detailed -- such as every size and color combination of a product in the model-- it is hard to analyze the find patterns. You can see how it would be useful to analyze the sales performance at a more summary level such as by product model (before diving into analyzing the different colors and sizes that make up the product line). Grouping data allows analysis. The measures (the numbers such as # units produced on a production order) are grouped by dimensions to be more easily analyzed. Analysts?often?group up, compile or aggregate data before slicing it in different ways and performing descriptive analytics or different analysis. You can put data into groups based on the values in one column or more. An example is to use a CASE() statement to place textual values into a new column to whether a student is sophomore, jr. etc. based on their # of units completed. Data is put into different groups to look for patterns and correlations to an important outcome variables.??CASE() statements are great to organize data into groups so that you can reduce the granularity of the data and put groups together in a column chart or slicer.?NTILE() is another useful SQL trick to quickly put records into quartiles for example. You can also use SQL GROUP BY() or PIVOT() query to aggregate and summarize data on one set of criteria (store#, city, state) and a second criteria (for the crosstab?or PIVOT()) such as month. More about SQL here.AutomationOne of the goals of analytics is to automate the report & dashboards so that when the?new transaction data is moved to the data warehouse (upon which the reports are based), the reports & dashboards automatically refresh with the new data.It is no fun to create the same report every week in Excel, copy/pasting data to merge it together into one dataset, then build pivot charts and tables on top of that.An industry standard approach is to use SQL stored procedures to pull, clean and merge data into datasets. The reports & dashboards execute the stored procedures and display the data. The best part is that next week when the data refreshes, the reports & dashboards auto-refresh and publish the latest visualizations.While it will take time to set up the database?infrastructure and?stored procedures, after the report & dashboards are approved as accurate, there is no more work to be done. Therefore a consultant can set up the databases, create the stored procedures, reports and dashboards and then no further work is needed.CardinalityThe number of values for a column of data. For example a month column has a cardinality of 12. You would only care about cardinality for dimension fields. When the cardinality of the data is too high it is hard to use the dimension column to analyze the measures. For example if you have 100 USA cities in your column you could reduce them with a SQL CASE statementDBAAn IT professional is responsible to manage the organizations data. With the need to continually improve operational efficiencies and opportunities or go out of business, managers need reports. DBA’s provide the dta to analysts or create the reports and dashboards themselves that keep managers informed, and supported in their decision making (such as the decision whether to expand or contract the product line). DBA has been one of the top 10 most needed jobs in the USA for the past 20 years, and should remain so in the foreseeable future.Data Model-3663950-254000A data model is a methodology to segment data into tables that organize attributes (columns of data) for a specific entity (such as region or customer or product). The data model is the plan to connect that tables using primary keys and foreign key relationships. The model ties the tables together for accurate storage and efficient retrieval.Data stored in Excel files does not have any rules regarding what data can go into each column, row or cell. This freedom creates disorganization however and typically hours of cut/copy/paste is needed to create reports. It is important to note that a data model created for an operational database is much different than a data model for a data warehouse (DW). Operational databases have many more tables and no flattened dimension tables. A data models for a DW segments the (often flattened) dimension data into from the transaction data (the table of measures).Data organized into a data model can be accessed accurately using SQL or any report writing software. Organization enables accuracy of data and reports, and improved decision making.When you write a SQL query with INNER JOINS() then you are leveraging the organization provided by the data model.DatabaseA storage place for data. It is both an art and a section of computer science to build a relational database (there are many forms of storing data, the most popular is a relational database that stored data in tables that are related to each other on common fields (e.g. employeeID, studentID).Data in the form of documents, ledgers, spreadsheets, cross-tabulated tables, and lists used to be stored in file cabinets. Now information systems that use web, phone, and tablets store data in a database. The data that conceptually fits into tabular rows and columns is best kept in a database. Tabular data is organized into separate concepts (entities)?that are?either dimension table (text), hybrid dimension tables (mostly words and some summary numbers or metrics) and fact tables (mostly numbers and dates).?You can think of a database as an excel?file where each tab is a grid of data about a different?entity (such as employees, or products).Databases are stored on permanent storage (disk or chip) and enabled by database management software (DBMS) such as SQL Server, MySQL, and Oracle. Data belongs in databases and not in Excel.Other forms of data are also gathered, and can provide insight into business phenomenon (such as brands or target markets). There is also document databases (think contracts), and unstructured data, such as comments and posts from a twitter feed or Facebook comments (useful for sentiment analysis). Sensor data from RFID chips or toll roads is also unstructured.DataCondensingBecause there is so much data, it needs to be condensed and compiled to make sense of it. For example while you can pay a marketer to read the thousands of tweets and posts about your product line, it is also helpful to have a computer program condense the data into metrics such as 70% of our customers are satisfied with our product, and the majority of the dissenting customers dislike model X.There simply are too many rows of data, columns of data, and disparate data sources. The data needs to be pulled together into reports ( 5 seconds or faster please) when the manager presses the button to see the report. To speed up processing, compiling, and interpretation of data, and most importantly management action; a series of data?condensing processes normally occurs during the ETL processGROUP BY() queries condense numeric fields into categories based on important dimensions (vendor, product, store). PIVOT() queries or pivot tables further slice the grouped measures (ie total sales) by another dimension such as week or month.NTILE() and CASE() functions create new dimension columns that that put individual records into groups so that the measures can be condensed DW - DataWarehouseA read-only database that is optimized for data retrieval. The data is read-only and cannot be changed. This is an important defining attribute because any report generated should not have its values changed by updates.DW's?typically use a star schema which connect each dimension table to the fact table, or a combination star/snowflake schema where some dimension tables connect to other dimension tables and not to the fact table.?It is common to flatten hierarchies of dimension tables (ie?city, state, country) into denormalized, flattened tables to speed up queries due to less inner joins.DimensionsDimensions are the fields in the tables of data that are used to create categories, hierarchies and filters that are needed for analysis and reporting. Dimensions are either textual (as in geographic categories of city, state, region and country) or time/date based (as in day of week, day of month, week #, month, #, year).Dimensions are best organized into dimension tables which may only be columns of text or a eparate dates table. A hybrid dimension table would keep the focus on the dimension of interest, or may include some running totals or performance KPI’s (such as LVC - lifetime value of the customer). DimensiontablesTables in a database that are used to store mostly textual information that is used to categorize or classify the measures. Some common?examples of dimension tables would be employee, vendor, geography, product. These dimensions are used to categorize the measures such as grouping sales by employee or product. Dimension tables do not change that much and do not grow very tall over time. For?example the employees?or products data for your company will change some but not radically every year (hopefully).Discretizing?dataakaBinning dataThe process or outcome of putting records into discrete groups or categories, such as when using CASE() statements to put customers into groups such as based on physical, demographic age (ie?20's, 30's, etc.) or based on purchase frequency (ie?new customer, monthly?customer,?frequent flier). Similarly you can use NTILE(4) to put records into 4 groups with equal number of records.?When using CASE() or NTILE() you are adding a new column of derived data to the database table (or resultset?from running a query). The new column shows?what group the record belongs to such as 40's or top 25% on some metric. The purpose of discretizing the data is to reduce its granularity (put it into groups) to help with data analysis and visualization. Often the data is too granular and you have a hard time seeing patterns, but when you reduce the # of datapoints, you can make sense of the data.The new column?can be used on the X axis of a column chart or in a slicer.ETLExtract, transform and load. ETL is the process of?extracting?data from different transaction systems (often with SQL queries) and other data repositories, cleaning and?transforming?the data (changing data types, column names, splitting columns, concatenating/merging other columns), adding new derived data metrics and other calculated fields, and finally loading the cleaned and merged data into a new database called a data warehouse that uses a simplified star data structure.You can perform ETL many ways, in PowerBI?for small dataset, using special software, and using SQL queries run in sequence. The trick is to automate the ETL processes so that the dashboards and report update automatically. Individuals whom can perform complex ETL processes get paid well, are heroic, and have interesting work.ETL is often a data compaction and compiling process where the granularity of the data is reduced such as to only show total units sold for each productID?for each store, rather than include the POS invoice# that would show?which invoice # the productID was sold on. This compiling alone can reduce the size of the data 100:1.?Fact tableIn a data warehouse fact tables hold the many data records that are captured as a result of business operations. While dimension tables remain small, fact tables grow very tall over time. Fact tables that are tens of millions of rows of data are common. The fact table captures mostly numbers that record an event occurring. Fact table have foreign keys to dimension tables to enable the merging of data from different tables such as in a GROUP BY query.Flattened data tableOperational databases that store the operational transaction data often have categorical (dimension) data spread into many tables. The goal data storage for operational DB’s is to remove as much redundancy as possible. For example, an operational database could have separate tables for cities, states, and regions, which improves data accuracy when saving transaction data. Operational DB’s are optimized for speed and accuracy of data entry/updating.Data warehouses are designed to speed up reporting and data analysis, and data exploration. The truth is that when data is spread into too many tables, than any good query to build a dataset requires many INNER JOINS to pull the data together into the resultset needed. There is no better way to reduce query speed than to perform and require multiple INNER JOINS. They all have to be tested for accuracy, and INNER JOINS slow down the processor and values are looked up from different tables.To partially solve this data dispersion problem, speed up data retrieval, and reduce the number of INNER JOINS necessary, DBA’s are fond of performing data flattening operations. For example the important columns of data from the abovementioned city, state, and region tables could simply be added together to form a new table perhaps named dimGeography. The technical term for flattening data is data denormalization, because it introduces redundancy in the values in the columns of data. In the above example database table, the state column will have many repeating values.However a flattened datatable?of dimension data that has redundancies in some of the columns, could add extra expense when gigabytes of hard-drive space is wasted. Cloud data storage plans are measured and invoiced by gigabyte per month.Flattening data is similar to flattening a very tall cheeseburger with your hands, until it is the right size to eat. You have to get the data compiled into an easy set of tables using the classic star-schema, flattening it until you can consume it.Data GranularityAkaLevel of Detail(LOD)Granularity is an attribute of data which refers to the data’s level of detail. It is also related to the number of rows that are retrieved from a query. When you have too much detail and cannot see the pattern then reduce the level of detail (reduce the granularity). Businesses have operational employees, operational management, mid-level, and the set of big bosses. Each of these management types needs their reports and analysis at their appropriate level of management. At higher levels of management the KPI’s used to run the business must be at a highly summarized nature (ie statewide). As data gets summarized at higher levels (ie first store, then city, then state, etc.) the data is said to become less granular, and more summary. Operational employees on the other hand, need the most detailed data to do their job (often lists of transactions, products, customers, etc.), this highest level of detail in the data is called the highest level of data granularity. The highest level of granularity means the ability to see the most detail in the data. Similar to cane sugar or flaky?rice, you can see every grain of the food. In the data the highest level of granularity is the transaction?level such as sales transaction with each line item, production or inventory order with every line item of each serial number being shown, or each individual sales quote.Data is reported at different levels of summarization (lower?levels of?granularity) and is compiled for different levels of analysis and provided to different levels of management. The higher in the management chain your report or dashboard is targeted towards, the higher the summarization level the data will be. Low level managers care about one zip code, middle level managers care about a district or state, high level managers want data by region or country (highest summarization level?=?lowest granularity).Data of low granularity?is like sticky rice balls, you can't really see each grain of rice, rather the rice is clumped into groups called rice balls or mocha. You?just pick up and eat the whole thing perhaps with chopsticks?or fingers, rather than capture each grain of rice with your chopstix.How do you set the level of detail? It depends on the fiends you use in the GROUP BY statement. Using our sample dataset and filtering on USA, if you put state only in the GROUP BY statement you should retrieve 50 rows of data. If you were to add city to the GROUP BY statement then your results would be more detailed, more granular and you might retrieve 300 rows of data. If you add the reseller name to the GROUP BY statement you are again increasing the granularity of the results and 500 rows of data might be retrieved (500 bike shops). When choosing the level of granularity desired you have to think about the usage of the data. If you are copying the data into analytics or data visualization software (Excel for Pivot chart analysis, SSRS, Power BI, etc.) then you might want high level of detail (ie. high granularity) as you have all the data to look at the results at three levels of analysis (reseller, city, state). On the other hand if you are pitching a report or dashboard to mid-level management or higher levels of management then you would not include data columns with high cardinality (many, many different values) such as reseller name (or customer name, and product ID). You would also NOT put these high cardinality columns in your GROUP BY statement as you would GROUP BYThe most common SQL query used to condense data. For example you can condense data from a million rows down to 500 rows or 5 rows. Often ETL routines use several layers of GROUP BY queries to compile and condense?data so that reporting software does not get bogged down and slow.??SSMS has the amazing ability to take gigabytes of data and condense it quickly.HierarchyA conceptually vertical grouping of dimension data. Some examples are a) town, city, state, county b) product, sub-category, category, c) date, week, month, year. You can easily make hierarchies in PowerBISeparate dimension tables can exist in a hierarchy such as having separate tables for towns, states, and countries. Because this data is slow to change, the DBA can choose to flatten the data into one denormalized?datatable?(meaning the data has lots of duplications and redundancies, but since all the data about one entity (ie?geography)?is in one table then the query runs faster.Operational?DatabaseAn operational database that is optimized for speed and accuracy of data entry. Data can be inserted, updated or deleted (in contrast to data warehouses that are read only since you do not want the reports that measure past performance to change).Operational DB's?capture transaction data that is captured about business processes such as inventory movements, quotes issued, POS data, purchase orders completion data, warehouse receipts, etc.Transaction data is extracted from operational databases and cleaned, transformed and loaded into staging databases or data warehouse.ParameterA filter that is used in a query to restrict the data to a desired category. Parameters value are often calendar periods (e.g. March) or product/customer/employee. In SQL you can have a WHERE statement that is frozen to one value for example ‘Country = France’. The freezing (aka hard-coding) of a resultset to one value, will over time cause you to have too many reports. Its better to let the program user specify a value for the parameter from a list, for example a list of countries. Then one report can replace ten reports that are hard-coded. Parameterized queryA parameterized query is a SQL and reporting concept where the program user can pass a value into the WHERE statement of a query. A parameterized query can accept a variety of values from the program user typically by selecting one of more values from a list, or set of calendars. The SQL query could have a statement such asWHERE Country = @CountryThe @ sign tells you that @Country is a variable that can hold different values.Pivot tableaka cross tabulated data-2540000A pivot table is a table of data where cells of data are created by the combination of two dimensions. Here countries are in the rows down the page and foods are in the columns. The numbers in each cell are the result of categorizing the data using country and food type. For example here perhaps number of units sold is calculated for each combination of country and food type.Most importantly, pivot tables aka crosstabs (cross tabulated data tables) can be used to condense a great deal of operational data into a simple and intuitive table. A difference from GROUP BY () query results is that pivoted data display one measure, while GROUPED data can have many columns of metrics.In SQL you can create a pivot table using a PIVOT() query, in Excel you use a pivot table, in SSRS and Primary keyOne field (or sometimes several fields together) is typically needed to uniquely identify each row in a database table. Even if a record uses its row number to identify it, each row of data must be able to be accessed, so it can be updated. If a table has unique rows of data and?no duplicates (ie, such as no 2 people with the same studentID) then the table is said to have?entity integrity.QueryA query is a question you ask of data. We use structured data (ie in rows and columns of a table) and the structured query language (SQL).Queries are used to:a) retrieve a filtered set of data records b) transform and compile data retrieved from detail transaction level data summary level data (for example total the units sold by day of week by state). The query compiles and aggregates the datac) create new columns of calculated derived data such as profit margin.d) pull data together from different tables and databases. e) move data from one database or format to anotherThe structured query language (SQL) is used to ask data questions. SQL which was invented in 1969 by engineers at IBM and is improved every year. Examples of useful SQL developments include the DATEPART(), and DATENAME() functions which were introduced in 2012. (SQL is pronounced like the second movie in a trilogy) SQL can be used in every modern relational database that uses tables (ie?Oracle, SQL Server, IBM's DB2, MySQL, etc.).?You use queries to generate?datasets that are fed into the data visualization software (ie. dashboards, pivot charts, maps, and report). It is common to save queries into "stored procedures" that can be called and run from?the data visualization software.You can save a query and email it to another analyst or employee. It makes sense to save query’s – there are two approaches a) views, b) stored procedures. Views are commonly used to fill employee’s screens with all the data they need for transaction processing or reporting. Views are used to pull together, format, and present data onto screens or reports. Views can be used inside Excel or other?report writing software to retrieve data, especially when the parameters of the report do not change (ie production by week, by machine). Similarly?stored procedures are saved queries that can accept criteria that is used for filtering the resultset such as receiving only the summarized data for France.Queries are used to retrieve and compile data from data warehouses. Writing queries with SQL is both a science and art, and has always been a useful job skill. The query produces the data, then you need to visualize and analyse the data.RFMA framework for business analysis – RFM Recency, Frequency, and Monetary (revenue and profitability). You can analyze a dataset to see for examplea) recency – which products were sold in the last week?b) frequency – which products are sold more often?c) monetary – revenue – what is the total sales by product for the last week?d) monetary – profitability – what profits are generated by productID?ResultsetAfter running a query against a database the data returned is called the resultset or dataset. The query results returned when using a query editor such as SQL Server Management Studio (SSMS) are called the the resultset, which is typically one table.?SQL Joins-2540000Good database structures provide a way to connect the database tables that are storing the data needed for reporting. The tables of data need to be connected (ie joined) so that you can retrieve data columns from each of the tables. Take a look at the second picture in this section to see the data model created by the query on the left. In effect one fact table of numerical data is joined to two dimension tables (resellers and geography) which provide the categories used to calculate the totals.Examine lines 6, 7, 8 in the SQL code on the left. This is an example of joining data tables together. Data in a data warehouse is segmented into dimension tables (words and dates to categorize the data) and fact tables (transaction data with mostly numbers and dates). The star-schema is used to organize the data into dimension tables and fact tables of operational data. The INNER JOIN is used to connect these tables -381025527000To pull data together from different tables into a coherent dataset (for subsequent reporting). You join the tables ON a field they have?in common. Typically a table has a foreign key that can be connected to the primary key of another table. For example the sales table here connects to the reseller table which connects to the geography table.SSMSSQL Server Management Studio (SSMS) - this is Microsoft’s flagship database management software. Analysts and DBA's?use SSMS to query data, update data, manage data, move data, create tables, save queries into views and stored procedures, build out data structures (tables and relationships amongst tables of data), etc.. SSRSSQL Server Reporting Services – this is a robust report writing software that is a free tool that comes with SQL Server. You download SSDT (SQL Server Data Tools) as an add-on to the database, and then you have access to industry-standard report-writing software. SSRS is used for the regular weekly reports, and reports that take a few filters. SSRS is not particularly suited for data exploration as Pivot charts, PowerBI and Tableau are. Rather SSRS is best suited for productionized (regular) reports that are delivered on screen, printed, by email or pdf.One advantage of using SSRS is that it is free and SSRS is closely integrated with SQL Server queries and stored procedures. They work together very well.?Star Schema3175000Just like any physical warehouse, data warehouses store gigabytes of data for later usage. While personal data may be a mess of files, organizational data better by organized and accessible within seconds by simultaneous business members.With the goals of data accuracy and organization. The most popular data management made popular for data warehouse by Ralph Kimball wherein dimension tables are connected to a fact table. The dimensions are used to slice up and group up (aggregate) the transaction data.The DBA decides how the tables are connected into the data model and what level of data flattening is needed in the categorical data. With so many data needs it is typical for larger companies to hire many DBA’s. DBA’s bring order to the data of an organization, safe keeping it Stored?ProcedureA SQL query can take hours to perfect. These queries can be saved inside the database account, and used whenever?needed to perform database management. Stored procedures are used to move data, clean data, merge data into datasets, update data, etc. We use stored procedures from within PowerBI, Excel, and SSRS to easily pull data sets into the reporting software for further formatting and chart-making. Employees can use stored procedures created by others, and one analyst/DBA can create a library of stored procedures to refresh dashboards, and reports and also ETL tasks. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download