Washington State University



Featherman’s T-SQL Primer ? Revised 12.27.16Congratulations for reaching out to improve your data management and visualization skills. These are the skills that will transform your business career. Your author survived his corporate baptismal by using the MBWA method and being able to tap into production and quality databases and writing T-SQL queries to create reports of production status and histograms of production quality errors (followed by corrective action). T-SQL is the language used to talk to, manipulate and retrieve datasets from the databases. The data of business processes resides in transaction databases (rows and rows of big data stored in tables). The analyst and manager’s job is to find the data you need, merge it with other data, strip off the data you don’t need, clean up the data you have, then add new columns of derived metrics. This document only shows how to get the data, the next module shows how to easily perform aggregations to perform analytics and subsequent data visualizations (with pivot chart, Tableau, SSRS, etc.). Other modules show more data transformations. If you want to perform business research and analysis, you need data management and retrieval skills. This series of modules will rapidly grow your skills. The goal at the end of learning how to manipulate data, is the automation of a dashboard reporting and alerting system, such that the dashboard of reports is constantly refreshed with new transaction data.This document is intended to provide a live learning environment that will help you understand databases and how to retrieve data from databases. In general a business analyst retrieves data from a database, in the process filtering, sorting and providing new columns of derived data (metrics). Next the analyst retrieves the data into a data visualization software to explore and make sense of the data, perhaps identifying trends and outliers. Finally the analyst uses their understanding of the business context to perform further queries, aggregations, etc., and then uses a data visualization software to build a slideshow to support storytelling, and management reporting. To perform the learning:Open a digital copy of this document in MS-WordOpen SQL Server Management Studio (SSMS) 2012, 2014 or newerLog in to Featherman’s public database using the credentials belowUsing the MS-Word digital copy of this document, copy the queries in the left column on the pages below into a new SSMS queryRun the query – the query knows which database to interact withCompare the results you receive to those displayedRead the information in the right-hand column of this document to explain how the SQL code created the resultsModify the query in SSMS to explore and learn!The goal of business intelligence is to build a data pipeline from an often complex IT infrastructure which is set of hopefully integrated systems. The analyst works with database administrators to build a series of scripted stored procedures that can reliably and repeatedly turn mountains of data into dashboards of actionable information. The trick is to automate as much of the process as possible. Another goal is to simplify managerial decision making. For example if the question is to identify which product lines need to be expanded, contracted, or promoted, you would want to turn a gigabyte of sales information into a 4-byte Boolean yes/no answer. You would also want other summary supporting information. The process is to aggregate data and apply business rules to answer business problems. The field of operational analytics uses live monitoring of sensor data to detect out of tolerance conditions.Often the information is formatted into routine and exception reports, trend reports, dashboards, charts, and scorecards. The easy part is making the charts and dashboards. The work (and therefore higher pay scale) is in understanding the data, creating data models and like a sculptor or blacksmith, pounding the data into shape until it shines in its insightfulness. You also need to know SQL so that you can perform the ETL process and to automate it. By knowing SQL you can be a better analyst and support your co-workers as well.The insights needed to make better decisions often starts with data management - capturing, cleaning, filtering, and aggregating the data. This document is designed to provide you an introduction to this process. It’s very fun to gain mastery over data, if you don’t the alternative is a lot of copying and pasting in Excel! Your goal is to gain mastery over data, and become self-sufficient so that the hands-on process of extracting and compiling data does not slow down your inquisitive mind. We have created database tables, learned about data types, and even implemented primary and foreign keys with different graphical designers. We have learned that there are different ways to generate and run SQL queries:In Access you can use the GUI Query by Example (QBE) facility In SSRS you can use the GUI query designer to build a SQL queryIn SQL Server Management Studio you can use the Query |Design query in Editor optionYou can use a Microsoft Query inside Excel’s external data connection in a QBE facilityIn Access you can write SQL codeIn SQL Server Management Studio you can write SQL codeIn a Microsoft Query inside Excel’s external data connection you can write SQLUsing an External data connection in Excel, and adding some filters is running SQL behind the scenesExcel is adding new query facilitiesPower BI Desktop also allows you to run SQL queries.So what’s so great about writing SQL code? It is time-consuming to learn and you must follow the syntax rules, so no one will ever say it’s easy to learn. But SQL is powerful, it’s the lingua franca of data, and if you don’t learn how to use it, you are stunting your growth as a BI analyst. So now it is time to learn more SQL tricks and tips. This document is designed to assist you to move to the next step of control and mastery of SQL SELECT statements. Learning SQL is a path to a higher pay grade for the business analyst. First a bit about databases. Analysts must become very adept at reading database structures and making sense of the tables, their columns, their derived data, and how the tables of a database fit together (or not). A database is a collection of data segmented into tables, where the tables are akin to tabs on a spreadsheet, and the spreadsheet is a collection of tabbed spreadsheets. While often little care is spent on creating the architecture of what content goes into each spreadsheet, the creation and design of tables and databases requires a great deal of care. The goal of database design (to optimize speed and accuracy of data input) is to store each fact only once in one table. To do this you have to segment the data into different tables and then join the tables using a system of keys (related fields). While data entry is made accurate and fast, data retrieval is slowed down, you will have to learn how to join tables (it’s not too hard).Good database design requires that data about an organization’s transactions is segmented into separate tables based on a logical grouping into entities (i.e., customers, sales, products, vendors, etc.). Some data is rapidly changing (with table inserts, updates, and deletes) such as inventory levels, warehouse locations, and sales records), this type of data is called transaction data. Other data changes less often (i.e., vendor and customer information) and is called master data.A great deal of fuss on database design is spent by data base administrators (DBA’s) to segment data into dozens of tables. This segmenting of data for proper storage is critical. Many database users shove all their data into one large spreadsheet (which makes it hard or impossible to retrieve data). The problem with this ‘one spreadsheet approach’ is that a lot of the records have duplicate data, erroneous data, or missing data (i.e., for each sale to the same customer, do you need to type the customer info every single time?). The problem of the duplicate or erroneous data is that you can’t pull the data you need to create the reports you need. Data analysts and data warehouse designers pull data out of the transaction database to build data warehouses or data cubes (pre-aggregated data) and build logical database designs, flattening out data, and comingling data to make that data easier to query for report and analysis. You can build different data models based on the problems you are solving – which drives the data that is needed. While database designers and web dev gurus focus on building systems to store master and transaction data, the business analyst needs to understand database design and datatypes so they can pull together data from different tables into datasets for further analysis. Analyst often need to clean up the data while merging it. Getting the data is the second step in decision making. What is step one? Knowing the problem.If you the analyst are faced with data bunched together into one spreadsheet or database table, talk to the business manager to start a database project turning the data into third normal form (3NF). Only when the data is in good shape in database tables, and is cleaned for accuracy as the transactions occur (or data is edited), then the analyst has a fighting chance to find the data they need to investigate and report on, and the data may even be accurate (which makes the reports more likely to be accurate, and the decisions correct). If the data is in good shape then the analyst can work quickly. If you the analyst have to perform data manipulation miracles to re-shape the data before analysis and reporting, then try to change the business rules of the data going in to the database.Simplified AdventureWorks 2012 data warehouse logical data model1476375400054 categories Such as Bikes004 categories Such as Bikesleft13335DimProduct Category00DimProduct Category685800013335> 18,000 different customers (WHO)00> 18,000 different customers (WHO)723266348615001819275205740> 35 Sub-categories, such as Mountain Bikes00> 35 Sub-categories, such as Mountain Bikes3619503613150FYI – Queries shown in this document can be saved, packaged, and re-used. A saved query that is stored within SQL Server is called a stored procedure. This packaging, deployment and leveraging of stored procedures into organizational procedures is reviewed at a later date.Many queries within this packet use the FactReseller table to analyze the sales made to bicycle shops. In that context we use the dimReseller table rather than dimCustomers, and dimGeography rather than dimTerritories.Some queries may use a table that collapses the Products hierarchy into one table to save the need to join three tables together (saving typing).0FYI – Queries shown in this document can be saved, packaged, and re-used. A saved query that is stored within SQL Server is called a stored procedure. This packaging, deployment and leveraging of stored procedures into organizational procedures is reviewed at a later date.Many queries within this packet use the FactReseller table to analyze the sales made to bicycle shops. In that context we use the dimReseller table rather than dimCustomers, and dimGeography rather than dimTerritories.Some queries may use a table that collapses the Products hierarchy into one table to save the need to join three tables together (saving typing).407434024780160045713652669973Dim Territories00Dim Territories352003424294640021469242575437DimPromotions00DimPromotions13351920815094 different promotions executed to sell more product (WHY)004 different promotions executed to sell more product (WHY)61230082132250Sales are made in 10 different geographic territories (WHERE)00Sales are made in 10 different geographic territories (WHERE)34318941813946Fact Internet Sales 0Fact Internet Sales 5457463835885> 60,000 sales line items of 158 different products, grouped on > 27,000 orders over a 3 year time period.00> 60,000 sales line items of 158 different products, grouped on > 27,000 orders over a 3 year time period.2000250631825> 500 Products such as Mountain-100 Silver, 48 (WHAT)00> 500 Products such as Mountain-100 Silver, 48 (WHAT)47599606413500011525259175750017716511765300004476761460501DimProducts00DimProducts5514975241301Dim Customers00Dim Customers266701279401DimProduct SubCategory00DimProduct SubCategoryRetrieving and formatting filtered data – you are asked to start up SQL Server Management Studio, create a new query and copy each query from the first column (one at a time) and run them. Next experiment with the functionality using different field names, filters, etc. It is important to keep the blue capitalized keywords in the order that you see them. Note: Please google ‘SQL Statements’ to see helpful readings from your friendly online DBA’s. Also a Youtube search for ‘useful SQL Select statements will reap many good resources. Please open up a new query, select the AdventureWorks2012DW database and paste in each query syntax one at a time. Practice reduces errors, and increases talent level.USE [AdventureWorksDW2012];SELECT [ProductAlternateKey], [EnglishProductName], [StandardCost]FROM [dbo].[DimProduct]This is the most basic of SQL statements where we want to build a dataset of discrete rows of data with no summarization and with only a few columns (here just three). Typically you would copy the resultset into Excel or some other program for further analysis or data visualization.We should filter the products though, since many of them are not for sale (ie nuts, bolts, decals), rather they are used internally in the production process. The next query solves this problem.The USE [database name] is critical and specifies which database you will run the query against.The SELECT line specifies what columns to retrieve (no typos please). The FROM specifies what table to retrieve the data from.? Here we select three columns from the DimProduct table.Let’s take a look at the product line of a bicycle company called Adventureworks. This query on the left selects three columns from the Dim Product table. When you press F5 or the red crimson exclamation point (the Execute button on the toolbar) you should see a list of 509 products. 211455635000-1905317500We start with the idea of running a SQL query in SSMS, then clicking the corner box (pointed to) and copying the results to Excel for further formatting and visualization.There are other methods to pull and display data from a database. You can use a reporting or visualization software which often have easy ways to graphically choose the data. You can also package and save the SQL into a stored query called a stored procedure. All of these methodologies of working with data are explored in later modules.USE [AdventureWorksDW2012];SELECT [ProductAlternateKey] as [Product ID], [EnglishProductName] as [Product Name], FORMAT([StandardCost], 'N2') as [Product Cost]FROM [dbo].[DimProduct]WHERE [StandardCost] IS NOT NULLORDER BY [Product ID]----------------also can use-------------WHERE [StandardCost] IS NOT NULLOR [StandardCost] <> ‘ ‘cb-ot-devst05.ad.wsu.edumfstudentBIanalyst76204826000Because you cannot have a comma before the FROM table name line of code, developers often put the comma at the start of the line of code that specifies a new column.A few improvements are provided:We can filter out the rows with NULL values (now 395 products are displayed which are sold to customers).We can provide better field names using the as [field name] syntax. If you are using spaces in field names be sure to encase the field name with brackets. These [field names] will become the column headers if you copy the results to Excel, or create a dataset in SSRS.We also can us the FORMAT(fieldname, format code) statement to format the number. Here a comma notation with 2 decimals is effected with the ‘N2’ syntax. ‘C2’ is also common.Sorting the output is handled by the ORDER BY statement. You can use different levels of sorting, to sort sub-groups within a larger grouping. (DESC is the added term to sort descendingThe as [field name] code gives the column a new name. This renaming of the column is usually optional, but required if you have a calculated field.USE [AdventureWorksDW2012];SELECT [ProductAlternateKey] as [Product ID], [EnglishProductName] AS [Product Name], FORMAT([StandardCost], 'N2') as [Product Cost]FROM [dbo].[DimProduct]WHERE [StandardCost] IS NOT NULLAND [ProductSubcategoryKey] = 3 AND [StandardCost] > 500 ORDER BY [Product ID]Notice that you can use many criteria in the WHERE filter statement. In fact, 2 of the 3 filters are not displayed in columns. You can include fields in the WHERE filter clause, and it is not needed to display these fields. You can just use them to filter. Here we certainly could display a ProductSubcategory column that repeats 3’s, but what use is that, presumably the filter terms are displayed in the analytics level report, spreadsheet, website, or dashboard. Also the more filters you set the fewer records will be retrieved as the search is more specific. You can add one or many different filters to the WHERE statement, which is pretty important because you rarely want ALL the data rows from a table.Usually the analyst has to sift through the data and retrieve a narrow dataset (to solve the problem at hand) from a mountain of data. Detailed WHERE clauses perform this functionality.For specifying filters in the where clause, dates and words have to be in 'single quotes' and numbers do not. You have to know the datatype of the column that you are filtering on first, before you craft your WHERE statement. You can do that by using SSMS's Object explorer to find the database and table then right-clicking the table name and columns to see the column names and their datatypes.127018605500The BETWEEN statement can also be used to filter by a range such as BETWEEN 500 AND 1000. This is the same as typing fieldname > 500 AND fieldname < =1000. You have to be careful though because you may not know the correct upper range.Try changing the subcategory filter criteria to see that the records retrieved and displayed will change. How many expensive products are in each sub-category?USE [AdventureWorksDW2012];SELECT [ProductAlternateKey] AS [Part #], [EnglishProductName] AS [Product],[ProductSubcategorykey],[Color]FROM [dbo].[DimProduct] as pWHERE EnglishProductName LIKE '%Mountain%' ORDER BY [ProductAlternateKey]Here the WHERE clause uses the term LIKE to create a wild card filtering term to retrieve all records that are close to a search term. Here any product with the word Mountain in it will be selected. You can for example retrieve all items in a product line that has a similar naming convention. ( LIKE ‘BK%’ )SELECT [ProductAlternateKey] AS [Part #], [EnglishProductName] AS [Product],[ProductSubcategorykey],[Color]FROM [dbo].[DimProduct] as pWHERE EnglishProductName NOT LIKE '%Mountain%' AND [ProductSubcategorykey] IS NOT NULLORDER BY [ProductAlternateKey]Here are the results of using WHERE EnglishProductName NOT LIKE '%Mountain%' to get all the other products without the word mountain.The ORDER BY statement could also have used the [Part #] alias.SELECT [EnglishProductName], [Class], FORMAT([ListPrice], 'N0') as [List Price]FROM [dbo].[DimProduct] WHERE [EnglishProductName] LIKE '%Road%'ORDER BY [List Price] DESC12706985000Here is another wild card search, any product that has the word road in it. Try some wildcard searches of your own.A first shortcoming of T-SQL is introduced. When the database program uses the FORMAT() function, the result is that the datatype for the numeric column is changed to character field, and the sorting stops working correctly. Just remove the FORMAT() syntax to fix the problem. It is widely recommended to perform the formatting in the reporting or data visualization software, rather than in SQL.SELECT [EnglishProductName], [Class], Format([ListPrice], 'N0') as [List Price]FROM [dbo].[DimProduct] WHERE Class = 'M' AND [EnglishProductName] LIKE '%Red%'ORDER BY [List Price] DESC-----a different resultset if retrieved if you change the criteria--SELECT [EnglishProductName], [Class], Format([ListPrice], 'N0') as [List Price]FROM [dbo].[DimProduct] WHERE Class = 'L' AND [EnglishProductName] LIKE '%Silver%'ORDER BY [List Price] DES6426208928100013220701016000-36830000Just changing the Class and Color of course produces a completely different list 10617207219950014363701206500USE [AdventureWorksDW2012];SELECT [EnglishProductName], [Class], ListPrice as [List Price]FROM [dbo].[DimProduct] WHERE [EnglishProductName] LIKE 'Road%Red%'--you can also substitute in the following line to retrieve the results shown on the rightWHERE [EnglishProductName] LIKE '%52%' What if you want to see the product line for the Road bikes? What if you want to just see the red road bikes?What if the customer has been measured and sized for the 52 cm model? Can you show a list of products that they can further examine?Can you retrieve a list of seats? Of bicycle helmets? Take some time to try out different wild card search terms.6235702749550021082028130500342392043370500 USE [AdventureWorksDW2012];SELECT CONCAT([FirstName], ' ', [LastName]) AS Employee, [Title], [EmailAddress], FORMAT([BaseRate], 'N2') as [Base rate]FROM [dbo].[DimEmployee]WHERE [LastName] BETWEEN 'A%' AND 'M%'ORDER BY LastNameHow can HR send an email to approximately half of the employees? Here we pull a list of employees whose last names start with the first letter A thru L. Notice the use of M as the ending range, the ending range is not inclusive.This is how you can concatenate two columns if data, adding a space between the firstname and lastname columns. Using the Concat() function rather than + to concatenate columns is more secure to SQL injection attacks. While we use concatenation to inprove the output, we still sort by one column - the LastName (not a concatenated term).The outcome of the query is a list of employees. Presumably the list can be created within a stored procedure that is fed to SMTP server for email distribution. Of course the list could be further sorted by job title or other HR information.Notice also that you can order (sort) by a column that isn’t in the resultset. Can you retrieve a list of employees whose last name begins with the same letter as your last name?The BETWEEN statement can also be used to filter by a range such as BETWEEN 500 AND 1000. This is the same as typing fieldname > 500 AND fieldname < =1000USE [AdventureWorksDW2012];SELECT [EnglishProductName], Color, [ListPrice]FROM [dbo].[DimProduct] WHERE Color <> 'Silver' AND COLOR <> 'Red' AND ListPrice IS NOT NULLAND [EnglishProductName] LIKE '%Frame%'-319836044600Sometimes your filters are numerous to filter down the data to retrieve only what you need. You just need to specify which column to exclude one (or more) category or value. You can use this fieldname <> ‘value’ to accomplish this. The syntax is column name <> ‘criteria’This syntax of excluding values can be faster than specifying the values to include in an OR statement or following IN statement.Here pay particular attention to the complex WHERE clause. When analysts drown in data, they need to employ good filters to extract from the database just the rows they need. So the SELECT statement filters the columns, and the WHERE statement filters the rows. You need to practice both to reduce information overload. When pulling data into a data visualization software, you do not want extra columns and rows that will not be included in the analysis. Good filtering reduces data obfuscation.USE [AdventureWorksDW2012];SELECT [EnglishProductName], Color, [ListPrice]FROM [dbo].[DimProduct] WHERE Color IN ('Black', 'Red', 'Blue') AND ListPrice IS NOT NULLORDER BY Color/* the WHERE IN(, , , )is often easier to write than the following rather than writing the following filter:WHERE Color = 'Black' OR Color = 'Red' or Color = 'Blue' Rather than write the filter:WHERE Color = 'Black' OR Color = 'Red' or Color = 'Blue' We use WHERE Color IN ('Black', 'Red', 'Blue')127063500You can the most excellent WHERE fieldname IN (criteria list) syntax. Of course you need to know the filter values. The use of AND means that the returned rows will need to satisfy two conditions. You can string together many filter criteria using AND fieldname criteria.We will use this IN statement later in SSRS reporting. Use the IN statementto specify several criteria to be used as filters for the same field (here color).Try a different field to filter on. First look at the values for the field, then build your filter.USE [AdventureWorksDW2012];SELECT MIN([ListPrice]) AS [Min Sales Price], MAX([ListPrice]) AS [Max Sales Price], AVG([ListPrice]) AS [Average Sales Price], COUNT([ListPrice]) AS [Num Products]FROM [dbo].[DimProduct] WHERE [ProductSubcategoryKey] = 3 You can perform some interesting statistics such as the minimum, maximum values in a column, the column average value and the count of the rows. Notice these metrics are for the entire dataset specified in the FROM and WHERE clause. There is no grouping of data.For the count, you can choose any column or use COUNT(*) from fieldname.Take a look at the measures for several of the product sub-categories. Wouldn’t it be helpful to see the name of the Product Subcategory rather than just a number? We will have to learn how to retrieve data from multiple tables at the same time.USE [AdventureWorksDW2012];SELECT DISTINCT [SalesOrderNumber]FROM FactResellerSales ORDER BY [SalesOrderNumber]1270000Sometimes you have a large table and you want to quickly pull out a list of unique values from one column in the table. You can use the DISTINCT keyword to do the trick. Here each Sales order number has many line items.Can you retrieve unique values for other fields? Can you retrieve the Customers ID#, from the USA that had an order in 2008? (We need to join in the customer table later on).Creating new columns of data from existing columns of dataReporting requirements given to the data analyst will eventually require them to generate new analytics, or perhaps simply to combine columns of data into one nicely formatted column. At some point the analyst will not find columns of data to support the business need, so they have to perform calculations to make new columns of metrics. The analyst needs more analytical measures (analytics) than are currently created in your database tables. We turn our focus to performing calculations - -just as you would in Excel adding a new column and refereeing other columns. You can create calculated columns of derived data. The calculation is performed once for every row in the resultset. Note new columns of data are virtual and therefore not stored in a database table, rather calculated ‘on-the-fly’. You can create calculated fields within a report or spreadsheet, but that methodology introduces errors. For governance and consistency, its best to drive the calculations down into the SQL (rather than calculated fields in some report or spreadsheet which can introduce errors, human bias, and fudging.) The custom SQL with the calculated fields is then used as the datasource for reports and spreadsheets (SQL gets packaged up as a stored procedure, is tested, and reused). Packaging up SQL SELECT statement that performs complex calculations and creates new columns of measures into stored procedures helps with the automation of the data management and business analytics creation process. This idea of automating report generation and analytics is very important.Also the calculation is performed once, rather than recreated in different spreadsheets. Tables never have all the measures and values that are needed by analysts, so creating new columns is very important and common task. In summary, keep the data and calculations as much as possible in the database and in the stored query. When data is moved to Excel and calculations are made therein, the organization loses quality control over the analysis and then the organization suffers from ‘version hell’, meaning that different players in the organization have different versions of documents, and can disagree. It’s better for all managers and analysts to retrieve data that is pre-compiled with needed metrics. Again keep the data on the server to reduce versioning errors. Also the data on the server should be constantly updated, which you can’t say is common in Excel.USE [AdventureWorksDW2012];SELECT [ProductSubcategorykey] as [Sub-Category], [ProductAlternateKey] AS [Part #], [EnglishProductName] AS [Product],[Color], FORMAT([StandardCost], 'N2') as [Product Cost], FORMAT([ListPrice], 'N2') as [List Price], (([ListPrice] - [StandardCost])/[StandardCost]) * 100 as [Profit Pct.]-- this next line is another methodology to provides the same results as the prior calculation., FORMAT((([ListPrice] - [StandardCost])/[StandardCost]), 'P2') as [Profit %]FROM [dbo].[DimProduct] as pWHERE EnglishProductName LIKE '%Mountain%'ORDER BY [Profit %] DESCIt’s a good idea to start the syntax for each new calculated column of new derived data on new lines in your query. A great deal of calculations can be created similar to creating formulas in Excel.The FORMAT clause takes the form FORMAT(expression, ‘format code’) where the popular format codes are P2, N0, C0 which are percent with 2 decimals, numeric with 0 decimals and currency with 0 decimals respectively.Notice the 4th line calculates the profit margin for each product is calculated. Be sure to count the number of left and right parentheses. This is a good example of deriving new information from the current set of columns. This is where the genius come in.Multiply formulas by 100 if you want to see a number > 1. Please make a mental note: if you perform some calculation, and your result is a column of zeros, try multiplying the result by 100.Remove and replace the WHERE statement aboveWHERE [ProductSubcategoryKey] = 3 AND [StandardCost] BETWEEN 500 and 1000Go ahead and experiment by changing the sub-category to a different number in the range 1 – 35. You can also change the standard cost range to see a different set of products in the resultset.USE [AdventureWorksDW2012];SELECT ResellerKey, [SalesOrderNumber], [SalesOrderLineNumber] AS [Line #], (([UnitPrice] * [OrderQuantity]) - [DiscountAmount]) AS [Line Total]FROM [dbo].[FactResellerSales]WHERE [OrderDate] BETWEEN '1/1/2008' AND '12/31/2008'ORDER BY [SalesOrderNumber], [SalesOrderLineNumber]Related questions are:what is the average # of line items that each reseller purchases on their sales ordersDoes the average # of line items differ by season, or by region, or by year?Which customers are expanding (contracting) their variety of purchases (incr. or decr. the # of line items)?Do promotions increase volume of sales or breadth?This query returns 13,283 rows of data. This query is another example of creating a calculated column of derived data. Notice one Sales order# has many line items. The line total for each line items of each sale order in 2008 is calculated.Note: 1) the dates for the BETWEEN clause here are hard –coded. We can pass in variables here (covered later).2) Notice in the next query the use of the DATEPART() function to simplify the WHERE clause.We really need to bring in the reseller name and aggregate these line totals for each order. Here we go. Rationale is explained below.Getting more columns from more than one table– This topic of joining tables of data together for improved report writing is covered here and in the next module. Linked tables can bring in the interesting categorical and nominal descriptors (called dimensions) for groupings and categorization. The data we need to create reports is usually in multiple tables of the data warehouse. Analysts manipulate data stored in the data warehouse or regional data mart. Specialized data warehouse DBA’s called data modelers or ETL programmers (extract, transform and load) write stored procedures that are scheduled to run on hourly intervals to pull and consolidate data from transaction processing systems pulling data into the data warehouse; thereby instantly refreshing each dashboard that is connected to the data warehouse.The focus of Featherman’s T-SQL programming series is pulling data into report writing software for managerial analysis. Now that you have some basics of data retrieval learned (Bravo!) let us turn to a fundamental concept (meaning you need to understand this concept expertly on job interviews) … JOINING tables. Report writers, and analysts use JOIN functions 100’s of times daily to connect dimension tables of master data (smaller sized tables which don’t change rapidly) to larger fact tables of transaction data (these transaction tables grow into the millions and 100’s of millions of rows of data). Next follows some background information on star schemas. Don’t worry, you don’t need to memorize the information but recognize that we are leveraging it. Our new use of INNER JOIN queries below is made possible by a prior effort to create a coherent data management structure. IN SSMS we continue to write our queries adding just one new feature simple INNER JOINS to connect two tables of data. After we connect the tables, then we can retrieve new columns of data from them, usually making the report more interesting and rich.To slice the big transaction data into logical groups based on reporting dimensions (i.e., time, geography, promotion, customer demographics) database architects create data models to connect the data tables into a star format (see image on page 5). In the star schema each data table uses primary keys to uniquely identify each record. The fact table has its own set of fields that uniquely identify each record (in the example of sales transactions two fields SalesOrderNumber & SalesOrderLineNumber are needed to make each row unique in the table) therefore stored safely without a chance of being edited in error. The fact table also needs to be designed with the plan of being connected to many other dimension tables (the tables with the slicers to focus in and analyze data patterns) has a foreign key relationship out to all the dimensions. If this prior data warehouse design has been implemented (here using the star approach) Ok now a few simple demonstrations of how to connect two tables to pull data from either table. We finish with a GROUP BY query to give you a hint at the even more powerful content coming in the next module. IF you see the power of a GROUP BY query you should become a little inspired to continue your self-study.USE [AdventureWorksDW2012];SELECT [EnglishProductSubcategoryName] AS [Product Sub-Category], [ProductAlternateKey] as [Model#], [EnglishProductName] AS [Product], FORMAT([ListPrice], 'N0') AS [Sales Price]FROM [dbo].[DimProduct] as pINNER JOIN [dbo].[DimProductSubcategory] as sON s.ProductSubcategoryKey = p.ProductSubcategoryKeyWHERE s.[ProductSubcategoryKey] = 3 ORDER BY EnglishProductName-26708800Here we join two tables together dimProduct and dimProductSubcategory, so that why we filter the data retrieved by the subcategory number we display the sub-categry term which greatly improves the usefulness of the dataset.You have to join two tables together on their common field – a field that is inside both tables (here ProductSubcategoryKey) so that you can pull data from both tables. FYI – the FROM [dbo].[DimProduct] as p means we are pulling columns of data from the DimProduct table. We also create an alias for the table so that we can refer to a column with the shortcut alias. So rather than type [dbo].[DimProduct].ProductSubcategoryKey we can type p.ProductSubcategoryKeyThe output here is enriched with adding the product sub-category names. The rows are also sorted using the ORDER BY statement. Run the query now. Hmm, is there a price differential for the bicycles when they are different colors? It would be great to add a column of how many are sold for each product! It would also be helpful if this query was attached to an Excel list or webpage which can pass in the sub-category #.USE [Featherman_Analytics];SELECT [Customers].[CustomerID], [CustomerName], [Sale_ID], CONVERT(VARCHAR(10),[Invoice_Date], 110) as [Invoice Date], FORMAT([Total_Sale], 'N0') as [Invoice Total]FROM [featherman].[Customers] INNER JOIN [featherman].[sales]ON [featherman].[sales].[CustomerID] = [featherman].[Customers].[CustomerID]ORDER BY [Customers].[CustomerID], [Invoice_Date] 552456477000This query pulls data together from featherman’s customers and sales tables .Can you see that the alias shortcut mentioned above was not utilized? Can you run the query and then modify it to implement the alias’?Notice the useful CONVERT() function to format the Invoice Date column. Perhaps less useful is the FORMAT([measure]) which while taming down the data into a nicely formatted number if you just need a screenshot, using FORMAT() unfortunately turns the data into a text string. If you copy the data to excel or POWERBI you have to reconvert the column to numeric. A common rule is to do the data formatting in the top level program.The query returns a dataset which is approximately 2000 rows of data, and 5 columns wide. In SQL Server Management Services you can run the query to the left, retrieve the results, and copy the results (with headers) into excel. Next add a pivot chart, and customer name for the slicer and you easily create the following visual, and others like it. This database reporting synergy has been the bread and potatoes for MIS graduates for years (and is now referred to as analysts). SQL\Excel is now being appended with SQL\PowerBI Desktop. Becoming common is also Amazon Redshift and Tableau.2315-79600USE [Featherman_Analytics];SELECT c.[CustomerID], [CustomerName], COUNT([Sale_ID]) as [# Invoices], FORMAT(AVG([Total_Sale]), 'N0') as [Invoice Avg.], FORMAT(SUM([Total_Sale]), 'N0') as [Grand Total]FROM [featherman].[Customers] as cINNER JOIN [featherman].[sales] as sON s.[CustomerID] = c.[CustomerID]GROUP BY c.[CustomerID], [CustomerName]ORDER BY c.[CustomerID]5737110888900The very simple looking query to the left collapses the 2000 rows of data retrieved in the above query down to 21 rows of data each again with 5 columns.Three columns of analytics are caclulated for each customer. In truth there is no limit to the number of calculated columns that you can generate for each row. Since the GROUP BY column specified is Customer level, the purple functions demonstrated provide counts, averages and totals at that level.If we had US State descriptors then the sales records could be grouped (called rolled-up) by state, which is a higher level of aggregation.The COUNT function just counts up the number of rows (invoices) for each customer. Next values in the Total_Sale column are grouped up and a) averaged and b) totalled by customer. The final output is very neat, and easy to copy into excel to produce a visual such as below. A next step woul dbe to cross-tabulate the data for example breaking the data into years for each customer.12319016532100To finish this module we review the funcitons COUNT and COUNT(DISTINCT) which are needed when you need to pull out unique or grouped values from large datasets. Usually you will use an INNER JOIN to select only the data from a fact table that has values in the dimension table, for example showing production orders for only those products that had production activity. We introduce using a LEFT JOIN here to show each reseller whether they had sales activity or not). The convention is to type the dimension table name first in your SQL statement, followed by a LEFT join to ensure ALL values form the dimension table are shown whether they have related records in the fact table or not. The LEFT JOIN still allows us to access columns of data from both tables. USE [AdventureWorksDW2012];SELECT [BusinessType], [ResellerName], COUNT(DISTINCT([SalesOrderNumber])) as [# Sales], COUNT([SalesOrderNumber]) as [# Lines], FORMAT(SUM([SalesAmount]), 'N0') as [Total Sales]FROM [dbo].[DimReseller] as r LEFT JOIN [dbo].[FactResellerSales] as rs ON r.ResellerKey = rs.ResellerKeyGROUP BY [BusinessType], [ResellerName]HAVING SUM([SalesAmount]) >0ORDER BY [BusinessType], [ResellerName]59055000To make data retrieval more understandable you can think of the dimension tables as being parent (or master) tables and the fact table (with all the numbers) as being the child (or detail) table.In this example there are 699 Resellers in the Resellers table. The Resellers table is a dimension (parent) table for which there are potentially many related rows in the related child table (sales in the FactResellerSales fact table). The LEFT JOIN used here ensures that all resellers are shown on the created list whether they had a corresponding sale in the FactResellerSales table or not. Here the fact table has >60,000 records). For every reseller we total up the related rows of data (here totaled due to using a group by ResellerName).Sometimes you will want to generate a list of all the values in the parent table even if there are no related records in the fact table. Here we want to include the resellers even if they do not have any sales (perhaps to highlight that there weren’t any sales).When you want to include all the records from the master table regardless of matches in the detail table, you change INNER JOIN to LEFT JOIN to include ALL of the values from the 1st table. This is a usefool tool.The HAVING allows us to filter the groups that were made. Do you want to run the query, copy the dataset into Excel (or Tableau) and make some pivot charts? Sounds like a job interview question. The skills shown here are fundamental for analysts.Congratulations on completing this most epic and useful course module. Even these basic skills learned here can help you survive on the job. You can pull large tables of data into Excel and let the pivot chart tool do the slicing, dicing and totaling of the data. The next module goes into condensing and compiling the data retrieved from the data warehouse, the GROUP BY functionality was foreshadowed here.By far the most useful topic presented here is the slicing up of the time dimension into hours, days, weeks, months, quarters and years. Doing the date manipulation in SQL with the DATEPART() function makes the Excel or Tableau analysis much richer and insightful.The contents of this module should prepare the business analyst for an internship and the first week on the job (as long as you have access to pivot charts or Tableau to back up this rudimentary use of T-SQL. To raise in the analyst profession (ie triple salary), the analyst needs to become self-reliant and gain mastery over data, able to create data sets, find insights and secrets in the data, even perform predictive analysis. There are more advanced procedures that analysts use to select, retrieve and filter data. For example, it is very common to need to join 5 tables together to gain access to the fields you need to provide the management insight. If you are lucky a DBA will write a view or stored procedure for you to use. For now please practice and master different combinations of filtering and creating new columns of derived data.While there is much more to learn, you have already learned a ton. I hope you see the power of SQL and its usefulness as well. This author himself survived his first production management job because of his ability to extract data and create status reports, exception reports and pareto histograms for quality control. This author has also been sending graduating university seniors off to MIS jobs in corporate America since 1993. The story is the same as it ever was; recruiters want applicants that can pull a dataset together from a database, manipulate and aggregate the data to create the datasets and the dashboards upon which managerial decisions are made (perhaps using Excel pivot charts, Tableau, or PowerBI Desktop). SQL is the IT tool that all businesspeople use to retrieve and aggregate data. First you run a SQL query, then you use a data visualization tool to make the dashboards that are presented at meetings (dashboards with live slicers run today’s business meeting, not PowerPoint). SQL is used to turn data into pivot charts, pivot tables, dashboards, and maps to facilitate informed, fact-based decision making.Mauricio Featherman, Ph.D. ? PS - The next module is actually easier than this one ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download