Name: Deepa Vaidhyanathan



Data warehousing Project

1.Synopsis: 2

2.Software and Database Specification: 2

3.Features of MS SQL server 2008: 2

4.Project Design: 3

5.Software Installation: 3

6.Creation of Fact and Dimenions Tables: 4

7.Report Generation using the SSRS wizard of SQL server 2008 4

8.Data Extraction from a Flat file using the SSIS tool: 9

9.Populating the Datawarehouse using the SSIS integration services tool: 14

10.Data Normalization: 19

11.Assuring Data Quality: 19

12.Books and references 19

DATA WAREHOUSING OLAP PROJECT DESCRIPTION

2 Synopsis:

The main aim of doing this project is to understand the basic concepts of how the data warehouse is built from scratch using an external data file. The project is to be done using MS SQL server 2008 which is a commercial software being used for building and maintaining warehouse. This tool can also be used for generation of reports for analysis purposes.

3 Software and Database Specification:

Software used:

The MS SQL server 2008 was downloaded from the following site:



This tool was a free download and the trial version is applicable for a period of 180 days.

Database tables and Schema:

Microsoft is also offering free database tables with data, so that we can use these samples as our database for building the data warehouse and also for reporting purposes. The databases use MySQL as the querying script. The databases are to be downloaded from the following link:



4 Features of MS SQL server 2008:

This is a commercial tool being used in industry and it has all the features required to do all the OLAP activities (example data mining, data warehousing, reporting etc). Some of the main features of this tool are as follows:

Server Management Studio: This is a graphical tool used for all the data management activities and it also helps us to build a query in a very user friendly development environment.

Analysis Services: Analysis Services makes it easy to create sophisticated on line analytical processing (OLAP) and data mining solutions. The Analysis Services tools provide the capability to design, create, and manage cubes and data-mining models from data warehouses and to provide client access to OLAP data and data mining data

Integration services: Microsoft Integration Services makes it easy to create robust and complex solutions for extracting, transforming, and loading data (ETL). The Integration Services tools let you design, create, deploy, and manage packages that address everyday business requirements. This was the main tool which is used by the data warehouse developer for designing data warehouse

Reporting Services: This tool is used to configure a report server, manage report server content and operations, and create and publish reports and report models

5 Project Design:

Create a small datawarehouse with the following tables:

Employee table which contains the employee information with attributes the employee id, First name, Last name. Empid is the primary key of this table.

Department table which contains the name of the department and the department id. In this table Dept_id would be the primary key.

Salary table: This table will have the dno which would be the foreign key form the department table along with the empid. This table will have the details like the rate the employee charges for a day and the frequency of how many times in a month he/she is paid.

Steps for the creation of datawarehouse

Step1 : Data analysis

Step2: Data Extraction from an external source and tranforming the data into a data source(NDS/ODS). In our case, it is NDS as the data is already normalized. The granularity is at the lowest level as the amount of data is very less.

Step3: Creation of data and fact tables and populating the data.

Step4: Quality assurance and data check done using reporting.

Project execution

In our case, as we do not have the raw data, we would do the data extraction from our available database using the reporting tool and then upload the data from the flat file into the fact and dimension tables using the SSIS software of MS Sql server which is used to extract the data from the operational envionment and upload the same into the Data Source.

We will also be manipulating the columns like how the data are intergrated in a datawarehouse using the SSIS tool. Hence, all the operations of ETL, i.e. the extraction, tranforming and loading of data would be done using this SSIS tool.

6 Software Installation:

Once the tools and database are downloaded, they are to be installed on your PC and the installations are done using the MSDN help, so that the PC becomes both the server and the database. Once the installation is done, do connect database with the SQL server and test the connectivity. Once the connection is established all the reports and integration activities can be done on the same PC.

Testing data connectivity:

Open the SQL server management studio. To check the database engine connectivity, select database under server type give the server name. If all the credentials are correct, the database would get connected and the below screen would be displayed.

7 Creation of Fact and Dimensions Tables:

The fact and the dimensions tables form the basis of the data warehousing. In our scenario, we will have three dimensions tables, namely dim_emp, dim_dept, dim_salary.

We will have only one fact table which will have columns like empid,dno,rate,Asalary and Msalary where emp id is the key form the emp table, dno is the key from the department table and rate is a column from the salary table. The Asalary and Msalary would be calculated as the annual and Monthly salary for each and every employee.

8 Report Generation using the SSRS wizard of SQL server 2008

Using this tool we can create reports and do necessary analysis with them. This wizard is very efficient from an end user perspective as the user does not need to know the complete schema of the tables. He just needs to know the tables and what type of data it contains.

Follow the steps given below for generating a report.

Click on Start-> programs-> MS SQL server 2008->Server integration Business Intelligence Studio

The next screen which we need to select would be the project type. In the above case, we would select the “Reporting using Wizard” option. There is also another option “Reporting services project” that can be used if we didn’t have to use the wizard.

After we give all the necessary server and database details a connection would be established. Also do check if the details are right by clicking on the test connection button. The pop up “Test connection successful” confirms the establishing of the connection.

When we click on OK the Query builder page would open as below

Build the query by selection the table option on the left hand side.

Specify the columns which needs to be displayed in the report the query is automatically generated by query builder.. The Report type also needs to be selected . This decides if the report is to be in tabular or in matrix from. After all the necessary details have been given the report is generated as shown below.

To make the data from the database to be displayed in the table format of the report, from the toolbox click on “create table option” options and match each and every cell of the table with the individual columns of the report. If the mapping is not done correctly the report would not be displayed.

When we click on preview we get the below report view. Click on the Save button and export the data into an TXT file so that we can use the same as a external flat file to build a data warehouse Save the TXT file in our PC so that it can be later used to build the data warehouse

9 Data Extraction from a Flat file using the SSIS tool:

In our case, we would be extracting data from the report which we have got from the database and we would insert the same into our new database.

This is one of the ETL operations Extract transform and then Load activities and hence this is done by the Integration services of the SQL Server. This data extraction forms the base for constructing the data warehouse

Click on Start-> programs-> MS SQL Server 2008-> SQL Server Business Development intelligence studio.

Select the option “Integration services” using wizard from the project type option and then give the name of the Integration services project.

Initially the Integration services tool would ask us to specify the type of data we would select the flat file option as the external data is the flat file. Also select the “OLE DB/ SQL server” option as the destination is the table in the database (i.e. the data is loaded form flat file to table). Please specify the necessary database and connection details and to check the connectivity click on test connection. The screens shown below would appear.

Once the connection is tested we have to select which of the two would be the source and destination. In our case the source would be a flat file and the destination would be a OLE DB. After selecting all of these options click on OK. The data flow object comes on the screen, as shown below.

First, when the source and destination comes in the screen click on the green arrow and connect the source to that of the destination just to indicate that the connection will be established between both. Double click on the flat file source option and then select the flat file which needs to be loaded from the PC. Also modify the column values to their respective data types so that they don't suffer any data type discrepancy while inserting the data into the database.

Then double click on the OLE DB destination task and give the specific server and database details. Also, do test the connection by clicking on the test connection button.

Then we need to map the table column names to that of the column names of the flat file. This is done when we click on the mappings tab. When we select the appropriate match the mappings are done accordingly and then click on OK.

Once the page is reloaded, click on the small green triangle option which means load the package. If the package is without any error, then the data gets inserted into the respective specified tables.

The next operation would be check if the records are inserted into the database. Connect to the database server manager and execute a select query to the table. The query should return records, which proves that the data is loaded into the database.

In this manner, all the tables are loaded from a flat file and populated as NDS (Normalized data source) in the database. These is the normalized data source which would be used to create the fact and dimensions tables of the data warehouse

10 Populating the Data warehouse using the SSIS integration services tool:

Create a new SSIS package in the same data integration project

Toolbox->control flow items->data flow task

Double click on the data flow task to edit the same and then create a OLDB source from the toolbox. The give the connectivity details for the OLEDB source so that it points to the master database of our server. Then from the tool box select the Derived column task.

Then we double click on the derived column the following screen comes.

Specify the columns which need to be modified/transformed or inserted. In our case, we would be modifying two columns and we would be inserting one column.

Then connect the derived column with the slowly changing attribute task which is also taken from the toolbox. Then double click on the slowing changing attribute where we would specify the details of the attributes

Select the business key for the table.

Also add a new column Msalary in our table, so that it can be mapped with the new column which is been created.

Once we debug the package, we would get the screen below.

Check the database, we would see the changes being modified to the table.

In this manner the NDS or normalized data source is populated. In our case as it is a very small database, normalization techniques are not required. Using the SSIS tool we can also normalize the data, so that there is no redundancy in the data and the database is also accurate.

11 Data Normalization:

Once the data is populated the data needs to be normalized. There is a separate method or way of data normalization which is also done using the SSIS tool. In our case there is no data normalization which is required.

12 Assuring Data Quality:

The data quality process included the activities to make sure that the data in the data warehouse is correct and complete. This process includes the mechanism to report the data and check the quality of the report and do the checking and modification so that consistency is maintained throughout the data warehouse

The main rules of data quality are given below. These rules essentially filter to prevent the dirty data from getting into the data warehouse. Based on the data location, we have three kinds of data quality rules:

Incoming data validation: The validation of the incoming data is usually applied on the fly when the data is being loaded into the data warehouse. It is usually verified whether the data from the source system are valid. If there is any discrepancy with the data type, etc., the data will not even get inserted into the data base

Cross reference Validation: Cross verification is the checking/comparing which is done on the incoming data to the data which is already present in the data warehouse. The main aim of this is to see if the incoming data is within the range of the data which is already present in the data warehouse

Data warehouse Internal validation: This is a type of validation which is done on the data which is present inside the data warehouse. This is basically done to maintain the quality of data in the data warehouse

13 Books and references

• Wikipedia

• MSDN for MS SQL server 2008

• Building a data warehouse by Vincent Rainardi

• Building the data warehouse by W.H Inmon.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download