CS590W Project 1 – An MS Access Database Design



[pic]

Project 2: Mining Wall-Mart Sales Data using

the NCR Teradata Machine

Part 1 – Data Preprocessing

[pic]

1. Purpose

This part of project 2 aims at applying some of the data preprocessing principles and techniques learnt in class (data cleaning, data integration and transformation, data reduction) on the Wal-Mart data available on the CS NCR Teradata machine. Section 2 describes the configuration steps to follow in order to access the Teradata machine.

2. Accounts and Configuration

Your accounts have been created on the NCR Teradata database:

Login = CS login if one exists, PH alias if no CS login

Password = First letter of your login + 9 digits of student ID

This account allows you to access the SQL based database on the machine teradata.cs.purdue.edu. You will need to use client software provided by the CS department to be able to access this account. At this time, a Unix client much like sqlplus for oracle is available. To use the Unix client software you will need to connect to a SPARC Solaris server. The one you will be using is cybil.cs.purdue.edu. Accounts on cybil have also been set for all CS 590W students.

Once you have connected to a Unix server you need to set up your environment. Add:

/p/tuf-5.3/bin

to your path variable.

Read the file called "README_CS" in the directory /p/tuf-5.3. It lists the environment settings required to be able to connect to the Teradata. Also on Unix there are libraries provided for accessing the Teradata from within code you write using ODBC compliant mechanisms. In the /p/tuf-5.3 directory are sample source files, include files, and the library files necessary to write your own code.

The command you will be using to connect on Unix is 'bteq' and is located in the directory you were instructed to add to your path earlier. You can also run this command directly by /p/tuf-5.3/bin/bteq. 'bteq' is the primary command line interface to Teradata sql server. It also can act as your batch job handler.

All commands directed at bteq need to begin with a '.' and all sql statements should begin the line with just the sql statement. It is important to properly exit bteq as it will save you time reconnecting rather than having your last connection 'recovered'. To exit bteq type ".exit"

To connect to the Teradata, type:

bteq

There will be no prompt, after the line:

" Teradata BTEQ 05.03.00.00 for UNIX5. Enter your logon or BTEQ command:"

type:

.logon YourLogin

You will be prompted for a password, use the one provided for you.

You should now be connected and see a message similar to:

" BTEQ -- Enter your DBC/SQL request or BTEQ command:"

At this point you are connect and can begin issuing sql commands or further bteq commands. Please, first change your Teradata password by typing the sql command:

modify user YourLogin as password=NewPassword;

3. Running Unix scripts

A more convenient way of accessing the Teradata machine is using Unix scripts. You can run a Unix script directly from your cybil account. Here is a simple example:

------------------------------------------------- test.sql ------------------------------------------------

.logon your_login,your_password;

select count(*) from Univ_Class_Tables.Calendar_Day;

.quit

Then, from your Unix command line, type:

bteq < test.sql

Or, from within bteq, type:

.run file=test.sql

In addition to this Unix software there is a jdbc driver provided for NT/98 and a client of sorts available. Please notify software@cs.purdue.edu if you wish to have this client installed on your CS Windows machine. For those of you outside the CS department, we should be able to provide you with the PC software to install yourself or have installed by your support staff.

3. Description of the Wal-Mart Data

You will be using part of Wal-Mart sales data for the years 1999 and 2000. The schema of the tables is given below (the row counts is indicated):

(21,915 rows)

Univ_Class_Tables.Calendar_Day

Gregorian_Date date not null

, Julian_Date integer not null

, Day_Of_Wk byteint not null

, Fiscal_Year byteint not null

, WM_Week byteint not null

, LY_Comp_Week byteint not null

, WM_Year byteint not null default 0

, LY_Comp_Year byteint not null default 0

, WM_Yr_Wk smallint not null

, Sun_Mult byteint not null default 0

, Mon_Mult byteint not null default 0

, Tue_Mult byteint not null default 0

, Wed_Mult byteint not null default 0

, Thu_Mult byteint not null default 0

, Fri_Mult byteint not null default 0

, Sat_Mult byteint not null default 0

Unique Primary Index (Gregorian_Date);

(99 rows)

Table Univ_Class_Tables.Dept_Desc

ORDER_DEPT_NBR SMALLINT

, Dept_Desc CHAR(30)

UNIQUE PRIMARY INDEX(ORDER_DEPT_NBR);

(99 rows)

TABLE Univ_Class_Tables.GMM_DMM_XREF

Category_Nbr SMALLINT NOT NULL

, CAT_MDS_FAM_ID INTEGER NOT NULL

, GMM_NAME CHAR(30) NOT NULL

, DMM_NAME CHAR(30) NOT NULL

UNIQUE PRIMARY INDEX (CATEGORY_NBR);

(432,223 rows)

TABLE Univ_Class_Tables.Item_Description

Item_Nbr INTEGER NOT NULL

, Category_Nbr SMALLINT

, Sub_Category_Nbr SMALLINT

, Primary_Desc CHAR(20)

, Secondary_Desc CHAR(20)

, Color_Desc CHAR(6)

, Size_Desc CHAR(6)

, Status_Code CHAR(1)

, Type_Code CHAR(2)

, Fineline CHAR(4)

, UPC CHAR(13)

, UPC_Desc CHAR(12)

, PLU_Nbr SMALLINT

, Create_Date DATE Format 'YYYYMMDD'

, Effective_Date DATE Format 'YYYYMMDD'

, Obsolete_Date DATE Format 'YYYYMMDD'

, Vendor_Nbr CHAR(6) NOT NULL

, Vendor_Nbr_Seq CHAR(1)

, Vendor_Name CHAR(30)

, Vendor_Stock_Nbr CHAR(15)

, VNPK_Qty INTEGER

, VNPK_Cubic_Ft DECIMAL(3,1)

, WHPK_Qty INTEGER

, WHPK_Cubic_Ft DECIMAL(3,1)

, Return_Sheet_Nbr INTEGER

, UOM_Code CHAR(2)

, Fact_Flag CHAR(1)

, DSD_Flag CHAR(1)

, Scale_Flag CHAR(1)

, RX_Flag CHAR(1)

, Catalog_Code CHAR(1)

, Ordbk_Stars_Qty SMALLINT

, Multi_Hdl_Code CHAR(1)

, Seasonal_Code CHAR(1)

, Promo_Ordbk_Flag CHAR(1)

, Xref_From_Nbr INTEGER

, Xref_To_Nbr INTEGER

, Trait_Flag CHAR(1)

, Whse_Align CHAR(2)

, Last_Change_Date DATE Format 'YYYYMMDD'

, Last_Change_Time INTEGER

, Expiration_Date DATE Format 'YYYYMMDD'

, Status_Chg_Date DATE Format 'YYYYMMDD'

, Mfg_Nbr CHAR(6) NOT NULL DEFAULT ' '

, International_Code CHAR(1) NOT NULL

, Product_Code INTEGER

, Activity_CD CHAR(1) NOT NULL DEFAULT 'y'

, Activity_TY_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_LY_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_TY_Q1_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_TY_Q2_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_TY_Q3_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_TY_Q4_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_LY_Q1_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_LY_Q2_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_LY_Q3_IND CHAR(1) NOT NULL DEFAULT 'N'

, Activity_LY_Q4_IND CHAR(1) NOT NULL DEFAULT 'N'

, Catg_User_Key DECIMAL(13,0)

, Shelf_User_Key DECIMAL(13,0)

, Division_Nbr SMALLINT

, Dept_Desc_ID INTEGER

, Catg_Desc_ID INTEGER

, Fineline_Desc_ID INTEGER

, Product_Desc_Id INTEGER

, Shelf_Desc_ID INTEGER

, ITEM_STORE_TYPE CHAR(1)

, Brand_Name CHAR(25)

, majority_status_cd char(1)

, WHPK_WGT_QTY DECIMAL(5,1)

, VNPK_WGT_QTY DECIMAL(5,1)

, FRSH_ITM_SHELF_LIFE DECIMAL(3,0)

, VENDOR_NBR_DEPT CHAR(2)

UNIQUE PRIMARY INDEX (Item_Nbr);

(842,556,378 rows)

Table Univ_Class_Tables.Item_Scan

Visit_Nbr Integer Not Null

, Store_Nbr Smallint Not Null

, Item_Nbr Integer Not Null

, Item_Quantity Decimal(9,2) Not Null

, Total_Scan_Amount Decimal(9,2) Not Null

, transaction_Date Date Format 'YYYYMMDD' Not Null

, Unit_Cost_Amount Decimal(9,4) Not Null

, Unit_Retail_Amount Decimal(9,2) Not Null

, Tax_Collect_Code Char(1) Not Null Compress '1'

Primary Index (Visit_Nbr);

(5,668,375 rows)

Table Univ_Class_Tables.Member_Index

MEMBERSHIP_NBR INTEGER NOT NULL

, ZIP_CODE CHAR(5) NOT NULL

, SIC CHAR(6) NOT NULL

, MEMBER_TYPE CHAR(1) NOT NULL

, MEMBER_STATUS_CD CHAR(1) NOT NULL

, ISSUING_CLUB_NBR SMALLINT NOT NULL

, RENEWAL_DATE DATE NOT NULL FORMAT 'YYYYMMDD'

, JOIN_DATE DATE NOT NULL FORMAT 'YYYYMMDD' DEFAULT 19500101

, ELITE_STAT_CODE BYTEINT NOT NULL DEFAULT 0

, BUS_CR_TYP_STAT_CD BYTEINT NOT NULL DEFAULT 0

, SECONDARY_CARD_CNT BYTEINT NOT NULL DEFAULT 0

, QUALIFY_ORG_CODE CHAR(4) NOT NULL DEFAULT ' '

, CMPLMNTRY_CARD_CNT BYTEINT NOT NULL DEFAULT 0

, LAST_RENEWAL_DATE DATE NOT NULL FORMAT 'YYYYMMDD' DEFAULT 19500101

Unique Primary Index (Membership_Nbr);

(4,606,516 rows)

Table Univ_Class_Tables.Secondary_Tender

Visit_Nbr Integer Not Null

, Tender_Type Byteint Not Null

, Tender_Amt Decimal(7,2)

Unique Primary Index (Visit_Nbr, Tender_Type);

(112 rows)

Table Univ_Class_Tables.Standard_Industry_Code

SIC char(4) Not Null

, Description Char(80) Not Null

Unique Primary Index (SIC);

(8,218 rows)

Table Univ_Class_Tables.Store_Area_Zip

Store_Nbr Smallint not null

, Zip_Code Char(5) not null

Unique Primary Index ( Store_Nbr,Zip_Code );

(150 rows)

TABLE Univ_Class_Tables.store_Information

Store_Nbr Smallint Not Null,

, Store_Name CHAR(30),

, Align_Sub_Division_Nbr CHAR(1),

, Region_Nbr SMALLINT,

, District_Nbr SMALLINT,

, Open_Date DATE Format 'YYYYMMDD',

, Store_Type CHAR(1),

, Street_Addr CHAR(30),

, City CHAR(23),

, State CHAR(2),

, ZIP_Code INTEGER,

, Phone_Nbr CHAR(12),

, Manager_Name CHAR(29),

, Open_Sunday_Flag CHAR(1),

, Geographic_Zone BYTEINT,

, Apparel_Zone SMALLINT,

, SizeClass SMALLINT,

, SalesClass SMALLINT,

, Store_Code CHAR(1),

, Delivery_Type CHAR(3),

, Mdse_Major_Zone SMALLINT Not Null,

, Mdse_Sub_Zone SMALLINT Not Null

UNIQUE PRIMARY INDEX(Store_Nbr );

(136,410,119 rows)

Table Univ_Class_Tables.Store_Visits

Visit_Nbr Integer Not Null

, Store_Nbr Smallint Not Null

, Register_Nbr Byteint Not Null

, Card_Holder_Nbr Byteint Not Null

, Membership_Nbr Integer Not Null

, Member_Code Char(1) Not Null

, Tender_Type Byteint Not Null

, Tender_Amt Decimal(9,2) Not Null

, Sales_Tax_Amt Decimal(9,2) Not Null

, Total_Visit_Amt Decimal(9,2) Not Null

, Transaction_Date Date Format 'YYYYMMDD' Not Null

, Transaction_Time Integer Not Null

, Refund_Code Char(1) Compress '0'

, Upcharge_Code Char(1) Not Null Compress '0'

, Tot_Unit_Cost Decimal(9,2) NOT NULL Default 0.00

, Tot_Unique_Itm_Cnt Smallint NOT NULL Default 0

, Tot_Scan_Cnt Integer NOT NULL Default 0

, Operator_nbr Smallint

Unique Primary Index (Visit_Nbr);

(9,894 rows)

TABLE Univ_Class_Tables.SUB_CATEGORY_DESC

CATEGORY_NBR SMALLINT NOT NULL

, SUB_CATEGORY_NBR SMALLINT NOT NULL

, SUB_CATEGORY_DESC VARCHAR(30) NOT NULL

, BUYER_ID VARCHAR(30) NOT NULL

PRIMARY INDEX (CATEGORY_NBR, SUB_CATEGORY_NBR);

(835,760,698 rows)

Table Univ_Class_Tables.Visit_Scan_lookup

Store_nbr SMALLINT NOT NULL,

, Transaction_date DATE Format 'YYYYMMDD' NOT NULL,

, Item_Nbr INTEGER NOT NULL,

, Visit_nbr INTEGER NOT NULL

Primary Index (Store_Nbr, Transaction_Date, Item_Nbr);

4. Project Description

4.1 Aggregation

Create the tables that give the following aggregated information:

1. The total sales per year and month.

2. Use the result of 1 to compute the total sales per year and by quarter.

3. The total number of store visits per month

4. The total sales per store and per month

5. The total sales for each product category

6. The average number of visits per month for a given customer

7. The average total visit amount for a (same) customer

8. The average number of items that a customer buys per visit

9. The best 10 Wal-Mart members.

4.2 Normalization methods

Consider the Univ_Class_Tables.Store_Visits:

Apply and compare the three normalization methods:

• min-max normalization,

• z-score normalization

• normalization by decimal scaling

on the attribute Total_Visit_Amt.

4.3 Histograms

a) Using Univ_Class_Tables.Store_Visits, Construct an Equiwidth Histogram giving the total sales per each hour of the day (for both years 1999 and 2000)

How different is this histogram from the ones that could have been obtained using the three normalized tables obtained from question 4.2 ?

b) Repeat the same question using a MaxDiff histogram method.

4.4 Clustering

Consider the table Univ_Class_Tables.Item_Scan. Using the centroid distance as a measure of cluster quality, partition the sold items into clusters. Give the mean value of each cluster. Compute the percentage of outliers.

4.5 Sampling

Apply the SRSWOR, SRSWR and stratified sample techniques on the table Univ_Class_Tables.Store_Visits. Then, using the obtained tables, answer the same questions given in Q1. Compare and comment these results to those previously obtained.

5. NCR Teradata Documents

For those of you who did not attend the 2-day Teradata Course, a copy of NCR Student Guide is available at the Math Library. Two volumes are available:

1. Teradata Basics

2. Teradata Structured Query Language.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download