A PROC SQL+Data Step Alternative to the Hash Object for ...

NESUG 2009

Coders' Corner

A PROC SQL+Data Step Alternative to the Hash Object for Creating Multiple, Data-Driven Data Sets

Mike Zdeb, University@Albany School of Public Health, Rensselaer, NY

ABSTRACT Creating multiple data sets from a single data set based on the value of one or variables is a common task. If the number of data sets to be created is small and all the values of the variable(s) to be used are know, one could write a data step that specified the names of the new data sets and use OUTPUT statements to direct observations to the appropriate data sets. However, if the values of the variable(s) to be used to create the data sets are unknown, other approaches can be used. One that has been suggested by others is use of the HASH OBJECT. Another method that uses PROC SQL to create a macro variable prior to a data step is suggested in this paper and compared to the HASH OBJECT method. The SQL method presents an alternative to the HASH OBJECT method when sorting the original data set, as required by the hash technique, may be impractical or costly.

VALUES KNOWN

If the number of data sets to be created is small and the values to be used to place observations in the

data sets are known, there is no need to resort to anything more complicated than a short data step.

Examples 1 and 2 use the SASHELP.CLASS data set to show such a situation. Both are used to create

two new data sets based on the value of

the variable SEX.

Example 1 offers more flexibility in that it allows a user to specify what to do in a situation when a value of a variable is encountered that is not supposed to be in the original data set. If that occurs, a note is written to the LOG. Also, since the data sets are gender-specific, there is

* example 1; data males females; set sashelp.class; select (sex);

when ('M') output males; when ('F') output females; otherwise put 'BAD OBSERVATION: ' _n_ sex; end; drop sex; run;

no need for the variable SEX in either data set, so it is dropped. What example

2 loses is flexibility it mak?es up for with

'elegance' in that the SAS code is shorter and the data statement makes it clear as to what the contents of each

* example 2; data males (where=(sex eq 'M')) females (where=(sex eq 'F')) ; set sashelp.class; run;

data set will be.

VALUES UNKNOWN When the values of the variable to be used to place observations in the data sets are unknown, there are several approaches that can be taken. One could run PROC FREQ and create a table of the values of the variable, then use either of the approaches taken in examples 1 and 2. If the number of unique values is small, that might be an acceptable approach.

Assume that you want to create a data set for each different value of the variable AGE in the data set SASHELP.CLASS. Using PROC FREQ, you see the ages shown on the right. Though you now know all the variable values as you did when creating the gender-specific data sets, you think it would be tedious to write either of the data steps shown in examples 1 and 2.

-1-

NESUG 2009

Coders' Corner

* Hash Object Hamilton1 reviewed a number of methods for creating data-driven data sets, i.e. data sets based on the values of a variable within a single data set. One of them was the hash object, shown in example3 (with code is similar to that described by Hamilton). In this example, a new data set is created for each distinct value of the variable AGE in the data set SASHELP.CLASS. Rather than hard-code the data set names (AGE11, AGE12, etc.) as was done with the new data set names in examples 1 and 2, assume that you do not know the values if the variable AGE and want SAS to both create the data set names and write the SAS code that directs the appropriate observations to each new data set.

* example 3; proc sort data=sashelp.class out=class; by age; run;

proc sql; select quote(strip(name)) into :vars separated by ',' from dictionary.columns where libname eq 'WORK' and memname eq 'CLASS'; quit;

data _null_; * define hash object; dcl hash a (); a.definekey('key'); a.definedata(&vars); a.definedone();

* add observations to hash object; do until (last.age);

set class; by age; key+1; a.add(); end;

The real work of creating the new data sets is done in the last data step in example 3, data _null_. The use of the

* write contents of hash object to data set delete hash object from memory ;

hash object in that data step requires a data set sorted in order by the variable

a.output(dataset:catt("age",age)); a.delete(); run;

being used to create the new data sets, in

this case AGE. PROC SORT creates a new

data set in the WORK library named CLASS, sorted in ascending order of age. The use of the hash

object in the data _null_ step also requires specification of the names of the variables to be placed in

the new data sets. Instead of hard-coding the variable names in that step, PROC SQL is used to place

the names of all the variable names in data set CLASS into a macro variable. Notice that they are

comma-separated and quoted as required when creating the hash object in the last data step.

If you are not familiar with data step hash objects, the data _null_ data step in example 3 will look a bit mysterious to you. Reading the paper by Secosky2 and Bloom will remove much of that mystery (the reference section of this paper lists a few other papers for learning more about hash objects). The previously cited paper by Hamilton contains a detailed explanation of this method and the reader is referred to that paper for more information. Basically, there are three parts to the data step: define a hash object (notice the use of the macro variables &VARS previously created in the PROC SQL step); read the observations in data set CLASS grouped by values of the variable AGE; each time the last observation in each age group is encountered, write all the observations to a data set then delete the age-specific observations from memory. Notice that the new data sets will have names that comprise the text AGE followed by the value of the variable age used to put observations in the data set (AGE11, AGE12, ..., AGE16 - see the table on the bottom-right of page 1).

* PROC SQL One disadvantage of the hash object method shown in example 3 is that it requires a sorted data set. If you do not know much (or for that matter, anything) about hash objects, another disadvantage may be complexity. An alternative to the hash object is using PROC SQL to write both the data set names and the rule used to create each new data set.

-2-

NESUG 2009

Coders' Corner

PROC SQL can be used to create a portion of the SAS code similar to that used in example 2. The macro variable &DATASETS is created in example 4 and

* example 4; proc sql noprint; select distinct cat(catt("age",age)," (where=(age eq ",age,"))") into :datasets separated by ' '

its value is shown on the right, below example 4. The CAT and CATT functions

from sashelp.class; quit;

create text containing both a data set

data &datasets;

name and a WHERE data set option. The DISTINCT option used with the SELECT

set sashelp.class; run;

statement ensures that only one new

entry will be made to the macro variable for each value of the variable AGE in the SASHELP.CLASS data set. When &DATASETS

contents of macro variable &DATASETS ...

is used in the data step of example 4, that data step has the same form as that used in example 2. However, SAS has written both the data set names and the rule for each data set. Also, since SAS wrote all that SAS code, no prior knowledge of the values of

age11 (where=(age eq 11)) age12 (where=(age eq 12)) age13 (where=(age eq 13)) age14 (where=(age eq 14))

AGE in the data set SASHELP.CLASS is required (as was true in the hash object method).

age15 (where=(age eq 15)) age16 (where=(age eq 16))

Hamilton1 also showed how to use PROC SQL and a macro variable to create data-driven data sets, but used a SAS-written SELECT statement within a data step to write the appropriate observations to each of the new data sets. Both the SAS code shown in this paper and that in Hamilton's show that the PROC SQL method is much less complex than the hash object. Also, no sort is required though the SQL method does require a pass through the data set to find the distinct values of the variable used to create the data sets.

HASH VERSUS PROC SQL: "REAL LIFE" EXAMPLE With a small data set, it is difficult to compare the performance (elapsed and CPU times) of the hash object and PROC SQL methods. Therefore, both methods were applied to the following problem: given a data set with approximately 2.5 million observations and 40 variables, create one new data set for each value of a variable within that data set. The name of the data set is TEMP and the contents of the data set are shown on the right. The new data sets are to be based on the value of the variable HOSP and there are 247 unique hospitals in the data set.

The hash object method requires that the data set be sorted in order of the variable HOSP. The total elapsed and CPU times for that method will be the sort plus the data step that creates the data sets. Note, there is also one PROC SQL step that puts the variable names from data set TEMP in a macro variable, but the time for that step is negligible.

The PROC SQL method first requires that the data set names and WHERE data set options be written to a macro variable. In example 4, the DISTINCT option was used in PROC SQL to ensure that only one unique value for each variable value was used to

CONTENTS OF DATA SET TEMP

Alphabetic List of Variables and Attributes

# Variable Type Len

9 adx

Char

5

3 age

Num

8

4 county

Char

2

8 ethnic

Char

1

6 gender

Char

1

1 hosp

Char

4

11 odx1

Char

5

12 odx2

Char

5

13 odx3

Char

5

14 odx4

Char

5

15 odx5

Char

5

16 odx6

Char

5

17 odx7

Char

5

18 odx8

Char

5

19 odx9

Char

5

20 odx10

Char

5

21 odx11

Char

5

22 odx12

Char

5

23 odx13

Char

5

24 odx14

Char

5

26 opr1

Char

4

27 opr2

Char

4

28 opr3

Char

4

29 opr4

Char

4

30 opr5

Char

4

31 opr6

Char

4

32 opr7

Char

4

33 opr8

Char

4

34 opr9

Char

4

35 opr10

Char

4

36 opr11

Char

4

37 opr12

Char

4

38 opr13

Char

4

39 opr14

Char

4

40 payor

Char

2

10 pdx

Char

5

25 ppr

Char

4

7 race

Char

1

2 year

Num

8

5 zip

Char

5

-3-

NESUG 2009

Coders' Corner

create the macro variable &DATASETS. With a large data set, making even one pass through a that data set to find the unique values of variable can take a long time. An alternative method is to first use PROC FREQ to find those unique values, followed by PROC SQL to create the required macro variable. In this "real life" example, PROC FREQ processes the 2.5 million observations with PROC SQL then reading only the 247 observations output from PROC FREQ.

* "real life" PROC SQL; proc freq data=temp; table hosp / noprint out=hospnums (keep=hosp); run;

proc sql; select cat(catt("hospital_",hosp),

" (where=(hosp eq ",quote(hosp),"))") into :hospitals separated by ' ' from hospnums; quit;

data &hospitals; set temp; run;

The code used for both methods is shown on the right. The combination of PROC FREQ and PROC SQL produces the macro variable &HOSPITALS. The subsequent data step produces 247 hospital-specific data sets. In the hash object method, the NOEQUALS option is added to PROC SORT. Since there is no need to retain the relative order of the original data set within each hospital, the NOEQUALS option can be used to speed up the sort. PROC SQL writes the names of the 40 variables to a macro variable and the subsequent data step produces 247 hospital- specific data sets.

* "real life" hash object; proc sort data=temp noequals; by hosp; run;

proc sql noprint; select quote(strip(name)) into :vars separated by ',' from dictionary.columns where libname eq 'WORK' and memname eq 'TEMP'; quit;

data _null_; dcl hash a (); a.definekey('key'); a.definedata(&vars); a.definedone();

do until (last.hosp); set temp; by hosp; key+1; a.add();

end;

The comparison of times for the two methods is shown in the table at the bottom of the page. The times are based on using

a.output(dataset:catt("hospital_",hosp)); a.delete(); run;

both methods with SAS version 9.2 on a PC

running windows XP. The hash object method takes only approximately 10% of the CPU time used by

the PROC SQL method. The elapsed times are closer, but the hash object method still tales only

approximately 60% of the elapsed time of the PROC SQL method.

METHOD PROC SQL

HASH OBJECT

STEP PROC FREQ PROC SQL DATA _NULL_ TOTAL PROC SORT PROC SQL DATA _NULL_ TOTAL

ELAPSED 1.45 0

1:58.20 1:59.65 32.35

0 42.51 1:14.86

CPU 1.40

0 1:57.95 1:59.35

6.78 0

7.50 14.28

-4-

NESUG 2009

Coders' Corner

When the data are sorted without the NOEQUALS option in the hash object method, the elapsed time increases to almost 1 minute (56.34 seconds) with a slight increase in CPU time to 7.87 seconds. This clearly shows the worth of the NOEQUALS option when sorting large data sets.

When PROC FREQ is not used in the PROC SQL method and PROC SQL is forced to process the original 2.5 million records (versus the 247 produced using PROC FREQ), the PROC SQL step uses 39.17 seconds of elapsed time and 8.78 seconds of CPU time. Notice in the table at the bottom of page 3 that the combination of PROC FREQ and PROC SQL took less than 2 seconds. This clearly shows the worth of using PROC FREQ prior to PROC SQL for finding the distinct values of the variable used to create the new data sets.

The two methods were further compared using data set TEMP and varying numbers of observations. The results are shown below.

NUMBER OF OBSERVATIONS

100,000 500,000 1,000,000 1,500,000

PROC SQL

ELAPSED

CPU

0.39

0.33

10.13

5.93

28.79

27.44

52.32

51.67

HASH OBJECT

ELAPSED

CPU

2.75

0.42

19.55

2.53

34.49

4.96

43.43

6.76

Up to 1 million observations, the PROC SQL method takes less elapsed time. This is due mainly to the time taken by the PROC SORT that is required as part of the hash object method. The CPU time required by the hash object method is much less than PROC SQL as the size of the data set increases. However, since elapsed time measures how long a user actually must wait until task completion, the CPU time advantage is of interest, but not a compelling reason to use the hash object method.

CONCLUSION Both the hash object and PROC SQL methods can be used to produce multiple, data-driven data sets. The hash object is more complex than the PROC SQL method, but takes less elapsed and CPU time with a large (2.5 million observation) data set. When using a data set with up to 1 million observations, the PROC SQL method takes less elapsed time than the hash object. Given the simplicity of the PROC SQL method, it provides a good alternative for many situations of creating data -driven data sets.

REFERENCES 1/ Hamilton J: Creating Data-Driven Data Set Names in a Single Pass Using Hash Objects, Proceedings of the 2007 SESUG Conference ( )

2/ Secosky J, Bloom J: Getting Started with the DATA Step Hash Object, Proceedings of the 2007 SAS Global Forum Conference ( )

-5-

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download