Mimicking the DATA Step Dash and Double Dash in PROC SQL

NESUG 2012

Coders' Corner

Mimicking the Data Step Dash and Double Dash in PROC SQL

Arlene Amodeo, Law School Admission Council, Newtown, PA

ABSTRACT

The SQL procedure is a powerful and versatile procedure in SAS that allows the programmer to utilize SQL syntax to perform data set merges (joins), update data sets (tables), and update observations (records). Its main use is to duplicate the functionality of the MERGE...BY statements in the DATA step, but most of its worth comes from its expansion upon these functionalities and its shorter processing time. It allows the user to utilize a wide range of summary functions without having to code PROCs and to merge on variables without the restrictions inherent in the BY statement.

However, there are two shortcuts in the DATA step that are not available in PROC SQL: the dash (-) for selecting a range of suffixed variables, and the double dash (--) for selecting a range of variables from a data set. This paper shows how to integrate PROC SQL, dictionary members, and character functions to mimic the functionality of the dash and double dash . This method can be utilized easily to gather several ranges of variable names and rearrange the ordering of columns in the output data set. Examples of how to use this code are given in the context of the LSAC National Longitudinal Bar Passage Study data. Macro programs are provided in the Appendices to enable the reader to implement these methods quickly without hard-coding.

This code was developed using SAS 9.2 for PC; execution of this code requires licenses for Base SAS and SAS/STAT. The intended audience of this paper has some awareness of SAS macro programming and PROC SQL syntax.

INTRODUCTION

The inconceivably large potential for data manipulation is a chief advantage of choosing SAS over similar software: If you can think it, you can probably do it in SAS--and usually with half the battle you'd face with other software. Two basic procedures for manipulating data sets are available in SAS: the DATA step (native to SAS), and the SQL procedure (based on the Structured Query Language used by several other commercial software, such as Oracle).

PROC SQL incorporates most of the functionality of the DATA step. It allows the programmer to utilize all SAS character functions and also perform the same types of merges (joins) as with DATA steps, using a more universal nomenclature. In many situations, PROC SQL has several advantages (and relatively few disadvantages) over the DATA step.

PROC SQL not only performs functions across rows (as the DATA step does), but it can also perform functions down observations, an endeavor that is typically code-intensive and requires caution when implemented in the DATA step. PROC SQL also makes merging data sets a less stressful task; it does not have the strict restrictions on the merging variables that the MERGE-BY method in the DATA step has. Data merges (joins) in PROC SQL can be performed on merging variables with different names (e.g. NAME and FULL_NAME), different cases (join records on LOWCASE(FIRST_NAME)), and even with several variables serving as one merging variable (CATX( , FIRST_NAME, LAST_NAME) matched to FULL_NAME)).

In addition to having powerful and versatile data manipulation capabilities, the SAS program contains several shortcuts to help eliminate tedious coding. Two of these shortcuts are the dash (-) and the double dash (--), each of which generates lists of variables. The (single) dash allows the programmer to specify that all variables that are similarly structured in name--with a prefix and a numeric suffix--should be processed (e.g., ANSWER1 through ANSWER30). The double dash allows the programmer to quickly specify that all variables in the program data vector that lie between two variables should be processed (e.g., NAME through PHONE_NUMBER, for specifying that NAME, ADDRESS, and PHONE_NUMBER should be included).1

Unfortunately, the dash and double dashes can only be used in DATA steps, data set options, and SAS statements; they cannot be used within the SELECT clause of PROC SQL. This is a rare example of a functionality that is lost when moving from the DATA step to PROC SQL. A consequence of this is that when a user utilizes PROC SQL to select variables, he or she must type each variable individually, even in cases where the dash or double dash would automatically generate the list in another context. However, as another example of SAS's versatility, a combination of procedures and data views can be integrated to duplicate the functionality of the dash and double dash for use in PROC SQL.

The first four sections of this paper introduce the basic nomenclature, concepts, and code that can be utilized to mimic the variable list shortcuts in PROC SQL. The last three sections describe code for each of the shortcuts (the dash for a range of suffixed variables, and the double dash for a range of variables between two specified names). The motivation is discussed for developing a method of mimicking both the single and double dashes; a short example of the code to accomplish this is provided; and macros are provided in the appendices, along with short descriptions in the text. The Bar Passage Study data-- collected and published by LSAC--is cited in short examples.

1 Typically, this means that the variable list will contain all variables that were created between NAME and PHONE_NUMBER chronologically when those variables were first created.

1

NESUG 2012

Coders' Corner

PROC SQL: BASIC SYNTAX

The SQL procedure is based on the Structured Query Language (SQL) used by products such as MySQL, Oracle, and Microsoft Access. The most basic purpose of the language is to query data from tables--sometimes imposing particular criteria--and to join one or more tables into one table (or temporary view). When SQL code is written to act upon a table (data set), it is said that the table is being queried.

PROC SQL can be used to create data sets and views through typical methods such as concatenation, subsetting, and merging. For several key components of the DATA step, there are analogous features in PROC SQL that have their own names unique to PROC SQL. Table 1 shows the equivalent names for DATA step components and associated data set attributes whose essences are also mimicked in PROC SQL.

Table 1: Equivalent SAS terms Between the DATA step and PROC SQL

DATA step term/method PROC SQL term/method

Data set

Table

Statement

Clause (CREATE TABLE/CREATE VIEW, SELECT, INTO, FROM, WHERE, GROUP BY,

HAVING, ORDER BY)

Observation

Row

Variable

Column

Merge

Join

RENAME oldvarname =

SELECT oldvarname AS newvarname

newvarname

...

newvarname is referred to as an alias

CALL SYMPUT(macro-

In general, the INTO clause is used within the SELECT statement. This can be utilized to

variable, value);

create one or more macro variables very efficiently. An example appears below.

This example creates one macro variable that holds each value encountered in column, where the elements within the variable are separated by a comma and a space:

Williams, SESUG 2008

SELECT DISTINCT columnname1 INTO : macrovariable SEPARATED BY , FROM datasetname; QUIT;

Figure 1 shows the syntax for PROC SQL. Only the SELECT and FROM clauses are required. In general, the SELECT clause indicates which variables should be placed into the table (or view) from the table(s) indicated in the FROM statement; in contrast to the DATA step, multiple variables are separated by commas, not spaces. The WHERE statement is used to select only certain rows and to execute joins between multiple data sets specified in the FROM statement: the variable that would be in the BY statement of a MERGE...BY would appear in the WHERE statement. Additional code can be used to specify what type of join(s) should be performed, using specific SQL syntax and the ON keyword rather than the WHERE clause. For more information on SQL joins, reference the paper PROC SQL: Tips and Translations for Data Step Users (Marcella and Jorgensen, 2010).

Just as each DATA step completes with a RUN statement, each CREATE...ORDER BY block ends with a semicolon, and each SQL block ends with a QUIT statement. The RUN Statement is not necessary. Several CREATE TABLE...ORDER BY blocks can be created within one SQL procedure, each ending with a semicolon. It is important to note that by default, the results of a query that does not use the CREATE clause will be printed to output; to prevent this, use the NOPRINT option in the PROC SQL statement.

Fig. 1 PROC SQL syntax

PROC SQL noprint/print; CREATE TABLE/ CREATE VIEW newtablename AS SELECT tablename.var1 AS alias1, tablename.var2 AS alias2 FROM tablename WHERE GROUP BY varname HAVING ORDER BY varname ;

QUIT;

2

NESUG 2012

Coders' Corner

Just as the DATA step can create macro variables with CALL SYMPUT, PROC SQL can also create macro variables. Within the SELECT clause, the user can specify that a column value (or a function of a column value) should be inserted into one or more macro variables; this is done using the INTO clause.

For example, if a programmer wanted to have one macro variable that contained the average value of age from the SASHELP.CLASS data set, he or she could code as in Figure 2. In this code, AVG is called an aggregate function; it works down rows in the table to find the average value of age, just as PROC MEANS would when the VAR statement is specified as VAR age; The average value for age is inserted into the macro variable average_age.

Fig. 2 Example PROC SQL code

PROC SQL; SELECT AVG(age) INTO: average_age FROM SASHELP.CLASS ;

QUIT;

Table 1 above briefly shows how one macro variable can be created to hold all of the values found in a column. In conjunction with automatic data sets called dictionary members, this code can be utilized to help mimic the dashes in PROC SQL. Dictionary members are described briefly in the next section, and the last three sections of this paper discuss how to put it all together.

In addition to several NESUG papers, the manual SAS SQL Procedure User's Guide -- published by the SAS Institute ? gives an excellent summary of the various ways in which PROC SQL can be utilized.

DICTIONARY MEMBERS AND THE DICTIONARY.COLUMNS TABLE

During each SAS session there exist automatic data views called dictionary members. These data sets contain information about the data sets that are in the current SAS session. For example, DICTIONARY.TABLES has one row to describe each data set in the current SAS session (one row for SASHELP.CLASS, one row for SASHELP.SALES, etc.). The data set DICTIONARY.COLUMNS contains one row to describe every variable in the data sets (one row for each variable in SASHELP.CLASS, one row for each variable in SASHELP.SALES, etc.). Table 2 displays the contents of this data set.

Table 2: Contents of DICTIONARY.COLUMNS

Column Name

Column Label

libname

Library Name

memname

Member Name

memtype

Member Type

name

Column Name

type

Column Type

length

Column Length

npos

Column Position

varnum

Column Number in Table

label

Column Label

format

Column Format

informat

Column Informat

idxusage

Column Index Type

sortedby

Order in Key Sequence

xtype

Extended Type

notnull

Not NULL?

precision

Precision

scale

Scale

transcode

Transcoded?

3

NESUG 2012

Coders' Corner

Marcella and Jorgensen, NESUG 2010

As seen in Table 2, DICTIONARY.COLUMNS contains the library name (libname) and data set name (memname) in which each column resides. It also contains the position of the variable in the data set in the context of a table with rows and columns; for example, varnum = 3 indicates that the variable is in the third column from the left.

In conjunction with PROC SQL, the DICTIONARY.COLUMNS table can be used to generate a single macro variable that contains a list of column/variable names from a certain data set/table, where the column/variable names have certain characteristics, or are located in a certain position in the SAS data set. This forms a foundation for mimicking the dash shortcuts in PROC SQL. It is worth noting that the columns in the dictionary members may have trailing blanks in their values, and this needs to be taken into consideration when generating macro variables whose values are intended to be used as SAS code. Each variable list shortcut is discussed in turn in the remaining sections of this paper.

SAS CHARACTER FUNCTIONS

SAS contains a wide variety of character functions to manipulate character variables, extract substrings from character variables, and count occurrences of substrings and characters within character variables. These three tasks are conducted by many more than just three functions, and many more tasks on character functions can be performed. The book SAS Functions by Example by Ronald Cody details many of these with excellent clarity and organization.

For the purposes of mimicking the dash and double dash, this paper will utilize the LENGTH, CAT, and SUBSTR functions, as described generally below.

1. LENGTH : counts the number of characters used in a character variable, ignoring trailing blanks. For each observation, the result varies according to the value of the character variable for that observation. This does not return the length property of the character variable as created in the SAS data set (that can be determined by querying the DICTIONARY.COLUMNS view, as seen in the previous section).

2. CAT(string1, string2, ... stringN) : concatenates string1, string2, ..., stringN with each substring taken verbatim (i.e. the strings are not trimmed for leading or trailing blanks). Each string can be a SAS variable (numerical or character) or a string expressed within quotation marks. CATT also concatenates strings with the same syntax as CAT, but trims the leading blanks from each element mentioned in the list; CATX will trim and concatenate each entry and place a specified delimiter between each entry.

3. SUBSTR(varname, start_position, length) : creates a substring from string, that is extracted from starting at the position start_position of string, and having length length. String can be a SAS character variable or a string within quotation marks, while start_position and length must be positive integer numbers (e.g. a SAS numeric variable containing such a value, a SAS function that resolves to resolves to such a value, or simply a constant positive integer number). For example, SUBSTR(LSAC, 3, 2) would resolve to AC, and SUBSTR(LSAC, 1, 1 + 2) would resolve to LSA.

METHOD FOR MIMICKING THE VARIABLE SHORTCUTS IN PROC SQL

The code below describes how PROC SQL, dictionary members, and character functions can be integrated to mimic the variable shortcuts in the SELECT clause. This basic structure is used throughout this paper.

PROC SQL; SELECT INTO: macrovar FROM DICTIONARY.COLUMNS WHERE libname = "libraryname" AND memname = "datasetname" AND memtype = "data";

*We will call this our "target" select clause in which we want to use the dash/double dash shortcuts;

SELECT ¯ovar. FROM libraryname.datasetname; QUIT;

DATA FOR EXAMPLES: LSAC NATIONAL LONGITUDINAL BAR PASSAGE STUDY

From 1989 through 1996, the Law School Admission Council (LSAC) conducted a longitudinal study of law school matriculates that collected data regarding a variety of subjects relevant to aspiring lawyers, such as their Law School Selection and Expectations and their Aspirations in their law career. Four surveys were administered throughout the legal education careers of these matriculates: an Entering Student Questionnaire (ESQ), a First-Year Follow-up Questionnaire (FSQ), a

4

NESUG 2012

Coders' Corner

Second-Year Follow-up Questionnaire (SFQ), and a Third-Year Follow-up Questionnaire (TFQ). Being a longitudinal study, many of the items on the surveys were identical or nearly identical in order to allow analyses of changes in the goals, educational statuses, and employment statuses; and the progress in passing the State Bar Exams of aspiring lawyers. The data collected during this study was released in 1999 and is available upon request from LSAC.

Variables (columns) from the Bar Passage Study are used in the examples in this paper. Figure 3 shows the relevant excerpts of the ESQ that are referenced (repeated on the subsequent three surveys as well) and the demographic variables that are also used in examples. Figure 4 shows some of the contents of DICTIONARY.COLUMNS for a subset of the Bar Passage Study data, inputted from a flat data file by means of a DATA step. Readers can explore further examples by using the SASHELP.PRICEDATA data set.

Fig. 3 Select Demographic Variables and Responses 67a Through 67k to Question 67 of the Entering Student Questionnaire of the Longitudinal Bar Passage Study (also duplicated in the other three surveys of this study)

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download