Automatically Converting Character Variables That Store ...

NESUG 2010

Foundations and Fundamentals

Automatically Converting Character Variables

That Store Numbers to Numeric Variables

Christopher J. Bost, MDRC, New York, NY

ABSTRACT

A character variable can store numbers, but many analyses require numeric variables. This paper shows how to

convert character variables that store numbers to numeric variables. The DATA step, PROC SQL, and the SAS?

macro language are used to automate the process. The result is a new data set that stores all variables with the

same names, labels, and order, but with character variables that stored numbers converted to numeric variables.

INTRODUCTION

Numbers can be stored in numeric variables or character variables. For example, a dummy variable that takes

the values 1 and 0 could be numeric or character. SAS? will even perform an automatic character-to-numeric

conversion as needed in specific situations (e.g., when a character variable is compared to a numeric variable).

System resources, however, are required to convert variables. It is not efficient to have SAS perform automatic

conversions. Moreover, if SAS tries to convert a character value that contains a non-numeric value to a number

it produces an error. In addition, the majority of statistical procedures require analysis variables to be numeric.

Character variables that store numbers and that are treated like numeric variables can be explicitly converted

to numeric variables.

SAMPLE DATA

SAS data set EARNINGS is used in this paper. It contains six variables: ID, EARNQ1 through EARNQ4 (four

quarterly earnings variables, in dollars), and EMPLOYED (a dummy variable for any earnings in quarters 1-4):

CONTENTS of data set EARNINGS

#

Variable

Type

Len

1

2

3

4

5

6

id

earnq1

earnq2

earnq3

earnq4

employed

Char

Num

Num

Char

Char

Char

2

8

8

4

5

3

Label

Unique identifier

Earnings in Quarter

Earnings in Quarter

Earnings in Quarter

Earnings in Quarter

Ever employed

1

2

3

4

Note that ID, EARNQ3, EARNQ4, and EMPLOYED are character variables.

PRINT of data set EARNINGS

Obs

id

earnq1

earnq2

earnq3

earnq4

employed

1

2

3

4

5

01

02

03

04

05

1000

0

0

1400

0

1300

2000

0

1400

.

1400

1250

2300

0

900.0

0

1

1

UNK

1

UNK

0

1500

0

ID values contain leading zeroes that should be preserved. EMPLOYED contains the character value ¡®UNK¡¯ (for

¡®Unknown¡¯). EARNQ3 and EARNQ4, however, store only numbers and could be converted to numeric variables.

1

NESUG 2010

Foundations and Fundamentals

CONVERTING VARIABLES

The following steps are required to convert character variables that store numbers to numeric variables:

1. Identifying character variables to convert

2. Creating equivalent numeric variables

3. Dropping the original character variables

4. Renaming the equivalent numeric variables

5. Reordering variables to the original sequence

6. Labeling the equivalent numeric variables

The required statements are hard-coded in the next section to illustrate this process. The required statements

are then generated by SAS in the subsequent section to automate this process.

CONVERTING VARIABLES MANUALLY

STEP 1: IDENTIFYING CHARACTER VARIABLES TO CONVERT

Use PROC FREQ to produce frequency tables for all character variables:

proc freq data=earnings;

tables _character_;

run;

The TABLES statement uses the special SAS name list _CHARACTER_ to reference all character variables.

Inspect the results for non-numeric values. If a variable contains only numeric values, it could be converted

to a numeric variable:

The FREQ Procedure

Unique identifier

Cumulative

Cumulative

id

Frequency

Percent

Frequency

Percent

------------------------------------------------------01

1

20.00

1

20.00

02

1

20.00

2

40.00

03

1

20.00

3

60.00

04

1

20.00

4

80.00

05

1

20.00

5

100.00

Earnings in Quarter 3

Cumulative

Cumulative

earnq3

Frequency

Percent

Frequency

Percent

----------------------------------------------------------0

2

50.00

2

50.00

1400

1

25.00

3

75.00

1500

1

25.00

4

100.00

Frequency Missing = 1

(continued)

ID contains no non-numeric values. It is a unique identifier, however, and its leading zeroes should be preserved.

EARNQ3 contains only numeric values (plus a missing value). It could be converted to a numeric variable.

2

NESUG 2010

Foundations and Fundamentals

(continued)

Earnings in Quarter 4

Cumulative

Cumulative

earnq4

Frequency

Percent

Frequency

Percent

----------------------------------------------------------0

2

40.00

2

40.00

1250

1

20.00

3

60.00

2300

1

20.00

4

80.00

900.0

1

20.00

5

100.00

Ever employed

Cumulative

Cumulative

employed

Frequency

Percent

Frequency

Percent

------------------------------------------------------------1

3

60.00

3

60.00

UNK

2

40.00

5

100.00

EARNQ4 also contains only numeric values (including a value with a decimal point). It could be converted to a

numeric variable. EMPLOYED contains the value ¡®UNK¡¯. It could not be converted to a numeric variable.

STEP 2: CREATING EQUIVALENT NUMERIC VARIABLES

Use the INPUT function in an assignment statement to convert a character variable to a numeric variable:

x1=input(earnq3,best12.);

x2=input(earnq4,best12.);

The INPUT function reads each value of character variable EARNQ3 with the numeric informat BEST12.,

converts it to a numeric value, and stores it in numeric variable X1. The same applies to EARNQ4 and X2.

STEP 3: DROPPING THE ORIGINAL CHARACTER VARIABLES

Use the DROP statement to drop the original character variables that stored numbers:

drop earnq3 earnq4;

STEP 4: RENAMING THE EQUIVALENT NUMERIC VARIABLES

Use the RENAME statement to rename the equivalent numeric variables to the original character variable names:

rename x1=earnq3;

rename x2=earnq4;

Note that a variable¡¯s type, character or numeric, cannot be changed once the variable is created. A new SAS data

set needs to be created to store the equivalent numeric variables. Steps 2-4 can be consolidated in one DATA step:

data earnings2;

set earnings;

x1=input(earnq3,best12.);

x2=input(earnq4,best12.);

drop earnq3 earnq4;

rename x1=earnq3;

rename x2=earnq4;

run;

Also note that the equivalent numeric variables are added after the original variables and do not have labels:

3

NESUG 2010

Foundations and Fundamentals

CONTENTS of data set EARNINGS2

#

Variable

Type

Len

1

2

3

4

5

6

id

earnq1

earnq2

employed

earnq3

earnq4

Char

Num

Num

Char

Num

Num

2

8

8

3

8

8

Label

Unique identifier

Earnings in Quarter 1

Earnings in Quarter 2

Ever employed

STEP 5: REORDERING VARIABLES TO THE ORIGINAL SEQUENCE

Use the RETAIN statement to reorder variables to the original sequence:

data earnings3;

retain id earnq1-earnq4 employed;

set earnings2;

run;

Note that SAS adds variables to the data set being created in the order in which it encounters them. Placing the

RETAIN statement after the DATA statement, but before the SET statement, effectively adds the variables named

on the RETAIN statement to the data set being created in the specified order. Numbered range lists of variables

(EARNQ1-EARNQ4) can be used.

SAS picks up the variable attributes (type and length, plus format and/or label, if assigned) when it executes the

SET statement and reads the descriptor portion of data set EARNINGS2.

STEP 6: LABELING THE EQUIVALENT NUMERIC VARIABLES

Use the LABEL statement to label the equivalent numeric variables with the original character variable labels:

label earnq3='Earnings in Quarter 3';

label earnq4='Earnings in Quarter 4';

Note that a new SAS data set needs to be created to reorder variables. Variables can be labeled with the LABEL

statement at the same time. Steps 5-6 can be consolidated in one DATA step:

data earnings3;

retain id earnq1-earnq4 employed;

set earnings2;

label earnq3='Earnings in Quarter 3';

label earnq4='Earnings in Quarter 4';

run;

Character variables that stored numbers have now been converted to numeric variables with the same names,

labels, and order:

CONTENTS of data set EARNINGS3

#

Variable

Type

Len

1

2

3

4

5

6

id

earnq1

earnq2

earnq3

earnq4

employed

Char

Num

Num

Num

Num

Char

2

8

8

8

8

3

Label

Unique identifier

Earnings in Quarter

Earnings in Quarter

Earnings in Quarter

Earnings in Quarter

Ever employed

4

1

2

3

4

NESUG 2010

Foundations and Fundamentals

PRINT of data set EARNINGS3

Obs

id

earnq1

earnq2

earnq3

earnq4

employed

1

2

3

4

5

01

02

03

04

05

1000

0

0

1400

0

1300

2000

0

1400

.

1400

.

0

1500

0

1250

2300

0

900

0

1

1

UNK

1

UNK

The above six steps produce the desired result but require coding that would be prohibitive with a larger number

of variables. The next section describes how to generate the same code using SAS.

CONVERTING VARIABLES AUTOMATICALLY

STEP 1: IDENTIFYING CHARACTER VARIABLES TO CONVERT

First, identify character variables in SAS data set EARNINGS. Use PROC SQL to retrieve their names from a

DICTIONARY table.

DICTIONARY.COLUMNS stores the names and types of all variables in all data sets with one variable per row.

Do not retrieve ID; although it stores only numeric values, they contain leading zeroes that should be preserved:

%let excludevars=id;

proc sql noprint;

select name into :charvars separated by ' '

from dictionary.columns

where libname="WORK" and memname="EARNINGS" and type="char"

and not indexw(upcase("&excludevars"),upcase(name));

quit;

The %LET statement assigns the value ID to a macro variable named EXCLUDEVARS.

The PROC SQL statement starts the procedure. The NOPRINT option suppresses printed output.

The SELECT clause picks up values of NAME (variable name). The INTO clause stores values in a macro variable

named CHARVARS, separated by blanks.

The FROM clause retrieves rows from DICTIONARY.COLUMNS.

The WHERE clause subsets rows where four conditions are met:

1. the value of LIBNAME equals ¡°WORK¡±

2. the value of MEMNAME equals ¡°EARNINGS¡±

3. the value of TYPE equals ¡°char¡±

4. the value of NAME is not in the list of variables to be excluded

(Note: Values of LIBNAME and MEMNAME are stored in uppercase and values of TYPE are stored in lowercase.)

In other words, the WHERE clause subsets rows for character variables in SAS data set WORK.EARNINGS that

are not specifically excluded.

The INDEXW function returns the starting position of the second argument if it is found in the first argument or 0

if it is not found. Variable names may be in uppercase, lowercase, or mixed case; the UPCASE function is used on

both arguments to compare values in the same case. When the expression is NOT true (i.e., when the result is 0,

meaning the variable name was not found in the list of variables to exclude), the row is selected.

Second, determine the number of character variables:

%let ncharvars=%sysfunc(countw(&charvars));

The %LET statement uses the COUNTW function to count the number of words (i.e., variable names) stored in macro

variable CHARVARS. COUNTW is a DATA step function; it must be called by %SYSFUNC in a %LET statement.

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download