%MISSING: A SAS® Macro to Report Missing Value Percentages for a Multi ...

[Pages:6]NESUG 2006

Coders' Corner

%MISSING: A SAS? Macro to Report Missing Value Percentages for a Multi-Year Multi-File Information System

Rushi Patel, Creative Information Technology, Inc., Arlington, VA

ABSTRACT

It is common to find missing values in datasets. An analyst deals with missing values in a variety of ways that range from simply ignoring these records to robust imputations. The decision to choose a strategy to tackle missing values depends upon their magnitude and relevance of the corresponding variable. This paper presents a macro that outputs the missing value percentages for all variables in all files (or for selected variables in selected files) in a multi-year information system. The focus is on using SAS? Macro Language to dynamically write pieces of code, MISSING and NMISS functions, iterative macro execution and ODS to export the final report. The intended audience is beginner to intermediate programmers interested in automating tasks through utility macros.

INTRODUCTION

Missing values are a reality of datasets. An analyst has several options to tackle missing values. However, the option chosen is largely a function of two factors. First, is the variable in question and second is the extent to which that variable is coded as missing. This paper describes a macro coded to generate a report that enlists the percentage of missing values for each variable present in Highway Safety Information System (HSIS), a multi-state, multi-file, multi-year information system. HSIS is comprised of nine geographic states and each of these states has 4 to 10 files. Again, each file is present for any or all years from 1985 onward. Initially the paper gives an overview of the layout of files, followed by a brief description of the basic procedures and functions used to compute the missing percentages. The focus then shifts to describing a routine that generates the required pieces of code through a series of macros that are invoked iteratively within a blanket macro (%MISSING).

LAYOUT OF DATA FILES IN THE INFORMATION SYSTEM

Broadly speaking, each file in HSIS can either be a crash (accident) based file or a segment based file. A segment based file has a specific length for each observation, while crash based files are point files. For segment based files, the desired output is a percentage of missing values as a function of total length within the file, while for crash based files the desired output is a function of total number of observations. Both sample data files are listed below.

/* crash based file */

Data __;

Input $a b;

Cards;

a

1

b

.

2

C

3

;

Run;

/* segment based file */

Data __;

Input $a c length;

Cards;

a

4

10

3

10

c

.

20

;

Run;

The output for a crash based file will show 25% missing (1 in 4 observations) for variables "a" and "b". While for segment based file, the output for variable "a" will be 25% missing (10 units of length out of a total of 40) and for variable "c" will be 50% missing (20 units of length out of a total of 40).

BASIC PROCEDURES AND FUNCTIONS

NMISS() and COUNT() functions do the trick for crash based files. NMISS() returns the number of missing values for a variable or a list of variables passed through it. COUNT() returns the number of observations in any dataset. The following template of code computes the missing value percentages for the specified variables in a dataset.

proc sql; create table as select nmiss() / count(*) as miss__pct ,

nmiss()/ count(*) as miss__pct .......

from __ quit ;

1

NESUG 2006

Coders' Corner

MISSING() and SUM() functions are used to compute the percentage of missing values for segment based files. MISSING() returns 0 or 1 for a data point that is present or missing. SUM() returns the summation of a variable across the entire dataset. The following template computes the required percentages for segment-based files.

proc sql; create table as select sum(missing() *seg_lng) / sum(seg_lng) as

miss__pct, sum(missing() *seg_lng) / sum(seg_lng) as miss__pct

....... from __ quit;

DESIRED OUTPUT

Table 1 shows a sample of the desired output. %MISSING generates this output for all years that a file is available for all statefile combinations. The value in a year-variable cell represents the missing value percentage depending upon whether a file is crash based or segment based. A "." indicates a variable is not present for that year.

Table 1: Report for Minnesota Accident data

Variables Variable 1 Variable 2 Variable 3 Variable 4

Year 2000 10.4 . 12 0

Year 2001 13.5 . 12 0

Year 2002 0 100 12 .

Year 2003 12.1 100 12 .

MACRO STRUCTURE

%MISSING consists of a series of macros that are iteratively invoked within it. The steps involved are ? 1. For each file available for a particular state, determine whether that file is segment based or crash based. This is done within %MISSING and a global macro variable "rdcheck" is assigned a value `Y' (indicating a segment based file) or `N' (indicating a crash based file). 2. Based on whether "rdcheck" is `Y or N', generate the relevant select statements within a PROC SQL routine. This is done by two support macros, %CRASH_BASED or %SEGMENT_BASED, which are invoked conditionally within another macro %COMPUTE_PCT. 3. Invoke %COMPUTE_PCT for all state-file-year combinations. This is done by populating a series of macro variable arrays and stepping into those iteratively through %do loops within %MISSING. 4. Collect, organize and restructure the output generated from each execution of %COMPUTE_PCT. This is also done within %MISSING using PROC TRANSPOSE, APPEND and DATA steps. 5. Generate the desired reports through ODS ? done within %MISSING.

A pre-requisite for executing %MISSING is a dataset "names" that has three variables ? state, file and year. This dataset has all the possible state-file-year combinations. The first PROC SQL routine in %MISSING refers to this dataset. Also, by

screening dataset "names", a user can execute %MISSING on a limited part of HSIS. Libname "data" points to an

ORACLE database, on which HSIS resides.

proc contents data = data._all_ noprint out=temp (keep=memname); data names (keep = st yr file); set temp;

by memname; if first.memname;

len=length(memname); st=substr(memname,1,2); yr=substr(memname,3,2); start=5; run=len-start+1; file = substr(memname,start,run); run;

2

NESUG 2006

Coders' Corner

SUPPORT MACROS - %CRASH_BASED and %SEGMENT_BASED

options mprint; %macro crash_based (vars = ) ;

%local i v list ; %let i = 1 ; %let v = %scan(&vars,&i) ; %do %while (%length(&v) ^= 0) ;

%let list = &list , (nmiss(&v)/count(*)) * 100 as miss_&v._pct ;

%let i = %eval( &i + 1 ) ; %let v = %scan(&vars,&i) ; %end ; %substr(%superq(list),2) %mend crash_based ;

A macro variable array consisting of all variable names present in a particular state-year-file combination is passed to either of these macros (parameter vars=). This macro variable array is populated within %COMPUTE_PCT through a PROC SQL routine before executing the support macros. The macro quoting function %superq() is used to mask the special characters generated within the code. The details of this and related quoting functions are beyond the scope of this paper and the reader is referred to some prior work by Whitlock (1). %SEGMENT_BASED is similar to %CRASH_BASED and is not listed here.

%COMPUTE_PCT

%macro compute_pct ( root = , file = , year = ) ;

proc sql noprint ; select lowcase(name) into :vars separated by " " from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)" ;

quit;

proc sql noprint; create table __temp as %if &rdcheck = Y %then %do; select %segment_based( vars = &vars ) from data.&root&year&branch ; %end; %if &rdcheck = N %then %do; select %crash_based( vars = &vars ) from data.&root&year&branch ; %end;

quit ;

proc transpose data = __temp out = __temp ;

data __temp ; format miss_pct 4.2; length varname $17; set __temp ( rename = ( _name_ = varname col1 = miss_pct ) ) ; label varname = "Variable" ;

3

NESUG 2006

Coders' Corner

varname = substr( varname , 6 ) ; varname = substr ( varname , 1 , length ( varname ) - 4 ) ; year = &year ; run ;

%mend compute_pct ;

Parameters passed to this macro are root= (state), file= (name of the file) and year= (each year for a state-file combination), the values of which are generated in %MISSING. It checks the "rdcheck" variable value and executes either of the two support macros based on its value. It populates a macro variable array "vars" which is passed to the support macros. PROC TRANSPOSE and the following DATA step is for organizing the output.

%MISSING

This macro manages the execution of the %COMPUTE_PCT and is referred in this paper as the blanket macro. This macro creates the macro variables that are passed to %COMPUTE_PCT, through a series of %do loops. It takes all the output from %COMPUTE_PCT and appends it in a file, creating a separate report for each state-file combination. Finally, it outputs the report as a MS Word file for each state-file combination to the designated directories using Output Delivery System (ODS).

%macro MISSING;

%global rdcheck;

proc sql noprint; select distinct(st) into: state separated by ` ` from names ; quit ;

%let i=1; %let root=%scan(&state,&i);

%do %while(%length(&root) > 0);

proc sql noprint; select distinct(file) into: FILES&root separated by ' ' from names where st="&root"; quit; run;

%let j=1; %let branch=%scan(&&FILES&root,&j);

%do %while(%length(&branch) > 0);

proc sql noprint; select yr into: &root.&branch separated by ' ' from names where st="&root" and file = "&branch"; quit; run;

%let k=1; %let getyr = %scan(&&&root.&branch,&k);

%do %while (%length(&getyr) > 0);

proc sql noprint;

4

NESUG 2006

Coders' Corner

select quote(trim(lowcase(name))) into:dum separated by ","

from dictionary.columns where libname = "DATA" and memname = "%upcase(&root&year&branch)";

quit;

proc sql noprint; select distinct

case when "seg_lng" in (&dum) then 'Y' else 'N'

end as var1 into : rdcheck from dictionary.columns where libname = "DATA"

and memname = "%upcase(&root&year&branch)"; quit;

%compute_pct( root=&root, file=&branch, year=&getyr );

proc append base = dat1 data = __temp ; run ;

%let k = %eval(&k + 1); %let getyr = %scan(&&&root.&branch, &k); %end;

proc sort data = dat1 ; by varname year ;

proc transpose data = dat1 out = dat1 ( drop = _name_ )

by varname ; id year ;

prefix = year ;

ods rtf file = "d:\Missing Drive\&root\&branch..rtf"; proc print;

title "Percent Missing for &root. &branch. files"; ods rtf close;

proc datasets lib= work; delete dat1 __temp;

run;

%let j=%eval(&j+1); %let branch=%scan(&&FILES&root,&j); %end;

%let i=%eval(&i+1); %let root=%scan(&state,&i); %end;

%mend MISSING;

CONCLUSION

The main feature of this macro is that it can generate the required report automatically, with minimal user intervention. Also, with simple modifications, it can be easily executed on a limited part of the information system. Along with determining all variables in any dataset in the information system, it also determines the type of the dataset. The output allows the user to

5

NESUG 2006

Coders' Corner

identify variables with high percentage missing observations and also variables with an abrupt change in percentage of missing observations. Finally, it outputs the report as a MS Word file in a specified directory for future reference and sharing.

ACKNOWLEDGEMENTS

I thank Creative Information Technology, Inc. (CITI) for sponsoring this paper. CITI is currently providing SAS services through engagements with U.S. Federal Government clients at the Departments of Housing and Urban Development and Transportation. I am grateful to Mr. Ian Whitlock for his immensely useful suggestions in response to my inquiry on SAS-L. The basic macro design discussed here was suggested by Mr. Whitlock. Many thanks, to Mr. Yusuf Mohamedshah and Dr. Forrest Council at Highway Safety Information System (HSIS) for their continued encouragement to develop macros to streamline HSIS data analysis tasks. And last but not the least; I thank my wife and my mother for being a constant source of

support and motivation.

REFERENCES

1. Whitlock, Ian. A serious look at Macro Quoting. Paper 11, Proceedings of twenty eighth annual SAS-User Group

International Conference. Available online at www2.proceedings/sugi28/011-28.pdf

2. Highway Safety Information System website. .

AUTHOR CONTACT

Your comments and questions are valued and encouraged. Contact the author at:

Rushi B. Patel Creative Information Technology, Inc., 1010 N. Glebe Road, Suite 710, Arlington, VA 22201 E-mail 1: rushi.b.patel@ E-mail 2: rpatel@citi- Phone: 240-383-6207

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ? indicates USA registration.

6

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download