No More Split Ends: Outputting Multiple CSV Files and ...

[Pages:7]NESUG 2011

Applications

No More Split Ends: Outputting Multiple CSV Files and Keeping Related Records Together

Gayle Springer, JHU Bloomberg School of Public Health, Baltimore, MD

ABSTRACT

The EXPORT Procedure allows us to output CSV files easily. For our application, we need to limit the number of records in our output CSV files. Using FIRSTOBS= and OBS= options in PROC EXPORT, we can manipulate the number of records in each file. Additionally, when we need to keep related records together in the output CSV file, we can manipulate the data in a DATA step and use the WHERE statement in PROC EXPORT to ensure the number of observations in each CSV file is not larger than our requirements and related data remains in the same CSV file.

INTRODUCTION

In our NIH-funded study, many different types of specimens are provided by study participants, including serum, plasma, and cervical-vaginal lavage, and placed in a central biorepository collection of over 2 million vials. This biorepository has an Internet-based data management system where we specify the vials to be shipped to labs for testing.

The central biorepository has a limit of 300 vials per withdrawal requisition. If we have a larger number of vials to select, we must split the request into smaller batches containing no more than 300 vials. At times, we must also ensure that related vials are sent in the same requisition.

The biorepository's system allows us to import comma-delimited files to help with specimen selection. We use the EXPORT procedure to copy our data from SAS? data sets to CSV files. Before writing the data, we format dates, times and other values as needed in a DATA step so that the values display properly in the CSV file.

In this paper, we will build the macro step-by-step, highlighting each function. A basic familiarity with SAS Macro language is assumed.

EXPORTING A SAS DATA SET TO A CSV FILE

The following syntax copies the SAS data set WORK.MYPROJECT to the CSV file "myproject.ddmonyyyy.csv" in the current directory. The REPLACE option tells SAS to overwrite the CSV file if it already exists.

PROC EXPORT DATA= myproject OUTFILE = "myproject.&sysdate..csv" DBMS = CSV REPLACE;

RUN;

We can generalize this code by placing it in a simple macro, with the data set name to be provided as a macro parameter by the user.

%MACRO makecsv (dsn); PROC EXPORT DATA= &dsn. OUTFILE = "&dsn..&sysdate..csv" DBMS = CSV REPLACE; RUN;

%MEND makecsv;

Assuming we have a specimen request of fewer than 300 vials, we can simply run the macro with this syntax:

%makecsv(dsn=myproject)

BREAKING THE SAS DATA SET INTO MULTIPLE CSV FILES When a specimen request contains more than 300 vials, we must create multiple CSV files, each with a maximum of 300 observations. In order to do this, we first determine the the number of observations in the data set. Then we can use options FIRSTOBS= and OBS= with PROC EXPORT to output all the needed CSV files.

We begin by performing SAS functions on the [input] data set using the macro command %SYSFUNC. In lines (1)(3), we determine the number of observations in the original data set In (1), the data set is opened. In (2), the number of observations is copied to the macro variable NUMOBS In (3), the data set is closed.

1

NESUG 2011

Applications

%MACRO makecsv (dsn) (1) %LET dsid=%SYSFUNC(OPEN(&dsn)); (2) %LET numobs=%SYSFUNC(ATTRN(&dsid, NLOBS)); (3) %LET rc=%SYSFUNC(CLOSE(&dsid));

We continue by creating our output CSV files. Each iteration of the %DO loop copies 300 observations into separate output files numbered sequentially. In (5), we create a file number counter to use with our output CSV files. In (6)-(11), we use a %DO loop to create our output CSV files. In (6), the %DO loop will begin with the first observation and continue through the number of observations in the data set (&NUMOBS), incremented by the number of observations we want in our output data set. In (7), increment &K by one. In (8), use FIRSTOBS=&I to tell SAS the first observation to write to the file. The last record to be written is noted by OBS=(&I.+300-1) (8), where &I is the first observation to be written + the number of records to include in the data set, decreased by one. We need to subtract one from the OBS= option because our loop begins with the number one, e.g., beginning at obs = 1 would give us (300+1)=301 observations instead of the required 300 observations. In (9), the OUTFILE= filename includes our file counter macro variable &K. In (10), the DBMS= and REPLACE options are kept the same.

(5) (6) (7) (8) (9) (10)

(11)

%LET k = 0; %DO i = 1 %TO &numobs %BY 300 ;

%LET k = %EVAL(&k+1); PROC EXPORT DATA=&dsn (FIRSTOBS=&i OBS=%eval(&i +300 -1))

OUTFILE = "batch&dsn&k..&SYSDATE..csv" DBMS = CSV REPLACE; RUN; %END; * end of DO loop *; %MEND makecsv;

CSV WITH USER-SPECIFIED NUMBER OF OBSERVATIONS We can make this macro more general by adding a macro parameter NVIALS that allows the number of observations in the output CSV file to be specified. In (1), the default value of 300 for NVIALS is specified. In (2), the %DO loop is incremented by NVIALS. In (3), the number of observations to be written is indicated by NVIALS.

(1) %MACRO makecsv (dsn, nvials=300)

* Determine number of records in &DSN *;

%LET dsid=%SYSFUNC(OPEN(&dsn));

%LET numobs=%SYSFUNC(ATTRN(&dsid, NLOBS));

%LET rc=%SYSFUNC(CLOSE(&dsid));

%LET k = 0;

(2) %DO i = 1 %TO &numobs %BY &nvials ;

%LET k = %EVAL(&k+1);

(3)

PROC EXPORT DATA=&dsn (FIRSTOBS=&i OBS=%EVAL(&i +&nvials -1))

OUTFILE = "batch&dsn&k..&SYSDATE..csv"

DBMS = CSV REPLACE;

RUN;

%END;

%MEND makecsv;

By providing the default value of 300 for NVIALS, the macro will function exactly as in the last step by calling with %makecsv(myproject).

Alternatively, to call the macro and specify 250 records per output CSV file, use:

%makecsv(dsn=myproject, nvials=250);

KEEPING RELATED RECORDS (VIALS) TOGETHER IN A REQUEST

Our next concern is keeping related vials together in the same withdrawal request. Note that this next extension of our macro will allow grouping of related specimen vials by the value (category) of a single or of multiple variables in the data set.

To handle this, we add a macro parameter KEEPTOGETHER containing the variable(s) that identify related groups. This macro variable will be used in a BY statement that in effect defines these groups. In our example, we wish to keep all vials for a specific id together.

2

NESUG 2011

Applications

The steps we need to take are: 1) Determine the number of observations in each group of related vials. If our groups are determined by a single variable, the number of observations in each group is the frequency of each value of the variable. If our groups are determined by more than one variable, the number of observations in each group is the frequency of each distinct combination of the variables. 2) If &KEEPTOGETHER includes more than one variable, obtain the last variable in the list (i.e., in the BY statement) so it can be used with FIRST. statements. 3) Using the steps described previously, limit the output file to no more than 300 observations and ensure our grouped records remain together.

Before showing the full macro, we will show an example of how the macro is invoked using our new macro variable KEEPTOGETHER.

%makecsv(myproject, nvials = 300, keeptogether= caseno id);

We sort the data set (1) by KEEPTOGETHER and use PROC MEANS to obtain the number of observations in each group (2), which is copied to the variable _COUNT in the data set _NIDS (3).

(1) PROC SORT DATA=&dsn; BY &keeptogether; RUN;

(2) PROC MEANS NOPRINT N DATA=&dsn;

VAR &keeptogether ;

BY &keeptogether ;

(3)

OUTPUT OUT = _nids (KEEP= &keeptogether _count) N=_count;

RUN;

Next, merge data set _NIDS with the original data set (4) by the variables in &KEEPTOGETHER (5).

DATA _combine;

(4)

MERGE &dsn _nids ;

(5)

BY &keeptogether ;

RUN;

Set the combined data set by the &KEEPTOGETHER variables so that automatic variables FIRST. may be used during the count of observations (6,7)

DATA &dsn.;

(6)

SET _combine ;

(7)

BY &keeptogether;

RUN;

For each group, we need to know the number of observations in the group to ensure that we do not exceed the maximum number of records in each file. To do this, we will capture the last variable in the &KEEPTOGETHER list for use later in the program.

In (8), use the macro function %SCAN with a negative number as the second argument reads from the end of the string and searches backward.

(8) %LET lastvar = %SCAN(&keeptogether, -1) ;

Then, in a DATA step, create a variable (NEXTFILE) that will keep count of the number of sets of &VNAILS records, e.g., NEXTFILE=1 for the first set of &NVIALS records, NEXTFILE=2 for the second set of &NVIALS records, through NEXTFILE=n for the last set of records to be written to the output CSV file.

In (9) and (10), SET the data set BY the &KEEPTOGETHER variables so automatic variable FIRST. may be used In (11), RETAIN _TOTAL and NEXTFILE so that their values are carried forward for each record. Variable _TOTAL will keep a running count of the total number of observations to be written to the output CSV file. In (12), increment _TOTAL by _COUNT at the beginning of each BY group. _COUNT, the number of observations in the &KEEPTOGETHER group, is repeated on all records within that BY group. Hence, _COUNT needs to be added to _TOTAL only once per group. In (13)-(16), when _TOTAL becomes larger than the number of observations specified in &NVIALS, reset the _total counter to _COUNT and change the value for NEXTFILE. In (14), _TOTAL is reset to _COUNT. In (15), increment NEXTFILE by one for the next set of &NVIALS observations to be written to a new CSV file. In (17), CALL SYMPUT is executed to assign the value of NEXTFILE to the macro parameter _MYLASTFILE. Although line 17 is executed with each record, the "final" value for _MYLASTFILE will be the value of NEXTFILE in the last observation of WORK._BATCH. In (18), write the value of &_MYLASTFILE to the SAS LOG

3

NESUG 2011

Applications

(9) (10) (11) (12) (13) (14) (15) (16) (17)

(18)

DATA _batch ; SET &dsn. ; BY &keeptogether. ; RETAIN _total 0 nextfile 1 ; IF FIRST.&lastvar. THEN _total = _total + _count; IF _total > &nvials THEN DO; _total = _count; nextfile+1; END; CALL SYMPUT ( '_mylastfile', nextfile);

RUN; %PUT The number of files to be written is &_mylastfile= ;

Now we get back to the heart of our previous versions of our macro. The number of records for each value of NEXTFILE was ensured above (lines 13-15) to be less than or equal to the value specified in &NVIALS.

In our %DO loop, we are not counting the number of observations this time. Instead, the value of NEXTFILE may be used to indicate which records will be output to each CSV file. By using a WHERE statement in our procedure, we can output our files.

In (19), assign the macro variable to run from 1 to the number of files to be written (&_MYLASTFILE) In (20), restrict the output CSV file to with a WHERE= statement specifying NEXTFILE = &K. The %DO loop is repeated until all files have been written.

(19) (20)

%DO k = 1 %TO &_mylastfile. ; PROC EXPORT DATA=_batch (WHERE= (nextfile = &k.)) OUTFILE = "batch&dsn.&k..&sysdate..csv" DBMS = CSV REPLACE; RUN;

%END;

The parameters in the above macro allow the user to specify the data set name (&DSN), the maximum number of records in the output data set (&NVIALS), and the grouping variables (&KEEPTOGETHER).

ADDING ERROR CHECKS

If the data set does not exist or a variable has been misspelled, our macro will not work. We can make the application more robust by adding some error checks.

In (1), we create the local macro variable &BAD that will be used to flag a macro parameter error. In (2), we write a message to the log indicating what check will be performed. In (3) ? (5), we write an error to the log and set &BAD to t(rue) if the data set is not found In (6) ? (9), we loop through the variables listed in KEEPTOGETHER, and verify that they exist. The variable check only occurs if (as verified in (8)) the data set exists. In (10) ? (11), we write an error message to the log and stop processing if the error flag (&BAD) has been turned on. Otherwise, we continue processing.

%makecsv(dsn=myproject, nvials = 300, keeptogether= case id);

(1) %LOCAL bad;

* check if DSN exists *;

(2)

%PUT Check for DSN (&dsn) existence ;

(3)

%IF %SYSFUNC(EXIST(WORK.&dsn))=0 %THEN %DO;

(4)

%PUT ERROR: data set &dsn does not exist ;

(5)

%LET bad=t;

%END;

* check if variables exists *;

%PUT Check for keeptogether (&keeptogether.) existence ;

%LET _vnum=1;

(6)

%DO %WHILE(%SCAN(&keeptogether., &_vnum., %STR( )) ne );

(7)

%IF %SYSFUNC(EXIST(WORK.&dsn))=1 %THEN %DO;

%LET dsnid = %SYSFUNC(OPEN(&dsn)) ;

%LET varck = %SCAN(&keeptogether, &_vnum, %STR( ));

%PUT Checking if &varck exists ;

(8)

%IF %SYSFUNC(VARNUM(&dsnid,&varck)) = 0 %THEN %DO;

%PUT ERROR: Variable in KEEPTOGETHER &varck does not exist ;

%LET bad=t;

%END;

%LET rc=%SYSFUNC(CLOSE(&dsnid));

4

NESUG 2011

Applications

%END;

%LET _vnum = %EVAL(&_vnum +1);

(9)

%END; * end of do while loop *;

(10)

(11) (12)

%IF &bad.=t %THEN %DO; %PUT Ending macro due to ERROR(S) listed above ; %GOTO bottom;

%END; /* more macro code */

%bottom: %MEND makecsv ;

CONCLUSION

Using PROC EXPORT, PROC MEANS and some basic DATA steps, we created a program that limits the number of records in each CSV file to a specified number and prevents related records from being split between files. The entire macro is shown in Appendix A.

Additional options we could add to this macro include location of the output data set (directory), other output destinations as specified with DBMS in PROC EXPORT, and even choosing to output the information to the listing file using a PROC (e.g., PRINT, REPORT) instead of writing the data to a file.

APPENDIX A:

%MACRO makecsv (dsn, nvials = 300, keeptogether= ); %LOCAL bad; * check if DSN exists *; %PUT Check for DSN (&dsn) existence ; %IF %SYSFUNC(EXIST(WORK.&dsn))=0 %THEN %DO; %PUT ERROR: data set &dsn does not exist ; %LET bad=t; %END; * check if variables exists *; %PUT Check for keeptogether (&keeptogether.) existence ; %LET _vnum=1; %DO %WHILE(%SCAN(&keeptogether., &_vnum., %STR( )) ne ); %IF %SYSFUNC(EXIST(WORK.&dsn))=1 %THEN %DO; %LET dsnid = %SYSFUNC(OPEN(&dsn)) ; %LET varck = %SCAN(&keeptogether, &_vnum, %STR( )); %PUT Checking if &varck exists ; %IF %SYSFUNC(VARNUM(&dsnid,&varck)) = 0 %THEN %DO; %PUT ERROR: Variable in KEEPTOGETHER &varck does not exist ; %LET bad=t; %END; %LET rc=%SYSFUNC(CLOSE(&dsnid)); %END; %LET _vnum = %EVAL(&_vnum +1); %END; * end of do while loop *;

%IF &bad.=t %THEN %DO; %PUT Ending macro due to ERROR(S) listed above ; %GOTO bottom;

%END;

%IF &KEEPTOGETHER = %THEN %DO; * Determine number of records in &DSN *; %LET dsid=%SYSFUNC(OPEN(&dsn)); %LET numobs=%SYSFUNC(ATTRN(&dsid, NLOBS)); %LET rc=%SYSFUNC(CLOSE(&dsid));

%LET k = 0; %DO i = 1 %TO &numobs %BY &nvials ;

%LET k = %EVAL(&k+1); PROC EXPORT DATA=&dsn (FIRSTOBS=&i OBS=%eval(&i +&nvials -1))

OUTFILE = "batch&dsn&k..&SYSDATE..csv" DBMS = CSV REPLACE; RUN; %END; %END ;

5

NESUG 2011

Applications

/* end of writing files without KEEPTOGETHER variables */ %ELSE %DO;

PROC SORT DATA=&dsn; BY &keeptogether; RUN; PROC MEANS NOPRINT N DATA=&dsn;

VAR &keeptogether ; BY &keeptogether ; OUTPUT OUT = _nids (KEEP= &keeptogether _count) N=_count; RUN;

DATA _combine; MERGE &dsn (IN=a) _nids ; BY &keeptogether ; IF a;

RUN;

DATA &dsn.; SET _combine ; BY &keeptogether;

RUN;

%LET lastvar = %SCAN(&keeptogether, -1) ; %PUT &lastvar is the last variable in keeptogether (&keeptogether). ;

DATA _batch ; SET &dsn ; BY &keeptogether ; RETAIN _total 0 nextfile 1 ; IF FIRST.&lastvar THEN _total = _total + _count; IF _total > &nvials THEN DO; _total = _count; nextfile+1; END; CALL SYMPUT ( '_mylastfile', nextfile);

RUN; %PUT The number of files to be written is &_mylastfile ;

%DO k = 1 %TO &_mylastfile ; PROC EXPORT DATA=_batch (WHERE= (nextfile = &k)) OUTFILE = "batch&dsn&k..&sysdate..csv" DBMS = CSV REPLACE; RUN;

%END; %END; %bottom: %MEND makecsv ;

ACKNOWLEDGMENTS

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ? indicates USA registration.

Other brand and product names are registered trademarks or trademarks of their respective companies.

Data in this manuscript were collected by the Women's Interagency HIV Study (WIHS) Collaborative Study Group with centers (Principal Investigators) at New York City/Bronx Consortium (Kathryn Anastos); Brooklyn, NY (Howard Minkoff); Washington DC, Metropolitan Consortium (Mary Young); The Connie Wofsy Study Consortium of Northern California (Ruth Greenblatt); Los Angeles County/Southern California Consortium (Alexandra Levine); Chicago Consortium (Mardge Cohen); Data Coordinating Center (Stephen Gange). The WIHS is funded by the National Institute of Allergy and Infectious Diseases (UO1-AI-35004, UO1-AI-31834, UO1-AI-34994, UO1-AI-34989, UO1-AI-34993, and UO1-AI-42590) and by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (UO1-HD-32632). The study is co- funded by the National Cancer Institute, the National Institute on Drug Abuse, and the National Institute on Deafness and Other Communication Disorders. Funding is also provided by the National Center for Research Resources (UCSF-CTSI Grant Number UL1 RR024131). The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the National Institutes of Health.

Special thanks to Bruce Gilsen, Johanna Goderre, and Jonathan Kerman for reviewing my drafts and providing constructive feedback.

6

NESUG 2011

Applications

I would also like to thank the many talented individuals whose written and presented work has helped me with my SAS programming. I cannot recall all the papers I referenced while creating this program, but I owe you all a big debt.

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the author at: Gayle Springer Johns Hopkins University Bloomberg School of Public Health 111 Market Street, Suite 906 Baltimore, MD 21202 410-223-1657 fax 410-223-1666 gspringe@jhsph.edu

7

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download