Getting started - School of Informatics



SAS Programming Notes

For

Data Mining and Exploration

Lecturer: Amos Storkey

School Of Informatics

University of Edinburgh

Acknowledgements: These notes are extensively based on notes developed over a long period by the School of Accounting, Economics & Statistics, Napier University. People who have worked on or contributed to these notes over that time include Amos Storkey, Ana Costa Da Silva, Phil Darby, Helen Storkey, Jeff Dodgson, Dorothy Currie, Kate Houston and Kirsty Davidson.

I am very grateful for permission to use and develop these notes for the Data Mining and Exploration course.

First published September 2000

Updated September 2001 (SAS version 8) and February, July

September 2002

October 2004 (SAS version 8.1) and September 2005

December 2006 (SAS version 9.1.3)

January 2008 (SAS 9.2 and linux differences)

File: SASv9.2.doc

1. Getting started 7

1.1 What is the SAS system? 7

1.2 The SAS Workspace 7

1.3 Creating and running a SAS program 9

1.4 Submitting and correcting your program 10

1.5 Saving files and clearing text from windows 10

1.6 Reading a saved program 11

1.7 A Data Analysis Flow Chart 11

1.8 Importing data using a wizard 12

1.9 Viewing a data set 13

1.10 Creating a SAS Program 14

1.11 Rules for entering SAS statements 15

1.12 Adding comments to a program 16

1.13 Including titles in your SAS output 16

1.14 Creating new variables 17

1.15 Printing and saving SAS output 17

2. Data files and SAS data sets 21

2.1 Reading data files using the INFILE statement 21

2.2 LIBNAME and permanent SAS Data Sets 22

2.3 Referencing a permanent SAS data set 24

2.4 Contents of a file 24

2.5 Importing data from other packages 25

2.6 Missing values 26

2.7 The INPUT statement 27

3. SAS procedures 33

3.1 Structure of a SAS program 33

3.2 Sample program 33

4. Summarising data 37

4.1 SAS System Options 37

4.2 HTML output 38

4.3 Summary Procedures 38

4.4 PROC SORT 39

4.5 PROC MEANS 40

4.6 PROC UNIVARIATE 41

4.7 PROC FREQ 41

4.8 General syntax for a procedure 43

4.9 Help 44

5. Graphs and charts 47

5.1 Graphics procedures 47

5.2 PROC PLOT 47

5.3 PROC CHART 48

6. Correlation and Regression 51

6.1 PROC CORR 51

6.2 PROC REG 52

7. Exploratory data analysis 55

7.1 SAS/INSIGHT 55

7.2 Accessing SAS/INSIGHT 55

7.3 Features of SAS/INSIGHT 57

7.4 Using SAS/INSIGHT 57

Tools 59

8. Modifying data and output 61

8.1 Introduction 61

8.2 SET statement 61

8.3 DROP and KEEP 62

8.4 Labelling output 63

8.5 PROC PRINT 64

8.6 PROC FORMAT 64

8.7 Recoding data 65

8.8 Conditional statements 65

8.9 VALUE statement 66

8.10 OUTPUT 67

9. PROC TABULATE 69

10. Functions and formats 73

10.1 MEAN function 73

10.2 NMISS function 73

10.3 N function 74

10.4 Functions to handle character variables 74

10.5 Date and Time Formats 75

11. Iterative processing 79

11.1 Do loops and arrays 79

11.2 Reading data in repeated patterns 79

11.3 Arrays 80

11.4 Generating random numbers 81

11.5 Random numbers from a uniform distribution 82

11.6 Random numbers from a normal distribution 83

11.7 The SAS Program Data Vector 83

11.8 The RETAIN and Sum statements 83

12. Further topics 87

12.1 Combining Data Sets 87

12.2 Hints on Using Word with SAS and SAS/INSIGHT 88

Solutions to exercises 93

Various files are referred to in these notes. These can be found in a zip file on the Data Mining and Exploration web site

inf.ed.ac.uk/teaching/courses/dme/

Getting started

1 Introduction

The SAS system is a widely used resource for statistical analysis and data mining. It is rare to find a job advert for a data mining practitioner that does not ask for SAS skills. The main positive points of SAS are its ability to handle large files fairly transparently, the ease and comprehensive way that standard analyses can be done, the interactive way that analyses can be built alongside a systematic programming environment, and the data handling capabilities. Its main negative points are its graphical capabilities, and that adding your own extensions to the techniques using macros and the interactive matrix language are slightly more cumbersome than other languages (e.g. matlab, R) and than more modern language constructs.

This tutorial will introduce you to the SAS System. This tutorial should be suitable for those working on either a Linux or Windows system. Interface tools in SAS for Windows are much better and so where there are differences these will also be mentioned.

SAS is, at its heart a piece of software for data handling and storage, statistical and data analysis, data mining decision support and report writing. It has been extended to a whole business intelligence package, but the best way of understanding SAS is from the inside out, and so this tutorial will teach the base SAS software to get you started. With base SAS software you can store data values and retrieve them, modify data, compute simple statistics, and create reports all in one SAS session. The difference between SAS and most statistical packages is that SAS incorporates both a database management system and a high-level programming language. There is also SAS software which provides graphics, forecasting, data entry, and statistics. The SAS system also contains other sophisticated applications that are valuable to large enterprises. All are available in one system.

2 The SAS Workspace

To start SAS on a linux system type SAS at the command prompt. On windows, select SAS from the start menu.

When you go into SAS, the first thing you see is a set of windows as shown in Figure 1. Your display may appear a little different since this has been adjusted to allow all the windows to be seen at once. There are five different windows shown in this figure. Two further windows are available in SAS version 9, you can switch between them by clicking on the buttons at the bottom of the SAS window.

Figure 1 SAS window on opening in Windows

Figure 2 SAS window on opening in Linux

The five windows are:

the EDITOR window where you enter the SAS statements you wish to execute. The EDITOR has handy features like colour coding and expandable and collapsible sections.

the LOG window which contains information on your SAS run, e.g. date and time of run, a listing of your SAS statements as they are executed and any errors which have occurred during processing.

the OUTPUT window which displays the actual results of the program.

the EXPLORER window, which allows you to view and manage your SAS files and create shortcuts to non-SAS files. For example you can use this window to create new libraries or to open any SAS file.

the RESULTS window helps you navigate and manage output from SAS programs you submit. You can view, save, and print individual items of output. (By default, the Results window is positioned behind the Explorer window but when you submit a SAS program that creates output it moves to the front of your display)

The two windows not shown are:

the GRAPH window, will appear when graphical output is to be displayed.

A seventh window will appear when html output is used. The output delivery system (ODS) can be turned on using programming code or by using the menu options.

You may turn on or turn off a window by using View from the main menu. Just choose the window you need (use this if you ‘loose’ a window).

Task 1

Resize the 3 windows on the right hand side so that you see the OUTPUT as well as the EDITOR and the LOG. Make the EDITOR the largest window.

You can activate any of the windows by

clicking on the window (Windows or Linux)

selecting Window from the menu, then the window you want (Windows)

selecting View from the menu (Windows or Linux)

3 Creating and running a SAS program

The following lines of code are a simple SAS program.

When they are typed into the editor window the words will become colour coded.

Reserved words appear blue (e.g. proc, print, input)

Comments appear green in Windows and in black in Linux (See below for details of entering comments.

Errors appear red.

data class1;

input height weight sex $;

datalines;

152 45.4 F

178 73.0 M

178 68.8 M

175 59.7 M

157 44.5 F

165 61.7 M

175 74.1 M

160 49.5 F

run;

proc print;

run;

Task 2

Enter the SAS program in the EDITOR window.

4 Submitting and correcting your program

There are several methods of submitting your program.

1. Highlight the section of code you wish to run and press the running man icon (in Windows).

2. Ensure that your cursor is in the EDITOR window, then select Run ( Submit (in Windows or Linux).

3. You can also run just a few lines of code by selecting Run -> Submit top line or Submit N lines (in Windows or Linux).

Right click with the mouse and select Submit All or press the man running icon

An alternative to pressing the man running icon is to press the key F3 in Windows or the key End in Linux.

Examine your LOG window to check that there were no error messages: if all is well examine your output in the OUTPUT window.

If you have error messages in your log file you will need to correct the mistakes and resubmit it.

After submitting your code you may find that it has disappeared from the editor window. To overcome this problem select Run ( Recall Last Submit.

5 Saving files and clearing text from windows

When you have succeeded in getting your program to run you can save it as filename.sas ( SAS automatically gives it a .sas ending to remind you that it is a SAS program). Make sure your EDITOR window is active before doing File ( Save (or pressing the floppy disk icon). Otherwise you might be saving the contents of your log or output window instead of your program.

Save log files as filename.log and output files as filename.lst if you want to save them too. It is usually not necessary to save the log file.

Important: - In order to avoid getting confused about which output and log refers to which program or version of a program, make it a habit to clear your windows before submitting a new program. Do this by selecting Edit ( Clear All.

Run ( Recall Last Submit returns the program you have just run to the EDITOR window. This is useful if you have cleared the program by mistake.

Task 3

Create a new folder in your personal disk space called MA71064 Statistical computing.

Submit the SAS program from the program editor. When it is working satisfactorily save the file as class1.sas in the folder you have just created.

6 Reading a saved program

A SAS program needs to be in an EDITOR window before it can run. To open a saved SAS program activate the EDITOR window and use File(Open. The program can then be submitted in the usual way.

You can have more than one EDITOR window open at the same time. However this can be confusing and it is easiest at first to have only one program open at a time.

7 A Data Analysis Flow Chart

Data analysis can be thought of in terms of a process flow. Actions proceed in a sequence. Often the output from one action leads to the input of another. A simple flow chart is given below.

Data step

Proc Print

Proc Means

Figure 3 A simple data analysis flow diagram

SAS programs can contain combinations of DATA steps and PROCEDURES. The SAS program you used above executed the first 2 blocks in the flow diagram. Quite quickly you will be producing more complicated programs that will have many DATA steps and PROCS.

8 Importing data using a wizard

The next example reads the excel file Class0 into the temporary SAS library called Work. The format of the data is displayed and the summary statistics (count, average and standard deviation) of the height readings is calculated.

The simplest method of entering data into SAS is using the import wizard.

File ( Import Data will display the dialogue shown in Figure 4.

A.

The source type default is Excel but others are available from the pull down menu.

Next

B. Locate the source file by pressing the Browse button

OK

C. Select the appropriate worksheet

From the options ensure that ‘Use data in the first row as SAS names’ is ticked.

OK

Next

D. Enter the Member as Class0.

Finish

Check the log window for errors.

Figure 4 Import data wizard

In Linux, there is naturally not the option of importing from Excel. However, there is the option of importing csv files. An Excel file can be opened using Open Office and can be saved as a Comma Separated File (csv). It can then be imported straight into SAS.

In Linux steps B and C above are replaced by the dialogue in Figure 4. Similar options are available when pressing the respective button. The remainder dialogue is similar to that in Figure 4.

Figure 5 Import data wizard in Linux

The final step of the wizard, in both Linux and Windows, is optional and offers the possibility of saving the importation command in a specified file, which can be opened with the Program Editor. This can be copy-pasted into any program and be run, without need to follow the steps of the wizard again.

9 Viewing a data set

Once the data is into the SAS format you can look at it in a variety of ways.

1. Proc print; run;

2. From the explorer window, double click on the libraries icon to reveal libraries that are present. These libraries are simply pointers to Windows XP folders where the data sets are stored. Double clicking on the work library reveals the data set Class0.

3. Double click on the data set to open the data set.

4. Right click on the data set to display a set of options. These include;

Open,

View the Columns

View in Excel (only in Windows).

Task 4

Import the excel file class0.xls into SAS using the import wizard then display the imported data set using excel.

10 Creating a SAS Program

You have already submitted a simple SAS program which created and then printed out a set of data. The following is an extension of that program. The line numbers have been included to help explain the structure of the program: they are not part of the program itself and should not be typed.

|Line number |Program |

|001 |data class2; |

|002 | input height weight sex $ bends pulse1 pulse2; |

|003 |datalines; |

|004 |152 |45.4 |F |6 |61 |84 | |

|005 |178 |7.0 |M |8 |59 |102 | |

|006 |178 |68.8 |M |12 |58 |95 | |

|007 |175 |59.7 |M |5 |76 |83 | |

|008 |157 |44.5 |F |5 |53 |102 | |

|009 |165 |61.7 |M |10 |70 |110 | |

|010 |175 |74.1 |M |5 |76 |102 | |

|011 |160 |49.5 |F |2 |67 |118 | |

|012 |161 |52.6 |M |5 |80 |103 | |

|013 |180 |85.4 |M |7 |84 |102 | |

|014 |160 |57.2 |F |7 |98 |115 | |

|015 |170 |69.9 |M |7 |69 |102 | |

|016 |178 |67.0 |M |11 |60 |79 | |

|017 |163 |57.0 |F |8 |70 |98 | |

|018 |160 |60.9 |F |12 |57 |84 | |

|019 |185 |73.1 |M |5 |68 |. | |

|020 |188 |79.1 |M |3 |53 |69 | |

|021 |159 |49.5 |F |6 |69 |112 | |

|022 | run; |

|023 |proc print; |

|024 | run; |

|025 |proc means; |

|026 | run; |

|Line number |Explanation |

|001 |The DATA statement tells SAS to create a data set called class2. |

|002 |The INPUT statement names the variables in the order they appear in the data lines. Variable names |

| |must start with a letter, be no more than 32 characters in length (eight characters in version 6) |

| |and must not contain blanks, commas and so on. To read data as characters, rather than numbers, a |

| |dollar sign is put after the variable name. |

|003 |The DATALINES statement indicates that the next lines are data. |

|004 |The data are entered with a space(s) separating each item. The data must be in the same order as |

|to |declared in the input statement. A new line is used for each record. |

|021 | |

|019 |A full stop indicates a missing numerical value. |

|022 |RUN tells SAS to execute the preceding statements |

|023 |PROC PRINT is a procedure to print data in the Output Window |

|025 & 026 |PROC MEANS is a procedure to calculate the mean and other statistics of all the numeric variables, |

| |RUN completes the procedure. |

This example illustrates the basic structure of a SAS program:

A DATA step consisting of a DATA statement and other statements that form part of this step

SAS PROCECURES begin with a PROC statement. Procedure statements may also be followed by statements that are part of the procedure step, although there are none in these two examples

11 Rules for entering SAS statements

SAS statements:

usually begin with an identifying keyword

always end with a semicolon

(check carefully before you submit any program!)

can be in uppercase or lowercase letters

SAS statements are free format.

they can begin and end in any column

one statement can continue over several lines

several statements can be on one line

Readability is improved if you add comments and leave spaces between the DATA and PROC steps and perhaps also indent code within a DATA or PROC step. Develop your own style and stick with it.

12 Adding comments to a program

There are two ways of writing comments in a SAS program:

begin the comment line with an asterisk and end with a semi-colon

e.g. *This program was developed by J Smith;

begin with a forward slash asterisk and end with an asterisk forward slash

e.g. /* J Smith February 2005 */

Inserting comments is essential if you are doing any serious programming.

The /* style */ is also useful for ‘commenting out’ blocks of a program when testing or debugging.

Task 5

Read the file class1.sas into the program editor. Edit the program so that it is the same as the sample program in section 1.10 but with the addition of a comment which gives your name and today’s date. Submit the program and when it is working properly save it as class2.sas.

What information did the Proc Means procedure give you?

13 Including titles in your SAS output

The TITLE statement is used to provide titles on your output. The TITLE statement can appear anywhere in a program (an example of a global statement) and subsequently each page of output (and each graph) will have the title until it is reset. For example program class2.sas could be enhanced as follows:

. . .

proc print;

title ‘Information on Students in Class’;

run;

proc means;

title ‘Summary Statistics of Students in Class’;

run;

However the final title will adorn all future output until it is reset with another title or ‘cancelled’ with

title;

run;

Task 6

Experiment with the TITLE statement in program class2.sas.

14 Creating new variables

If you need to analyse variables, that are derived from the input variables, then you must create these variables in the DATA step. For example, if you want to use two new variables ‘the difference in pulse rates’ and ‘the log of the number of bends’ then these variables must be defined before the lines of data are read in. The rules about naming new variables are the same as for input variables.

data class2;

input height weight sex $ bends pulse1 pulse2;

diff=pulse2-pulse1;

lnbends=log(bends);

datalines;

...

Run;

Some commonly used operators and functions are as follows:

|Operator |Meaning | |Function | |

|* |multiplication | |log( ) |natural log |

|/ |division | |exp( ) |exponential |

|** |exponentiation | |sqrt( ) |square root |

15 Printing and saving SAS output

The contents of the OUTPUT window may be sent to a printer using the OUTPUT window print command. You can change the way the output looks using, for example, the LINESIZE and PAGESIZE options (see Section 4.1 or SAS Help).The whole of the OUTPUT window listing may be saved as filename.lst using the OUTPUT window save command.

In WINDOWS, it is often convenient to copy all or part of the OUTPUT window into a Word document or another text processing software. This can be achieved with the copy and paste operation. However the results might be disappointing. The appearance may be improved by

• using a fixed space font such as SAS Monospace (available if SAS is running)

• avoiding ‘wrap around’ by reducing the font size and avoiding unnecessary leading spaces.

See Section 12.2 for further advice on incorporating SAS numeric and graphical output into a Word document.

Task 7 (WINDOWS ONLY)

Create a Word document (using copy and paste) which consists of program class2.sas and the output it produces. Experiment with improving the layout of the document.

Exercises

1.1 The following table shows the heights and weights of 16 eleven-year-old girls.

|ID no |Height(cm) |Weight(kg) | |ID no |Height(cm) |Weight(kg) |

|59 |135 |25 | |71 |133 |30 |

|82 |146 |33 | |78 |149 |35 |

|27 |153 |56 | |12 |141 |33 |

|52 |154 |51 | |37 |164 |48 |

|55 |139 |31 | |28 |146 |37 |

|13 |131 |25 | |48 |149 |45 |

|01 |149 |43 | |69 |147 |36 |

|15 |137 |32 | |16 |152 |47 |

(a) Write the SAS statements to create a SAS data set called ELEVEN. The ID number should be stored as a character variable.

(b) Insert a comment statement to indicate that this data came from Exercise 1.1.

(c) Create a new variable which is the ratio of weight to height.

(d) Produce a printout of the data and a table which shows the mean, standard deviation, the maximum and the minimum values for the variables height, weight and the ratio of weight to height. The output should be suitable labelled.

(e) (WINDOWS ONLY) Copy and paste your program and its output into a word document (edit to ensure an attractive appearance).

(f) How would the output have differed if you had input the ID number as a numeric variable?

1.2 Modify your program in 1.1 above in order to determine the body mass index (BMI = weight in kilograms/(height in metres)2 or BMI = W/(H * H) ).

1.3 Several measurements of water quality were taken at eight different sites along the Firth of Forth. The data are shown below.

|Site |Salinity |Phosphate |Nitrogen |Chlorophyll |Faecal |

| | | | | |Coliforms |

|CR |30.11 |0.068 |0.297 |1.693 |2.917 |

|WG |31.48 |0.059 |0.165 |1.464 |3.149 |

|EG |31.79 |0.068 |0.144 |1.100 |3.196 |

|SF |31.37 |0.185 |0.278 |1.787 |3.418 |

|PB |31.50 |0.116 |0.223 |2.099 |3.049 |

|JO |31.60 |0.106 |0.207 |1.067 |2.903 |

|SS |30.50 |0.047 |0.162 |1.563 |2.895 |

|FN |31.96 |0.060 |0.130 |0.753 |2.797 |

(a) Write the statements to create a SAS data set called FORTH.

(b) The units of phosphate are mg/litre. Create a new variable which gives phosphate in units of (g/litre where 1mg = 1000(g (1 milligram = 1,000 micrograms).

(c) Produce a table which shows summary values for each variable.

(d) Save the program in a file called forth.sas.

Data files and SAS data sets

1 Reading data files using the INFILE statement

In the examples in the previous section you created temporary SAS data sets from data, which were included in the program, with a DATALINES statement. In practice, a large set of data is more likely to be available as a raw data file (known as an ASCII or text file) and it will be more convenient to read the external data directly into SAS.

To illustrate this we will create a small ASCII data set using the Notepad editor, read the data file into SAS and then print the contents.

Task 1

Open Notepad / text editor and type in the following data set. Save it as blood.txt on your floppy disk. (Note that Notepad automatically gives the extension .txt.)

1 107 100

2 110 114

3 123 105

4 129 112

5 112 115

6 111 116

7 107 106

8 112 102

9 136 125

10 102 104

The variables in the data set are patient number and blood pressure measurements before and after treatment.

The code required to input the data into SAS and get a printout in the Output Window is as follows.

data blood;

infile 'a:\blood.txt';

input patient $ before after;

run;

proc print;

run;

The only changes that are required to the previous method of data input are that:

the INPUT statement is preceded by an INFILE statement to tell SAS where to find the external data file.

the DATALINES statement and the lines of data are omitted.

Task 2

Type the above SAS program into the EDITOR window and save the program under a suitable name. Submit the program and confirm that the values of the variables together with variable names have been printed in the OUTPUT window.

You can verify that the data is stored in the correct location on your hard drive. An example is given in Figure 6.

Figure 6 The data set Blood in the SAS Work library and temporary directory

2 LIBNAME and permanent SAS Data Sets

In the programs you have written so far the data set used in any analysis has been created in the data step. Such a data set is described as temporary in the sense that it only exists during your current SAS session and will be deleted when you close SAS.

This kind of temporary file is stored in a SAS library called WORK. You can check what files you have created in the current session by going to the EXPLORER window and clicking on Libraries and then on the library WORK.

(Use View ( Up One Level

or View ( Show Tree

to navigate back to the original EXPLORER window)

SAS files are given a two part name. The first part of the name is the library name in which the file is stored and the second part is the name of the particular file. You probably noticed that when you created the previous data sets, for example class2, that SAS referred to this file in the log window as WORK.CLASS2.

If you wanted to do further analyses on this type of data set in a different session you would need to recreate the data set by running the data step once again. This can be a time-consuming process especially if you have a large amount of data and have created many new variables, changed the format of variables and so on. The alternative approach is to create a permanent SAS data set. This is a special type of file, unique to SAS, which stores the data, variable names and other information such as formats.

You can set up a library to store your data sets and save them so that they can be used in another SAS session. The SAS LIBNAME statement defines the name of the library where the file is to be stored. For example, if you want to store your data on your own disk in drive A, then you need to give this a SAS libname using a statement like the one below. The actual name of the library, in this case mydisk, is chosen by the programmer and is just a convenient name that can be referred to later in the program.

LIBNAME myadisk ‘a:\’;

LIBNAME mydisk ‘h:\MA71064 Statistical computing\’;

LIBNAME myhomedisk ‘c:\My documents\Napier\MA71064 Stat Comp\’;

You can then save your SAS data sets in this library using a two level SAS name. The first part of the name is the libname and the second part is the name given to the SAS data set. So to create the permanent SAS data set called class2, on the H: drive, would require the following SAS code.

libname mydisk ‘h:\MA71064 Statistical computing\’;

* The data library called ‘mydisk’ will;

* be located on the H: drive;

data mydisk.class2; *Create the new data set class2;

input height weight sex $ bends pulse1 pulse2;

datalines;

152 45.4 F 6 61 84

178 53.0 M 8 59 102

165 61.7 M 10 70 110

...................

175 74.1 M 5 76 102

160 49.5 F 2 67 118

run;

Task 3

Modify your program, class2.sas, to create a permanent data set. (If you want the data set stored on a hard drive make sure you give the full path name of the required directory.) Check the messages in the LOG Window and check you can see the permanent data set in the EXPLORER window.

SAS version 8 puts an automatic SAS7BDAT extension on permanent data sets (version 6 uses SD2).

Task 4

Go to Windows explorer and check that you have a file class2.SAS7BDAT in the appropriate directory.

3 Referencing a permanent SAS data set

Suppose that you have a permanent SAS data set stored in a particular directory. You may have created this yourself or possibly have downloaded it from the web. You may carry out procedures on the data set directly by using the DATA option in the procedure statement. All the details of variable names and so on will be held in the data set.

In the following example permanent SAS data set prac1 is stored in directory h:\sas\sasdata.

libname xyz ‘h:\sas\sasdata’;

proc print data = xyz.prac1;

proc means data = xyz.prac1;

run;

Note that the first part of the name given by LIBNAME is a pointer to a directory and does not have to be the same name as was used when the data set was created. It is the second part of the name that refers to the particular data set.

4 Contents of a file

You can use the SAS procedure CONTENTS to get information about a data set and a list of the variables it contains. This procedure is useful for larger data sets that would be too long or have too many variables to list completely, and it gives you information about when and where the data set was last modified. For example,

proc contents data = xyz.prac1;

run;

Note: if you have already submitted a LIBNAME statement in the current SAS session, it is not necessary to do so again. You can simply refer to the two-level data set name.

Task 5

Get information about the SAS data set COMPANY stored in SASHELP. What information is given in each of the columns?

An alternative way of inspecting what variables are in a large file is to print out only the first few observations. This can be done using an option in the PROC PRINT statement.

proc print data=pany (obs=6);

run;

Remember you can also view data sets from the EXPLORER window

5 Importing data from other packages

Software such as Excel, Minitab and SPSS store data in file types unique to themselves. Some packages have the ability to export into or import out of other formats.

Use the import wizard or PROC IMPORT (see SAS Help).

A safe approach to importing data from such application software into SAS is to export from the other package into ASCII format and input the resulting file into SAS in a DATA step.

Large data sets from outside sources (other companies or organisations) are usually supplied in ASCII format since such data is often held in proprietary databases. Most software allows data to be written to an ASCII or raw data file. This approach is illustrated bellow:

| |export | |INFILE statement | |

|Application data file |( |ASCII data file |( |SAS data set |

It is a good idea to check the ASCII file with an editor such as WordPad (and possibly ‘tidy up’ if necessary). The ASCII file can then be read into SAS using the INFILE statement assigning variable names with the INPUT statement (as explained in Section 2.1).

Advice on importing data into SAS from popular applications software is summarised below.

Excel (WINDOWS)

For a spreadsheet containing data only (values of variables in columns):

Right align columns (necessary for character data)

File (Save As(Formatted Text (Space delimited)

to give ASCII file filename.prn.

Edit filename.prn with WordPad if column headings need deleting, missing values need replacing with ‘.’ etc.

Minitab

File(Other Files(Export Special Text

Specify columns (accept Period Decimel Separator).

Results in ASCII file filename.dat.

Note that Minitab’s missing value symbol is ‘*’. SAS will find this invalid and replace by ‘.’.

SPSS

File(Other Files(Fixed ASCII

Results in filename.dat.

Note that SPSS’s missing value symbol is ‘.’ However this will be blank in filename.dat and cause SAS to misread the data set when using simple list input.

Alternative: SPSS allows data to be saved directly as a permanent SAS data set:

File (Save As(SASv7 Windows long extension

In recent versions of SAS (e.g. 9.2), SPSS files can be imported directly.

6 Missing values

Uncoded missing values present special problems for using list input. To provide some protection for the integrity of your output data set when input data contain uncoded missing input values, use the MISSOVER or STOPOVER options in the INFILE statement. Use the MISSOVER option to set all remaining variables in the INPUT statement to missing. Use the STOPOVER option to prevent an observation from being written to the data set when the input line does not contain a value for each variable in the INPUT statement and to stop the DATA step from further processing.

e.g. the program

data test1;

input id $ var1 var2 var3 var4 var5;

datalines;

1001 115 45 65 83 78

1002 86 27 55 86

1004 93 52 63 76 88

1015 73 35 43 112 108

;

run;

would result in the following inaccurate data set

|obs |id |var1 |var2 |var3 |var4 |var5 |

|1 |1001 |115 |45 |65 |83 |78 |

|2 |1002 |86 |27 |55 |86 |1004 |

|3 |1015 |73 |35 |43 |112 |108 |

If we use the MISSOVER option i.e.

data test1;

infile cards missover;

input id $ var1 var2 var3 var4 var5;

cards;

1001 115 45 65 83 78

1002 86 27 55 86

1004 93 52 63 76 88

1015 73 35 43 112 108

;

run;

we will get the following data set

|obs |id |var1 |var2 |var3 |var4 |var5 |

|1 |1001 |115 |45 |65 |83 |78 |

|2 |1002 |86 |27 |55 |86 |. |

|3 |1004 |93 |52 |63 |76 |88 |

|4 |1015 |73 |35 |43 |112 |108 |

Using the MISSOVER option prevents the uncoded missing value in the second data line from causing the third record to be read incorrectly as well. The second observation is still incorrect, but the errors have been restricted to one observation. The STOPOVER option would prevent observation 2 from being written to the data set at all. In order to read the data in properly, either column input or formatted input would have to be used. (See next section)

7 The INPUT statement

The INPUT statement names the variables being read in via a DATALINES or INFILE statement and tells SAS where on the DATALINES, or on the lines of INFILE, the values of the variables can be found. There are three main types of INPUT that you can use to describe a record’s values : LIST, COLUMN and FORMATTED. The choice of which type of input you use will depend on the type and arrangement of the incoming data. The $ symbol is placed after a variable name to indicate a character variable.

In the previous examples you have used only the simplest type of INPUT, LIST INPUT. List INPUT is seldom useful for large commercial or scientific work because it is too easy to get missing values or errors in big files. It is commoner for real data to come in fixed column format, where the fields on each line are aligned in columns one under each other.

1 LIST INPUT

- the values are separated by spaces

- missing values must be represented by full stops

- by default, character values cannot be longer than 8 characters

- character values cannot contain embedded blanks

- fields must be read in order

e.g.

data one;

input height weight name $ age;

datalines;

65 150 Chris 50

60 125 Kelly 35

68 180 Leslie 29

;

run;

2 COLUMN INPUT

- data must be aligned within the column positions specified

- character values can contain embedded blanks

- input values can be read in any order

- character values can be of length 1 to 200 characters

- leading and trailing blanks within a field are ignored

e.g.

data two;

input name $ 1-7 age 9-10 birthdate $ 11-22 sport $ 23-30

;

datalines;

Ronald 40Dec 3 1954 golf

Michael 37Jul 4 1957 fishing

Laurel 33Jun 23 1961 softball

;

run;

3 FORMATTED INPUT

- character values can be of length 1 to 200 characters

- a full stop is not needed for numeric missing values

- nonstandard data, such as dates or numbers can be read in

- with the use of pointer controls, values can be read in any order

This method of input uses pointer controls and informats for reading in nonstandard data from external data files. An informat is used for reading in data containing dates, numbers with commas, etc.

The informat w.d after a variable specifies the width w and the number of decimal places d to be used in reading in a number.

e.g. for the number 2346,

• the informat 4.2 would result in the number 23.46 being read in.

• the informat 4. with no ‘d’ specified would result in the number 2346 being read in

The informat $w. after a variable specifies the length of a character variable

Dates such as 21/10/89 can be read using the informat DDMMYY8. (Note the full stop at the end of the informat)

Pointers indicate the position of a variable e.g.

@n go to column n

+n move the pointer on n positions

e.g.

/* A line of place counters is often useful to put to help alignment

000000000111111111122222222223333333333444444444455555555

1234567890123456789012345678901234567890123456789012345*/

data three;

input @1 name $7. @10 age 2.0 @14 birthdate $11.

@28 sport $8. / @9 gradyr 4.0 @16 numchild 1.0

@20 occupation $20.;

datalines;

Ronald 40 Dec 3 1954 golf

1973 2 masonry contractor

Michael 37 Jul 4 1957 fishing

1975 2 bricklayer

Laurel 33 Jun 23 1961 softball

1979 0 attorney

;

run;

/ tells the pointer to go to the next line. Once you go to the next line, you cannot move back to the previous line.

Exercises

2.1 Create a permanent data set of the data given in Exercise 1.1. How are you going to retrieve this data without having to retype it? You should be able to modify the program you have saved.

2.2 (a) Download the pulse data file Minitab version from the web or WebCT (pulse.mtw not pulse.prn). Open Minitab load pulse.mtw using File ( Open Worksheet (not Open Project). Use File ( Other Files ( Export Special Text (not File ( Save Current Worksheet as) to export the PULSE file as an ASCII file. You have to highlight the variables to export then press select. Press OK. Enter a suitable file name. Change the file type to ANSI Text Files (*.TXT). Finally press save.

(b) Create a permanent SAS data set of the data.

2.3 To illustrate the dangers of list format input, take the data file blood.dat and edit it with a text editor (notepad or Word). Make one or two mistakes in it by removing some of the entries in one or more lines. Now save it as a text file and use it to input and print a SAS data set. Examine your log file and output, to see what has gone wrong, and how you are warned.

2.4 (a) (WINDOWS ONLY) Create an Excel file containing the following data where column1 is size, column 2 is colour, column 3 is price and column 4 is transport cost. Save it as a formatted text space delimited file.

Large Red 18.97 0.25

Medium Blue 24.68 1.10

X-Large Black 29.99 1.75

Small Orange 15.89 0.90

(b) Write and submit a SAS program to read in the data using list input and print the variables colour, size and price in that order.

(c) Redo (b) using column input

(d) Redo (b) using formatted input

5. Copy the text file houses.dat from the web. The file contains the following five variables for each of the 120 houses in a survey of house prices. Examine it with an editor (Wordpad or Word).

|VARIABLE |CONTENTS |COLUMN LOCATION |

|style |Type of house |1 |

|sqfeet |Floor area |3-6 |

|bedroom |Number of bedrooms |8 |

|baths |Number of bathrooms |10-12 |

|Price |Price of house |14-19 |

Use column input to create a permanent SAS data set for the housing data and print the contents.

6. (a) Download the cars Excel file from the web.

To create a file of raw data for reading into a SAS data set:-

Open the file up in Excel.

Right align the columns.

Delete the coding information about the origin variable (in column L).

Save the data as a formatted text space delimited file (.prn extension), or as a csv file in Linux.

(b) Use this file to create a SAS data set (use column input).

To identify the column location for each variable, open the .prn file up in Notepad. Move the cursor along the row of data, taking a note of the column locations.

SAS procedures

1 Structure of a SAS program

Once you have got the data organised a simple SAS program consists of a series of procedures. You have already used three of these procedures. PROC PRINT, PROC MEANS and PROC CONTENTS. Apart from specifying which data set to use you had no control on the type of output that SAS produced. This may have given the impression that SAS is rather inflexible. However, this is far from the truth. Most procedures have several options which can be invoked and in addition there are statements which can be incorporated into a program (which themselves have options). The procedures and subsequent statements determine the nature of the output produced. Most SAS procedures use the following syntax:

PROC PROCNAME options;

STATEMENTS / statement options;

RUN;

A program will typically consist of several such blocks of code.

2 Sample program

libname unit3 'c:\sas\sasdata';

proc sort data=unit3.pulse out=sorted;

by activity;

run;

proc print data=sorted noobs N; *NOOBS removes observation numbers;

format height 6.0;

title 'Pulse data from Minitab sorted by activity';

var pulse1 pulse2 weight height;

by activity;

run;

proc freq;

tables ran smokes activity;

tables sex/nocum nopercent;

tables sex*smokes;

run;

proc means maxdec=2 mean std;

title 'Pulse rates before and after exercise';

var pulse1 pulse2;

run;

Task 1

Print the pulse data that you saved as a permanent SAS data set in Exercise 2.2. Now run the first two parts of the sample program (PROC SORT and PROC PRINT) and compare the output. Remember to specify an appropriate library.

One way of printing separate tables for different subgroups is to use a BY statement. In order to do this the data set must be already sorted by this BY variable. If you do not want to overwrite the original file then the sorted data must be stored in a new file. The statements:

proc sort data=unit3.pulse out=sorted;

by activity;

run;

sort the pulse data by activity level and store the sorted data in a new file called sorted.

Task2

Look in the libraries to see where this file is stored. Is it a permanent or a temporary data set?

The option NOOBS suppresses the observation numbers and the option N allows the sample size to be printed at the end of each table. The format statement gives an instruction to print the values of height with a maximum of six characters and no decimal places.

Task 3

Type in the rest of the sample program and see if you can work out what the remaining statements and options are doing. Look carefully at the titles. What happens if no title statement is made in a procedure?

Individual procedures will be looked at in more detail in the next few sections. Information about the options available for individual procedures is given in the SAS help though it is not always very easy to follow!

It is not strictly necessary to have a run statement between each procedure. SAS recognises that a new procedure statement indicates that the previous statements refer to the preceding procedure. However, it is generally advisable to include additional run statements and it is essential to put a run statement at the end of the program.

Summarising data

1 SAS System Options

You have probably noticed that the date and a page number are included on all the output produced by SAS. The type of output produced by SAS is determined by the system but may be changed by making use of SAS System Options.

There are dozens of options available which deal with hardware and software interfacing, and the input and processing as well as just the output of jobs.

A list of the options may be found in help. . The following are some commonly used options which may be used to change the output.

|Option |Action |

|CENTRE/NOCENTRE |Output centred / left aligned |

|DATE/NODATE |Date shown / date not shown |

|NUMBER/NONUMBER |Pages numbered / not numbered |

|PAGESIZE= |Determines the number of lines per page |

|LINESIZE= |Determines the printer line width |

|FIRSTOBS= |Specifies the first observation to include from the data set |

|OBS = |Specifies the last observation to include. This is useful for testing code |

| |using large data sets. |

|OBS = max |Includes all observations |

The following lines of code will produce a print out of observations 20 to 45 inclusively of the pulse data, with no page numbering, no date, left aligned and with 20 rows on the page.

options nonumber nodate nocentre pagesize=20 linesize=80 firstobs=20 obs=45;

libname unit4 'c:\sas\sasdata';

proc print data=unit4.pulse;

run;

Options firstobs = 1 obs = max; /* Uses all observations in

any analysis that follows./*;

SAS system options remain in place for the whole of a SAS session unless subsequently changed. If an OPTIONS statement is entered within a DATA or PROC step then it takes effect immediately. An OPTIONS statement entered outside of a step takes effect with the following step.

2 HTML output

HTML output can be turned on from the menu Tools ( Options ( Preferences.

Select the Results tab and select the Create HTML box. The dialogue windows are shown below, for both Linux and Windows. It includes an option to write the output into a specified folder. If this option is not used the output file is written into the folder specified for the work library. I

[pic][pic]

Figure 7 Dialogue to turn on HTML output (in Windows / in Linux)

3 Summary Procedures

Four procedures PROC SORT, PROC MEANS, PROC UNIVARIATE and PROC FREQ may be used to summarise data. The most commonly used options and statements for these procedures together with sample programmes are given below. The complete set of options can be obtained in SAS help,

Help(SAS Help and Documentation

Choose the SAS Products, Base SAS, SAS Procedures then Procedures. From there you should click on the procedure you require.

4 PROC SORT

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|OUT= |Specifies the name of file to store the sorted data. If no OUT option is used the |

| |original file will be overwritten. |

|Statements | |

|BY |A list of variables to sort by must be specified. DESCENDING placed before a variable |

| |name will sort the data in descending order for that variable. |

options centre pagesize=50 firstobs=1 obs=92;

proc sort data=unit4.pulse out=sorted;

by sex descending ran;

run;

proc print;

run;

The lines of code sort the pulse data by sex (males first, followed by females) and within sex by whether the students ran. Those who did not run (coded 2) are placed before those that did run (coded 1) because DESCENDING has been specified. Note that no data statement is used with PROC PRINT. SAS automatically uses the sorted data set because that was the last data set created.

Task 1

Using the first 50 observations only of the pulse data, create a data set sorted by smoking (non-smokers first) and by activity. Print out the sorted data set. Check carefully that the output is what you expect.

5 PROC MEANS

|Options |Description |

|DATA= |Data set to be used, by default will use the last data set created |

|MAXDEC= |Gives the maximum number of decimals to be used in the output (must be between 0 and 8) |

|NOPRINT |Suppresses the printing if the procedure is only being used to send summary output to a file (see OUTPUT |

| |statement). |

|ALPHA |Gives value for confidence limits (ALPHA=0.05 for 95% C.I.) |

|statistic keyword |By default PROC MEANS prints out the variable name, count, mean, std dev, min and max values. Particular |

|list |statistics may be requested. |

|Procedure options |

|N |Number of non-missing observations in a subgroup |

|NMISS |Number of missing observations |

|MEAN |Mean |

|STD |Standard deviation |

|MIN |Minimum value |

|MAX |Maximum value |

|RANGE |Range |

|STDERR |Standard error |

|CLM |Confidence limits for the mean |

| |(For additional keywords see SAS Help) |

|Statements | |

|VAR |Specify a list of numeric variables for which statistics are required. |

|BY |Specify a list of alphanumeric variables (data must be sorted by these variables). Descriptive statistics|

| |are given for each subgroup. |

|CLASS |Specify a list of alphanumeric variables. Descriptive statistics are given for each subgroup. Uses more |

| |memory than the BY command but does not need the data to be sorted. |

|OUTPUT |There are various ways of storing all or some of the summary statistics requested. Need to specify a file|

| |name using OUT=filename and which variables/statistics are required. See specimen programme for a simple |

| |example of how this can be done. |

The following lines of code may be used to get summary statistics (the means, standard deviations and standard errors) for pulse1 and pulse2 in subgroups defined by sex and whether the students ran. These summary values are stored in a file named summary.

proc means data=unit4.pulse maxdec=2 mean std stderr;

var pulse1 pulse2;

class sex ran;

output out=summary

mean = mean_p1 mean_p2

std = std_p1 std_p2

stderr = se_p1 se_p2;

run;

proc print data=summary;

run;

Task 2

Submit the previous sample program. (Remember you may need to change the library name and file name of the data set.) What data has been stored in the file ‘summary’? What does the TYPE variable indicate in the print output?

6 PROC UNIVARIATE

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|PLOT |Produces stem-and-leaf plots, boxplot and normal probability plots of the data. |

|NOPRINT |Suppresses all printing. |

|Statements |

|VAR |Specify a list of numeric variables for which statistics are required. |

|BY |Specify a list of character variables (data must be sorted). |

|OUTPUT |Need to specify a file name using OUT=filename and which statistics/variable names are required. |

Task 3

Submit the following program and see how the printout differs from that produced by PROC MEANS. How does the output file containing summary values differ?

proc univariate data=sorted plot;

var height weight;

by sex;

output out=summary

mean = mean_ht mean_wt

std = std_ht std_wt;

run;

proc print data=summary;

run;

7 PROC FREQ

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|Statements | |

|TABLES |Specify a list of alphanumeric variables for which tallies are required. Smaller |

| |subgroups may be defined by the use of an * e.g. sex*ran*activity |

|Table statements options | |

|NOCOL |Does not show column percentages |

|NOCUM |Does not show cumulative frequencies or percentages |

|NOFREQ |Does not show cell frequencies |

|NOPERCENT |Does not show cell percentages |

|NOROW |Does not show row percentages |

|CHISQ |Gives results of chi-squared tests of independence |

The following code produces frequency tables for sex and smoking habit separately and a two way table of sex and smoking habit. The output also includes the results of a chi-squared test of independence for these two variables.

proc freq data=unit4.pulse;

tables sex smokes smokes*sex/nocol norow nocum chisq;

run;

1 Chi-square test

Proc freq is used to carryout a chi-square test for the association of 2 categorical variables. In this case the null hypothesis is that there is no association between smoking and sex. The same proportion of smokers should be found amongst males and females.

It is convenient to add the row percentage to the cross tabulation as an easy way to look for a possible association. This is achieved by removing the option “norow”. The options “nocol” and “nopercent” have been left in the statement to remove clutter from the output.

proc freq data=unit4.pulse;

tables sex*smokes/ chisq nocol nopercent;

run;

The output from SAS gives

The FREQ Procedure

Table of Smokes by Sex

Smokes Sex

Frequency‚

Row Pct ‚1 ‚2 ‚ Total

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

1 ‚ 20 ‚ 8 ‚ 28

‚ 71.43 ‚ 28.57 ‚

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

2 ‚ 37 ‚ 27 ‚ 64

‚ 57.81 ‚ 42.19 ‚

ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ

Total 57 35 92

Statistics for Table of Smokes by Sex

Statistic DF Value Prob

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Chi-Square 1 1.5321 0.2158

Likelihood Ratio Chi-Square 1 1.5699 0.2102

Continuity Adj. Chi-Square 1 1.0089 0.3152

Mantel-Haenszel Chi-Square 1 1.5154 0.2183

Phi Coefficient 0.1290

Contingency Coefficient 0.1280

Cramer's V 0.1290

The probability of the chi-square statistic being as large as 1.5321 by chance alone is 0.2158. This indicates that there is not an association between sex and smoking in this sample.

Fisher's Exact Test

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Cell (1,1) Frequency (F) 20

Left-sided Pr = F 0.1576

Table Probability (P) 0.0886

Two-sided Pr 1000 ; * Only uses certain cases;

TABLE CatVar * NumVar ; * Specifies an output table;

FREQ NumVar3 ; * Variable giving the observation

Frequency;

MODEL YVar= XVar + . .; * Fits models;

PLOT YVar * XVar ; * Plots a scatter plot;

RUN;

9 Help

Extensive documentation on each procedure can be found by using Help ( SAS Help and Documentation. An example of the help screen for the BASE SAS procedures is shown in Figure 8. Other useful help modules are SAS/STAT and SAS/GRAPH

Figure 8 Base SAS Procedures help

Exercises

4.1 (a) For the pulse data, get a printout which shows the number of observations, the mean, standard deviation, maximum and minimum values of pulse2 in each of the four subgroups defined by whether the student smoked/did not smoke and ran/did not run. Make the printout left-aligned with the summary values shown to one decimal place.

(b) Obtain confidence limits for the four means produced in part (a). Does it appear that smoking or running on the spot had any effect on the second pulse rate?

(c) Obtain comparative boxplots of the second pulse rate in each of the four subgroups.

(d) What percentage of smokers were made to run on the spot? Get suitable SAS output to give you this information.

4.2 For the SAS data set RETAIL (From the explorer tab look in the library SASHELP) get a printout which shows the number of observations and the mean and standard deviation of retail sales in each year. Print the mean and standard deviation to two decimal places. Output the mean sales for each year to a new file called ‘summary’ and get a print out of this file

Graphs and charts

1 Graphics procedures

SAS can produce two types of graphics, high or low resolution. High resolution graphics are sent to a special graphics windows where the graphs can be edited and copied into Word documents. GPLOT and GCHART are two procedures which produce high resolution graphics: the equivalent low resolution procedures are PLOT and CHART.

2 PROC PLOT

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|Statements | |

|PLOT |Specify yvariable*xvariable. Can produce several plots with a single statement by |

| |including a list of variables in parentheses e.g. (list of n yvariables)*(list of m |

| |xvariables) will produce nm separate plots. |

|BY |Specify a list of character variables (data must be sorted) to produce separate graphs|

| |for subgroups. |

|PLOT options | |

|=‘symbol’ |Specify a symbol to be used for plotting |

|=variable |Identifies each point by the value of another variable |

The following code produces a plot of weight against height for all students, separate plots of weight against height for each sex and a single plot with a different symbol for males and females.

proc sort data=unit5.pulse out=sorted;

by sex;

proc plot data=sorted;

plot weight*height;

proc plot;

plot weight*height=’*’;

by sex;

run;

proc plot;

plot weight*height=sex;

run;

Task 1

Input the program into SAS and examine the output. (Remember to assign a LIBNAME as the first statement.) Resubmit the program using PROC GPLOT instead of PROC PLOT. What differences in the output do you observe?

When it has run successfully in PROC GPLOT, you will find that you are in a graph window. The graph can be edited in SAS by clicking on the painting icon. To come out of editing the plot, click on file and then down to end. You can save the graph to a file or cut and paste it into Word where it can be further edited if required. The graph window must be closed down (by clicking on () before another SAS program can be run.

3 PROC CHART

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|Statements | |

|HBAR |Specify variable to produce a frequency bar chart (horizontal bars) |

|VBAR |Specify variable to produce a frequency bar chart (vertical bars) |

|PIE |Specify variable to produce a pie chart. |

|BLOCK |Specify variable to use on the x-axis. Used in conjunction with GROUP and SUMVAR options to produce |

| |three-dimensional bar charts. |

|BY |Specify a list of character variables (data must be sorted) to produce separate charts for subgroups. |

|HBAR/VBAR/PIE/options |

|SUMVAR= |Specify an analysis variable the sum of which is to be shown on the y-axis |

|TYPE= |May be used on its own or in conjunction with SUMVAR to produce statistics other than the frequency or |

| |sum on the y-axis. The options for TYPE are FREQ (frequency counts), PCT (percentages), CFREQ (Cumulative|

| |frequencies), CPCT (Cumulative percentages), SUM (Totals), MEAN (Means) |

| |The default is TYPE=SUM if SUMVAR is used otherwise the default is TYPE=FREQ. |

|LEVELS= |Specifies the number of equal width classes for numeric variables. |

|MIDPOINTS= |Specifies the midpoints of classes for numeric variables |

| |MIDPOINTS=lower_limit TO upper_limit BY interval |

|DISCRETE |Prevents SAS from dividing a discrete variable into inappropriate intervals e.g ensures a variable coded |

| |from 1 to 5 will produce 5 classes. |

|GROUP= |Produces separate bar charts on the same graph for different discrete values of the GROUP variable. |

The following program illustrates some of the features of PROC CHART.

proc chart data=sorted;

hbar smokes;

by sex;

run;

proc chart;

vbar height/levels=6 group=sex sumvar=pulse2 type=mean;

run;

Task 2

Run this program and look carefully at the output obtained. Adapt the code to show other statistics on the y-axis of the vertical bar chart, a different number of bars and so on to familiarise yourself with the procedure.

Exercises

1 Plot weight against height for the data from Exercise 1.1. This should be stored somewhere as a permanent SAS data set. Use a plus sign as your plotting symbol.

2 For the pulse data:

(i) Plot the second pulse rate against weight, using a different symbol for males and females.

(ii) Produce a pie chart which shows the percentage of students who usually have particular levels of activity.

(iii) Produce a horizontal bar chart which shows the mean of the second pulse rate for those students that did and did not run.

(iv) Produce vertical bar charts side by side which show the percentage of students who have different levels of physical activity for males and females separately.

(v) Produce a three dimensional bar chart which shows the mean of the first pulse rate for subgroups defined by level of activity and smoking habit.

Correlation and Regression

Two procedures may be used to obtain information about the relationship between two or more continuous variables. PROC CORR determines the correlation coefficients between selected variables and PROC REG fits a regression model to data and allows output to be saved for further analyses. Remember that the statements and options given in these notes are only a very small subset of those that can be used with particular procedures. The help facility may be used to investigate further possibilities.

1 PROC CORR

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|SPEARMAN |Calculates the Spearman rank correlation coefficient. By default the Pearson product |

| |moment correlation is calculated. |

|Statements | |

|VAR |Specify a variable list (essential) |

|WITH |Specify a variable list to be used with VAR. The VAR variables are given at the top of|

| |the table of correlations and the WITH variables at the side. If WITH is not used a |

| |matrix of the correlations between all pairs of variables is produced. |

|BY |Specify a list of character variables (data must be sorted) to produce separate tables|

| |of correlations. |

An example of the use of PROC CORR is shown below using the data concerned with heights and weights of eleven-year-olds (Exercise 1.1).

proc corr data=unit6.eleven;

var height weight;

run;

proc corr data=unit6.eleven;

var height ;

with weight;

run;

Task 1

Try running this program and look at the difference in output produced when the WITH statement is included or not included. What do the values under the correlation coefficients indicate?

2 PROC REG

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|CORR |Prints a correlation matrix for all variables listed in the MODEL statement. |

|Statements | |

|MODEL |For simple linear regression, specify a response variable and an independent variable,|

| |response=independent variable |

|OUTPUT |Need to specify an output file using OUT=filename and a list of |

| |keywords(statistics)/names required. Commonly used keywords are p (predicted values), |

| |r (residuals), student (standardised residuals). See the specimen program for syntax. |

|PLOT |Allows scatters plots to be produced using any variables in the model or keywords in |

| |the OUTPUT statement. Note the keywords in the OUTPUT must be followed by a full-stop |

| |when used as variables in the PLOT statement. |

|BY |Specify a list of character variables (data must be sorted) to produce separate tables|

| |of correlations. |

|PLOT options | |

|OVERLAY |Superimposes several scatter plots on the same graph. |

The following code produces output which shows:

1. the correlation matrix of weight and height

2. output from a simple linear regression analysis of weight on height

3. a plot of weight against height with the predicted values (line) superimposed

4. a plot of the standardised residuals against height

5. a printout of the output file

6. a normal probability plot for each of the residuals and standardised residuals together with the results of normal probability test.

proc reg data=unit6.eleven corr;

model weight=height;

output out=regout

p=pred

r=resid

student=stdresid;

plot weight*height='*' p.*height= 'P'/overlay;

plot student.*height;

run;

proc print data=regout;

proc univariate plot normal data=regout;

var resid stdresid;

run;

Task2

Obtain the output from the previous program. Why is it important to produce residual plots and normal probability plots for the residuals from a linear regression model?

Exercises

6.1 Data set beetle, available as a permanent SAS data set on the web page, gives information on a sample of beetles and the damage they cause.

(a) Download the file and examine the data.

(b) (i) plot area against male beetle length

(ii) regress area against male beetle length

(iii) obtain a plot of the standardised residuals against length

(iv) obtain a normal probability plot of the standardised residuals.

(c) Why isn’t a simple linear regression model satisfactory? What might be a useful thing to try in order to improve the model?

(d) Similarly, investigate the relationship between the amount of frass produced and female beetle length.

Exploratory data analysis

1 SAS/INSIGHT

SAS Insight software is a tool for data exploration and analysis. It is interactive which means you can explore data through graphs and analyses linked across multiple windows. This facility can be used to identify outliers, highlight subgroups on a graph and so on. SAS/INSIGHT allows you to analyse univariate distributions, investigate multivariate distributions and fit explanatory models such as a simple linear regression model to your data.

2 Accessing SAS/INSIGHT

SAS/INSIGHT may be used with any SAS data sets that have been created previously. If you wish to investigate a permanent SAS data set make sure that you have set up an appropriate library using a LIBNAME statement before invoking SAS/INSIGHT. It is also possible to enter data directly into the SAS/INSIGHT data window.

Within SAS, select Solutions(Analyse(Interactive data analysis. The data of interest can then be accessed using Library(Data Set(Open or a new data set created in the data window using New.

3 Creating a Scatter Plot

To investigate SAS/INSIGHT initially, choose a data set you are fairly familiar with and try out some of the features. This example used the Pulse data set.

Analyse > Interactive Analysis

Figure 9 Opening the SAS/INSIGHT dialogue in Windows

[pic]

Figure 10. Opening the SAS/INSIGHT dialogue in Linux

You can create charts using the analyze menu. An example of creating a scatter plot is shown in Figure 11.

Analyse > Scatter Plot (X/Y)

Figure 11 The Scatter Plot ( X / Y ) dialogue

Figure 12 Two SAS/INSIGHT scatter plots with 3 points highlighted

A key feature of SAS/INSIGHT graphs is that they are interactive. Click on a point on one chart and all corresponding points become highlighted. An example of this interactivity is shown in Figure 1. The 3 points labelled with ‘1’ use a larger symbol and indicate the level of the variable ran (1 = yes). To click on several points just hold down the ctrl key while clicking. To highlight an area of points left click and hold in one corner of the region then drag the cursor to the opposite corner. All points in the rectangle will be highlighted. You can then turn off the points by clicking on a point ouside the rectangle.

4 Features of SAS/INSIGHT

SAS/INSIGHT software provides an extensive range of tools for investigating data and carrying out analyses. Some of the activities that you can carry out using SAS/INSIGHT are shown below.

7. enter data from the keyboard

8. identify observations in plots

9. examine all values for selected observations

10. brush observations in graphs

11. create overlaid line plots

12. rotate data in three dimensional plots

13. manipulate histograms to explore the distribution of data

14. compare distributions in box plots and mosaic plots

15. compute descriptive statistics

16. fit parametric (normal, lognormal, exponential, Weibull) and kernel density estimates

17. fit parametric cumulative distribution functions

18. create quantile-quantile plots

19. calculate correlations and principal components to find the structure of your data

20. fit a general linear model

21. create residual and leverage plots

22. transform variables

23. process data by groups for every analysis

5 Using SAS/INSIGHT

Once you have a data set in SAS/INSIGHT, manipulations and analyses are carried out by using either Edit or Analyse on the main menu.

Operations are also available from pop-up menus by:

clicking the left mouse button in the corners of graphs and tables

pressing the right mouse button over an appropriate object.

Variables to be analysed may either be selected before clicking on Analyse or entered as requested within the particular analysis window.

In the data set window:

a variable is selected by clicking on the name

several variables can be selected by holding down the left mouse button and dragging across the selection

non-contiguous variables or observations may be selected by holding down the control button and clicking on individual names or row numbers.

In WINDOWS, any plots produced can be printed directly or copied and pasted into Word documents.

Tabular output can be saved into the normal SAS Output Window using File(Save(Tables

Commands from your SAS/INSIGHT session can be recorded and later resubmitted. The FILE and INFILE options allow you to produce a file containing commands to document and reproduce a SAS/INSIGHT session. This is very useful for exploratory analysis that you need to interrupt or repeat on different sets of data.

Example

filename note ‘h:\MA71064 Statistical Computing\insight.txt’;

proc insight file = note;

run;

After doing your analysis and then exiting, your file will contain the commands that were used to create and close your Insight session. You can begin your second Insight session from where you left your first session with the following code:

filename note ‘h:\MA71064 Statistical Computing\insight.txt’;

proc insight infile = note;

run;

Alternatively, just code the FILE keyword without a filename specified and your commands will be recorded in the SAS Log window.

Task 1

To get a feel for what SAS/INSIGHT can do, work through the following exercises which are based on the pulse data.

(a) Obtain a histogram of height.

Obtain a histogram of sex in a new window.

Click on the bar representing males on the sex histogram.

Look at the histogram of height. What has changed on this histogram?

(b) Obtain comparative boxplots of weight for each sex. (Input weight as the Y-variable and sex as the X-variable.)

Click on the outlier for male weights. Which observation number is this?

Double click on this outlier. What information do you get?

(c) Highlight observation numbers 1, 31 and 67 in the data window..

Press the right mouse button and click ‘Label in plots’.

Obtain a scatter plot using pulse2 as the Y-variable, pulse1 as the X-variable, sex as the group variable and ran as the label. What information is shown on these plots?

Double click on one of the points. What information do you get?

(d) Highlight the variable names for pulse1, pulse2, height and weight in the data window.

Obtain a scatter plot.

What plots do you get? What are the values shown in each plot?

(e) Using the Fit option in the Analyse menu, input pulse2 as the Y-variable, weight as the X-variable and ran as the group variable.

Look carefully at the output obtained.

Redo the analysis to show Residual Normal QQ plots and store predicted values and standardised residuals in the data sheet.

6 Tools

Edit > Windows > Tools turns on a menu that allows data points to be coloured or selects different symbols by data value.

Figure 13 The tools dialogue labelling observations by sex

Press the coloured square from the tools window to set a particular set of points to that colour, eg colour red all those points representing people who smoke.

Similarly press one of the symbol buttons to select a given symbol for a set of observations.

Other features of SAS/INSIGHT can be investigated using the help facility.

Task 2

(a) Find out what is meant by ‘brushing observations’.

(b) Produce summary statistics and graphs for each of the continuous variables in the pulse data.

(c) Input, into the data sheet, the following data which are chloride content  (mg/l) of waters draining from a particular type of rock.

|6.0 |5.0 |0.5 |0.5 |0.6 |10.0 |

|0.4 |6.0 |1.2 |0.2 |0.7 |0.3 |

|0.2 |0.8 |0.2 |1.7 |0.5 |6.0 |

(i) Produce a boxplot of the data and comment on the distribution.

(ii) Create a new variable which is the log of the chloride content.

(iii) Check whether the log values could reasonably be assumed to have come from a normal distribution.

Modifying data and output

1 Introduction

In the previous units you have discovered how to create permanent SAS data sets and produce output using a variety of procedures. The OPTIONS statement was introduced in the unit on Describing Data which allows you to make some changes to the output produced. Generally speaking though, there has been very little flexibility in either changing the style of the output or modifying the data set that has been used. In this unit you will learn how to make changes to a permanent SAS data set and to customise some types of output.

2 SET statement

The SET statement is a very versatile statement which is used in a DATA step and enables a variety of tasks to be carried out depending on which options are used. One of its most common uses is reading observations and variables from existing SAS data sets so that further processing can take place. Another use is combining two or more data sets so that analyses can be carried out on a larger set of variables or observations.

The same operations can often be done in different ways because some SAS statements can be incorporated as options into either the DATA or SET statements. Have a look at the following examples of code which both achieve the same thing.

data unit8.beetles1 (drop=site);

set unit8.beetles;

where site=’1’;

run;

data unit8.beetles1 (drop=site);

set unit8.beetles (where=(site=’1’));

run;

Both pieces of code produce a new permanent SAS data set called beetles1 which has data from site one only and does not include site as a variable. The number 1 is shown in single quotes because site is a character variable in this data set. The DROP option allows variables not required to be omitted from the new data set. It can be included in either the DATA or SET statement but you have to be a bit careful. If DROP is used in the SET statement then the variables involved cannot be used for further processing. In this example using the DROP option with the variable site in the SET statement would result in an error message – try it!

The following table shows some of the commonly used options in the DATA and SET statements.

|Data Set Options |Description |

|DROP= |Specify one or more variables to exclude either from further processing or from the |

| |new data set |

|FIRSTOBS= |Specify the first observation required for processing |

|KEEP= |Specify one or more variables to include in further processing or in the new data set |

|OBS= |Specify the last observation required for processing |

|LABEL= |Specify names to be given to variables (see section Labelling output) |

|RENAME= |Specify new names for variables |

|WHERE= |Specify a condition to select certain observations from a SAS data set |

3 DROP and KEEP

The following code shows how the DROP and KEEP options may be used in a program.

data lengths (keep=height cond mlength flength)

damage (drop=mlength flength);

set unit8.beetles (drop=site);

run;

proc print data=lengths;

proc print data=damage;

run;

Note that more than one data set can be specified in a single DATA statement. In this example two temporary data sets are produced lengths and damage.

Task 1

Look at the preceding code and see if you can work out which variables will be contained in each data set. Submit the code and see if you are correct. If you wanted to create permanent data sets what changes to the code would you have to make?

4 Labelling output

Variable names in SAS are restricted to being eight characters in length, and by default these variable names are used as column headings. A LABEL statement may be used to associate a descriptive label with a variable. If the labels are required in the output then either a LABEL or SPLIT option must be used in the PROC PRINT statement. If a LABEL statement is made in a DATA statement then the labels are permanently associated with the variables in that SAS data set.

An example of the use of labels follows.

proc sort data=unit8.beetles1 out=sorted;

by cond height;

proc print data=sorted split=’*’;

var height mlength area frass;

by cond;

label

mlength='male*length'

area='leaf area*consumed'

frass='number of*frass*pellets';

pageby cond;

sum frass;

run;

Task 2

Submit the code in SAS and inspect the output. What effect have the PAGEBY and SUM statements had?

If you had used label names without the asterisks and used the LABEL option in the PROC PRINT statement then SAS would split the label names automatically at a suitable place but you have no control over the process.

5 PROC PRINT

|Options |Description |

|DATA= |Data set to be used, uses the last data set created by default |

|LABEL |Ensures that column labels are used in the output |

|SPLIT= |Specify a character in the label names which splits the column headings onto two or more lines|

|NOOBS |Suppresses the printing of observation numbers in the output |

|Statements | |

|VAR |Specify variables to be printed |

|ID |Specify variable to use as identification instead of the observation number |

|BY |Specify a list of character variables (data must be sorted) to produce separate tables. |

|PAGEBY |Used with BY statement to output each table on a separate page. |

|SUMBY |Prints subtotals for the specified BY variable |

|SUM |Specify numeric variables for which the sum of the values is required. |

6 PROC FORMAT

The LABEL statement allows you to give longer names to variables so that any output is easier to interpret. It is also possible to assign names to individual categories for character variables and to save these names as permanent formats. This is done using PROC FORMAT. The permanent formats are saved in a location which is specified using the LIBNAME statement with the special libref name LIBRARY. For example,

Libname library ‘C:\sasdata’;

will store the formats in a directory sasdata on the hard disk when the LIBRARY option is used in the FORMAT procedure.

Task 3

Assign a library called LIBRARY in a suitable location. (The location where you have stored your SAS permanent data sets is probably the most appropriate.)

The individual names are assigned using a value statement. These formats are independent of any particular data set and if appropriate may be used with any variable. The following example shows suitable labels for the plant height and plant condition categories from the beetles data but the format $health, for example, could be used for any variable where 1, 2 and 3 represent poor, satisfactory and good respectively.

proc format library=library;

value $height '1'='less than 10cm'

'2'='10cm ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download