PROC BOXPLOT - Using the CLIPFACTOR Option to Produce More ...

PROC BOXPLOT - Using the CLIPFACTOR Option to Produce More

Readable Figures

Heidi Nasizadeh, Sangart Inc., San Diego, CA

Michelle Wang, Sangart Inc., San Diego, CA

ABSTRACT

The BOXPLOT procedure in SAS/Graph? provides the ability to visualize summary statistics such as maximum, third

quartile, median, mean, first quartile, and minimum. The SCHEMATIC option allows the outlier values within a group

to be shown as separate points. For exploratory analysis using ¡°dirty data¡± or data prone to extreme outliers, the box

and whisker elements can become too compressed for the figures to be easily interpreted. PROC BOXPLOT, will

show all the values in all groups and will not allow us to limit the y-axis to a value that is within the data range. Using

the CLIPFACTOR option to clip the extreme values produces a more readable and useful plot to visualize the data.

INTRODUCTION

A box-and-whisker plot is a great way to display distribution of groups of numeric data. The most common boxplot

includes a box which ends at the first and third quartile, draws the median as a horizontal line in the box, extends the

"whiskers" to the farthest points that are not outliers and for outliers draws a dot. Outliers are any point more or less

than 1.5 times the first and third quartile range (from the end of a box), draws a dot.

This paper describes how to use the PROC BOXPLOT¡¯s clipping options when a few extreme outliers can cause the

box and whiskers elements become too compressed and the figure is no longer useful in interpreting the data.

The data used in this paper is not real patient data and was programmically generated to illustrate various options. It

is assumed to be the result of a lab test of 2 groups over time. These results contain 2 extreme values , which in turn

produces a compressed box and whiskers.

COMMON BOX PLOT VS. A CLIPPED BOX PLOT

A general PROC BOXPLOT code such as:

proc boxplot data=final;

plot lbstresn*visitnum=trt01on/

boxstyle=schematic

boxwidth=3

wasxis=2

symbollegend = legend1

cboxfill=white

haxis=axis1

vaxis=axis2 vminor=5 ;

run;

quit;

creates Figure 1, which demonstrates the effect of 2 extreme outliers. The box and whiskers are compressed and the

figure is not helping us understand the data. PROC BOXPLOT is designed to show all the values in one group and

will ignore code to limit the y-axis to a specific value.

Figure 2, shows how by adding the CLIPFACTOR option we can improve this visual display. The 2 extreme values

are clipped, a legend on the bottom right makes note of it, and it is easier to understand how the summary statistics

of the 2 groups change over time as the plot is now zoomed into the section with the most relevant data.

An Example of the benefits of this process is visible when analyzing the data from Treatment A and B on Day 1.

Figure 1¡¯s compressed box and whiskers give us very little understanding of what the maximum, minimum, Q1, Q3,

median and mean are while figure 2 easily displays these values and the next outliers position.

1

Figure 1. General Box and Whisker Plot

500

Standard Lab Result

400

300

200

100

0

Screen/Baseline

Day 1

Day 2

Planned Treatment (N)

TRT A

Day 3

TRT B

Figure 2. Box and Whisker Plot using the CLIPFACTOR option:

80

Standard Lab Result

60

40

20

0

Screen/Baseline

Day 1

Day 2

Planned Treatment (N)

TRT A

2

Day 3

TRT B

Boxes clipped=2

The SAS code generating figure 2 is:

proc boxplot data=final;

plot lbstresn*visitnum=trt01pn/

clipfactor = 20

clipsymbol = dot

cliplegpos = bottom

cliplegend = 'Boxes clipped=#'

clipsubchar = '#'

boxstyle=schematic

boxwidth=3

waxis=2

symbollegend = legend1

cboxfill=white

haxis=axis1

vaxis=axis2 vminor=5 ;

run;

quit;

The CLIPFACTOR should be a value greater than one.

Clipping is applied as follows:

1-

The mean of the first quartile

and the mean of third quartile

2-

Values outside the range of

+(

3-

Any statistics outside

+(

¨C

¨C

)*factor and

)*factor and

+(

+(

¨C

across all groups are calculated.

¨C

)*factor will be clipped.

)*factor will be clipped.

The clipping is only applied to the plot not the actual statistical values and a legend in the chart indicates the number

of boxes clipped. CLIPSYMBOL allows us to specify how the clipped points are to be marked. CLIPLEGPOS

positions the clipping legend CLIPLEGEND and CLIPSUBCHAR options specify the legend content.

CONCLUSION

Boxplots are very helpful in visualizing data¡¯s distribution and the variations in mean, range, quartiles and outliers

over time and/or between groups. Though for exploratory analysis using ¡®dirty data¡¯ or data prone to extreme outliers

the elements can become compressed and the figures less useful. This paper demonstrates how using the

CLIPFACTOR option can help overcome this problem, adjusting the y-axis by clipping extreme values. Since this is

not a widely used and known option, we recommend producing both set of Boxplots (original and clipped) for the

reviewers and footnote what has been clipped and in applicable an explanation

REFERENCES



ACKNOWLEDGMENTS

We will like to thank Mohamed Darif, Sangart¡¯s Director of Biostatistics & Programming for his review and

suggestions.

CONTACT INFORMATION

Your comments and questions are valued and encouraged. Contact the authors at:

Heidi Nasizadeh

Sangart Inc.

6175 Lusk Blvd

San Diego, CA 92121

858.458.2379

hnasizadeh@

3

Michelle Wang

Sangart Inc.

6175 Lusk Blvd

San Diego, CA 92121

858.458.2303

mwang@

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS

Institute Inc. in the USA and other countries. ? indicates USA registration.

Other brand and product names are trademarks of their respective companies.

4

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download