SAS-JMP Tutorial ~ Univariate Displays and Summary ...



4 - Graphical Displays and Summary Statistics for Numeric Data (i.e. Descriptive Methods for Numeric Data)

4.1 ~ EXAMPLE: Monitoring DDT Levels Found in Fish in the Tennessee River Near Triana, Alabama

Background:

Olin Agrees to Clean Up DDT in Triana, Alabama Area

[EPA press release - April 21, 1983]

The U.S. Environmental Protection Agency announced today that the Olin Corporation has formally agreed to a multi-million dollar cleanup of DDT contamination around its former manufacturing facility, the Redstone Arsenal in Alabama, and to provide for health care for the residents of the nearby community of Triana.

"The agreement marks the first time an EPA enforcement action has provided for health care for an affected population," said EPA Acting Administrator Lee Verstandig.

Verstandig added, "This unique health care provision provides $5 million to establish the Triana Area Medical Fund, Inc., which will provide primary health care and monitoring of the residents. This is a non-profit corporation whose Board of Trustees will consist of representatives of both local citizens groups and the federal government."

Many of Triana's 1000 residents have been found to have elevated levels of DDT in their blood, primarily from the consumption of DDT-contaminated fish caught in Indian Creek, which runs near the plant site. The small community is about 12 miles from the Arsenal.

Felix Wynn, an 85-year-old resident of Triana, has 3,300 parts per billion of DDT in his blood. That is more DDT than has ever been found in any human being.

EPA announced general terms of the agreement on December 30, 1982. The final agreement was lodged with the U.S. District Court in Birmingham, Alabama, April 15, 1983. A notice announcing a 30-day comment period was published in the Federal Register the same day.

DDT was manufactured at the Redstone Arsenal site from 1947 to 1971 by Olin and the predecessor lessee of the site, the Calabama Chemical Co. In 1972, former EPA Administrator William D. Ruckelshaus, currently the EPA Administrator-Designate, banned the use of DDT in this country, except for limited situations.

In the late 1970s, widespread DDT contamination was discovered at the plant site, in nearby waterways (the Huntsville Spring Branch-Indian Creek, a tributary of the Tennessee River), and on more than 1400 acres of the Wheeler National Wildlife Refuge, the largest and oldest national refuge in Alabama. Elevated levels of DDT have also been detected in wildlife in the area.

In 1980, the Justice Department, at EPA's request, sued Olin, asking them to clean up the contamination. In October 1981, the site was designated one of EPA's top-priority hazardous waste sites for cleanup under the new Superfund program. Superfund is the $1.6 billion fund authorized under the Comprehensive Environmental Response, Compensation, and Liability Act of 1980 (CERCLA) which gives EPA the resources to clean up abandoned hazardous waste sites.

Under the settlement, Olin will clean up the DDT residue from the nearby Wheeler refuge and from the sediment of the Huntsville Spring Branch-Indian Creek tributary to reduce DDT levels in fish within a 10-year period.

In addition to paying for the cleanup, Olin will provide $24 million to assist residents of the contaminated area. This includes $19 million to satisfy personal injury claims of over 1000 private parties including local residents and local commercial fishermen.

Under the agreement, Olin must submit a comprehensive remedial cleanup plan for the area to a review panel of federal, state and local representatives by June 1, 1984. The review panel will approve or recommend changes to the plan and will provide complete oversight of the cleanup.

As part of the environmental cleanup, Olin also will provide short and long-term environmental monitoring for all affected areas. The EPA-Olin agreement also settles related suits against Olin by the State of Alabama and three citizens groups.

[pic]

4.2 ~ Graphical Displays for Numeric Data: The Histogram

Histograms and Outlier Boxplots

To obtain a histogram and outlier boxplot for numeric variable(s) select Distribution from the Analyze pull down menu and place the variable(s) that you wish to examine in the right-hand box. We will begin by examining a histogram for the weight of the fish sampled as part of the study. 

[pic]

Three key features of a histogram

The Horizontal Layout, Prob Axis, Normal Curve & Smooth Curve options have been used in constructing the histogram for weight of the fish sampled. The locations of these options are illustrated in the graphics below:

[pic][pic]

The density or distribution curves are added by selecting the options shown below.

[pic]

[pic]

Histograms for length, weight and DDT level found in the tissue of the fish sampled are shown below.

[pic] [pic]

[pic]

We can see that the lengths of the fish sampled appear to have a skewed left distribution with several outliers (i.e. unusual observations) on the low end. These outliers are all largemouth bass which evidently are generally shorter in length than the other fish species sampled. The typical length appears to be somewhere between 42-45 cm in length. The weight distribution appears to be slightly skewed to the right, but is not far from normal as evidenced by the fairly close agreement between the normal curve and smooth curve distribution estimate. There also a couple of outliers flagged in the boxplot. We will discuss the boxplot in more detail later. A typical weight for the fish sampled is approximately 1000 grams. The DDT concentrations of the fish sampled follow a severely skewed right distribution with many obvious outliers on the high end, with one fish having a DDT concentration of approximately 1100 ppm.

Using the Location and Species columns to label the points in succession shows these observations correspond to catfish and smallmouth buffalo sampled from locations 1, 8, and 13. Examination of the map shows that locations 1 and 13 are in close proximity to the plant on Indian Creek that was the source of the DDT contamination of the ecosystem. To obtain the labeling feature in JMP right-click on the variable name in column list on the left-hand side of the spreadsheet and select the Label/Unlabel option.

[pic]

Use Label/Unlabel to assign variables to be used as point labels when interacting with plots.

[pic]

4.3 ~ Transformations to Improve Normality

When the distribution of a variable is markedly skewed (left or right) we can often times use a transformation to obtain approximate normality. The common remedy is to consider raising the variable to some power. This type of transformation is known as a power transformation. To remove right skewness we consider using powers less than 1 such as 1/2 (i.e. square root), 1/3 (i.e. cube root), 0 (which corresponds to a log transformation),

-1/2 (i.e. reciprocal square root), -1 (i.e. reciprocal) , .... etc. As a rule of thumb, we often avoid using negative power transformations because they change the ordering of the data, i.e. the largest observed value with become the smallest and vice versa. Also the associated units of a negative power transformed variable can be difficult to explain. To remove left skewness, which is less common, we typically raise the power of the variable in question (e.g. 1.5, 2 or 3).

Ladder of Powers

[pic]

From the histogram and boxplot above we see that the distribution of the DDT concentration has a distribution which is extremely skewed to the right. To improve normality we will consider transformation to the log scale. (Note: This is a very common transformation to use when working with toxicity data. The logarithmic transformation is one of the most commonly employed transformations in statistics!)

To do this in JMP you must use the JMP Calculator which allows you to perform a variety of data transformations and manipulations. To create a column containing a function of another column, double-click to the right of the last column to add a new column to the spreadsheet. Next double-click at the top of the column to obtain the column information window. In the window change the name of the new column to log10(DDT) and select Formula from the New Property pull-down menu and click Edit Formula as shown below.

[pic]

The JMP Calculator should then appear on the screen. To take the base 10 logarithm of the DDT levels, select Transcendental from the menu to the right of the calculator keypad because the logarithm is a transcendental (non-algebraic) function. In the list that appears in the rightmost menu select base 10 logarithm (i.e. log10). In formula window you should see log10( ). Now you need supply the name of the variable you wish to take the logarithm of, which is DDT in this case by selecting it from the list of variables left of the calculator keypad. Finally click Apply and close the calculator window. The new column you created should now contain the base 10 logarithm of the DDT concentrations. The histogram and boxplot for the log base 10 scale DDT readings are shown on the following page. We can clearly see approximate normality has been achieved through transformation of the DDT levels to the log base 10 scale.

[pic]

4.4 ~ Types of Summary Statistics

• Measures of Central Tendency, Typical, or “Average” Value

• Measures of Spread/Variability

• Measures of Location/Relative Standing

4.5 ~ Measures of Central Tendency (mean, median, and mode)

Notation for Observations or Data

[pic] where [pic]ith observed value of the variable x and n = sample size

Mean

Sample Mean [pic] Population Mean[pic]

[pic] [pic]

Example:

Median

Middle value when observations are ranked from smallest to largest.

Sample Median (Med) Population Median (M)

Example:

Mode

Most frequently observed value or for data with no or few repeated values we can think of the mode as being the midpoint of the modal class in a histogram.

[pic]

4.6 ~ Measures of Variability (range, variance/standard deviation, CV, and interquartile range)

Range

Range = Maximum Value – Minimum Value

Example:

Variance and Standard Deviation

Sample Variance ([pic]) Population Variance ([pic])

[pic] [pic]

Sample Standard Deviation ([pic]) Population Standard Deviation ([pic])

[pic] [pic]

Example:

Chebyshev’s Theorem and the Empirical Rule

These are used to determine the percentage of observations that lie within in certain intervals centered about the mean. The intervals have the form:

mean [pic]standard deviation

where k is a positive integer. As an example consider the gestational age of infants at the time of birth.

[pic]

Chebyshev’s applies for any non-normal distribution while the empirical rule applies only for distributions which are approximately normal.

Interval Chebyshev’s Thm Empirical Rule

In 1949, a divorce case was heard where the husband filed for divorce on the grounds of his wife’s adultery. The only evidence he had was the fact she gave birth to a child 50 weeks (350 days) after he had gone abroad on military service. The judge hearing the case agreed that though it was improbable a woman would carry a baby 350 days, it was scientifically possible and the child could have been his. Thus the judge did not grant him a divorce. What do these rules say about the likelihood of a gestation age > 350?

Coefficient of Variation (CV)

Measures spread relative to the size of the mean.

[pic]

Example: Which has more variation, length (in.) or weight (g)?

4.7 ~ Measures of Location/Relative Standing

(Percentile/Quantiles and z-scores/Standardized Variables)

Percentiles/Quantiles

Quartiles

Interquartile Range (IQR) (another measure of variability)

Outlier Boxplots

[pic]

Any observations lying more than 1.5[pic]below[pic]or more than [pic]above [pic]are classified as outliers.

Standardized Variables (z-scores)

The z-score for an observation [pic] is [pic] (sample) or [pic](population).

It tells us…

Example:

Which is more extreme a catfish 24 inches in length or a smallmouth buffalo 13.5

inches in length?

Standardizing Variables in JMP

[pic]

Histogram of standardized lengths.

[pic]

4.8 ~ Summary Statistics - Measures of Central Tendency, Variability and Location in JMP

When we examine the distribution of a numeric variable in JMP (Analyze > Distribution) you will automatically obtain basic summary statistics. The summary statistics for length, weight, and DDT level for the fish sampled as part of the Tennessee River study are shown below.

[pic]

To obtain the variance and coefficient of variation you also need to select

More Moments from Display Options pull-out menu.

[pic]

4.9 ~ Comparative Displays

In this study we could compare the DDT levels of the different fish species and also compare DDT levels of fish by location. We first consider the potential difference in the DDT levels in catfish found at different river locations by using comparative boxplots. Because of the profound right skewness in the DDT levels we will use the DDT levels transformed to the logarithmic scale. To obtain basic comparative display in JMP select Fit Y by X from the Analyze menu and put Location in the X, Factor box and log(DDT) in the Y, Response box. The resulting display will show the log(DDT) levels plotted versus the location number. To add boxplots or items to this plot use the Display Options menu located within the main pull-down menu. 

[pic]

The display on the below shows comparative boxplots for log(DDT) level across location with the X-axis proportional option turned off.

[pic]

Here we can clearly see that the fish from Flint Creek (309 miles) & Tennessee River (320 miles) have the highest DDT levels and fish from Tenn. River (285 miles) & Tenn. River (345 miles) appear to have the lowest. It is important to note that latter locations are the only locations where largemouth bass were sampled.

Sample Percentiles/Quantiles by Location

[pic]

To convert these summary statistics back to the original scale use the following:

[pic]

e.g.

We can construct a similar display for comparing the log DDT measurements across species by placing Species Name instead of location in the X box.

[pic]

To obtain summary statistics for the log(DDT) levels within each species type select Quantiles and Means and Std Dev from the Oneway Analysis pull-down menu. The results are shown on the following page.

Summary Statistics for log10(DDT) by Species

[pic]

How do different species compare in terms of summary statistics? 

Catfish clearly have the highest mean and median DDT levels in the log scale while largemouth bass have the smallest. Catfish have the smallest amount of variation and seen by comparing the standard deviations or the coefficient of variations. [pic]

To convert these summary statistics back to the original scale use the following:

[pic]

For example, when converting the median DDT level found in catfish in the log 10 scale back to the original scale we have

[pic]

CDF Plots

A CDF plot shows the estimated probability or chance that we observe a value less than or equal to given value. The CDF plot for the fish lengths is shown below.

[pic]

For example:

We estimate that the chance a randomly selected fish is less than 40 cm is _________

We estimate that the chance a randomly selected fish is less than 50 cm is _________

The plot below gives the CDF plots for the DDT levels found in each the fish species in this study. To obtain these select the CDF Plots from the Oneway Analysis... pull-down menu. We can clearly see that we are much more likely to find a catfish with a high DDT level, e.g. there is an approximate 50% chance that we sample a catfish with a log10(DDT) level exceeding 1 which is 10 ppm in the original scale. This same chance for small-mouth buffalo is less than 25% and estimated to be 0% for largemouth bass.

[pic]

-----------------------

The normal curve and smooth curve density estimate are added by selecting these options from the Fit Distribution pull-out menu,

[pic]

The data we will be examining come from study conducted by the U.S. Army Corps of Engineers in the summer of 1980. The data is contained in the file Catfish.JMP.

The variables in this data file are:

• Fish ID – number ID for fish (1 – 144)

• Location - location on the river from which the fish were sampled. (FCM5 = Flint Creek 5 miles from mouth, LCM3 = Limestone Creek 3 miles from mouth, SCM1 = Spring Creek 1 mile from mouth, TRM### = Tennessee River ### miles from mouth).

Note: The source of the DDT contamination was a plant located on Indian Creek which flowed into the Tennessee River 321 miles from the mouth.

• Distance from Mouth – approximate distance of the sample location from the mouth of the Tennessee River.

• Species - fish species (CATFISH, Smallmouth BUFFALO, Largemouth BASS)

• Length - length of fish sampled (cm)

• Weight - weight of fish sampled (g)

• DDT - DDT concentration found in a fillet of the fish (parts per million - ppm)

• log10(DDT) – log base 10 of the DDT concentration

• ln(DDT) – natural log of the DDT concentration.

Quantiles – gives quantile summary statistics and adds boxplots to the display.

Means and Std Dev – gives means and standard deviations by location and adds mean/SD lines to the plot.

The options and their effects are summarized below...

Box Plots - adds quantile boxplots to the display

Mean Diamonds - adds mean diamonds to the plot

Mean Lines – adds a horizontal showing the mean for each group/population.

Mean CI Lines – adds lines depicting the 95% confidence interval for the mean to the plot.

Mean Error Bars - adds the means and standard errors (Ch. 6) to the plot

Std Dev Lines - add lines one standard deviation above and below the mean.

Connect Means - adds line segments connecting the individual means.

X-Axis Proportion - if checked the space allocated to the groups will proportional to the sample size for that group.

Points Jittered – “jitters” the points so individual observations are more easily seen.

Points Spread – staggers the points much more than jittering.

[pic]

Why use the median rather than the mean to measure typical?

Examples:

Others:

Gestational Age (days)

[pic]

Length (in.) Weight (g) [pic]

[pic]

All of the observations with extreme z-scores for length are Largemouth Bass.

To obtain z-scores associated with each observation select Standardized from the Save menu which is located within the main pull-down menu for the variable.

Three new columns labeled Std length, Std weight, and Std DDT will appear in the original spreadsheet containing the z-scores. You could examine the distribution of the z-scores themselves by using the Distribution command. Any observations with z-scores exceeding 3 in absolute value could be classified as potential outliers.

The histogram below is for length standardized using z-scores.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download

To fulfill the demand for quickly locating and searching documents.

It is intelligent file search solution for home and business.

Literature Lottery

Related searches