Ben Ward Saving a Plot Department of Statistics University ...

[Pages:20]Statistical Graphics Intro - with ggplot2

Presented Fall 2014

Ben Ward Department of Statistics University of California, Irvine

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

1

Read data into R and view the first few lines

This data set contains information about 26,058 movies that users rated on IMDB.

d head(d)

X

title year length budget rating votes mpaa cat.rating cat.year Genre

12

$windle 2002

93

NA 5.3 200

R

(5,6] (2000,2020] Action

25

'A' gai waak 1983 106

NA 7.1 1259 PG-13

(7,8] (1980,2000] Action

3 6 'A' gai waak juk jaap 1987 101

NA 7.2 614 PG-13

(7,8] (1980,2000] Action

4 8 'Crocodile' Dundee II 1988 110

NA 5.0 7252

(4,5] (1980,2000] Action

5 10

'Gator Bait 1974

88

NA 3.5 100

(3,4] (1960,1980] Action

6 23

'Sheba, Baby' 1975

90

NA 5.5 91

(5,6] (1960,1980] Action

In this slideshow, the graphics are made using ggplot2. You will need to install the package from a repository (only one time). Then, when you want to use it you type:

library("ggplot2")

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

2

IMDB Average User Rating

> summary(d$rating) Min. 1st Qu. Median

1.300 5.500 6.400 > sd(d$rating) [1] 1.275851

Mean 3rd Qu. 6.226 7.200

Max. 9.800

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

3

IMDB Average User Rating ggplot(d)+geom_histogram(aes(x=rating))

2000

count

1000

0

2.5

5.0

7.5

10.0

rating

Is this skewed? How do the summary statistics relate to this picture?

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

4

IMDB Average User Rating ggplot2 doesn't like to do univariate boxplots, but you can force it to:

ggplot(d)+geom_boxplot(aes(y=rating, x=factor(1)))

10.0 q

7.5

rating

5.0

2.5

qqqqqqqqqqqqqqqqq

1

factor(1)

Are there any outliers?

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

5

Genre

> table(d$Genre)

Action Animation

2709

802

Comedy Documentary

7624

712

Drama 10284

Romance 2769

Short 1158

ggplot(d)+geom_bar(aes(x=Genre))

count

10000

7500

5000

2500

0

Action

Animation Comedy Documentary Drama

Genre

Romance

Short

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

6

Rating and Budget

ggplot(d)+geom_point(aes(y=rating, x=budget))

rating

10.0

7.5 5.0 2.5

q

q

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

q qq

q

qq q

q qqq

q

q

q qqq

q

q q

q

q

q qqq

q q

qqqqqqq

q q qq

q q

q q

q

q

qqq q q

qqqqq

q

q

qq

q q qq

q

q

q

q qq

q

q

qq

qq q

q

q qq

q

q q

q

q q

q qq

qq q

q q q

q

q

q q q

q

q q

qq

q q

q

q

0.0e+00

5.0e+07

1.0e+08

budget

1.5e+08

2.0e+08

Do you see a relationship between rating and budget? positive, negative or no association?

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

7

Rating and Budget

You can also add a line which represents a model for y at a given x value. This can help you see trends.

ggplot(d)+geom_point(aes(y=rating, x=budget))+geom_smooth(aes(y=rating, x=budget))

rating

10.0

7.5 5.0 2.5

q

q

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

q qq

q

qq q

q qqq

q

q

q qqq

q

q q

q

q

q qqq

q q

qqqqqqq

q q qq

q q

q q

q

q

qqq q q

qqqqq

q

q

qq

q q qq

qq

q

q qq

q

q

qq

qq q

q

q qq

q

q q

q

q q

q qq

qq q

q q q

q

q

q q q

q

q q

qq

q q

q

q

0.0e+00

5.0e+07

1.0e+08

budget

1.5e+08

You can find the correlation with

2.0e+08

cor(x=d$rating, y=d$budget, use="complete.obs")

The correlation between rating and budget is -0.063. Is that high or low? What does it tell you about the relationship between rating and budget?

Read Data In

One Numerical Variable

Summary Statistics Histogram Boxplot

One Categorical Variable

Bar Chart and Table with Counts

Two Numerical Variables

Scatterplot, Line, Correlation

One Numerical & One Categorical

Side-by-side Boxplots Stacked Histograms Hollow Histograms

Two Categorical

Two Way Table Bar Charts Mosaic Plot

Remove Unwanted Values

Labels

Saving a Plot

Stat 67

8

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download