06 - Intro to graphics (with ggplot2) (part 2)

06 - Intro to graphics (with ggplot2) (part 2)

ST 597 | Spring 2017 University of Alabama

06-dataviz2.pdf

Contents

1 Cleveland Dot Plot

2

1.1 Your Turn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Cleveland Dot Plot Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Line Graphs

4

2.1 economics data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Your Turn: Stock Price . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Estimating Distributions

7

3.1 Discrete Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Continuous Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 Histograms

9

4.1 geom_hist() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Kernel Density Estimation

11

5.1 geom_density() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.2 Some Useful Settings: geom_histogram, geom_density . . . . . . . . . 16

6 Boxplot and Violin Plot

17

6.1 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

6.2 Violin Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.3 Discretizing continuous variables for boxplot . . . . . . . . . . . . . . . . . . 19

6.4 Sequentile Quantiles (Fanchart, Fanplot) . . . . . . . . . . . . . . . . . . . . . 20

6.5 Your Turn: Old Faithful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Required Packages and Data

library(tidyverse) library(gcookbook) library(Lahman) # may need to: install.packages("Lahman")

1

1 Cleveland Dot Plot

William Cleveland wrote a popular book on visualizing data The Elements of Graphing Data that has many useful suggestions. One element he stressed was to reduce the cognitive strain on the view. One way to do this is to use as little ink as possible. The Cleveland dot plot contains the same information as a bar graph, but instead of using all the ink needed for the bar, remove the bar altogether and place a dot at the bar height (using geom_point()).

Consider the baseball data Batting from the Lahman package. We want to display the number of home runs for each team during the 2014 season.

library(Lahman)

# load the Lahman package

data(Batting)

# load the Batting data

H = Batting %>%

filter(yearID == 2014) %>%

group_by(teamID) %>%

summarize(teamHR = sum(HR), teamBA=sum(H)/sum(AB), teamR = sum(R))

glimpse(H)

#> Observations: 30

#> Variables: 4

#> $ teamID ARI, ATL, BAL, BOS, CHA, CHN, CIN, CLE, COL, DET, HOU,...

#> $ teamHR 118, 123, 211, 123, 155, 157, 131, 142, 186, 155, 163, ...

#> $ teamBA 0.2484, 0.2407, 0.2563, 0.2441, 0.2526, 0.2387, 0.2376,...

#> $ teamR 615, 573, 705, 634, 660, 614, 595, 669, 755, 757, 629, ...

I added team batting average (teamBA) and runs (teamR) to dress up our plot.

Compare the bar graph with the dot plot. #- (left) bar graph ggplot(H) + geom_col(aes(x=reorder(teamID, -teamHR), y=teamHR))

#- (right) corresponding dot plot ggplot(H) + geom_point(aes(x=reorder(teamID, -teamHR), y=teamHR))

210 200

150

180

teamHR teamHR

100

150

50

120

0

BALCOLTORHOUCHNPITCHADETLAAWASMILNYAOAKCLESEALANSFNCINMINNYNPHIATLBOSMIAARITBATEXSDNSLNKCA reorder(teamID, -teamHR)

90 BALCOLTORHOUCHNPITCHADETLAAWASMILNYAOAKCLESEALANSFNCINMINNYNPHIATLBOSMIAARITBATEXSDNSLNKCA

reorder(teamID, -teamHR)

2

1.1 Your Turn

Your Turn #1 : Dot Plot vs. Bar Plot 1. What are the differences between the two plots? 2. What aspects can be improved with the dot plot?

1.2 Cleveland Dot Plot Aesthetics

The real strength is in adding additional aesthetics, like size and color ggplot(H) + geom_point(aes(x=reorder(teamID, -teamHR), y=teamHR,

size=teamR, color=teamBA>.260))

210

teamHR

teamBA > 0.26

180

FALSE

TRUE

teamR

150

550

600

650

700

120

750

90 BACLOTLOHROCUHNPICTHDAELTAWAAMSINL YOAACKLSEELAASNFNCINMINYNPHAITBLOMSIAARTIBTAESXDSNLKNCA

reorder(teamID, -teamHR)

Final touch include putting team on the y-axis, changing the theme, and adding a title

#- new theme dot_theme = theme_bw() +

theme(panel.grid.major.x=element_blank(), panel.grid.minor.x=element_blank(), panel.grid.major.y=element_line(color="grey60", linetype="dotted"))

#- Cleveland dot plot ggplot(H) +

geom_point(aes(x=teamHR, y=reorder(teamID, teamHR), size=teamR, color=teamBA>.260)) +

dot_theme + labs(title = "2014 MLB", x="home runs", y="team") + scale_color_manual(name="BA>.260", values=c("blue", "orange")) + scale_size(name="Runs", range=c(1,6))

3

team

2014 MLB

BAL COL TOR HOU CHN PIT LAA DET CHA WAS MIL NYA OAK CLE SEA LAN SFN CIN MIN PHI NYN BOS ATL MIA ARI TBA TEX SDN SLN KCA

90

120

150

180

home runs

BA>.260

FALSE TRUE

Runs

550 600 650 700 750

210

The Cleveland Dot Plot is an alternative to a bar plot. There is also a dot plot (geom_dotplot()) that is an alternative to a histogram.

2 Line Graphs

2.1 economics data

The economics data from the ggplot2 package contains some economic time series data

library(dplyr)

library(ggplot2)

data(economics)

glimpse(economics)

#> Observations: 574

#> Variables: 6

#> $ date

1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967...

#> $ pce

507.4, 510.5, 516.3, 512.9, 518.1, 525.8, 531.5, 534....

#> $ pop

198712, 198911, 199113, 199311, 199498, 199657, 19980...

#> $ psavert 12.5, 12.5, 11.7, 12.5, 12.5, 12.1, 11.7, 12.2, 11.6,...

#> $ uempmed 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4...

#> $ unemploy 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877,...

We can plot the number of unemployed over time with a line plot (using geom_line())

4

ggplot(economics, aes(date, unemploy)) + geom_line()

12000

unemploy

8000

4000

1970

1980

1990

date

2000

2010

ggplot recognizes the date class and smartly adds yearly tick marks.

We can fancy it up, maybe add some points ggplot(economics, aes(date, unemploy)) +

geom_line(size=2, color="orange") + geom_point(shape=21, color='blue', fill='white', size= 1)

12000

unemploy

8000

4000

1970

1980

1990

date

2000

2010

We can shade the region under the line with geom_area()

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download