探索性数据分析week2

2.1 the lattice绘图系统

The Lattice Plotting System

The lattice plotting system is implemented using the following packages:

  • lattice: contains code for producing Trellis graphics, which are independent of the “base” graphics system; includes functions like xyplotbwplotlevelplot

  • grid: implements a different graphing system independent of the “base” system; the lattice package builds on top of grid

    • We seldom call functions from the grid package directly
  • The lattice plotting system does not have a "two-phase" aspect with separate plotting and annotation like in base plotting

  • All plotting/annotation is done at once with a single function call

Lattice Functions

  • xyplot: this is the main function for creating scatterplots
  • bwplot: box-and-whiskers plots (“boxplots”)
  • histogram: histograms
  • stripplot: like a boxplot but with actual points
  • dotplot: plot dots on "violin strings"
  • splom: scatterplot matrix; like pairs in base plotting system
  • levelplotcontourplot: for plotting "image" data

Lattice Functions

Lattice functions generally take a formula for their first argument, usually of the form

xyplot(y ~ x | f * g, data)
  • We use the formula notation here, hence the ~.

  • On the left of the ~ is the y-axis variable, on the right is the x-axis variable

  • f and g are conditioning variables — they are optional

    • the * indicates an interaction between two variables
  • The second argument is the data frame or list from which the variables in the formula should be looked up

    • If no data frame or list is passed, then the parent frame is used.
  • If no other arguments are passed, there are defaults that can be used.

Lattice Behavior

Lattice functions behave differently from base graphics functions in one critical way.

  • Base graphics functions plot data directly to the graphics device (screen, PDF file, etc.)

  • Lattice graphics functions return an object of class trellis

  • The print methods for lattice functions actually do the work of plotting the data on the graphics device.

  • Lattice functions return "plot objects" that can, in principle, be stored (but it’s usually better to just save the code + data).

  • On the command line, trellis objects are auto-printed so that it appears the function is plotting the data


Lattice Panel Functions

  • Lattice functions have a panel function which controls what happens inside each panel of the plot.

  • The lattice package comes with default panel functions, but you can supply your own if you want to customize what happens in each panel

  • Panel functions receive the x/y coordinates of the data points in their panel (along with any optional arguments)


其下若无panel.xyplot(x,y,...)则不会显示原函数


Many Panel Lattice Plot: Example from MAACS

  • Study: Mouse Allergen and Asthma Cohort Study (MAACS)

  • Study subjects: Children with asthma living in Baltimore City, many allergic to mouse allergen

  • Design: Observational study, baseline home visit + every 3 months for a year.

  • Question: How does indoor airborne mouse allergen vary over time and across subjects?

Ahluwalia et al., Journal of Allergy and Clinical Immunology, 2013


Summary

  • Lattice plots are constructed with a single function call to a core lattice function (e.g. xyplot)

  • Aspects like margins and spacing are automatically handled and defaults are usually sufficient

  • The lattice system is ideal for creating conditioning plots where you examine the same kind of plot under many different conditions

  • Panel functions can be specified/customized to modify what is plotted in each of the plot panels

2.2 ggplot2


What is ggplot2?

  • An implementation of The Grammar of Graphics by Leland Wilkinson
  • Written by Hadley Wickham (while he was a graduate student at Iowa State)
  • A “third” graphics system for R (along with base and lattice)
  • Available from CRAN via install.packages()
  • Web site: http://ggplot2.org (better documentation)

What is ggplot2?

  • Grammar of graphics represents an abstraction of graphics ideas/objects
  • Think “verb”, “noun”, “adjective” for graphics
  • Allows for a “theory” of graphics on which to build new graphics and graphics objects
  • “Shorten the distance from mind to page”

Grammer of Graphics

“In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars). The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system”

  • from ggplot2 book

The Basics: qplot()

  • Works much like the plot function in base graphics system
  • Looks for data in a data frame, similar to lattice, or in the parent environment
  • Plots are made up of aesthetics 美学(size, shape, color) and geoms(几何) (points, lines)

The Basics: qplot()

  • Factors are important for indicating subsets of the data (if they are to have different properties); they should be labeled
  • The qplot() hides what goes on underneath, which is okay for most operations
  • ggplot() is the core function and very flexible for doing things qplot() cannot do

具体应用那节,网上好像有问题,没有,所以去

What is ggplot2?

  • An implementation of the Grammar of Graphics by Leland Wilkinson
  • Grammar of graphics represents and abstraction of graphics ideas/objects
  • Think “verb”, “noun”, “adjective” for graphics
  • Allows for a “theory” of graphics on which to build new graphics and graphics objects

Basic Components of a ggplot2 Plot

  • data frame
  • aesthetic mappings: how data are mapped to color, size
  • geoms: geometric objects like points, lines, shapes.
  • facets: for conditional plots.
  • stats: statistical transformations like binning, quantiles, smoothing.
  • scales: what scale an aesthetic map uses (example: male = red, female = blue).
  • coordinate system

Building Plots with ggplot2

  • When building plots in ggplot2 (rather than using qplot) the “artist’s palette” model may be the closest analogy
  • Plots are built up in layers
    • Plot the data
    • Overlay a summary
    • Metadata and annotation


Example: BMI, PM2.5, Asthma

  • Mouse Allergen and Asthma Cohort Study
  • Baltimore children (age 5-17)
  • Persistent asthma, exacerbation in past year
  • Does BMI (normal vs. overweight) modify the relationship between PM2.5 and asthma symptoms?

Building Up in Layers

head(maacs)
  logpm25        bmicat NocturnalSympt logno2_new
1  1.5362 normal weight              1      1.299
2  1.5905 normal weight              0      1.295
3  1.5218 normal weight              0      1.304
4  1.4323 normal weight              0         NA
5  1.2762    overweight              8      1.108
6  0.7139    overweight              0      0.837
g <- ggplot(maacs, aes(logpm25, NocturnalSympt))
summary(g)
data: logpm25, bmicat, NocturnalSympt, logno2_new [554x4]
mapping:  x = logpm25, y = NocturnalSympt
faceting: facet_null() 

No Plot Yet!

g <- ggplot(maacs, aes(logpm25, NocturnalSympt))
print(g)
Error: No layers in plot

这样不会有用

但这样会有

g <- ggplot(maacs, aes(logpm25, NocturnalSympt))
g + geom_point()

Annotation

  • Labels: xlab()ylab()labs()ggtitle()
  • Each of the “geom” functions has options to modify
  • For things that only make sense globally, use theme()
    • Example: theme(legend.position = "none")
  • Two standard appearance themes are included
    • theme_gray(): The default theme (gray background)
    • theme_bw(): More stark/plain

alpha就是透明度





此例中因为有y=100,所以才显得第二个图那么变态,第一个图好像少了个头似的,这里只是搞清楚两者的区别即可,但实际应用中有时不需知道全景,只需掌握核心即可


从这图可知道,设置了ylim,会自动去掉outliner,而在后者则不会去掉这个东东。

More Complex Example

  • How does the relationship between PM$_{2.5}$ and nocturnal symptoms vary by BMI and NO$_2$?
  • Unlike our previous BMI variable, NO2 is continuous
  • We need to make NO2 categorical so we can condition on it in the plotting
  • Use the cut() function for this

Making NO$_2$ Tertiles

## Calculate the tertiles of the data
cutpoints <- quantile(maacs$logno2_new, seq(0, 1, length = 4), na.rm = TRUE)

## Cut the data at the tertiles and create a new factor variable
maacs$no2tert <- cut(maacs$logno2_new, cutpoints)

## See the levels of the newly created factor variable
levels(maacs$no2tert)
[1] "(0.378,1.2]" "(1.2,1.42]"  "(1.42,2.55]"


Summary

  • ggplot2 is very powerful and flexible if you learn the “grammar” and the various elements that can be tuned/modified
  • Many more types of plots can be made; explore and mess around with the package (references mentioned in Part 1 are useful)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值