[Exploratory Data Analysis] Week 1

20 篇文章 1 订阅
16 篇文章 0 订阅

Principles of analysis graphics

  • Principle 1: Show corparisons
    • Evidence for a pyhothesis is always relative to another competing yhpothesis.
    • Always ask “Compared to What?”
  • Principle 2: Show causality, mechanism, explanation, systematic structure
    • What is your causal framework for thinking about a question?
  • Principle 3: Show multivariate data
    • multivariate = more than 2 variables
    • the real world is multivariate
    • need to ‘escape flatland’
  • Principle 4: Integration of evidence
    • completely integrate words, numbers, images, diagrams
    • data graphics should make use of many modes of data presentation.
    • don’t let the tool drive the analysis.
  • Principle 5: Describle and document the evidence with appropriate labels, scales, scources, et al.
    • a data graphic should tell a complete story that is credible.
  • Principle 6: Content is king
    • analytical presentations ultimately stand or falling depend on the quanlity, relerance, and integrity of their content.

Exploratory Graphics

Why do we use graphic in data analysis

  • To understand data properties
  • To find partterns in data
  • To suggest modeling strategies
  • To ‘debug’ analysis
  • To communicate results

Characteristics of exploratory graphics

  • They are made quickly
  • A large number are made
  • The goal is for personal understanding
  • Axes/legends are generally cleaned up
  • Colors/size are primarily used for information
# download data set form internet
setwd('E:\\Dropbox\\coursera\\Exploratory Data Analysis')
if(!file.exists('data')) dir.create('data')
fileUrl <- 'https://raw.githubusercontent.com/jtleek/modules/master/04_ExploratoryAnalysis/exploratoryGraphs/data/avgpm25.csv'
download.file(fileUrl, './data/pollution.csv')
# read data
pollution <- read.csv('./data/pollution.csv', header = TRUE)
head(pollution)

Simple Summary of Data

  • One dimension
    • Five-number summary
    • Boxplot
    • Histograms
    • Density plot
    • Barplot
  • Two dimesions
    • Multiple/overlayed 1-D plots(Lattice/ggplot2)
    • Scatterplot
    • Smooth scatter plot
  • More than two dimensions
    • Ovelayed/multiple 2-D plots,coplots
    • Use color, size, shape to add dimesions
    • Spinning plots
    • Actual 3-D plots(not that useful)
# Five number summary
summary(pollution$pm25)
# Boxplot
boxplot(pollution$pm25, col = 'blue')
# Histograms
hist(pollution$pm25, col = 'green')
rug(pollution$pm25)
hist(pollution$pm25, col = 'green', breaks = 100)
rug(pollution$pm25)
# Overlayng features
boxplot(pollution$pm25, col = 'blue')
abline(h = 12)
hist(pollution$pm25, col = 'green')
abline(v = 12, lwd = 2)
abline(v = median(pollution$pm25), col = 'red')
# Barplot
barplot(table(pollution$region), col = 'wheat', main = 'NO. of counties in each region')
# Multiple boxplot
boxplot(pm25 ~ region, data = pollution, col = 'red')
# multiple histograms
par(mfrow = c(2,1), mar = c(4, 4, 2, 1))
hist(subset(pollution, region == 'east')$pm25, col = 'green')
hist(subset(pollution, region == 'west')$pm25, col = 'green')
# Scatter plot
par(mfrow = c(1,1))
with(pollution, plot(latitude, pm25))
abline(h = 12, lwd = 2, lty = 2)
# scatter plot using color
with(pollution, plot(latitude, pm25, col = region))
abline(h = 12, lwd = 2, lty = 2)
# multiple scatterplots
par(mfrow = c(1, 2), mar = c(5, 4, 2, 1))
with(subset(pollution, region == 'east'), plot(latitude, pm25), main= 'East')
with(subset(pollution, region == 'west'), plot(latitude, pm25), main = 'West')

Summary

  • Exploratory plots are ‘quick and dirty’
  • Let you summarize the data(usually graphically) and highly any broad features
  • Explore basic questions and hypotheses(and perhaps rule them out)
  • Suggest modeling strategies for the ‘next step’

Plotting

There are three plotting systems in R

  • The basic plotting system

    • ’Artist’s palette’ model
    • Start with blank canvas and build up from there
    • Start with plot function (or similar)
    • Use annotation functions to add/modify(text, lines, points, axis)

    • (Pros) Convenient, mirrors how we think of building plots and analysis

    • (Cons) Can’t go back one plot has started(i.e. to adjust margins)
    • (Cons) Difficult to ‘translate’ to others once a new plot has been created
    • (Cons) Plot is just a series of R commands
  • The Lattive system

    • (Pros) Plots are created with a single functin cal (xyplot, bwplot, etc)
    • (Pros) Most useful for conditioning types of plots: looking at how y changes with x cross levels of z
    • (Pros) Things like margins/spacing set automatically because entire is specified at once
    • (Pros) Good for putting many many plots on a screen
    • (Cons) Sometimes arkward to specify an entire plot in a single function call
    • (Cons) Annotation in plot is not especially intuitive
    • (Cons) Use of panel functions and subscripts difficult to wield and requires intense preparation.
    • (COns) Cannot ‘add’ to the plot once it is created
  • The ggplot2 system

    • Splits the difference between base and lattive in a number of ways
    • Automatically deals with spacing, text, titles but also allows you to annotate by ‘adding’ to a plot
    • Superficial similarity to lattice but generaly easier/more intuitive to use
    • Default mode makes many choices for you (but you can still customize to your heart’s desire)
  • Summary

    • Base: ‘artist’s palette’ model
    • Lattice: Entire plot specified by one function; conditioning
    • ggplot2: Mixed elements of Base and Lattice

Base Plotting System

  • The core plotting and graphics engine in R is encasulated in the following two packages
    • graphics: contains plotting functions for the ‘base’ graphing system, plot, boxplot and many others
    • grDevices: contains all the code implementing the various graphics device, including X11, PDF, PostScript, PNG, etc
  • The lattice plotting system is implementd using the following packages
    • lattice: contains code for producing Trellis graphics, which are independent of the ‘base’ graphics system; including functions like xyplot, bwplot, levelplot
    • grid: implements a different graphing system independent of the ‘base’ system; the lattice package builds on top of grid; we seldom call functions from grid package directly.
# Simple Base Graphcs: histogram
par(mfrow = c(1,1))
library(datasets)
hist(airquality$Ozone)
# Simple Base Graphics: scatterplot
with(airquality, plot(Wind, Ozone))
# Simple Base Graphics: boxplot
airquality <- transform(airquality, Month = factor(Month))
boxplot(Ozone ~ Month, airquality, xlab = 'month', ylab = 'Ozone(ppb)')
boxplot(Ozone~Month, airquality, xlab="Month", ylab="Ozone (ppb)",col.axis="blue",col.lab="red")

par() function is used to specify global graphics parameters

  • las: the orientation of the axis labels on the plotting
  • bg: the background color
  • mar: the margin size
  • oma: the outer margin size
  • mfrow: number of plot per row, column(plots are filled row-wise)
  • mfcol: number of plot per row, column(plots are filled column-wise)
with(airquality, plot(Wind, Ozone, main = 'Ozone and Wind in New York City'), type = 'n')
with(subset(airquality, Month == 5), points(Wind, Ozone, col = 'blue'))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = 'red'))
legend('topright', pch = 1, col = c('blue', 'red'), legend = c('May', 'Other months'))
with(airquality, plot(Wind, Ozone, main = 'Ozone and Wind in New York City'), pch = 5)
model <- lm(Ozone ~ Wind, data = airquality)
abline(model, lwd = 2)
par(mfrow = c(1, 3), mar = c(4, 4, 2, 1), oma = c(2, 0, 2, 0)) # bottom left top right
with(airquality, {
        plot(Wind, Ozone, main = 'Ozone and Wind')
        plot(Solar.R, Ozone, main = 'Ozone and Solar.R')
        plot(Temp, Ozone, main = 'Ozone and temperature')
        mtext('Ozone and Weather in New York City', outer = TRUE)
})

Graphics Devies

  • A graphics device is something where you can make a plot
    • A window on your computer(screen device)
    • A PDF file(file device)
    • A PNG or JPEG file(file device)
    • A scalable vector graphics (SVG) file(file device)
  • Screen devices
    • Mac quartz()
    • Windows windows()
    • Unix/Linux xll()
  • Plot on the screen
    • Call a plotting function like plot, xyplot, qplot
    • The plot appears on the screen device
    • Annotate plot if necessary
    • Enjoy
  • Plot to a file
    • Explicity lauch a graphics device
    • Call a plotting function to make a plot
    • Annotate plot if necessary
    • Explicitly close graphics device with dev.off()(this is very important!!!!)
if(!file.exists('figures')) dir.create('figures')
pdf('./figures/myplot.pdf')
with(pollution, plot(pm25, latitude))
title(main = 'Ozone and Wind')
dev.off()
  • Graphics FIle Devices: There are two basic types of file devices: vector and bitmap
    • Vector formats:
      • pdf: useful for line-type graphics, resizes well, usually portable, not efficient if a plot has many objects or points
      • svg: XML- based scalable vector graphics; supports animation and interactive, potentially useful for web-based plots
      • win.metafile: Windows metafile format(only on Windows)
      • postscript: older format, also resizes well, usually portable, can be used to create encapsulated postscript files; Windows systems ofen don’t have a postscript viewer
    • bitmap formats:
      • png: bitmapped format, good for line drawings or images with solid color, uses lossless compression(like the old GIF format), most web browers can read this format natively, good for plotting many many points, does nor resize well
      • jpeg: good for photographs or natural scenes, uses lossy compression, good for plotting many many many points, does not resize well, can be read by almost any computer and any web brower, not great for line drawings
      • tiff: Creates bitmap files in the TIFF format; supprots lossless compression
      • bmp: a native Windows bitmapped format

Copy plots
- dev.copy: copy a plot from one device to another
- dev.copy2pdf: specifically copy a plot to a PDF file.

with(faithful, plot(eruptions, waiting))
title(main = 'Old Faithful Geyser data')
dev.copy(png, file = './figures/greyserplot.png')
dev.off()
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值