R 探索多变量

最新推荐文章于 2020-04-12 18:24:09 发布

salt2020

最新推荐文章于 2020-04-12 18:24:09 发布

阅读量609

点赞数

分类专栏：学习笔记

本文链接：https://blog.csdn.net/Guo_ya_nan/article/details/81193567

版权

学习笔记专栏收录该内容

59 篇文章

订阅专栏

3) Third Qualitative Variable

ggplot(aes(x = gender, y = age),
       data = subset(pf, !is,na(gender))) +
  geom_boxplot() +
  stat_summary(fun.y = mean, geom = 'point', shape = 4)

这里写图片描述

4) Plotting Conditional Summaries

# 方法一
ggplot(aes(x = gender, y = age),
       data = subset(pf, !is,na(gender))) +
  geom_line(aes(color = gender), stat = 'summary', fun.y = median)

# 方法二
library(dplyr)
# 汇总数据
pf.fc_by_age_gender <- pf %>%
  group_by(age, gender) %>%
  summarise(mean_friend_count = mean(friend_count),
            median_friend_count = median(friend_count),
            n = n())

ggplot(data = pf.fc_by_age_gender,
aes(x = age, y = median_friend_count)) +
  geom_line(color = age)

You can include multiple variables to split the data frame when using group_by() function in the dplyr package.
new_groupings <- group_by(data, variable1, variable2)
OR
using chained commands…
new_data_frame <- data_frame %>%
group_by(variable1, variable2) %>%
Repeated use of summarise() and group_by(): The summarize function will automatically remove one level of grouping (the last group it collapsed).

这里写图片描述

注意这里的图像跟直方图的区别。之前有一个直方图/频率多边形是按照性别进行分组，对两个性别的 friend_count 进行直方图展示，本质上其实分析的是一个变量，即 friend_count 。
在那副图中，female 在 500 以下的人数明显低于 male，跟这里展示的情况截然相反让人感到混乱，但仔细想想两张图确实应该不同：
对于直方图来说，本质上只探寻一个变量，即考察这个变量的分布情况，如果一个变量在 friend_count 较小的 bins 当中具有很大的频数，那么这个变量的均值或者中位数自然也会很小，而另外的变量在 friend_count 较小的 bins 当中没有那么高的频数，那么它的均值或者中位数自然也会稍大。
其实，对于直方图来说，左侧堆积的越高，那么它的均值越小。

Your code should look similar to the code we used to make the plot the first time. It will not need to make use of the stat and fun.y parameters.

ggplot(aes(x = age, y = friend_count),
data = subset(pf, !is.na(gender))) +
geom_line(aes(color = gender), stat = ‘summary’, fun.y = median)
You could also use color = gender inside the aes() wrapper of ggplot().

6) Wide and Long Format

install.packages("tidyr")
library(tidyr)

spread(subset(pf.fc_by_age_gender, 
       select = c('gender', 'age', 'median_friend_count')), 
       gender, median_friend_count)

you will find the tidyr package easier to use than the reshape2 package. Both packages can get the job done.
An Introduction to reshape2 by Sean Anderson
Converting Between Long and Wide Format
Melt Data Frames

7) Reshaping Data

library(reshape2)

pf.fc_by_age_gender.wide <- dcast(pf.fc_by_age_gender,
                                  age ~ gender,
                                  value.var = 'median_friend_count')

这里写图片描述

8) Ratio Plot

The linetype parameter can take the values 0-6:
0 = blank, 1 = solid, 2 = dashed
3 = dotted, 4 = dotdash, 5 = longdash
6 = twodash

9) Third Quantitative Variable

# 提取变量，用户哪一年加入
pf$year_joined <- floor(2014 - pf$tenure / 365)

10) Cut a Variable

这里写图片描述

pf$year_joined.bucket <- cut(pf$year_joined, breaks = c(2004, 2009, 2011, 2012, 2014))

The Cut Function

11) Plotting It All Together

# 显示NA
table(pf$year_joined.bucket, usaNA = 'ifany')

# 按照 性别 进行分组，不同年龄对应的朋友中位数
ggplot(aes(x = gender, y = age),
       data = subset(pf, !is,na(gender))) +
  geom_line(aes(color = gender), stat = 'summary', fun.y = median)

# 按照 加入时间 进行分组，不同年龄对应的朋友中位数
ggplot(aes(x = age, y = friend_count),
       data = subset(pf, !is,na(year_joined.bucket))) +
  geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = median)

这里写图片描述

12) Plot the Grand Mean

# 按照 加入时间 进行分组，不同年龄对应的朋友数均值，以及总的均值
ggplot(aes(x = age, y = friend_count),
       data = subset(pf, !is,na(year_joined.bucket))) +
  geom_line(aes(color = year_joined.bucket), stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = mean, linetype = 2)

这里写图片描述

13) Friending Rate

计算 friend-count 和 tenure 之间的比率，每一个用户都能产生这样一个比率，然后汇总一下结果。
看看：比率的中位数是多少？最大值是多少？

# 方法一
pf$fc_tenure_ration = pf$friend_count / pf$tenure
summary(pf$fc_tenure_ration)

# 方法二
with(subset(pf, tenure >= 1), summary(friend_count/tenure))

14) Friendships Initiated

friendship_initiated 由这名用户发起的好友请求

ggplot(aes(x = tenure, y = friendship_initiated/tenure),
       data = subset(pg, tenure >= 1) +
    geom_point(color = year_joined.bucket)

总觉得以下得到的应该是一个频率多边形，没搞太明白。
这里写图片描述

15) Bias Variance Trade off Revisited

round 没搞明白，是滑动平均么。

ggplot(aes(x = 7 * round(tenure / 7), y = friendships_initiated / tenure),
       data = subset(pf, tenure > 0)) +
  geom_line(aes(color = year_joined.bucket),
            stat = "summary",
            fun.y = mean)

这里写图片描述

Understanding the Bias-Variance Tradeoff
NOTE: The code changing the binning is substituting x = tenure in the plotting expressions with x = 7 * round(tenure / 7), etc., binning values by the denominator in the round function and then transforming back to the natural scale with the constant in front.

ggplot(aes(x = tenure, y = friendships_initiated / tenure),
       data = subset(pf, tenure > 1)) +
  geom_smooth(aes(color = year_joined.bucket))

这里写图片描述

16) Sean’s NFL Fan Sentiment Study

没看懂在讲述什么故事。

17) Introducing the Yogurt Dataset

分析酸奶数据集（顾客每次购买酸奶的记录）。

18) Histograms Revisited

yo <- read.csv('yogurt.csv')

# 将id转为因子类型的变量 factor
yo$id <- factor(yo$id)
str(yo$id)

# 画直方图：价格的分布
# qplot(data = yo, x = price, fill=I('orange'))

ggplot(data = yo, aes(x = price)) + 
  geom_histogram(fill = I('orange')) +
  scale_x_continuous(breaks = seq(20, 70, 2))