WEEK2-Descriptive statistics and data cleaning


Explore Statistics with R (EDX)

WEEK2-Descriptive statistics and data cleaning视频笔记

例1:

1. 获得一部分数据

obesity <- read.csv("http://www.hscic.gov.uk/catalogue/PUB13648/Obes-phys-acti-diet-eng-2014-tab_CSV.csv", skip=4, nrows=12)

#skip the first 4 rows; import 12 rows


2. 看一看这数据长什么样子(structure)

str(obesity)


然后发现他长得乱七八糟的,接下来需要整理一下。


3. 只留下三个数据:日期,男,女

obesity$Males <- as.numeric(as.character(gsub(",","",obesity$Males)))
obesity$Females <- as.numeric(as.character(gsub(",","",obesity$Females)))
obesity<- obesity[-1,c(-2, -5:-12)]
obesity

#去掉男女千分位的逗号(用global substitution来replace the comma with nothing);
改变数据类型从factor到numeric(change the factor to a character and then change the character to numeric)
#去掉第一行,去掉第2列和第5至12列


现在他变得干净整齐多了。

#This is the So called wide format.
#We would like to have the long format: one row - one observation


4. 要用的package如果没装要先装一下(install a package and activate it)

install.packages("reshape2")
library("reshape2")

(老师的装包过程好整齐,我的怎么乱七八糟的,算了能用就好)


5. 使用reshape2包的melt,每次用的时候如果没有library过都要library一下

obesitylong <- melt(obesity)
obesitylong #long format



然后数据就变成了日期,性别和值了。


6. 画个图来看看

plot(obesitylong$value~obesitylong$variable)


(the default behavior of the function plot(), if I ask to plot the value in obesitylong dependent on the variable, in this case the sex, in obesitylong, what I will get is a boxplot, like this.)(这块我不太明白)


然后老师推荐装包什么的。。

# install.packages("lubridate")

# setting the argument colClasses= in read.table() can reduce import time of large datasets


例2:

1. 先做了以上1,2的事情,就是读数据和看样子:

body <- read.table("http://www.amstat.org/publications/jse/datasets/body.dat.txt")
dim(body)
str(body)


然后发现变量都没有名字啊,就叫V1,V2之类的怎么行。。


2. 给他们加上名称

BodyMeasurements <- c("Biacromial_diameter","Biiliac_diameter","Bitrochanteric_diameter","Chest_depth","Chest_diameter","Elbow_diameter","Wrist_diameter","Knee_diameter","Ankle_diameter","Shoulder_girth","Chest_girth","Waist_girth","Navel_girth","Hip_girth","Thigh_girth","Bicep_girth","Forearm_girth","Knee_girth","Calf_max_girth","Ankle_min_girth","Wrist_min_girth","Age","Weight","Height","Gender")
names(body) <- BodyMeasurements


3. 看看数据特征和画图

summary(body)
boxplot(body)

#summary就是最小值,最大值,中间值,均值,四分位值什么的,变量太多,截图只显示一部分




然后发现横轴的变量都看不清楚有木有。。

接着召唤par让横轴的变量现身。。(这块我也不太明白)

keep.par <- par()
par(mar = c(10,4,4,2)+0.1)
boxplot(body, las=3)

# to restore parameters to defaul, use: par(keep.par)
#or close your plotting window
#A few examples of visualization: Postion, colour, size, plot character #You can visualize many different variables in the same graph.


是的,老师你成功了,但是我的变量还有一半被吃掉了我不知道怎么办。。


例3:

1. 交代了一些画图的事情:

x<- 1:10
set.seed(23)
y <- x + rnorm(10)

#产生一些x和随机的y用来画图


plot(x,y) #position #正常的散点图
plot(x,y, col=x) #colour #散点颜色改变的散点图
plot(x,y, col=x, cex=x) #size #散点大小改变的散点图,因为这里的x是1到10,所以这里是散点逐渐变大的散点图
plot(x,y, col=x, cex=x, pch=x) #plot chartacter #散点形状改变的散点图

以上只解释了增加的特征,然后老师又给pch举了一个例子,太可怕了。。还有出错信息。。

x <- rep(1:10, 10)
y <- rep(1:10, each=10)
z <- 1:100
plot(x,y,pch =z)


2. 我们来实践一下:

plot(body$Thigh_girth,body$Bicep_girth) #一个正常的散点图
plot(body$Thigh_girth,body$Bicep_girth, pch=body$Gender) #根据性别改变散点形状的散点图,就是说男的和女的的散点形状是不一样的
plot(body$Thigh_girth,body$Bicep_girth, col=body$Gender+1) #根据性别改变散点颜色的散点图,我比较喜欢这个,感觉比较明显

可以看到比较明显的线性关系。


#Summarize a variable by binning
breaks <- seq(min(body$Age),max(body$Age), 5)
Age_group <- cut(body$Age, breaks)
body$Age_group <- Age_group

plot(body$Thigh_girth,body$Bicep_girth, pch=body$Gender, col=body$Age_group)
plot(body$Thigh_girth,body$Bicep_girth, pch=body$Gender, col=body$Age_group, cex=(body$Weight/10))



感觉要瞎了。。


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值