R数据清理与转换

最新推荐文章于 2022-09-13 11:18:04 发布

love others as self

最新推荐文章于 2022-09-13 11:18:04 发布

阅读量9k

点赞数 1

分类专栏： R 大数据

本文链接：https://blog.csdn.net/sinat_29581293/article/details/51290639

版权

大数据同时被 2 个专栏收录

7 篇文章 0 订阅

订阅专栏

5 篇文章 0 订阅

订阅专栏

数据清理与转换

1缺失值得处理
#查看哪些地方是数据是缺失的
> which(is.na(a),arr.ind=TRUE)

#删除空缺的数据
> a<-na.omit(a)
> which(is.na(a),arr.ind=TRUE)# 发现已经没有空缺的数据了
row col

which(x,arr.ind = FALSE，.....)
which是用来寻找逻辑值为真的值所在的位置，x表示的是一个逻辑向量或者数组；arr.ind是一个逻辑值，如果为真，会得出在数据中的位置，，否则会得到储层的索引

> (x<-matrix(1:10,nrow=2))
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> which(x==5) #仅仅返回单下标的索引
[1] 5
> which(x==5,arr.ind = TRUE)# 会返回双下标的索引

row col
[1,] 1 3
2构建新变量

> a.Range<- a$a.checkyear-a$a.checkyear # 记得一定要加上a$，否则R不知道使用哪一个数据集

> a.Range
numeric(0)
也可以利用transform实现

> transform(a,a.Range=a.checkyear-a.checkyear)

类型转化

a<- transform(a,Date=as.Date(Date))
as.Date（）函数是吧数据转化为日期数据的函数，

> transform(a,a.Range=a.checkyear-a.checkyear)
Error in eval(expr, envir, enclos) : object 'a.checkyear' not found
> a<- tarnsform(a,Date=as.Date(Date))
Error: could not find function "tarnsform"
> a<- transform(a,Date=as.Date(Date))
Error in as.Date(Date) : object 'Date' not found
> a<- transform(a,Date=as.Date(checkyear))
Error in as.Date.numeric(checkyear) : 'origin'一定得给值
> (d<- as.Date(x="15/06/2016",fomat="%d/%m/%Y"))
[1] "0015-06-20"
> (d<- as.Date(x="15/06/2013",fomat="%d/%m/%Y"))
[1] "0015-06-20"
> (d<- as.Date(x="15/06/2013",fomat="%d/%m/%y"))
[1] "0015-06-20"
> (d<- as.Date(x="15$06$2013",fomat="%d$%m$%y"))
Error in charToDate(x) : 字符串的格式不够标准明确
> as.Date(x="15$06$2013",fomat="%d$%m$%y")
Error in charToDate(x) : 字符串的格式不够标准明确
> as.Date(x="15$06$13",fomat="%d$%m$%y")
Error in charToDate(x) : 字符串的格式不够标准明确
> as.Date(x="15/May/13",fomat="%d/%b/%y")
Error in charToDate(x) : 字符串的格式不够标准明确
> Sys.setlocale("LC_TIME","USA")
[1] "English_United States.1252"
> as.Date(x="15/May/13",format="%d/%B/%y")
[1] "2013-05-15"
> Sys.setlocale("LC_TIME","Chinese")
[1] "Chinese (Simplified)_China.936"
> as.numeric(d)
[1] -713879

在R中，更加完整的时间函数是as.POSIXlt

排序

sort rank order
> x<- c(19,84,64,2)
> order(x) #返回的是排序数据所在向量中的索引
[1] 4 1 3 2
> rank(x) #返回该值处于第几位
[1] 2 4 3 1
> sort(x) # 返回的是按次序排好的数据
[1] 2 19 64 84

使用order 实现多关键字排序
> d<-data.frame(x=c(19,84,64,2,2),y=c(20,13,5,40,21))
> d
x y
1 19 20
2 84 13
3 64 5
4 2 40
5 2 21

> d[order(d$x,d$y),] #按x的升序排列，如果x一样，则按y的升序排列
x y
5 2 21
4 2 40
1 19 20
3 64 5
2 84 13
> d[order(d$x,-d$y),]#按x的升序排列，如果x一样，则按y的降序排列
x y
4 2 40
5 2 21
1 19 20
3 64 5
2 84 13
> d[order(d$x,d$y),]
x y
5 2 21
4 2 40
1 19 20
3 64 5
2 84 13

选取特定行和子集

> suba<- subset(a,subset = a.Range>40)
> length(a)
[1] 29
> length(suba)
[1] 29
> head(suba)

> set.seed(42)
> (x<- data.frame(x=1:18,id=LETTERS[sample(rep(c(1,2,3),6),18)]))
x id
1 1 B
2 2 A
3 3 B
4 4 A
5 5 C
6 6 A
7 7 B
8 8 B
9 9 C
10 10 A
11 11 A
12 12 C
13 13 C
14 14 B
15 15 C
16 16 C
17 17 B
18 18 A
> (new.x1<- subset(x=x,subset=id=c("A","B")))
Error: unexpected '=' in "(new.x1<- subset(x=x,subset=id="
> (new.x1<- subset(x=x,subset=id==c("A","B")))
x id
8 8 B
11 11 A
14 14 B
> (new.x2<- subset(x=x,subset=id%in%c("A","B")))#注意：在选取条件作为一个集合的时候，注意使用%in%，
x id
1 1 B
2 2 A
3 3 B
4 4 A
6 6 A
7 7 B
8 8 B
10 10 A
11 11 A
14 14 B
17 17 B
18 18 A

数据的合并

根据行名来合并连个数据集用merge,rbind和cbind只是对数据框进行合并
具体使用百度

另一种操作数据框的方法

love others as self

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
R数据清理与转换

数据清理与转换1缺失值得处理#查看哪些地方是数据是缺失的> which(is.na(a),arr.ind=TRUE)#删除空缺的数据> a> which(is.na(a),arr.ind=TRUE)# 发现已经没有空缺的数据了 row col which(x,arr.ind = FALSE，.....)which是用来寻找逻辑值为真的值所
复制链接

扫一扫

专栏目录