关注微信号:小程在线
关注CSDN博客:程志伟的博客
数据预处理包括:
1.数据采样:sample()
2.修改变量名:tolower()、strsplit()
3.产生新的变量:cut()
4.数据离散化
5.日期处理:lubridate包、paste()、ymd()
6.数据二值化
7.合并数据集:merge()
8.排列数据集:order()
9.重塑数据集:melt()
10.dplyr数据操作
11.缺失数据处理:e1071包、 impute()
12.特征缩放:scale()
13.降维:PCA
-----------------------------------------------------------------------------------------------------
1.数据采样
> sample_index <-sample(1:nrow(iris), 10, replace=T)
> sample_index
[1] 75 32 39 145 138 134 16 114 64 75
> sample_set <- iris[sample_index,]
> sample_set
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
75 6.4 2.9 4.3 1.3 versicolor
32 5.4 3.4 1.5 0.4 setosa
39 4.4 3.0 1.3 0.2 setosa
145 6.7 3.3 5.7 2.5 virginica
138 6.4 3.1 5.5 1.8 virginica
134 6.3 2.8 5.1 1.5 virginica
16 5.7 4.4 1.5 0.4 setosa
114 5.7 2.5 5.0 2.0 virginica
64 6.1 2.9 4.7 1.4 versicolor
75.1 6.4 2.9 4.3 1.3 versicolor
2.修改变量名
> df <- data.frame("Address 1"=character(0), direction=
+ character(0), street=character(0), CrossStreet=character(0),
+ intersection=character(0), Location.1=character(0))
> names(df)
[1] "Address.1" "direction" "street" "CrossStreet" "intersection"
[6] "Location.1"
#tolower将大写变为小写
> names(df) <- tolower(names(df))
> names(df)
[1] "address.1" "direction" "street" "crossstreet" "intersection"
[6] "location.1"
> #strsplit()输出的是一个 R 列表对象。这个基础 R 函数的工作方式是:
> #当它发现一个字符串中有句点时,就把字符串拆成一个子列表,
> #其中第 1 个元素是句点前面的字符串,第 2 个元素是句点后面的字符串。
> splitnames <- strsplit(names(df), "\\.")
> splitnames
[[1]]
[1] "address" "1"
[[2]]
[1] "direction"
[[3]]
[1] "street"
[[4]]
[1] "crossstreet"
[[5]]
[1] "intersection"
[[6]]
[1] "location" "1"
> splitnames[6]
[[1]]
[1] "location" "1"
> splitnames[[6]][1]
[1] "location"
> splitnames[[6]][2]
[1] "1"
3.产生新的变量
> airquality$Ozone[1:10]
[1] 41 36 12 18 NA 28 23 19 8 NA
> ozoneRanges <- cut(airquality$Ozone, seq(0,200,by=25))
> ozoneRanges[1:10]
[1] (25,50] (25,50] (0,25] (0,25] <NA> (25,50