R语言数据高效处理指南——基本数据处理

最新推荐文章于 2022-04-01 10:16:51 发布

m0_61027476

最新推荐文章于 2022-04-01 10:16:51 发布

阅读量1.4k

点赞数

文章标签： r语言

本文链接：https://blog.csdn.net/m0_61027476/article/details/120355174

版权

1 数据集基本探索

str、summary、head这三个函数，是对数据框进行探索性分析的“三板斧”。

> str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500                  
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

2 基本范式实现

2.1 创建（read.csv/data.frame）

2.1.1 外部导入

首先创建一个csv文件，为了步骤统一，我们先用write.csv函数从内部写出表格，再用read.csv读入。以iris数据及为例，首先将它写入D盘根目录下：

 > write.csv(iris,file = "D:/iris.csv")
# 上面函数可以默认file = 这个部分，也就是说，可以写成：
#  write.csv(iris,"D:/iris.csv")

操作完毕后，可以把这个数据从外部读入，并赋值iris2：

> iris2 = read.csv("D:/iris.csv")
> iris2
      X Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1     1          5.1         3.5          1.4         0.2     setosa
2     2          4.9         3.0          1.4         0.2     setosa
3     3          4.7         3.2          1.3         0.2     setosa
4     4          4.6         3.1          1.5         0.2     setosa
5     5          5.0         3.6          1.4         0.2     setosa
…………

一般来说，read.csv函数默认第一行为表头，会作为列名称赋给数据框，如果不希望这个操作，可设置参数“header = F”，不过在包含列名称的数据中使用header = F是错误的。

2.1.2 内部创作（data.frame）

内部创作数据框，可使用data.frame函数直接创建：

> df = data.frame(x = 1:3,y = c("a","b","c"))
> df
  x y
1 1 a
2 2 b
3 3 c

2.2 删除（rm）

如果不希望R 中继续使用这个变量，可以用rm函数删除它：

> rm(df)
> df
function (x, df1, df2, ncp, log = FALSE) 
{
    if (missing(ncp)) 
        .Call(C_df, x, df1, df2, log)
    else .Call(C_dnf, x, df1, df2, ncp, log)
}
<bytecode: 0x0000000004c0a8c0>
<environment: namespace:stats>

想要知道环境中有哪些变量，可以用ls函数显示：

> ls()
[1] "iris2"

如果想清空环境中所有变量，可以这么做：

> rm(list = ls())

但是注意，无法删除系统自带数据集。

2.3 检索（DF[i , j]）

检索分为行检索和列检索，注意几点：中括号，逗号前行后列，连续行用冒号，不连续行用向量c(* , * , *)

2.3.1 行检索

iris的第33行：

> iris[33,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
33          5.2         4.1          1.5         0.1  setosa

选取多行，例如33到35行：

> iris[33:35,]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
33          5.2         4.1          1.5         0.1  setosa
34          5.5         4.2          1.4         0.2  setosa
35          4.9         3.1          1.5         0.2  setosa

选取不连续行，例如33、36、38行：

> iris[c(33,36,38),]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
33          5.2         4.1          1.5         0.1  setosa
36          5.0         3.2          1.2         0.2  setosa
38          4.9         3.6          1.4         0.1  setosa

2.3.2 列检索

同理

> iris1 = iris[2:5,]
> iris1[,2:4]
  Sepal.Width Petal.Length Petal.Width
2         3.0          1.4         0.2
3         3.2          1.3         0.2
4         3.1          1.5         0.2
5         3.6          1.4         0.2

因为列是有名称的，例如选取Petal.Length列：

> iris1[,"Petal.Length"]
[1] 1.4 1.3 1.5 1.4

也可以通过$符号来选取列，上面的例子也可以这样实现：

> iris1$Petal.Length
[1] 1.4 1.3 1.5 1.4

如果需要选取多列，就需要利用向量的方法：

> iris1[,c("Sepal.Length","Petal.Length")]
  Sepal.Length Petal.Length
2          4.9          1.4
3          4.7          1.3
4          4.6          1.5
5          5.0          1.4

2.4 插入（rbind/cbind）

同检索，插入也分为行插入和列插入。

2.4.1 行插入（rbind）

对数据框进行行插入时，必须保证两个数据框列数一样，而且列名一致。下面，我将区iris两个子集，再将两个子集合并在一起，完成对第一个列表插入第二个列表的操作：

> iris[1:3,] -> i1
> iris[4,] -> i2
> i1
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> i2
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4          4.6         3.1          1.5         0.2  setosa
> rbind(i1,i2)->i
> i
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
>

注意，“->”箭头指向哪里，就往哪里赋值

2.4.2 列插入（cbind）

c是column的简写

> i1[,1:2]->i3
> i1[,3]->i4
> cbind(i3,i4)->i5
> i3
  Sepal.Length Sepal.Width
1          5.1         3.5
2          4.9         3.0
3          4.7         3.2
> i4
[1] 1.4 1.4 1.3
> i5
  Sepal.Length Sepal.Width  i4
1          5.1         3.5 1.4
2          4.9         3.0 1.4
3          4.7         3.2 1.3

注意，i4 表格没有列名称，因此赋值后，自动将i4 作为列名称放入数据框，如果想改列名称，可以用colnames函数或者names函数

> names(i5) = c("a","b","c")
> i5
    a   b   c
1 5.1 3.5 1.4
2 4.9 3.0 1.4
3 4.7 3.2 1.3

2.5 排序（order）

order函数原理：接受一个向量，然后返回这个向量的排序。例如：

> c(3,5,2,6,4,8)->a
> order(a)
[1] 3 1 5 2 4 6

解释一下意思，输出结果第一个数字3也就是向量中第3个数字应该排在第一位，以此类推。

> a[order(a)]
[1] 2 3 4 5 6 8

取iris前六行进行演示

> > test<-iris[1:6,]
> test[order(test$Sepal.Length),]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
4          4.6         3.1          1.5         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
1          5.1         3.5          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

如果希望降序，在order函数的参数中加入负号即可：

> test[order(-test$Sepal.Length),]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
6          5.4         3.9          1.7         0.4  setosa
1          5.1         3.5          1.4         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa

order函数中可以加入多个参数

？

2.6 过滤（DF[condition,]）

对数据框的过滤依然需要对行进行操作，行的检索其实是可以利用逻辑值的，例如：

> test[c(T,T,T,T,F)]
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

筛选Sepal.Lenght大于5的记录，查看是否满足条件

满足逻辑值的进行筛选

> test$Sepal.Length > 5
[1]  TRUE FALSE FALSE FALSE FALSE  TRUE
> test[test$Sepal.Length > 5,]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

2.7 汇总（apply）

iris最后一列是物种名，汇总没有意义，因此去掉，在使用apply函数汇总：

apply(要操作的数据框，1按行汇总/2按列汇总，汇总的操作)

> iris0 <- iris[,-5]
> apply(iris0,2,mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333

2.8 分组（aggregate）

iris数据集中有三个物种，分别汇总求平均值，则需要分组。例如，现在求3中鸢尾花Sepal.Length属性的平均值：

> aggregate(Sepal.Length ~ Species, data = iris, mean)
     Species Sepal.Length
1     setosa        5.006
2 versicolor        5.936
3  virginica        6.588

如果对所有属性进行分组汇总，可以用 . 表示出来Species外所有属性：

> aggregate(. ~ Species,data = iris,mean)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa        5.006       3.428        1.462       0.246
2 versicolor        5.936       2.770        4.260       1.326
3  virginica        6.588       2.974        5.552       2.026

2.9 连接（merge）

基本包中，merge函数进行多表连接。我们构造新的数据集进行举例。

> #构建顾客交易数据框
> df1 = data.frame(CustomerID = c(1:6) , Product = c(rep("Oven", 3), rep("Television", 3)))
> df1
  CustomerID    Product
1          1       Oven
2          2       Oven
3          3       Oven
4          4 Television
5          5 Television
6          6 Television
> #构建顾客地址数据框
> df2 = data.frame(CustomerID = c(2, 4, 6),State = c(rep("California", 2),rep("Texas", 1)))
> df2
  CustomerID      State
1          2 California
2          4 California
3          6      Texas
> #两个表中都有顾客，所以才能连接
> df <- merge(x=df1,y=df2,by="CustomerID")
> df
  CustomerID    Product      State
1          2       Oven California
2          4 Television California
3          6 Television      Texas
> #如果需要做左连接，可以设置all.x=T
> df <- merge(x=df1,y=df2,by="CustomerID",all.x=T)
> df0 <- merge(x=df1,y=df2,by="CustomerID",all.x=T)
> df0
  CustomerID    Product      State
1          1       Oven       <NA>
2          2       Oven California
3          3       Oven       <NA>
4          4 Television California
5          5 Television       <NA>
6          6 Television      Texas
> #如果需要做右连接，可以设置all.y=T
> df3 <- merge(x=df1,y=df2,by="CustomerID",all.y=T)
> df3
  CustomerID    Product      State
1          2       Oven California
2          4 Television California
3          6 Television      Texas
> df4 <- merge(x=df1,y=df2,by="CustomerID",all=T)
> df4
  CustomerID    Product      State
1          1       Oven       <NA>
2          2       Oven California
3          3       Oven       <NA>
4          4 Television California
5          5 Television       <NA>
6          6 Television      Texas

m0_61027476

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
R语言数据高效处理指南——基本数据处理

1 数据集基本探索str、summary、head这三个函数，是对数据框进行探索性分析的“三板斧”。> str(iris)'data.frame': 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... $ Petal.Length: n
复制链接

扫一扫