R语言之数据操作

最新推荐文章于 2023-11-16 18:58:20 发布

鲁鲁酱1996

最新推荐文章于 2023-11-16 18:58:20 发布

阅读量3.7k

点赞数

分类专栏：机器学习之R语言基础文章标签： r语言数据

本文链接：https://blog.csdn.net/lulujiang1996/article/details/78897492

版权

数据读写

对离散变量，我们会观测变量各个层级观测的频数，或者使用两个变量的交叉表格，对离散变量绘制条形图等；
对连续变量，我们会看某个变量的均值，标准差，分位数等
此外，summary(),str(),describe(()等函数（psych包里）做义工数据框的总结。
以上即为一些最基础的方法，但这些方法灵活性不高，输出的信息也是固定的，这时我们需要对数据进行整形。
在整合和整形操作前，我们介绍一个新的可以取代数据框的对象，tibble，一个可以高效读取数据集的包readr。最后会介绍两个用于数据整形的包：reshape2和tidyr包

取代传统数据框的tibble对象

> library(tibble)
> library(tibble)
> library(ggplot2)
> sim.dat=read.csv("https://raw.githubusercontent.com/happyrabbit/DataScientistR/master/Data/SegData.csv")
> df=data.frame(x=c(1:5),y=rep("a",5))
> as_tibble(df)
# A tibble: 5 x 2
      x      y
  <int> <fctr>
1     1      a
2     2      a
3     3      a
4     4      a
5     5      a
> tibble(x=1:5,y=rep("a",5))
# A tibble: 5 x 2
      x     y
  <int> <chr>
1     1     a
2     2     a
3     3     a
4     4     a
5     5     a
> 
> tibble(x=1:5,y=1,z=x^2+y)
# A tibble: 5 x 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26
> tb=tibble(':)'="smile",' '="space",'2000'="number")
> print(tb)
# A tibble: 1 x 3
   `:)`   ` ` `2000`
  <chr> <chr>  <chr>
1 smile space number
>

特别，如果你在其他包中使用tibble对象中的变量也需要加单引号。
tibble和传统数据框的不同主要在于输出显示和截取变量这两个方面
1.输出显示

> print(as_tibble(sim.dat))
# A tibble: 1,000 x 19
     age gender   income  house store_exp online_exp store_trans online_trans
   <int> <fctr>    <dbl> <fctr>     <dbl>      <dbl>       <int>        <int>
 1    57 Female 120963.4    Yes  529.1344   303.5125           2            2
 2    63 Female 122008.1    Yes  478.0058   109.5297           4            2
 3    59   Male 114202.3    Yes  490.8107   279.2496           7            2
 4    60   Male 113616.3    Yes  347.8090   141.6698          10            2
 5    51   Male 124252.6    Yes  379.6259   112.2372           4            4
 6    59   Male 107661.5    Yes  338.3154   195.6870           4            5
 7    57   Male 120483.3    Yes  482.5445   284.5363           5            3
 8    57   Male 110542.0    Yes  340.7368   135.2556          11            5
 9    61 Female 132060.5    Yes  608.2310   142.5503           6            1
10    60   Male 105048.8    Yes  470.3190   163.4663          12            1
# ... with 990 more rows, and 11 more variables: Q1 <int>, Q2 <int>, Q3 <int>,
#   Q4 <int>, Q5 <int>, Q6 <int>, Q7 <int>, Q8 <int>, Q9 <int>, Q10 <int>,
#   segment <fctr>

如上，它只展示头10行数据，而且会根据屏幕大小，自动调整列数，列名后还会显示每列的类型，更友好。
2.截取变量
从tibble对象中截取某一变量
用"$"和"[["符号
“[[”符号能够通过变量的名字或位置指针来截取
“$”只能通过变量名截取
“%>%"(管道操作符)也可进行数据截取

sim.dat$age
sim.dat[["age"]]
sim.dat[[1]]

library(dplyr)
sim.dat%>%.$age
sim.dat%>%.[["age"]]

若用"$"或"[["操作符从数据框中截取一个变量时，截取的变量可能不是数据框形式，从而可能会引起程序运行错误，但是从tibble中截取任何一个变量依旧是一个tibble对象
注意：由于tibble对象比较新，所以在清理了数据之后要对数据建模的话，可以将tibble对象转换成原始数据框格式

sim.dat=as.data.frame(sim.dat)
class(sim.dat)

高效数据读写 readr包
readr包中用于读入数据的函数：
read_csv()读入逗号分隔文件
read_csv2()读入分号分隔文件
read_tsv()读人制表符分隔文件
read_delim()读入任意分隔符文件
其中，read_csv()涵盖了大部分的数据读入需求。

#skip=2表示跳过两行
> dat=read_csv("这行是一个样本数据
+ 这行只是注释
+ x,y,z
+ 1,2,3",skip=2)
> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

> dat=read_csv("1,2,3\n4,5,6",col_names=FALSE)
> print(dat)
# A tibble: 2 x 3
     X1    X2    X3
  <int> <int> <int>
1     1     2     3
2     4     5     6

对于分号分隔文件读取read_csv2()

> dat=read_csv2("x;y;z\n1;2;3")

> print(dat)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

对于制表符分隔文件，read_tsv()

> dat1=read_tsv("x\ty\tz\n1\t2\t3")
> print(dat1)
# A tibble: 1 x 3
      x     y     z
  <int> <int> <int>
1     1     2     3

读入任意分隔符read_delim()

> dat2=read_delim("x|y|z\n1|2|3",delim=
+                     "|")
> print(dat2)
# A tibble: 1 x 3
      x     y     z
  <int> <

最低0.47元/天解锁文章

鲁鲁酱1996

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
R语言之数据操作

数据读写对离散变量，我们会观测变量各个层级观测的频数，或者使用两个变量的交叉表格，对离散变量绘制条形图等；对连续变量，我们会看某个变量的均值，标准差，分位数等此外，summary(),str(),describe(()等函数（psych包里）做义工数据框的总结。以上即为一些最基础的方法，但这些方法灵活性不高，输出的信息也是固定的，这时我们需要对数据进行整形。在整合和整形操作前，我们
复制链接

扫一扫

专栏目录