R基础2——数据形式

最新推荐文章于 2023-10-05 11:51:21 发布

shlay

最新推荐文章于 2023-10-05 11:51:21 发布

阅读量653

点赞数 1

分类专栏： R语言文章标签： R语言数据形式缺失值数据类型转换

本文链接：https://blog.csdn.net/s1164548515/article/details/88623960

版权

R语言专栏收录该内容

8 篇文章 11 订阅

订阅专栏

R基础2——数据形式

@TOC 本博文介绍了数据处理、分析过程中常见的5种数据类型：Numeric（数值型）、
Logical（逻辑型）、Character（字符型）、Factor(因子)、Dates and Times（日期与时间）。介绍了各数据类型的相关特性，同时介绍了缺失值的特性以及检测方法。最后还介绍了数据类型的判断方法以及各数据类型之间的强制转换。

1. Numeric（数值型）

数值型含double,integer

> # trim - If FALSE right justified with common width
> format(c(1,10.123456,100.1,1000.1), trim = FALSE) #数据右对齐，左边不够的用空格填充
[1] "   1.00000" "  10.12346" " 100.10000" "1000.10000"
> format(c(1,10.123456,100,1000), trim = TRUE) #数据左对齐
[1] "1.00000"    "10.12346"   "100.00000"  "1000.00000"

> format(13.7, nsmall = 3) #至少保留小数点后3位，若原来的小数点后大于3位则返回原值，否则在后面补0
[1] "13.700"

> format(13.70001, nsmall = 3)
[1] "13.70001"

> format(2^16, scientific = TRUE) #采用科学技术
[1] "6.5536e+04"

2.Logical（逻辑型）

TRUE or FALSE

2.1 常见logical语句

常见logical语句

2.2 逻辑运算示例：

> x <- 1:10

> (x%%2==0) | (x > 5) #x中大于5或者偶数返回TRUE，反之FALSE
[1] FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE

> x[(x%%2==0) | (x > 5)] #加了中括号，返回符合条件的元素值
[1] 2 4 6 7 8 9 10

%in%

> x <- 1:10 ; y <- 5:15 

> x %in% y #x中的元素在y中也存在则返回TRUE,反之FALSE
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE

> x[x %in% y] #返回满足条件的元素值
[1] 5 6 7 8 9 10

> any(x>5) # x中是否存在大于5的元素值
[1] TRUE

> all(x>5) # x中的所有元素是否均大于5
[1] FALSE

== ; != ; all.equal()

> name <- "Nick"
> if(name=="Nick") TRUE else FALSE #name定义了值可用==或!=，否则会报错
[1] TRUE

> name <- NA
> if(name=="Nick") TRUE else FALSE
Error in if (name == "Nick") TRUE else FALSE :
missing value where TRUE/FALSE needed

> if(identical(name, "Nick")) TRUE else FALSE
[1] FALSE

判断语句中通常不直接使用all.equal() ，而是用isTRUE()或 identical()代替

> (x <- sqrt(2))
[1] 1.414214
> x^2
[1] 2
> x^2==2
[1] FALSE
> all.equal(x^2, 2)
[1] TRUE
> all.equal(x^2, 1)
[1] "Mean relative difference: 0.5"
> isTRUE(all.equal(x^2, 1))
[1] FALSE

3.Character（字符型）

字符串使用单引号或者双引号进行定义。

3.1 Character相关函数总结### 3.2 示例：

> animals <- c("bird", "horse", "fish")
> length(animals) #向量中字符（strings）长度
[1] 3
> nchar(animals) #每个字符（string）对应的字符串（characters）长度
[1] 4 5 4
> cat("Animals:", animals)
Animals: bird horse fish

> home <- c("tree", "barn", "lake")
> cat(animals, home, "\n") # home 接着animals 的内容输出
bird horse fish tree barn lake

> cat(animals,"\n", home, "\n")  # "\n"为换行符
bird horse fish 
 tree barn lake 
 
> paste(animals, collapse=" ") #向量中的所有字符组合为一个长的字符
[1] "bird horse fish" 

> a.h=paste(animals, home, sep=".") #animals 和 home 对应位置成对组合，并用sep指定的分隔符分隔
> a.h
[1] "bird.tree"  "horse.barn" "fish.lake" 

> unlist(strsplit(a.h, ".", fixed=TRUE)) #按照指定的分隔符分割a.h ，此时必须有fixed=TRUE
[1] "bird"  "tree"  "horse" "barn"  "fish"  "lake" 
> unlist(strsplit(a.h, ".", fixed=FALSE))
 [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
 
> substr(animals, 2, 4) #提取各字符串中的2-4 3各字符
[1] "ird" "ors" "ish"

> strtrim(animals, 3) ##提取各字符串中的前3个字符
[1] "bir" "hor" "fis"

> toupper(animals) ##所以字符串转化为大写
[1] "BIRD"  "HORSE" "FISH"

3.3 正则表达式

符号	含义
^	匹配以某字符开头的字符串
$	匹配以某字符结尾的字符串
.	任意字符
.{n}	任意n个字符
[· − ·]	指定范围内的字符，eg:[a-c]表示a,b,c

示例：
grep

> col = c("red1","red2","red3","red4","violetred","darkred","indianred","red","sgreen")

> col[grep("red", col)] #包含“red”
[1] "red1"      "red2"      "red3"      "red4"      "violetred"    "darkred"   "indianred" "red"    
  
> col[grep("^red", col)] 以red开头
[1] "red1" "red2" "red3" "red4" "red" 

> col[grep("red$", col)] #以red结尾
[1] "violetred" "darkred"   "indianred" "red"      

> col[grep("red.", col)]  #red后至少还包含一个字符
[1] "red1" "red2" "red3" "red4"

> col[grep("^[s-t]", col)] #得到以s或t开头的字符串
[1] "sgreen"

gsub 全局替换；sub 仅替换每个字符串中首次出现的字符

> places <- c("home", "zoo", "school", "work", "park")

> gsub("o", "O", places)
[1] "hOme"   "zOO"    "schOOl" "wOrk"   "park"  

> sub("o", "O", places)
[1] "hOme"   "zOo"    "schOol" "wOrk"   "park" 

> gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", places, perl=TRUE) #首字母大写 perl=TRUE
[1] "Home"   "Zoo"    "School" "Work"   "Park"

4.Factor(因子)

因子是一个分类变量，可定义为有序或者无序的。使用函数**factor()**创建因子变量.

4.1 创建因子变量

示例：

#无序
> factor(rep(1:2, 4), labels=c("hello", "hey"))
[1] hello hey   hello hey   hello hey   hello hey  
Levels: hello hey
#有序
> factor(rep(1:3, 3), labels=c("low", "med", "high"), ordered=TRUE)
 [1] low  med  high low  med  high low  med  high
Levels: low < med < high

4.2 因子变量相关函数

因子相关函数
示例：

> f <- gl(3, 2, labels=paste("trt",1:3, sep="."))
> f
[1] trt.1 trt.1 trt.2 trt.2 trt.3 trt.3
Levels: trt.1 trt.2 trt.3

> levels(f)
[1] "trt.1" "trt.2" "trt.3"

> nlevels(f)
[1] 3

> relevel(f, "trt.2")
[1] trt.1 trt.1 trt.2 trt.2 trt.3 trt.3
Levels: trt.2 trt.1 trt.3

> x <- runif(10)
> x
 [1] 0.65165567 0.56773775 0.11350898 0.59592531 0.35804998 0.42880942
 [7] 0.05190332 0.26417767 0.39879073 0.83613414
 
> cut(x, 3)
 [1] (0.575,0.837]  (0.313,0.575]  (0.0511,0.313] (0.575,0.837] 
 [5] (0.313,0.575]  (0.313,0.575]  (0.0511,0.313] (0.0511,0.313]
 [9] (0.313,0.575]  (0.575,0.837] 
Levels: (0.0511,0.313] (0.313,0.575] (0.575,0.837]

> cut(x, c(0,.25,.5,.75,1))
 [1] (0.5,0.75] (0.5,0.75] (0,0.25]   (0.5,0.75] (0.25,0.5] (0.25,0.5]
 [7] (0,0.25]   (0.25,0.5] (0.25,0.5] (0.75,1]  
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]

5.Dates and Times（日期与时间）

5.1 Dates and Times相关函数总结

Dates and Times相关函数总结

5.2 字符串转换为日期

符号	释义	符号	释义
%a	简写的周几	%A	周几的全称
%y	不含世纪的年份表示	%Y	全写的年份
%b	简写的月	%B	全写的月份
%m	数值型月份（01-12）	%d	月份中对应的日期

示例：
> dates_2 <- c(“5-1-2008”, “19-8-2008”, “2-2-2009”, “29-9-2009”)
> as.Date(dates_2, format="%d-%m-%Y")
[1] “2008-01-05” “2008-08-19” “2009-02-02” “2009-09-29”

5.3 Sequence of Dates：

创建日期序列
语法：seq.Date(from, to, by, length.out = NULL)
#by= ”day”/ ”week”/”month” / ”year”/ ”3 days”/ ”2 weeks”/”4 months” / ”2 years”/

示例：

> seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="week")
[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"

> seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="3 days")
 [1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
 [6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"

> seq.Date(as.Date("2011/1/1"), by="week", length.out=10)
 [1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
 [6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"

5.4 Cutting Dates（切分日期）

将日期序列划分为多个级别：
cut.Date(x, breaks, start.on.monday = TRUE)
示例：

> jan <- seq.Date(as.Date("2011/1/1"), as.Date("2011/1/15"), by="days")
> cut(jan, breaks="weeks")
 [1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
 [7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10
Levels: 2010-12-27 2011-01-03 2011-01-10

cut week

5.5 Operations with Dates

示例：

> mar1 <- as.Date("2019/3/18")
> mar1
[1] "2019-03-18"

> (mar8<-mar1+7)
[1] "2019-03-25"

> mar1-14
[1] "2019-03-04"

> mar8>mar1
[1] TRUE

> format.Date(mar8, "%Y")
[1] "2019"

> format.Date(mar1, "%b-%d")
[1] "3月-18"

6. Missing data(缺失值)

R语言中用NA表示缺失值。
若某一数据结构中存在缺失值，则调用相关运算函数时其结果为NA，但加上na.rm = TRUE，则调用函数时会忽略缺失值。

> x <- c(4, 7, 2, 0, 1, NA)
> mean(x)
[1] NA
> mean(x,na.rm = TRUE) #只计算5个数值的均值
[1] 2.8

> x1 <- matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))
> x1
      c.1 c.2
[1,]   1   3
[2,]   2   4

缺失值检测
is.na( ) #判断指定目标中各元素是否属于NA，是返回TRUE,反之FALSE
any(is.na( )) #判断指定对象中是否存在缺失值，存在返回TRUE,反之FALSE
is.null( ) #判断指定对象中是否为空，是返回TRUE,反之FALSE
is.nan() #判断指定目标中各元素是否属于NAN，是返回TRUE,反之FALSE

> x2 <- c(4, 7, 2, 0, 1, NA)

> is.na(x2)
[1] FALSE FALSE FALSE FALSE FALSE  TRUE

> any(is.na(x2))
[1] TRUE

> is.null(x2)
[1] FALSE

> (y <- x/0)
[1] Inf Inf Inf NaN Inf  NA

> is.na(y)
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE

> is.nan(y)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE

7. 数据类型检测与强制转换

Testing and Coercing Functions
示例：

> x <- 1:10
> x > 5
 [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 
> sum(x>5)
[1] 5

> is.vector(x)
[1] TRUE

> is.numeric(x)
[1] TRUE

> y <- as.list(x)
> is.list(y)
[1] TRUE

> as.numeric("123")
[1] 123