R基础2——数据形式
@TOC 本博文介绍了数据处理、分析过程中常见的5种数据类型:Numeric(数值型)、
Logical(逻辑型)、Character(字符型)、Factor(因子)、Dates and Times(日期与时间)。介绍了各数据类型的相关特性,同时介绍了缺失值的特性以及检测方法。最后还介绍了数据类型的判断方法以及各数据类型之间的强制转换。
1. Numeric(数值型)
数值型含double,integer
> # trim - If FALSE right justified with common width
> format(c(1,10.123456,100.1,1000.1), trim = FALSE) #数据右对齐,左边不够的用空格填充
[1] " 1.00000" " 10.12346" " 100.10000" "1000.10000"
> format(c(1,10.123456,100,1000), trim = TRUE) #数据左对齐
[1] "1.00000" "10.12346" "100.00000" "1000.00000"
> format(13.7, nsmall = 3) #至少保留小数点后3位,若原来的小数点后大于3位则返回原值,否则在后面补0
[1] "13.700"
> format(13.70001, nsmall = 3)
[1] "13.70001"
> format(2^16, scientific = TRUE) #采用科学技术
[1] "6.5536e+04"
2.Logical(逻辑型)
TRUE or FALSE
2.1 常见logical语句
2.2 逻辑运算示例:
> x <- 1:10
> (x%%2==0) | (x > 5) #x中大于5或者偶数返回TRUE,反之FALSE
[1] FALSE TRUE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
> x[(x%%2==0) | (x > 5)] #加了中括号,返回符合条件的元素值
[1] 2 4 6 7 8 9 10
%in%
> x <- 1:10 ; y <- 5:15
> x %in% y #x中的元素在y中也存在则返回TRUE,反之FALSE
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> x[x %in% y] #返回满足条件的元素值
[1] 5 6 7 8 9 10
> any(x>5) # x中是否存在大于5的元素值
[1] TRUE
> all(x>5) # x中的所有元素是否均大于5
[1] FALSE
== ; != ; all.equal()
> name <- "Nick"
> if(name=="Nick") TRUE else FALSE #name定义了值可用==或!=,否则会报错
[1] TRUE
> name <- NA
> if(name=="Nick") TRUE else FALSE
Error in if (name == "Nick") TRUE else FALSE :
missing value where TRUE/FALSE needed
> if(identical(name, "Nick")) TRUE else FALSE
[1] FALSE
判断语句中通常不直接使用all.equal() ,而是用isTRUE()或 identical()代替
> (x <- sqrt(2))
[1] 1.414214
> x^2
[1] 2
> x^2==2
[1] FALSE
> all.equal(x^2, 2)
[1] TRUE
> all.equal(x^2, 1)
[1] "Mean relative difference: 0.5"
> isTRUE(all.equal(x^2, 1))
[1] FALSE
3.Character(字符型)
字符串使用单引号或者双引号进行定义。
3.1 Character相关函数总结### 3.2 示例:
> animals <- c("bird", "horse", "fish")
> length(animals) #向量中字符(strings)长度
[1] 3
> nchar(animals) #每个字符(string)对应的字符串(characters)长度
[1] 4 5 4
> cat("Animals:", animals)
Animals: bird horse fish
> home <- c("tree", "barn", "lake")
> cat(animals, home, "\n") # home 接着animals 的内容输出
bird horse fish tree barn lake
> cat(animals,"\n", home, "\n") # "\n"为换行符
bird horse fish
tree barn lake
> paste(animals, collapse=" ") #向量中的所有字符组合为一个长的字符
[1] "bird horse fish"
> a.h=paste(animals, home, sep=".") #animals 和 home 对应位置成对组合,并用sep指定的分隔符分隔
> a.h
[1] "bird.tree" "horse.barn" "fish.lake"
> unlist(strsplit(a.h, ".", fixed=TRUE)) #按照指定的分隔符分割a.h ,此时必须有fixed=TRUE
[1] "bird" "tree" "horse" "barn" "fish" "lake"
> unlist(strsplit(a.h, ".", fixed=FALSE))
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
> substr(animals, 2, 4) #提取各字符串中的2-4 3各字符
[1] "ird" "ors" "ish"
> strtrim(animals, 3) ##提取各字符串中的前3个字符
[1] "bir" "hor" "fis"
> toupper(animals) ##所以字符串转化为大写
[1] "BIRD" "HORSE" "FISH"
3.3 正则表达式
符号 | 含义 |
---|---|
^ | 匹配以某字符开头的字符串 |
$ | 匹配以某字符结尾的字符串 |
. | 任意字符 |
.{n} | 任意n个字符 |
[· − ·] | 指定范围内的字符,eg:[a-c]表示a,b,c |
示例:
grep
> col = c("red1","red2","red3","red4","violetred","darkred","indianred","red","sgreen")
> col[grep("red", col)] #包含“red”
[1] "red1" "red2" "red3" "red4" "violetred" "darkred" "indianred" "red"
> col[grep("^red", col)] 以red开头
[1] "red1" "red2" "red3" "red4" "red"
> col[grep("red$", col)] #以red结尾
[1] "violetred" "darkred" "indianred" "red"
> col[grep("red.", col)] #red后至少还包含一个字符
[1] "red1" "red2" "red3" "red4"
> col[grep("^[s-t]", col)] #得到以s或t开头的字符串
[1] "sgreen"
gsub 全局替换;sub 仅替换每个字符串中首次出现的字符
> places <- c("home", "zoo", "school", "work", "park")
> gsub("o", "O", places)
[1] "hOme" "zOO" "schOOl" "wOrk" "park"
> sub("o", "O", places)
[1] "hOme" "zOo" "schOol" "wOrk" "park"
> gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", places, perl=TRUE) #首字母大写 perl=TRUE
[1] "Home" "Zoo" "School" "Work" "Park"
4.Factor(因子)
因子是一个分类变量,可定义为有序或者无序的。使用函数**factor()**创建因子变量.
4.1 创建因子变量
示例:
#无序
> factor(rep(1:2, 4), labels=c("hello", "hey"))
[1] hello hey hello hey hello hey hello hey
Levels: hello hey
#有序
> factor(rep(1:3, 3), labels=c("low", "med", "high"), ordered=TRUE)
[1] low med high low med high low med high
Levels: low < med < high
4.2 因子变量相关函数
示例:
> f <- gl(3, 2, labels=paste("trt",1:3, sep="."))
> f
[1] trt.1 trt.1 trt.2 trt.2 trt.3 trt.3
Levels: trt.1 trt.2 trt.3
> levels(f)
[1] "trt.1" "trt.2" "trt.3"
> nlevels(f)
[1] 3
> relevel(f, "trt.2")
[1] trt.1 trt.1 trt.2 trt.2 trt.3 trt.3
Levels: trt.2 trt.1 trt.3
> x <- runif(10)
> x
[1] 0.65165567 0.56773775 0.11350898 0.59592531 0.35804998 0.42880942
[7] 0.05190332 0.26417767 0.39879073 0.83613414
> cut(x, 3)
[1] (0.575,0.837] (0.313,0.575] (0.0511,0.313] (0.575,0.837]
[5] (0.313,0.575] (0.313,0.575] (0.0511,0.313] (0.0511,0.313]
[9] (0.313,0.575] (0.575,0.837]
Levels: (0.0511,0.313] (0.313,0.575] (0.575,0.837]
> cut(x, c(0,.25,.5,.75,1))
[1] (0.5,0.75] (0.5,0.75] (0,0.25] (0.5,0.75] (0.25,0.5] (0.25,0.5]
[7] (0,0.25] (0.25,0.5] (0.25,0.5] (0.75,1]
Levels: (0,0.25] (0.25,0.5] (0.5,0.75] (0.75,1]
5.Dates and Times(日期与时间)
5.1 Dates and Times相关函数总结
5.2 字符串转换为日期
符号 | 释义 | 符号 | 释义 |
---|---|---|---|
%a | 简写的周几 | %A | 周几的全称 |
%y | 不含世纪的年份表示 | %Y | 全写的年份 |
%b | 简写的月 | %B | 全写的月份 |
%m | 数值型月份(01-12) | %d | 月份中对应的日期 |
示例:
> dates_2 <- c(“5-1-2008”, “19-8-2008”, “2-2-2009”, “29-9-2009”)
> as.Date(dates_2, format="%d-%m-%Y")
[1] “2008-01-05” “2008-08-19” “2009-02-02” “2009-09-29”
5.3 Sequence of Dates:
创建日期序列
语法:seq.Date(from, to, by, length.out = NULL)
#by= ”day”/ ”week”/”month” / ”year”/ ”3 days”/ ”2 weeks”/”4 months” / ”2 years”/
示例:
> seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="week")
[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
> seq.Date(as.Date("2011/1/1"), as.Date("2011/1/31"), by="3 days")
[1] "2011-01-01" "2011-01-04" "2011-01-07" "2011-01-10" "2011-01-13"
[6] "2011-01-16" "2011-01-19" "2011-01-22" "2011-01-25" "2011-01-28"
[11] "2011-01-31"
> seq.Date(as.Date("2011/1/1"), by="week", length.out=10)
[1] "2011-01-01" "2011-01-08" "2011-01-15" "2011-01-22" "2011-01-29"
[6] "2011-02-05" "2011-02-12" "2011-02-19" "2011-02-26" "2011-03-05"
5.4 Cutting Dates(切分日期)
将日期序列划分为多个级别:
cut.Date(x, breaks, start.on.monday = TRUE)
示例:
> jan <- seq.Date(as.Date("2011/1/1"), as.Date("2011/1/15"), by="days")
> cut(jan, breaks="weeks")
[1] 2010-12-27 2010-12-27 2011-01-03 2011-01-03 2011-01-03 2011-01-03
[7] 2011-01-03 2011-01-03 2011-01-03 2011-01-10 2011-01-10 2011-01-10
[13] 2011-01-10 2011-01-10 2011-01-10
Levels: 2010-12-27 2011-01-03 2011-01-10
5.5 Operations with Dates
示例:
> mar1 <- as.Date("2019/3/18")
> mar1
[1] "2019-03-18"
> (mar8<-mar1+7)
[1] "2019-03-25"
> mar1-14
[1] "2019-03-04"
> mar8>mar1
[1] TRUE
> format.Date(mar8, "%Y")
[1] "2019"
> format.Date(mar1, "%b-%d")
[1] "3月-18"
6. Missing data(缺失值)
R语言中用NA表示缺失值。
若某一数据结构中存在缺失值,则调用相关运算函数时其结果为NA,但加上na.rm = TRUE,则调用函数时会忽略缺失值。
> x <- c(4, 7, 2, 0, 1, NA)
> mean(x)
[1] NA
> mean(x,na.rm = TRUE) #只计算5个数值的均值
[1] 2.8
> x1 <- matrix(1:4, ncol=2, dimnames=list(NULL, c("c.1", "c.2")))
> x1
c.1 c.2
[1,] 1 3
[2,] 2 4
缺失值检测
is.na( ) #判断指定目标中各元素是否属于NA,是返回TRUE,反之FALSE
any(is.na( )) #判断指定对象中是否存在缺失值,存在返回TRUE,反之FALSE
is.null( ) #判断指定对象中是否为空,是返回TRUE,反之FALSE
is.nan() #判断指定目标中各元素是否属于NAN,是返回TRUE,反之FALSE
> x2 <- c(4, 7, 2, 0, 1, NA)
> is.na(x2)
[1] FALSE FALSE FALSE FALSE FALSE TRUE
> any(is.na(x2))
[1] TRUE
> is.null(x2)
[1] FALSE
> (y <- x/0)
[1] Inf Inf Inf NaN Inf NA
> is.na(y)
[1] FALSE FALSE FALSE TRUE FALSE TRUE
> is.nan(y)
[1] FALSE FALSE FALSE TRUE FALSE FALSE
7. 数据类型检测与强制转换
示例:
> x <- 1:10
> x > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
> sum(x>5)
[1] 5
> is.vector(x)
[1] TRUE
> is.numeric(x)
[1] TRUE
> y <- as.list(x)
> is.list(y)
[1] TRUE
> as.numeric("123")
[1] 123