R for Data Science总结之——Vectors
对于整个tidyverse框架而言,最重要的莫过于其数据结构tibble,而tibble得基础,也就是vector,向量。
library(tidyverse)
vector包括两类:
- Atomic vectors:logical, numeric(integer, double), character, complex, raw
- lists:recursive vectors
其中得区别为atomic vectors是同质性的(homogeneous),而lists是异质性的(heterogeneous)
每个vector包含type和length两个属性:
typeof(letters)
#> [1] "character"
typeof(1:10)
#> [1] "integer"
x <- list("a", "b", 1:10)
length(x)
#> [1] 3
在这之上构建了增广向量(augmented vectors),包括:
- Factors are built on top of integer vectors.
- Dates and date-times are built on top of numeric vectors.
- Data frames and tibbles are built on top of lists.
Logical
1:10 %% 3 == 0
#> [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
c(TRUE, TRUE, FALSE, NA)
#> [1] TRUE TRUE FALSE NA
Numeric
typeof(1)
#> [1] "double"
typeof(1L)
#> [1] "integer"
1.5L
#> [1] 1.5
x <- sqrt(2) ^ 2
x
#> [1] 2
x - 2
#> [1] 4.44e-16
c(-1, 0, 1) / 0
#> [1] -Inf NaN Inf
0 | Inf | NA | NaN | |
---|---|---|---|---|
is.finite() | x | |||
is.infinite() | x | |||
is.na() | x | x | ||
is.nan() | x |
Character
x <- "This is a reasonably long string."
pryr::object_size(x)
#> 152 B
y <- rep(x, 1000)
pryr::object_size(y)
#> 8.14 kB
Missing values
NA # logical
#> [1] NA
NA_integer_ # integer
#> [1] NA
NA_real_ # double
#> [1] NA
NA_character_ # character
#> [1] NA
Coercion
对于类型转换可以使用as.logical(), as.integer(), as.double(), as.character()等函数,也可以在使用readr包读取数据时直接指定col_types,当然也会有隐式类型转换的存在:
x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y) # how many are greater than 10?
#> [1] 44
mean(y) # what proportion are greater than 10?
#> [1] 0.44
if (length(x)) {
# do something
}
typeof(c(TRUE, 1L))
#> [1] "integer"
typeof(c(1L, 1.5))
#> [1] "double"
typeof(c(1.5, "a"))
#> [1] "character"
有的时候会需要测试数据类型,可以使用typeof(),is,vector(),is.atomic()等
scalar与重复
sample(10) + 100
#> [1] 109 108 104 102 103 110 106 107 105 101
runif(10) > 0.5
#> [1] TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
1:10 + 1:2
#> [1] 2 4 4 6 6 8 8 10 10 12
1:10 + 1:3
#> Warning in 1:10 + 1:3: longer object length is not a multiple of shorter
#> object length
#> [1] 2 4 6 5 7 9 8 10 12 11
在tidyverse框架中,不会隐式进行重复标度,相反,需要用rep函数进行放宽标度:
tibble(x = 1:4, y = 1:2)
#> Error: Column `y` must be length 1 or 4, not 2
tibble(x = 1:4, y = rep(1:2, 2))
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 1
#> 4 4 2
tibble(x = 1:4, y = rep(1:2, each = 2))
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 2
#> 4 4 2
向量命名
向量可以直接在定义时命名,也可使用set_names命名:
c(x = 1, y = 2, z = 4)
#> x y z
#> 1 2 4
set_names(1:3, c("a", "b", "c"))
#> a b c
#> 1 2 3
子集Subsetting
类似于dplyr::filter(),对于向量,我们使用x[a],除此之外,也可以使用索引等:
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
#> [1] "three" "two" "five"
x[c(1, 1, 5, 5, 5, 2)]
#> [1] "one" "one" "five" "five" "five" "two"
x[c(-1, -3, -5)]
#> [1] "two" "four"
x[c(1, -1)]
#> Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts
x[0]
#> character(0)
使用逻辑判断选择子集:
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]
#> [1] 10 3 5 8 1
# All even (or missing!) values of x
x[x %% 2 == 0]
#> [1] 10 NA 8 NA
使用元素名选择子集:
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
#> xyz def
#> 5 2
Recursive vectors(lists)
x <- list(1, 2, 3)
x
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
由于Lists的树状结构,通常我们使用str()用于关注结构而不是内容:
str(x)
#> List of 3
#> $ : num 1
#> $ : num 2
#> $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
#> List of 3
#> $ a: num 1
#> $ b: num 2
#> $ c: num 3
同时Lists中的元素可以是不同的数据类型:
y <- list("a", 1L, 1.5, TRUE)
str(y)
#> List of 4
#> $ : chr "a"
#> $ : int 1
#> $ : num 1.5
#> $ : logi TRUE
甚至可以list中包含其他list:
z <- list(list(1, 2), list(3, 4))
str(z)
#> List of 2
#> $ :List of 2
#> ..$ : num 1
#> ..$ : num 2
#> $ :List of 2
#> ..$ : num 3
#> ..$ : num 4
图像化表示:
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
List-Subsetting
用[]选择子集,得到一个sub-list,结果永远是个list:
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
str(a[1:2])
#> List of 2
#> $ a: int [1:3] 1 2 3
#> $ b: chr "a string"
str(a[4])
#> List of 1
#> $ d:List of 2
#> ..$ : num -1
#> ..$ : num -5
用[[]]选择子集,其结果会抹掉一层数据结构,可能返回list或向量:
str(a[[1]])
#> int [1:3] 1 2 3
str(a[[4]])
#> List of 2
#> $ : num -1
#> $ : num -5
用$选择子集,与[[]]类似:
a$a
#> [1] 1 2 3
a[["a"]]
#> [1] 1 2 3
[]和[[]]的区别在于[[]]返回list中的元素本身,而[]返回一个新的list:
Attributes
attributes可以看作一个可添加到任何对象上的一个命名的list,对单一attribute的设置可用attr(),获取所有的attribute可用attributes():
x <- 1:10
attr(x, "greeting")
#> NULL
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
#> $greeting
#> [1] "Hi!"
#>
#> $farewell
#> [1] "Bye!"
对于R而言,有三个非常重要的attributes:
- Names are used to name the elements of a vector.
- Dimensions (dims, for short) make a vector behave like a matrix or array.
- Class is used to implement the S3 object oriented system.
通用函数
as.Date
#> function (x, ...)
#> UseMethod("as.Date")
#> <bytecode: 0x5653be8>
#> <environment: namespace:base>
UseMethod表示这是一个通用函数,其会调用一个特定函数:
methods("as.Date")
#> [1] as.Date.character as.Date.default as.Date.factor as.Date.numeric
#> [5] as.Date.POSIXct as.Date.POSIXlt
#> see '?methods' for accessing help and source code
如果想看一个方法的展开,可用getS3method():
getS3method("as.Date", "default")
#> function (x, ...)
#> {
#> if (inherits(x, "Date"))
#> x
#> else if (is.logical(x) && all(is.na(x)))
#> .Date(as.numeric(x))
#> else stop(gettextf("do not know how to convert '%s' to class %s",
#> deparse(substitute(x)), dQuote("Date")), domain = NA)
#> }
#> <bytecode: 0x593f2e8>
#> <environment: namespace:base>
getS3method("as.Date", "numeric")
#> function (x, origin, ...)
#> {
#> if (missing(origin))
#> stop("'origin' must be supplied")
#> as.Date(origin, ...) + x
#> }
#> <bytecode: 0x59414e0>
#> <environment: namespace:base>
增强向量Augmented vectors
- Factors
- Dates
- Date-times
- Tibbles
factor
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
#> [1] "integer"
attributes(x)
#> $levels
#> [1] "ab" "cd" "ef"
#>
#> $class
#> [1] "factor"
Dates and date-times
x <- as.Date("1971-01-01")
unclass(x)
#> [1] 365
typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "Date"
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"
typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "POSIXct" "POSIXt"
#>
#> $tzone
#> [1] "UTC"
attr(x, "tzone") <- "US/Pacific"
x
#> [1] "1969-12-31 17:00:00 PST"
attr(x, "tzone") <- "US/Eastern"
x
#> [1] "1969-12-31 20:00:00 EST"
y <- as.POSIXlt(x)
typeof(y)
#> [1] "list"
attributes(y)
#> $names
#> [1] "sec" "min" "hour" "mday" "mon" "year" "wday"
#> [8] "yday" "isdst" "zone" "gmtoff"
#>
#> $class
#> [1] "POSIXlt" "POSIXt"
#>
#> $tzone
#> [1] "US/Eastern" "EST" "EDT"
Tibbles
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
#> [1] "list"
attributes(tb)
#> $names
#> [1] "x" "y"
#>
#> $row.names
#> [1] 1 2 3 4 5
#>
#> $class
#> [1] "tbl_df" "tbl" "data.frame"
tibble与list的区别在于tibble的所有vector长度必须相同:
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
#> [1] "list"
attributes(df)
#> $names
#> [1] "x" "y"
#>
#> $class
#> [1] "data.frame"
#>
#> $row.names
#> [1] 1 2 3 4 5
所有代码已上传GITHUB点此进入