R for Data Science总结之——Vectors

最新推荐文章于 2022-05-03 19:28:54 发布

我要养只哈士奇

最新推荐文章于 2022-05-03 19:28:54 发布

阅读量1.5k

点赞数 1

分类专栏： R Data Science R语言数据挖掘tidyverse框架

本文链接：https://blog.csdn.net/weixin_38423453/article/details/84336461

版权

R 同时被 3 个专栏收录

20 篇文章 4 订阅

订阅专栏

Data Science

16 篇文章 1 订阅

订阅专栏

R语言数据挖掘tidyverse框架

16 篇文章 0 订阅

订阅专栏

R for Data Science总结之——Vectors

对于整个tidyverse框架而言，最重要的莫过于其数据结构tibble，而tibble得基础，也就是vector，向量。

library(tidyverse)

vector包括两类：

Atomic vectors：logical, numeric(integer, double), character, complex, raw
lists：recursive vectors

其中得区别为atomic vectors是同质性的(homogeneous)，而lists是异质性的(heterogeneous)
在这里插入图片描述
每个vector包含type和length两个属性：

typeof(letters)
#> [1] "character"
typeof(1:10)
#> [1] "integer"

x <- list("a", "b", 1:10)
length(x)
#> [1] 3

在这之上构建了增广向量(augmented vectors)，包括：

Factors are built on top of integer vectors.
Dates and date-times are built on top of numeric vectors.
Data frames and tibbles are built on top of lists.

Logical

1:10 %% 3 == 0
#>  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE

c(TRUE, TRUE, FALSE, NA)
#> [1]  TRUE  TRUE FALSE    NA

Numeric

typeof(1)
#> [1] "double"
typeof(1L)
#> [1] "integer"
1.5L
#> [1] 1.5

x <- sqrt(2) ^ 2
x
#> [1] 2
x - 2
#> [1] 4.44e-16

c(-1, 0, 1) / 0
#> [1] -Inf  NaN  Inf

	0	Inf	NA	NaN
is.finite()	x
is.infinite()		x
is.na()			x	x
is.nan()				x

Character

x <- "This is a reasonably long string."
pryr::object_size(x)
#> 152 B

y <- rep(x, 1000)
pryr::object_size(y)
#> 8.14 kB

Missing values

NA            # logical
#> [1] NA
NA_integer_   # integer
#> [1] NA
NA_real_      # double
#> [1] NA
NA_character_ # character
#> [1] NA

Coercion

对于类型转换可以使用as.logical(), as.integer(), as.double(), as.character()等函数，也可以在使用readr包读取数据时直接指定col_types，当然也会有隐式类型转换的存在：

x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y)  # how many are greater than 10?
#> [1] 44
mean(y) # what proportion are greater than 10?
#> [1] 0.44

if (length(x)) {
  # do something
}

typeof(c(TRUE, 1L))
#> [1] "integer"
typeof(c(1L, 1.5))
#> [1] "double"
typeof(c(1.5, "a"))
#> [1] "character"

有的时候会需要测试数据类型，可以使用typeof()，is,vector()，is.atomic()等

scalar与重复

sample(10) + 100
#>  [1] 109 108 104 102 103 110 106 107 105 101
runif(10) > 0.5
#>  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

1:10 + 1:2
#>  [1]  2  4  4  6  6  8  8 10 10 12

1:10 + 1:3
#> Warning in 1:10 + 1:3: longer object length is not a multiple of shorter
#> object length
#>  [1]  2  4  6  5  7  9  8 10 12 11

在tidyverse框架中，不会隐式进行重复标度，相反，需要用rep函数进行放宽标度：

tibble(x = 1:4, y = 1:2)
#> Error: Column `y` must be length 1 or 4, not 2

tibble(x = 1:4, y = rep(1:2, 2))
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     3     1
#> 4     4     2

tibble(x = 1:4, y = rep(1:2, each = 2))
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     3     2
#> 4     4     2

向量命名

向量可以直接在定义时命名，也可使用set_names命名：

c(x = 1, y = 2, z = 4)
#> x y z 
#> 1 2 4

set_names(1:3, c("a", "b", "c"))
#> a b c 
#> 1 2 3

子集Subsetting

类似于dplyr::filter()，对于向量，我们使用x[a]，除此之外，也可以使用索引等：

x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
#> [1] "three" "two"   "five"

x[c(1, 1, 5, 5, 5, 2)]
#> [1] "one"  "one"  "five" "five" "five" "two"

x[c(-1, -3, -5)]
#> [1] "two"  "four"

x[c(1, -1)]
#> Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts

x[0]
#> character(0)

使用逻辑判断选择子集：

x <- c(10, 3, NA, 5, 8, 1, NA)

# All non-missing values of x
x[!is.na(x)]
#> [1] 10  3  5  8  1

# All even (or missing!) values of x
x[x %% 2 == 0]
#> [1] 10 NA  8 NA

使用元素名选择子集：

x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
#> xyz def 
#>   5   2

Recursive vectors(lists)

x <- list(1, 2, 3)
x
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 2
#> 
#> [[3]]
#> [1] 3

由于Lists的树状结构，通常我们使用str()用于关注结构而不是内容：

str(x)
#> List of 3
#>  $ : num 1
#>  $ : num 2
#>  $ : num 3

x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
#> List of 3
#>  $ a: num 1
#>  $ b: num 2
#>  $ c: num 3

同时Lists中的元素可以是不同的数据类型：

y <- list("a", 1L, 1.5, TRUE)
str(y)
#> List of 4
#>  $ : chr "a"
#>  $ : int 1
#>  $ : num 1.5
#>  $ : logi TRUE

甚至可以list中包含其他list：

z <- list(list(1, 2), list(3, 4))
str(z)
#> List of 2
#>  $ :List of 2
#>   ..$ : num 1
#>   ..$ : num 2
#>  $ :List of 2
#>   ..$ : num 3
#>   ..$ : num 4

图像化表示：

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

在这里插入图片描述

List-Subsetting

用[]选择子集，得到一个sub-list，结果永远是个list：

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))

str(a[1:2])
#> List of 2
#>  $ a: int [1:3] 1 2 3
#>  $ b: chr "a string"
str(a[4])
#> List of 1
#>  $ d:List of 2
#>   ..$ : num -1
#>   ..$ : num -5

用[[]]选择子集，其结果会抹掉一层数据结构，可能返回list或向量：

str(a[[1]])
#>  int [1:3] 1 2 3
str(a[[4]])
#> List of 2
#>  $ : num -1
#>  $ : num -5

用$选择子集，与[[]]类似：

a$a
#> [1] 1 2 3
a[["a"]]
#> [1] 1 2 3

[]和[[]]的区别在于[[]]返回list中的元素本身，而[]返回一个新的list：
在这里插入图片描述

Attributes

attributes可以看作一个可添加到任何对象上的一个命名的list，对单一attribute的设置可用attr()，获取所有的attribute可用attributes()：

x <- 1:10
attr(x, "greeting")
#> NULL
attr(x, "greeting") <- "Hi!"
attr(x, "farewell") <- "Bye!"
attributes(x)
#> $greeting
#> [1] "Hi!"
#> 
#> $farewell
#> [1] "Bye!"

对于R而言，有三个非常重要的attributes：

Names are used to name the elements of a vector.
Dimensions (dims, for short) make a vector behave like a matrix or array.
Class is used to implement the S3 object oriented system.

通用函数

as.Date
#> function (x, ...) 
#> UseMethod("as.Date")
#> <bytecode: 0x5653be8>
#> <environment: namespace:base>

UseMethod表示这是一个通用函数，其会调用一个特定函数：

methods("as.Date")
#> [1] as.Date.character as.Date.default   as.Date.factor    as.Date.numeric  
#> [5] as.Date.POSIXct   as.Date.POSIXlt  
#> see '?methods' for accessing help and source code

如果想看一个方法的展开，可用getS3method()：

getS3method("as.Date", "default")
#> function (x, ...) 
#> {
#>     if (inherits(x, "Date")) 
#>         x
#>     else if (is.logical(x) && all(is.na(x))) 
#>         .Date(as.numeric(x))
#>     else stop(gettextf("do not know how to convert '%s' to class %s", 
#>         deparse(substitute(x)), dQuote("Date")), domain = NA)
#> }
#> <bytecode: 0x593f2e8>
#> <environment: namespace:base>
getS3method("as.Date", "numeric")
#> function (x, origin, ...) 
#> {
#>     if (missing(origin)) 
#>         stop("'origin' must be supplied")
#>     as.Date(origin, ...) + x
#> }
#> <bytecode: 0x59414e0>
#> <environment: namespace:base>

增强向量Augmented vectors

Factors
Dates
Date-times
Tibbles

factor

x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
#> [1] "integer"
attributes(x)
#> $levels
#> [1] "ab" "cd" "ef"
#> 
#> $class
#> [1] "factor"

Dates and date-times

x <- as.Date("1971-01-01")
unclass(x)
#> [1] 365

typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "Date"

x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
#> [1] 3600
#> attr(,"tzone")
#> [1] "UTC"

typeof(x)
#> [1] "double"
attributes(x)
#> $class
#> [1] "POSIXct" "POSIXt" 
#> 
#> $tzone
#> [1] "UTC"

attr(x, "tzone") <- "US/Pacific"
x
#> [1] "1969-12-31 17:00:00 PST"

attr(x, "tzone") <- "US/Eastern"
x
#> [1] "1969-12-31 20:00:00 EST"

y <- as.POSIXlt(x)
typeof(y)
#> [1] "list"
attributes(y)
#> $names
#>  [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"  
#>  [8] "yday"   "isdst"  "zone"   "gmtoff"
#> 
#> $class
#> [1] "POSIXlt" "POSIXt" 
#> 
#> $tzone
#> [1] "US/Eastern" "EST"        "EDT"

Tibbles

tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
#> [1] "list"
attributes(tb)
#> $names
#> [1] "x" "y"
#> 
#> $row.names
#> [1] 1 2 3 4 5
#> 
#> $class
#> [1] "tbl_df"     "tbl"        "data.frame"

tibble与list的区别在于tibble的所有vector长度必须相同：

df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
#> [1] "list"
attributes(df)
#> $names
#> [1] "x" "y"
#> 
#> $class
#> [1] "data.frame"
#> 
#> $row.names
#> [1] 1 2 3 4 5

所有代码已上传GITHUB点此进入