R 语言入门 —— 字符串

最新推荐文章于 2024-06-18 09:31:32 发布

名本无名

最新推荐文章于 2024-06-18 09:31:32 发布

阅读量413

点赞数 3

分类专栏： R 文章标签： r语言开发语言

本文链接：https://blog.csdn.net/dxs18459111694/article/details/139738307

版权

R 专栏收录该内容

33 篇文章 8 订阅

订阅专栏

R 语言入门 —— 字符串

前言

虽然 R 的字符串并不是它的强项，看起来也不是那么的优雅，但是 R 中数据处理和清洗过程中字符串处理还是占有较大比重的。R 提供了多种基础和高级功能来处理字符串，这些功能主要集中在几个包中，如基础的 R 函数、stringr 包等。

基本操作

虽然 R 的字符串并不是它的强项，看起来也不是那么的优雅，但是 R 中数据处理和清洗过程中字符串处理还是占有较大比重的。R 中，字符串主要有一对单引号（''）或一对双引号（""）表示，里面可以包含任意字符数据。

字符串内部包含的引号需要转义，例如

'abc\''
# [1] "abc'"
"cccc\"ddd"
# [1] "cccc\"ddd"

但是单引号内的双引号，或双引号内的单引号不需要转义

"con't"
# [1] "con't"
'as the saying gose ""'
# [1] "as the saying \"\""

基本函数

操作字符串的函数有

函数	功能	函数	功能
`nchar`	字符串长度	`sub`	替换字符串
`paste`	拼接字符串	`strsplit`	拆分字符串
`substr`	截取字符串	`strrep`	字符串重复
`toupper`	转换为大写	`tolower`	转换为小写
`startsWith`	字符串开头	`endsWith`	字符串结尾
`sort`	字符串排序	`order`	排序后索引
`strwrap`	拆分为段落	`trimws`	删除空白符
`strtrim`	按宽度裁剪	`format`	字符串格式
`sprintf`	格式化输出	`strtoi`	转换为整数

使用示例

nchar("I have a dream!")
# [1] 15
toupper("I have a dream!")
# [1] "I HAVE A DREAM!"
strsplit("I have a dream!", split = " ")
# [[1]]
# [1] "I"      "have"   "a"      "dream!"
paste(c("I", "have", "a", "dream"))
# [1] "I"     "have"  "a"     "dream"
paste("A", "deram")
# [1] "A deram"
paste("A", "deram", sep = "-")  # 用 - 连接
# [1] "A-deram"
paste0("Symbol: ", "TP53")      # 不使用连接字符
# [1] "Symbol: TP53"
strwrap("Stopping distance of cars (ft) vs. speed (mph) from Ezekiel (1930)", width = 20)
# [1] "Stopping distance" "of cars (ft) vs."  "speed (mph) from"  "Ezekiel (1930)" 
trimws("    have dream     ")
# [1] "have dream"
trimws("    have dream     ", which = "right")
# [1] "    have dream"
strrep("go", 3)
# [1] "gogogo"
strrep(x = c("go", "do"), 3)         # 重复
# [1] "gogogo" "dododo"
sub("can", "can't" , "I can do it")  # 替换
# [1] "I can't do it"
sort(c("one", "two", "three"))
# [1] "one"   "three" "two"
order(c("one", "two", "three"))
# [1] 1 3 2
strtoi('1234')
# [1] 1234

字符串的前缀字符与后缀字符的判断

x <- c("demo.pdf", "demo.R", "demo.py", "test.R")
startsWith(x, "demo")
# [1]  TRUE  TRUE  TRUE FALSE
endsWith(x, "R")
# [1] FALSE  TRUE FALSE  TRUE

格式化字符串

sprintf 格式化输出函数是对 C 中同名函数的封装，所以使用方式也是一样的

sprintf("age: %d", 20)             # %d 表示整数
# [1] "age: 20"
sprintf("score: %f", 92.5)         # %f 表示浮点数
# [1] "score: 92.500000"
sprintf("percent: %8.2f%%", 99.5)  # %% 表示百分号，8 表示输出字符宽度，2 表示取两位有效小数
# [1] "percent:    99.50%"
sprintf("name: %10s", 'tom')       # %s 表示字符串，字符宽度为 10 并向右对齐
# [1] "name:        tom"
sprintf("name: %-10s", 'tom')      # 向左对齐
# [1] "name: tom       "

format 函数也可用于格式化 R 中的对象，该函数包含很多参数用于控制输出格式

format(x, trim = FALSE, digits = NULL, nsmall = 0L,
       justify = c("left", "right", "centre", "none"),
       width = NULL, na.encode = TRUE, scientific = NA,
       big.mark   = "",   big.interval = 3L,
       small.mark = "", small.interval = 5L,
       decimal.mark = getOption("OutDec"),
       zero.print = NULL, drop0trailing = FALSE, ...)

其中，常用的几个参数

digits：保留数值的位数
nsmall：小数点后最少的位数
scientific：是否表示为科学计数法
width：字符串的最小宽度
justify：字符串对齐方式

format(1:10)
# [1] " 1" " 2" " 3" " 4" " 5" " 6" " 7" " 8" " 9" "10"
format(1:10, trim = TRUE)
# [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
format(c(6.0, 13.1), digits = 2)
# [1] " 6" "13"
format(12.3456789, nsmall = 10)
# [1] "12.3456789000"
format(3.1415, width = 8)
# [1] "  3.1415"
format("R", width = 8, justify = 'l')
# [1] "R       "
format("R", width = 8, justify = 'c')
# [1] "   R    "
format("R", width = 8, justify = 'r')
# [1] "       R"

字符串插值

许多脚本语言都允许在字符串中插入变量值，虽然 springf 和 format 也可以达到相似的功能，但是用起来还是受限，不够灵活。所以我们将要介绍一个提供字符串插值功能的第三方包 —— glue。

glue 提供了轻巧、快速和无依赖的可解释字符串，glue 通过将 R 表达式嵌入到花括号中，然后对其求值并将其插入字符串中。

安装和使用

install.packages("glue")
# install.packages("devtools")
devtools::install_github("tidyverse/glue")

安装版本为 1.6.2，导入包

library(glue)

通过将变量名放置在一对花括号之间，glue 会将变量名替换为相应的值

name <- "Tom"
glue('My name is {name}.')
# My name is Tom

字符串可以写成多行的形式，最后会自动将这些行连接起来

name <- "Tom"
age <- 50
today <- Sys.Date()

glue(
  'My name is {name},',
  ' my age next year is {age + 1},',
  ' today is {format(today, "%A, %B %d, %Y")}.'
)
# My name is Tom, my age next year is 51, today is 星期日, 五月 15, 2022.

可以使用 .sep 参数来设置字符串之间的连接符

glue(
  'My name is {name}',
  'my age is {age}',
  .sep = ", "
)
# My name is Tom, my age is 50

在 glue 中使用命名参数来指定临时变量

glue(
  'My name is {name},',
  ' my age next year is {age + 1},',
  ' today is {format(today, "%A, %B %d, %Y")}.',
  name = "Joe",
  age = 40,
  today = Sys.Date()
)
# My name is Joe, my age next year is 41, today is 星期日, 五月 15, 2022.

glue_data 函数则可以将一个列表、数据框或命名空间作为输入数据，将变量传递到字符串中

glue_data(
  .x = list(name = "Tom", age = 20),
  'My name is {name}',
  'my age is {age}',
  .sep = ", "
)
# My name is Tom, my age is 20

字符串格式

前导空格和第一行以及最后一行的换行符会自动被修剪

glue("
      formatted string
      multiple lines
        indention preserved
     ")
# formatted string
# multiple lines
#   indention preserved

如果要添加空行，可以多添加一个换行符

glue("\n\nabc\n\nDEF\n")
# 
# abc
# 
# DEF

在行尾使用 \\ 可以将较长的一行字符串写成多行的形式，例如

glue(
  "this is a very \\
  long long long \\
  line"
)
# this is a very long long long line

如果要字符串中使用花括号，需要使用双花括号

name <- "Tom"
glue("My name is {name}, not {{name}}.")
# My name is Tom, not {name}.

对比下面花括号位置的不同导致输出结果的区别

a <- "foo"
glue("{{a}}")
# {a}
glue("{{a} }")
# {a} }
glue("{ {a}}")
# foo
glue("{ {a} }")
# foo
glue("{
      a
     }")
# foo
glue("{
      {a}
     }")
# foo
glue("{
      {a}}")
# foo

可以使用 + 连接字符串和 glue 返回对象

x <- 1
y <- 3
glue("x + y") + " = {x + y}"
# x + y = 4

指定分隔符

glue 默认将花括号之间的字符作为变量名或者表达式，我们可以通过设置 .open 和 .close 参数来指定变量界定符

percent <- 0.9899123
glue("Percent: <<round(percent * 100, 2)>>%", .open = "<<", .close = ">>")
# Percent: 98.99%

字符串向量的折叠

可以使用 glue_collapse 可以将任意长度的字符串向量折叠为长度为 1 的字符串向量

glue_collapse(x, sep = "", width = Inf, last = "")

x：字符串向量
sep：用来分隔向量中元素的字符串
width：折叠之后加上 ... 之后的最大长度，
last ：如果 x 至少有 2 个元素，则用于分隔最后两个元素的字符串

glue_collapse(glue("{1:10}"))
# 12345678910
glue_collapse(glue("{1:10}"), width = 7)
# 1234...
glue_collapse(1:4, ", ", last = " and ")
# 1, 2, 3 and 4
glue_collapse(glue("{1:10}"), sep = ",")
# 1,2,3,4,5,6,7,8,9,10

单个元素的引用

下面三个对单个元素引用函数可以搭配 glue_collapse 使用

single_quote(x)：用单引号包裹字符串元素
double_quote(x)：用双引号包裹字符串元素
backtick(x)：用反引号包裹字符串元素

x <- 1:5
glue('Values of x: {glue_collapse(backtick(x), sep = ", ", last = " and ")}')
# Values of x: `1`, `2`, `3`, `4` and `5`
glue('Values of x: {glue_collapse(single_quote(x), sep = ", ", last = " and ")}')
# Values of x: '1', '2', '3', '4' and '5'
glue('Values of x: {glue_collapse(double_quote(x), sep = ", ", last = " and ")}')
# Values of x: "1", "2", "3", "4" and "5"

为输出着色

glue 可以搭配 crayon 包定义的一些用于终端输出着色的函数，来为我们的输出文本着色。先导入 crayon（版本为 1.4.2）

require(crayon)

glue 提供的 glue_col() 函数可以为数据进行着色

glue_col("{blue 2 * 2 = {green {2 * 2}}}")
glue_col("{yellow don't} known!", .literal = TRUE)  # 字符只表示字面意思，没有特殊含义
glue_col(
  "A URL: {magenta https://github.com/tidyverse/glue#readme}",
  .literal = TRUE
)

我们可以设置黑色背景和白色字体（bgBlack $ white）

white_on_black <- bgBlack $ white
glue_col("{white_on_black
  {red red} flowers,
  {green green} tree,
  `glue_col()` can show \\
  {red c}{yellow o}{green l}{cyan o}{blue r}{magenta s}
  and {bold bold} A {underline underline} B!
}")

正则表达式

何为正则表达式（Regular Expression）？即使用一组事先定义好的具有某种特殊含义的字符，将其根据一个规则组合起来用于匹配某个模式的字符串。简单来说就是定义了一个字符串排列规则，并利用该规则去字符串文本中检索、替换那些符合该规则的匹配。

R中的正则表达式模式有三种

扩展正则表达式：默认方式
Perl 风格正则表达式：设置参数 perl = TRUE
字面意义正则表达式：设置参数 fixed = TRUE，字面意义上的正则表达式，不具有特殊含义

基本字符

R 中有几个具有特殊含义的字符，也称为元字符：

 . \ | ( ) [ ] ^ $ * + ?

元字符	含义	使用
`.`	表示任意字符，包括换行符	`a.` 可以匹配 `aa`、`az`
`\`	对字符进行转义	`\n` 表示换行而不是两个字符
`	`	`A
`()`	字符组，括号中的模式可以被捕获
`[]`	定义字符集合	`[a-z]` 表示 `26` 个英文字符
`^`	匹配字符串开头，在`[]` 中表示取反	`^a` 匹配以 `a` 开头的字符，`[^a]` 表示所有非 `a` 字符
`$`	匹配字符串结尾	`a$` 匹配以 `a` 结尾的字符

数量词表示将前一个规则重复匹配的次数

数量词	含义	使用
`*`	前一个规则匹配 `0` 或无限次	`a*` 表示任意多个 `a`，可以是 `0` 个
`+`	前一个规则匹配 `1` 或无限次	`a+` 匹配 `aa`、`a`，不匹配 `bbb`
`?`	前一个规则匹配 `0` 或 `1` 次	`a?` 匹配 `cba`、`cb`，不匹配 `cbaa`
`{m}`	前一个规则匹配 `m` 次	`a{2}` 匹配 `caa` ，不匹配 `ca`、`caaa`
`{m ,n}`	前一个规则匹配 `m~n` 次，尽可能多	`a{2,4}` 匹配 `caa` 、`caaa`，不匹配 `ca`、`caaaaa`
`{m,}`	前一个规则匹配 `m` 次以上，尽可能多	`a{2,}` 匹配 `caa`、`caaaaa` ，不匹配 `ca`

上述数量词默认为贪婪模式，即尽可能匹配多个字符，还有一种非贪婪模式，只需在数量词后加 ?，表示尽可能少的匹配到满足规则的字符串，例如对于字符串 znaaaaaaa，正则表达式 a{3}? 会匹配 znaaa。

下面使用内置函数 grepl 来测试正则表达式，其使用方式为

# 返回匹配成功的索引
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)
# 匹配成功返回 TRUE，否则返回 FALSE
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, 
      fixed = FALSE, useBytes = FALSE)
# 将匹配到的字符串替换成 replacement
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
     fixed = FALSE, useBytes = FALSE)

主要参数含义：

pattern：匹配规则
x：需要匹配的目标字符串
ignore.case：忽略大小写
perl： perl 语言的匹配规则
fixed：正则表达式不具有特殊含义，只表示其字面意义

点号（`.`）、析取（`|`）

点号表示任意单个字符

grepl('ab.', 'abc')
# [1] TRUE
grepl('abc', 'Abc', ignore.case = TRUE)
# [1] TRUE
grepl('ab.*', 'abcdba', fixed = T)
# [1] TRUE
grepl('ab.*', 'ab.*a', fixed = T)
# [1] TRUE

析取有点像是或运算，表示匹配任意两个表达式中的任意一个都可以

grepl('ab|c', c('ab', 'ac', 'a'))
# [1] TRUE TRUE FALSE

字符串边界（`^ $`）

^ 用于固定开头，$ 用于固定结尾

grepl('[Tt]he', c('the', 'The', 'other'))
# out: TRUE TRUE TRUE

grepl('^[Tt]he', c('the', 'The', 'other'))
# out: TRUE TRUE FALSE

grepl('[Tt]he$', c('the', 'The', 'other'))
# out: TRUE TRUE FALSE

反斜杠（`\`）

R 中字符串内的反斜杠表示转义，反斜杠加后面的字符代表特殊含义，例如：

字符	含义	字符	含义
`\n`	换行	`\r`	回车
`\t`	制表符	`\b`	退格
`\a`	响铃	`\f`	换页
`\v`	垂直制表符	`\\`	反斜杠
`\'`	单引号	`\"`	双引号
`\nnn`	八进制字符（`n` 为八进制数字）	`\xnn`	十六进制（`n` 为十六进制数字）

如果要在字符常量中输入反斜杠，需要输入两个反，即 \\

grepl('ab\\[', 'ab[]')
# [1] TRUE

grepl('ab\\\\', 'ab\\')
# [1] TRUE

grepl('ab\.', 'ab.')
# Error: 由"'ab\."开头的字符串中存在'\.'，但没有这种逸出号

特殊字符

特殊字符	含义	特殊字符	含义
`\\b`	匹配单词边界	`\\B`	匹配非单词边界
`\\d`	匹配数字，`0-9`	`\\D`	匹配非数字
`\\s`	匹配空白字符，空格、换行等	`\\S`	匹配非空白字符
`\\w`	匹配大小写字母 `a-z` 和数字 `0-9`	`\\W`	匹配非字母和数字
`\\<`	匹配字符串开始，同 `^`	`\\>`	匹配字符串结尾，同 `$`

x <- c("apple pie", "banana", "cookie")
grepl('\\ba', x)  # a 在单词的开头
# [1]  TRUE FALSE FALSE
grepl('\\Ba', x)  # a 不在单词的边界
# [1] FALSE  TRUE FALSE
grepl('\\<b.*a\\>', x)
# [1] FALSE  TRUE FALSE
grepl('^b.*a$', x)
# [1] FALSE  TRUE FALSE
x <- c('123', 'bac', 'bcd234')
grepl('\\d', x)   # 匹配字符串中的数字
# [1]  TRUE FALSE  TRUE
grep('\\w', x)
# [1] 1 2 3
grepl('\\w', x)
# [1] TRUE TRUE TRUE

字符集合（`[]`）

我们可以使用中括号 [] 来定义字符集合，在 R 中以 [:keyword:] 的方式预先定义了一些比较常用的字符集合

字符集合	含义	字符集合	含义
`[:digit:]`	数字，同 `\\d`	`[:xdigit:]`	`16` 进制数字
`[:lower:]`	小写字母	`[:upper:]`	大写字母
`[:blank:]`	空白字符，`\t` 和空格	`[:space:]`	间隔字符，同 `\\s`
`[:punct:]`	标点符号	`[:cntrl:]`	控制字符
`[:alpha:]`	字母，`[:upper:]` + `[:lower:]`	`[:alnum:]`	字母和数字，同 `\\w`
`[:graph:]`	图形字符，`[:punct:]` + `[:alnum:]`	`[:print:]`	可打印字符，`[:graph:]` + `[:space:]`

自定义字符集合

grepl('[Tt]he', c('the', 'The'))
# [1] TRUE TRUE
grepl('[a-z]he', c('the', 'she', 'The'))  # 范围：a-z 表示 26 个小写字母
# [1] TRUE TRUE FALSE
grepl('[^t]he', c('the', 'she', 'The'))   # 取反，所有不是 t 的字符
# [1] FALSE TRUE TRUE
grepl('[.]he', c('the', '.he'))           # 特殊字符失去特殊含义，. 不再表示任意字符
# [1] FALSE TRUE

使用字符集

x <- "fat Cat sat mat cat"
gsub("[[:lower:]]at",  'l', x)
# [1] "l Cat l l l"
gsub("[a-z]at",  'l', x)
# [1] "l Cat l l l"
gsub("[[:space:]]",  '+', x)
# [1] "fat+Cat+sat+mat+cat"

数量词（`* + ? {}`）

grepl('.*', 'abcdba')
# [1] TRUE
grepl('.*', '')
# [1] TRUE
grepl('.+', '')
# [1] FALSE
grepl('ab?', 'a')
# [1] TRUE
grepl('ab?', 'ab')
# [1] TRUE
grepl('ab{3}', 'abbb')
# [1] TRUE
grepl('ab{3,5}', 'abbbb')
# [1] TRUE
grepl('ab{3,}', 'abbbb')
# [1] TRUE
grepl('ab{,4}', 'abbbb')
# [1] TRUE

非贪婪模式，尽可能少的匹配字符串

gsub('ab{3,8}', '', 'abbbbbbbc')
# [1] 'c'
gsub('ab{3,8}?', '', 'abbbbbbbc')
# [1] 'bbbbc'

组合（`()`）

有时，我们不仅想要匹配字符串，还想把符合我们规则的字符串提取出来，这时候，可以使用圆括号将正表达式包裹起来，表示要捕获匹配到的字符串。

我们使用内置函数 gsub 来替换捕获到的内容

gsub('(a.)+c', 'AB', c('ac', 'acc'))
# [1] "ac" "AB"
gsub('(a.)+', 'AB', c('ac', 'acc'))
# [1] "AB"  "ABc"
gsub('(abc){2}', 'ABC', 'abcabcabc')
# [1] "ABCabc"

反向引用，即引用前面用圆括号捕获到的值，使用双下划线加数字（\\n）的方式引用，n 对应于第 n 个圆括号，例如

grepl('c(..) s\\1', c('cat sat', 'cat saa'))
# [1] TRUE FALSE
gsub('(abc)-(xyz)-\\2-\\1', 'ABC', '@abc-xyz-xyz-abc@')
# [1] "@ABC@"

环视

我们在匹配字符串时，可能想要匹配某一特定字符之前或之后的内容。比如说我们想要提取出文本中的基因名称和基因 ID，而刚好，基因名之前的字符串是 Symobl: ，ID 前面的字符串是 Gene_ID: 。环视包含前向和后向两种，每种方式又可以分为肯定与否定，即必须包含某个字符串或不能包含某个字符串

表达式	含义	表达式	含义
`(?=pattern)`	肯定后向，此位置之后必须包含 `pattern`	`(?!pattern)`	否定后向，此位置之后不能包含 `pattern`
`(?<=pattern)`	肯定前向，此位置之前必须包含 `pattern`	`(?<!pattern)`	否定前向，此位置之前不能包含 `pattern`

注意这里所说的前后指的是文本搜索的方向，扫描过的字符串为后，未扫描的字符串为前。

我们使用 regexpr 和 regmatches 搭配来捕获匹配字符串，gregexpr 是用于全局匹配，返回所有捕获到的字符串。该匹配规则需要开启 perl 模式

# 捕获字符之后的基因名称和 ID
x <- 'Symbol: TP53, Gene_ID: 7157'
m <- regexpr('(?<=Symbol: )(\\w+)', x, perl = TRUE)
regmatches(x, m)
# [1] "TP53"
m <- regexpr('(?<=Gene_ID: )(\\d+)', x, perl = TRUE)
regmatches(x, m)
# [1] "7157"

# 获取 $ 符之前的数字，注意 $ 需要转义
x <- '1000$'
m <- regexpr('\\d+(?=\\$)', x, perl = TRUE)
regmatches(x, m)
# [1] "1000"
# 非 $ 符之前的数字
x <- c('1000$', '111abc', 'acbd123')
m <- gregexpr('\\d+(?!\\$)', x, perl = TRUE)
regmatches(x, m)
# [[1]]
# [1] "100"
# 
# [[2]]
# [1] "111"
# 
# [[3]]
# [1] "123"

stringr

R 内置的函数对字符串数据的处理还是比较基础的，用起来不是很优雅，所以我们要介绍一个第三方包: stringr，它提供了比内置函数更丰富的功能，使字符串的操作简单化，主要包括四个函数族：

字符操作：这些函数允许操作字符串向量中每个字符串中的字符
提供添加、删除和操作空白字符的工具
提供对语境敏感的操作函数
4 种模式匹配函数，其中最常用的就是正则表达式

安装与使用

# install 
install.packages("stringr")

# 从 GitHub 中安装最新的开发版本:
# 需要用到 devtools 包，如果没有这个包，需要安装一下
if (!require("devtools")) 
  install.packages("devtools")
devtools::install_github("tidyverse/stringr")

安装版本为 1.4.0，导入包

library(stringr)

stringr 中的所有函数都以 str_ 开头，并接受一个字符串向量作为第一个参数

单字符串操作

函数	功能	函数	功能
`str_length`	计算字符串长度	`str_sub`	提取和修改字符串
`str_dup`	复制字符串	`str_c`	字符串拼接

计算字符串长度

s <- c("why", "video", "cross", "extra", "deal", "authority")
str_length(s)
# [1] 3 5 5 5 4 9
str_length("love")
# [1] 4
str_length(c("love", NA))
# [1]  4 NA

提取和修改字符串

str_sub(string, start = 1L, end = -1L)

x <- c("abcdef", "ghifjk")
# 第三个字符
str_sub(x, 3, 3)
# [1] "c" "i"
# 第 2 到倒数第 2 个字符串
str_sub(x, 2, -2)
# [1] "bcde" "hifj"

修改字符串

str_sub(x, 3, 3) <- "X"
x
# [1] "abXdef" "ghXfjk"

复制字符串

str_dup(string, times)

str_dup(x, c(2, 3))
# [1] "abcdefabcdef"       "ghifjkghifjkghifjk"
str_dup(x, c(2))
# [1] "abcdefabcdef" "ghifjkghifjk"
str_dup(x, 2)
# [1] "abcdefabcdef" "ghifjkghifjk"

拼接字符串

str_c(..., sep = "", collapse = NULL)

> x <- c("why", "video", "cross", "extra", "deal", "authority")
> str_c(x, collapse = ", ")
# [1] "why, video, cross, extra, deal, authority"
> str_c("why", "not", "my", sep = "-")
[1] "why-not-my"

空白字符操作

函数	功能	函数	功能
`str_pad`	用字符符填充字符串	`str_trunc`	控制字符串长度
`str_trim`	删除空白符	`str_wrap`	控制字符串输出

使用字符来填充字符串，可以用于字符串对齐

str_pad(string, width, side = c("left", "right", "both"), pad = " ")

x <- c("abc", "defghi")
str_pad(x, 10)             # 默认在左边填充
# [1] "       abc" "    defghi"
str_pad(x, 10, "both")     # 两端同时填充
# [1] "   abc    " "  defghi  "
str_pad(x, 10, pad = "=")  # 使用 = 来填充
# [1] "====abcdef" "====ghifjk"

如果你设置的字符串填充长度小于其本身，那么不会产生任何作用

str_pad(x, 4)
# [1] " abc"   "defghi"

如果字符串过长，我们可以使用省略符来展示长字符串

str_trunc(string, width, side = c("right", "left", "center"), ellipsis = "...")

x <- "This string is moderately long"
rbind(
  str_trunc(x, 20, "right"),  # 右边省略符
  str_trunc(x, 20, "left"),   # 左边省略符
  str_trunc(x, 20, "center")  # 中间省略符
)
#      [,1]                  
# [1,] "This string is mo..."
# [2,] "...s moderately long"
# [3,] "This stri...ely long"

去掉字符串中的空白字符

str_trim(string, side = c("both", "left", "right"))

x <- c("  a   ", "b   ",  "   c")
str_trim(x)
# [1] "a" "b" "c"
str_trim(x, "left")
# [1] "a   " "b   " "c"
str_trim(x, "right")
# [1] "  a"  "b"    "   c"

控制字符串的输出格式

str_wrap(string, width = 80, indent = 0, exdent = 0)

其中，width 表示一行所占的宽度，indent 表示段落首行的缩进值，exdent：段落非首行的缩进值

jabberwocky <- str_c(
  "`Twas brillig, and the slithy toves ",
  "did gyre and gimble in the wabe: ",
  "All mimsy were the borogoves, ",
  "and the mome raths outgrabe. "
)
cat(str_wrap(jabberwocky, width = 40))
# `Twas brillig, and the slithy toves did
# gyre and gimble in the wabe: All mimsy
# were the borogoves, and the mome raths
# outgrabe.

语境转换操作

函数	功能	函数	功能
`str_to_upper`	转换为大写	`str_to_lower`	转换为小写
`str_to_title`	首字母大写	`str_conv`	字符串编码
`str_sort`	对字符串向量排序	`str_order`	返回排序后字符串的索引

大小写转换

x <- "I like horses."
str_to_upper(x)
# [1] "I LIKE HORSES."
str_to_title(x)
# [1] "I Like Horses."
str_to_lower(x)
# [1] "i like horses."
str_to_lower(x, locale = "tr")  # locale 设置语种：Turkish
# [1] "ı like horses."

对字符串进行排序，默认的排序方式为降序 decreasing = FALSE， na_last = TRUE 表示 NA 值放置在末尾，FALSE 放到最前，NA 表示删除 NA 值，locale 表示按哪种语言习惯排序

x <- c("y", "i", "k")
str_order(x)
# [1] 2 3 1

str_sort(x)
# [1] "i" "k" "y"
str_sort(x, locale = "lt")  # Lithuanian
# [1] "i" "y" "k"

设置编码方式

str_conv(string, encoding)

# 把中文字符字节化
x <- charToRaw('你好')
x
# [1] c4 e3 ba c3

# 默认 win 系统字符集为 GBK，GB2312 为 GBK 字集，转码正常
str_conv(x, "GBK")
# [1] "你好"
str_conv(x, "GB2312")
# [1] "你好"

x <- charToRaw('你好') # 在 mac 系统下
str_conv(x, "GBK")
# [1] "浣犲ソ"
str_conv(x, "GB2312")
# [1] "浣\032濂\032"
# Warning messages:
# 1: In stri_conv(string, encoding, "UTF-8") :
str_conv(x, "UTF-8")  # mac 
# [1] "你好"

模式匹配函数

函数	功能	函数	功能
`str_locate`	匹配到字符的位置，返回第一个匹配	`str_locate_all`	返回所有匹配结果
`str_extract`	提取匹配到的字符串，返回第一个匹配	`str_extract_all`	返回所有匹配结果
`str_match`	类似 `str_extract`，返回矩阵形式	`str_match_all`	返回所有匹配结果
`str_replace`	替换字符串	`str_replace_all`	替换所有匹配结果
`str_replace_na`	将 `NA` 值替换为字符串 `"NA"`	`str_count`	统计字符串的数量
`str_split`	拆分字符串，返回字符串 `list`	`str_split_fixed`	拆分字符串，返回字符串矩阵
`str_detect`	是否匹配到字符串	`str_subset`	返回匹配到的字符串子集
`word`	提取单词

每个模式匹配函数的前两个参数都相同，分别为

string：字符串或字符串向量
pattern：正则表达式匹配模式

使用 str_locate 获取匹配到字符的位置

fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, 'pp')
#      start end
# [1,]     2   3
# [2,]    NA  NA
# [3,]    NA  NA
# [4,]     6   7

使用 str_extract 函数从字符串中提取匹配到的字符串

val <- c("a1", 467, "ab2")
# 返回匹配的数字
str_extract(val, "\\d")
# [1] "1" "4" "2"
str_extract_all(val, "\\d")
# [[1]]
# [1] "1"

# [[2]]
# [1] "4" "6" "7"

# [[3]]
# [1] "2"

# 返回匹配的字符
str_extract(val, "[a-z]+")
# [1] "a" NA     "ab"
str_extract_all(val, "\\w+")
# [[1]]
# [1] "a1"

# [[2]]
# [1] "467"

# [[3]]
# [1] "ab2"

str_match 与 str_extract 类似，但返回的是字符串矩阵

val <- c("a1", 467, "ab2")
str_match(val, '[0-9]+')
#      [,1] 
# [1,] "1"  
# [2,] "467"
# [3,] "2" 
str_match_all(val, "\\w+")
#      [[1]]
# [,1]
# [1,] "a1"
# 
#      [[2]]
# [,1] 
# [1,] "467"
# 
#      [[3]]
# [,1] 
# [1,] "ab2"

替换匹配的字符串

val <- c('abc1', "123F", "139Q1")
str_replace(val, '\\d+', 'Z')
# [1] "abcZ" "ZF"   "ZQ1" 
str_replace_all(val, '\\d+', 'Z')
# [1] "abcZ" "ZF"   "ZQZ" 
str_replace_na(c(NA, "sss", "tom"))
# [1] "NA"  "abc" "def"

拆分字符串

str_split("a-b-c", "-")
# [[1]]
# [1] "a" "b" "c"
str_split_fixed("a-b-c", "-", n = 2)
#      [,1] [,2] 
# [1,] "a"  "b-c"

统计字符串中匹配的数量

str_count(val, '\\d+')
# [1] 1 1 2

检测字符串中是否存在符合规则的匹配

val <- c("ABC", 123, "A@qq.com", "aa")
str_detect(val, "[[:upper:]]")
# [1]  TRUE FALSE  TRUE FALSE
str_detect(val, "\\d")
# [1] FALSE  TRUE FALSE FALSE

使用 str_subset 提取匹配字符串子集

str_subset(val, "[[:upper:]]")
# [1] "ABC"      "A@qq.com"
str_subset(val, "\\d")
# [1] "123"

使用 word 可以提取单词，默认以空格分割，并返回匹配到的第一个单词

val <- c("hello world", "this is my, girl")
word(val, 1)
# [1] "hello" "this"
word(val, -1)
# [1] "world" "girl"   
word(val, 2, -1)
# [1] "world"       "is my, girl"  

# 以 , 分割
> val <- '111,222,333,444'
> word(val, 1, sep = fixed(','))
[1] "111"
> word(val, 3, sep = fixed(','))
[1] "333"

四种匹配模式

stringr 中所有函数默认都是使用正则匹配模式，使用 4 个函数可以将匹配规则封装，并传递给函数中的 pattern 参数

boundary: 匹配字符（character）、行（line_break）、句子（sentence）或单词（word）之间的边界

x <- "This is a sentence."
str_split(x, boundary("word"))
# [[1]]
# [1] "This"     "is"       "a"        "sentence"
str_count(x, boundary("sentence"))
# [1] 1
str_extract_all(x, boundary("word"))
# [[1]]
# [1] "This"     "is"       "a"        "sentence"

coll: 主要用于大小写不敏感及不同地区的语言格式

i <- c("I", "İ", "i", "ı")
i
# [1] "I" "İ" "i" "ı"
str_subset(i, coll("i", ignore_case = TRUE))
# [1] "I" "i"
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
# [1] "İ" "i"

fixed: 字符只表示其本身的含义，没有特殊含义

str_count(c("a.", ".", ".a.",NA), ".")
# [1]  2  1  3 NA
# 用fixed匹配字符
str_count(c("a.", ".", ".a.",NA), fixed("."))
# [1]  1  1  2 NA

regex: 定义正则表达式，默认方式

val
# [1] "a1"  "467" "ab2"
str_extract(val, regex("\\w+"))
# [1] "a1"  "467" "ab2"

文件读写

对于 R 中的文件读写操作，我们一般使用的都是比较顶层的函数，例如 read.csv、read.table 等，能够直接将文件导入成相应的数据结构，或将数据结果写入到文件中，不需要调用底层的函数。

但是，R 的这种方式是非常耗内存的，因为内存问题导致 R 奔溃的情况应该不在少数，那当我们需要读取大文件时，该如何避免这一问题呢？一种方式是，按需读入，更好听点应该叫懒加载，并不是一次性全部读入内存中，而是在用到的时候在加载，即节省内存又能加快读取速度。但是这需要较多的控制，比较复杂，也可以使用第三方包：vroom，而我们要介绍的方法是逐行读取，只获取需要的数据。

介绍之前，先说说 R 文件读写的基本方式，R 语言中的 I/O 机制是通过连接（connect）来实现的，根据文件的类型主要包含以下几种：file、url、gzfile、bzfile、xzfile、unz、pipe、fifo 和 socketConnection

连接文件

demo <- file("~/Downloads/demo.txt", "w")     # 连接 txt 文件，并以写的模式打开
cat("Hello R", file = demo, sep = "\n")    
seek(con = demo, 0)                           # 将写入位置移动到文件开头
# [1] 8
cat("Hello again!", file = demo, sep = "\n")  # 再次重头写入
isOpen(demo)                                  # 文件是否打开
# [1] TRUE
isOpen(demo, "r")															# 读的方式打开
# [1] FALSE
isOpen(demo, "w")   													# 写的方式打开
# [1] TRUE
flush(demo)																		# 刷新连接的输入流，保证内容写入到文件中
close(demo)                                   # 关闭连接

这时候我们打开 demo.txt 文件，其内容应该只有 Hello again!，好像不符合我们的预期，这是因为我们的打开方式不对，w 打开模式会将写入位置之后的内容清空，我们要使用 a 模式打开

demo <- file("~/Downloads/demo.txt", "a")     # 连接 txt 文件，并以写的模式打开
cat("Hello R", file = demo, sep = "\n")    
seek(con = demo, 0)                           # 将写入位置移动到文件开头
cat("Hello again!", file = demo, sep = "\n")  # 再次重头写入
isIncomplete(demo)                            # 判断最后一个读取是否完整，是否有未写入内容
# [1] FALSE
showConnections()                             # 查看所有连接
#   description            class  mode text   isopen   can read can write
# 3 "~/Downloads/demo.txt" "file" "a"  "text" "opened" "no"     "yes" 
getConnection(demo)                           # 获取连接信息
# A connection with                                  
# description "~/Downloads/demo.txt"
# class       "file"                
# mode        "a"                   
# text        "text"                
# opened      "opened"              
# can read    "no"                  
# can write   "yes"
closeAllConnections()                         # 关闭所有连接

读写函数

根据文件类型的不同，可以读写函数分为两类：

文本文件：readLines、 scan、 parse 和 read.dcf（读）； writeLines、 cat、 sink、 dput 和 dump（写）
二进制文件：readBin、readChar 和 load （读）；writeBin 、writeChar 和 save（写）

我们使用 writeLines 函数将数据按行写入文件

demo <- file("~/Downloads/demo.txt", "w")
set.seed(100)
data <- matrix(sample(1:200, 50),nrow = 10)
for (i in 1:dim(data)[1]) {
  row <- as.character(data[i,])
  writeLines(row, con = demo, sep = "\t")
  writeLines('', con = demo)
}
close(demo)

使用 readLines 函数按行读取文件，并计算每行的 log2 之和

demo <- file("~/Downloads/demo.txt", "r")
total <- c()
while (TRUE) {
  line <- readLines(demo, n = 1)
  if (length(line) == 0) {
    break
  }
  row <- as.numeric(unlist(strsplit(line, "\t")))
  total <- c(total, sum(log2(row)))
}
close(demo)
total
# [1] 32.66736 34.35637 33.97802 27.67451 32.79030 31.04216 28.28763 30.54825 27.89171 35.65607

其他模式

一般地，我们连接的都是文件，但有两种特殊的连接场景：

获取控制台输入

readLines(stdin(), n = 2)  # n 表示要读取的行数
Hello
World
# [1] "Hello" "World"

获取剪贴板赋值内容

虽然提供了这个功能，但是在不同系统使用方式不一样，因此不建议使用，完全可以复制粘贴成字符串，再使用 textConnection 将字符串转换为连接

目录操作

R 语言内置了很多目录操作函数，让我们处理文件更加方便快捷，常用函数如下

函数	功能	函数	功能
`getwd`	获其当前工作目录	`setwd`	设置当前工作目录
`dirname`	获得路径的目录名	`basename`	获得目录的最底层目录或者文件名
`normalizePath`	获取目录的绝对路径	`dir`	查看目录下的子目录和文件
`list.dirs`	查看目录下的子目录	`list.files`	同 `dir`
`dir.exists`	文件夹是否存在	`file.exists`	文件是否存在
`dir.create`	创建文件夹	`file.create`	创建空文件
`unlink`	删除文件或文件夹	`file.remove`	删除文件
`file.rename`	重命名文件	`file.copy`	复制文件
`file.path`	文件路径拼接	`file_test`	判断是文件还是目录

查看目录文件

getwd()
# [1] "/Users/dengxsh"
setwd("~/Downloads/")
getwd()
# [1] "/Users/dengxsh/Downloads"
basename(getwd())
# [1] "Downloads"
dirname(getwd())
# [1] "/Users/dengxsh"
normalizePath(".")
# [1] "/Users/dengxsh/Downloads"
path <- "/Users/dengxsh/Documents/WorkSpace/"
dir(path)
# [1] "CLion"    "Go"       "image"    "IntelliJ"  "PyCharm"  "Qt5"      "RStudio" 
list.dirs(path, full.names = FALSE, recursive = FALSE)
# [1] "CLion"    "Go"       "image"    "IntelliJ"  "PyCharm"  "Qt5"      "RStudio"
list.files(path, pattern = "py", recursive = FALSE, full.names = TRUE)
# [1] "/Users/dengxsh/Documents/WorkSpace//Jupyter"

创建和修改目录文件

dir.create("test")
setwd(file.path(getwd(), 'test'))
getwd()
# [1] "/Users/dengxsh/Downloads/test"
file.create("demo.txt")
# [1] TRUE
file.exists("demo.txt")
# [1] TRUE
list.files()
# [1] "demo.txt"
file.rename("demo.txt", "test")   # 重命名
# [1] TRUE
file_test('-f', "test")           # 判断文件类型，-f 可访问且不是文件夹
# [1] TRUE
file_test('-d', "test")           # -d 可访问且是文件夹
# [1] FALSE
dir()
# [1] "test"
unlink("test", recursive = TRUE)  # 因为文件不为空，需要递归删除文件夹
dir.exists("test")
# [1] FALSE