目录
2. 字符串组合str_c(..., sep = "", collapse = NULL)
3. 字符串取子集 str_sub() 用法:str_sub(string, start = 1L, end = -1L)
4. 字符串定位 str_locate() str_locate_all()
5. 字符串提取 str_extract() str_extract_all()
6. 字符串复制 str_dup() 用法:str_dup(string, times)
[1] "%>%" "boundary" "coll" "fixed" "fruit"
[6] "invert_match" "regex" "sentences" "str_c" "str_conv"
[11] "str_count" "str_detect" "str_dup" "str_ends" "str_extract"
[16] "str_extract_all" "str_flatten" "str_glue" "str_glue_data" "str_interp"
[21] "str_length" "str_locate" "str_locate_all" "str_match" "str_match_all"
[26] "str_order" "str_pad" "str_remove" "str_remove_all" "str_replace"
[31] "str_replace_all" "str_replace_na" "str_sort" "str_split" "str_split_fixed"
[36] "str_squish" "str_starts" "str_sub" "str_sub<-" "str_subset"
[41] "str_to_lower" "str_to_sentence" "str_to_title" "str_to_upper" "str_trim"
[46] "str_trunc" "str_view" "str_view_all" "str_which" "str_wrap"
[51] "word" "words"
[1] "str_detect_multiple" "str_detect_multiple_and" "str_detect_multiple_or" "str_extract_after"
[5] "str_extract_after_date" "str_extract_before" "str_extract_before_date" "str_extract_between"
[9] "str_extract_context" "str_extract_context_all"
1. 字符串长度 str_length()
str_length(letters[1:10])
# [1] 1 1 1 1 1 1 1 1 1 1
str_length(NA)
# [1] NA
str_length(c("i", "like", "programming", NA))
# [1] 1 4 11 NA
2. 字符串组合str_c(..., sep = "", collapse = NULL)
1.str_c()函数是向量化的,它可以自动循环短向量,使其与最长的向量有相同的长度。
2.str_c()函数使用sep=""参数来控制字符串间的分隔方式。
3.如果想要字符向量合并成字符串,则使用collapse参数。
# 前两条语句有相同的作用
str_c("Letter: ", letters[1:5])
# [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d" "Letter: e"
str_c("Letter", letters[1:5], sep = ": ")
# [1] "Letter: a" "Letter: b" "Letter: c" "Letter: d" "Letter: e"
# 可以合并任意多个字符串
str_c(letters[1:5], " is for", "...")
# [1] "a is for..." "b is for..." "c is for..." "d is for..." "e is for..."
# 不使用collapse时得到的是字符向量
str_c(letters[1:5])
# [1] "a" "b" "c" "d" "e"
# 使用collapse时,该字符向量被合并成字符串
str_c(letters[1:5],collapse = "")
# [1] "abcde"
str_c(letters[1:5],collapse = ",")
# [1] "a,b,c,d,e"
注意,和多数R函数相同,缺失值是可传染的。如果需要将缺失值输出为"NA",可以使用str_replace_na()函数。
# Missing inputs give missing outputs
str_c(c("a", NA, "b"), "-d")
# [1] "a-d" NA "b-d"
# Use str_replace_NA to display literal NAs:
str_c(str_replace_na(c("a", NA, "b")), "-d")
# [1] "a-d" "NA-d" "b-d"
3. 字符串取子集 str_sub() 用法:str_sub(string, start = 1L, end = -1L)
hw <- "Hadley Wickham"
str_sub(hw, 1, 6)
# [1] "Hadley"
str_sub(hw, end = 6)
# [1] "Hadley"
str_sub(hw, 8)
# [1] "Wickham"
# 正数表示从前往后数,负数表示从后往前数
str_sub(hw, -7)
# [1] "Wickham"
# 1和8分别是两个子集的起始位置,6和14分别是两个子集的终止位置
str_sub(hw, c(1, 8), c(6, 14))
# [1] "Hadley" "Wickham"
可以使用str_sub()函数的赋值形式来修改字符串。
x <- "BBCDEF"
str_sub(x, 1, 1) <- "A"
# [1] "ABCDEF"
str_sub(x, -1, -1) <- "K"
# [1] "ABCDEK"
str_sub(x, 2, -2) <- ""
# [1] "AK"
4. 字符串定位 str_locate() str_locate_all()
用法:str_locate(string, pattern)
用法:str_locate_all(string, pattern)
# str_locate()返回值是一个矩阵
# 当字符串中有多个匹配的字符时,只返回第一个匹配的位置
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "e")
# start end
[1,] 5 5
[2,] NA NA
[3,] 2 2
[4,] 4 4
str_locate(fruit, c("a", "b", "p", "p"))
# start end
[1,] 1 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
str_locate_all()返回值是一个矩阵的列表
# 当字符串中有多个匹配的字符时,返回每一个的位置
str_locate_all(fruit, "e")
# [[1]]
start end
[1,] 5 5
[[2]]
start end
[[3]]
start end
[1,] 2 2
[[4]]
start end
[1,] 4 4
[2,] 9 9
ps. pattern可以是任意长度的字符串,不过上述示例中只用了单个字符,所以返回值中start与end相等。
5. 字符串提取 str_extract() str_extract_all()
用法:str_extract(string, pattern)
用法:str_extract_all(string, pattern, simplify = FALSE)
str_extract()和str_locate()用法基本一致,不过后者返回的是与pattern匹配的位置,而前者返回的是与pattern匹配的字符串。这两个函数通常与正则表达式搭配使用。
其中str_extract_all()中的simplifyc参数决定返回值是列表还是矩阵。
fruit <- c("apple", "banana", "pear", "pineapple")
str_extract(fruit, "e")
# [1] "e" NA "e" "e"
str_extract(fruit, c("a", "b", "p", "p"))
#[1] "a" "b" "p" "p"
# simplify = FALSE 返回一个列表
str_extract_all(fruit, "e")
# [[1]]
[1] "e"
[[2]]
character(0)
[[3]]
[1] "e"
[[4]]
[1] "e" "e"
# simplify = TRUE 返回一个矩阵
str_extract_all(fruit, "e", simplify = TRUE)
# [,1] [,2]
[1,] "e" ""
[2,] "" ""
[3,] "e" ""
[4,] "e" "e"
6. 字符串复制 str_dup() 用法:str_dup(string, times)
fruit <- c("apple", "pear", "banana")
str_dup(fruit, 2)
# [1] "appleapple" "pearpear" "bananabanana"
str_dup(fruit, 1:3)
# [1] "apple" "pearpear" "bananabananabanana"
str_c("ba", str_dup("na", 0:5))
# [1] "ba" "bana" "banana" "bananana" "banananana" "bananananana"
7. 字符串计数 str_count()
fruit <- c("apple", "banana", "pear", "pineapple")
str_count(fruit, "a")
# [1] 1 3 1 1
str_count(fruit, c("a", "b", "p", "p"))
# [1] 1 1 1 3
8. 去空格
space <- "space "
space <- str_trim(space)
> space
[1] "space"
9. str_split分割与str_c合并
场景:数据框中某一列按某一个分隔符分隔,需要将该列分列;或者数据框中某几列需要按分隔符合并成一列。
a <- "a_b_c_d"
#分割
a.split <- str_split(a, "_")
a.split
[[1]]
[1] "a" "b" "c" "d"
a.split <- str_split(a, "_",simplify = T)[,1]
> a.split
[1] "a"
#合并,其实就是拼接字符串
#使用unlist将a.split变成向量
#当输入为单一向量时,使用collapse这个参数
a.unite <- str_c(unlist(a.split), collapse = "_")
> a.unite
[1] "a_b_c_d"
fruits <- c(
"apples and oranges and pears and bananas",
"pineapples and mangos and guavas"
)
str_split(fruits, " and ")
# [[1]]
[1] "apples" "oranges" "pears" "bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
# 设置返回值为矩阵
str_split(fruits, " and ", simplify = TRUE)
# [,1] [,2] [,3] [,4]
[1,] "apples" "oranges" "pears" "bananas"
[2,] "pineapples" "mangos" "guavas" ""
# 参数n用来指定分割后得到的字符串个数
str_split(fruits, " and ", n = 3)
# [[1]]
[1] "apples" "oranges" "pears and bananas"
[[2]]
[1] "pineapples" "mangos" "guavas"
str_split(fruits, " and ", n = 2)
# [[1]]
[1] "apples" "oranges and pears and bananas"
[[2]]
[1] "pineapples" "mangos and guavas"
# str_split_fixed()与str_split(simplify = TRUE)等效,不过n值不能缺省,必须设置。
str_split_fixed(fruits, " and ", 4)
# [,1] [,2] [,3] [,4]
[1,] "apples" "oranges" "pears" "bananas"
[2,] "pineapples" "mangos" "guavas" ""
提取某字符前/后的字符串
a <- c("x1_y","x2_y","x3_y")
a=str_split(a, "_")
a=as.data.frame(a)
a=as.character(a[1,])
a
[1] "x1" "x2" "x3"
10. 替换
场景:批量导入csv文件时,由于部分文件是以逗号作为小数点的,虽然可以使用read.csv2函数正确读入,但是需要先判断出哪一部分以逗号为小数点,我觉得不如一起读入之后再做处理方便。
comma <- "7,99"
replace.comma <- str_replace(comma, ",", ".")
> replace.comma
[1] "7.99"
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "a", "-")
# [1] "one -pple" "two pe-rs" "three b-nanas"
str_replace_all(fruits, "a", "-")
# [1] "one -pple" "two pe-rs" "three b-n-n-s"
str_replace_all(fruits, "a", toupper)
# [1] "one Apple" "two peArs" "three bAnAnAs"
str_replace_all(fruits, "a", NA_character_)
# [1] NA NA NA
str_replace_all(fruits,c("one" = "1", "two" = "2", "three" = "3"))
# [1] "1 apple" "2 pears" "3 bananas"
Rcode替换
d$metas[which(d$metas=="M")]="1"
d$metas=as.numeric(d$metas)
11. 截取
场景:这个就比较特定的场景了,在公司有一个表,其中一列是规格,一列是规格与颜色合并,但是合并后也没有特定的分隔符,需要把颜色截取出来。
guige <- c("iphone x", "mate 10")
#颜色用不同分隔符合并
guige.yanse <- c("iphone x_black", "mate 10 white")
#用str_length函数算出规格长度
yanse <- str_sub(guige.yanse, str_length(guige)+2, str_length(guige.yanse))
> yanse
[1] "black" "white"
12. 提取
场景:与截取的功能类似,但是可以使用正则表达式匹配,更为强大。在我的工作中,常用来提取csv文件名。
filepath <- c(
"C:/Users/Administrator/Desktop/a-2017.csv",
"C:/Users/Administrator/Desktop/b-2017.csv"
)
#使用正则表达式提取文件名
filenames <- str_extract(filepath, "\\w-\\d+")
> filenames
[1] "a-2017" "b-2017"
13. 字母大小写转换
场景:在Excel中,查找匹配不区分大小写,但是在R中区分大小写,常出现在Excel中能查到到但是R中匹配不到的情况,故先预处理统一大小写再做匹配。
r.letter <- "YdJDdaLA"
upper.letter <- str_to_upper(r.letter) #大写
lower.letter <- str_to_lower(r.letter) #小写
title.letter <- str_to_title(r.letter) #首字母大写
> upper.letter
[1] "YDJDDALA"
> lower.letter
[1] "ydjddala"
> title.letter
[1] "Ydjddala"
14. 检测
场景:常跟ifelse函数配合使用,对某一列字符串进行判断是否匹配。多用于新建列
df <- data.frame(a = c(1, 2, 3, "a", "b", "c"))
df$b <- ifelse(str_detect(df$a, "\\d"), "数字", "非数字")
df
a b
1 1 数字
2 2 数字
3 3 数字
4 a 非数字
5 b 非数字
6 c 非数字
和which何用
data[which(str_detect(data,"")==TRUE)]
3.str_各函数功能
3.1字符串拼接函数
str_c: 字符串拼接。
str_join: 字符串拼接,同str_c。
str_trim: 去掉字符串的空格和TAB(\t)
str_pad: 补充字符串的长度
str_dup: 复制字符串
str_wrap: 控制字符串输出格式
str_sub: 截取字符串
3.2字符串计算函数
str_count: 字符串计数
str_length: 字符串长度
str_sort: 字符串值排序
str_order: 字符串索引排序,规则同str_sort
3.3字符串匹配函数
str_split: 字符串分割
str_split_fixed: 字符串分割,同str_split
str_subset: 返回匹配的字符串
word: 从文本中提取单词
str_detect: 检查匹配字符串的字符
str_match: 从字符串中提取匹配组。
str_match_all: 从字符串中提取匹配组,同str_match
str_replace: 字符串替换
str_replace_all: 字符串替换,同str_replace
str_replace_na:把NA替换为NA字符串
str_locate: 找到匹配的字符串的位置。
str_locate_all: 找到匹配的字符串的位置,同str_locate
str_extract: 从字符串中提取匹配字符
str_extract_all: 从字符串中提取匹配字符,同str_extract
3.4字符串变换函数
str_conv: 字符编码转换
str_to_upper: 字符串转成大写
str_to_lower: 字符串转成小写,规则同str_to_upper
str_to_title: 字符串转成首字母大写,规则同str_to_upper
3.5参数控制函数,仅用于构造功能的参数,不能独立使用
boundary: 定义使用边界
coll: 定义字符串标准排序规则。
fixed: 定义用于匹配的字符,包括正则表达式中的转义符
regex: 定义正则表达式