R使用正则表达式
#R里面自带的正则表达式grep,grepl,前者返回的是索引,后者返回的是逻辑向量
telephone=c("123-23451", "1225-3123", "121-45672", "1332-1231", "1212-3212" ,"123456789")
grep('^[0-9]{4}-[0-9]{4}$',telephone)
output:[1] 2,4,5
telephone[grep('^[0-9]{3}-[0-9]{5}$',telephone)]
output:[1] "123-23451","121-45672"
# 如果你想选取除了以上两种的其他形式的子集,可以使用grepl()【可以说查找异常值】
telephone[!grepl('^[0-9]{4}-[0-9]{4}$',telephone) & !grepl('^[0-9]{3}-[0-9]{5}$',telephone)]
output:[1] "123456789"
#stringr包的str_match()函数
#str_match(x1,x2);有两个参数x1,x2;x1表示正则表达式,而且只返回括号括起来的元素,x2表示数据,返回的值为在x2的基础上新增了匹配的圆括号里的元素,有几个圆括号就新增几列,如果没匹配到的,这一行全为NA。
fruits
[1] "apple:20" "orange:missing" "banana:30"
[4] "pear:sent to Jerry" "watermelon:2" "blueberry:12"
[7] "strawberry:sent to James"
library(stringr)
matches <- str_match(fruits,'^(\\w+):\\s?(\\d+)$')
matches
output:[1] matches
[,1] [,2] [,3]
[1,] "apple:20" "apple" "20"
[2,] NA NA NA
[3,] "banana:30" "banana" "30"
[4,] NA NA NA
[5,] "watermelon:2" "watermelon" "2"
[6,] "blueberry:12" "blueberry" "12"
[7,] NA NA NA
#我现在想matches有效的行转为数据框结构
fruits_df <- data.frame(na.omit(matches[,-1]),stringAsFactor=FALSE)
fruits_df
output:
name number
1 apple 20
2 banana 30
3 watermelon 2
4 blueberry 12