通过实例来学习R对字符串的处理。
用到的数据集是R自带的USArrests
查看该数据集的前几行
> head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
获取州的简称
> states = rownames(USArrests);head(states) #获取州的名字集合
[1] "Alabama" "Alaska" "Arizona" "Arkansas"
[5] "California" "Colorado"
> substr(states,start = 1,stop = 4) #方法一
[1] "Alab" "Alas" "Ariz" "Arka" "Cali" "Colo" "Conn" "Dela"
[9] "Flor" "Geor" "Hawa" "Idah" "Illi" "Indi" "Iowa" "Kans"
[17] "Kent" "Loui" "Main" "Mary" "Mass" "Mich" "Minn" "Miss"
[25] "Miss" "Mont" "Nebr" "Neva" "New " "New " "New " "New "
[33] "Nort" "Nort" "Ohio" " Okla" "Oreg" "Penn" "Rhod" "Sout"
[41] "Sout" "Tenn" "Texa" "Utah" "Verm" "Virg" "Wash" "West"
[49] "Wisc" "Wyom"
在方法一中,使用的函数是substr(x, start, stop)
。
其参数x
是所要操作的字符串;
参数start
和stop
是指明起始和终止位置。
值得注意的是,该函数操作的对象是单个字符串。
例如方法一的结果是对states
这个包含多个字符串的向量进行操作,其结果是states
中的每个字符串被取出第1个到第4个位置区间内的字符。
> abbreviate(states,minlength = 5) #方法二
Alabama Alaska
"Alabm" "Alask"
Arizona Arkansas
"Arizn" "Arkns"
California Colorado
"Clfrn" "Colrd"
Connecticut Delaware
"Cnnct" "Delwr"
Florida Georgia
"Flord" "Georg"
在方法二中使用的函数是abbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE, method = c("left.kept", "both.sides"), named = TRUE)
该函数的主要作用就是获得缩写,一般只需要设定前两个参数即可,如此返回结果即为通常认为的缩写。如方法二中获得结果,虽然参数minlength=4
但是结果返回值并不都是字符长度为4的字符串。这和方法一有所不同。
参考文献:正则表达式及R字符串处理之终结版(Posted by yphuang on March 15, 2016)