R语言笔记一

常用函数

object.size() ##查询数据大小
names() ##查询数据变量名称
head(x, 10) ,tail(x, 10) ##查询数据前/后10行
summary() ##对数据集的详细统计呈现
table(x$y) ##对y值出现次数统计
str() ##查询数据集/函数的详细结构
nrow(),ncol() ##查询行列数
sqrt(x) ##square root取x的平方根
abs(x) ##absolute value取x的绝对值
names(vect2)<-c(“foo”,”bar”,”norf”) ##给向量命名
identical(vect,vect2) ##TRUE 检查两个向量是否一样
vect[c(“foo”,”bar”)] ##用名字选取向量
colnames(my_data)<-cnames ##修改数据框的列名
t() ##互换数据框的行列
length(“”)统计字符数,空字符时计数为1
nchar(“”)统计字符数,空字符时计数为0
tolower()将字符转换为小写
toupper()将字符转换为大写
chartr(“A”,”B”,x):字符串x中使用B替换A
na.omit(),移除所有含有缺失值的观测(行删除,listwise deletion)
paste()

paste("Var",1:5,sep="")
[1] "Var1" "Var2" "Var3" "Var4" "Var5"

> x<-list(a='aaa',b='bbb',c="ccc")
> y<-list(d="163.com",e="qq.com")
> paste(x,y,sep="@")
[1] "aaa@163.com" "bbb@qq.com"  "ccc@163.com"

#增加collapse参数,设置分隔符
> paste(x,y,sep="@",collapse=';')
[1] "aaa@163.com;bbb@qq.com;ccc@163.com"
> paste(x,collapse=';')
[1] "aaa;bbb;ccc"

strsplit()字符串拆分

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)
x为需要拆分的字串向量
split为拆分位置的字串向量,默认为正则表达式匹配(fixed=FALSE),
设置fixed=TRUE,表示使用普通文本匹配或正则表达式的精确匹配。普通文本的运算速度快
perl=TRUE/FALSE的设置和perl语言版本有关,如果正则表达式很长,正确设置表达式并且使用perl=TRUE可以提高运算速度。
useBytes设置是否逐个字节进行匹配,默认为FALSE,即按字符而不是字节进行匹配。
strsplit得到的结果是列表,后面要怎么处理就得看情况而定了

字符串替换:sub(),gsub()

严格地说R语言没有字符串替换的函数
R语言对参数都是传值不传址
sub和gsub的区别是前者只做一次替换,gsub把满足条件的匹配都做替换
> text<-c("Hello, Adam","Hi,Adam!","How are you,Ava")
> sub(pattern="Adam",replacement="word",text)
[1] "Hello, word"     "Hi,word!"        "How are you,Ava"
> sub(pattern="Adam|Ava",replacement="word",text)
[1] "Hello, word"      "Hi,word!"         "How are you,word"
> gsub(pattern="Adam|Ava",replacement="word",text)
[1] "Hello, word"      "Hi,word!"         "How are you,word"

字符串提取substr(), substring()

substr和substring函数通过位置进行字符串拆分或提取,它们本身并不使用正则表达式
结合正则表达式函数regexpr、gregexpr或regexec使用可以非常方便地从大量文本中提取所需信息
语法格式
substr(x, start, stop) 
substring(text, first, last = 1000000L)
第 1个参数均为要拆分的字串向量,第2个参数为截取的起始位置向量,第3个参数为截取字串的终止位置向量
substr返回的字串个数等于第一个参数的长度
substring返回字串个数等于三个参数中最长向量长度,短向量循环使用
> x <- "123456789" 
> substr(x, c(2,4), c(4,5,8)) 
[1] "234" 
> substring(x, c(2,4), c(4,5,8)) 
[1] "234"     "45"      "2345678"
因为x的向量长度为1,substr获得的结果只有1个字串,
即第2和第3个参数向量只用了第一个组合:起始位置2,终止位置4。
substring的语句三个参数中最长的向量为c(4,5,8),执行时按短向量循环使用的规则第一个参数事实上就是c(x,x,x),
第二个参数就成了c(2,4,2),最终截取的字串起始位置组合为:2-4, 4-5和2-8。

Workspace and Files

ls() ##查询工作区对象
list.files(), dir() ##列出工作目录所有文件
dir.create(“testdir”) ##创建testdir目录
file.create(“mytest.R”) ##创建mytest.R文件
file.exists(“mytest.R”) ##查询文件是否存在
file.info(“mytest.R”) , file.info(“mytest.R”)$mode ##查询文件包含信息,或特定信息
file.rename(“mytest.R”,”mytest2.R”) ##重命名为mytest2.R
file.remove(“mytest.R”) ##删文件
file.copy(“mytest2.R”,”mytest3.R”) ##复制为mytest3.R文件
file.path(“mytest3.R”) ##在众多工作文件中,指定提供某个文件的相对路径。
file.path(“folder1”,”folder2”) ##”folder1/folder2”也能创建独立于系统的路径供R工作。?

Create a directory in the current working directory called “testdir2” and a subdirectory for it called “testdir3”, all in one command by using dir.create() and file.path().

 dir.create(file.path('testdir2','testdir3'),recursive = TRUE)

 unlink("testdir2", recursive = TRUE)    ##删除目录及所有(没有recursive=T,R会阻止)。名称源于unix命令。
setwd('testdir')     ##设testdir目录,为工作目录
> old.dir <- getwd()
args()  ##查询函数参数构成
sample(x) ##也可以对x重新排序
> sample(1:6, 4, replace = TRUE)
[1] 4 5 1 3

>flips <- sample(c(0,1),100,replace = TRUE, prob = c(0.3,0.7)) #prob设定0和1出现的概率

> flips
  [1] 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
 [47] 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 1 0 0 1 1 1
 [93] 1 1 1 1 1 0 1 1

Sequence of Numbers

> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10

>pi:10   ##real numbers 实数
[1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593

?‘:’查询操作符号:

> seq(1,10)
 [1]  1  2  3  4  5  6  7  8  9 10

> seq(0, 10, by=0.5)
 [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0  8.5
[19]  9.0  9.5 10.0

>my_seq<- seq(5,10,length=30)  ##在区间(5, 10)等距生成30个数
> 1:length(my_seq)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
> seq(along.with = my_seq)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

> seq_along(my_seq)  **
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

>rep(c(0,1,2),times=10)
 [1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
>rep(c(0,1,2),each=10)
 [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Vector

> paste(1:3,c("X", "Y", "Z"),sep="")
[1] "1X" "2Y" "3Z"

* Vector recycling!*

> paste(LETTERS, 1:4, sep = "-")
 [1] "A-1" "B-2" "C-3" "D-4" "E-1" "F-2" "G-3" "H-4" "I-1" "J-2" "K-3" "L-4" "M-1" "N-2" "O-3"
[16] "P-4" "Q-1" "R-2" "S-3" "T-4" "U-1" "V-2" "W-3" "X-4" "Y-1" "Z-2"

数据类型

对象与属性 Objects and Attributes

Objects

R has five basic or “atomic” classes of objects:

  • character
  • numeric (real numbers)
  • integer
  • complex
  • logical (True/False)

The most basic object is a vector

  • A vector can only contain objects of the same class
  • BUT: The one exception is a list, which is represented as a vector but can contain objects of different classes (indeed, that’s usually why we use them)

Empty vectors can be created with the vector() function.

Numbers

  • Numbers in R a generally treated as numeric objects (i.e. double precision real numbers)
  • If you explicitly want an integer, you need to specify the L suffix
  • Ex: Entering *1* gives you a numeric object; entering *1L* explicitly gives you an integer **
  • There is also a special number *Inf* which represents infinity; e.g. 1 / 0; Inf can be used in ordinary calculations; e.g. 1 / Inf is 0
  • The value *NaN* represents an undefined value (“not a number”); e.g. 0 / 0; *NaN* can also be thought of as a missing value (more on that later)

Attributes

R objects can have attributes

  • names, dimnames
  • dimensions (e.g. matrices, arrays)
  • class
  • length
  • other user-defined attributes/metadata
    Attributes of an object can be accessed using the attributes() function

向量与列表 Vectors and Lists

Creating Vectors

The c() function can be used to create vectors of objects.

> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex

Using the vector() function

> x <- vector("numeric", length = 10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

Mixing Objects

When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class.

> y <- c(1.7, "a") ## character
> y <- c(TRUE, 2) ## numeric
> y <- c("a", TRUE) ## character

Explicit Coercion 强制明确

Objects can be explicitly coerced from one class to another using the as.* functions, if available.

> x <- 0:6
> class(x)
[1] "integer"
> as.numeric(x)
[1] 0 1 2 3 4 5 6
> as.logical(x)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE
> as.character(x)
[1] "0" "1" "2" "3" "4" "5" "6"

Nonsensical coercion results in NAs

> x <- c("a", "b", "c")
> as.numeric(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion
> as.logical(x)
[1] NA NA NA
> as.complex(x)
[1] NA NA NA
Warning message:
NAs introduced by coercion 

Lists

Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them well.

> x <- list(1, "a", TRUE, 1 + 4i)
> x
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i

矩阵 Matrices

Matrices

Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of length 2 (nrow, ncol)

> m <- matrix(nrow = 2, ncol = 3)
> m
 [,1] [,2] [,3]
[1,] NA NA NA
[2,] NA NA NA
> dim(m)
[1] 2 3
> attributes(m) **
$dim
[1] 2 3

Matrices (cont’d)

Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.

> m <- matrix(1:6, nrow = 2, ncol = 3)
> m
 [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

Matrices can also be created directly from vectors by adding a dimension attribute.**

> m <- 1:10
> m
[1] 1 2 3 4 5 6 7 8 9 10
> dim(m) <- c(2, 5)  **
> m
 [,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10

cbind-ing and rbind-ing

Matrices can be created by column-binding or row-binding with cbind() and rbind().

> x <- 1:3
> y <- 10:12
> cbind(x, y)
 x y
[1,] 1 10
[2,] 2 11
[3,] 3 12
> rbind(x, y)
 [,1] [,2] [,3]
x 1 2 3
y 10 11 12

因子 Factors

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.

  • Factors are treated specially by modelling functions like *lm()* and *glm()*
  • Using factors with labels is *better* than using integers because factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

     x <- factor(c("yes", "yes", "no", "yes", "no"))
     x
    [1] yes yes no yes no
    Levels: no yes
     table(x)
    x
    no yes
    2 3
     unclass(x)
    [1] 2 2 1 2 1
    attr(,"levels")
    [1] "no" "yes"
    

The order of the levels can be set using the levels argument to factor(). This can be important in linear modelling because the first level is used as the baseline level.

> x <- factor(c("yes", "yes", "no", "yes", "no"),
 levels = c("yes", "no")) **
> x
[1] yes yes no yes no
Levels: yes no

缺失值 Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.

  • is.na() is used to test objects if they are NA
  • is.nan() is used to test for NaN
  • NA values have a class also, so there are integer NA, character NA, etc
  • A NaN value is also NA but the converse is not true

    > x <- c(1, 2, NA, 10, 3)
    > is.na(x)
    [1] FALSE FALSE TRUE FALSE FALSE
    > is.nan(x)
    [1] FALSE FALSE FALSE FALSE FALSE
    > x <- c(1, 2, NaN, NA, 4)
    > is.na(x)
    [1] FALSE FALSE TRUE TRUE FALSE
    > is.nan(x)
    [1] FALSE FALSE TRUE FALSE FALSE
    

数据框 Data Frames

Data frames are used to store tabular data (表格数据)

  • They are represented as a special type of list where every element of the list has to have the same length
  • Each element of the list can be thought of as a column and the length of each element of the list is the number of rows
  • Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class
  • Data frames also have a special attribute called *row.names*
  • Data frames are usually created by calling *read.table()* or *read.csv()*
  • Can be converted to a matrix by calling *data.matrix()* *

    > x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
    > x
     foo bar
    1 1 TRUE
    2 2 TRUE
    3 3 FALSE
    4 4 FALSE
    > nrow(x)
    [1] 4
    > ncol(x)
    [1] 2
    

Names Attribute 名字属性

Names

R objects can also have names, which is very useful for writing readable code and self-describing objects.

> x <- 1:3
> names(x)
NULL
> names(x) <- c("foo", "bar", "norf")
> x
foo bar norf
 1 2 3
> names(x)
[1] "foo" "bar" "norf"

Lists can also have names.

> x <- list(a = 1, b = 2, c = 3)
> x
$a
[1] 1
$b
[1] 2
$c
[1] 3

And matrices.

> m <- matrix(1:4, nrow = 2, ncol = 2)
> dimnames(m) <- list(c("a", "b"), c("c", "d")) ***
> m
 c d
a 1 3
b 2 4

Summary

Data Types

  • atomic classes: numeric, logical, character, integer, complex \
  • vectors, lists
  • factors
  • missing values
  • data frames
  • names

Reading Writing Data

Reading Data

There are a few principal functions reading data into R.

  • *read.table()*, *read.csv()*, for reading tabular data
  • *readLines()*, for reading lines of a text file
  • *source()*, for reading in R code files (inverse of dump)**
  • *dget()*, for reading in R code files (inverse of dput)**
  • *load()*, for reading in saved workspaces
  • *unserialize()*, for reading single R objects in binary form

Writing Data

There are analogous functions for writing data to files.

  • write.table()
  • writeLines()
  • dump()
  • dput()
  • save()
  • serialize()

Reading Data Files with read.table *

The read.table function is one of the most commonly used functions for reading data. It has a few important arguments:

  • *file*, the name of a file, or a connection
  • *header*, logical indicating if the file has a header line
  • *sep*, a string indicating how the columns are separated
  • *colClasses*, a character vector indicating the class of each column in the dataset
  • *nrows*, the number of rows in the dataset
  • *comment.char()*, a character string indicating the comment character
  • *skip*, the number of lines to skip from the beginning
  • *stringsAsFactors*, should character variables be coded as factors?

read.table
For small to moderately sized datasets, you can usually call read.table without specifying any other arguments.

data <- read.table("foo.txt")

R will automatically

  • skip lines that begin with a #
  • figure out how many rows there are (and how much memory needs to be allocated
  • figure what type of variable is in each column of the table Telling R all these things directly makes R run faster and more efficiently.
  • *read.csv* is identical to *read.table* except that the default separator is a comma.

Reading in Larger Datasets with read.table

With much larger datasets, doing the following things will make your life easier and will prevent R from choking.

  • Read the help page for read.table, which contains many hints
  • Make a rough calculation of the memory required to store your dataset. If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.
  • Set comment.char = "" if there are no commented lines in your file. **
  • Use the *colClasses* argument. Specifying this option instead of using the default can make ’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set *colClasses = "numeric"*. A quick an dirty way to figure out the classes of each column is the following:
initial <- read.table("datatable.txt", nrows = 100) ***
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt",
                      colClasses = classes)
  • Set *nrows*. This doesn’t make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool *wc* to calculate the number of lines in a file.

Know Thy System

In general, when using R with larger datasets, it’s useful to know a few things about your system.

  • How much memory is available?
  • What other applications are in use?
  • Are there other users logged into the same system?
  • What operating system?
  • Is the OS 32 or 64 bit?

Calculating Memory Requirements

I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?
1,500,000 × 120 × 8 bytes/numeric

= 1440000000 bytes
= 1440000000 / bytes/MB
= 1,373.29 MB
= 1.34 GB

Textual Formats

  • *dumping* and *dputing* are useful because the resulting textual format is edit-able, and in the case of corruption, potentially recoverable.
  • *Unlike* writing out a table or csv file, *dump* and *dput* preserve the *metadata* (sacrificing some readability), so that another user doesn’t have to specify it all over again.
  • *Textual* formats can work much better with version control programs like subversion or git which can only track changes meaningfully in text files
  • Textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem
  • Textual formats adhere to the “Unix philosophy”
  • Downside: The format is not very space-efficient

dput-ting R Objects ?

Another way to pass data around is by deparsing the R object with dput and reading it back in using dget.

> y <- data.frame(a = 1, b = "a")
> dput(y)
structure(list(a = 1,
                 b = structure(1L, .Label = "a",
                                        class = "factor")),
            .Names = c("a", "b"), row.names = c(NA, -1L),
            class = "data.frame")
> dput(y, file = "y.R")
> new.y <- dget("y.R")
> new.y
     a    b
1   1    a

Dumping R Objects ?

Multiple objects can be deparsed(逆分析) using the dump function(转储功能) and read back in using source.

> x <- "foo"
> y <- data.frame(a = 1, b = "a")
> dump(c("x", "y"), file = "data.R")
> rm(x, y)
> source("data.R")
> y
    a  b
1  1  a
> x
[1] "foo"

Interfaces to the Outside World

Data are read in using connection interfaces. Connections can be made to files (most common) or to other more exotic things.

  • *file*, opens a connection to a file
  • *gzfile*, opens a connection to a file compressed with gzip
  • *bzfile*, opens a connection to a file compressed with bzip2
  • *url*, opens a connection to a webpage

File Connections **

> str(file)
function (description = "", open = "", blocking = TRUE,
            encoding = getOption("encoding"))

 1. *description* is the name of the file
 2. *open* is a code indicating
    - “r” read only
    - “w” writing (and initializing a new file)
    - “a” appending
    - “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)

Connections

In general, connections are powerful tools that let you navigate files or other external objects. In practice, we often don’t need to deal with the connection interface directly.

con <- file("foo.txt", "r") **
data <- read.csv(con)
close(con)

is the same as

data <- read.csv("foo.txt")

Reading Lines of a Text File

> con <- gzfile("words.gz")
> x <- readLines(con, 10)
> x
 [1] "1080"        "10-point"   "10th"         "11-point"
 [5] "12-point"  "16-point"   "18-point"  "1st"
 [9] "2"              "20-point"

writeLines takes a character vector and writes each element one line at a time to a text file.
readLines can be useful for reading in lines of webpages

## This might take time
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
> head(x)
[1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\">"
[2] ""
[3] "<html>"
[4] "<head>"
[5] "\t<meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8

Subsetting

There are a number of operators that can be used to extract subsets of R objects.

  • [ always returns an object of the same class as the original; can be used to select more than one element (there is one exception)
  • [[ is used to extract elements of a list or a data frame; it can only be used to extract a single element and the class of the returned object will not necessarily be a list or data frame
  • $ is used to extract elements of a list or data frame by name; semantics are similar to that of [[.

    x <- c(“a”, “b”, “c”, “c”, “d”, “a”)
    x[1]
    [1] “a”
    x[2]
    [1] “b”
    x[1:4]
    [1] “a” “b” “c” “c”
    x[x > “a”]
    [1] “b” “c” “c” “d”
    u <- x > “a”
    u
    [1] FALSE TRUE TRUE TRUE TRUE FALSE
    x[u]
    [1] “b” “c” “c” “d”

Subsetting Lists

> x <- list(foo = 1:4, bar = 0.6)
> x[1]
$foo
[1] 1 2 3 4
> x[[1]]
[1] 1 2 3 4
> x$bar
[1] 0.6
> x[["bar"]]
[1] 0.6
> x["bar"]
$bar
[1] 0.6

> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> x[c(1, 3)]
$foo
[1] 1 2 3 4
$baz
[1] "hello"

The [[ operator can be used with computed indices; $ can only be used with literal names.

> x <- list(foo = 1:4, bar = 0.6, baz = "hello")
> name <- "foo"
> x[[name]]     ## computed index for ‘foo’ **
[1] 1 2 3 4
> x$name       ## element ‘name’ doesn’t exist!
NULL
> x$foo
[1] 1 2 3 4       ## element ‘foo’ does exist

Subsetting Nested Elements of a List
The [[ can take an integer sequence.

> x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
> x[[c(1, 3)]]    **
[1] 14
> x[[1]][[3]]
[1] 14
> x[[c(2, 1)]]
[1] 3.14

Subsetting a Matrix

Matrices can be subsetted in the usual way with (i,j) type indices.

> x <- matrix(1:6, 2, 3)
> x[1, 2]
[1] 3
> x[2, 1]
[1] 2

Indices can also be missing. **

> x[1, ]
[1] 1 3 5
> x[, 2]
[1] 3 4

By default, when a single element of a matrix is retrieved, it is returned as a vector of length 1 rather than a 1 × 1 matrix. This behavior can be turned off by setting drop = FALSE.

> x <- matrix(1:6, 2, 3)
> x[1, 2]
[1] 3
> x[1, 2, drop = FALSE] **
    [,1] 
[1,]   3

Similarly, subsetting a single column or a single row will give you a vector, not a matrix (by default).

> x <- matrix(1:6, 2, 3)
> x[1, ]
[1] 1 3 5
> x[1, , drop = FALSE]
  [,1]    [,2]    [,3]
[1,]   1       3       5

Partial Matching

Partial matching of names is allowed with [[ and $

> x <- list(aardvark = 1:5)
> x$a
[1] 1 2 3 4 5
> x[["a"]]
NULL
> x[["a", exact = FALSE]] ***
[1] 1 2 3 4 5 

Removing NA Values *

A common task is to remove missing values (NAs).

> x <- c(1, 2, NA, 4, NA, 5)
> bad <- is.na(x)
> x[!bad]
[1] 1 2 4 5

What if there are multiple things and you want to take the subset with no missing values?

> x <- c(1, 2, NA, 4, NA, 5)
> y <- c("a", "b", NA, "d", NA, "f")
> good <- complete.cases(x, y) ***
> good
[1] TRUE TRUE FALSE TRUE FALSE TRUE
> x[good]
[1] 1 2 4 5
> y[good]
[1] "a" "b" "d" "f"

> airquality[1:6, ]
      Ozone     Solar.R    Wind       Temp     Month   Day
1       41       190       7.4         67       5       1
2       36       118       8.0         72       5       2
3       12       149       12.6       74       5       3
4       18       313       11.5       62       5       4
5       NA       NA       14.3       56       5       5
6       28       NA 14.9 66 5 6
> good <- complete.cases(airquality)
> airquality[good, ] [1:6, ]   ***
         Ozone Solar.R   Wind      Temp       Month     Day
1       41       190       7.4         67       5       1
2       36       118       8.0         72       5       2
3       12       149       12.6       74       5       3
4       18       313       11.5       62       5       4
7       23       299       8.6         65       5       7

Vectorized Operations 向量化操作

Many operations in R are vectorized making code more efficient, concise, and easier to read.

> x <- 1:4; y <- 6:9
> x + y
[1] 7 9 11 13
> x > 2
[1] FALSE FALSE TRUE TRUE
> x >= 2
[1] FALSE TRUE TRUE TRUE
> y == 8
[1] FALSE FALSE TRUE FALSE
> x * y
[1] 6 14 24 36
> x / y
[1] 0.1666667 0.2857143 0.3750000 0.4444444

Vectorized Matrix Operations

> x <- matrix(1:4, 2, 2); y <- matrix(rep(10, 4), 2, 2) ?
> x * y             ## element-wise multiplication
        [,1]    [,2]
[1,]    10    30
[2,]    20    40
> x / y
     [,1]    [,2]
[1,]    0.1    0.3
[2,]    0.2    0.4
> x %*% y       ## true matrix multiplication
          [,1]    [,2]
[1,]      40    40
[2,]      60    60

Missing Value

is.na(mydata) 与 mydata == NA 结果一样

R uses ‘one-based indexing‘, which (you
| guessed it!) means the first element of a vector is considered element 1.

x[c(2, 10)] ##取x的第2个和第10个数
x[c(-2, -10)] ##取除去第2个和第10个的所有数
x[-c(2, 10)] ##同上

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值