R语言-《R in Action,2rd》-Robert I.Kabacoff-Chapter 2.Creating a dataset

R in Action

Data analysis and graphics with R - SECOND EDITION

ROBERT I. KABACOFF

ISBN: 9781617291388

在这里插入图片描述

Chapter 2. Creating a dataset

2.1 Table2.1

## R contains a wide variety of "structures" for holding data, 
   including "scalars", "vectors", "arrays", "data frames", and "lists".
## The "data types" or "modes" that R can handle include "numeric", 
   "character", "logical (TRUE/FALSE)", "complex (imaginary numbers)", and "raw (bytes)". 
##  R refers to "case identifiers" as "rownames" 
    and "categorical variables" (nominal, ordinal) as "factors".
##  "Factors" = "nominal" or "ordinal" variables. They’re stored and treated specially in R

在这里插入图片描述

2.2.1 Vector

## "Vectors" are one-dimensional arrays that can hold "numeric data",
  "character data", or "logical data".
> a <- c(1, 2, 5, 3, 6, -2, 4)
> b <- c("one", "two", "three")
> c <- c(TRUE, TRUE, FALSE, FALSE)
##  Note that the data in a vector must be "only one type or mode" (numeric,  
    character, or logical). 
    You can’t mix modes in the same vector
## 取值
> b[c(1, 3)]
[1] "one"   "three"

2.2.0 Scalars

## "Scalars" are "one-element vectors".
> h <- 3
> h2 <- TRUE

2.2.2 Matrix

## 举例01
> y <- matrix(1:20, nrow = 5, ncol = 4)
> y
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
## 举例02
> cells <- c(1, 26, 24, 68)
> rnames <- c("R1", "R2")
> cnames <- c("C1", "C2")
> mymatrix <- matrix(cells, 
                     nrow = 2, ncol = 2, byrow = TRUE, 
                     dimnames = list(rnames, cnames)
                     )
> mymatrix
   C1 C2
R1  1 26
R2 24 68
## "Matrices" are "two-dimensional" and, like vectors, can contain "only one"
    data type.
   When there are "more than two dimensions", you use "arrays" (section 2.2.3)
## 引用matrix
> x <- matrix(1:10, nrow = 2)
> x
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10
> x[2, ]
[1]  2  4  6  8 10
> x[ ,2]
[1] 3 4
> x[1, 4]
[1] 7
> x[1, c(4,5)]
[1] 7 9

2.2.3 Array

> dimx <- c("A1", "A2")
> dimy <- c("B1", "B2", "B3")
> dimz <- c("C1", "C2", "C3", "C4")
> A <- array(1:24, 
            c(2, 3, 4), 
            dimnames = list(dimx, dimy, dimz)
            )
> A
, , C1

   B1 B2 B3
A1  1  3  5
A2  2  4  6

, , C2

   B1 B2 B3
A1  7  9 11
A2  8 10 12

, , C3

   B1 B2 B3
A1 13 15 17
A2 14 16 18

, , C4

   B1 B2 B3
A1 19 21 23
A2 20 22 24
> A[1,1,3]
[1] 13
## Like matrices, they must be a single mode.

2.2.4 Data Frame

> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> patientdata
  patientID age diabetes    status
1         1  25    Type1      Poor
2         2  34    Type2  Improved
3         3  28    Type1 Excellent
4         4  52    Type1      Poor
## Specifying elements of a data frame
> patientdata[1:2]
  patientID age
1         1  25
2         2  34
3         3  28
4         4  52
> patientdata[c("diabetes", "status")]
  diabetes    status
1    Type1      Poor
2    Type2  Improved
3    Type1 Excellent
4    Type1      Poor
> patientdata$age
[1] 25 34 28 52
> table(patientdata$diabetes, patientdata$status)
       
        Excellent Improved Poor
  Type1         1        0    2
  Type2         0        1    0
## 方法01
> summary(mtcars$mpg)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.43   19.20   20.09   22.80   33.90 
> plot(mtcars$mpg,mtcars$disp)
> plot(mtcars$mpg,mtcars$wt)

2.2.4.2 方法02：The “attach()” and “detach()” functions are best used when you’re analyzing “a single data frame” and you’re unlikely to have multiple objects with the same name.

> attach(mtcars)
> summary(mpg)
   Min. 1st Qu.  Median 
  10.40   15.43   19.20 
   Mean 3rd Qu.    Max. 
  20.09   22.80   33.90 
> plot(mpg, disp)
> plot(mpg, wt)
> detach(mtcars) ## The statement is optional but is good 
  programming practice and should be included routinely.
## 方法02的缺点：
> mpg <- c(25, 36,47)
> attach(mtcars)
The following object is masked _by_ .GlobalEnv:

    mpg

> mpg
[1] 25 36 47
## 错误分析：mpg与mtcars中的重复

2.2.4.3 with()

## 方法03
> with(mtcars, {
+     print(summary(mpg))
+     plot(mpg, disp)
+     plot(mpg, wt)
+ })
   Min. 1st Qu.  Median 
  10.40   15.43   19.20 
   Mean 3rd Qu.    Max. 
  20.09   22.80   33.90 
## 方法03缺点
> with(mtcars, {
+     st <- summary(mpg)
+     plot(mpg, disp)
+     plot(mpg, wt)
+ })
> st
Error: object 'st' not found
## 方法03，避免缺点的方法
> with(mtcars, {
+     st <- summary(mpg)  ## 注意符号的使用
+     stt <<- summary(mpg)  ## 注意符号的使用
+     plot(mpg, disp)
+     plot(mpg, wt)
+ })
> stt
   Min. 1st Qu.  Median 
  10.40   15.43   19.20 
   Mean 3rd Qu.    Max. 
  20.09   22.80   33.90

2.2.4.4CASE IDENTIFIERS

## specifies "patientID" as the variable to use in labeling cases on various 
    printouts and graphs produced by R
> patientdata <- data.frame(patientID, age, diabetes, status, 
                            row.names = patientID )

2.2.5 Factors

## 分类变量
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> class(diabetes)  ## 注意区别01
[1] "character"
> diabetes <- factor(diabetes)
> diabetes
[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2
> class(diabetes)  ## 注意区别02
[1] "factor"
> diabeteses <- c("Type0")
## 有序分类变量
> status <- c("Poor", "Improved", "Excellent", "Poor")
> class(status)
[1] "character"
> status <- factor(status, ordered = TRUE)
> class(status)
[1] "ordered" "factor" 
> status
[1] Poor      Improved  Excellent Poor     
Levels: Excellent < Improved < Poor  ## R语言默认按照字母排序，但不符合统计要求
## 按照统计要求，定义顺序
> status <- factor(status, ordered = TRUE,
+                  levels = c("Poor", "Improved", "Excellent"))
> status
[1] Poor      Improved  Excellent Poor     
Levels: Poor < Improved < Excellent  ## 根据要求定义顺序
## Assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent. 
   Be sure the specified levels match your actual data values. 
   Any data values not in the list will be set to "missing".
## label
> sex <- c("1", "2", "3")
> sex <- factor(sex, levels = c(1, 2), labels = c("Male", "Female"))
> sex
[1] Male   Female <NA>  
Levels: Male Female

> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1", "Type1")
> status <- c("Poor", "Improved", "Excellent", "Poor")
> patientdata <- data.frame(patientID, age, diabetes, status)
> str(patientdata) ## 对比区别01
'data.frame':	4 obs. of  4 variables:
 $ patientID: num  1 2 3 4
 $ age      : num  25 34 28 52
 $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
 $ status   : Factor w/ 3 levels "Excellent","Improved",..: 3 2 1 3
> diabetes.fac <- factor(diabetes)
> status.ord <- factor(status, ordered = TRUE) 
> patientdata.fac <- data.frame(patientID, age, diabetes.fac, status.ord)
> str(patientdata.fac)  ##  对比区别02
'data.frame':	4 obs. of  4 variables:
 $ patientID   : num  1 2 3 4
 $ age         : num  25 34 28 52
 $ diabetes.fac: Factor w/ 2 levels "Type1","Type2": 1 2 1 1
 $ status.ord  : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3
> summary(patientdata.fac)
   patientID         age        diabetes.fac     status.ord
 Min.   :1.00   Min.   :25.00   Type1:3      Excellent:1   
 1st Qu.:1.75   1st Qu.:27.25   Type2:1      Improved :1   
 Median :2.50   Median :31.00                Poor     :2   
 Mean   :2.50   Mean   :34.75                              
 3rd Qu.:3.25   3rd Qu.:38.50                              
 Max.   :4.00   Max.   :52.00

2.2.6 Lists

## Lists are "the most complex" of the R data types. 
> g <- "My"
> h <- c(25, 26, 18, 39)
> j <- matrix(1:10, nrow = 5)
> k <- c("one", "two")
> mylist <- list(title <- g, ages = h, j, k)
> mylist
[[1]]
[1] "My"

$ages
[1] 25 26 18 39

[[3]]
     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

[[4]]
[1] "one" "two"


> mylist[[2]]
[1] 25 26 18 39
> mylist[["ages"]]
[1] 25 26 18 39
> mylist$ages
[1] 25 26 18 39

2.3 input data

2.3.1 Entering data from the keyboard

> mydata <- data.frame(age = numeric(0),
+                      gender = character(0),
+                      weight = numeric(0)
+ )
> mydata <- edit(mydata) ## 会弹出对话框

2.3.2 Entering data from the Excel

## The "best" way to read an Excel file is to export it to "a comma-delimited" 
   file from Excel and import it into R using the method described earlier.