1. CSV文件的的读取与写出
2. 数据集筛选
3. 简单随机抽样 sample函数
正文:
1. CSV文件的的读取与写出
- 文件读取: df2 <- read.table("C:\\Users\\Lee\\Desktop\\R语言\\dummyData.csv", header= TRUE, sep=",")
- 文件写出:write.table(df1, "C:\\Users\\Lee\\Desktop\\R语言\\dummyData.csv", sep=",", row.names=FALSE)
2. 数据集筛选
方法一:数据框方法即 newdata[filter条件,filter条件]
> newdata <- read.table("C:\\Users\\Lee\\Desktop\\R语言\\leadership.csv", header= TRUE, sep=",")
> newdata
manager date country gender age q1 q2 q3 q4 q5
1 1 2014/10/27 US M 32 5 4 5 5 5
2 2 2014/10/28 US F 45 3 5 2 5 5
3 3 2014/10/29 UK F 25 3 5 5 5 2
4 4 2014/10/30 UK M 39 3 3 4 NA NA
5 5 2014/10/31 UK F 99 2 2 1 2 1
>newdata<- leadership[with(leadership,which(gender=="M")),]
> newdata
manager date country gender age q1 q2 q3 q4 q5
1 1 2014/10/27 US M 32 5 4 5 5 5
4 4 2014/10/30 UK M 39 3 3 4 NA NA
> newdata<- leadership[with( leadership,which(gender=="M" & age>34)),]
> newdata
manager date country gender age q1 q2 q3 q4 q5
4 4 2014/10/30 UK M 39 3 3 4 NA NA
> newdata
manager date country gender age q1 q2 q3 q4 q5
1 1 2014/10/27 US M 32 5 4 5 5 5
4 4 2014/10/30 UK M 39 3 3 4 NA NA
> newdata<- leadership[with( leadership,which(gender=="M" & age>34)),]
> newdata
manager date country gender age q1 q2 q3 q4 q5
4 4 2014/10/30 UK M 39 3 3 4 NA NA
注意
> newdata<- leadership[which(gender=="M" & age>34),]
Error in which(gender == "M" & age > 34) : object 'gender' not found
Error in which(gender == "M" & age > 34) : object 'gender' not found
#要指明gender是属于那个数据框,否则会出错
方法二:用subset函数筛选
subset(dataset,条件#筛选行,筛选列)
newdata <- subset(leadership, gender=="M" & age>25,select=c(gender:q2))#列只取gender:q2列
> newdata
gender age q1 q2
1 M 32 5 4
4 M 39 3 3
> newdata <- subset(leadership, gender=="M" & age>25,select=gender:q5)
> newdata
gender age q1 q2 q3 q4 q5
1 M 32 5 4 5 5 5
4 M 39 3 3 4 NA NA
>
> newdata
gender age q1 q2
1 M 32 5 4
4 M 39 3 3
> newdata <- subset(leadership, gender=="M" & age>25,select=gender:q5)
> newdata
gender age q1 q2 q3 q4 q5
1 M 32 5 4 5 5 5
4 M 39 3 3 4 NA NA
>
3. 简单随机抽样 sample函数
id <- sample(1:2,nrow(iris),replace=TRUE,prob=c(0.7,0.3)) #1:2表示在1:2这个区间,replace=TRUE有放回的抽取nrow(iris)个值, nrow(iris)是一个数值,即iris观测值的个数,记录条数,有多少行,其中1 ,2的分配比例是prob=c(0.7,0.3)
> id
[1] 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 2 1 1 1 2 2 1 2
[40] 1 1 1 2 1 1 2 1 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 2
[79] 1 1 2 1 1 1 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 2 2 1 1 1
[118] 2 1 2 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 2 2 2 1 1
traindata <- iris[id==1,]#训练集
testdata <- iris[id==2,] #测试集
【举例】
> mysample <- binary[sample(1:nrow(binary),3,replace=0),]
> mysample
admit gre gpa rank
390 0 640 3.51 2
225 0 800 2.90 2
44 0 500 3.31 3
> mysample <- binary[sample(1:nrow(binary),3,replace=0),]
> mysample
admit gre gpa rank
60 0 600 2.82 4
213 0 460 2.87 2
25 1 760 3.35 2
> mysample <- binary[sample(1:nrow(binary),3,replace=0),]
> mysample
admit gre gpa rank
30 0 520 3.29 1
303 1 400 3.15 2
395 1 460 3.99 3
>
> mysample
admit gre gpa rank
390 0 640 3.51 2
225 0 800 2.90 2
44 0 500 3.31 3
> mysample <- binary[sample(1:nrow(binary),3,replace=0),]
> mysample
admit gre gpa rank
60 0 600 2.82 4
213 0 460 2.87 2
25 1 760 3.35 2
> mysample <- binary[sample(1:nrow(binary),3,replace=0),]
> mysample
admit gre gpa rank
30 0 520 3.29 1
303 1 400 3.15 2
395 1 460 3.99 3
>