本文参考数据挖掘与R第二章节
-
读入数据
-
方法1,下载Data mining with r的配套包
-
install.packages('DMwR')
-
-
方法2,下载txt数据,并且读入数据。方法见上文。
-
-
Summary()#的到数据的摘要,概括。(包括最大,小值,中值,均值,4分为,NA的数量)
-
summary(algae)
season size speed mxPH mnO2
autumn:40 large :45 high :84 Min. :5.600 Min. : 1.500
spring:53 medium:84 low :33 1st Qu.:7.700 1st Qu.: 7.725
summer:45 small :71 medium:83 Median :8.060 Median : 9.800
winter:62 Mean :8.012 Mean : 9.118
3rd Qu.:8.400 3rd Qu.:10.800
Max. :9.700 Max. :13.400
NA's :1 NA's :2
Cl NO3 NH4 oPO4
Min. : 0.222 Min. : 0.050 Min. : 5.00 Min. : 1.00
1st Qu.: 10.981 1st Qu.: 1.296 1st Qu.: 38.33 1st Qu.: 15.70
Median : 32.730 Median : 2.675 Median : 103.17 Median : 40.15
Mean : 43.636 Mean : 3.282 Mean : 501.30 Mean : 73.59
3rd Qu.: 57.824 3rd Qu.: 4.446 3rd Qu.: 226.95 3rd Qu.: 99.33
Max. :391.500 Max. :45.650 Max. :24064.00 Max. :564.60
NA's :10 NA's :2 NA's :2 NA's :2
PO4 Chla a1 a2
Min. : 1.00 Min. : 0.200 Min. : 0.00 Min. : 0.000
1st Qu.: 41.38 1st Qu.: 2.000 1st Qu.: 1.50 1st Qu.: 0.000
Median :103.29 Median : 5.475 Median : 6.95 Median : 3.000
Mean :137.88 Mean : 13.971 Mean :16.92 Mean : 7.458
3rd Qu.:213.75 3rd Qu.: 18.308 3rd Qu.:24.80 3rd Qu.:11.375
Max. :771.60 Max. :110.456 Max. :89.80 Max. :72.600
NA's :2 NA's :12
a3 a4 a5 a6
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
Median : 1.550 Median : 0.000 Median : 1.900 Median : 0.000
Mean : 4.309 Mean : 1.992 Mean : 5.064 Mean : 5.964
3rd Qu.: 4.925 3rd Qu.: 2.400 3rd Qu.: 7.500 3rd Qu.: 6.925
Max. :42.800 Max. :44.600 Max. :44.400 Max. :77.600
a7
Min. : 0.000
1st Qu.: 0.000
Median : 1.000
Mean : 2.495
3rd Qu.: 2.400
Max. :31.600 -
hist()#画出数据的直方图
-
hist(algae$mxPH, prob=T)#prob=T显示概率,缺省的显示频数
-
-
-
更细致的显示
> library(car)#调入库
> par(mfrow=c(1,2))#设置把图片分为左右两个,画左边的
> hist(algae$mxPH, prob=T, xlab='',
+ main='Histogram of maximum pH value',ylim=0:1)#画直方图
> lines(density(algae$mxPH,na.rm=T))#画概率密度曲线
> rug(jitter(algae$mxPH))#画地下那个痕迹
> qq.plot(algae$mxPH,main='Normal QQ plot of maximum pH')
> par(mfrow=c(1,1))
-
画箱图
-
boxplot(algae$oPO4,ylab='Orthophosphate (oPO4)')#画箱图,上面那条线表示,第三个四分位+1.5*四分位距,下面那条线表示第一个四分位-1.5*四分位距) 四分位,四分位距离,的概念(百度百科):http://baike.baidu.com/link?url=v0bXCf9-Pg-1oC-v2JMzcjx7PzehHQ-iwhAIvS6G_Yg1v0x-XkRo_dqr7309MRam,http://baike.baidu.com/view/1376569.htm
-
rug(jitter(algae$oPO4),side=2) #画出类似于毯子那个东西,jitter给变量加入细微噪声,以免叠加看不见
-
abline(h=mean(algae$oPO4,na.rm=T),lty=2) #画出均值
-
-
-
-
找出异常点
-
plot(algae$NH4,xlab='')#画出某变量的图
-
clicked.lines <- identify(algae$NH4)#手动的标出异常点,标出的同时会显示出该点的行,结束的同时会把这些点保存于clicked.lines 中
-
algae[clicked.lines,] #显示异常点
-