今天把之前看的Head First Data Analysis中的R语言练习过来一遍,感觉R语言还是挺有意思的,它支持一些非常专业的统计库,例如用它来计算方差,斜率什么的都很简洁,特别是一张用R生成的分组散点图,效果非常的赞。但是这些生成的图片并不能像Excel中的图标一样支持数据钻取,希望后续的学习能解答我现在的这些疑惑。
加载统计文件
R Source File include one statement =>employees <- read.csv("c:\\hfda_ch09_employees.csv",header=TRUE)
source(“R source file path”)
帮助函数
help(command) e.g. help(sd)
方差函数
sd(X)
[1] 2.432138
简介函数
summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.800 4.600 5.500 6.028 6.700 25.900
1st Qu. means 25% observations are below this quantity(approx)
3st Qu. means 75% observations are below this quantity(approx)
Median means median value
mean means average value
柱状图
hist(employees$received[employees$negotiated==TRUE],50) --带约束条件
散点图
plot(employees$requested[employees$negotiated==TRUE],employees$received[employees$negotiated==TRUE])
约束条件要一致, 例如:employees$negotiated==TRUE,这个条件必须一致。
斜率计算
cor(employees$requested[employees$negotiated==TRUE],employees$received[employees$negotiated==TRUE])
截距和斜率的计算
> mylm <- lm(received[negotiated==TRUE]~requested[negotiated==TRUE],data=employees)
> mylm$coefficients
(Intercept) requested[negotiated == TRUE]
2.3121277 0.7250664
截距和斜率的计算(多约束条件)
定义线
myLMBig <- lm(received[negotiated==TRUE & requested >10]~requested[negotiated==TRUE & requested >10],data=employees)
> myLMSmall <- lm(received[negotiated==TRUE & requested <=10]~requested[negotiated==TRUE & requested<=10],data=employees)
计算斜率和截距
> summary(myLMBig)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.813403(截距) 1.8760371 4.164845 4.997597e-05
requested[negotiated == TRUE & requested > 10] 0.302609(斜率) 0.1420151 2.130824 3.457618e-02
> summary(myLMBig)$sigma
[1] 4.544424(方差)
> summary(myLMSmall)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7933468 0.22472009 3.530378 4.378156e-04
requested[negotiated == TRUE & requested <= 10] 0.9424946 0.03151835 29.903041 6.588020e-134
> summary(myLMSmall)$sigma
[1] 1.374526
散点图
dispath2 <- read.csv("dispatch analysis.csv",header=TRUE)
plot(Sales~jitter(Article.count),data=dispath2) Jitter的作用是增加噪点,增加图的可读性
分组噪点图
> articleHitsComments <- read.csv("hfda_ch12_articleHitsComments.csv",header=TRUE)
> library(lattice) 加载类库
> head(articleHitsComments,10)
articleID authorName webHits commentCount
1 1 Destiny Adams 2019 14
2 2 Jon Radermacher 1421 6
3 3 Matt Janney 1174 8
4 4 Matt Janney 1613 26
5 5 Paul Semenec 1099 10
6 6 Destiny Adams 1903 26
7 7 Nicole Fry 1718 21
8 8 Jason Wightman 642 8
9 9 Jon Radermacher 1616 7
10 10 Matt Janney 1233 12
> xyplot(webHits~commentCount | authorName, data=webHitsComments) “|” 是分组符号,这里是按authorName进行分组
数据清洗
使用正则表达式
hfhh <- read.csv("hfda_ch13_data_for_R.csv",header=TRUE)
NewLastName <- sub("\\(.*\\)","",hfhh$LastName)
排序
hfhhSorted <- hfhh[order(hfhh$PersonID,decreasing=FALSE),]
去重复
hfhhNameOnly <- unique(hfhhNameOnly)
删除不需要的列
> hfhhNameOnly$CallID <-NULL
> hfhhNameOnly$Time <-NULL
输出CSV
write.csv(hfhhNameOnly,file="output from R.csv")
赋值
hfhhName <- hfhhName