Data science note 2

最新推荐文章于 2022-12-01 17:23:02 发布

upright man

最新推荐文章于 2022-12-01 17:23:02 发布

阅读量176

点赞数

分类专栏：小学期data science 文章标签： R语言 ggplot2

本文链接：https://blog.csdn.net/uprightman_/article/details/107050019

版权

小学期data science 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Summaries and Indices

when summarizing we may choose different aspects:center,spread,asymmetry
mean，mode(众数)，median
some further options:weighted average,trimmed mean

Properties of an Average

average:a number $\bar{x}$ that substitutes the entire data x1,x2,…xn for one variable
Internality.monotonicity.symmetry.associativity

It makes sense to consider the average $\bar{x}$ such that
$f(x_1,x_2,...,x_n)=f(\bar{x},\bar{x},...,\bar{x})$

we can substitute the one number $\bar{x}$ in place of all the values for the variable and get the same result.
different functions f are disirable based on various purpose and lead to different averages(arithmetic mean, geometric mean)

the average interest rate

income $r_1,r_2,...,r_{12})$ =income $(\bar{r},\bar{r},...,\bar{r})$
$(1+r_1)(1+r_2)...(1+r_{12})=(1+\bar{r})^{12}$
$\bar{r}=(\prod^{12}_{i=1}(1+r_i))^{\frac1{12}}-1$

the geometic mean of (1+ri) is a meaningful summary.

average speed

$Time=\sum^n_{i=1}\frac{d_i}{v_i}=\sum^n_{i=1}\frac{d_i}{\bar{v}}$

$\bar{v}=\frac{\sum^n_{i=1}d_i}{\sum^n_{i=1}\frac{d_i}{v_i}}$
we find that the harmonic mean(here with weights di)is a meaningful summary

another approach to defining an average

let g(z,x1,x2,…,xn) be a function that describes the loss we incur when substituing x1,x2,…,xn with z,then $average(x_1,x_2,...,x_n)=\bar{x}=\arg\min_zg(z,x_1,...,x_n)$

different loss functions

this approach to defining average says that the average is our “best guess” for a value in the data,and the different loss functions specify how we evaluate the goodness of a guess
the square loss penalizes more large discrepancies and down weighs small discrepancies
the absolute loss considers all discrepancies at their face values
the 0-1 loss consider all discrepancies to be the same ,with the exception of no error

square loss:arithmetic mean
absolute loss:medium
0-1 loss:mode [to find the z that minimizes this loss we need to look for the z such that the number of xi equal to z is maximal ;this is called the mode and it is useful as a measure of the “center” for qualitatibe variables]
$MAD=\frac1n\sum_{i=1}^n|x-M|,M:median$

measures of spread/dispersion/variability

variance:average square distance from the mean
$V(x_1,...,x_n)=\frac1n\sum^n_{i=1}(x_i-\bar{x})^2$

standad deviation: $\sqrt{\frac1n\sum^n_{i=1}(x_i-\bar{x})^2}$
note that R actually divides by n-1 rather than n.

often data is summarized so that we have counts of occurrences of the same values:we have a set v1,v2,…,vm of possible values,with their frequencies fi
calculating averages and standard deviations has to adapt to this different set -up
$\bar{v}=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}v_if_i$
$Variance=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}(v_i-\bar{v})^2f_i$

let’s consider some restriction that make the statement meaningful
xi>=0,fix the total sum of values, $\sum^m_{i=1}x_i=n\bar{x}$

$\sum^m_{i=1}(x_i-\bar{x})^2=\sum^m_{i=1}x_i^2-n\bar{x}^2$
$\sum^m_{i=1}x_i^2-n\bar{x}^2\leq(\sum^m_{i=1}x_i)^2-n\bar{x}^2=n(n-1)\bar{x}^2$
$V(x_1,...,x_n)\leq(n-1)\bar{x}^2$

measure income inequality

$x_1\leq x_2 \leq ... \leq x_n$
we now calculate two quantities:
$F_i=\frac in ,Q_i=\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j}$
in general, $Q_i\leq F_i$
$\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} \leq \frac in$
$\frac{\sum^i_{j=1}x_j}i \leq \frac{\sum^n_{j=1}x_j}n$

Area under bottom curve:sum of areas of trapezoids.Thus
$A=\frac12-\sum^n_{i=1}\frac{(F_i-F_{i-1})(Q_i+Q_{i+1})}2$
gini’s index
$G=\frac{A}{1/2}=1-\sum^n_{i=1}(F_i-F_{i-1})(Q_i+Q_{i+1})$

we can calculate the following summary of “mutual variability”
$\Delta=\sum^k_{i=1}\sum^k_{j=1}|x_i-x_j|\frac {n_i}n\frac{n_j}n$
one can show that
$G=\frac{\Delta}{2\bar{x}}$

An Index of Diversity

$D=1-\sum_{i=1}^mp_i^2$
probability that if you capture two fishes they are not the same.
$p 1 + . . . + p m = 1$
$p1=p2=...=pm=\frac1m$
$D1=1-\frac1m$
so that the diversity 1- $\frac1m$ is larger the larger is m
m>k
$D2=1-\frac1k<D1$
This index is known as GINI or Simpson’s diversity index(be careful that actually there are multiple versions of the Simpson index)
There are other measures of diveisity
shannon’s index that is based on entropy
$H=-\sum^m_{i=1}p_i\log p_i$
When analysing data relative to the frequency of different alleles in genetics,the $D=1-\sum_{i=1}^mp_i^2$ is preferred.

this is because it has a very easy genetic interpretation:it represents the probability of an heterozygous genotype

Rstudio

robust

sample=runif(20,2,5)
sample_mean=mean(sample)
#how many numbers in the sample we need to change to make the sample mean equal to 4?
sample1=rnorm(30,0,1)
sample1
sample1_mean=mean(sample1)
sample1
#how many numbers in the sample we need to change to make the sample mean equal to 2?

1 number ,so the mean does not have robust at all

sample_median=median(sample)
sort(sample)
#how many numbers in the sample we need to change to make the sample median equal to 4?

median is more robust than mean.
mean(sample,trim=0)
mean(sample,trim=0.1)(highest 10% and lowest 10% is trimmed)
mean(sample,trim=0.2)(20% is trimmed)
mean(sample,trim=0.5)(50% is trimmed)

lorenz

x=c(1,2,3,10,15,15,30,50)
n=length(x)
F=(1:n)/n
Q=cumsum(x)/sum(x)
F=c(0,F)
Q=c(0,Q)
lorenz=data.frame(F,Q)
library(ggplot2)
g=ggplot(lorenz,aes(x=F))+geom_line(aes(x=F,y=Q),color=I(“blue”))+
geom_point(aes(x=F,y=Q),color=I(“blue”))+
geom_abline(intercept=0,slope=1)
F=c(0,F) Q=c(0,Q)在数据中添加0 过原点。 lorenz=data.frame(F,Q)形成数据结构才能一一对应被plot

#perfect equality
x=c(1,2,3,10,15,15,30,50)
tot=sum(x)
n=length(x)
F=(1:n)/n
x=rep(tot/8,8)
Q=cumsum(x)/sum(x)
F=c(0,F)
Q=c(0,Q)
lorenz=data.frame(F,Q)
library(ggplot2)
p1=ggplot

#Shading the region between Lorenze curve and equality line
ggplot(lorenz,aes(x=F))+
geom_line(aes(x=F,y=Q),color=I(“blue”)) +
geom_point(aes(x=F,y=Q),color=I(“blue”)) +
geom_abline(intercept=0, slope=1) +
geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5)
geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) 填充，(aes(ymin = Q, ymax = F)为范围，fill="cyan"为填充的颜色，alpha为透明度

在这里插入图片描述

#load some income data
library(readr)
library(dplyr)
library(ggplot2)
income <- read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),
col_names = FALSE, skip = 9,col_types = cols(X2 = col_number(),X3 = col_number(),X4 =col_number(), X5 = col_number(),X6 = col_number(),X7 = col_number(),X8 = col_number(), X9 = col_number(),X10 = col_number(),X11 = col_number(), X12 = col_number(),X13 = col_number(),X14 = col_number(), X15 = col_number(),X16 = col_number(),X17 = col_number(),X18 = col_number(),
X19 = col_number(),X20 = col_number(),X21 = col_number(),X22 = col_number(),X23 = col_number(),X24 = col_number(), X25 = col_number(),X26 = col_number(), X27 = col_number(), X28 = col_number()))
for(i in 1:9)
{
income[is.na(income[,paste(“X”,as.character(3i),sep="")]),paste(“X”,as.character(3i),sep="")]<-income[is.na(income[,paste(“X”,as.character(3*i),sep="")]),“X3”]
}
income <- rename(income, Bracket=X1)
income <- rename(income, Frequency=X2)
income <-rename(income, Income=X3)
read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),文件名中\要\ skip = 9跳过了9行，col_names = FALSE,无列名 na=c("(B)")为缺失值；X2 = col_number()把每列的值的种类为数值 for循环将每第三列数据的缺失值换成对应行的第一项第三列数据 rename(income, Bracket=X1)将列重命名

p <- ggplot() +
geom_point(data = income, aes(x = Income, y = Frequency/sum(Frequency), colour = “All”),pch=1) +
geom_point(data = income, aes(x = X9, y = X8/sum(X8), colour = “White”)) +
geom_point(data = income, aes(x = X18, y = X17/sum(X17), colour = “Black”)) +
geom_point(data = income, aes(x = X24, y = X23/sum(X23), colour = “Asian”)) +
geom_point(data = income, aes(x = X27, y = X26/sum(X26), colour = “Hispanic”))+
scale_colour_manual("",breaks = c(“All”, “White”, “Black”,“Asian”,“Hispanic”), values = c(“All”=“black”, “White”=“blue”, “Black”=“green”, “Asian”=“yellow”,“Hispanic”=“red”))+ scale_y_continuous(name=“Proportion in Income Bracket”)+
scale_x_continuous(name=“Average Income in Bracket”)+ggtitle(“US 2015 household income, CPS”)
p
colour = “All”,系统自动分配颜色，旁边的标签标记该颜色代表“All”，pch为点的种类，pch=1为点的种类为空心圆点 scale_colour_manual用法： p + scale_colour_manual(values = c(“red”,“blue”, “green”))#手动分配颜色 p + scale_colour_manual(values = c(“8” = “red”,“4” =“blue”,“6” = “green”)) #根据level分配颜色 cols <- c(“8” = “red”,“4” = “blue”,“6” = “darkgreen”, “10”= “orange”) #自己定义一个颜色向量 p + scale_colour_manual(values = cols) p + scale_colour_manual(values = cols, breaks = c(“8”, “6”,“4”)) #可以控制图例的顺序 p + scale_colour_manual(values = cols, breaks = c(“4”, “6”,“8”),labels =c(“four”, “six”, “eight”)) #可以控制图例的标签 p + scale_colour_manual(values = cols, limits = c(“4”, “8”))#控制显示哪些图例 #Notice that the values are matched with limits, and not breaks p + scale_colour_manual(limits = c(6, 8, 4), breaks = c(8, 4,6)，values =c(“grey50”, “grey80”, “black”)) scale_x_continuous(name="")可设定x轴坐标轴标签

#creating lorenz curve
#Creating Lorenze Curve for all races together
F <- cumsum(income $F r e q u e n c y) / s u m (i n c o m e$ Frequency)
Q <- cumsum(income $I n c o m e * i n c o m e$ Frequency)/sum(income $I n c o m e * i n c o m e$ Frequency)
lorenz <- data.frame(c(0,F),c(0,Q))
#Creating Lorenz Curve for Whites
F <- cumsum(income $X 8) / s u m (i n c o m e$ X8)
Q <- cumsum(income $X 9 * i n c o m e$ X8)/sum(income $X 9 * i n c o m e$ X8)
lorenz <- data.frame(lorenz,c(0,F),c(0,Q))
#Creating Lorenze Curve for Blacks
F <- cumsum(income $X 17) / s u m (i n c o m e$ X17)
Q <- cumsum(income $X 18 * i n c o m e$ X17)/sum(income $X 18 * i n c o m e$ X17)
lorenz <- data.frame(lorenz,c(0,F),c(0,Q))
#Creating Lorenze Curve for Asians
F <- cumsum(income $X 23) / s u m (i n c o m e$ X23)
Q <- cumsum(income $X 24 * i n c o m e$ X23)/sum(income $X 24 * i n c o m e$ X23)
lorenz <- data.frame(lorenz,c(0,F),c(0,Q))
#Creating Lorenze Curve for Hispanics
F <- cumsum(income $X 26) / s u m (i n c o m e$ X26)
Q <- cumsum(income $X 27 * i n c o m e$ X26)/sum(income $X 27 * i n c o m e$ X26)
lorenz<-data.frame(lorenz,c(0,F),c(0,Q))
names(lorenz)<-c(“F”,“Q”,“F.w”,“Q.w”,“F.b”,“Q.b”,“F.a”,“Q.a”,“F.h”,“Q.h”)
lorenz

names(lorenz),对每个元素变量进行命名

#Compute Gini Index

#Define a new function to compute Gini Index
mygini=function(F,Q){
n=length(F)}
gini=1-sum((F[-1]-F[-N])*(Q[-1]+Q[-n]))
return(gini)
)
GiniL=c(mygini(lorenz$F,lorenz$Q),mygini(lorenz$F.w,lorenz$Q.w),mygini(lorenz$F.b,lorenz$Q.b),mygini(lorenz$F.a,lorenz$Q.a),mygini(lorenz$F.h,lorenz$Q.h))
Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)
names(Gini)=c(“Race”,“Index”)
Gini

#Make a plot
ggplot(Gini)+ geom_point(aes(x=Race,y=Index)) +
coord_flip() +
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(linetype=3, color=“darkgray”))+
ylab(“Gini Index”) +
ggtitle(“2015 US household income, CPS”)
sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) F[-1]删去第一列剩下的元素；F[-n]删去第n列剩下的元素，即对应相邻两列相减 Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)将数据结构又变成两列，names(Gini)=c(“Race”,“Index”)将两列命名 coord_flip() xy坐标轴转换 panel.grid.major.y = element_line(linetype=3, color=“darkgray”))linetype=3是虚线

在这里插入图片描述

#Load the data on genetic diversity
diversitycodes <- read_delim(“D:\专业英文拓展\diversitycodes”, " ", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
apply(diversitycodes,2,unique) # unique values in each column
names(diversitycodes)<-c(“Code”,“Population”,“Location”,“Continent”)
#Read the frequency data
diversitydata <- read_delim(“D:\专业英文拓展\diversitydata.freqs”,
" ", escape_double = FALSE, col_names = FALSE,
col_types = cols(X2 = col_character(),
X4 = col_double()), trim_ws = TRUE)
names(diversitydata)<-c(“Marker”,“Allele”,“Population”,“Frequency”)
filter(diversitydata, Marker == “D10S1208”, Allele== “167”)
MARKER<-unique(diversitydata$Marker)
POP<-unique(diversitydata$Population)
#compute the diversity index
Diversity=matrix(NA,ncol=377,nrow=52)
for(i in 1:52)
{for(j in 1:377)
{Diversity[i,j]=1-sum(diversitydata[diversitydata $KaTeX parse error: Expected 'EOF', got '&' at position 18: \dotsrker==MARKER[j]&̲diversitydata$ Population==POP[i],4]^2)
}
}
colnames(Diversity)=MARKER
rownames(Diversity)=POP
markdiv<-(apply(Diversity,2,mean))
markdiv<-sort(markdiv)
hist(markdiv,xlab=“Diversity”,
main=“Average Marker diversity across populations”)

trim_ws = TRUE去掉头尾的空格 apply(diversitycodes,2,unique) 2为按列apply ，names(diversitycodes) 对每列进行命名

在这里插入图片描述

Diversity<-data.frame(POP,apply(Diversity,1,mean),Diversity)
names(Diversity)<-c(“Population”,“Average”,MARKER)
Diversity<-inner_join(diversitycodes,Diversity,by=c(“Population”))
sort(unique(Diversity$Continent))
contcol<data.frame(sort(unique(Diversity$Continent)),c(“green”,“red”,“orange”,“yellow”,“azure”,“beige”,“blue”))
names(contcol)=c(“Continent”,“Color”)
rownames(contcol)<-contcol$Continent
Diversity<-inner_join(Diversity,contcol,by=c(“Continent”))

Diversity<-Diversity[order(Diversity$Continent,Diversity$Location),]

par(mar=c(8,4,4,2))
barplot(Diversity$Average,names=Diversity$Population,col=as.character(Diversity$Color),cex.lab=.3,
las=3,main=“Average diversity”)
contdiv<-tapply(Diversity $A v e r a g e, D i v e r s i t y$ Continent,mean)
第一行将数据结构分为三个部分，第二行将其命名 inner_join将Population中共同的部分连在一起，sort 返回排序后的数值向量 order()的返回值是对应“排名”的元素所在向量中的位置。 mar：以数值向量表示边界大小，顺序为"下、左、上、右"，单位为英分，默认值c(5, 4, 4, 2)+0.1 cex.lab表示修改坐标轴名称字体大小 las：刻度显示形式，las=3为竖着

在这里插入图片描述

par(mar=c(5,15,4,2))
barplot(contdiv[order(contdiv)],col=as.character(contcol[names(contdiv[order(contdiv)]),2]),
main=“Continental Diversity”,horiz = T,las=1)

按排序画图 las参数坐标刻度标签的方向。0表示总是平行于坐标轴，1表示总是水平，2表示总是垂直于坐标轴。

在这里插入图片描述

upright man

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Data science note 2

square loss:arihometic meanabsolute loss:medium0-1 loss:modeMAD=1n∑i=1n∣x−M∣,M:medianMAD=\frac1n\sum_{i=1}^n|x-M|,M:medianMAD=n1i=1∑n∣x−M∣,M:medianProperties of an Averageaverage:a number xˉ\bar{x}xˉ that substitutes the entire data x1,x2,…xn for on
复制链接

扫一扫