Data science note 2

Summaries and Indices

when summarizing we may choose different aspects:center,spread,asymmetry
mean,mode(众数),median
some further options:weighted average,trimmed mean

Properties of an Average

average:a number x ˉ \bar{x} xˉ that substitutes the entire data x1,x2,…xn for one variable
Internality.monotonicity.symmetry.associativity

It makes sense to consider the average x ˉ \bar{x} xˉ such that
f ( x 1 , x 2 , . . . , x n ) = f ( x ˉ , x ˉ , . . . , x ˉ ) f(x_1,x_2,...,x_n)=f(\bar{x},\bar{x},...,\bar{x}) f(x1,x2,...,xn)=f(xˉ,xˉ,...,xˉ)

we can substitute the one number x ˉ \bar{x} xˉ in place of all the values for the variable and get the same result.
different functions f are disirable based on various purpose and lead to different averages(arithmetic mean, geometric mean)

the average interest rate

income ( r 1 , r 2 , . . . , r 12 ) (r_1,r_2,...,r_{12}) (r1,r2,...,r12)=income ( r ˉ , r ˉ , . . . , r ˉ ) (\bar{r},\bar{r},...,\bar{r}) (rˉ,rˉ,...,rˉ)
( 1 + r 1 ) ( 1 + r 2 ) . . . ( 1 + r 12 ) = ( 1 + r ˉ ) 12 (1+r_1)(1+r_2)...(1+r_{12})=(1+\bar{r})^{12} (1+r1)(1+r2)...(1+r12)=(1+rˉ)12
r ˉ = ( ∏ i = 1 12 ( 1 + r i ) ) 1 12 − 1 \bar{r}=(\prod^{12}_{i=1}(1+r_i))^{\frac1{12}}-1 rˉ=(i=112(1+ri))1211

the geometic mean of (1+ri) is a meaningful summary.

average speed

T i m e = ∑ i = 1 n d i v i = ∑ i = 1 n d i v ˉ Time=\sum^n_{i=1}\frac{d_i}{v_i}=\sum^n_{i=1}\frac{d_i}{\bar{v}} Time=i=1nvidi=i=1nvˉdi

v ˉ = ∑ i = 1 n d i ∑ i = 1 n d i v i \bar{v}=\frac{\sum^n_{i=1}d_i}{\sum^n_{i=1}\frac{d_i}{v_i}} vˉ=i=1nvidii=1ndi
we find that the harmonic mean(here with weights di)is a meaningful summary

another approach to defining an average

let g(z,x1,x2,…,xn) be a function that describes the loss we incur when substituing x1,x2,…,xn with z,then a v e r a g e ( x 1 , x 2 , . . . , x n ) = x ˉ = arg ⁡ min ⁡ z g ( z , x 1 , . . . , x n ) average(x_1,x_2,...,x_n)=\bar{x}=\arg\min_zg(z,x_1,...,x_n) average(x1,x2,...,xn)=xˉ=argzming(z,x1,...,xn)

different loss functions

this approach to defining average says that the average is our “best guess” for a value in the data,and the different loss functions specify how we evaluate the goodness of a guess
the square loss penalizes more large discrepancies and down weighs small discrepancies
the absolute loss considers all discrepancies at their face values
the 0-1 loss consider all discrepancies to be the same ,with the exception of no error

square loss:arithmetic mean
absolute loss:medium
0-1 loss:mode [to find the z that minimizes this loss we need to look for the z such that the number of xi equal to z is maximal ;this is called the mode and it is useful as a measure of the “center” for qualitatibe variables]
M A D = 1 n ∑ i = 1 n ∣ x − M ∣ , M : m e d i a n MAD=\frac1n\sum_{i=1}^n|x-M|,M:median MAD=n1i=1nxM,M:median

measures of spread/dispersion/variability

variance:average square distance from the mean
V ( x 1 , . . . , x n ) = 1 n ∑ i = 1 n ( x i − x ˉ ) 2 V(x_1,...,x_n)=\frac1n\sum^n_{i=1}(x_i-\bar{x})^2 V(x1,...,xn)=n1i=1n(xixˉ)2

standad deviation: 1 n ∑ i = 1 n ( x i − x ˉ ) 2 \sqrt{\frac1n\sum^n_{i=1}(x_i-\bar{x})^2} n1i=1n(xixˉ)2
note that R actually divides by n-1 rather than n.

often data is summarized so that we have counts of occurrences of the same values:we have a set v1,v2,…,vm of possible values,with their frequencies fi
calculating averages and standard deviations has to adapt to this different set -up
v ˉ = 1 ∑ i = 1 m f i ∑ i = 1 m v i f i \bar{v}=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}v_if_i vˉ=i=1mfi1i=1mvifi
V a r i a n c e = 1 ∑ i = 1 m f i ∑ i = 1 m ( v i − v ˉ ) 2 f i Variance=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}(v_i-\bar{v})^2f_i Variance=i=1mfi1i=1m(vivˉ)2fi

let’s consider some restriction that make the statement meaningful
xi>=0,fix the total sum of values, ∑ i = 1 m x i = n x ˉ \sum^m_{i=1}x_i=n\bar{x} i=1mxi=nxˉ

∑ i = 1 m ( x i − x ˉ ) 2 = ∑ i = 1 m x i 2 − n x ˉ 2 \sum^m_{i=1}(x_i-\bar{x})^2=\sum^m_{i=1}x_i^2-n\bar{x}^2 i=1m(xixˉ)2=i=1mxi2nxˉ2
∑ i = 1 m x i 2 − n x ˉ 2 ≤ ( ∑ i = 1 m x i ) 2 − n x ˉ 2 = n ( n − 1 ) x ˉ 2 \sum^m_{i=1}x_i^2-n\bar{x}^2\leq(\sum^m_{i=1}x_i)^2-n\bar{x}^2=n(n-1)\bar{x}^2 i=1mxi2nxˉ2(i=1mxi)2nxˉ2=n(n1)xˉ2
V ( x 1 , . . . , x n ) ≤ ( n − 1 ) x ˉ 2 V(x_1,...,x_n)\leq(n-1)\bar{x}^2 V(x1,...,xn)(n1)xˉ2

measure income inequality

x 1 ≤ x 2 ≤ . . . ≤ x n x_1\leq x_2 \leq ... \leq x_n x1x2...xn
we now calculate two quantities:
F i = i n , Q i = ∑ j = 1 i x j ∑ j = 1 n x j F_i=\frac in ,Q_i=\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} Fi=ni,Qi=j=1nxjj=1ixj
in general, Q i ≤ F i Q_i\leq F_i QiFi
∑ j = 1 i x j ∑ j = 1 n x j ≤ i n \frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} \leq \frac in j=1nxjj=1ixjni
∑ j = 1 i x j i ≤ ∑ j = 1 n x j n \frac{\sum^i_{j=1}x_j}i \leq \frac{\sum^n_{j=1}x_j}n ij=1ixjnj=1nxj

Area under bottom curve:sum of areas of trapezoids.Thus
A = 1 2 − ∑ i = 1 n ( F i − F i − 1 ) ( Q i + Q i + 1 ) 2 A=\frac12-\sum^n_{i=1}\frac{(F_i-F_{i-1})(Q_i+Q_{i+1})}2 A=21i=1n2(FiFi1)(Qi+Qi+1)
gini’s index
G = A 1 / 2 = 1 − ∑ i = 1 n ( F i − F i − 1 ) ( Q i + Q i + 1 ) G=\frac{A}{1/2}=1-\sum^n_{i=1}(F_i-F_{i-1})(Q_i+Q_{i+1}) G=1/2A=1i=1n(FiFi1)(Qi+Qi+1)

we can calculate the following summary of “mutual variability”
Δ = ∑ i = 1 k ∑ j = 1 k ∣ x i − x j ∣ n i n n j n \Delta=\sum^k_{i=1}\sum^k_{j=1}|x_i-x_j|\frac {n_i}n\frac{n_j}n Δ=i=1kj=1kxixjnninnj
one can show that
G = Δ 2 x ˉ G=\frac{\Delta}{2\bar{x}} G=2xˉΔ

An Index of Diversity

D = 1 − ∑ i = 1 m p i 2 D=1-\sum_{i=1}^mp_i^2 D=1i=1mpi2
probability that if you capture two fishes they are not the same.
p 1 + . . . + p m = 1 p1+...+pm=1 p1+...+pm=1
p 1 = p 2 = . . . = p m = 1 m p1=p2=...=pm=\frac1m p1=p2=...=pm=m1
D 1 = 1 − 1 m D1=1-\frac1m D1=1m1
so that the diversity 1- 1 m \frac1m m1is larger the larger is m
m>k
D 2 = 1 − 1 k < D 1 D2=1-\frac1k<D1 D2=1k1<D1
This index is known as GINI or Simpson’s diversity index(be careful that actually there are multiple versions of the Simpson index)
There are other measures of diveisity
shannon’s index that is based on entropy
H = − ∑ i = 1 m p i log ⁡ p i H=-\sum^m_{i=1}p_i\log p_i H=i=1mpilogpi
When analysing data relative to the frequency of different alleles in genetics,the D = 1 − ∑ i = 1 m p i 2 D=1-\sum_{i=1}^mp_i^2 D=1i=1mpi2 is preferred.

this is because it has a very easy genetic interpretation:it represents the probability of an heterozygous genotype

Rstudio
robust

sample=runif(20,2,5)
sample_mean=mean(sample)
#how many numbers in the sample we need to change to make the sample mean equal to 4?
sample1=rnorm(30,0,1)
sample1
sample1_mean=mean(sample1)
sample1
#how many numbers in the sample we need to change to make the sample mean equal to 2?

1 number ,so the mean does not have robust at all

sample_median=median(sample)
sort(sample)
#how many numbers in the sample we need to change to make the sample median equal to 4?

median is more robust than mean.
mean(sample,trim=0)
mean(sample,trim=0.1)(highest 10% and lowest 10% is trimmed)
mean(sample,trim=0.2)(20% is trimmed)
mean(sample,trim=0.5)(50% is trimmed)

lorenz

x=c(1,2,3,10,15,15,30,50)
n=length(x)
F=(1:n)/n
Q=cumsum(x)/sum(x)
F=c(0,F)
Q=c(0,Q)
lorenz=data.frame(F,Q)
library(ggplot2)
g=ggplot(lorenz,aes(x=F))+geom_line(aes(x=F,y=Q),color=I(“blue”))+
geom_point(aes(x=F,y=Q),color=I(“blue”))+
geom_abline(intercept=0,slope=1)
F=c(0,F) Q=c(0,Q)在数据中添加0 过原点。
lorenz=data.frame(F,Q)形成数据结构才能一一对应被plot

#perfect equality
x=c(1,2,3,10,15,15,30,50)
tot=sum(x)
n=length(x)
F=(1:n)/n
x=rep(tot/8,8)
Q=cumsum(x)/sum(x)
F=c(0,F)
Q=c(0,Q)
lorenz=data.frame(F,Q)
library(ggplot2)
p1=ggplot

#Shading the region between Lorenze curve and equality line
ggplot(lorenz,aes(x=F))+
geom_line(aes(x=F,y=Q),color=I(“blue”)) +
geom_point(aes(x=F,y=Q),color=I(“blue”)) +
geom_abline(intercept=0, slope=1) +
geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5)
geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5)
填充,(aes(ymin = Q, ymax = F)为范围,fill="cyan"为填充的颜色,alpha为透明度

在这里插入图片描述

#load some income data
library(readr)
library(dplyr)
library(ggplot2)
income <- read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),
col_names = FALSE, skip = 9,col_types = cols(X2 = col_number(),X3 = col_number(),X4 =col_number(), X5 = col_number(),X6 = col_number(),X7 = col_number(),X8 = col_number(), X9 = col_number(),X10 = col_number(),X11 = col_number(), X12 = col_number(),X13 = col_number(),X14 = col_number(), X15 = col_number(),X16 = col_number(),X17 = col_number(),X18 = col_number(),
X19 = col_number(),X20 = col_number(),X21 = col_number(),X22 = col_number(),X23 = col_number(),X24 = col_number(), X25 = col_number(),X26 = col_number(), X27 = col_number(), X28 = col_number()))
for(i in 1:9)
{
income[is.na(income[,paste(“X”,as.character(3i),sep="")]),paste(“X”,as.character(3i),sep="")]<-income[is.na(income[,paste(“X”,as.character(3*i),sep="")]),“X3”]
}
income <- rename(income, Bracket=X1)
income <- rename(income, Frequency=X2)
income <-rename(income, Income=X3)
read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),文件名中\要\
skip = 9跳过了9行,col_names = FALSE,无列名
na=c("(B)")为缺失值;X2 = col_number()把每列的值的种类为数值
for循环将每第三列数据的缺失值换成对应行的第一项第三列数据
rename(income, Bracket=X1)将列重命名

p <- ggplot() +
geom_point(data = income, aes(x = Income, y = Frequency/sum(Frequency), colour = “All”),pch=1) +
geom_point(data = income, aes(x = X9, y = X8/sum(X8), colour = “White”)) +
geom_point(data = income, aes(x = X18, y = X17/sum(X17), colour = “Black”)) +
geom_point(data = income, aes(x = X24, y = X23/sum(X23), colour = “Asian”)) +
geom_point(data = income, aes(x = X27, y = X26/sum(X26), colour = “Hispanic”))+
scale_colour_manual("",breaks = c(“All”, “White”, “Black”,“Asian”,“Hispanic”), values = c(“All”=“black”, “White”=“blue”, “Black”=“green”, “Asian”=“yellow”,“Hispanic”=“red”))+ scale_y_continuous(name=“Proportion in Income Bracket”)+
scale_x_continuous(name=“Average Income in Bracket”)+ggtitle(“US 2015 household income, CPS”)
p
colour = “All”,系统自动分配颜色,旁边的标签标记该颜色代表“All”,pch为点的种类,pch=1为点的种类为空心圆点
scale_colour_manual用法:
p + scale_colour_manual(values = c(“red”,“blue”, “green”))#手动分配颜色
p + scale_colour_manual(values = c(“8” = “red”,“4” =“blue”,“6” = “green”)) #根据level分配颜色
cols <- c(“8” = “red”,“4” = “blue”,“6” = “darkgreen”, “10”= “orange”) #自己定义一个颜色向量
p + scale_colour_manual(values = cols)
p + scale_colour_manual(values = cols, breaks = c(“8”, “6”,“4”)) #可以控制图例的顺序
p + scale_colour_manual(values = cols, breaks = c(“4”, “6”,“8”),labels =c(“four”, “six”, “eight”)) #可以控制图例的标签
p + scale_colour_manual(values = cols, limits = c(“4”, “8”))#控制显示哪些图例
#Notice that the values are matched with limits, and not breaks
p + scale_colour_manual(limits = c(6, 8, 4), breaks = c(8, 4,6),values =c(“grey50”, “grey80”, “black”))
scale_x_continuous(name="")可设定x轴坐标轴标签
在这里插入图片描述

#creating lorenz curve
#Creating Lorenze Curve for all races together
F <- cumsum(income F r e q u e n c y ) / s u m ( i n c o m e Frequency)/sum(income Frequency)/sum(incomeFrequency)
Q <- cumsum(income I n c o m e ∗ i n c o m e Income*income IncomeincomeFrequency)/sum(income I n c o m e ∗ i n c o m e Income*income IncomeincomeFrequency)
lorenz <- data.frame(c(0,F),c(0,Q))
#Creating Lorenz Curve for Whites
F <- cumsum(income X 8 ) / s u m ( i n c o m e X8)/sum(income X8)/sum(incomeX8)
Q <- cumsum(income X 9 ∗ i n c o m e X9*income X9incomeX8)/sum(income X 9 ∗ i n c o m e X9*income X9incomeX8)
lorenz <- data.frame(lorenz,c(0,F),c(0,Q))
#Creating Lorenze Curve for Blacks
F <- cumsum(income X 17 ) / s u m ( i n c o m e X17)/sum(income X17)/sum(incomeX17)
Q <- cumsum(income X 18 ∗ i n c o m e X18*income X18incomeX17)/sum(income X 18 ∗ i n c o m e X18*income X18incomeX17)
lorenz <- data.frame(lorenz,c(0,F),c(0,Q))
#Creating Lorenze Curve for Asians
F <- cumsum(income X 23 ) / s u m ( i n c o m e X23)/sum(income X23)/sum(incomeX23)
Q <- cumsum(income X 24 ∗ i n c o m e X24*income X24incomeX23)/sum(income X 24 ∗ i n c o m e X24*income X24incomeX23)
lorenz <- data.frame(lorenz,c(0,F),c(0,Q))
#Creating Lorenze Curve for Hispanics
F <- cumsum(income X 26 ) / s u m ( i n c o m e X26)/sum(income X26)/sum(incomeX26)
Q <- cumsum(income X 27 ∗ i n c o m e X27*income X27incomeX26)/sum(income X 27 ∗ i n c o m e X27*income X27incomeX26)
lorenz<-data.frame(lorenz,c(0,F),c(0,Q))
names(lorenz)<-c(“F”,“Q”,“F.w”,“Q.w”,“F.b”,“Q.b”,“F.a”,“Q.a”,“F.h”,“Q.h”)
lorenz

names(lorenz),对每个元素变量进行命名

#Compute Gini Index

#Define a new function to compute Gini Index
mygini=function(F,Q){
n=length(F)}
gini=1-sum((F[-1]-F[-N])*(Q[-1]+Q[-n]))
return(gini)
)
GiniL=c(mygini(lorenz$F,lorenz$Q),mygini(lorenz$F.w,lorenz$Q.w),mygini(lorenz$F.b,lorenz$Q.b),mygini(lorenz$F.a,lorenz$Q.a),mygini(lorenz$F.h,lorenz$Q.h))
Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)
names(Gini)=c(“Race”,“Index”)
Gini

#Make a plot
ggplot(Gini)+ geom_point(aes(x=Race,y=Index)) +
coord_flip() +
theme( # remove the vertical grid lines
panel.grid.major.x = element_blank() ,
# explicitly set the horizontal lines (or they will disappear too)
panel.grid.major.y = element_line(linetype=3, color=“darkgray”))+
ylab(“Gini Index”) +
ggtitle(“2015 US household income, CPS”)
sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) F[-1]删去第一列剩下的元素;F[-n]删去第n列剩下的元素,即对应相邻两列相减
Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)将数据结构又变成两列,names(Gini)=c(“Race”,“Index”)将两列命名
coord_flip() xy坐标轴转换
panel.grid.major.y = element_line(linetype=3, color=“darkgray”))linetype=3是虚线

在这里插入图片描述

#Load the data on genetic diversity
diversitycodes <- read_delim(“D:\专业英文拓展\diversitycodes”, " ", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
apply(diversitycodes,2,unique) # unique values in each column
names(diversitycodes)<-c(“Code”,“Population”,“Location”,“Continent”)
#Read the frequency data
diversitydata <- read_delim(“D:\专业英文拓展\diversitydata.freqs”,
" ", escape_double = FALSE, col_names = FALSE,
col_types = cols(X2 = col_character(),
X4 = col_double()), trim_ws = TRUE)
names(diversitydata)<-c(“Marker”,“Allele”,“Population”,“Frequency”)
filter(diversitydata, Marker == “D10S1208”, Allele== “167”)
MARKER<-unique(diversitydata$Marker)
POP<-unique(diversitydata$Population)
#compute the diversity index
Diversity=matrix(NA,ncol=377,nrow=52)
for(i in 1:52)
{for(j in 1:377)
{Diversity[i,j]=1-sum(diversitydata[diversitydataKaTeX parse error: Expected 'EOF', got '&' at position 18: …rker==MARKER[j]&̲diversitydataPopulation==POP[i],4]^2)
}
}
colnames(Diversity)=MARKER
rownames(Diversity)=POP
markdiv<-(apply(Diversity,2,mean))
markdiv<-sort(markdiv)
hist(markdiv,xlab=“Diversity”,
main=“Average Marker diversity across populations”)

trim_ws = TRUE去掉头尾的空格
apply(diversitycodes,2,unique) 2为按列apply ,names(diversitycodes) 对每列进行命名

在这里插入图片描述

Diversity<-data.frame(POP,apply(Diversity,1,mean),Diversity)
names(Diversity)<-c(“Population”,“Average”,MARKER)
Diversity<-inner_join(diversitycodes,Diversity,by=c(“Population”))
sort(unique(Diversity$Continent))
contcol<data.frame(sort(unique(Diversity$Continent)),c(“green”,“red”,“orange”,“yellow”,“azure”,“beige”,“blue”))
names(contcol)=c(“Continent”,“Color”)
rownames(contcol)<-contcol$Continent
Diversity<-inner_join(Diversity,contcol,by=c(“Continent”))

Diversity<-Diversity[order(Diversity$Continent,Diversity$Location),]

par(mar=c(8,4,4,2))
barplot(Diversity$Average,names=Diversity$Population,col=as.character(Diversity$Color),cex.lab=.3,
las=3,main=“Average diversity”)
contdiv<-tapply(Diversity A v e r a g e , D i v e r s i t y Average,Diversity Average,DiversityContinent,mean)
第一行将数据结构分为三个部分,第二行将其命名
inner_join将Population中共同的部分连在一起,sort 返回排序后的数值向量
order()的返回值是对应“排名”的元素所在向量中的位置。
mar:以数值向量表示边界大小,顺序为"下、左、上、右",单位为英分,默认值c(5, 4, 4, 2)+0.1
cex.lab表示修改坐标轴名称字体大小 las:刻度显示形式,las=3为竖着

在这里插入图片描述

par(mar=c(5,15,4,2))
barplot(contdiv[order(contdiv)],col=as.character(contcol[names(contdiv[order(contdiv)]),2]),
main=“Continental Diversity”,horiz = T,las=1)

按排序画图
las参数
坐标刻度标签的方向。0表示总是平行于坐标轴,1表示总是水平,2表示总是垂直于坐标轴。

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值