when summarizing we may choose different aspects:center,spread,asymmetry mean,mode(众数),median some further options:weighted average,trimmed mean
average:a number x ˉ \bar{x} xˉ that substitutes the entire data x1,x2,…xn for one variable Internality.monotonicity.symmetry.associativity
It makes sense to consider the average x ˉ \bar{x} xˉ such that f ( x 1 , x 2 , . . . , x n ) = f ( x ˉ , x ˉ , . . . , x ˉ ) f(x_1,x_2,...,x_n)=f(\bar{x},\bar{x},...,\bar{x}) f(x1,x2,...,xn)=f(xˉ,xˉ,...,xˉ)
we can substitute the one number x ˉ \bar{x} xˉ in place of all the values for the variable and get the same result. different functions f are disirable based on various purpose and lead to different averages(arithmetic mean, geometric mean)
income ( r 1 , r 2 , . . . , r 12 ) (r_1,r_2,...,r_{12}) (r1,r2,...,r12)=income ( r ˉ , r ˉ , . . . , r ˉ ) (\bar{r},\bar{r},...,\bar{r}) (rˉ,rˉ,...,rˉ) ( 1 + r 1 ) ( 1 + r 2 ) . . . ( 1 + r 12 ) = ( 1 + r ˉ ) 12 (1+r_1)(1+r_2)...(1+r_{12})=(1+\bar{r})^{12} (1+r1)(1+r2)...(1+r12)=(1+rˉ)12 r ˉ = ( ∏ i = 1 12 ( 1 + r i ) ) 1 12 − 1 \bar{r}=(\prod^{12}_{i=1}(1+r_i))^{\frac1{12}}-1 rˉ=(i=1∏12(1+ri))121−1
the geometic mean of (1+ri) is a meaningful summary.
T i m e = ∑ i = 1 n d i v i = ∑ i = 1 n d i v ˉ Time=\sum^n_{i=1}\frac{d_i}{v_i}=\sum^n_{i=1}\frac{d_i}{\bar{v}} Time=i=1∑nvidi=i=1∑nvˉdi
v ˉ = ∑ i = 1 n d i ∑ i = 1 n d i v i \bar{v}=\frac{\sum^n_{i=1}d_i}{\sum^n_{i=1}\frac{d_i}{v_i}} vˉ=∑i=1nvidi∑i=1ndi we find that the harmonic mean(here with weights di)is a meaningful summary
let g(z,x1,x2,…,xn) be a function that describes the loss we incur when substituing x1,x2,…,xn with z,then a v e r a g e ( x 1 , x 2 , . . . , x n ) = x ˉ = arg min z g ( z , x 1 , . . . , x n ) average(x_1,x_2,...,x_n)=\bar{x}=\arg\min_zg(z,x_1,...,x_n) average(x1,x2,...,xn)=xˉ=argzming(z,x1,...,xn)
this approach to defining average says that the average is our “best guess” for a value in the data,and the different loss functions specify how we evaluate the goodness of a guess the square loss penalizes more large discrepancies and down weighs small discrepancies the absolute loss considers all discrepancies at their face values the 0-1 loss consider all discrepancies to be the same ,with the exception of no error
square loss:arithmetic mean absolute loss:medium 0-1 loss:mode [to find the z that minimizes this loss we need to look for the z such that the number of xi equal to z is maximal ;this is called the mode and it is useful as a measure of the “center” for qualitatibe variables] M A D = 1 n ∑ i = 1 n ∣ x − M ∣ , M : m e d i a n MAD=\frac1n\sum_{i=1}^n|x-M|,M:median MAD=n1i=1∑n∣x−M∣,M:median
variance:average square distance from the mean V ( x 1 , . . . , x n ) = 1 n ∑ i = 1 n ( x i − x ˉ ) 2 V(x_1,...,x_n)=\frac1n\sum^n_{i=1}(x_i-\bar{x})^2 V(x1,...,xn)=n1i=1∑n(xi−xˉ)2
standad deviation: 1 n ∑ i = 1 n ( x i − x ˉ ) 2 \sqrt{\frac1n\sum^n_{i=1}(x_i-\bar{x})^2} n1i=1∑n(xi−xˉ)2 note that R actually divides by n-1 rather than n.
often data is summarized so that we have counts of occurrences of the same values:we have a set v1,v2,…,vm of possible values,with their frequencies fi calculating averages and standard deviations has to adapt to this different set -up v ˉ = 1 ∑ i = 1 m f i ∑ i = 1 m v i f i \bar{v}=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}v_if_i vˉ=∑i=1mfi1i=1∑mvifi V a r i a n c e = 1 ∑ i = 1 m f i ∑ i = 1 m ( v i − v ˉ ) 2 f i Variance=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}(v_i-\bar{v})^2f_i Variance=∑i=1mfi1i=1∑m(vi−vˉ)2fi
let’s consider some restriction that make the statement meaningful xi>=0,fix the total sum of values, ∑ i = 1 m x i = n x ˉ \sum^m_{i=1}x_i=n\bar{x} ∑i=1mxi=nxˉ
∑ i = 1 m ( x i − x ˉ ) 2 = ∑ i = 1 m x i 2 − n x ˉ 2 \sum^m_{i=1}(x_i-\bar{x})^2=\sum^m_{i=1}x_i^2-n\bar{x}^2 i=1∑m(xi−xˉ)2=i=1∑mxi2−nxˉ2 ∑ i = 1 m x i 2 − n x ˉ 2 ≤ ( ∑ i = 1 m x i ) 2 − n x ˉ 2 = n ( n − 1 ) x ˉ 2 \sum^m_{i=1}x_i^2-n\bar{x}^2\leq(\sum^m_{i=1}x_i)^2-n\bar{x}^2=n(n-1)\bar{x}^2 i=1∑mxi2−nxˉ2≤(i=1∑mxi)2−nxˉ2=n(n−1)xˉ2 V ( x 1 , . . . , x n ) ≤ ( n − 1 ) x ˉ 2 V(x_1,...,x_n)\leq(n-1)\bar{x}^2 V(x1,...,xn)≤(n−1)xˉ2
x 1 ≤ x 2 ≤ . . . ≤ x n x_1\leq x_2 \leq ... \leq x_n x1≤x2≤...≤xn we now calculate two quantities: F i = i n , Q i = ∑ j = 1 i x j ∑ j = 1 n x j F_i=\frac in ,Q_i=\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} Fi=ni,Qi=∑j=1nxj∑j=1ixj in general, Q i ≤ F i Q_i\leq F_i Qi≤Fi ∑ j = 1 i x j ∑ j = 1 n x j ≤ i n \frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} \leq \frac in ∑j=1nxj∑j=1ixj≤ni ∑ j = 1 i x j i ≤ ∑ j = 1 n x j n \frac{\sum^i_{j=1}x_j}i \leq \frac{\sum^n_{j=1}x_j}n i∑j=1ixj≤n∑j=1nxj
Area under bottom curve:sum of areas of trapezoids.Thus A = 1 2 − ∑ i = 1 n ( F i − F i − 1 ) ( Q i + Q i + 1 ) 2 A=\frac12-\sum^n_{i=1}\frac{(F_i-F_{i-1})(Q_i+Q_{i+1})}2 A=21−i=1∑n2(Fi−Fi−1)(Qi+Qi+1) gini’s index G = A 1 / 2 = 1 − ∑ i = 1 n ( F i − F i − 1 ) ( Q i + Q i + 1 ) G=\frac{A}{1/2}=1-\sum^n_{i=1}(F_i-F_{i-1})(Q_i+Q_{i+1}) G=1/2A=1−i=1∑n(Fi−Fi−1)(Qi+Qi+1)
we can calculate the following summary of “mutual variability” Δ = ∑ i = 1 k ∑ j = 1 k ∣ x i − x j ∣ n i n n j n \Delta=\sum^k_{i=1}\sum^k_{j=1}|x_i-x_j|\frac {n_i}n\frac{n_j}n Δ=i=1∑kj=1∑k∣xi−xj∣nninnj one can show that G = Δ 2 x ˉ G=\frac{\Delta}{2\bar{x}} G=2xˉΔ
D = 1 − ∑ i = 1 m p i 2 D=1-\sum_{i=1}^mp_i^2 D=1−i=1∑mpi2 probability that if you capture two fishes they are not the same. p 1 + . . . + p m = 1 p1+...+pm=1 p1+...+pm=1 p 1 = p 2 = . . . = p m = 1 m p1=p2=...=pm=\frac1m p1=p2=...=pm=m1 D 1 = 1 − 1 m D1=1-\frac1m D1=1−m1 so that the diversity 1- 1 m \frac1m m1is larger the larger is m m>k D 2 = 1 − 1 k < D 1 D2=1-\frac1k<D1 D2=1−k1<D1 This index is known as GINI or Simpson’s diversity index(be careful that actually there are multiple versions of the Simpson index) There are other measures of diveisity shannon’s index that is based on entropy H = − ∑ i = 1 m p i log p i H=-\sum^m_{i=1}p_i\log p_i H=−i=1∑mpilogpi When analysing data relative to the frequency of different alleles in genetics,the D = 1 − ∑ i = 1 m p i 2 D=1-\sum_{i=1}^mp_i^2 D=1−∑i=1mpi2 is preferred.
this is because it has a very easy genetic interpretation:it represents the probability of an heterozygous genotype
sample=runif(20,2,5) sample_mean=mean(sample) #how many numbers in the sample we need to change to make the sample mean equal to 4? sample1=rnorm(30,0,1) sample1 sample1_mean=mean(sample1) sample1 #how many numbers in the sample we need to change to make the sample mean equal to 2?
1 number ,so the mean does not have robust at all
sample_median=median(sample) sort(sample) #how many numbers in the sample we need to change to make the sample median equal to 4?
median is more robust than mean. mean(sample,trim=0) mean(sample,trim=0.1)(highest 10% and lowest 10% is trimmed) mean(sample,trim=0.2)(20% is trimmed) mean(sample,trim=0.5)(50% is trimmed)
x=c(1,2,3,10,15,15,30,50) n=length(x) F=(1:n)/n Q=cumsum(x)/sum(x) F=c(0,F) Q=c(0,Q) lorenz=data.frame(F,Q) library(ggplot2) g=ggplot(lorenz,aes(x=F))+geom_line(aes(x=F,y=Q),color=I(“blue”))+ geom_point(aes(x=F,y=Q),color=I(“blue”))+ geom_abline(intercept=0,slope=1) F=c(0,F) Q=c(0,Q)在数据中添加0 过原点。 lorenz=data.frame(F,Q)形成数据结构才能一一对应被plot
#perfect equality x=c(1,2,3,10,15,15,30,50) tot=sum(x) n=length(x) F=(1:n)/n x=rep(tot/8,8) Q=cumsum(x)/sum(x) F=c(0,F) Q=c(0,Q) lorenz=data.frame(F,Q) library(ggplot2) p1=ggplot
#Shading the region between Lorenze curve and equality line ggplot(lorenz,aes(x=F))+ geom_line(aes(x=F,y=Q),color=I(“blue”)) + geom_point(aes(x=F,y=Q),color=I(“blue”)) + geom_abline(intercept=0, slope=1) + geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) 填充,(aes(ymin = Q, ymax = F)为范围,fill="cyan"为填充的颜色,alpha为透明度
#load some income data library(readr) library(dplyr) library(ggplot2) income <- read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"), col_names = FALSE, skip = 9,col_types = cols(X2 = col_number(),X3 = col_number(),X4 =col_number(), X5 = col_number(),X6 = col_number(),X7 = col_number(),X8 = col_number(), X9 = col_number(),X10 = col_number(),X11 = col_number(), X12 = col_number(),X13 = col_number(),X14 = col_number(), X15 = col_number(),X16 = col_number(),X17 = col_number(),X18 = col_number(), X19 = col_number(),X20 = col_number(),X21 = col_number(),X22 = col_number(),X23 = col_number(),X24 = col_number(), X25 = col_number(),X26 = col_number(), X27 = col_number(), X28 = col_number())) for(i in 1:9) { income[is.na(income[,paste(“X”,as.character(3i),sep="")]),paste(“X”,as.character(3i),sep="")]<-income[is.na(income[,paste(“X”,as.character(3*i),sep="")]),“X3”] } income <- rename(income, Bracket=X1) income <- rename(income, Frequency=X2) income <-rename(income, Income=X3) read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),文件名中\要\ skip = 9跳过了9行,col_names = FALSE,无列名 na=c("(B)")为缺失值;X2 = col_number()把每列的值的种类为数值 for循环将每第三列数据的缺失值换成对应行的第一项第三列数据 rename(income, Bracket=X1)将列重命名
p <- ggplot() + geom_point(data = income, aes(x = Income, y = Frequency/sum(Frequency), colour = “All”),pch=1) + geom_point(data = income, aes(x = X9, y = X8/sum(X8), colour = “White”)) + geom_point(data = income, aes(x = X18, y = X17/sum(X17), colour = “Black”)) + geom_point(data = income, aes(x = X24, y = X23/sum(X23), colour = “Asian”)) + geom_point(data = income, aes(x = X27, y = X26/sum(X26), colour = “Hispanic”))+ scale_colour_manual("",breaks = c(“All”, “White”, “Black”,“Asian”,“Hispanic”), values = c(“All”=“black”, “White”=“blue”, “Black”=“green”, “Asian”=“yellow”,“Hispanic”=“red”))+ scale_y_continuous(name=“Proportion in Income Bracket”)+ scale_x_continuous(name=“Average Income in Bracket”)+ggtitle(“US 2015 household income, CPS”) p colour = “All”,系统自动分配颜色,旁边的标签标记该颜色代表“All”,pch为点的种类,pch=1为点的种类为空心圆点 scale_colour_manual用法: p + scale_colour_manual(values = c(“red”,“blue”, “green”))#手动分配颜色 p + scale_colour_manual(values = c(“8” = “red”,“4” =“blue”,“6” = “green”)) #根据level分配颜色 cols <- c(“8” = “red”,“4” = “blue”,“6” = “darkgreen”, “10”= “orange”) #自己定义一个颜色向量 p + scale_colour_manual(values = cols) p + scale_colour_manual(values = cols, breaks = c(“8”, “6”,“4”)) #可以控制图例的顺序 p + scale_colour_manual(values = cols, breaks = c(“4”, “6”,“8”),labels =c(“four”, “six”, “eight”)) #可以控制图例的标签 p + scale_colour_manual(values = cols, limits = c(“4”, “8”))#控制显示哪些图例 #Notice that the values are matched with limits, and not breaks p + scale_colour_manual(limits = c(6, 8, 4), breaks = c(8, 4,6),values =c(“grey50”, “grey80”, “black”)) scale_x_continuous(name="")可设定x轴坐标轴标签
#creating lorenz curve #Creating Lorenze Curve for all races together F <- cumsum(income F r e q u e n c y ) / s u m ( i n c o m e Frequency)/sum(income Frequency)/sum(incomeFrequency) Q <- cumsum(income I n c o m e ∗ i n c o m e Income*income Income∗incomeFrequency)/sum(income I n c o m e ∗ i n c o m e Income*income Income∗incomeFrequency) lorenz <- data.frame(c(0,F),c(0,Q)) #Creating Lorenz Curve for Whites F <- cumsum(income X 8 ) / s u m ( i n c o m e X8)/sum(income X8)/sum(incomeX8) Q <- cumsum(income X 9 ∗ i n c o m e X9*income X9∗incomeX8)/sum(income X 9 ∗ i n c o m e X9*income X9∗incomeX8) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Blacks F <- cumsum(income X 17 ) / s u m ( i n c o m e X17)/sum(income X17)/sum(incomeX17) Q <- cumsum(income X 18 ∗ i n c o m e X18*income X18∗incomeX17)/sum(income X 18 ∗ i n c o m e X18*income X18∗incomeX17) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Asians F <- cumsum(income X 23 ) / s u m ( i n c o m e X23)/sum(income X23)/sum(incomeX23) Q <- cumsum(income X 24 ∗ i n c o m e X24*income X24∗incomeX23)/sum(income X 24 ∗ i n c o m e X24*income X24∗incomeX23) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Hispanics F <- cumsum(income X 26 ) / s u m ( i n c o m e X26)/sum(income X26)/sum(incomeX26) Q <- cumsum(income X 27 ∗ i n c o m e X27*income X27∗incomeX26)/sum(income X 27 ∗ i n c o m e X27*income X27∗incomeX26) lorenz<-data.frame(lorenz,c(0,F),c(0,Q)) names(lorenz)<-c(“F”,“Q”,“F.w”,“Q.w”,“F.b”,“Q.b”,“F.a”,“Q.a”,“F.h”,“Q.h”) lorenz
names(lorenz),对每个元素变量进行命名
#Compute Gini Index
#Define a new function to compute Gini Index mygini=function(F,Q){ n=length(F)} gini=1-sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) return(gini) ) GiniL=c(mygini(lorenz$F,lorenz$Q),mygini(lorenz$F.w,lorenz$Q.w),mygini(lorenz$F.b,lorenz$Q.b),mygini(lorenz$F.a,lorenz$Q.a),mygini(lorenz$F.h,lorenz$Q.h)) Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL) names(Gini)=c(“Race”,“Index”) Gini
#Make a plot ggplot(Gini)+ geom_point(aes(x=Race,y=Index)) + coord_flip() + theme( # remove the vertical grid lines panel.grid.major.x = element_blank() , # explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype=3, color=“darkgray”))+ ylab(“Gini Index”) + ggtitle(“2015 US household income, CPS”) sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) F[-1]删去第一列剩下的元素;F[-n]删去第n列剩下的元素,即对应相邻两列相减 Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)将数据结构又变成两列,names(Gini)=c(“Race”,“Index”)将两列命名 coord_flip() xy坐标轴转换 panel.grid.major.y = element_line(linetype=3, color=“darkgray”))linetype=3是虚线
#Load the data on genetic diversity diversitycodes <- read_delim(“D:\专业英文拓展\diversitycodes”, " ", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) apply(diversitycodes,2,unique) # unique values in each column names(diversitycodes)<-c(“Code”,“Population”,“Location”,“Continent”) #Read the frequency data diversitydata <- read_delim(“D:\专业英文拓展\diversitydata.freqs”, " ", escape_double = FALSE, col_names = FALSE, col_types = cols(X2 = col_character(), X4 = col_double()), trim_ws = TRUE) names(diversitydata)<-c(“Marker”,“Allele”,“Population”,“Frequency”) filter(diversitydata, Marker == “D10S1208”, Allele== “167”) MARKER<-unique(diversitydata$Marker) POP<-unique(diversitydata$Population) #compute the diversity index Diversity=matrix(NA,ncol=377,nrow=52) for(i in 1:52) {for(j in 1:377) {Diversity[i,j]=1-sum(diversitydata[diversitydataKaTeX parse error: Expected 'EOF', got '&' at position 18: …rker==MARKER[j]&̲diversitydataPopulation==POP[i],4]^2) } } colnames(Diversity)=MARKER rownames(Diversity)=POP markdiv<-(apply(Diversity,2,mean)) markdiv<-sort(markdiv) hist(markdiv,xlab=“Diversity”, main=“Average Marker diversity across populations”)
trim_ws = TRUE去掉头尾的空格 apply(diversitycodes,2,unique) 2为按列apply ,names(diversitycodes) 对每列进行命名
Diversity<-data.frame(POP,apply(Diversity,1,mean),Diversity) names(Diversity)<-c(“Population”,“Average”,MARKER) Diversity<-inner_join(diversitycodes,Diversity,by=c(“Population”)) sort(unique(Diversity$Continent)) contcol<data.frame(sort(unique(Diversity$Continent)),c(“green”,“red”,“orange”,“yellow”,“azure”,“beige”,“blue”)) names(contcol)=c(“Continent”,“Color”) rownames(contcol)<-contcol$Continent Diversity<-inner_join(Diversity,contcol,by=c(“Continent”))
Diversity<-Diversity[order(Diversity$Continent,Diversity$Location),]
par(mar=c(8,4,4,2)) barplot(Diversity$Average,names=Diversity$Population,col=as.character(Diversity$Color),cex.lab=.3, las=3,main=“Average diversity”) contdiv<-tapply(Diversity A v e r a g e , D i v e r s i t y Average,Diversity Average,DiversityContinent,mean) 第一行将数据结构分为三个部分,第二行将其命名 inner_join将Population中共同的部分连在一起,sort 返回排序后的数值向量 order()的返回值是对应“排名”的元素所在向量中的位置。 mar:以数值向量表示边界大小,顺序为"下、左、上、右",单位为英分,默认值c(5, 4, 4, 2)+0.1 cex.lab表示修改坐标轴名称字体大小 las:刻度显示形式,las=3为竖着
par(mar=c(5,15,4,2)) barplot(contdiv[order(contdiv)],col=as.character(contcol[names(contdiv[order(contdiv)]),2]), main=“Continental Diversity”,horiz = T,las=1)
按排序画图 las参数 坐标刻度标签的方向。0表示总是平行于坐标轴,1表示总是水平,2表示总是垂直于坐标轴。