Data science note 2

    技术2022-07-10  106

    Summaries and Indices

    when summarizing we may choose different aspects:center,spread,asymmetry mean,mode(众数),median some further options:weighted average,trimmed mean

    Properties of an Average

    average:a number x ˉ \bar{x} xˉ that substitutes the entire data x1,x2,…xn for one variable Internality.monotonicity.symmetry.associativity

    It makes sense to consider the average x ˉ \bar{x} xˉ such that f ( x 1 , x 2 , . . . , x n ) = f ( x ˉ , x ˉ , . . . , x ˉ ) f(x_1,x_2,...,x_n)=f(\bar{x},\bar{x},...,\bar{x}) f(x1,x2,...,xn)=f(xˉ,xˉ,...,xˉ)

    we can substitute the one number x ˉ \bar{x} xˉ in place of all the values for the variable and get the same result. different functions f are disirable based on various purpose and lead to different averages(arithmetic mean, geometric mean)

    the average interest rate

    income ( r 1 , r 2 , . . . , r 12 ) (r_1,r_2,...,r_{12}) (r1,r2,...,r12)=income ( r ˉ , r ˉ , . . . , r ˉ ) (\bar{r},\bar{r},...,\bar{r}) (rˉ,rˉ,...,rˉ) ( 1 + r 1 ) ( 1 + r 2 ) . . . ( 1 + r 12 ) = ( 1 + r ˉ ) 12 (1+r_1)(1+r_2)...(1+r_{12})=(1+\bar{r})^{12} (1+r1)(1+r2)...(1+r12)=(1+rˉ)12 r ˉ = ( ∏ i = 1 12 ( 1 + r i ) ) 1 12 − 1 \bar{r}=(\prod^{12}_{i=1}(1+r_i))^{\frac1{12}}-1 rˉ=(i=112(1+ri))1211

    the geometic mean of (1+ri) is a meaningful summary.

    average speed

    T i m e = ∑ i = 1 n d i v i = ∑ i = 1 n d i v ˉ Time=\sum^n_{i=1}\frac{d_i}{v_i}=\sum^n_{i=1}\frac{d_i}{\bar{v}} Time=i=1nvidi=i=1nvˉdi

    v ˉ = ∑ i = 1 n d i ∑ i = 1 n d i v i \bar{v}=\frac{\sum^n_{i=1}d_i}{\sum^n_{i=1}\frac{d_i}{v_i}} vˉ=i=1nvidii=1ndi we find that the harmonic mean(here with weights di)is a meaningful summary

    another approach to defining an average

    let g(z,x1,x2,…,xn) be a function that describes the loss we incur when substituing x1,x2,…,xn with z,then a v e r a g e ( x 1 , x 2 , . . . , x n ) = x ˉ = arg ⁡ min ⁡ z g ( z , x 1 , . . . , x n ) average(x_1,x_2,...,x_n)=\bar{x}=\arg\min_zg(z,x_1,...,x_n) average(x1,x2,...,xn)=xˉ=argzming(z,x1,...,xn)

    different loss functions

    this approach to defining average says that the average is our “best guess” for a value in the data,and the different loss functions specify how we evaluate the goodness of a guess the square loss penalizes more large discrepancies and down weighs small discrepancies the absolute loss considers all discrepancies at their face values the 0-1 loss consider all discrepancies to be the same ,with the exception of no error

    square loss:arithmetic mean absolute loss:medium 0-1 loss:mode [to find the z that minimizes this loss we need to look for the z such that the number of xi equal to z is maximal ;this is called the mode and it is useful as a measure of the “center” for qualitatibe variables] M A D = 1 n ∑ i = 1 n ∣ x − M ∣ , M : m e d i a n MAD=\frac1n\sum_{i=1}^n|x-M|,M:median MAD=n1i=1nxM,M:median

    measures of spread/dispersion/variability

    variance:average square distance from the mean V ( x 1 , . . . , x n ) = 1 n ∑ i = 1 n ( x i − x ˉ ) 2 V(x_1,...,x_n)=\frac1n\sum^n_{i=1}(x_i-\bar{x})^2 V(x1,...,xn)=n1i=1n(xixˉ)2

    standad deviation: 1 n ∑ i = 1 n ( x i − x ˉ ) 2 \sqrt{\frac1n\sum^n_{i=1}(x_i-\bar{x})^2} n1i=1n(xixˉ)2 note that R actually divides by n-1 rather than n.

    often data is summarized so that we have counts of occurrences of the same values:we have a set v1,v2,…,vm of possible values,with their frequencies fi calculating averages and standard deviations has to adapt to this different set -up v ˉ = 1 ∑ i = 1 m f i ∑ i = 1 m v i f i \bar{v}=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}v_if_i vˉ=i=1mfi1i=1mvifi V a r i a n c e = 1 ∑ i = 1 m f i ∑ i = 1 m ( v i − v ˉ ) 2 f i Variance=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}(v_i-\bar{v})^2f_i Variance=i=1mfi1i=1m(vivˉ)2fi

    let’s consider some restriction that make the statement meaningful xi>=0,fix the total sum of values, ∑ i = 1 m x i = n x ˉ \sum^m_{i=1}x_i=n\bar{x} i=1mxi=nxˉ

    ∑ i = 1 m ( x i − x ˉ ) 2 = ∑ i = 1 m x i 2 − n x ˉ 2 \sum^m_{i=1}(x_i-\bar{x})^2=\sum^m_{i=1}x_i^2-n\bar{x}^2 i=1m(xixˉ)2=i=1mxi2nxˉ2 ∑ i = 1 m x i 2 − n x ˉ 2 ≤ ( ∑ i = 1 m x i ) 2 − n x ˉ 2 = n ( n − 1 ) x ˉ 2 \sum^m_{i=1}x_i^2-n\bar{x}^2\leq(\sum^m_{i=1}x_i)^2-n\bar{x}^2=n(n-1)\bar{x}^2 i=1mxi2nxˉ2(i=1mxi)2nxˉ2=n(n1)xˉ2 V ( x 1 , . . . , x n ) ≤ ( n − 1 ) x ˉ 2 V(x_1,...,x_n)\leq(n-1)\bar{x}^2 V(x1,...,xn)(n1)xˉ2

    measure income inequality

    x 1 ≤ x 2 ≤ . . . ≤ x n x_1\leq x_2 \leq ... \leq x_n x1x2...xn we now calculate two quantities: F i = i n , Q i = ∑ j = 1 i x j ∑ j = 1 n x j F_i=\frac in ,Q_i=\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} Fi=ni,Qi=j=1nxjj=1ixj in general, Q i ≤ F i Q_i\leq F_i QiFi ∑ j = 1 i x j ∑ j = 1 n x j ≤ i n \frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} \leq \frac in j=1nxjj=1ixjni ∑ j = 1 i x j i ≤ ∑ j = 1 n x j n \frac{\sum^i_{j=1}x_j}i \leq \frac{\sum^n_{j=1}x_j}n ij=1ixjnj=1nxj

    Area under bottom curve:sum of areas of trapezoids.Thus A = 1 2 − ∑ i = 1 n ( F i − F i − 1 ) ( Q i + Q i + 1 ) 2 A=\frac12-\sum^n_{i=1}\frac{(F_i-F_{i-1})(Q_i+Q_{i+1})}2 A=21i=1n2(FiFi1)(Qi+Qi+1) gini’s index G = A 1 / 2 = 1 − ∑ i = 1 n ( F i − F i − 1 ) ( Q i + Q i + 1 ) G=\frac{A}{1/2}=1-\sum^n_{i=1}(F_i-F_{i-1})(Q_i+Q_{i+1}) G=1/2A=1i=1n(FiFi1)(Qi+Qi+1)

    we can calculate the following summary of “mutual variability” Δ = ∑ i = 1 k ∑ j = 1 k ∣ x i − x j ∣ n i n n j n \Delta=\sum^k_{i=1}\sum^k_{j=1}|x_i-x_j|\frac {n_i}n\frac{n_j}n Δ=i=1kj=1kxixjnninnj one can show that G = Δ 2 x ˉ G=\frac{\Delta}{2\bar{x}} G=2xˉΔ

    An Index of Diversity

    D = 1 − ∑ i = 1 m p i 2 D=1-\sum_{i=1}^mp_i^2 D=1i=1mpi2 probability that if you capture two fishes they are not the same. p 1 + . . . + p m = 1 p1+...+pm=1 p1+...+pm=1 p 1 = p 2 = . . . = p m = 1 m p1=p2=...=pm=\frac1m p1=p2=...=pm=m1 D 1 = 1 − 1 m D1=1-\frac1m D1=1m1 so that the diversity 1- 1 m \frac1m m1is larger the larger is m m>k D 2 = 1 − 1 k < D 1 D2=1-\frac1k<D1 D2=1k1<D1 This index is known as GINI or Simpson’s diversity index(be careful that actually there are multiple versions of the Simpson index) There are other measures of diveisity shannon’s index that is based on entropy H = − ∑ i = 1 m p i log ⁡ p i H=-\sum^m_{i=1}p_i\log p_i H=i=1mpilogpi When analysing data relative to the frequency of different alleles in genetics,the D = 1 − ∑ i = 1 m p i 2 D=1-\sum_{i=1}^mp_i^2 D=1i=1mpi2 is preferred.

    this is because it has a very easy genetic interpretation:it represents the probability of an heterozygous genotype

    Rstudio

    robust

    sample=runif(20,2,5) sample_mean=mean(sample) #how many numbers in the sample we need to change to make the sample mean equal to 4? sample1=rnorm(30,0,1) sample1 sample1_mean=mean(sample1) sample1 #how many numbers in the sample we need to change to make the sample mean equal to 2?

    1 number ,so the mean does not have robust at all

    sample_median=median(sample) sort(sample) #how many numbers in the sample we need to change to make the sample median equal to 4?

    median is more robust than mean. mean(sample,trim=0) mean(sample,trim=0.1)(highest 10% and lowest 10% is trimmed) mean(sample,trim=0.2)(20% is trimmed) mean(sample,trim=0.5)(50% is trimmed)

    lorenz

    x=c(1,2,3,10,15,15,30,50) n=length(x) F=(1:n)/n Q=cumsum(x)/sum(x) F=c(0,F) Q=c(0,Q) lorenz=data.frame(F,Q) library(ggplot2) g=ggplot(lorenz,aes(x=F))+geom_line(aes(x=F,y=Q),color=I(“blue”))+ geom_point(aes(x=F,y=Q),color=I(“blue”))+ geom_abline(intercept=0,slope=1) F=c(0,F) Q=c(0,Q)在数据中添加0 过原点。 lorenz=data.frame(F,Q)形成数据结构才能一一对应被plot

    #perfect equality x=c(1,2,3,10,15,15,30,50) tot=sum(x) n=length(x) F=(1:n)/n x=rep(tot/8,8) Q=cumsum(x)/sum(x) F=c(0,F) Q=c(0,Q) lorenz=data.frame(F,Q) library(ggplot2) p1=ggplot

    #Shading the region between Lorenze curve and equality line ggplot(lorenz,aes(x=F))+ geom_line(aes(x=F,y=Q),color=I(“blue”)) + geom_point(aes(x=F,y=Q),color=I(“blue”)) + geom_abline(intercept=0, slope=1) + geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) 填充,(aes(ymin = Q, ymax = F)为范围,fill="cyan"为填充的颜色,alpha为透明度

    #load some income data library(readr) library(dplyr) library(ggplot2) income <- read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"), col_names = FALSE, skip = 9,col_types = cols(X2 = col_number(),X3 = col_number(),X4 =col_number(), X5 = col_number(),X6 = col_number(),X7 = col_number(),X8 = col_number(), X9 = col_number(),X10 = col_number(),X11 = col_number(), X12 = col_number(),X13 = col_number(),X14 = col_number(), X15 = col_number(),X16 = col_number(),X17 = col_number(),X18 = col_number(), X19 = col_number(),X20 = col_number(),X21 = col_number(),X22 = col_number(),X23 = col_number(),X24 = col_number(), X25 = col_number(),X26 = col_number(), X27 = col_number(), X28 = col_number())) for(i in 1:9) { income[is.na(income[,paste(“X”,as.character(3i),sep="")]),paste(“X”,as.character(3i),sep="")]<-income[is.na(income[,paste(“X”,as.character(3*i),sep="")]),“X3”] } income <- rename(income, Bracket=X1) income <- rename(income, Frequency=X2) income <-rename(income, Income=X3) read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),文件名中\要\ skip = 9跳过了9行,col_names = FALSE,无列名 na=c("(B)")为缺失值;X2 = col_number()把每列的值的种类为数值 for循环将每第三列数据的缺失值换成对应行的第一项第三列数据 rename(income, Bracket=X1)将列重命名

    p <- ggplot() + geom_point(data = income, aes(x = Income, y = Frequency/sum(Frequency), colour = “All”),pch=1) + geom_point(data = income, aes(x = X9, y = X8/sum(X8), colour = “White”)) + geom_point(data = income, aes(x = X18, y = X17/sum(X17), colour = “Black”)) + geom_point(data = income, aes(x = X24, y = X23/sum(X23), colour = “Asian”)) + geom_point(data = income, aes(x = X27, y = X26/sum(X26), colour = “Hispanic”))+ scale_colour_manual("",breaks = c(“All”, “White”, “Black”,“Asian”,“Hispanic”), values = c(“All”=“black”, “White”=“blue”, “Black”=“green”, “Asian”=“yellow”,“Hispanic”=“red”))+ scale_y_continuous(name=“Proportion in Income Bracket”)+ scale_x_continuous(name=“Average Income in Bracket”)+ggtitle(“US 2015 household income, CPS”) p colour = “All”,系统自动分配颜色,旁边的标签标记该颜色代表“All”,pch为点的种类,pch=1为点的种类为空心圆点 scale_colour_manual用法: p + scale_colour_manual(values = c(“red”,“blue”, “green”))#手动分配颜色 p + scale_colour_manual(values = c(“8” = “red”,“4” =“blue”,“6” = “green”)) #根据level分配颜色 cols <- c(“8” = “red”,“4” = “blue”,“6” = “darkgreen”, “10”= “orange”) #自己定义一个颜色向量 p + scale_colour_manual(values = cols) p + scale_colour_manual(values = cols, breaks = c(“8”, “6”,“4”)) #可以控制图例的顺序 p + scale_colour_manual(values = cols, breaks = c(“4”, “6”,“8”),labels =c(“four”, “six”, “eight”)) #可以控制图例的标签 p + scale_colour_manual(values = cols, limits = c(“4”, “8”))#控制显示哪些图例 #Notice that the values are matched with limits, and not breaks p + scale_colour_manual(limits = c(6, 8, 4), breaks = c(8, 4,6),values =c(“grey50”, “grey80”, “black”)) scale_x_continuous(name="")可设定x轴坐标轴标签

    #creating lorenz curve #Creating Lorenze Curve for all races together F <- cumsum(income F r e q u e n c y ) / s u m ( i n c o m e Frequency)/sum(income Frequency)/sum(incomeFrequency) Q <- cumsum(income I n c o m e ∗ i n c o m e Income*income IncomeincomeFrequency)/sum(income I n c o m e ∗ i n c o m e Income*income IncomeincomeFrequency) lorenz <- data.frame(c(0,F),c(0,Q)) #Creating Lorenz Curve for Whites F <- cumsum(income X 8 ) / s u m ( i n c o m e X8)/sum(income X8)/sum(incomeX8) Q <- cumsum(income X 9 ∗ i n c o m e X9*income X9incomeX8)/sum(income X 9 ∗ i n c o m e X9*income X9incomeX8) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Blacks F <- cumsum(income X 17 ) / s u m ( i n c o m e X17)/sum(income X17)/sum(incomeX17) Q <- cumsum(income X 18 ∗ i n c o m e X18*income X18incomeX17)/sum(income X 18 ∗ i n c o m e X18*income X18incomeX17) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Asians F <- cumsum(income X 23 ) / s u m ( i n c o m e X23)/sum(income X23)/sum(incomeX23) Q <- cumsum(income X 24 ∗ i n c o m e X24*income X24incomeX23)/sum(income X 24 ∗ i n c o m e X24*income X24incomeX23) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Hispanics F <- cumsum(income X 26 ) / s u m ( i n c o m e X26)/sum(income X26)/sum(incomeX26) Q <- cumsum(income X 27 ∗ i n c o m e X27*income X27incomeX26)/sum(income X 27 ∗ i n c o m e X27*income X27incomeX26) lorenz<-data.frame(lorenz,c(0,F),c(0,Q)) names(lorenz)<-c(“F”,“Q”,“F.w”,“Q.w”,“F.b”,“Q.b”,“F.a”,“Q.a”,“F.h”,“Q.h”) lorenz

    names(lorenz),对每个元素变量进行命名

    #Compute Gini Index

    #Define a new function to compute Gini Index mygini=function(F,Q){ n=length(F)} gini=1-sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) return(gini) ) GiniL=c(mygini(lorenz$F,lorenz$Q),mygini(lorenz$F.w,lorenz$Q.w),mygini(lorenz$F.b,lorenz$Q.b),mygini(lorenz$F.a,lorenz$Q.a),mygini(lorenz$F.h,lorenz$Q.h)) Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL) names(Gini)=c(“Race”,“Index”) Gini

    #Make a plot ggplot(Gini)+ geom_point(aes(x=Race,y=Index)) + coord_flip() + theme( # remove the vertical grid lines panel.grid.major.x = element_blank() , # explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype=3, color=“darkgray”))+ ylab(“Gini Index”) + ggtitle(“2015 US household income, CPS”) sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) F[-1]删去第一列剩下的元素;F[-n]删去第n列剩下的元素,即对应相邻两列相减 Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)将数据结构又变成两列,names(Gini)=c(“Race”,“Index”)将两列命名 coord_flip() xy坐标轴转换 panel.grid.major.y = element_line(linetype=3, color=“darkgray”))linetype=3是虚线

    #Load the data on genetic diversity diversitycodes <- read_delim(“D:\专业英文拓展\diversitycodes”, " ", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) apply(diversitycodes,2,unique) # unique values in each column names(diversitycodes)<-c(“Code”,“Population”,“Location”,“Continent”) #Read the frequency data diversitydata <- read_delim(“D:\专业英文拓展\diversitydata.freqs”, " ", escape_double = FALSE, col_names = FALSE, col_types = cols(X2 = col_character(), X4 = col_double()), trim_ws = TRUE) names(diversitydata)<-c(“Marker”,“Allele”,“Population”,“Frequency”) filter(diversitydata, Marker == “D10S1208”, Allele== “167”) MARKER<-unique(diversitydata$Marker) POP<-unique(diversitydata$Population) #compute the diversity index Diversity=matrix(NA,ncol=377,nrow=52) for(i in 1:52) {for(j in 1:377) {Diversity[i,j]=1-sum(diversitydata[diversitydataKaTeX parse error: Expected 'EOF', got '&' at position 18: …rker==MARKER[j]&̲diversitydataPopulation==POP[i],4]^2) } } colnames(Diversity)=MARKER rownames(Diversity)=POP markdiv<-(apply(Diversity,2,mean)) markdiv<-sort(markdiv) hist(markdiv,xlab=“Diversity”, main=“Average Marker diversity across populations”)

    trim_ws = TRUE去掉头尾的空格 apply(diversitycodes,2,unique) 2为按列apply ,names(diversitycodes) 对每列进行命名

    Diversity<-data.frame(POP,apply(Diversity,1,mean),Diversity) names(Diversity)<-c(“Population”,“Average”,MARKER) Diversity<-inner_join(diversitycodes,Diversity,by=c(“Population”)) sort(unique(Diversity$Continent)) contcol<data.frame(sort(unique(Diversity$Continent)),c(“green”,“red”,“orange”,“yellow”,“azure”,“beige”,“blue”)) names(contcol)=c(“Continent”,“Color”) rownames(contcol)<-contcol$Continent Diversity<-inner_join(Diversity,contcol,by=c(“Continent”))

    Diversity<-Diversity[order(Diversity$Continent,Diversity$Location),]

    par(mar=c(8,4,4,2)) barplot(Diversity$Average,names=Diversity$Population,col=as.character(Diversity$Color),cex.lab=.3, las=3,main=“Average diversity”) contdiv<-tapply(Diversity A v e r a g e , D i v e r s i t y Average,Diversity Average,DiversityContinent,mean) 第一行将数据结构分为三个部分,第二行将其命名 inner_join将Population中共同的部分连在一起,sort 返回排序后的数值向量 order()的返回值是对应“排名”的元素所在向量中的位置。 mar:以数值向量表示边界大小,顺序为"下、左、上、右",单位为英分,默认值c(5, 4, 4, 2)+0.1 cex.lab表示修改坐标轴名称字体大小 las:刻度显示形式,las=3为竖着

    par(mar=c(5,15,4,2)) barplot(contdiv[order(contdiv)],col=as.character(contcol[names(contdiv[order(contdiv)]),2]), main=“Continental Diversity”,horiz = T,las=1)

    按排序画图 las参数 坐标刻度标签的方向。0表示总是平行于坐标轴,1表示总是水平,2表示总是垂直于坐标轴。

    Processed: 0.038, SQL: 9