Data science note 2

技术2022-07-10 210

Summaries and Indices

when summarizing we may choose different aspects:center,spread,asymmetry mean，mode(众数)，median some further options:weighted average,trimmed mean

Properties of an Average

average:a number $\bar{x}$ that substitutes the entire data x1,x2,…xn for one variable Internality.monotonicity.symmetry.associativity

It makes sense to consider the average $\bar{x}$ such that $f(x_1,x_2,...,x_n)=f(\bar{x},\bar{x},...,\bar{x})$

we can substitute the one number $\bar{x}$ in place of all the values for the variable and get the same result. different functions f are disirable based on various purpose and lead to different averages(arithmetic mean, geometric mean)

the average interest rate

income $r_1,r_2,...,r_{12})$ =income $(\bar{r},\bar{r},...,\bar{r})$ $(1+r_1)(1+r_2)...(1+r_{12})=(1+\bar{r})^{12}$ $\bar{r}=(\prod^{12}_{i=1}(1+r_i))^{\frac1{12}}-1$

the geometic mean of (1+ri) is a meaningful summary.

average speed

$Time=\sum^n_{i=1}\frac{d_i}{v_i}=\sum^n_{i=1}\frac{d_i}{\bar{v}}$

$\bar{v}=\frac{\sum^n_{i=1}d_i}{\sum^n_{i=1}\frac{d_i}{v_i}}$ we find that the harmonic mean(here with weights di)is a meaningful summary

another approach to defining an average

let g(z,x1,x2,…,xn) be a function that describes the loss we incur when substituing x1,x2,…,xn with z,then $average(x_1,x_2,...,x_n)=\bar{x}=\arg\min_zg(z,x_1,...,x_n)$

different loss functions

this approach to defining average says that the average is our “best guess” for a value in the data,and the different loss functions specify how we evaluate the goodness of a guess the square loss penalizes more large discrepancies and down weighs small discrepancies the absolute loss considers all discrepancies at their face values the 0-1 loss consider all discrepancies to be the same ,with the exception of no error

square loss:arithmetic mean absolute loss:medium 0-1 loss:mode [to find the z that minimizes this loss we need to look for the z such that the number of xi equal to z is maximal ;this is called the mode and it is useful as a measure of the “center” for qualitatibe variables] $MAD=\frac1n\sum_{i=1}^n|x-M|,M:median$

measures of spread/dispersion/variability

variance:average square distance from the mean $V(x_1,...,x_n)=\frac1n\sum^n_{i=1}(x_i-\bar{x})^2$

standad deviation: $\sqrt{\frac1n\sum^n_{i=1}(x_i-\bar{x})^2}$ note that R actually divides by n-1 rather than n.

often data is summarized so that we have counts of occurrences of the same values:we have a set v1,v2,…,vm of possible values,with their frequencies fi calculating averages and standard deviations has to adapt to this different set -up $\bar{v}=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}v_if_i$ $Variance=\frac1{\sum^m_{i=1}fi}\sum^m_{i=1}(v_i-\bar{v})^2f_i$

let’s consider some restriction that make the statement meaningful xi>=0,fix the total sum of values, $\sum^m_{i=1}x_i=n\bar{x}$

$\sum^m_{i=1}(x_i-\bar{x})^2=\sum^m_{i=1}x_i^2-n\bar{x}^2$ $\sum^m_{i=1}x_i^2-n\bar{x}^2\leq(\sum^m_{i=1}x_i)^2-n\bar{x}^2=n(n-1)\bar{x}^2$ $V(x_1,...,x_n)\leq(n-1)\bar{x}^2$

measure income inequality

$x_1\leq x_2 \leq ... \leq x_n$ we now calculate two quantities: $F_i=\frac in ,Q_i=\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j}$ in general, $Q_i\leq F_i$ $\frac{\sum^i_{j=1}x_j}{\sum^n_{j=1}x_j} \leq \frac in$ $\frac{\sum^i_{j=1}x_j}i \leq \frac{\sum^n_{j=1}x_j}n$

Area under bottom curve:sum of areas of trapezoids.Thus $A=\frac12-\sum^n_{i=1}\frac{(F_i-F_{i-1})(Q_i+Q_{i+1})}2$ gini’s index $G=\frac{A}{1/2}=1-\sum^n_{i=1}(F_i-F_{i-1})(Q_i+Q_{i+1})$

we can calculate the following summary of “mutual variability” $\Delta=\sum^k_{i=1}\sum^k_{j=1}|x_i-x_j|\frac {n_i}n\frac{n_j}n$ one can show that $G=\frac{\Delta}{2\bar{x}}$

An Index of Diversity

$D=1-\sum_{i=1}^mp_i^2$ probability that if you capture two fishes they are not the same. $p 1 + . . . + p m = 1$ $p1=p2=...=pm=\frac1m$ $D1=1-\frac1m$ so that the diversity 1- $\frac1m$ is larger the larger is m m>k $D2=1-\frac1k<D1$ This index is known as GINI or Simpson’s diversity index(be careful that actually there are multiple versions of the Simpson index) There are other measures of diveisity shannon’s index that is based on entropy $H=-\sum^m_{i=1}p_i\log p_i$ When analysing data relative to the frequency of different alleles in genetics,the $D=1-\sum_{i=1}^mp_i^2$ is preferred.

this is because it has a very easy genetic interpretation:it represents the probability of an heterozygous genotype

Rstudio

robust

sample=runif(20,2,5) sample_mean=mean(sample) #how many numbers in the sample we need to change to make the sample mean equal to 4? sample1=rnorm(30,0,1) sample1 sample1_mean=mean(sample1) sample1 #how many numbers in the sample we need to change to make the sample mean equal to 2?

1 number ,so the mean does not have robust at all

sample_median=median(sample) sort(sample) #how many numbers in the sample we need to change to make the sample median equal to 4?

median is more robust than mean. mean(sample,trim=0) mean(sample,trim=0.1)(highest 10% and lowest 10% is trimmed) mean(sample,trim=0.2)(20% is trimmed) mean(sample,trim=0.5)(50% is trimmed)

lorenz

x=c(1,2,3,10,15,15,30,50) n=length(x) F=(1:n)/n Q=cumsum(x)/sum(x) F=c(0,F) Q=c(0,Q) lorenz=data.frame(F,Q) library(ggplot2) g=ggplot(lorenz,aes(x=F))+geom_line(aes(x=F,y=Q),color=I(“blue”))+ geom_point(aes(x=F,y=Q),color=I(“blue”))+ geom_abline(intercept=0,slope=1) F=c(0,F) Q=c(0,Q)在数据中添加0 过原点。 lorenz=data.frame(F,Q)形成数据结构才能一一对应被plot

#perfect equality x=c(1,2,3,10,15,15,30,50) tot=sum(x) n=length(x) F=(1:n)/n x=rep(tot/8,8) Q=cumsum(x)/sum(x) F=c(0,F) Q=c(0,Q) lorenz=data.frame(F,Q) library(ggplot2) p1=ggplot

#Shading the region between Lorenze curve and equality line ggplot(lorenz,aes(x=F))+ geom_line(aes(x=F,y=Q),color=I(“blue”)) + geom_point(aes(x=F,y=Q),color=I(“blue”)) + geom_abline(intercept=0, slope=1) + geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) geom_ribbon(aes(ymin = Q, ymax = F),fill=“cyan”,alpha=0.5) 填充，(aes(ymin = Q, ymax = F)为范围，fill="cyan"为填充的颜色，alpha为透明度

#load some income data library(readr) library(dplyr) library(ggplot2) income <- read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"), col_names = FALSE, skip = 9,col_types = cols(X2 = col_number(),X3 = col_number(),X4 =col_number(), X5 = col_number(),X6 = col_number(),X7 = col_number(),X8 = col_number(), X9 = col_number(),X10 = col_number(),X11 = col_number(), X12 = col_number(),X13 = col_number(),X14 = col_number(), X15 = col_number(),X16 = col_number(),X17 = col_number(),X18 = col_number(), X19 = col_number(),X20 = col_number(),X21 = col_number(),X22 = col_number(),X23 = col_number(),X24 = col_number(), X25 = col_number(),X26 = col_number(), X27 = col_number(), X28 = col_number())) for(i in 1:9) { income[is.na(income[,paste(“X”,as.character(3i),sep="")]),paste(“X”,as.character(3i),sep="")]<-income[is.na(income[,paste(“X”,as.character(3*i),sep="")]),“X3”] } income <- rename(income, Bracket=X1) income <- rename(income, Frequency=X2) income <-rename(income, Income=X3) read_csv(“D:\\专业英文拓展\\hinc06.csv”, na=c("(B)"),文件名中\要\ skip = 9跳过了9行，col_names = FALSE,无列名 na=c("(B)")为缺失值；X2 = col_number()把每列的值的种类为数值 for循环将每第三列数据的缺失值换成对应行的第一项第三列数据 rename(income, Bracket=X1)将列重命名

p <- ggplot() + geom_point(data = income, aes(x = Income, y = Frequency/sum(Frequency), colour = “All”),pch=1) + geom_point(data = income, aes(x = X9, y = X8/sum(X8), colour = “White”)) + geom_point(data = income, aes(x = X18, y = X17/sum(X17), colour = “Black”)) + geom_point(data = income, aes(x = X24, y = X23/sum(X23), colour = “Asian”)) + geom_point(data = income, aes(x = X27, y = X26/sum(X26), colour = “Hispanic”))+ scale_colour_manual("",breaks = c(“All”, “White”, “Black”,“Asian”,“Hispanic”), values = c(“All”=“black”, “White”=“blue”, “Black”=“green”, “Asian”=“yellow”,“Hispanic”=“red”))+ scale_y_continuous(name=“Proportion in Income Bracket”)+ scale_x_continuous(name=“Average Income in Bracket”)+ggtitle(“US 2015 household income, CPS”) p colour = “All”,系统自动分配颜色，旁边的标签标记该颜色代表“All”，pch为点的种类，pch=1为点的种类为空心圆点 scale_colour_manual用法： p + scale_colour_manual(values = c(“red”,“blue”, “green”))#手动分配颜色 p + scale_colour_manual(values = c(“8” = “red”,“4” =“blue”,“6” = “green”)) #根据level分配颜色 cols <- c(“8” = “red”,“4” = “blue”,“6” = “darkgreen”, “10”= “orange”) #自己定义一个颜色向量 p + scale_colour_manual(values = cols) p + scale_colour_manual(values = cols, breaks = c(“8”, “6”,“4”)) #可以控制图例的顺序 p + scale_colour_manual(values = cols, breaks = c(“4”, “6”,“8”),labels =c(“four”, “six”, “eight”)) #可以控制图例的标签 p + scale_colour_manual(values = cols, limits = c(“4”, “8”))#控制显示哪些图例 #Notice that the values are matched with limits, and not breaks p + scale_colour_manual(limits = c(6, 8, 4), breaks = c(8, 4,6)，values =c(“grey50”, “grey80”, “black”)) scale_x_continuous(name="")可设定x轴坐标轴标签

#creating lorenz curve #Creating Lorenze Curve for all races together F <- cumsum(income $F r e q u e n c y) / s u m (i n c o m e$ Frequency) Q <- cumsum(income $I n c o m e * i n c o m e$ Frequency)/sum(income $I n c o m e * i n c o m e$ Frequency) lorenz <- data.frame(c(0,F),c(0,Q)) #Creating Lorenz Curve for Whites F <- cumsum(income $X 8) / s u m (i n c o m e$ X8) Q <- cumsum(income $X 9 * i n c o m e$ X8)/sum(income $X 9 * i n c o m e$ X8) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Blacks F <- cumsum(income $X 17) / s u m (i n c o m e$ X17) Q <- cumsum(income $X 18 * i n c o m e$ X17)/sum(income $X 18 * i n c o m e$ X17) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Asians F <- cumsum(income $X 23) / s u m (i n c o m e$ X23) Q <- cumsum(income $X 24 * i n c o m e$ X23)/sum(income $X 24 * i n c o m e$ X23) lorenz <- data.frame(lorenz,c(0,F),c(0,Q)) #Creating Lorenze Curve for Hispanics F <- cumsum(income $X 26) / s u m (i n c o m e$ X26) Q <- cumsum(income $X 27 * i n c o m e$ X26)/sum(income $X 27 * i n c o m e$ X26) lorenz<-data.frame(lorenz,c(0,F),c(0,Q)) names(lorenz)<-c(“F”,“Q”,“F.w”,“Q.w”,“F.b”,“Q.b”,“F.a”,“Q.a”,“F.h”,“Q.h”) lorenz

names(lorenz),对每个元素变量进行命名

#Compute Gini Index

#Define a new function to compute Gini Index mygini=function(F,Q){ n=length(F)} gini=1-sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) return(gini) ) GiniL=c(mygini(lorenz$F,lorenz$Q),mygini(lorenz$F.w,lorenz$Q.w),mygini(lorenz$F.b,lorenz$Q.b),mygini(lorenz$F.a,lorenz$Q.a),mygini(lorenz$F.h,lorenz$Q.h)) Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL) names(Gini)=c(“Race”,“Index”) Gini

#Make a plot ggplot(Gini)+ geom_point(aes(x=Race,y=Index)) + coord_flip() + theme( # remove the vertical grid lines panel.grid.major.x = element_blank() , # explicitly set the horizontal lines (or they will disappear too) panel.grid.major.y = element_line(linetype=3, color=“darkgray”))+ ylab(“Gini Index”) + ggtitle(“2015 US household income, CPS”) sum((F[-1]-F[-N])*(Q[-1]+Q[-n])) F[-1]删去第一列剩下的元素；F[-n]删去第n列剩下的元素，即对应相邻两列相减 Gini=data.frame(c(“All”,“White”,“Black”,“Asian”,“Hispanic”),GiniL)将数据结构又变成两列，names(Gini)=c(“Race”,“Index”)将两列命名 coord_flip() xy坐标轴转换 panel.grid.major.y = element_line(linetype=3, color=“darkgray”))linetype=3是虚线

#Load the data on genetic diversity diversitycodes <- read_delim(“D:\专业英文拓展\diversitycodes”, " ", escape_double = FALSE, col_names = FALSE, trim_ws = TRUE) apply(diversitycodes,2,unique) # unique values in each column names(diversitycodes)<-c(“Code”,“Population”,“Location”,“Continent”) #Read the frequency data diversitydata <- read_delim(“D:\专业英文拓展\diversitydata.freqs”, " ", escape_double = FALSE, col_names = FALSE, col_types = cols(X2 = col_character(), X4 = col_double()), trim_ws = TRUE) names(diversitydata)<-c(“Marker”,“Allele”,“Population”,“Frequency”) filter(diversitydata, Marker == “D10S1208”, Allele== “167”) MARKER<-unique(diversitydata$Marker) POP<-unique(diversitydata$Population) #compute the diversity index Diversity=matrix(NA,ncol=377,nrow=52) for(i in 1:52) {for(j in 1:377) {Diversity[i,j]=1-sum(diversitydata[diversitydata $KaTeX parse error: Expected 'EOF', got '&' at position 18: \dotsrker==MARKER[j]&̲diversitydata$ Population==POP[i],4]^2) } } colnames(Diversity)=MARKER rownames(Diversity)=POP markdiv<-(apply(Diversity,2,mean)) markdiv<-sort(markdiv) hist(markdiv,xlab=“Diversity”, main=“Average Marker diversity across populations”)

trim_ws = TRUE去掉头尾的空格 apply(diversitycodes,2,unique) 2为按列apply ，names(diversitycodes) 对每列进行命名

Diversity<-data.frame(POP,apply(Diversity,1,mean),Diversity) names(Diversity)<-c(“Population”,“Average”,MARKER) Diversity<-inner_join(diversitycodes,Diversity,by=c(“Population”)) sort(unique(Diversity$Continent)) contcol<data.frame(sort(unique(Diversity$Continent)),c(“green”,“red”,“orange”,“yellow”,“azure”,“beige”,“blue”)) names(contcol)=c(“Continent”,“Color”) rownames(contcol)<-contcol$Continent Diversity<-inner_join(Diversity,contcol,by=c(“Continent”))

Diversity<-Diversity[order(Diversity$Continent,Diversity$Location),]

par(mar=c(8,4,4,2)) barplot(Diversity$Average,names=Diversity$Population,col=as.character(Diversity$Color),cex.lab=.3, las=3,main=“Average diversity”) contdiv<-tapply(Diversity $A v e r a g e, D i v e r s i t y$ Continent,mean) 第一行将数据结构分为三个部分，第二行将其命名 inner_join将Population中共同的部分连在一起，sort 返回排序后的数值向量 order()的返回值是对应“排名”的元素所在向量中的位置。 mar：以数值向量表示边界大小，顺序为"下、左、上、右"，单位为英分，默认值c(5, 4, 4, 2)+0.1 cex.lab表示修改坐标轴名称字体大小 las：刻度显示形式，las=3为竖着

par(mar=c(5,15,4,2)) barplot(contdiv[order(contdiv)],col=as.character(contcol[names(contdiv[order(contdiv)]),2]), main=“Continental Diversity”,horiz = T,las=1)

按排序画图 las参数坐标刻度标签的方向。0表示总是平行于坐标轴，1表示总是水平，2表示总是垂直于坐标轴。

Processed: 0.016, SQL: 9