Data science note 3

技术2022-07-11 109

each sample gives us potentially a different result:there is sampling variability we need to remember this sampling variability when interpreting the sampling results . A larger sample size resulted in smaller variability.

R play roulette

values=c(“00”,0:36) play=sample(values,1) play

#bet on red red_values=c(1,3,5 ,…) red_bet=function(spin_value){ return (spin_value %in% red_values)} red_bet(play)

normal distribution

x=rnorm(10000,mean=70,sd=4) t=seq(min(x),max(x),length.out=1000) hist(x,probability=TRUE,ylim=c(0,0.1)) lines(t,dnorm(t,mean=70,sd=4),col=2,lwd=2)

seq(from,to,length.out=) 表示生成一组从from到to的数量为num的数 by = ((to - from)/(length.out - 1)) 附seq的其它用法： seq(from, to) seq(from, to, by= ) seq(from, to, length.out= ) seq(along.with= ) seq(from) seq(length.out= )

interpreting the result from one sample

ssize=50 actualsample=sample(UGRace,ssize,replace=F) sum(actualsample==“Black or African American,non-Hispanic”)/ssize

in reality ,we have only one sample and we do not get to see the entire population.How can we get an idea of how wrong we could be ? to repeat the random process,we need a population to sample from. as a “best guess” for the population we can use the sample we actually observed. repeated sampling with replacement for the population defined by the sample is an idea.

bootstrap

code

ssize=50 actualsample=sample(UGRace,ssize,replace=F) B=10000 bsampleprop=NULL for(i in 1:B) { bootsample=sample(actualsample,ssize,replace=T) bsampleprop=c(bsampleprop,sum(bootsample==“Black or African American,non-Hispanic”)/ssize) }

a toy example

population=c(3,3,3,5,5,5,7,7,7,9,9,9,10,10,10) mean(population)

#here is our sample with its sample mean actualsample=sample(population,5,replace=F) mean(actualsample) #we can calculate the standard deviation of the bootstrap sample to estimate the standard error of the sample estimator sqrt(var(bsamplemean)) #or we can calculate a range of possible estimated value using quantiles quantile(bsamplemean,c(.05,.95))

bootstrap principle

the bootstrap works here because we are able to replicate the chance process of drawing from an unknown population by drawing with replacement from our observed sample works in the sense that it gives a pretty good estimate of the sampling variability.even when we do not know the exact population generating our sample. also works in that we did not have to use any complicated calculations to get this estimate.

bootstrap doesn’t always work

1.if our chance process were different,then drawing repeatedly from our observed sample will not help us. 2.we have to believe that the resampling process we use is a reasonable approximation to the way our data was generated. 3.we might call this no free lunch principle: we can use the computer to do calculations for us about the sampling processes only when we know how to tell the computer to mimic the sampling process

bootstrap summary

we have a sample to infer about the population. our estimate or summary from a sample will be different for different samples we need to be aware of sampling variability Bootstrap principle is a way to get to know sampling variability

step

1.start with data sample $x_1,...,x_n)$

2.draw a sample of size n with replacement from $x_1,...,x_n)$ call it $x_1^{*},...,x_n^{*})$ compute the summary statistics $S(x_1^{*},...,x_n^{*})$ repeat these steps for B times

3.finally we have B of variables from $S(x_1^{*},...,x_n^{*})$

1.drawback:it must replicate the same chance mechanism used to get the sample from the population 2.computation complexity

we have not commented on this before ,but a number of the histograms of the estimates derived by multiple samples look like normal distribution.

looking at the problem of estimating th mean for U[50,100].We have noted that as the sample size increases the variability of the sample estimates decreases

Uniform[a,b] mean=(b+a)/2 variance=(b^2 - a^2)/12 it depends on the length of the interval

causes of variability 1.sample size 2.variation in the population

data:x1,…,xn average $\bar{x}=\frac1n\sum^n_{i=1}x_i$ xi=1 if i th person is black =0 otherwise xbar=proportion of black students 1.taking sums(or averages) of many independent quantities we obtain a normal distribution(this is known at the central limit theorem) 2.this can be used as a definition of what the normal distribution is and where it comes from 3. it is also known as Gaussian,as Gauss derives it in Theoria motus corporum coelestium in sectionibus conicis solem ambientium(1809)(a fairly impressive work)

sampling variance I:the bootstrap

The variance of the bootstrap samples is an estimate of the variance of S(X1,…,Xn)across all possible samples average $((S(X_1,...,X_n)-S_b)^2)≈\frac1{B}\sum^{B}_{b=1}(S_b-\bar{S})^2$ standard error $((S(X_1,...,X_n))≈\sqrt{\frac1{B}\sum^{B}_{b=1}(S_b-\bar{S})^2}$

sampling variance II:

sample(X1,…,Xn) Sample summmary:S(X1,…,Xn)which we use to estimate the corresponding value in the population Spop the variance of the sample gives us a way to estimate of the variance of S(X1,…,Xn) across all possible samples $Average((S(X_1,...,X_n)-S_{pop})^2)≈\frac1{n}\sum^{n}_{i=1}\frac{(X_i-\bar{X})^2}n$ $Error(S(X_1,...,X_n))≈\sqrt{\frac1{n}\sum^{n}_{i=1}\frac{(X_i-\bar{X})^2}n}$

interval estimation the idea is to report not one number estimate ,but a range of possible values that you expect to cover the true value of the population summary two rules that we can write out and guarantee that ,in repeated experiments,the intervals cover the true value of the population summary 95% of the times For averages $(\bar{X}-\frac{2SD(X_1,....,X_n)}{\sqrt{n}},\bar{X}+\frac{2SD(X_1,....,X_n)}{\sqrt{n}}$

For Bootstrap $Quantile_{Bootsample}(0.025),Quantile_{Bootsample}(0.975))$

Rstudio

library(readr) stanford =read_delim(“D:\专业英文拓展\data\Stanford.txt”,"\t",escape_double=FALSE) stanford=stanford[,c(1,3)] names(stanford)=c(“Race/Ethnicity”,“Number”) stanford=stanford[-10,] UGRace=rep(Stanford$“Race/Ethnicity”,stanford$Number) par(mar=c(5,22,4,2)) barplot(sort(summary(as.factor(UGRace)))/length(UGRace),horiz=T,las=1)

"\t"默认分隔符为制表符，stanford=stanford[-10,]删去第十行 par(mar=c(5,22,4,2))# base R style graphics command for margins #par allows you to have multiple graphs side by side summary(as.factor(UGRace)))计算频数 ,horiz=T，坐标轴旋转，las=1字符均水平。 horizotal参数指定是否将图旋转90度使得x轴平行于纸的长边。horizontal=T,就是要旋转90°,从而绘制横向柱状图

ssize=100 B=1000 SamplePropB=NULL SamplePropH=NULL for(i in 1:B){observation=sample(UGRace,ssize,replace=FALSE) SamplePrppB=c(SamplePropB,sum(observation==“Black or African American,non-Hispanic”)/ssize) SamplePropB=c(SamplePropH,sum(obervation==“Hispanic/Llatino”)/ssize) } par(mfrow=c(1,2)) hist(SamplePropB,main=“Black or African American”,xlab=“Sample Proportion”,xlim=c(0,.3)) abline(v=0.064,col=2,lwd=2) hist(SamplePropH,main=“Hispanic/Latino”,xlab=“Sample proportion”,xlim=c(0,.4)) abline(v=0.16,col=2,lwd=2)

进行B=1000次模拟获得均值分布

#Game of Roullette values=c(“00”,0:36) play=sample(values,1) play

ssize1=100 ssize2=1000 B=1000 x=NULL y=NULL for(i in 1:B){ x=c(x,mean(red_bet(sample(value,ssize1,replace=TRUE)))) y=c(y,mean(red_bet(sample(value,ssize2,replace=TRUE)))) } par(mfrow=c(1,2)) hist(x,main=“100 spins”,xlab=“Proportion of red”,xlim=c(0.28,0.62)) abline(v=18/38,col=2,lwd=1) hist(y,main=“1000 spins”,xlab=“Proportion of red”,xlim=c(0.28,0.62)) abline(v=18/38,col=2,lwd=1)

#normal distribution t=seq(min(X),max(X),length.out=1000) hist(X,probability=TRUE,ylim=c(0,0.1)) lines(t,dnorm(t,mean=70,sd=4),col=2,lwd=2)

par(mfrow=c(1,2)) x=seq(-1,30,length,out=1000) plot(x,dnorm(x,mean=15,sd=4),main=“density of a normal,mean=15 and sd=4”) X=rnorm(1000,mean=15,sd=4) hist(X,main=“Histogram of 1000 samples from N(15,4)”,freq=FALSE) lines(x,dnorm(x,mean=15,sd=4),col=2)

#uniform distribution par(mfrow=c(1,2)) x=seq(-1,30,length,out=1000) plot(x,dunif(x,min=10,max=20),main=“density of a uniform between 10 and 20”) X=runif(1000,min=10,max=20) hist(X,main=“Histogram of 1000 samples from U[10,20]”,freq=FALSE) lines(x,dunif(x, min=10,max=20),col=2)

#chi-square distribution par(mfrow=c(1,2)) x=seq(0,30,length.out=1000) plot(x,dchisq(x,df=10),main=“density of a chi-square with 10 df”,ylab=“density”,pty=“l”) X=rchisq(1000,df=10) hist(X,main=“Histogram of 1000 samples from chi^2(10)”,freq=FALSE) lines(x,dchisq(x,df=10),col=2)

#Empirical density plots ssize=50 X=rnorm(ssize,mean=5,sd=2) hist(X,freq=FALSE) lines(density(X),col=“blue”,lwd=2) lines(sort(X),dnorm(sort(X),mean=5,sd=2),col=“red”,lwd=2)

#ssize=1000 #ssize=10000 ssize <- 100000 X <- rnorm(ssize,mean=5,sd=2) par(mfrow=c(1,2)) hist(X, freq=FALSE) lines(sort(X), dnorm(sort(X),mean=5,sd=2), col=‘red’, lwd=2) hist(X, freq=FALSE,breaks=70) lines(sort(X), dnorm(sort(X),mean=5,sd=2), col=‘red’, lwd=2) hist(X, freq=FALSE,breaks=70) breaks 代表间隔点的个数，个数越多越密，越逼近正态

#Question :what do you observe from the above three plots?

#sample mean examples X=runif(10000,min=50,max=100) par(mfrow=c(1,1)) hist(X,freq=FALSE) abline(v=mean(X),col=2,lwd=2) mean(X) lines(sort(X),dunif(sort(X),min=50,max=100),col=“blue”)

ssize=c(10,100,1000,5000) B=1000 smean=matrix(nrow=B,ncol=length(ssize)) for(i in 1:B){ for(j in 1:length(ssize)){ x=runif(ssize[j],min=50,max=100) smean[i,j]=mean(X)}} par(mfrow=c(2,2)) for(i in 1:4){ hist(smean[,i],freq=FALSE,main=“Histogram of 1000 sample means”,xlab=paste(“Mean of a sample of size”,ssize[i]),xlim=c(60,90))} par(mfrow=c(2,2),mar=c(5,4,4,2)) for(i in 1:4){ hist(smean[,i],freq=FALSE,main=“Histogram of 1000 sample means”,xlab =paste(“Mean if a sample of size”,ssize[i]),xlim=c(60,90))}

#poisson distribution library(stats) X=rpois(10000,lambda=15) summary(X) par(mfrow=c(1,1)) hist(X,freq=FALSE) abline(v=mean(X),col=2,lwd=2) points(sort(X),dpois(sort(X),15),col=“blue”,pch=20)

ssize=c(10,100,1000,5000) B=1000 smean=matrix(nrow=B,ncol=length(ssize)) for(i in 1:B){ for(j in 1:length(ssize)){ X=rpois(ssize[j],15) smean[i,j]=mean(X) }} par(mfrow=c(2,2)) for(i in 1:4){ hist(smean[,i],freq=FALSE,main=“histogram of 1000 sample means”,xlab=paste(“mean of a sample of size”,ssize[i]),xlim=c(11,19)) }

#Effect of sample size on sample mean and sampling variability ssize=c(10,25,50,75,100,150,500,1000,5000,10000) B=1000 smean=matrix(NA,nrow=B,ncol=length(ssize)) for(i in 1:B) { for(j in 1:length(ssize)) { X=runif(ssize[j],min=50,max=100) smean[i,j]=mean(X) } } mean.sample=apply(smean,2,mean) sd.sample=sqrt(apply(smean,2,var)) par(mfrow=c(1,2)) plot(ssize,mean.sample,xlab=“sample size”,ylab="",main=“average value of estimate”) plot(ssize,sd.sample,xlab=“sample size”,ylab="",main=“standard deviation of estimate”) #plot in the log scale par(mfrow=c(1,2)) plot(ssize,log(mean.sample),xlab=“log(sample size)”,ylab="",main=“average value of estimate”) plot(ssize,log(sd.sample),xlab=“sample size”,ylab="",main=“standard deviation of estimate”)

#bootstrap example library(boot) data(“claridge”) summary(claridge) #let us have a look library(ggplot2) ggplot(claridge,aes(x=dnan,y=hand))+geom_count(alpha=0.5,col=“blue”)+ylab(“propensity to use left hand”)+xlab(“DNA variant”)+ggtitle(“claridge data”)

geom_count为计算重叠的点的计数图重叠的点越多点越大

B=10000 ssize=nrow(claridge) bcorr=NULL for(i in 1:B) { bsample=claridge[sample(1:ssize,ssize,replace=TRUE),] bcorr=c(bcorr,cor(bsample[,1],bsample[,2])) } par(mfrow=c(1,1)) hist(bcorr,main=“Bootstrap variability”,xlab="") nrow(claridge)计算行数，运用bootstrap，从样本中抽取样本：bsample=claridge[sample(1:ssize,ssize,replace=TRUE),]

library(dplyr) correlations=vector() for(i in 1:10000){ sample=sample_n(claridge,37,replace=T) correlations[i]=cor(samples[,1],samples[,2]) } correlations=data.frame(correlations) plot5=ggplot(data=correlations,aes(x=correlations))+geom_histogram(bins=20,fill=“cornflowerblue”,alpha=0.8)+labs(x=“correlation in Bootstrap sample”,title=“Bootstrap variability”)+theme_bw() plot5

mean(bcorr) sd(bcorr)

ggplot内data需为dataframe，bins=20柱状20条，fill为填充颜色，alpha为透明度

Processed: 0.012, SQL: 12