On the relationship of the sample size and the correlation

This has bugged me for some time now. There is a “common knowledge” that the correlation size is dependent on the variability, i.e. higher the variability – higher the correlation. However, when you put this into practice, there seems to be a confusion on what this really means.

To analyse this I have divided this issue into two approaches – the wrong one and the right one 🙂

The wrong approach

I have heard this too many times: “The variablility depends on the sample size”. This reasoning comes from the idea that the smaller the sample is, you will have less distinct data values (data points) and thus have smaller sensitivity and thus have smaller variability.

The error here is that the variability (variance) of the sample in reality does not depend on the sample size, but on the representativeness of the sample for the population. However, the sampling error becomes higher when the sample is smaller and because of this we have a higher probability that we will make a mistake when we try to deduce the population parameter. So theoretically if we choose the right (representative) sample, we will get the same variance as it is in the population, but if the sample is not representative the sample variance will be higher or lower than the population parameter.

Let’s go back to the beginning, the idea that as the sample gets larger the correlation gets higher is wrong. In reality what happens is that when the sample is larger the error is lower and the sample correlation converges to the population parameter.

So let’s try this out in R

To put this into perspective, I have generated two random normal variables with 1 000 000 cases and correlation ≈ 0.6 that represent the population.


set.seed(12345) #set seed for reproducibility

myCov=lazyCov(Rho=0.6, Sd=20, d=2) #define covariance matrix for two variables with sd=20 and correlation 0.6
myData=data.frame(mvrnorm(n=1000000, mu=c(100,100), Sigma=myCov)) #create two variables with specified covariance matrix and M=100 --> Population = 1 000 000

pop_cor=cor(myData[,1], myData[,2]) # calculate population correlation

Now I create 1000 samples that go from 5 to 5000 cases in steps by 5 (5, 10, 15…) and for each sample I write the sample size, the correlation between the two variables and the absolute deviation of the sample correlation from the population correlation.

rez=data.frame() # result data frame

for (i in (1:1000)){ #iterate through samples
  sampleData=myData[sample(nrow(myData),i*5),] #select random sample from myData with size i*5
  q=cor(sampleData$X1,sampleData$X2, method = "pearson") #calculate correlation of the sample
  rez[i,1]=i*5 #sample size - V1
  rez[i,2]=q #sample correlation - V2
  rez[i,3]=abs(q-pop_cor) #absolute deviation from the population correlation - V3

To present the results I have created two charts using ggplot2. The first chart presents the sample correlation in relation to the sample size.

#plot chart of correlation and sample size
ggplot(data=rez, aes(x=V1, y=V2))+
  labs(title="Correlation and sample size", x="Sample size", y="Correlation")+
  annotate("text", x = 4000, y = 0.75, label = paste("Population r = ", round(pop_cor,3)))

Although the stabilization of the correlation is evident, it is also visible that the correlation doesn’t get higher as the sample size gets higher, the correlations are equally higher and lower from the population parameter.

The another approach would be by showing the absolute deviation from the population correlation (abs(r_sample – r_population)).

#plot chart of absolute deviation from population r and sample size (include this correlation in annotation)
ggplot(data=rez, aes(x=V1, y=V3))+
  labs(title="Absolute deviation from population correlation and sample size", x="Sample size", y="Absolute deviation")+
  annotate("text", x = 4000, y = 0.2, label = paste("r = ", round(cor(rez$V1,rez$V3),3))) 

The results show that as the sample gets higher the error gets lower, and with simple correlation (although linear model is not quite appropriate), we see that the correlation between the error and the sample size is -0.41.

To conclude this, it is clearly evident both logically and from the simulation that the size of the sample correlation is not dependent on the sample size (in my example r=-0.016).

The right approach

The reasoning here is that in reality we have two variabilities. The first one is the phenomenon variability, and the second one is the measurement variability. The phenomenon variability happens in nature and it is something that we try to capture by measurement. For example if we try to measure my height using centimetres, you would think that my height is  180 cm; if you would use millimetres, it would be 1805 mm; and if you would use metres it would be 2 m. What this means is that when you use centimetres you say that all people with height of 179.5 to 180.5 have the same height and thus you ignore their natural variability.

The relationship of the correlation size and the variability comes from the measurement variability. In psychology and similar social sciences, it is quite often that to simplify things for the participants we use 1-5 scales (Likert type). It is also possible to use 1-7, 1-9 etc. scales to measure the same phenomenon. The point here is that by choosing the smaller scales (less distinct values) we directly influence the measurement variability (again not the phenomenon variability). The other scenario is when we try to organize the continuous data into “bins” (groups), such as deciles, percentages, etc. This is also the reduction of the measurement variability because you put the participants who are different into the same category.

A simulation in R

To prove a point, I have used the same dataset as in the first example. The original data is simulated with 1 million cases and since they are numeric and random, we have 1 million different data points.

For the simulation, I have divided the data from the X1 variable into bins of varying width. The first simulation has the bin width of 1 and has 184 unique data points, the second simulation has the bin width of 2 and has 97 unique data points, and so on until the bin width of 100 where all of the 1 million cases are sorted into 3 unique data points. This means that when the bin width is 1, e.g. all of the participants in the range 98.5-99.49 get the same value, when the bin width is 2, all participants with value 97.5-99.5 get the same value, etc.

For each simulation I preserve the number of data points, the standard deviation of the grouping variables and the correlation with the X2 variable (remember that the original X1-X2 correlation was 0.599).

rez2=data.frame() #result data frame

for (i in 1:100) {
  myData$Group1=as.numeric(cut_width(myData$X1,i, center=100)) #transform X1 variable by creating bins of width i
  rez2[i,1]=length(unique(myData$Group1)) #number of distinct data points in the transformed X1 variable - V1
  rez2[i,2]=sd(myData$Group1)  #standard deviation of transformed X1 variable - V2
  rez2[i,3]=cor(myData$Group1,myData$X2)  #correlaton of transformed X1 and X2 variable - V3

Before plotting the chart, I have calculated a linear model of predicting the number of unique data points from the standard deviation. This was necessary because I wanted to include a secondary x axis into the chart and this was possible only as the transformation formula from the standard deviation (sec_axis command works only as the transformation of the primary axis). The simplest solution was to create a linear model because the correlation between the standard deviation and the number of unique data points was 0.999.

tr_form=lm(rez2$V1~rez2$V2) #linear model for the transformation of the secondary x axis

ggplot(data=rez2, aes(x=V2, y=V3))+ #basic mapping
  geom_point()+ #adding points
  scale_x_continuous(    #definition of x axis
    name = "Standard deviation",
    trans="reverse",  #reverse the order of points on x axis
    sec.axis = sec_axis( #definition of secondary x axis
      ~ tr_form$coefficients[1] + . * tr_form$coefficients[2] , #transformation formula inherited from linear model
      name = "Number of distinct data points"))+
  labs(title="Correlation and the reduction of variability", y="Correlation")+ #title and y axis name
  geom_hline(yintercept = pop_cor, color="red")+ #add the red line for the original correlation
  annotate("text", x=15,y=pop_cor+0.015, label="Original correlation") #annotate the red line

The chart here shows that as the number of unique data points becomes smaller the correlation also becomes smaller. The original SD of X1 was about 20 and we see that in the beginning (bin width=1) SD of the grouping variable is similar to the original SD and the correlation is almost equal to the original correlation. But after the number of unique data points becomes smaller than 50 the correlation decrease becomes faster and after the number of unique data points becomes smaller than 20 we see a speedy drop.

The conclusion

So to conclude, the two key points here is that it is really important to have representative samples regardless of their size (but larger are better 🙂 ) and that when calculating correlations, try to restrain yourself from grouping of the values.

Happy plotting 🙂

A bit more understanding of Cronbach’s alpha

Cronbach’s alpha reliability coefficient is one of the most widely used indicators of the scale reliability. It is used often without concern for the data (this will be a different text) because it is simple to calculate and it requires only one implementation of a single scale. The aim of this article is to provide some more insight into the functioning of this reliability coefficient without going into heavy mathematics. The main formula for the structure of variance is

That is – the total variance is composed of the variance of the real results, the error variance and their covariance. Since the error is random and does not correlate with anything we can drop the covariance part of the formula. The reliability is by definition a ratio of the variance of real results and total variance and is usually represented with the symbol rxx. With this information we can reformulate the structure of the variance

Therefore to find out about the reliability, we are searching for the proportion of the error variance. Now to Cronbach’s alpha. The formula for alpha is

Let’s dissect this formula

In the first part we will concentrate on the rightmost part of this formula. The structure of the variance of any linear combination is a sum of variances of individual variables and double sum of covariances between the variables. If the intercorrelation between the variances is low, the test is not homogenous and the variance of the linear combination is composed mainly from the sum of the variances of individual variables. If the intercorrealtion is zero, the total variance is equal to the sum of variances of individual variables. The

part gives us the information of how much of total variance is attributed to the sum of individual variances. This number has a maximum value of 1 and represents the situation when there is no intercorrelation between the variables. Therefore if we substract this from 1 we get the representation of the proportion of variance attributed to covariances of the individual variables.

To simplify things a bit, let’s work with standardized variables (M=0, V=1). In the formula for the variance of the linear combination

when we work with standardized variables, all individual variances are 1 so the formula can be replaced with

with k being the number of variables.

If the correlation between variables is 1 the number of correlations is k(k-1)/2 so the formula for the total variance would be

So to conclude, if the intercorrelations are zero VLC=k, and if the intercorrelations are 1, VLC=k2

Therefore the importance of the covariances within the total variance grows as the intercorrelations between items become higher, and when the number of items gets larger.

In the next part we can concentrate on the k/(k-1) part of the formula for alpha coefficient.

As we have seen in the previous part, the ratio of the sum of individual variances and the total variance is dependent on the intercorrelations between the items, but also on the number of items. If we would not correct this number, the maximum reliability would become dependent on the number of items and the scales with less items would be considered unreliable even if the intercorrelation is 1 between all items. Let’s see the example of a scale with two items with correlation 1 without the correction.

As the number of items becomes larger, this becomes less important since the importance of the number of items grows within the formula for the total variance. If the number of items is 100 and the intercorrelation is 1, this equation would give us 0.9. Therefore it is necessary to introduce the correction factor for the number of items that removes the penalization of the coefficient related to the number of items. The part k/(k-1) of the formula for alpha serves that purpose and since this part has a maximum value of 2 when the number of items is 2, and the minimum value of 1 when the number of items is infinite, it serves as the correction of the second part of the alpha formula.

To conclude, we have seen that the alpha coefficient is the simple way for measuring the homogeinity of a scale and that we can improve this coefficient by either using variables that have high correlation with other variables, or by adding more variables to the scale, or preferably – both.

The simulation in R

We will now use R to see what does the simulation of data show. We will create a set of standardized variables with a given intercorrelation and variate this intercorrealtion and the number of variables.

In the first part we have to load the packages


Now we will iterate through sets of n_var variables with t_cor intercorrelation. The minimum number of variables is 3 because R produces warning when calculating alpha on 2 variables when trying to calculate alpha if item deleted.

The results data frame has three columns that are necessary for plotting – intercorrelation, number of variables within the scale and the obtained alpha coefficient. For each intercorrelation we create 48 different scales with the number of items being from 3 to 50. To create results we use this code

res=data.frame() #the results data frame
for (t_cor in seq(from=0.1, to=0.9, by=0.2)){ #correlations go from 0.1 to 0.9
  for (n_var in 3:50){ #number of variables goes from 3 to 50
    rmat=diag(1,n_var) #create empty correlation matrix 
    for (i in 1:n_var){
      for (j in 1:n_var) {
        if (i!=j) {rmat[i,j]=t_cor} #fill the matrix with intercorrelations
    mu <- rep(0,n_var) #create vector of means for variables
    mat <- mvrnorm(100, Sigma = rmat, mu = mu, empirical = TRUE) #create variables using means and the correlation matrix
    res[step+n_var-2,1]=t_cor #save target intercorrelation
    res[step+n_var-2,2]=n_var #save number of variables in the scale 
    res[step+n_var-2,3]=psych::alpha(mat)$total$raw_alpha #save Cronbach's alpha
    #both ggplot2 and psych packages have alpha function so it is explicitely declared
  step=step+48 #this creates a new start for the next  target intercorrelation

And finally we plot the data

ggplot(data = res, aes(x=V2, y=V3)) + geom_line(aes(colour=as.factor(V1)))+
  xlab("Number of items")+

So what we see here is that the value of alpha coefficient indeed depends on the intercorrelation and the number of items in the scale. There is also a visible reduction in the growth speed of the alpha coefficient as the number of items gets larger. The growth of alpha is quick at first, but as the number of items gets larger, the growth of alpha slows down.

In memory of Monty Hall

Some find it a common knowledge, some find it weird. As a professor I usually teach about Monty Hall problem and year after year I see puzzling looks from students regarding the solution.

Image taken from http://media.graytvinc.com/images/690*388/mon+tyhall.jpg

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

image taken from https://rohanurich.files.wordpress.com/2013/03/mhp-agc2.png

The basis of the answer lies in related and unrelated events. The most common answer is that it doesn’t matter which strategy you choose because it is 50/50 chance – but it is not. The 50/50 assumption is based on the idea that the first choice (one of three doors) and the second choice (stay or switch door) are unrelated events, like flipping a coin two times. But in reality, those are related events, and the second event depends on the first event.

At the first step, when you choose one of three doors, the probability that you picked the right door is 33%, or in other words, there is 66,67% that you are on the wrong door. The fact that that in the second step you are given a choice between your door and the other one doesn’t change the fact that you are most likely starting with the wrong door. Therefore, it is better to switch door in the second step.

Simulation using R

To explore this a bit further and to have a nice exercise with R, a small simulation of games is created.

First we load the necessary packages


Then we create the possible door combinations

#create door combinations

So what I did was to generate three-digit numbers. The first number will always say behind which door is a car, and two other numbers will say where are goats.

Now let’s prepare the vectors for the simulation

#create results vectors

Now we are ready for the simulation

#create 100.000 games
for (i in 1:100000){

  #set up a situation

  doors<-sample(a,1) #randomly pick a door combination
  car[i]<-doors %/% 100 #the first number is which door is the right door
  goat1[i]<-(doors-car[i]*100)%/%10 #where is the first wrong door
  goat2[i]<-doors-car[i]*100-goat1[i]*10 #where is the second wrong door

  #have a person select a random door

#now we open the wrong door
  if (initial_choice[i]==car[i]){
    open_door[i]=sample(c(goat1[i],goat2[i]),1) #if the person is initially on the right door we randomly select one of the two wrong doors
  } else if (initial_choice[i]==goat1[i]) {
  } else {open_door[i]=goat1[i]} #if the person is initially on the wrong door, we open the other wrong door  

  #stayer remains by his initial choice and switcher changes his choice
  if (initial_choice[i]==car[i]){who_wins[i]="Stayer"} else {who_wins[i]="Switcher"}

monty_hall=data.frame(car, goat1,goat2,initial_choice,open_door,who_wins)

And now we got a nice analysis of 100.000 games. To put the most important result into chart we use ggplot2

ggplot(data=monty_hall, aes(who_wins, fill=who_wins)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) + #crude but effective
  xlab("Who wins?")+
  theme(legend.position = "none")

And now we got a nice analysis of 100.000 games. To put the most important result into chart we use ggplot2

So it is definitely better to switch door!

For more reading refer to https://en.wikipedia.org/wiki/Monty_Hall_problem

Happy coding 🙂

The Illusion of linearity trap when making bubble charts in ggplot2

Bubble charts are a great way to represent the data when you have instances that vary greatly, eg. the size of Canada compared to the size of Austria. However this type of a chart introduces a new dimension in the interpretation of data because the data is interpreted by the bubble size (area), and not linearly. The mistake when building such charts is that we ignore what is known as the illusion of linearity. This illusion (see this Article for more) is the effect that people tend to judge proportions linearly even when they are not linear. For example, the common mistake is that the pizza with the diameter of 20 cm is two times larger than the pizza with the diameter of 10 cm, while in fact the first pizza is 4 times larger than the second one because we judge the size of an area and not the diameter. The first pizza has the area of 314 cm² (r²Π) and the second one has 78,5 cm² → 314/78,5=4. Now back to bubble charts…

For this example I have loaded ggplot2 and created a simple dataset with three variables – x, y and size.



The resulting dataset provides the coordinates and the bubble sizes for the chart. Now lets create the chart with annotated bubble sizes.

q<-q + geom_point(aes(x=x,y=y), size=dat$size) #create bubbles
q<-q + xlim(-1,5)+ylim(-1,5) #add limits to axes
q<-q+ annotate("text",x=dat$x,y=dat$y,label=dat$size, color="white",size=6) #add size annotations
q<-q + theme_void() + theme(panel.background = element_rect(fill="lightblue")) #create a simple theme

The chart looks like this:


The basic issue is that the smallest bubble looks as it is 9 times smaller than the largest bubble instead of 3 times smaller because the size parameter of geom_point is determined by the diameter and not by area size.

To correct for this and to make the chart interpretable we will use the simple transformation of the size parameter in geom_point by square root.

q<-q + geom_point(aes(x=x,y=y), size=sqrt(dat$size)*10) #create bubbles with a correct scale
q<-q + xlim(-1,5)+ylim(-1,5) #add limits to axes
q<-q+ annotate("text",x=dat$x,y=dat$y,label=dat$size, color="white",size=6) #add size annotations
q<-q + theme_void() + theme(panel.background = element_rect(fill="lightblue")) #create a simple theme

The multiplication of squared size by the factor of 10 is just for creating the bubbles large enough compared to the limits of axes.

The chart now looks like this:


The areas are now in the correct scale and the bubbles are proportional to the size variable.

Of course, if we would like to make three dimensional shapes, the correction factor would be third root, because when the diameter is increased by the factor of n, the volume is increased by the factor of n³. 

Happy charting 🙂


This post was motivated by a lot of wonderful blogs on http://www.R-bloggers.com