A bit more understanding of Cronbach’s alpha

Cronbach’s alpha reliability coefficient is one of the most widely used indicators of the scale reliability. It is used often without concern for the data (this will be a different text) because it is simple to calculate and it requires only one implementation of a single scale. The aim of this article is to provide some more insight into the functioning of this reliability coefficient without going into heavy mathematics. The main formula for the structure of variance is

That is – the total variance is composed of the variance of the real results, the error variance and their covariance. Since the error is random and does not correlate with anything we can drop the covariance part of the formula. The reliability is by definition a ratio of the variance of real results and total variance and is usually represented with the symbol rxx. With this information we can reformulate the structure of the variance

Therefore to find out about the reliability, we are searching for the proportion of the error variance. Now to Cronbach’s alpha. The formula for alpha is

Let’s dissect this formula

In the first part we will concentrate on the rightmost part of this formula. The structure of the variance of any linear combination is a sum of variances of individual variables and double sum of covariances between the variables. If the intercorrelation between the variances is low, the test is not homogenous and the variance of the linear combination is composed mainly from the sum of the variances of individual variables. If the intercorrealtion is zero, the total variance is equal to the sum of variances of individual variables. The

part gives us the information of how much of total variance is attributed to the sum of individual variances. This number has a maximum value of 1 and represents the situation when there is no intercorrelation between the variables. Therefore if we substract this from 1 we get the representation of the proportion of variance attributed to covariances of the individual variables.

To simplify things a bit, let’s work with standardized variables (M=0, V=1). In the formula for the variance of the linear combination

when we work with standardized variables, all individual variances are 1 so the formula can be replaced with

with k being the number of variables.

If the correlation between variables is 1 the number of correlations is k(k-1)/2 so the formula for the total variance would be

So to conclude, if the intercorrelations are zero VLC=k, and if the intercorrelations are 1, VLC=k2

Therefore the importance of the covariances within the total variance grows as the intercorrelations between items become higher, and when the number of items gets larger.

In the next part we can concentrate on the k/(k-1) part of the formula for alpha coefficient.

As we have seen in the previous part, the ratio of the sum of individual variances and the total variance is dependent on the intercorrelations between the items, but also on the number of items. If we would not correct this number, the maximum reliability would become dependent on the number of items and the scales with less items would be considered unreliable even if the intercorrelation is 1 between all items. Let’s see the example of a scale with two items with correlation 1 without the correction.

As the number of items becomes larger, this becomes less important since the importance of the number of items grows within the formula for the total variance. If the number of items is 100 and the intercorrelation is 1, this equation would give us 0.9. Therefore it is necessary to introduce the correction factor for the number of items that removes the penalization of the coefficient related to the number of items. The part k/(k-1) of the formula for alpha serves that purpose and since this part has a maximum value of 2 when the number of items is 2, and the minimum value of 1 when the number of items is infinite, it serves as the correction of the second part of the alpha formula.

To conclude, we have seen that the alpha coefficient is the simple way for measuring the homogeinity of a scale and that we can improve this coefficient by either using variables that have high correlation with other variables, or by adding more variables to the scale, or preferably – both.

The simulation in R

We will now use R to see what does the simulation of data show. We will create a set of standardized variables with a given intercorrelation and variate this intercorrealtion and the number of variables.

In the first part we have to load the packages

library(MASS)
library(psych)
library(ggplot2)

Now we will iterate through sets of n_var variables with t_cor intercorrelation. The minimum number of variables is 3 because R produces warning when calculating alpha on 2 variables when trying to calculate alpha if item deleted.

The results data frame has three columns that are necessary for plotting – intercorrelation, number of variables within the scale and the obtained alpha coefficient. For each intercorrelation we create 48 different scales with the number of items being from 3 to 50. To create results we use this code

res=data.frame() #the results data frame
step=0
for (t_cor in seq(from=0.1, to=0.9, by=0.2)){ #correlations go from 0.1 to 0.9
  for (n_var in 3:50){ #number of variables goes from 3 to 50
    rmat=diag(1,n_var) #create empty correlation matrix 
    for (i in 1:n_var){
      for (j in 1:n_var) {
        if (i!=j) {rmat[i,j]=t_cor} #fill the matrix with intercorrelations
      }
    }
    mu <- rep(0,n_var) #create vector of means for variables
    mat <- mvrnorm(100, Sigma = rmat, mu = mu, empirical = TRUE) #create variables using means and the correlation matrix
    res[step+n_var-2,1]=t_cor #save target intercorrelation
    res[step+n_var-2,2]=n_var #save number of variables in the scale 
    res[step+n_var-2,3]=psych::alpha(mat)$total$raw_alpha #save Cronbach's alpha
    #both ggplot2 and psych packages have alpha function so it is explicitely declared
  }
  step=step+48 #this creates a new start for the next  target intercorrelation
}

And finally we plot the data

ggplot(data = res, aes(x=V2, y=V3)) + geom_line(aes(colour=as.factor(V1)))+
  xlab("Number of items")+
  ylab("Alpha")+
  labs(color="Intercorrelation")

So what we see here is that the value of alpha coefficient indeed depends on the intercorrelation and the number of items in the scale. There is also a visible reduction in the growth speed of the alpha coefficient as the number of items gets larger. The growth of alpha is quick at first, but as the number of items gets larger, the growth of alpha slows down.

In memory of Monty Hall

Some find it a common knowledge, some find it weird. As a professor I usually teach about Monty Hall problem and year after year I see puzzling looks from students regarding the solution.

Image taken from http://media.graytvinc.com/images/690*388/mon+tyhall.jpg

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

image taken from https://rohanurich.files.wordpress.com/2013/03/mhp-agc2.png

The basis of the answer lies in related and unrelated events. The most common answer is that it doesn’t matter which strategy you choose because it is 50/50 chance – but it is not. The 50/50 assumption is based on the idea that the first choice (one of three doors) and the second choice (stay or switch door) are unrelated events, like flipping a coin two times. But in reality, those are related events, and the second event depends on the first event.

At the first step, when you choose one of three doors, the probability that you picked the right door is 33%, or in other words, there is 66,67% that you are on the wrong door. The fact that that in the second step you are given a choice between your door and the other one doesn’t change the fact that you are most likely starting with the wrong door. Therefore, it is better to switch door in the second step.

Simulation using R

To explore this a bit further and to have a nice exercise with R, a small simulation of games is created.

First we load the necessary packages

library(ggplot2)
library(scales)

Then we create the possible door combinations

#create door combinations
 a<-c(123,132,213,231,312,321)
 

So what I did was to generate three-digit numbers. The first number will always say behind which door is a car, and two other numbers will say where are goats.

Now let’s prepare the vectors for the simulation

#create results vectors
 car=integer(length=100000)
 goat1=integer(length=100000)
 goat2=integer(length=100000)
 initial_choice=integer(length=100000)
 open_door=integer(length=100000)
 who_wins=character(length=100000)
 

Now we are ready for the simulation

#create 100.000 games
for (i in 1:100000){

  #set up a situation

  doors<-sample(a,1) #randomly pick a door combination
  car[i]<-doors %/% 100 #the first number is which door is the right door
  goat1[i]<-(doors-car[i]*100)%/%10 #where is the first wrong door
  goat2[i]<-doors-car[i]*100-goat1[i]*10 #where is the second wrong door

  #have a person select a random door
  initial_choice[i]<-sample(c(1,2,3),1)
  

#now we open the wrong door
  if (initial_choice[i]==car[i]){
    open_door[i]=sample(c(goat1[i],goat2[i]),1) #if the person is initially on the right door we randomly select one of the two wrong doors
  } else if (initial_choice[i]==goat1[i]) {
    open_door[i]=goat2[i]
  } else {open_door[i]=goat1[i]} #if the person is initially on the wrong door, we open the other wrong door  

  #stayer remains by his initial choice and switcher changes his choice
  if (initial_choice[i]==car[i]){who_wins[i]="Stayer"} else {who_wins[i]="Switcher"}


}
monty_hall=data.frame(car, goat1,goat2,initial_choice,open_door,who_wins)

And now we got a nice analysis of 100.000 games. To put the most important result into chart we use ggplot2

ggplot(data=monty_hall, aes(who_wins, fill=who_wins)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) + #crude but effective
  ylim(0,1)+
  ylab("Ratio")+
  xlab("Who wins?")+
  theme(legend.position = "none")

And now we got a nice analysis of 100.000 games. To put the most important result into chart we use ggplot2

So it is definitely better to switch door!

For more reading refer to https://en.wikipedia.org/wiki/Monty_Hall_problem

Happy coding 🙂

The Illusion of linearity trap when making bubble charts in ggplot2

Bubble charts are a great way to represent the data when you have instances that vary greatly, eg. the size of Canada compared to the size of Austria. However this type of a chart introduces a new dimension in the interpretation of data because the data is interpreted by the bubble size (area), and not linearly. The mistake when building such charts is that we ignore what is known as the illusion of linearity. This illusion (see this Article for more) is the effect that people tend to judge proportions linearly even when they are not linear. For example, the common mistake is that the pizza with the diameter of 20 cm is two times larger than the pizza with the diameter of 10 cm, while in fact the first pizza is 4 times larger than the second one because we judge the size of an area and not the diameter. The first pizza has the area of 314 cm² (r²Π) and the second one has 78,5 cm² → 314/78,5=4. Now back to bubble charts…

For this example I have loaded ggplot2 and created a simple dataset with three variables – x, y and size.

library(ggplot2)

dat=data.frame(c(1,2,3),c(2,3,1),c(10,20,30))
colnames(dat)<-c("x","y","size")
dat

The resulting dataset provides the coordinates and the bubble sizes for the chart. Now lets create the chart with annotated bubble sizes.

q<-ggplot(dat,aes())
q<-q + geom_point(aes(x=x,y=y), size=dat$size) #create bubbles
q<-q + xlim(-1,5)+ylim(-1,5) #add limits to axes
q<-q+ annotate("text",x=dat$x,y=dat$y,label=dat$size, color="white",size=6) #add size annotations
q<-q + theme_void() + theme(panel.background = element_rect(fill="lightblue")) #create a simple theme
q

The chart looks like this:

Rplot01

The basic issue is that the smallest bubble looks as it is 9 times smaller than the largest bubble instead of 3 times smaller because the size parameter of geom_point is determined by the diameter and not by area size.

To correct for this and to make the chart interpretable we will use the simple transformation of the size parameter in geom_point by square root.

q<-ggplot(dat,aes())
q<-q + geom_point(aes(x=x,y=y), size=sqrt(dat$size)*10) #create bubbles with a correct scale
q<-q + xlim(-1,5)+ylim(-1,5) #add limits to axes
q<-q+ annotate("text",x=dat$x,y=dat$y,label=dat$size, color="white",size=6) #add size annotations
q<-q + theme_void() + theme(panel.background = element_rect(fill="lightblue")) #create a simple theme
q

The multiplication of squared size by the factor of 10 is just for creating the bubbles large enough compared to the limits of axes.

The chart now looks like this:

Rplot02

The areas are now in the correct scale and the bubbles are proportional to the size variable.

Of course, if we would like to make three dimensional shapes, the correction factor would be third root, because when the diameter is increased by the factor of n, the volume is increased by the factor of n³. 

Happy charting 🙂

 

This post was motivated by a lot of wonderful blogs on http://www.R-bloggers.com