In memory of Monty Hall

Some find it a common knowledge, some find it weird. As a professor I usually teach about Monty Hall problem and year after year I see puzzling looks from students regarding the solution.

Image taken from http://media.graytvinc.com/images/690*388/mon+tyhall.jpg

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

image taken from https://rohanurich.files.wordpress.com/2013/03/mhp-agc2.png

The basis of the answer lies in related and unrelated events. The most common answer is that it doesn’t matter which strategy you choose because it is 50/50 chance – but it is not. The 50/50 assumption is based on the idea that the first choice (one of three doors) and the second choice (stay or switch door) are unrelated events, like flipping a coin two times. But in reality, those are related events, and the second event depends on the first event.

At the first step, when you choose one of three doors, the probability that you picked the right door is 33%, or in other words, there is 66,67% that you are on the wrong door. The fact that that in the second step you are given a choice between your door and the other one doesn’t change the fact that you are most likely starting with the wrong door. Therefore, it is better to switch door in the second step.

Simulation using R

To explore this a bit further and to have a nice exercise with R, a small simulation of games is created.

First we load the necessary packages

library(ggplot2)
library(scales)

Then we create the possible door combinations

#create door combinations
 a<-c(123,132,213,231,312,321)
 

So what I did was to generate three-digit numbers. The first number will always say behind which door is a car, and two other numbers will say where are goats.

Now let’s prepare the vectors for the simulation

#create results vectors
 car=integer(length=100000)
 goat1=integer(length=100000)
 goat2=integer(length=100000)
 initial_choice=integer(length=100000)
 open_door=integer(length=100000)
 who_wins=character(length=100000)
 

Now we are ready for the simulation

#create 100.000 games
for (i in 1:100000){

  #set up a situation

  doors<-sample(a,1) #randomly pick a door combination
  car[i]<-doors %/% 100 #the first number is which door is the right door
  goat1[i]<-(doors-car[i]*100)%/%10 #where is the first wrong door
  goat2[i]<-doors-car[i]*100-goat1[i]*10 #where is the second wrong door

  #have a person select a random door
  initial_choice[i]<-sample(c(1,2,3),1)
  

#now we open the wrong door
  if (initial_choice[i]==car[i]){
    open_door[i]=sample(c(goat1[i],goat2[i]),1) #if the person is initially on the right door we randomly select one of the two wrong doors
  } else if (initial_choice[i]==goat1[i]) {
    open_door[i]=goat2[i]
  } else {open_door[i]=goat1[i]} #if the person is initially on the wrong door, we open the other wrong door  

  #stayer remains by his initial choice and switcher changes his choice
  if (initial_choice[i]==car[i]){who_wins[i]="Stayer"} else {who_wins[i]="Switcher"}


}
monty_hall=data.frame(car, goat1,goat2,initial_choice,open_door,who_wins)

And now we got a nice analysis of 100.000 games. To put the most important result into chart we use ggplot2

ggplot(data=monty_hall, aes(who_wins, fill=who_wins)) + 
  geom_bar(aes(y = (..count..)/sum(..count..))) + #crude but effective
  ylim(0,1)+
  ylab("Ratio")+
  xlab("Who wins?")+
  theme(legend.position = "none")

And now we got a nice analysis of 100.000 games. To put the most important result into chart we use ggplot2

So it is definitely better to switch door!

For more reading refer to https://en.wikipedia.org/wiki/Monty_Hall_problem

Happy coding 🙂

Advertisements

The Illusion of linearity trap when making bubble charts in ggplot2

Bubble charts are a great way to represent the data when you have instances that vary greatly, eg. the size of Canada compared to the size of Austria. However this type of a chart introduces a new dimension in the interpretation of data because the data is interpreted by the bubble size (area), and not linearly. The mistake when building such charts is that we ignore what is known as the illusion of linearity. This illusion (see this Article for more) is the effect that people tend to judge proportions linearly even when they are not linear. For example, the common mistake is that the pizza with the diameter of 20 cm is two times larger than the pizza with the diameter of 10 cm, while in fact the first pizza is 4 times larger than the second one because we judge the size of an area and not the diameter. The first pizza has the area of 314 cm² (r²Π) and the second one has 78,5 cm² → 314/78,5=4. Now back to bubble charts…

For this example I have loaded ggplot2 and created a simple dataset with three variables – x, y and size.

library(ggplot2)

dat=data.frame(c(1,2,3),c(2,3,1),c(10,20,30))
colnames(dat)<-c("x","y","size")
dat

The resulting dataset provides the coordinates and the bubble sizes for the chart. Now lets create the chart with annotated bubble sizes.

q<-ggplot(dat,aes())
q<-q + geom_point(aes(x=x,y=y), size=dat$size) #create bubbles
q<-q + xlim(-1,5)+ylim(-1,5) #add limits to axes
q<-q+ annotate("text",x=dat$x,y=dat$y,label=dat$size, color="white",size=6) #add size annotations
q<-q + theme_void() + theme(panel.background = element_rect(fill="lightblue")) #create a simple theme
q

The chart looks like this:

Rplot01

The basic issue is that the smallest bubble looks as it is 9 times smaller than the largest bubble instead of 3 times smaller because the size parameter of geom_point is determined by the diameter and not by area size.

To correct for this and to make the chart interpretable we will use the simple transformation of the size parameter in geom_point by square root.

q<-ggplot(dat,aes())
q<-q + geom_point(aes(x=x,y=y), size=sqrt(dat$size)*10) #create bubbles with a correct scale
q<-q + xlim(-1,5)+ylim(-1,5) #add limits to axes
q<-q+ annotate("text",x=dat$x,y=dat$y,label=dat$size, color="white",size=6) #add size annotations
q<-q + theme_void() + theme(panel.background = element_rect(fill="lightblue")) #create a simple theme
q

The multiplication of squared size by the factor of 10 is just for creating the bubbles large enough compared to the limits of axes.

The chart now looks like this:

Rplot02

The areas are now in the correct scale and the bubbles are proportional to the size variable.

Of course, if we would like to make three dimensional shapes, the correction factor would be third root, because when the diameter is increased by the factor of n, the volume is increased by the factor of n³. 

Happy charting 🙂

 

This post was motivated by a lot of wonderful blogs on http://www.R-bloggers.com