Who is more corrupt? Correlating politicians and corruption using Google

Posted: Wednesday, August 31, 2011 | Posted by Debajyoti Datta | Labels:

This post was inspired by a post on The Madura Beats.

Let me start off by saying that this should not be taken seriously. It’s a quick and dirty work of half hour and not a rigorous study. If you want a rigorous study then probably a journal on political science would be a good starting place. That being said let me begin the post.

There is a nifty tool powered by Google called Google Insights for Search that lets you see the interest in particular search items overtime. You can also compare different search items and see how they vary over time. So I began my experiment. I searched for the search terms “lalu prasad yadav” and “corrupt” originating from India from 2004 till date. The graph shows the Google output.

I then downloaded the data and fired up R (a very powerful free statistical software). If you notice carefully you will see that the interest in “corrupt” went up sharply towards the end, that’s probably because of the recent interest in anticorruption movement. Hence I removed the outliers from the data using the rm.outlier() function of outliers package in R. Then I calculated Pearson’s correlation coefficient to see if the searches of Lalu Prasad Yadav and corrupt are correlated and plotted a scatter graph. The results are below.

Lalu Prasad Yadav and Corrupt
Pearson's product-moment correlation
data:  lpydata$lalu.prasad.yadav and lpydata$corrupt 
t = -6.4745, df = 398, p-value = 2.806e-10
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval: -0.3947899 -0.2172100 
sample estimates: cor = -0.3086874 
Scatter gram of interest in LaluPrasad Yadav and Corrupt
I did the same using the search terms “manmohan singh” and “corrupt”. The results are below –

Manmohan Singh and Corrupt
Pearson's product-moment correlation
data:  newdata1$manmohan.singh and newdata1$corrupt 
t = 7.7384, df = 398, p-value = 8.371e-14
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval: 0.2732764 0.4439476 
sample estimates:cor= 0.361638
Scatter gram of Manmohan Singh and Corrupt
The findings are counterintuitive. Though searches of both Manmohan Singh and Lalu Prasad Yadav are weakly correlated with the search of coruupt, they are correlated in the opposite way. Manmohan Singh has a positive correlation with corrupt while Lalu Prasad Yadav has a negative correlation with corruption. This means more the interest in the search term “corrupt”, more the interest in search term “Manmohan Singh” while more the interest in search term “corrupt”, less the interest in search term “Lalu Prasad Yadav”. Not what you expect, eh?

What does this mean? This means correlation doesn’t equal to causation and that my model is too simple to model the actual reality! Of interest, in the beginning of the tine series, there was almost no interest in the search term “Lalu Prasad Yadav” but there was a high interest in “corrupt”. Probably this screwed up the results.

Edited to Add -
Spurred on by the comment of confused yuppie, I did a bit more calculation, now comparing "lalu prasad yadav" and "manmohan singh" wit the search term "honest". The results -

Lalu Prasad Yadav and Honest


Pearson's product-moment correlation
data:  lpynew$lalu.prasad.yadav and lpynew$honest 
t = 0.1226, df = 59, p-value = 0.9028
alternative hypothesis: true correlation is not equal to 0 
95 percent confidence interval: -0.2368133  0.2667080 
sample estimates: cor = 0.01595912 
Lalu Prasad Yadav and honest
Manmohan Singh and Honest -

And then I run into problems. Somehow on downloading the .CSV file the data for honest is not there. So I can't test for correlation. I anyone is able to find the data for honest then please let me know.

At least the interest for Lalu Prasad Yadav is not correlated with honest.

2 comments:

  1. confusedyuppie said...
  2. really? wow-what an analysis. although since the recent interest in dr singh and corruption, the results are altogether not surprising

  3. Debajyoti Datta said...
  4. @confusedyuppie
    You are right, there are too many unaccounted confounding factors.

    I have done some additional work and compared the interest in them with "Honest"

Post a Comment