Data Scientist’s take on the US Election results

Data Scientist’s take on the US Election results

It would be an understatement if I say that the outcome of the recently concluded US election has been a shocker for many people in the US, and across the world. Also, as called out by various media outlets, these results have indicated the failure of the political polling/predictive analytics industries and the power of data and data science.

In this post I share my thoughts on this matter.
From a Data Science perspective, there are two possibilities of why the predictions were so off the charts:
a) The predictive models were wrong
b) The data used in the models was bad
Lets look at both these possibilities in detail.

a) Predictive Models were wrong

  • There is this adage which is widely accepted in the statistics world that “All models are wrong”. The reason for this stand is that ‘data’ beats ‘algorithms’, and that the models are only as good as the data used to validate them. But in this particular case, the models used, have been in use in polling predictions for decades, and its not clear to me on what went wrong with the models, in this case.
  • Having said that, there is definitely some interesting work published in the last few weeks that show the use of Inference and Regression models in understanding the outcome of this election results. Here is a whitepaper published by professors in the Dept. Of Statistics at Oxford University. To summarize the paper:

We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using dis- tribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn’t vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate “exit poll” style results like what was Trump’s support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.

b) Data used in the models was bad

  • Not everyone will be open about their opinion, especially if the opinion is not aligned to the general consensus among public. And such opinions are usually not welcome in our society. A recent example of this is Mark Zuckerberg reprimanding employees for stating that “All Lives Matter” on a Black Lives Matter posting inside the Facebook headquarters. So there is a good chance such opinions wouldn’t have made it through to the dataset being used in the models.
  • Groupthink also played a major role in adding to the skewed dataset. When most of the media and journalist agencies were predicting Hillary’s landslide victory over Trump, only a courageous pollster would contradict with the widely supported and predicted poll results. And so this resulted in everybody misreading the data.
  • Incomplete analysis methods, which only used traditional methods of collecting data like surveys and polls, instead of using important signals from the social media platforms, especially Twitter. Social media engagement of the candidates with the voters, was a definitely ignored data set, inspite of social media analysts sounding the alarm that all of the polls were not reflecting the actual situation on the ground in the pre-election landscape. Clinton outspent Trump on TV ads and had set up more field offices, and also sent staff to swing states earlier, but Trump simply better leveraged social media to both reach and grow his audience and he clearly benefitted from that old adage, “any press is good press.”

To summarize…

Data science has limitations

  • Data is powerful only when used correctly. As I called out above, biased data played the biggest spoil sport in the predictions in this election
  • Variety of data is more important than volume. There is constant rage these days to collect as much data as possible, from various sources. Google and Facebook are leading examples. As I called out above, depending on different data sets, including social media, could have definitely helped in getting the predictive models to be closer to reality. Simply put, the key is using the right “big data”.

Should we be surprised?

  • To twist the perspective a little bit, if we look at this keeping in mind how probabilistic predictions work, the outcome wouldn’t be a surprise to us. For ex., if I said that “I am 99% sure that its going to be a sunny day tomorrow”, and if you offer to bet on it at odds of 99 to 1, I might say that “I didn’t mean it literally; i just meant it will “probably” be a sunny day”. Will you be surprised if I tossed a coin twice and got heads both times? Not at all, right?
  • This New York Times article captures the gist of what actually went wrong with the use of data and probabilities in this election, very well. I think following lines say it all:
The danger, data experts say, lies in trusting the data analysis too much without grasping its limitations and the potentially flawed assumptions of the people who build predictive models.
The technology can be, and is, enormously useful. “But the key thing to understand is that data science is a tool that is not necessarily going to give you answers, but probabilities,” said Erik Brynjolfsson, a professor at the Sloan School of Management at the Massachusetts Institute of Technology.
Probabilistic prediction is a very interesting topic, but it can also be very misleading if the probabilities are not presented correctly (70% chances of Clinton winning over Trump’s 30% chances).
I shall dwell deeper into this in a follow up post…



Title image courtesy:

Two Centuries of Population, Animated, using R

An interesting visualisation – history of a growing United States – mapping built using R. 

The animated map above shows population density by decade, going back to 1790 and up to recent estimates for 2015. The time in between each time period represents a smoothed transition. This is approximate, but it gives a better idea of how the distribution of population changed.

The Data used for this mapping is from the Census Bureau amd made better accessible by NHGIS.

R moves up to 5th place in IEEE language rankings

R moves up to 5th place in IEEE language rankings

IEEE has published its annual Top Computer programming languages rankings report. It starts with the line “C is No. 1, but big data is still the big winner”, indicating the rise of R, the defacto programming language used in Big Data analytics, including Cyber Security domain. 

I think this is an extraordinary result for a language which is domain-specific (big data and data science). If you compare R to the other four languages, which are general purpose languages (C, Java, Python amd C++) in Top 5, it’s a great feat, and is a clear indication of the adoption and heavy use and relevance of R in today’s Information Age where every device, system, or a “thing” (IoT) generates some form of data (logs). This also reflects the critical important of Data Science (where R is the defacto programming language used by Data Scientists), as a descipline today. 

Some interesting lines from the report:

Another language that has continued to move up the rankings since 2014 is R, now in fifth place. R has been lifted in our rankings by racking up more questions on Stack Overflow—about 46 percent more since 2014. But even more important to R’s rise is that it is increasingly mentioned in scholarly research papers. The Spectrumd efault ranking is heavily weighted toward data from IEEE Xplore, which indexes millions of scholarly articles, standards, and books in the IEEE database. In our 2015 ranking there were a mere 39 papers talking about the language, whereas this year we logged 244 papers.

R’s steady growth in this and numerous other surveys and rankings over time reflects the growing importance of Data Science applied using R. And application of Data Science concepts in Cyber security, especially in detecting  cyber attacks, is only becoming more and more relevant. 
Using conventional security monitoring tools which use rule based detection engines (yes they are called SIEM!), to detect cyber attacks, is not working anymore. Let’s face it; SIEM has come off age. Using Machine learning approach to detect cyber attacks, has become one of the most important developments in the cyber security domain in the last 10 years. And its relevance in today’s world, where there is surplus amounts of data (also called “Big Data”) being churned out by all forms of computer systems, is at its peak. And R is playing a very important role in helping Security Data Scientists build “algorithmic models” that can detect better cyber attacks 

So I am very excited and happy to see R’s popularity and adaption growing year on year. 

This is a core area of study I am currently focusing on, and I will be writing more about this here on my blog, in the coming months. 

Picture Courtesy:

Interesting Data Science projects of 2015

Interesting Data Science projects of 2015

Here is a list of some really interesting Data Science projects of 2015. Thanks to Jeff Leek from @simplystatistics for putting this together. 
Some of my picks from the list are:

* I’m excited about the new R Consortiumand the idea of having more organizations that support folks in the R community.

* Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought this oneon changing the incentives for sexual assault surveys was particularly interesting/good.

* As usual Philip Guo was producing gold over on his blog. I appreciate this piece on twelve tips for data driven research.

* I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be “real data analysts” and still get reasonable estimates at the end of the day. This paper from Cynthia Dwork and co was one of the initial salvos that came out this year.

* Karl Broman’s post on why reproducibility is hard is a great introduction to the real issues in making data analyses reproducible.

* Datacamp incorporated Python into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.

Picture Courtesy: