Data Scientist’s take on the US Election results

It would be an understatement if I say that the outcome of the recently concluded US election has been a shocker for many people in the US, and across the world. Also, as called out by various media outlets, these results have indicated the failure of the political polling/predictive analytics industries and the power of data and data science.

In this post I share my thoughts on this matter.

From a Data Science perspective, there are two possibilities of why the predictions were so off the charts:

a) The predictive models were wrong

b) The data used in the models was bad

Lets look at both these possibilities in detail.

a) Predictive Models were wrong

There is this adage which is widely accepted in the statistics world that “All models are wrong”. The reason for this stand is that ‘data’ beats ‘algorithms’, and that the models are only as good as the data used to validate them. But in this particular case, the models used, have been in use in polling predictions for decades, and its not clear to me on what went wrong with the models, in this case.
Having said that, there is definitely some interesting work published in the last few weeks that show the use of Inference and Regression models in understanding the outcome of this election results. Here is a whitepaper published by professors in the Dept. Of Statistics at Oxford University. To summarize the paper:

We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using dis- tribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn’t vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate “exit poll” style results like what was Trump’s support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.

b) Data used in the models was bad

Not everyone will be open about their opinion, especially if the opinion is not aligned to the general consensus among public. And such opinions are usually not welcome in our society. A recent example of this is Mark Zuckerberg reprimanding employees for stating that “All Lives Matter” on a Black Lives Matter posting inside the Facebook headquarters. So there is a good chance such opinions wouldn’t have made it through to the dataset being used in the models.
Groupthink also played a major role in adding to the skewed dataset. When most of the media and journalist agencies were predicting Hillary’s landslide victory over Trump, only a courageous pollster would contradict with the widely supported and predicted poll results. And so this resulted in everybody misreading the data.
Incomplete analysis methods, which only used traditional methods of collecting data like surveys and polls, instead of using important signals from the social media platforms, especially Twitter. Social media engagement of the candidates with the voters, was a definitely ignored data set, inspite of social media analysts sounding the alarm that all of the polls were not reflecting the actual situation on the ground in the pre-election landscape. Clinton outspent Trump on TV ads and had set up more field offices, and also sent staff to swing states earlier, but Trump simply better leveraged social media to both reach and grow his audience and he clearly benefitted from that old adage, “any press is good press.”

To summarize…

Data science has limitations

Data is powerful only when used correctly. As I called out above, biased data played the biggest spoil sport in the predictions in this election
Variety of data is more important than volume. There is constant rage these days to collect as much data as possible, from various sources. Google and Facebook are leading examples. As I called out above, depending on different data sets, including social media, could have definitely helped in getting the predictive models to be closer to reality. Simply put, the key is using the right “big data”.

Should we be surprised?

To twist the perspective a little bit, if we look at this keeping in mind how probabilistic predictions work, the outcome wouldn’t be a surprise to us. For ex., if I said that “I am 99% sure that its going to be a sunny day tomorrow”, and if you offer to bet on it at odds of 99 to 1, I might say that “I didn’t mean it literally; i just meant it will “probably” be a sunny day”. Will you be surprised if I tossed a coin twice and got heads both times? Not at all, right?
This New York Times article captures the gist of what actually went wrong with the use of data and probabilities in this election, very well. I think following lines say it all:

The danger, data experts say, lies in trusting the data analysis too much without grasping its limitations and the potentially flawed assumptions of the people who build predictive models.

The technology can be, and is, enormously useful. “But the key thing to understand is that data science is a tool that is not necessarily going to give you answers, but probabilities,” said Erik Brynjolfsson, a professor at the Sloan School of Management at the Massachusetts Institute of Technology.

Probabilistic prediction is a very interesting topic, but it can also be very misleading if the probabilities are not presented correctly (70% chances of Clinton winning over Trump’s 30% chances).

I shall dwell deeper into this in a follow up post…

Title image courtesy: http://www.probabilisticworld.com

RStudio v1.0 is out

RStudio has finally moved out of “beta” status last week, and the first official production version is now available. This is great news for all of us who use RStudio as the primary IDE for R programming.

Check out this link for the release history of RStudio and all changes that’s been it has gone through over the last 6 years.

Some of the major new functionality added in this release are:

Support for R Notebooks, a new interactive document format combining R code and output. It’s similar to (but not based on) Jupyter Noteooks, in that an R Notebook includes chunks of R code that can be processed independently (as opposed to R Markdown documents that are processed all at once in batch mode.)
GUI support for the sparklyr package, with menus and dialogs for connecting to a Spark cluster, and for browsing and previewing the available Spark Dataframe objects.
Profiling tools for measuring which parts of your R code are consuming the most processing time, based on the profvis package.
Dialogs to import data from file formats including Excel, SAS and SPSS, based on the readr, readxl and haven packages.

Checkout the official blog for more information about this release.

An R based analysis of Cubs and Indians performance

Here is a great use of the Lahman package in R, to analyse the historical performance of the two teams Chicago Cubs and Cleveland Indians.

This comes at the right time after the nail biting game yesterday.

In recognition of the event, and the fact that simple data analysis is all I can muster today, I thought I’d use the excellent Lahman package, which provides a trove of baseball statistics for R, to have a look at the historical performance of the two teams.

Two Centuries of Population, Animated, using R

An interesting visualisation – history of a growing United States – mapping built using R.

The animated map above shows population density by decade, going back to 1790 and up to recent estimates for 2015. The time in between each time period represents a smoothed transition. This is approximate, but it gives a better idea of how the distribution of population changed.

The Data used for this mapping is from the Census Bureau amd made better accessible by NHGIS.

All the ways to map election results

An interesting take using R. Do check this out.

Uber to partner with Maruti Suzuki

A very interesting move by Uber indeed. Very much inline with Modi’s push for creating local jobs and opportunities.

Top sources in know of Uber’s plans told The Economic Times, “Uber has around 200,000 active driver partners on their platform currently and they want to increase this to a million by 2018. They are beginning with this pilot with Maruti Suzuki and will extend this going ahead”, said an executive in know of developments.

Ola – what are you up to?

Picture Courtesy: financialexpress.com

What to expect in Apple’s big event tomorrow

It’s time for the biggest Tech event of the year – Apple’s product (hardware) launch event tomorrow.

WWDC in June is when we find out about the greatest and the latest software Apple has built, but when it comes to how the software blends seamlessly with hardware, resulting in one of the best designed, engineered and built tech products on the planet, it’s the event in Fall, where they launch them.

Key expectations are the next iPhone – iPhone 7, and the next big leap in wearables Apple Watch 2.

If there is one preview that you would want to read, about the event tomorrow, make it Jason Snell’s

The devil’s in the details, though. This event is Apple’s big chance to put all of its fall product offerings in context, to tell stories that explain why these products do what they do (or in some cases, don’t do what they don’t). This is product marketing at its highest level, and the way Apple introduces a product can be enlightening.

Apple getting rid of the headphone jack, what’s their take on wireless audio, the best camera on a smartphone getting even better (two lens camera), positioning of the Apple Watch – Jason has it all in his post.

On Apple’s Bug bounty program

The Head of Security Engineering and Architecture at Apple, Ivan Krstić, announced to Black Hat attendees last week, that Apple will begin offering cash bounties of up to $200,000 to researchers who discover vulnerabilities in its products.

Krstić’s talk at Black Hat was definitely interesting and covered a good breadth of the technical measures that Apple has been taking in making iOS secure, from grounds up. The presentation also included a level of technical detail and disclosure of security—here, related to AutoUnlock, HomeKit, and iCloud Keychain—that has been mostly absent in the past at conferences, according to those present.

Apple being so open and forthcoming, about their security architecture, is somewhat unusual, but definitely welcoming.

Now, about the the bounty program itself, it will initially be limited to about two dozen researchers who Apple will invite to help discover difficult-to-uncover security bugs in five specific categories:

Each of these aspects represent key threat vectors for attacks by governments and criminals alike. While iOS has never had exploits spread significantly in the wild, jailbreaking the software has made use of various methods of running arbitrary code in iOS. In another Black Hat presentation, the makers of the Pangu jailbreak for iOS 9 (fixed in 9.2), described how they achieved that kind of code execution.

Until now, there’s been no known extraction of data from Secure Enclave, the dedicated hardware in iOS devices with an A7 or newer processor that acts as a one-way valve to store fingerprint characteristics and certain data associated with Apple Pay. It is also used to prevent downgrading iOS to exploit a bug in a previous release. iCloud, which has been in the media sometimes for the wrong reasons, have had some accounts compromised in the past through certain weak password entry endpoints and social engineering of celebrity accounts, there has been no reported breach of iCloud servers itself.

Going by these clearly laid out vulnerability categories and qualification parameters, I see that Apple’s program sets clear objectives – find exploitable bugs in key areas. It makes complete sense, because proving exploitability with a repeatable proof of concept, takes lot more effort than merely finding a vulnerability. If the bug is found to have significant impact on security, then Apple will pay the researchers a fair value for their work. By doing this, Apple aims to learn how to improve a bug bounty program, over a period of time, and derive maximum value out of it.

The end result is – high-quality vulnerabilities (and their respective exploits) discovered, by researchers and developers who Apple believes have the skills and the right intentions to help advance product security. Bounty fees at other companies range from a starting point from $100 to $500, and are capped at from $20,000 at Google to $100,000 at Microsoft, clearly indicating the focus being quantity, unlike Apple’s focus on quality and difficult to discover, exploit and reproducible vulnerabilities.

Many major tech companies, like Google, Facebook, Microsoft, Adobe, and SAP, have been running Bounty programs for years. But there is a reason for Apple not getting into the Bounty business until now, even if security has always been a priority for them and iOS is way more secure, grounds up, than other competing mobile OS platforms today. That reason is primarily to ward off governments and underground hackers who merely want to make money, by not being in a position to negotiate with them. The disclosure by the United States government on last week that an unknown third party had approached it — and not Apple — to help open a controversial iPhone only highlights how the giant company approaches bug-hunting efforts and security differently from the rest of the tech industry.

Asked by the audience at Black Hat why Apple waited so long to launch a bounty program, Krstić said the company has heard from researchers that finding critical vulnerabilities is increasingly difficult, and it wanted to reward those who take the time to do it.

I have been following Apple closely since 2009, when I bought my first Apple product – an iPhone 4 (the last phone Steve Jobs personally launched). Being a Security Consultant myself, I have always wondered to how Apple builds their software to be far more secure than other operating system platforms. And this has been true from the very beginning of Mac (built on a strong Unix base), And so I have always tried to understand iOS and Mac security a bit deeper, but Apple has always been secretive about sharing information, just the way they are about their product strategy and roadmap. So this new development with the Bounty program and the overall incharge for Product security at Apple making a presentation at Blackhat, is very exciting to me.

I am looking forward to understanding how Operating System security is best handled, from a company that makes the best software and hardware in the world today.

Notes:

Krstić’s presentation at Black Hat is available here
The video of the talk has been published recently on YouTube

Feature Image courtesy: blackhat.com

Gartner publishes Hype Cycle for Emerging Technologies 2016

Gartner has just released their annual Hype Cycle for Emerging Technologies, for 2016.

16 new technologies added to the Hype Cycle this year, including blockchain, machine learning, general purpose machine intelligence, smart workspace for the first time.

Interestingly, 14 technologies were taken off the Hype Cycle this year including Hybrid Cloud Computing, Consumer 3D Printing, and Enterprise 3D Printing.

Do checkout the report here. Definitely worth a read.
Image source: Gartner

Need a security expert? You got to hire a coder!

As security (cyber) becomes more and more important, to businesses, governments, and also to our personal lives, the need for good security engineers and researchers is increasing at a rapid pace.

This is true whether one is working in an entry-level position or is already a senior researcher.

It is often said in the security industry that “It is easier to teach a developer about security than it is to teach a security researcher about development (coding).”

Information security professionals are used to seeing, experiencing and talking about failures in the industry. This usually leads them to assume that badly written (vulnerable) code is always the product of unskilled developers. If these professionals have never been exposed to software development, even at a small scale, then they do not have a fair understanding of the complex challenges that developers face in secure code development. And I think that a security professional cannot be effective in designing detective and preventive security controls (tools, architectures, processes) if he or she doesn’t appreciate these challenges.

Let me illustrate that with an example- ‘code injection” attacks against NoSQL databases versus SQL databases. Simply put, SQL and NoSQL databases both collect, organize and accept queries for information, and so both are exposed to malicious code injections. So, when NoSQL databases became popular, people were quick to predict that NoSQL injection would become as common as SQL injection. Though that is theoretically true, developers know that it’s not that simple.

If you take sometime out understanding NoSQL databases, you will quickly realize that there are a wide variety of query formats, from SQL like queries (Cassandra), to JSON based queries (MongoDB, DynamoDB), and to assembly like queries (Redis). And so security recommendations and tools for a NoSQL environment have to be targeted to the individual server that is underneath. Also, your security testing tools must have the injection attacks that are in the format of that specific database. And so one cannot blindly recommend controls or preventive measures, without understanding that the vulnerabilities are not available on all platforms. Encoding recommendations for data will be specific to the database type as well. This OWASP article explains how one can test for noSQL injection vulnerabilities.

This is all the knowledge that one can learn by digging deep into a subject and experimenting with technologies at a developer level. And so people with development backgrounds can also, often times, give better technical advice.

If one looks at the people leading security programs or initiatives at companies like Apple, Facebook, Google, and other large successful tech companies, many of them are respected because they are also keeping their hands on the keyboards and are speaking from direct knowledge. They not only provide advice and research but also tools and techniques to empower others in the same industry.

So to summarise, I would like to say that whether one is a newly graduated engineer or a senior security professional or a security researcher, one should never lose sight of the code, as that is where it all begins!

Picture courtesy: http://www.icd10forpt.com

a) Predictive Models were wrong

b) Data used in the models was bad

To summarize…

Hari Notes