Deep Learning – Limitations and its future

Introduction

The origin of this blog post is the recent debate spurred off Elon Musk and Mark Zuckerberg, on whether AI is good or bad for humanity.

Elon is an inspiration to many of us around the world, especially for anyone entrepreneurial, and also for us – the machine learning enthusiasts; self-driving cars, and his thoughts on AI and its applications in his companies (e.g., Tesla autonomous driving via its auto-pilot capabilities).

Screen Shot 2017-09-17 at 10.49.24 PM.png — Auto-pilot in action, in a Tesla Model S

But sometimes, I tend to differ with his point-of-views on certain topics. For ex., We have to leave earth and go to Mars, to sustain humanity (Space X was founded primarily to make this possible sooner), and that robots with Super-intelligence, will take over the planet soon. Elon is a visionary, and like Stephen Hawking, he too believes that AI could one day supersede humans. He is right; or rather, he ‘could’ be right. But the point I would like to make in this blog post is that AI and deep learning in particular, is at a very nascent stage now, and considering the capabilities that we have built into AI systems so far, I am pretty sure that the doomsday type scenario that Elon points out, is definitely not in the near future.

As Andrew NG, the cofounder of Coursera and former chief scientist at Chinese technology powerhouse Baidu, recently pointed out in a Harvard Business Review event, the more immediate problem we need to address, is job displacement, due to automation, and this is an area that we must focus on, rather than being distracted by science fiction-ish, dystopian elements.

Limitations of Deep learning

At the recently held AI By The Bay conference, Francois Chollet, an AI Researcher at Google and inventor of the widely used deep learning library Keras, spoke about the limitations of deep learning. He said that deep learning is simply a more powerful pattern recognition system when compared to previous statistical and machine learning methods. “The most important problem for A.I today is abstraction and reasoning”, said Chollet.

Current supervised perception and reinforcement learning algorithms require lots of data, they’re terrible at planning, and are merely doing straightforward pattern recognition. However, by contrast, humans are able to learn from very few examples and can do very long-term planning. Also, we are capable of forming abstract models of a situation and manipulate these models to achieve “extreme generalization”.

Lets take an example of how difficult it is to teach simple humans behaviours to a deep learning algorithm. Lets examine the task of not being hit by a car as you attempt to cross a road. In case of supervised learning, we would need huge datasets of this (vehicular movement) situations with clearly labeled actions to take, such as “stop” or “move”. Then you’d need to train a neural network to learn the mapping between the situation and the appropriate response action. If we go with the reinforcement learning route, where we give an algorithm a goal and then let it independently determine the appropriate actions to take, the computer would need to die thousands of times before it learns to avoid vehicles in different situations. In summary, humans only need to be told once to avoid cars. We’re equipped with the ability to generalize from just a few examples and are capable of imagining (modeling) the dire consequences of being run-over by a vehicle. And so without ever (in most cases) losing our life or hurting us significantly, most of us quickly learn to avoid being run over by motor vehicles.

Talking of anthropomorphizing machine learning models, Francois Chollet in his recent blog post, has a very interesting observation:

“A fundamental feature of the human mind is our “theory of mind”, our tendency to project intentions, beliefs and knowledge on the things around us. Drawing a smiley face on a rock suddenly makes it “happy”—in our minds. Applied to deep learning, this means that when we are able to somewhat successfully train a model to generate captions to describe pictures, for instance, we are led to believe that the model “understands” the contents of the pictures, as well as the captions it generates. We then proceed to be very surprised when any slight departure from the sort of images present in the training data causes the model to start generating completely absurd captions.”

Another difference between how we, humans, interpret our surrounding, versus how these models do, is ‘extreme generalisation’ that we are good at, versus the ‘local generalisation’ that the machine learning models can do.

Screen Shot 2017-09-17 at 10.53.23 PM.png — **Can humans’s ‘extreme generalisation’ abilities be ported into a machine learning model?** (Picture source: https://blog.keras.io/the-limitations-of-deep-learning.html)

Lets take an example to understand this difference. If we take a young and smart 6 year old boy from Bangalore, and leave him in the town of Siem Reap in Cambodia, he will, in a few hours, manage to find out place to eat, and start communicating with the people around, and make his ends meet in a couple of days time. This ability to handle new situations, when we have never experience a similar one before – language, people, surroundings, etc., that is to perform abstraction and reasoning, far beyond what we have experienced so far, is arguably the defining characteristic of human cognition. In other words, this is “extreme generalization”; our ability to adapt to completely new, never experienced before situations, using very little data or even no new data at all. This is in sharp contrast with what deep/neural nets can do, which can be referred to as “local generalization”; that is, the mapping from inputs to outputs performed by deep nets quickly start to fall apart, if the new inputs differ even slightly from what they were trained with.

How Deep learning should evolve

A necessary transformational development that we can expect in the field of machine learning is a move away from models that merely perform pattern recognition and can only achieve local generalization, towards models capable of abstraction and reasoning, that can achieve extreme generalization. Whilst moving towards this goal, it will also be important for the models to require minimal intervention from human engineers. Today, most of the AI programs that are capable of basic reasoning, are all written by human programmers; for example the software that relies on search algorithms. All this will result in deep learning models which are not heavily dependant on supervised learning, which is the case today, and truly become self-supervised and independent.

As Francois calls out in his blog post, “we will move away from having on one hand “hard-coded algorithmic intelligence” (handcrafted software) and on the other hand “learned geometric intelligence” (deep learning). We will have instead a blend of formal algorithmic modules that provide reasoning and abstraction capabilities, and geometric modules that provide informal intuition and pattern recognition capabilities. The whole system would be learned with little or no human involvement.”

Will this result in machine learning engineers losing jobs? Not really; engineers will move higher up in the value chain. These engineers will then start focusing on crafting complex loss functions to meet business goals and use cases, and spend more time understanding how the models they have built impact the digital ecosystems in which they are deployed. For ex., interact and understand the users that consume the model’s predictions and the sources that generate the model’s training data. Only the largest companies can afford to have their data scientists spend time in these areas.

Another area of development could be, the models becoming more modular, like how the advent of OOP (Object-oriented programming) helped in software-development, and how the concept of ‘functions” help in re-using key functionalities in a software program. This will steer the way for the models becoming re-usable. What we do today, in the lines of model reuse across different tasks, is to leverage pre-trained weights for models that perform common functions, like visual feature extraction (for image recognition). When we reach this stage, we would not only leverage previously learned features (submodel weights or hyperparameters), but also model architectures and training procedures. And as models become more like programs, we would start reusing program subroutines, like the functions and classes found in our regular programming languages today.

Screen Shot 2017-09-17 at 11.00.40 PM.png — (Picture source: http://www.astuta.com/how-artificial-intelligence-big-data-will-transform-the-workplace/)

Closing thoughts…

The result of these developments in the deep learning models, would be a system that attains the state of Artificial General Intelligence (AGI).

So, does all this, again take us to the debate i initially started with – a singularitarian robot apocalypse to takeover planet Earth? For now, and for the near-term future, I think its a pure fantasy, that originates from a profound misunderstanding of intelligence and technology.

This post began with Elon Musk’s comments; and I will end it with a recent comment from him – “AI will be the best of worst hing ever for humanity…”.

If you want to dwell deeper in this fantasy of Superintelligence, then do read Max Tegmark’s book Life 3.0 published last month. Elon himself, highly recommends it. :).

Another classic is Nick Bostrom’s Superintelligence – Paths, Dangers, Strategies. And, if you are a Python programmer, and want to start building Deep Learning models, then I strongly recommend Francois Chollet’s book Deep Learning with Python

Visualising the performance of Machine learning models

Evaluating the performance of machine learning models using various metrics like accuracy, precision, recall, is straightforward, but visualising them has never been easy. But Ben Bengfort at District Data Labs, has developed a python library for this purpose, called YellowBrick.

It definitely looks interesting. Our very own Charles Givre shows how this package can be used, in his blog.

It's a definite read.

Machine Learning and EU GDPR

In this post, I share my thoughts on the impact of using machine learning to conduct profiling of individuals in the context of the EU General Data Protection Regulation (hereon referred to as GDPR). My analysis is based on, specifically, Article 22 of the GDPR regulation, which can be found here, which refers to the “automated-processing and profiling of data subjects” requirement.

One of the arguments I discuss is, though using machine learning for profiling (of users/consumers, hereon referred to as ‘data subjects’) may complicate data controllers’ compliance with their obligations under the GDPR, at the same time it may lead to fairer decisions for data subjects, because human intervention whilst classifying data or people is flawed and is subject to various factors, whereas, machines/computers eliminate the subjectivity and biased approaches used by humans.

Lawful, Fair and Transparent

One of the fundamental principles of EU data protection law is that personal data must be processed lawfully, fairly and in a transparent manner.

GDPR’s definition of ‘processing’, is as follows:
‘any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction’

‘Profiling’ is a subset of automated processing, and GDPR defines it as:
‘the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements’.

Now, lets analyse the three key tenets of the GRPD requirement – personal data must be processed lawfully, fairly and transparently

Lawfulness

If we break down the definition of ‘profiling’ in GDPR, in the context of machine learning, following are three key elements in this process:

Data profiling – key elements:

Data collection
Model development
Decision making

The outcome of these steps is that, machine learning is used for:

Automated data processing for profiling purposes
Automated decision making, based on the profiles built

Data collection

The regulation says that the collection of personal data should comply with the data protection principles and there must be a lawful ground for processing of this data. This means that personal data should only be collected for specified, explicit, and legitimate purposes and should not be processed subsequently in a manner that is incompatible with those purposes.

A machine learning algorithm may build a profile of a subject, based on the data that has been provided by the ‘data controller’ or by a third party or by both. Many organisations use Cloud Computing services for these activities, as the process may require significant resources in terms of computational power and storage. Depending on the nature of the business/application/usecase of such profiling, this processing may take place locally on the data controller’s machines, while a copy of this data is also sent to the Cloud to continue the dynamic training of the algorithm.

Elaborating on the “lawfulness” of this profiling, an individuals’ personal data are not only processed to create descriptive profiles about them but also to check against predefined patterns of normal behaviour, and to detect anomalies. This stage of profile construction will be subject to the GDPR rules governing the processing of personal data including the legal grounds for processing this data.

An interesting point to note is that, the final text of Article 22 of the GDPR refers to a ‘data subject’ and not a ‘natural person’. This could be interpreted as the protection against solely automated decision-making might not apply if the data processed are anonymized. This means, if profiling does not involve the processing of data relating to identifiable individuals, the protection against decisions based on automated profiling may not be applicable, even if such decisions may impact upon a person’s behaviour or autonomy. However, as Article 22 seems only to apply to profiling of individual data subjects and not groups, the question arises whether data subjects are protected against decisions that have significant effects on them but these decisions could be based on group profiling.

This can be an issue, because if inferences about individuals are made based on shared characteristics with other members of a group, there may be significant number of false positives or false negatives. A good example of this “anonymised” data collection for machine learning application, is Apple’s approach, which they refer to as ‘differential privacy’

Decision making

When it comes to decision making, based on the ‘processing’ of personal data described above, does ‘automated individual decision-making’ only cover situations where a machine makes decisions without any involvement by human actors? This may not be true in most of the situations as some human intervention is likely to occur at some point in the automated decision-making process. And so, I think the scope of the protection is broader than just covering wholly automated decision-making. Also, human intervention would have to be actual and substantive, i.e. humans would have to exercise ‘real influence on the outcome of a particular decision-making process, in order to lead to the inapplicability of this protection.

In addition, the GDPR does not specify whether the decision itself has to be made by a human or whether it could potentially be made by a machine. Nevertheless, as I mentioned above, it is highly likely that one or more humans will be involved in the design of the model, training it with data, and testing of a system incorporating machine learning.

Legal impact

Another important element of the decision is that it has to produce legal effects or similarly significantly affect the data subject. Some examples could be an automatic refusal for an online credit application or e-recruitment practices without human intervention. The effects can be both material and / or immaterial, potentially affecting the data subject’s dignity, integrity or reputation. And so the requirement that ‘effects’ be ‘legal’ means that a decision must be binding or that the decision creates legal obligations for a data subject.

Potential consequences of non-compliance

It is important to bear in mind that if data controllers violate the rights of data subjects under Article 22, they shall ‘be subject to administrative fines up to 20,000,000 EUR, or in the case of an undertaking, up to 4 % of the total worldwide annual turnover of the preceding financial year, whichever is higher’. In the face of potential penalties of this magnitude and considering the complexities of machine learning, data controllers may have apprehensions in using the technology for automated decision making in certain situations. Moreover, data controllers may insist that contractual arrangements be put in place, with providers that are part of the machine learning supply chain, which contain very specific provisions regarding the design, training, testing, operation and outputs of the algorithms, and also the relevant technical and organisational security measures to be incorporated.

Fairness

Lets now turn to the meaning of ‘fairness’ in the context of using machine learning either to carry out automated processing, including profiling, or to make automated decisions based on such processing. Whether personal data will be processed in a fair way or not, may depend upon a number of factors. Machine learning processes may be biased to produce the results pursued by the person who built the model. Also, the quantity and quality of data used to train the algorithm, including the reliability of their sources and labelling, may have significant impact on the construction of profiles.
For example, an indirect bias may arise where data relate to a minority group that has been treated unfairly in the past in such a way that the group is underrepresented in specific contexts or overrepresented in others. Also, in case of a hiring application, if fewer women have been hired previously, data about female employees might be less reliable than data about male employees.

So the point is, reliability while using machine learning for automated decision-making, will depend, on the techniques and the training data used. Further, machine learning techniques often perform better when the training data is large (more data about data subjects), and when the variance is wide spread. However, this may collide with the data minimisation principle in EU data protection law, a strict interpretation of which is that ‘the data collected on the data subject should be strictly necessary for the specific purpose previously determined by the data controller’.

And so it is very important that the data controllers decide, at the time of collection, which personal data they are going to process for profiling purposes. Then, they will also have to provide the algorithm with only the data that are strictly necessary for the specific profiling purpose, even if that leads to a narrower representation of the data subject and possibly a less fair decision for him/her.

Transparency

Machine learning algorithms may be based on very different computational learning models. Some are more amenable to allowing humans to track the way they work, others may operate as a ‘black box’. For example, where a process utilises a decision tree it may be easier to generate an explanation (in a human-readable form) of how and why the algorithm reached a particular conclusion; though this very much depends on the size and complexity of the tree. The situation may be very different in relation to neural network-type algorithms, such as deep learning algorithms. This is because the conclusions reached by neural networks are ‘non-deductive and thus cannot be legitimated by a deductive explanation of the impact various factors at the input stage have on the ultimate outcome’

This opacity of machine learning techniques might have an impact on a data controller’s obligation to process a data subject’s personal data in a transparent way. Whether personal data are obtained directly from the data subject or from an indirect source, the GDPR imposes on the data controller the obligation, at the time when personal data are obtained, to provide the data subject with information regarding:

‘the existence of automated decision-making, including profiling, referred to in Article 22(1) and (4) and, at least in those cases, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.’

Does this mean that whenever machine learning is used to conduct profiling the data controller must provide information regarding the existence and type of machine learning algorithms used? If so, to what does the term ‘logic’ refer and what would constitute ‘meaningful information’ about that logic? Does the term ‘logic’ refer to the data set used to train the algorithm, or to the way the algorithm itself works in general, for example the mathematical / statistical theories on which the design of the algorithm is based? And what about the criteria fed into the algorithm, the variables, and the weights attributed to those variables? And how does this relate to the role of different service providers forming part of the ‘machine learning’ supply chain? All these are important clarifications to be sought.

Due to all the above complexities, it is clear that transparency might not be the most appropriate way of seeking to ensure legal fairness but that compliance should be verified, for instance, through the use of technical tools. For example to show bias to a particular attribute like the use of race in credit decisions or the requirement that a certain class of analysis be applied for certain decisions. This might also be achieved by testing the trained model for unfair discrimination against a number of ‘discrimination testing’ datasets, or by assessing the actual outcomes of the machine learning process to prove that they comply with the lawfulness and fairness requirements.

Conclusion

According to Article 22 of the GDPR, data subjects have a right not to be subject to a decision based solely on automated processing, including profiling that produces legal effects concerning them or significantly affects them. When data controllers use machine learning to carry out automated processing, including profiling of data subjects, they must comply with the requirement of lawful, fair and transparent processing. This may be difficult to achieve due to the way in which machine learning works and / or the way machine learning is integrated into a broader workflow that might involve the use of data of different origins and reliability, specific interventions by human operators, and the deployment of machine learning products and services, including ‘Machine Learning as a Service’ services (provided by Amazon, Google, Microsoft, and others).

In order to be compliant, data controllers must assess how using machine learning to carry out automated processing affects the different stages of profiling and the level of risk to data subjects’ rights, and the impact of how the data controller can produce evidences of the compliance to the regulator and the data subject. In some cases where automated processing, including profiling, is permitted by law, data controllers still have to implement appropriate measures to protect the data subjects’ rights. The underlying objective of GDPR is that a decision significantly affecting a person cannot just be based on a fully automated assessment of his or her personal characteristics. However, as I called out in the very beginning of this post, in the context of machine learning, in some cases, it might be more beneficial for data subjects if a final decision is based on an automated assessment, as it is devoid of prejudices induced by human intervention.

Whether a decision about us is being made by a human or by a machine/computer, right now the best we can hope for is such a decision, which can produce legal effects or significantly affect us in any manner, will be as fair as humans can be. And eventually, we, as machine learning practitioners, must aim to build machine learning models where the decisions are far more fair, than what humans can be.

This is taking into account that machines may soon be able to overcome the limitations of human decision makers and provide us with decisions that are demonstrably fair. Indeed, it may already in some contexts make sense to replace the current model, whereby individuals can appeal to a human against a machine decision, and also where individuals would have a right to appeal to a machine against a decision made by a human!

Well that sounds a bit weird, ain’t it! Has the time for Skynet to take over Planet earth, finally arrived!

I am sure that many of the questions that we, the machine learning enthusiasts and practitioners, have, about the implication of GDPR to it, will eventually be answered, after GDPR becomes a regulation in May 2018. And also, we will see interesting changes to how machine learning models are designed and applied, especially in the context of personal data processing.

Title image courtesy: LinkedIn.com

Securely store API keys in R scripts with the “secret” package

When we use an API key to access a secure service, through R, or when we need to authenticate in order to access a protected database, we need to store this sensitive information in our R code somewhere. This typical practice is to include those keys as strings in the R code itself — but as you guessed it, it’s not secure. By doing that, we are also storing our private keys and passwords in plain-text on our hard drive somewhere. And as most of us use Github to collaborate on our code, we will also end up, unknowingly, including those keys in a public repo.

Now there is a solution to this – its the “secret” package developed by Gábor Csárdi and Andrie de Vries for R. This package integrates with OpenSSH, providing R functions that allow us to create a vault to keys on our local hard drive, and also define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by the person who wrote the code, or by people he/she trusts.

Here is the presentation by Andrie de Vries at useR!2017, where they demoed this package, and here is the package itself.

AI powered Cyber Security startups

Artificial Intelligence (AI) and Machine Learning have become mainstream these days, but at the same time, they are some of the most used (abused) term/jargon in the last 2-3 years.

Last year’s Gartner hype cycle report (2016 Hype Cycle for Emerging Technologies – shown below) shows this trend clearly.

Why do we need AI in Cyber security

The biggest challenge in the Cybersecurity Threat Managment space today, is the ability (or lack of) of effective “detection” of cyber attacks. One of the key levers in making “detection” work is reducing the dependency on the “human” element in this entire threat management lifecycle:

Let it be the detection techniques (signatures, patterns, and for that matter ML models and their hyper-parameters), or,
The incident “response” techniques:
- involving human security analysts for analysing the detections, or,
- human security administrators to remediate/block the attacks at the network or system level

Introducing automation and bringing in cognitive methods in each of these areas, is the only way forward, to take the adversaries head-on. And there has been numerous articles, presentations and whitepapers published on why Machine Learning (ML) and AI will play a key role in addressing the cyber threat management challenge.

In my pursuit of understanding how AI can be used effectively in the cybersecurity space, I have come across products developed by some of the leading startups in this domain. And in this blog post, I attempt to share my thoughts on 10 of these products, chosen primarily on their market cap/revenue, IP (intellectual property) potential, and any reference materials available about their successful detections so far.

Note:

I have tried to cover as much breadth I can, in terms of covering Products falling under various domains of Cybersecurity – Network detection, UEBA, Application security and Data security, and so there is a good chance I have missed some contenders in this area. AI in Cyber is a rapidly growing plateau, and I hope to cover more ground in the coming months.
These Products are listed below in no particular order.

Lets get started.

1. PatternEx

Founded 2013, San Jose, California
https://www.patternex.com/
@patternex

PatternEx’s Threat Prediction Platform is designed to create “virtual security analysts” that mimic the intuition of human security analysts in real time and at scale. The platform reportedly detects ten times more threats with five times fewer false positives compared with approaches based on Machine Learning-Anomaly Detection technology. Using a new technology called “Active Contextual Modeling” or ACM, the product synthesizes analyst intuition into predictive models. These models, when deployed across global customers, can reportedly learn from each other and achieve a network effect in detecting attack patterns.

The process of Active Contextual Modeling (ACM) facilitates communication between the artificial intelligence platform and the human analyst. Raw data is ingested, transformed into behaviors, and run through algorithms to find rare events for an analyst for review. After investigation, an appropriate label is attached to each event by the analyst. The system learns from these labels and automatically improves detection efficacy. Data models created though this process are flexible and adaptive. Event accuracy is continuously improved. Historic data is retrospectively analyzed as new knowlege is added to the system.

Training the AI happens when the AI presents a set of alerts to human analysts, who review the alerts and define them as attacks or not. The analyst applies a label to the alert which trains a supervised learning model that automatically adapts and improves. This is a trained AI, and interesting concept, that attempts to simulate a security analyst, helping the AI system to improve the detection over a period of time.

PatternEx was founded by Kalyan Veeramachaneni, Uday Veeramachaneni, Vamsi Korrapati, and Costas Bassias.

PatternEx has received funding of about $7.8M so far.

2. Vectra Networks

Founded 2011, USA
http://www.vectranetworks.com/
@Vectra_Networks

Vectra Networks’ platform is designed to instantly identify cyber attacks while they are happening as well as what the attacker is doing. Vectra automatically prioritizes attacks that pose the greatest business risk, enabling organizations to quickly make decisions on where to focus their time and resources. The company says that platform uses next-generation compute architecture and combines data analytics and machine learning to detect attacks on every device, application and operating system. And to do this, the system uses the most reliable source of information – network traffic. Logs only provide low-fidelity summaries of events that have already been seen, not what has been missed. Likewise, endpoint security is easy to compromise during an active intrusion.

The Vectra Networks approach to threat detection blends human expertise with a broad set of data science and machine learning techniques. This model, known as Automated Threat Management, delivers a continuous cycle of threat intelligence and learning based on cutting-edge research, global learning models, and local learning models. With Vectra, all of these different perspectives combine to provide an ongoing, complete and integrated view that reveals complex multistage attacks as they unfold inside your network.

They have an interesting approach to use Supervised and Unsupervised ML models to detect cyber attacks. They have a “Global Learning” element, where supervised ML algorithms are used to build models to detect “generic” and “new known” attack patterns. “Local learning” element uses Unsupervised ML algorithms are used to collect knowledge of local norms in an enterprise, and then detecting deviations from those norms.

Vectra networks has received funding of about $87M so far, and has seen very good traction in the Enterprise Threat Detection space, where ML models are a lot more effective than using conventional signature/pattern based detections.

3. Darktrace

Founded 2013, UK
https://www.darktrace.com/
@Darktrace

Darktrace is inspired by the self-learning intelligence of the human immune system; it’s Enterprise Immune System technology iteratively learns a pattern of life for every network, device and individual user, correlating this information in order to spot subtle deviations that indicate in-progress threats. The system is powered by machine learning and mathematics developed at the University of Cambridge. Some of the world’s largest corporations rely on Darktrace’s self-learning appliance in sectors including energy and utilities, financial services, telecommunications, healthcare, manufacturing, retail and transportation.

DarkTrace has a set of products, which use ML and AI in detecting and blocking cyber attacks:

DarkTrace (Core) is the Enterprise Immune System’s flagship threat detection and defense capability, based on unsupervised machine learning and probabilistic mathematics. It works by analyzing raw network data, creating unique behavioral models for every user and device, and for the relationships between them.

The Threat Visualizer is Darktrace’s real-time, 3D threat notification interface. As well as displaying threat alerts, the Threat Visualizer provides a graphical overview of the day-to-day activity of your network(s), which is easy to use, and accessible for both security specialists and business executives.

Darktrace ICS retains all of the capabilities of Darktrace in the corporate environment, creating unique, behavioral understanding of the ‘self’ for each user and device within an Industrial Control systems’s network, and detecting threats that cannot be defined in advance by identifying even subtle shifts in expected behavior in the OT space.

Darktrace Antigena is capable of taking a range of measured, automated actions in the face of confirmed cyber-threats detected in real time by Darktrace. Because Darktrace understands the ‘pattern of life’ of users, devices, and networks, Darktrace Antigena is able to take action in a highly targeted manner, mitigating threats while avoiding over-reactions. It basically performs three steps, once a cyber attack is detected by the DarkTrace Core:

Stop or slow down activity related to a specific threat
Quarantine or semi-quarantine people, systems, or devices
Mark specific pieces of content, such as email, for further investigation or tracking

DarkTrace has received funding of about $105M so far.

4. Status today

Founded 2015, UK
http://www.statustoday.com/
@statustodayhq

StatusToday was founded by Ankur Modi and Mircea Danila-Dumitrescu. It is a SaaS based AI-powered Insights Platform that understands human behavior in the workplace, helping organizations ensure security, productivity and communication.
Through patent-pending AI that understands human behavior, StatusToday maps out human threats and key behavior patterns internal to the company.

In a nutshell, this product collects all the user activity log data, from various IT systems, applications, servers and even everyday cloud services like google apps or dropbox. After collecting this metadata, the tool extracts as many functional parameters as possible and present them in easily understood reports graph. I think they use one of the Link analysis ML models to plot the relationship between all these user attributes.

The core solution provides direct integrations with Office 365, Exchange, CRMs, Company Servers and G-Suite (upcoming) to enable a seamless no-effort Technology Intelligence Center.

StatusToday has been identified as one of UK’s top 10 AI startups by Business Insider, TechWorld, VentureRadar and other forums, in the EU region.

Status Today has received funding of about $1.2M so far.

5. Jask

Founded 2015, USA
http://jask.io/
@jasklabs

Jask aims to use AI in solving the age old problem of tsunami of logs fed into SIEM tools which then generate events & alerts, and other indicators that security analysts face every day, which produce a never ending flood of unknowns which forces these analysts to spend their valuable time sorting through indicators in the endless hunt for real threats.

At the heart is their product Trident, which is a big data platform for real time and historical analysis over an unlimited amount of stored security telemetry data. Trident collects all this data directly from the network and complements that with the ability to fuse other data sources such as threat intelligence (through STIX and TAXII), providing context into real threats. Once Trident identifies a sequence that indicates an attack, it generates SmartAlerts, which analysts can use to have the full picture of an attack, also allowing them to spend their time on real analysis instead of an endless hunt for the attack story.

They have really interesting blog posts on their site, which are worth a read.

Jask has received funding of about $2M so far.

6. Fortscale

Founded 2012, Israel
https://fortscale.com/
@fortscale

Fortscale uses a machine learning system to detect abnormal account behavior indicative of credential compromise or abuse. The company was founded by security engineers from the Israeli Defense Force’s elite security unit. The products key ability is to rapidly detect and eliminate insider threats. From rogue employees to hackers with stolen credentials, Fortscale is designed to automatically and dynamically identify anomalous behaviors and prioritizes the highest-risk activities within any application, anywhere in the enterprise network.

Behavioral data is automatically ingested from SIEM tools and enriched with contextual data, and multi-dimensional baselines are created autonomously and statistical analysis reveals any deviations, which are then captured in SMART Alerts. All of this can viewed and analysed in Fortscale Console.

Fortscale was named Gartner Cool Vendor (2016) in the UEBA< Fraud Detection and User Authentication category.

More info about the product can be found here.

Fortscale has received funding of about $40 million so far.

7. Neokami

Founded 2014, Germany & USA
https://www.neokami.com/
@neokami_tech

Neokami attempts to tackle a very important problem we all face today – keeping a track of where all our and an enterprises’s sensitive information resides. Neokami’s CyberVault uses AI to discover, secure and govern Sensitive Data in the cloud, on premise, or across their physical assets. It can also scan images to detect sensitive information, as it uses highly optimized NLP for text analytics & Convolutional Neural Networks for image data analytics.
In a nutshell, Neokami uses a multi-layer decision pipeline, wherein it takes in data stream or files, and performs pattern matching, text analytics, image recognition, N-gram modelling and topic detection, using ML learning methods like Random Forest, to learn user-specific sensitivity over time. Post this analysis, a % sensitivity Score is generated and assigned to the data, which can then be picked up for further analysis and investigation.

Some key use cases Neokami tackles are – isolating PII to meet regulations such as GDPR, HIPPA, etc., discovering a company’s confidential information and intellectual property, scan images for sensitive information, protect information in Hadoop clusters, cloud, endpoints or mainframes.

Neokami was acquired by Relayr in Feb this year, and has received $1.1million funding so far, from three investors.

8. Cyberlytic

Founded 2013, UK
https://www.cyberlytic.com/
@CyberlyticUK

Cyberlytic call themselves the ‘Intelligent Web application security’ product. Their elevator pitch is they provide advanced web-application security using AI to classify attack data, identify threat characteristics and prioritize high-risk attacks.

The founders have had a stint with the UK Ministry of Defense, where this product was first used and has been in use support critical cybersecurity research projects in the department.

Cyberlytic analyzes web server traffic in real-time, and determines the sophistication, capability and effectiveness of each attack. This information is translated into a risk score, to prioritize incident response and prevent dangerous web attacks. And the underlying ML models adapt to new and evolving threats without requiring the creation or management of firewall rules. They key to their detection, is their patented ML classification approach, which appears to be more effective in detecting web application attacks than the conventional signature/pattern based detection.

Cyberlytic is a combination of two products – the Profiler, and the Defender. The Profiler provides real-time risk assessment of web-based attacks, by connecting to the web server and analyzing web traffic, to determine the capability, sophistication and effectiveness of each attack. And Defender, is deployed on web servers, and acts on the assessment performed by Profiler, by blocking and preventing web-based cyber-attacks from reaching critical web applications or the underlying data layer.

Cyberlytic has also been gaining a lot of attention in the UK and EU region; Real Business, an established publication in the UK, has named Cyberlytic as one of the UK’s 50 most disruptive tech companies in 2017.

Cyberlytic has received funding of about $1.24 million.

9. harvest.ai

Founded 2014, USA
http://www.harvest.ai/
@harvest_ai

Harvest.ai aims at detecting and stopping data breaches, by using AI-based algorithms to learn the business value of critical documents across an organization, and offer what it describes as an industry-first ability to detect and stop data breaches. In a nutshell, Harvest.ai is an AI powered advanced DLP system having the ability to perform UEBA.

Key features of their product MACIE, includes:

Use AI to track intellectual property across an organization’s network, including emails and other content derived from IP.
MACIE understands the business value of all data across a network and whether it makes sense for a user to be accessing certain documents, a key indicator of a targeted attack.
MACIE can automatically identify risk to the business of data that is being exposed or shared outside the organization and remediate based on policies in near real-time. It not only classifies documents but can identify true IP matches to protect sensitive documents that exist for an organization, whether it be technology, brand marketing campaigns or the latest pharmaceutical drug.
MACIE not only detects changes in a single users behavior, but it has the unique ability to detect minor shifts in groups of users, which can indicate an attack.

Their blog has some interesting analysis of some of the recent APT attacks, and how MACIE detected them. Definitely work a read.

Harvest.ai has received funding of about $2.71 million so far, and interestingly, they have been acquired by Amazon in Jan this year, for reportedly $20 million.

10. Deep Instinct

Founded 2014, Israel
http://www.deepinstinct.com/
@DeepInstinctSec

Deep Instinct focuses as End point as the pivot point, in detecting and blocking cyber attacks, and thus fall under the category of EDR. There is something going on in israel, for the last few years, as many cybersecurity startups (Cyberreason, Demisto, Intsights, etc.) are being founded by ex-IDF engineers in Israel, and a good portion of these startups are to do with Endpoint Detection and Response (EDR).

Deep Instinct uses deep learning to detect unknown malware in real-time, just by analysing the binary raw details of the binary picked up by the system. The software runs efficiently on the combination of central processing units (CPUs) and graphics processing units (GPUs) and Nvidia’s CUDA software for running non-graphics software on graphics chips. The GPUs enable the company to do in a day what would take three months for a CPU.

I couldn’t find enough documentation on their website to understand how this deep learning system actually works, but their website has a link to register for an online demo. So it must be definitely worth a try.

They are also gaining a lot of attention in the EDR space, and NVIDIA has selected Deep Instinct as one of the 5 most disruptive AI startups this year.

Deep Instinct has raised $50 million so far, from Blumberg Capital, UST Global, CNTP, and Cerracap.

Model Evaluation in Machine Learning

One of the most important activities for a Data Scientist to perform, is to measure and optimize the prediction accuracy of Machine Learning models one has built. Though there are various approaches to do this, they can be grouped into three major key steps.

Sebastian Raschka, the author of the bestselling book “Python Machine Learning”, who is a Ph.D. candidate at Michigan State University, developing new computational methods in the field of computational biology, has published an excellent article describing these steps.

In a nutshell, he breaks down the evaluation process into three main steps:

Data generalisation – ensure that the training data and the test data have good ‘variance’ and a fair proportion of various classifications’. This could be achieved by a couple of techniques:
- Stratification
- Cross validation – k fold or bootstrap
- Hold out method – training data set, hold out data set, test data set
- Bias variance trade-off
Algorithm selection – picking the right algorithm that is best suited for the use case in hand
Model selection
1. Hyper parameters tuning – cross validation techniques
2. ‘Model parameters’ are of models and ‘Hyper parameters’ (also called tuning parameters) are of algorithms; for ex., the depth of Trees in Random Forests

Sebastian has put together a detailed 3 part tutorial where he goes into the details of each of these steps:

These are great reads for anyone who is having a tough time picking the right model for their ML project, and also having difficulty measuring its efficiency and accuracy.

Title Image courtesy: biguru.wordpress.com

Data Scientist’s take on the US Election results

It would be an understatement if I say that the outcome of the recently concluded US election has been a shocker for many people in the US, and across the world. Also, as called out by various media outlets, these results have indicated the failure of the political polling/predictive analytics industries and the power of data and data science.

In this post I share my thoughts on this matter.

From a Data Science perspective, there are two possibilities of why the predictions were so off the charts:

a) The predictive models were wrong

b) The data used in the models was bad

Lets look at both these possibilities in detail.

a) Predictive Models were wrong

There is this adage which is widely accepted in the statistics world that “All models are wrong”. The reason for this stand is that ‘data’ beats ‘algorithms’, and that the models are only as good as the data used to validate them. But in this particular case, the models used, have been in use in polling predictions for decades, and its not clear to me on what went wrong with the models, in this case.
Having said that, there is definitely some interesting work published in the last few weeks that show the use of Inference and Regression models in understanding the outcome of this election results. Here is a whitepaper published by professors in the Dept. Of Statistics at Oxford University. To summarize the paper:

We combine fine-grained spatially referenced census data with the vote outcomes from the 2016 US presidential election. Using this dataset, we perform ecological inference using dis- tribution regression (Flaxman et al, KDD 2015) with a multinomial-logit regression so as to model the vote outcome Trump, Clinton, Other / Didn’t vote as a function of demographic and socioeconomic features. Ecological inference allows us to estimate “exit poll” style results like what was Trump’s support among white women, but for entirely novel categories. We also perform exploratory data analysis to understand which census variables are predictive of voting for Trump, voting for Clinton, or not voting for either. All of our methods are implemented in python and R and are available online for replication.

b) Data used in the models was bad

Not everyone will be open about their opinion, especially if the opinion is not aligned to the general consensus among public. And such opinions are usually not welcome in our society. A recent example of this is Mark Zuckerberg reprimanding employees for stating that “All Lives Matter” on a Black Lives Matter posting inside the Facebook headquarters. So there is a good chance such opinions wouldn’t have made it through to the dataset being used in the models.
Groupthink also played a major role in adding to the skewed dataset. When most of the media and journalist agencies were predicting Hillary’s landslide victory over Trump, only a courageous pollster would contradict with the widely supported and predicted poll results. And so this resulted in everybody misreading the data.
Incomplete analysis methods, which only used traditional methods of collecting data like surveys and polls, instead of using important signals from the social media platforms, especially Twitter. Social media engagement of the candidates with the voters, was a definitely ignored data set, inspite of social media analysts sounding the alarm that all of the polls were not reflecting the actual situation on the ground in the pre-election landscape. Clinton outspent Trump on TV ads and had set up more field offices, and also sent staff to swing states earlier, but Trump simply better leveraged social media to both reach and grow his audience and he clearly benefitted from that old adage, “any press is good press.”

To summarize…

Data science has limitations

Data is powerful only when used correctly. As I called out above, biased data played the biggest spoil sport in the predictions in this election
Variety of data is more important than volume. There is constant rage these days to collect as much data as possible, from various sources. Google and Facebook are leading examples. As I called out above, depending on different data sets, including social media, could have definitely helped in getting the predictive models to be closer to reality. Simply put, the key is using the right “big data”.

Should we be surprised?

To twist the perspective a little bit, if we look at this keeping in mind how probabilistic predictions work, the outcome wouldn’t be a surprise to us. For ex., if I said that “I am 99% sure that its going to be a sunny day tomorrow”, and if you offer to bet on it at odds of 99 to 1, I might say that “I didn’t mean it literally; i just meant it will “probably” be a sunny day”. Will you be surprised if I tossed a coin twice and got heads both times? Not at all, right?
This New York Times article captures the gist of what actually went wrong with the use of data and probabilities in this election, very well. I think following lines say it all:

The danger, data experts say, lies in trusting the data analysis too much without grasping its limitations and the potentially flawed assumptions of the people who build predictive models.

The technology can be, and is, enormously useful. “But the key thing to understand is that data science is a tool that is not necessarily going to give you answers, but probabilities,” said Erik Brynjolfsson, a professor at the Sloan School of Management at the Massachusetts Institute of Technology.

Probabilistic prediction is a very interesting topic, but it can also be very misleading if the probabilities are not presented correctly (70% chances of Clinton winning over Trump’s 30% chances).

I shall dwell deeper into this in a follow up post…

Title image courtesy: http://www.probabilisticworld.com

RStudio v1.0 is out

RStudio has finally moved out of “beta” status last week, and the first official production version is now available. This is great news for all of us who use RStudio as the primary IDE for R programming.

Check out this link for the release history of RStudio and all changes that’s been it has gone through over the last 6 years.

Some of the major new functionality added in this release are:

Support for R Notebooks, a new interactive document format combining R code and output. It’s similar to (but not based on) Jupyter Noteooks, in that an R Notebook includes chunks of R code that can be processed independently (as opposed to R Markdown documents that are processed all at once in batch mode.)
GUI support for the sparklyr package, with menus and dialogs for connecting to a Spark cluster, and for browsing and previewing the available Spark Dataframe objects.
Profiling tools for measuring which parts of your R code are consuming the most processing time, based on the profvis package.
Dialogs to import data from file formats including Excel, SAS and SPSS, based on the readr, readxl and haven packages.

Checkout the official blog for more information about this release.

R moves up to 5th place in IEEE language rankings

IEEE has published its annual Top Computer programming languages rankings report. It starts with the line “C is No. 1, but big data is still the big winner”, indicating the rise of R, the defacto programming language used in Big Data analytics, including Cyber Security domain.

I think this is an extraordinary result for a language which is domain-specific (big data and data science). If you compare R to the other four languages, which are general purpose languages (C, Java, Python amd C++) in Top 5, it’s a great feat, and is a clear indication of the adoption and heavy use and relevance of R in today’s Information Age where every device, system, or a “thing” (IoT) generates some form of data (logs). This also reflects the critical important of Data Science (where R is the defacto programming language used by Data Scientists), as a descipline today.

Some interesting lines from the report:

Another language that has continued to move up the rankings since 2014 is R, now in fifth place. R has been lifted in our rankings by racking up more questions on Stack Overflow—about 46 percent more since 2014. But even more important to R’s rise is that it is increasingly mentioned in scholarly research papers. The Spectrumd efault ranking is heavily weighted toward data from IEEE Xplore, which indexes millions of scholarly articles, standards, and books in the IEEE database. In our 2015 ranking there were a mere 39 papers talking about the language, whereas this year we logged 244 papers.

R’s steady growth in this and numerous other surveys and rankings over time reflects the growing importance of Data Science applied using R. And application of Data Science concepts in Cyber security, especially in detecting cyber attacks, is only becoming more and more relevant.
Using conventional security monitoring tools which use rule based detection engines (yes they are called SIEM!), to detect cyber attacks, is not working anymore. Let’s face it; SIEM has come off age. Using Machine learning approach to detect cyber attacks, has become one of the most important developments in the cyber security domain in the last 10 years. And its relevance in today’s world, where there is surplus amounts of data (also called “Big Data”) being churned out by all forms of computer systems, is at its peak. And R is playing a very important role in helping Security Data Scientists build “algorithmic models” that can detect better cyber attacks

So I am very excited and happy to see R’s popularity and adaption growing year on year.

This is a core area of study I am currently focusing on, and I will be writing more about this here on my blog, in the coming months.

Picture Courtesy: ieee.org

Live transcription of OpenVis Conference

OpenVis Conference is a 2 day annual conference, held in Boston, about the practice of visualising data on the web. A must for all the Data Visualisation professionals amongst us.

This time, what is interesting is, they are Live streaming the Conference, in the form of Transcript, on their site, as shown below.

The conference is being held today (Apr 25) and tomorrow, and there are some really interesting Talks lined up. Some of these concepts have direct implication to Cyber/Information Security too.

I am hoping that the Presentations will be made available for people who couldn’t make it to the conference.

Category: Data Science