Christmas Markets opening up

Christmas Markets opening up

It’s the time of the year when the whole world gets busy enjoying and celebrating the Christmas season.

In Germany, the Christmas markets (called Weinachtmarkt) start opening up in late November every year, and close just before Christmas Eve (usually by Dec 22).

These markets in Frankfurt, called Frankfurter Weinachtmarkt in German, are one of the oldest in Germany, with origins going back to 14th century.

I am looking forward to exploring these Weinachtmarkt in the coming days and hoping to share my experiences here.

Advertisement
A view from atop Atomium (Brussels)

A view from atop Atomium (Brussels)

An interesting landmark near Brussels city (Belgium 🇧🇪). It’s a building constructed in 1958, with 9 stainless clad spheres connected to form the shape of a unit cell of an iron crystal magnified 165 billion times. It is now a museum with a very nice view point from the top of the building (335ft tall).

Interesting design for sure, and a technical feat by the architects in the 1950s.

Quick thoughts on Apple’s WWDC 2019 announcements

Quick thoughts on Apple’s WWDC 2019 announcements

Another year has passed, and its about time for another WWDC, and key takeaways for developers.

There were a lot of interesting announcements by Apple. There were six announcements in yesterday’s keynote that, when seen cohesively, show Apple, as a company moving forward.

1. Mac Pro – it’s a super power. But at $6,000 base price, barring a few big budget content creators, I wonder how many will buy/need it.

2. iOS 13 – I did not see any major ground breaking features. I am definitely looking forward to the increase in Speed and performance though (last few months of my honourable & trusty but ageing 6s, before I upgrade later this year)

3. Sign in with Apple – this is a big one. Again, I really appreciate Apple’s push for privacy here. The random email ID generation, to prevent our mailboxes from being spammed, is cool! 

4. WatchOS 6 – new watch faces are always welcome. I have got really bored of the limited watch faces on my Series 3. Native apps like Calculator, Voice memos, are definitely welcome. App Store on the Watch, takes the Watch one step closer to being an independent device.

5. iPadOS – I see this as a marketing thing, to get the message out that Apple does care about products beyond the iPhone (which still contribute to over 50% of its revenues). But iPad is definitely becoming worth the consideration, for folks who do not want a laptop to be their primary desktop-class computer.

6. Privacy – some great privacy features. I love this the most, about being part of the Apple ecosystem.

Cybersecurity in 2017 and going forward…

Cybersecurity in 2017 and going forward…

2017 has come to an end, and its time to reflect back on the year gone by, and look forward to what is in store for us, the cybersecurity professionals, in 2018.

To start with, lets look at some of the major security events/incidents of 2017. Following are five security and data breaches that made headlines all over the globe:

Equifax

This breach was publicly disclosed in September this year. This is a truly vast breach, as the data stolen included social security and driver’s license numbers of US Consumers, upto the tune of 143 million. Credit card numbers and other personally identifying information were also compromised for a smaller number of U.S. consumers. With this sensitive data now exposed, the operational impact of this on businesses is, that many organizations, including banks, that rely on the data to prove the identity of online users may need to implement additional, expensive and cumbersome authentication procedures.

Apparently, the attack vector was a simple one; the cyber criminals leveraged the critical remote code execution vulnerability CVE-2017-5638 on Apache Struts2. And ironically, this wasn’t a zero day, and the patch to this vulnerability, was available since March this year.

Yahoo

Although the attack occurred, or at least began in 2013, the same year when Target was also exposed to a cyber attack, it only came into light this year when parent company Verizon announced in October that every one of Yahoo’s 3 billion accounts was hacked in 2013. That’s more than three times the initial assessment done last year. In addition to the massive size of the attack, what is astonishing is the fact that it remained largely hidden for so many years. It makes me wonder, how many other huge attacks have occurred that we still don’t know about?

Uber

Last month, Uber CEO Dara Khosrowshahi revealed that two hackers broke into the company in late 2016 and stole personal data, including phone numbers, email addresses, and names, of 57 million Uber users. Among those, the hackers stole 600,000 driver’s license numbers of drivers for the company. Instead of disclosing the breach, as the law requires, Uber paid $100,000 to the hackers to conceal the fact that a breach had occurred. Why is this attack significant?

  • The vast number of records compromised
  • The fact that it was a ransomware attack; the most widely used attack vector in 2017
  • The company paid the attackers (and thus encouraged the illegal industry), and,
  • Nobody at such a large company disclosed the breach.

Shadow Brokers leak of NSA/CIA files

In 2013, a mysterious group of hackers that calls itself the Shadow Brokers stole a few disks full of National Security Agency secrets. Since the beginning of 2017, they’ve been dumping these secrets on the internet. They have publicly embarrassed the NSA and damaged its intelligence-gathering capabilities, while at the same time have put sophisticated cyber-weapons in the hands of anyone who wants them. The reason this hack is significant is because, with all this information now in the hands of cybercriminals, we are already seeing crimes committed by smaller organizations that used to be limited to well-funded, state sponsored attackers. The level of sophistication among attackers took a giant leap forward.

WannaCry

There has been enough said and written about WannaCry, which has turned out to be the most widely used attack vector by cyber criminals, this year. This ransomware plagued thousands in massive global cyberattacks. The widespread impact of WannaCry can be attributed to NSA losing control of its key hacking tools, to the Shadow Brokers group, which enabled hackers to install backdoors that distributed the ransomware to millions of computers.

A key outcome and learning from these incidents has been, organisations shifting focus to incident / breach detection and response. And more importantly, the need for automation in these two areas, powered by Machine Learning techniques, has gained a lot of momentum.

And looking back through 2017, there has been significant progress, in the effective use of Machine Learning techniques in detecting and responding to cyber attacks. Following are some examples that demonstrate this.

I have broken down these examples into two categories / applications – Offensive side (the cybercriminal’s perspective) and the Defensive side (the security architect/incident analyst perspective)

Developments in Offensive security

Attackers have more actively, started leveraging machine learning to improve their attacks. There is not much evidence available of this use in the breaches I called out above. So I pick a few examples from the recently held BlackHat conference (US).

One BlackHat talk called Bot vs Bot: Evading Machine Learning Malware Detection explored how adversaries could use ML to figure out what other ML-based malware detection mechanisms were “looking” for. They could then create malware that avoided those things and thus evade detection. Another talk, Wire Me Through Machine Learning investigated how spammers might improve the success rate of their phishing campaigns by leveraging ML to improve their phishing emails.

At DEFCON, researchers shared how to Weaponize machine learning (humanity is overrated anyway)”. They introduced a tool called DeepHack, an open source AI that hacks web applications. Meanwhile, ML was often an underlying subject in many other talks that weren’t directly about it. It’s clear cybersecurity researchers and attackers alike are leveraging ML & AI to speed up and improve their projects.

Developments in Defensive security

  1. Lets start with picking on the Equifax breach. As mentioned earlier, attackers used the Apache Struts Jakarta Multipart Parser Vulnerability – CVE-2017-5638 here. In this particular case, we could look at using various anomaly detection techniques. Some examples include, Suspicious Process/Service Activity Anomalies (For ex., Suspicious Process Activity Rare Process/MD5 For User/Host Anomaly), Suspicious Network Activity Anomalies (Suspicious Network Activity Traffic to Rare Domains Anomaly), Suspicious Web Server Tomcat Access Anomalies. These Anomaly detection rules can be a starting point in a machine learning based intrusion detection tool.

2. Detecting web-application attacks: Web applications are the primary target by cyber criminals as these applications are mostly exposed to the internet, and in many cases, as also seen in Equifax attack, are not effectively configured to prevent exploits of web application vulnerabilities. One of the most widely used web application attack is SQL injection. There are many methods of detecting it, without depending on signatures based systems, and just using machine learning algorithms. One such approach is described in detail, in this white paper. This method identifies SQL injection codes by their HTTP parameters’ attributes and a Bayesian classifier. Such methods are a lot more effective than using traditional web-application firewalls.

3. A deep learning approach to network intrusion detection: In the last 2 years, there have been many developments in the using of conventional machine learning algorithms, in building network intrusion detection systems (NIDS). These tools are basically developed as classifiers to differentiate the normal traffic from the anomalous traffic. These NIDSs perform a feature selection task to extract a subset of relevant features from the traffic dataset to enhance the result of the classification. These feature selection helps in the elimination of the possibility of incorrect training through the removal of redundant features and noise.

However, recently, deep learning based methods have been successfully applied in audio, image, and speech processing applications. These methods aim to learn a good feature representation from a large amount of unlabelled and unstructured data and subsequently apply these learned features on a limited amount of labeled data in the supervised classification. The labeled and unlabelled data may come from different distributions, however, they must be relevant to each other. Thus, combining signals from unlabelled and unstructured data, with the labelled and structured (logs) data, we will be able to significantly improve the possibilities of detecting an anomalous behaviour, and in turn a real cyber incident. Here is an interesting white-paper that describes one such system, in detail.

4. Detecting Wannacry using machine learning: Ransomware has exploded in the past two years, as software programs with names like Locky and Wannacry infect hosts in high-profile environments on a weekly basis. From power utilities to healthcare systems, ransomware indiscriminately encrypts all files on the victim’s computer and demands payments (usually in the form of cryptocurrency, like Bitcoin). Conventional techniques of detecting them always fail, as there are new variants to these malware being released on a daily and hourly basis. One potentially useful anti-ransomware tool, that uses machine learning, was one that was presented at Black Hat 2017 was ShieldFS, created and presented by a group of researchers from Politecnico di Milano and Trend Micro. The key to this technique is applying machine learning to operating-system-level file access patterns.

Implemented as a Windows filesystem filter, running in the kernel, ShieldFS isn’t a filesystem proper. Instead, it adds functionality to the underlying filesystem. As you would know, two most common challenges in machine learning are feature engineering (how to come up with a list of “features” about the input) and the feature selection itself (figuring out which of those features productively contribute to generating the correct answer). Feature engineering in ShieldFS seemed straightforward to me, since many of the features were simple counts of types of events the filter observed, such as directory listings and writes. They were also fortunate that so many of the features showed obvious qualitative differences between malicious (red) and benign (blue) programs, making feature selection also a high-confidence process.

Using binary inspection (called “static analysis”), they were able to supplement results based on operation statistics (“dynamic analysis”). The team implemented a multitiered machine learning model to preserve long-term trends but also be able to react to new behavioural patterns. By using a copy-on-write policy, if a process started to exhibit ransomware behavior, they could kill it and restore all the copies. This system detected ransomware with a 96.9% success rate, but even the other 3.1% of cases still had the original content stored, so 100% of encrypted files were able to be restored. This is unheard of, in the world of signature based malware detection tools.

How will 2018 turn out to be?

Based on the the cybersecurity events that we saw in 2017, following are some of the trends to watch out for, in 2018. Though not intended to be a comprehensive overview, the following are some of key areas in cyber security, that will undoubtedly shape the security conversations in 2018.

GDPR

Data privacy and data security have long been considered two separate missions with two separate and distinct objectives. But all that will change in 2018. With serious global regulations kicking into effect, especially in Europe, and with the regulatory responses to data breaches increasing, organizations will build new data management frameworks centered on controlling data – controlling who sees what data, in what state, and for what purpose. 2018 will prove that cybersecurity without privacy is a thing of the past.

Ransomware to continue to play

Ransomware will continue to represent the most dangerous threat to organizations and end-users. The number of new Ransomware families will continue to increase; authors will be more focused on mobile devices implementing new evasion techniques making these threats even more efficient and difficult to eradicate.

New ransom-as-a-service platforms will be available on the dark web making very easy to wannabe crooks to arrange their ransomware campaigns.

IoT, a privileged target of hackers

During 2017, botnets targeted over 122,000 IP cameras with DDoS attacks, and IoT attacks on wireless routers virtually shut down the internet for several hours in a day. Baby and pet monitors, medical devices, and dozens of other gadgets were hacked. Although we are a long way from securing the IoT, these incidents served as a wake-up call, and many organizations have added IoT security to their agendas and are talking seriously about securing it moving forward.

Critical infrastructure to include Social media too

Until recent past, Social media was only limited to being a fun way to communicate and stay up to date with friends, family and the latest viral videos. But along the way, as we started to also follow various influencers and use Facebook, Twitter and other platforms as curators for our news consumption, social media has become inextricably linked with how we experience and perceive our democracy.

The definition of critical infrastructure, previously limited to areas like power grids and sea ports, will likely expand to include internet social networks. While a downed social network will not prevent society from functioning, these websites have the ability to influence elections and shape public opinion generally, and also elections, thus making their security essential to preserving our democracy. And protecting them from cyberattacks, will become utmost necessary.

 

Standardised hacking techniques

In 2018, more threat actors will adopt plain-vanilla tool sets, designed to remove any tell-tale signs of their attacks. As mentioned earlier, this can also be attributed to the NSA and CIA toolkits now made available to rookies, thanks to Shadow Brokers.

For example, we will see backdoors sport fewer features and become more modular, creating smaller system footprints and making attribution, more difficult than ever.

And so, as accurate attribution becomes more challenging, the door is opened for even more ambitious cyberattacks and influence campaigns from both nation-states and cybercriminals alike.

Crypto currencies

The rapid and sustained increase in the value of some cryptocurrencies will push crooks in intensifying the fraudulent activities against virtual currency scheme.

Cyber criminals will continue to use malware to steal funds from victims’ computers or to deploy hidden mining tools on machines.

Another perspective is – Cryptocurrencies, including Bitcoin, Ethereum, Litecoin and Monero, maintain total market capital of over $1 billion, which makes them a more appealing target for hackers as their market value increases. Several hacks against Ethereum have temporarily dropped its value in the past few years. So, there are chances that, in 2018, a major hack against one of these cryptocurrencies will damage public confidence.

Artificial Intelligence as a double-edged sword

Across the board, more criminals will use AI and machine learning to conduct their crimes. Ransomware will be automatic – bank theft will be conducted by organized gangs using machine learning to conduct their attacks in more intelligent ways. Smaller groups of criminals will be able to cause greater damage by using these new technologies to breach companies and steal data.

At the same time, as I mentioned above, research and practical applications of Machine Learning and AI, in detecting and responding to cyber attacks, are improving month-over-month. So, large enterprises will turn to AI to detect and protect against new sophisticated threats. AI and machine learning will enable them to increase their detection rates and dramatically decrease the false alarms that can so easily lead to alert fatigue and failure to spot real threats by incident responders, thus resulting in significantly reduced MTTD (Mean Time to Detect) and MTTR (Mean Time to Respond).

 

Thanks for reading.

I shall look forward to your comments and point-of-views as well.

 

 

 

Title image courtesy: https://www.askideas.com

Deep Learning – Limitations and its future

Deep Learning – Limitations and its future

Introduction

The origin of this blog post is the recent debate spurred off Elon Musk and Mark Zuckerberg, on whether AI is good or bad for humanity.

Elon is an inspiration to many of us around the world, especially for anyone entrepreneurial, and also for us – the machine learning enthusiasts; self-driving cars, and his thoughts on AI and its applications in his companies (e.g., Tesla autonomous driving via its auto-pilot capabilities).

Screen Shot 2017-09-17 at 10.49.24 PM.png
Auto-pilot in action, in a Tesla Model S

But sometimes, I tend to differ with his point-of-views on certain topics. For ex., We have to leave earth and go to Mars, to sustain humanity (Space X was founded primarily to make this possible sooner), and that robots with Super-intelligence, will take over the planet soon. Elon is a visionary, and like Stephen Hawking, he too believes that AI could one day supersede humans. He is right; or rather, he ‘could’ be right. But the point I would like to make in this blog post is that AI and deep learning in particular, is at a very nascent stage now, and considering the capabilities that we have built into AI systems so far, I am pretty sure that the doomsday type scenario that Elon points out, is definitely not in the near future.

As Andrew NG, the cofounder of Coursera and former chief scientist at Chinese technology powerhouse Baidu, recently pointed out in a Harvard Business Review event, the more immediate problem we need to address, is job displacement, due to automation, and this is an area that we must focus on, rather than being distracted by science fiction-ish, dystopian elements.

Limitations of Deep learning

At the recently held AI By The Bay conference, Francois Chollet, an AI Researcher at Google and inventor of the widely used deep learning library Keras, spoke about the limitations of deep learning. He said that deep learning is simply a more powerful pattern recognition system when compared to previous statistical and machine learning methods. “The most important problem for A.I today is abstraction and reasoning”, said Chollet.

Current supervised perception and reinforcement learning algorithms require lots of data, they’re terrible at planning, and are merely doing straightforward pattern recognition. However, by contrast, humans are able to learn from very few examples and can do very long-term planning. Also, we are capable of forming abstract models of a situation and manipulate these models to achieve “extreme generalization”.

Lets take an example of how difficult it is to teach simple humans behaviours to a deep learning algorithm. Lets examine the task of not being hit by a car as you attempt to cross a road. In case of supervised learning, we would need huge datasets of this (vehicular movement) situations with clearly labeled actions to take, such as “stop” or “move”. Then you’d need to train a neural network to learn the mapping between the situation and the appropriate response action. If we go with the reinforcement learning route, where we give an algorithm a goal and then let it independently determine the appropriate actions to take, the computer would need to die thousands of times before it learns to avoid vehicles in different situations. In summary, humans only need to be told once to avoid cars. We’re equipped with the ability to generalize from just a few examples and are capable of imagining (modeling) the dire consequences of being run-over by a vehicle. And so without ever (in most cases) losing our life or hurting us significantly, most of us quickly learn to avoid being run over by motor vehicles.

Talking of anthropomorphizing machine learning models, Francois Chollet in his recent blog post, has a very interesting observation:

“A fundamental feature of the human mind is our “theory of mind”, our tendency to project intentions, beliefs and knowledge on the things around us. Drawing a smiley face on a rock suddenly makes it “happy”—in our minds. Applied to deep learning, this means that when we are able to somewhat successfully train a model to generate captions to describe pictures, for instance, we are led to believe that the model “understands” the contents of the pictures, as well as the captions it generates. We then proceed to be very surprised when any slight departure from the sort of images present in the training data causes the model to start generating completely absurd captions.”

Another difference between how we, humans, interpret our surrounding, versus how these models do, is ‘extreme generalisation’ that we are good at, versus the ‘local generalisation’ that the machine learning models can do.

Screen Shot 2017-09-17 at 10.53.23 PM.png
Can humans’s ‘extreme generalisation’ abilities be ported into a machine learning model?     (Picture source: https://blog.keras.io/the-limitations-of-deep-learning.html)

Lets take an example to understand this difference. If we take a young and smart 6 year old boy from Bangalore, and leave him in the town of Siem Reap in Cambodia, he will, in a few hours, manage to find out place to eat, and start communicating with the people around, and make his ends meet in a couple of days time. This ability to handle new situations, when we have never experience a similar one before – language, people, surroundings, etc., that is to perform abstraction and reasoning, far beyond what we have experienced so far, is arguably the defining characteristic of human cognition. In other words, this is “extreme generalization”; our ability to adapt to completely new, never experienced before situations, using very little data or even no new data at all. This is in sharp contrast with what deep/neural nets can do, which can be referred to as “local generalization”; that is, the mapping from inputs to outputs performed by deep nets quickly start to fall apart, if the new inputs differ even slightly from what they were trained with.

How Deep learning should evolve

A necessary transformational development that we can expect in the field of machine learning is a move away from models that merely perform pattern recognition and can only achieve local generalization, towards models capable of abstraction and reasoning, that can achieve extreme generalization. Whilst moving towards this goal, it will also be important for the models to require minimal intervention from human engineers. Today, most of the AI programs that are capable of basic reasoning, are all written by human programmers; for example the software that relies on search algorithms. All this will result in deep learning models which are not heavily dependant on supervised learning, which is the case today, and truly become self-supervised and independent.

As Francois calls out in his blog post, “we will move away from having on one hand “hard-coded algorithmic intelligence” (handcrafted software) and on the other hand “learned geometric intelligence” (deep learning). We will have instead a blend of formal algorithmic modules that provide reasoning and abstraction capabilities, and geometric modules that provide informal intuition and pattern recognition capabilities. The whole system would be learned with little or no human involvement.”

Will this result in machine learning engineers losing jobs? Not really; engineers will move higher up in the value chain. These engineers will then start focusing on crafting complex loss functions to meet business goals and use cases, and spend more time understanding how the models they have built impact the digital ecosystems in which they are deployed. For ex., interact and understand the users that consume the model’s predictions and the sources that generate the model’s training data. Only the largest companies can afford to have their data scientists spend time in these areas.

Another area of development could be, the models becoming more modular, like how the advent of OOP (Object-oriented programming) helped in software-development, and how the concept of ‘functions” help in re-using key functionalities in a software program. This will steer the way for the models becoming re-usable. What we do today, in the lines of model reuse across different tasks, is to leverage pre-trained weights for models that perform common functions, like visual feature extraction (for image recognition). When we reach this stage, we would not only leverage previously learned features (submodel weights or hyperparameters), but also model architectures and training procedures. And as models become more like programs, we would start reusing program subroutines, like the functions and classes found in our regular programming languages today.

Screen Shot 2017-09-17 at 11.00.40 PM.png
(Picture source: http://www.astuta.com/how-artificial-intelligence-big-data-will-transform-the-workplace/)

Closing thoughts…

The result of these developments in the deep learning models, would be a system that attains the state of Artificial General Intelligence (AGI).

So, does all this, again take us to the debate i initially started with – a singularitarian robot apocalypse to takeover planet Earth? For now, and for the near-term future, I think its a pure fantasy, that originates from a profound misunderstanding of intelligence and technology.

This post began with Elon Musk’s comments; and I will end it with a recent comment from him – “AI will be the best of worst hing ever for humanity…”.

If you want to dwell deeper in this fantasy of Superintelligence, then do read Max Tegmark’s book Life 3.0 published last month. Elon himself, highly recommends it. :).

Screen Shot 2017-09-17 at 9.18.47 PM.png

Another classic is Nick Bostrom’s Superintelligence – Paths, Dangers, Strategies. And, if you are a Python programmer, and want to start building Deep Learning models, then I strongly recommend Francois Chollet’s book Deep Learning with Python

Further reading

Here are some recent research papers highlighting limitations of Deep Learning:

 

 

Cover image source: https://regmedia.co.uk/

Visualising the performance of Machine learning models

Visualising the performance of Machine learning models

Evaluating the performance of machine learning models using various metrics like accuracy, precision, recall, is straightforward, but visualising them has never been easy. But Ben Bengfort at District Data Labs, has developed a python library for this purpose, called YellowBrick.

It definitely looks interesting. Our very own Charles Givre shows how this package can be used, in his blog.

It's a definite read.

Busting the Citadel Trojan developers

Brian Krebs recently reported about the Citadel developers getting busted by the FBI. What is most interesting is the bait that FBI used to trap them.

Citadel boasted an online tech support system for customers designed to let them file bug reports, suggest and vote on new features in upcoming malware versions, and track trouble tickets that could be worked on by the malware developers and fellow Citadel users alike. Citadel customers also could use the system to chat and compare notes with fellow users of the malware.

It was this very interactive nature of Citadel’s support infrastructure that FBI agents ultimately used to locate and identify Vartanyan, the developer of Citadel.

Machine Learning and EU GDPR

Machine Learning and EU GDPR

In this post, I share my thoughts on the impact of using machine learning to conduct profiling of individuals in the context of the EU General Data Protection Regulation (hereon referred to as GDPR). My analysis is based on, specifically, Article 22 of the GDPR regulation, which can be found here, which refers to the “automated-processing and profiling of data subjects” requirement.

One of the arguments I discuss is, though using machine learning for profiling (of users/consumers, hereon referred to as ‘data subjects’) may complicate data controllers’ compliance with their obligations under the GDPR, at the same time it may lead to fairer decisions for data subjects, because human intervention whilst classifying data or people is flawed and is subject to various factors, whereas, machines/computers eliminate the subjectivity and biased approaches used by humans.

Lawful, Fair and Transparent

One of the fundamental principles of EU data protection law is that personal data must be processed lawfully, fairly and in a transparent manner.

GDPR’s definition of ‘processing’, is as follows:
‘any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction’

‘Profiling’ is a subset of automated processing, and GDPR defines it as:
‘the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements’.

Now, lets analyse the three key tenets of the GRPD requirement – personal data must be processed lawfully, fairly and transparently

Lawfulness

If we break down the definition of ‘profiling’ in GDPR, in the context of machine learning, following are three key elements in this process:

Data profiling – key elements:

  • Data collection
  • Model development
  • Decision making

The outcome of these steps is that, machine learning is used for:

  • Automated data processing for profiling purposes
  • Automated decision making, based on the profiles built

Data collection

The regulation says that the collection of personal data should comply with the data protection principles and there must be a lawful ground for processing of this data. This means that personal data should only be collected for specified, explicit, and legitimate purposes and should not be processed subsequently in a manner that is incompatible with those purposes.

A machine learning algorithm may build a profile of a subject, based on the data that has been provided by the ‘data controller’ or by a third party or by both. Many organisations use Cloud Computing services for these activities, as the process may require significant resources in terms of computational power and storage. Depending on the nature of the business/application/usecase of such profiling, this processing may take place locally on the data controller’s machines, while a copy of this data is also sent to the Cloud to continue the dynamic training of the algorithm.

Elaborating on the “lawfulness” of this profiling, an individuals’ personal data are not only processed to create descriptive profiles about them but also to check against predefined patterns of normal behaviour, and to detect anomalies. This stage of profile construction will be subject to the GDPR rules governing the processing of personal data including the legal grounds for processing this data.

An interesting point to note is that, the final text of Article 22 of the GDPR refers to a ‘data subject’ and not a ‘natural person’. This could be interpreted as the protection against solely automated decision-making might not apply if the data processed are anonymized. This means, if profiling does not involve the processing of data relating to identifiable individuals, the protection against decisions based on automated profiling may not be applicable, even if such decisions may impact upon a person’s behaviour or autonomy. However, as Article 22 seems only to apply to profiling of individual data subjects and not groups, the question arises whether data subjects are protected against decisions that have significant effects on them but these decisions could be based on group profiling.

This can be an issue, because if inferences about individuals are made based on shared characteristics with other members of a group, there may be significant number of false positives or false negatives. A good example of this “anonymised” data collection for machine learning application, is Apple’s approach, which they refer to as ‘differential privacy’

Decision making

When it comes to decision making, based on the ‘processing’ of personal data described above, does ‘automated individual decision-making’ only cover situations where a machine makes decisions without any involvement by human actors? This may not be true in most of the situations as some human intervention is likely to occur at some point in the automated decision-making process. And so, I think the scope of the protection is broader than just covering wholly automated decision-making. Also, human intervention would have to be actual and substantive, i.e. humans would have to exercise ‘real influence on the outcome of a particular decision-making process, in order to lead to the inapplicability of this protection.

In addition, the GDPR does not specify whether the decision itself has to be made by a human or whether it could potentially be made by a machine. Nevertheless, as I mentioned above, it is highly likely that one or more humans will be involved in the design of the model, training it with data, and testing of a system incorporating machine learning.

Legal impact

Another important element of the decision is that it has to produce legal effects or similarly significantly affect the data subject. Some examples could be an automatic refusal for an online credit application or e-recruitment practices without human intervention. The effects can be both material and / or immaterial, potentially affecting the data subject’s dignity, integrity or reputation. And so the requirement that ‘effects’ be ‘legal’ means that a decision must be binding or that the decision creates legal obligations for a data subject.

Potential consequences of non-compliance

It is important to bear in mind that if data controllers violate the rights of data subjects under Article 22, they shall ‘be subject to administrative fines up to 20,000,000 EUR, or in the case of an undertaking, up to 4 % of the total worldwide annual turnover of the preceding financial year, whichever is higher’. In the face of potential penalties of this magnitude and considering the complexities of machine learning, data controllers may have apprehensions in using the technology for automated decision making in certain situations. Moreover, data controllers may insist that contractual arrangements be put in place, with providers that are part of the machine learning supply chain, which contain very specific provisions regarding the design, training, testing, operation and outputs of the algorithms, and also the relevant technical and organisational security measures to be incorporated.

Fairness

Lets now turn to the meaning of ‘fairness’ in the context of using machine learning either to carry out automated processing, including profiling, or to make automated decisions based on such processing. Whether personal data will be processed in a fair way or not, may depend upon a number of factors. Machine learning processes may be biased to produce the results pursued by the person who built the model. Also, the quantity and quality of data used to train the algorithm, including the reliability of their sources and labelling, may have significant impact on the construction of profiles.
For example, an indirect bias may arise where data relate to a minority group that has been treated unfairly in the past in such a way that the group is underrepresented in specific contexts or overrepresented in others. Also, in case of a hiring application, if fewer women have been hired previously, data about female employees might be less reliable than data about male employees.

So the point is, reliability while using machine learning for automated decision-making, will depend, on the techniques and the training data used. Further, machine learning techniques often perform better when the training data is large (more data about data subjects), and when the variance is wide spread. However, this may collide with the data minimisation principle in EU data protection law, a strict interpretation of which is that ‘the data collected on the data subject should be strictly necessary for the specific purpose previously determined by the data controller’.

And so it is very important that the data controllers decide, at the time of collection, which personal data they are going to process for profiling purposes. Then, they will also have to provide the algorithm with only the data that are strictly necessary for the specific profiling purpose, even if that leads to a narrower representation of the data subject and possibly a less fair decision for him/her.

Transparency

Machine learning algorithms may be based on very different computational learning models. Some are more amenable to allowing humans to track the way they work, others may operate as a ‘black box’. For example, where a process utilises a decision tree it may be easier to generate an explanation (in a human-readable form) of how and why the algorithm reached a particular conclusion; though this very much depends on the size and complexity of the tree. The situation may be very different in relation to neural network-type algorithms, such as deep learning algorithms. This is because the conclusions reached by neural networks are ‘non-deductive and thus cannot be legitimated by a deductive explanation of the impact various factors at the input stage have on the ultimate outcome’

This opacity of machine learning techniques might have an impact on a data controller’s obligation to process a data subject’s personal data in a transparent way. Whether personal data are obtained directly from the data subject or from an indirect source, the GDPR imposes on the data controller the obligation, at the time when personal data are obtained, to provide the data subject with information regarding:

‘the existence of automated decision-making, including profiling, referred to in Article 22(1) and (4) and, at least in those cases, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.’

Does this mean that whenever machine learning is used to conduct profiling the data controller must provide information regarding the existence and type of machine learning algorithms used? If so, to what does the term ‘logic’ refer and what would constitute ‘meaningful information’ about that logic? Does the term ‘logic’ refer to the data set used to train the algorithm, or to the way the algorithm itself works in general, for example the mathematical / statistical theories on which the design of the algorithm is based? And what about the criteria fed into the algorithm, the variables, and the weights attributed to those variables? And how does this relate to the role of different service providers forming part of the ‘machine learning’ supply chain? All these are important clarifications to be sought.

Due to all the above complexities, it is clear that transparency might not be the most appropriate way of seeking to ensure legal fairness but that compliance should be verified, for instance, through the use of technical tools. For example to show bias to a particular attribute like the use of race in credit decisions or the requirement that a certain class of analysis be applied for certain decisions. This might also be achieved by testing the trained model for unfair discrimination against a number of ‘discrimination testing’ datasets, or by assessing the actual outcomes of the machine learning process to prove that they comply with the lawfulness and fairness requirements.

Conclusion

According to Article 22 of the GDPR, data subjects have a right not to be subject to a decision based solely on automated processing, including profiling that produces legal effects concerning them or significantly affects them. When data controllers use machine learning to carry out automated processing, including profiling of data subjects, they must comply with the requirement of lawful, fair and transparent processing. This may be difficult to achieve due to the way in which machine learning works and / or the way machine learning is integrated into a broader workflow that might involve the use of data of different origins and reliability, specific interventions by human operators, and the deployment of machine learning products and services, including ‘Machine Learning as a Service’ services (provided by Amazon, Google, Microsoft, and others).

In order to be compliant, data controllers must assess how using machine learning to carry out automated processing affects the different stages of profiling and the level of risk to data subjects’ rights, and the impact of how the data controller can produce evidences of the compliance to the regulator and the data subject. In some cases where automated processing, including profiling, is permitted by law, data controllers still have to implement appropriate measures to protect the data subjects’ rights. The underlying objective of GDPR is that a decision significantly affecting a person cannot just be based on a fully automated assessment of his or her personal characteristics. However, as I called out in the very beginning of this post, in the context of machine learning, in some cases, it might be more beneficial for data subjects if a final decision is based on an automated assessment, as it is devoid of prejudices induced by human intervention.

Whether a decision about us is being made by a human or by a machine/computer, right now the best we can hope for is such a decision, which can produce legal effects or significantly affect us in any manner, will be as fair as humans can be. And eventually, we, as machine learning practitioners, must aim to build machine learning models where the decisions are far more fair, than what humans can be.

This is taking into account that machines may soon be able to overcome the limitations of human decision makers and provide us with decisions that are demonstrably fair. Indeed, it may already in some contexts make sense to replace the current model, whereby individuals can appeal to a human against a machine decision, and also where individuals would have a right to appeal to a machine against a decision made by a human!

Well that sounds a bit weird, ain’t it! Has the time for Skynet to take over Planet earth, finally arrived!

I am sure that many of the questions that we, the machine learning enthusiasts and practitioners, have, about the implication of GDPR to it, will eventually be answered, after GDPR becomes a regulation in May 2018. And also, we will see interesting changes to how machine learning models are designed and applied, especially in the context of personal data processing.

 

 

Title image courtesy: LinkedIn.com

Securely store API keys in R scripts with the “secret” package

When we use an API key to access a secure service, through R, or when we need to authenticate in order to access a protected database, we need to store this sensitive information in our R code somewhere. This typical practice is to include those keys as strings in the R code itself — but as you guessed it, it’s not secure. By doing that, we are also storing our private keys and passwords in plain-text on our hard drive somewhere. And as most of us use Github to collaborate on our code, we will also end up, unknowingly, including those keys in a public repo.

Now there is a solution to this – its the “secret” package developed by Gábor Csárdi and Andrie de Vries for R. This package integrates with OpenSSH, providing R functions that allow us to create a vault to keys on our local hard drive, and also define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by the person who wrote the code, or by people he/she trusts.

Here is the presentation by Andrie de Vries at useR!2017, where they demoed this package, and here is the package itself.