In this post, I share my thoughts on the impact of using machine learning to conduct profiling of individuals in the context of the EU General Data Protection Regulation (hereon referred to as GDPR). My analysis is based on, specifically, Article 22 of the GDPR regulation, which can be found here, which refers to the “automated-processing and profiling of data subjects” requirement.
One of the arguments I discuss is, though using machine learning for profiling (of users/consumers, hereon referred to as ‘data subjects’) may complicate data controllers’ compliance with their obligations under the GDPR, at the same time it may lead to fairer decisions for data subjects, because human intervention whilst classifying data or people is flawed and is subject to various factors, whereas, machines/computers eliminate the subjectivity and biased approaches used by humans.
Lawful, Fair and Transparent
One of the fundamental principles of EU data protection law is that personal data must be processed lawfully, fairly and in a transparent manner.
GDPR’s definition of ‘processing’, is as follows:
‘any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction’
‘Profiling’ is a subset of automated processing, and GDPR defines it as:
‘the use of personal data to evaluate certain personal aspects relating to a natural person, in particular to analyse or predict aspects concerning that natural person’s performance at work, economic situation, health, personal preferences, interests, reliability, behaviour, location or movements’.
Now, lets analyse the three key tenets of the GRPD requirement – personal data must be processed lawfully, fairly and transparently
If we break down the definition of ‘profiling’ in GDPR, in the context of machine learning, following are three key elements in this process:
Data profiling – key elements:
- Data collection
- Model development
- Decision making
The outcome of these steps is that, machine learning is used for:
- Automated data processing for profiling purposes
- Automated decision making, based on the profiles built
The regulation says that the collection of personal data should comply with the data protection principles and there must be a lawful ground for processing of this data. This means that personal data should only be collected for specified, explicit, and legitimate purposes and should not be processed subsequently in a manner that is incompatible with those purposes.
A machine learning algorithm may build a profile of a subject, based on the data that has been provided by the ‘data controller’ or by a third party or by both. Many organisations use Cloud Computing services for these activities, as the process may require significant resources in terms of computational power and storage. Depending on the nature of the business/application/usecase of such profiling, this processing may take place locally on the data controller’s machines, while a copy of this data is also sent to the Cloud to continue the dynamic training of the algorithm.
Elaborating on the “lawfulness” of this profiling, an individuals’ personal data are not only processed to create descriptive profiles about them but also to check against predefined patterns of normal behaviour, and to detect anomalies. This stage of profile construction will be subject to the GDPR rules governing the processing of personal data including the legal grounds for processing this data.
An interesting point to note is that, the final text of Article 22 of the GDPR refers to a ‘data subject’ and not a ‘natural person’. This could be interpreted as the protection against solely automated decision-making might not apply if the data processed are anonymized. This means, if profiling does not involve the processing of data relating to identifiable individuals, the protection against decisions based on automated profiling may not be applicable, even if such decisions may impact upon a person’s behaviour or autonomy. However, as Article 22 seems only to apply to profiling of individual data subjects and not groups, the question arises whether data subjects are protected against decisions that have significant effects on them but these decisions could be based on group profiling.
This can be an issue, because if inferences about individuals are made based on shared characteristics with other members of a group, there may be significant number of false positives or false negatives. A good example of this “anonymised” data collection for machine learning application, is Apple’s approach, which they refer to as ‘differential privacy’
When it comes to decision making, based on the ‘processing’ of personal data described above, does ‘automated individual decision-making’ only cover situations where a machine makes decisions without any involvement by human actors? This may not be true in most of the situations as some human intervention is likely to occur at some point in the automated decision-making process. And so, I think the scope of the protection is broader than just covering wholly automated decision-making. Also, human intervention would have to be actual and substantive, i.e. humans would have to exercise ‘real influence on the outcome of a particular decision-making process, in order to lead to the inapplicability of this protection.
In addition, the GDPR does not specify whether the decision itself has to be made by a human or whether it could potentially be made by a machine. Nevertheless, as I mentioned above, it is highly likely that one or more humans will be involved in the design of the model, training it with data, and testing of a system incorporating machine learning.
Another important element of the decision is that it has to produce legal effects or similarly significantly affect the data subject. Some examples could be an automatic refusal for an online credit application or e-recruitment practices without human intervention. The effects can be both material and / or immaterial, potentially affecting the data subject’s dignity, integrity or reputation. And so the requirement that ‘effects’ be ‘legal’ means that a decision must be binding or that the decision creates legal obligations for a data subject.
Potential consequences of non-compliance
It is important to bear in mind that if data controllers violate the rights of data subjects under Article 22, they shall ‘be subject to administrative fines up to 20,000,000 EUR, or in the case of an undertaking, up to 4 % of the total worldwide annual turnover of the preceding financial year, whichever is higher’. In the face of potential penalties of this magnitude and considering the complexities of machine learning, data controllers may have apprehensions in using the technology for automated decision making in certain situations. Moreover, data controllers may insist that contractual arrangements be put in place, with providers that are part of the machine learning supply chain, which contain very specific provisions regarding the design, training, testing, operation and outputs of the algorithms, and also the relevant technical and organisational security measures to be incorporated.
Lets now turn to the meaning of ‘fairness’ in the context of using machine learning either to carry out automated processing, including profiling, or to make automated decisions based on such processing. Whether personal data will be processed in a fair way or not, may depend upon a number of factors. Machine learning processes may be biased to produce the results pursued by the person who built the model. Also, the quantity and quality of data used to train the algorithm, including the reliability of their sources and labelling, may have significant impact on the construction of profiles.
For example, an indirect bias may arise where data relate to a minority group that has been treated unfairly in the past in such a way that the group is underrepresented in specific contexts or overrepresented in others. Also, in case of a hiring application, if fewer women have been hired previously, data about female employees might be less reliable than data about male employees.
So the point is, reliability while using machine learning for automated decision-making, will depend, on the techniques and the training data used. Further, machine learning techniques often perform better when the training data is large (more data about data subjects), and when the variance is wide spread. However, this may collide with the data minimisation principle in EU data protection law, a strict interpretation of which is that ‘the data collected on the data subject should be strictly necessary for the specific purpose previously determined by the data controller’.
And so it is very important that the data controllers decide, at the time of collection, which personal data they are going to process for profiling purposes. Then, they will also have to provide the algorithm with only the data that are strictly necessary for the specific profiling purpose, even if that leads to a narrower representation of the data subject and possibly a less fair decision for him/her.
Machine learning algorithms may be based on very different computational learning models. Some are more amenable to allowing humans to track the way they work, others may operate as a ‘black box’. For example, where a process utilises a decision tree it may be easier to generate an explanation (in a human-readable form) of how and why the algorithm reached a particular conclusion; though this very much depends on the size and complexity of the tree. The situation may be very different in relation to neural network-type algorithms, such as deep learning algorithms. This is because the conclusions reached by neural networks are ‘non-deductive and thus cannot be legitimated by a deductive explanation of the impact various factors at the input stage have on the ultimate outcome’
This opacity of machine learning techniques might have an impact on a data controller’s obligation to process a data subject’s personal data in a transparent way. Whether personal data are obtained directly from the data subject or from an indirect source, the GDPR imposes on the data controller the obligation, at the time when personal data are obtained, to provide the data subject with information regarding:
‘the existence of automated decision-making, including profiling, referred to in Article 22(1) and (4) and, at least in those cases, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.’
Does this mean that whenever machine learning is used to conduct profiling the data controller must provide information regarding the existence and type of machine learning algorithms used? If so, to what does the term ‘logic’ refer and what would constitute ‘meaningful information’ about that logic? Does the term ‘logic’ refer to the data set used to train the algorithm, or to the way the algorithm itself works in general, for example the mathematical / statistical theories on which the design of the algorithm is based? And what about the criteria fed into the algorithm, the variables, and the weights attributed to those variables? And how does this relate to the role of different service providers forming part of the ‘machine learning’ supply chain? All these are important clarifications to be sought.
Due to all the above complexities, it is clear that transparency might not be the most appropriate way of seeking to ensure legal fairness but that compliance should be verified, for instance, through the use of technical tools. For example to show bias to a particular attribute like the use of race in credit decisions or the requirement that a certain class of analysis be applied for certain decisions. This might also be achieved by testing the trained model for unfair discrimination against a number of ‘discrimination testing’ datasets, or by assessing the actual outcomes of the machine learning process to prove that they comply with the lawfulness and fairness requirements.
According to Article 22 of the GDPR, data subjects have a right not to be subject to a decision based solely on automated processing, including profiling that produces legal effects concerning them or significantly affects them. When data controllers use machine learning to carry out automated processing, including profiling of data subjects, they must comply with the requirement of lawful, fair and transparent processing. This may be difficult to achieve due to the way in which machine learning works and / or the way machine learning is integrated into a broader workflow that might involve the use of data of different origins and reliability, specific interventions by human operators, and the deployment of machine learning products and services, including ‘Machine Learning as a Service’ services (provided by Amazon, Google, Microsoft, and others).
In order to be compliant, data controllers must assess how using machine learning to carry out automated processing affects the different stages of profiling and the level of risk to data subjects’ rights, and the impact of how the data controller can produce evidences of the compliance to the regulator and the data subject. In some cases where automated processing, including profiling, is permitted by law, data controllers still have to implement appropriate measures to protect the data subjects’ rights. The underlying objective of GDPR is that a decision significantly affecting a person cannot just be based on a fully automated assessment of his or her personal characteristics. However, as I called out in the very beginning of this post, in the context of machine learning, in some cases, it might be more beneficial for data subjects if a final decision is based on an automated assessment, as it is devoid of prejudices induced by human intervention.
Whether a decision about us is being made by a human or by a machine/computer, right now the best we can hope for is such a decision, which can produce legal effects or significantly affect us in any manner, will be as fair as humans can be. And eventually, we, as machine learning practitioners, must aim to build machine learning models where the decisions are far more fair, than what humans can be.
This is taking into account that machines may soon be able to overcome the limitations of human decision makers and provide us with decisions that are demonstrably fair. Indeed, it may already in some contexts make sense to replace the current model, whereby individuals can appeal to a human against a machine decision, and also where individuals would have a right to appeal to a machine against a decision made by a human!
Well that sounds a bit weird, ain’t it! Has the time for Skynet to take over Planet earth, finally arrived!
I am sure that many of the questions that we, the machine learning enthusiasts and practitioners, have, about the implication of GDPR to it, will eventually be answered, after GDPR becomes a regulation in May 2018. And also, we will see interesting changes to how machine learning models are designed and applied, especially in the context of personal data processing.