How To Train Your AI Dragon (Safely, Legally And Without Bias)

How to Train Your Dragon character
Dreamworks Animation

Untrained dragons can cause a lot of damage. Likewise, as AI systems spread further and have more influence over our lives, it’s getting far more important to make sure they’re properly trained. Bias can creep into the reasoning of AI very easily, either via datasets that are not diverse enough or through irrelevant data attached to viable data points, leading to flawed results and in some cases prejudiced or dangerous conclusions.

Despite regulations like GDPR to protect the privacy of our data, personal consumer data is increasingly being used by companies to improve services or to gain customer insight. Ironically, these regulations also make it more difficult for companies to gather enough data to train an AI system or to prove how their AI reaches its decisions (an impossible task for many deep learning systems).

Therefore, as AI develops and its abilities grow, collecting useful data without breaching data regulations will be crucial to ensure that AI can make the right decisions, and that personal and sensitive data isn’t used in the wrong context. 

Safely sourcing data

With so much data flowing through cyberspace, companies are employing more and more granular metrics to measure our behavior and improve their service. However, the General Data Protection Regulation (GDPR) allows companies to only collect a person’s personal information with their explicit consent, or “if it is necessary for the purposes of legitimate interests pursued by the company,” says Sebastian Weyer, CEO of data anonymization company Statice. Because article 6 of the GDPR (which outlines the requirements for compliant data processing) leaves the phrase ‘legitimate interests’ open for interpretation this means that companies “are safest when they obtain direct consent from data subjects,” according to Weyer.

However, due to an air of mistrust around companies’ usage of our data, Weyer makes the point that the majority of customers “will not consent to the use of their data for product tests and innovation,” which limits the amount of useful data available to train AI and improve products. This “lack of education around the importance of data in building personalized products and services” can stifle AI innovation, Weyer argues, and fears around data breaches and commercial misuse could in fact restrict the ability of AI to tackle pressing societal issues. Companies building AI products and services also need to be transparent about their use of data for automation, which is not always as easy as it sounds. Machine Learning systems use data in incredibly complex ways, and the most advanced algorithms often “sacrifice interpretability as a price for performance” says Weyer. Article 15 of the GDPR requires, however, that companies must be able to explain the basic functionalities of their algorithms to a data subject, to give an indication of how their data was used.

Removing all identifying information from a dataset, known as data anonymization, is therefore incredibly important when collecting data as it allows usable information to be gleaned from a dataset without compromising data privacy regulations. Statice, for example, creates a synthetic dataset that follows the same structural and statistical properties of the original but without any identifying information attached. Proper data anonymization is not only a GDPR requirement when collecting data but it also helps to accurately train an AI system. “If the data is not correctly anonymized before being used to build machine learning models, the learned patterns could involve sensitive information,” says Weyer. This is because algorithms work by recognizing patterns in data—if there is extraneous data in the dataset, such as a person’s age, race, or address, then patterns could be drawn between those factors and not the relevant data. 

Training data

Aside from properly anonymizing sensitive information, getting the right training data for a particular algorithm is vitally important; in fact data can be seen as the most important part of an AI system. Datasets that are not complete, have over- or under-represented elements, or have too much irrelevant information can easily skew an AI system’s reasoning. This has been notably demonstrated in flawed criminal recidivism systems that suggested that African-Americans were more likely to reoffend than their white counterparts, due to historically biased training data. But it isn’t easy to remove bias from a dataset, partly because of issues such as historical inequality, or because of a lack of diversity in a dataset. “You will almost always start with an overrepresentation of some elements and underrepresentation of others,” says Leila Janah, founder and CEO of Samasource, but without proper testing and review “data sets that are not inclusive and diverse can lead to issues with bias involving race, gender and culture.”

Issues with bias are not just skin-deep, however, and for image recognition systems like self-driving cars having diverse training data is a prime safety concern. “The data used to train an algorithm is a large component in ensuring it is able to appropriately identify a pedestrian from a stop sign and a stop sign from a tree,” says Janah. For example, a dataset that is under-represented by people with darker skin tones could lead a self-driving vehicle to be less likely to ‘see’ a pedestrian with darker skin crossing the road. While this may seem an extreme example, it is pertinent to think about the importance of representative datasets now as AI is used in more and more mission-critical applications. Employing a diverse team to annotate training data (as Samasource do) helps to ensure that all relevant metrics are accounted for, that cultural bias does not inadvertently enter the system, and that datasets are representative of the general population.

Correctly annotated and properly anonymized training data is also just good practice when training AI. Removing irrelevant information from a model and ensuring that training data is as diverse and representative as possible gives an algorithm the best tools to make appropriate decisions. “The more variety of situations you can capture for a given problem, the more chances you have to build a comprehensive, reliable model,” says Janah, and this holds true for any AI application. When looking for appropriate training data, it is also a question of knowing what you want from a system, and ensuring that misleading data is not present–if you are training an AI to look for lung cancer nodules, for example, it isn’t helpful to include liver cancer screenings. Overall, employing an appropriate bias prevention strategy when selecting training data is equally as important as the quantity of data gathered, due to the damage that bias can cause throughout an AI’s computations. Janah argues that “thoughtfully testing your model for bias before, after and throughout production will help move your model to maturity.”

The devil’s in the data

While an AI company’s performance is often put down to the complexity of their algorithm, the power to make or break an AI system lies with the data it is trained on. Not handling sensitive data properly can not only lead to a PR nightmare, but can also inherently flaw an algorithm’s reasoning by allowing it to draw patterns between irrelevant data. Notwithstanding the legal requirement to properly anonymize data (at least in the EU), it’s also good practice to ensure that identifying data is removed from a dataset before training an algorithm, so that bias does not creep in. 

Our lives are becoming more automated, and the majority of people now interact with AI systems on an hourly basis whether we are aware of it or not. In this context, we must remain vigilant about protecting an individual’s right to data privacy, and ensure that discriminatory AI is not set loose upon the world due to biased training data. AI is getting more powerful every day, and proper data management and assessment will be the check and balance against the fateful consequences of poorly trained systems.

">

Untrained dragons can cause a lot of damage. Likewise, as AI systems spread further and have more influence over our lives, it’s getting far more important to make sure they’re properly trained. Bias can creep into the reasoning of AI very easily, either via datasets that are not diverse enough or through irrelevant data attached to viable data points, leading to flawed results and in some cases prejudiced or dangerous conclusions.

Despite regulations like GDPR to protect the privacy of our data, personal consumer data is increasingly being used by companies to improve services or to gain customer insight. Ironically, these regulations also make it more difficult for companies to gather enough data to train an AI system or to prove how their AI reaches its decisions (an impossible task for many deep learning systems).

Therefore, as AI develops and its abilities grow, collecting useful data without breaching data regulations will be crucial to ensure that AI can make the right decisions, and that personal and sensitive data isn’t used in the wrong context. 

Safely sourcing data

With so much data flowing through cyberspace, companies are employing more and more granular metrics to measure our behavior and improve their service. However, the General Data Protection Regulation (GDPR) allows companies to only collect a person’s personal information with their explicit consent, or “if it is necessary for the purposes of legitimate interests pursued by the company,” says Sebastian Weyer, CEO of data anonymization company Statice. Because article 6 of the GDPR (which outlines the requirements for compliant data processing) leaves the phrase ‘legitimate interests’ open for interpretation this means that companies “are safest when they obtain direct consent from data subjects,” according to Weyer.

However, due to an air of mistrust around companies’ usage of our data, Weyer makes the point that the majority of customers “will not consent to the use of their data for product tests and innovation,” which limits the amount of useful data available to train AI and improve products. This “lack of education around the importance of data in building personalized products and services” can stifle AI innovation, Weyer argues, and fears around data breaches and commercial misuse could in fact restrict the ability of AI to tackle pressing societal issues. Companies building AI products and services also need to be transparent about their use of data for automation, which is not always as easy as it sounds. Machine Learning systems use data in incredibly complex ways, and the most advanced algorithms often “sacrifice interpretability as a price for performance” says Weyer. Article 15 of the GDPR requires, however, that companies must be able to explain the basic functionalities of their algorithms to a data subject, to give an indication of how their data was used.

Removing all identifying information from a dataset, known as data anonymization, is therefore incredibly important when collecting data as it allows usable information to be gleaned from a dataset without compromising data privacy regulations. Statice, for example, creates a synthetic dataset that follows the same structural and statistical properties of the original but without any identifying information attached. Proper data anonymization is not only a GDPR requirement when collecting data but it also helps to accurately train an AI system. “If the data is not correctly anonymized before being used to build machine learning models, the learned patterns could involve sensitive information,” says Weyer. This is because algorithms work by recognizing patterns in data—if there is extraneous data in the dataset, such as a person’s age, race, or address, then patterns could be drawn between those factors and not the relevant data. 

Training data

Aside from properly anonymizing sensitive information, getting the right training data for a particular algorithm is vitally important; in fact data can be seen as the most important part of an AI system. Datasets that are not complete, have over- or under-represented elements, or have too much irrelevant information can easily skew an AI system’s reasoning. This has been notably demonstrated in flawed criminal recidivism systems that suggested that African-Americans were more likely to reoffend than their white counterparts, due to historically biased training data. But it isn’t easy to remove bias from a dataset, partly because of issues such as historical inequality, or because of a lack of diversity in a dataset. “You will almost always start with an overrepresentation of some elements and underrepresentation of others,” says Leila Janah, founder and CEO of Samasource, but without proper testing and review “data sets that are not inclusive and diverse can lead to issues with bias involving race, gender and culture.”

Issues with bias are not just skin-deep, however, and for image recognition systems like self-driving cars having diverse training data is a prime safety concern. “The data used to train an algorithm is a large component in ensuring it is able to appropriately identify a pedestrian from a stop sign and a stop sign from a tree,” says Janah. For example, a dataset that is under-represented by people with darker skin tones could lead a self-driving vehicle to be less likely to ‘see’ a pedestrian with darker skin crossing the road. While this may seem an extreme example, it is pertinent to think about the importance of representative datasets now as AI is used in more and more mission-critical applications. Employing a diverse team to annotate training data (as Samasource do) helps to ensure that all relevant metrics are accounted for, that cultural bias does not inadvertently enter the system, and that datasets are representative of the general population.

Correctly annotated and properly anonymized training data is also just good practice when training AI. Removing irrelevant information from a model and ensuring that training data is as diverse and representative as possible gives an algorithm the best tools to make appropriate decisions. “The more variety of situations you can capture for a given problem, the more chances you have to build a comprehensive, reliable model,” says Janah, and this holds true for any AI application. When looking for appropriate training data, it is also a question of knowing what you want from a system, and ensuring that misleading data is not present–if you are training an AI to look for lung cancer nodules, for example, it isn’t helpful to include liver cancer screenings. Overall, employing an appropriate bias prevention strategy when selecting training data is equally as important as the quantity of data gathered, due to the damage that bias can cause throughout an AI’s computations. Janah argues that “thoughtfully testing your model for bias before, after and throughout production will help move your model to maturity.”

The devil’s in the data

While an AI company’s performance is often put down to the complexity of their algorithm, the power to make or break an AI system lies with the data it is trained on. Not handling sensitive data properly can not only lead to a PR nightmare, but can also inherently flaw an algorithm’s reasoning by allowing it to draw patterns between irrelevant data. Notwithstanding the legal requirement to properly anonymize data (at least in the EU), it’s also good practice to ensure that identifying data is removed from a dataset before training an algorithm, so that bias does not creep in. 

Our lives are becoming more automated, and the majority of people now interact with AI systems on an hourly basis whether we are aware of it or not. In this context, we must remain vigilant about protecting an individual’s right to data privacy, and ensure that discriminatory AI is not set loose upon the world due to biased training data. AI is getting more powerful every day, and proper data management and assessment will be the check and balance against the fateful consequences of poorly trained systems.

Follow me on Twitter.

I have been working in the M2M, IoT, and data space since founding Pod Group (a provider of IoT connectivity & billing software) in 1999, and have become greatly int...