Teaching machines to create: The hidden hazards of AI training

An image of a robot sitting before a table

Some of the biggest record labels in the world made headlines recently when they filed a lawsuit against two companies which provide music-related AI tools. They allege that Suno, Inc. and Udio AI committed copyright infringement by using the labels’ songs to train their AIs, even if those AIs are used to generate original material. But the training of AIs has long been an ethical minefield, and in this post, we’ll explore why this can be such a fraught topic.

Table of Contents

Some of the biggest record labels in the world made headlines recently when they filed a lawsuit against two companies which provide music-related AI tools. They allege that Suno, Inc. and Udio AI committed copyright infringement by using the labels’ songs to train their AIs, even if those AIs are used to generate original material. But the training of AIs has long been an ethical minefield, and in this post, we’ll explore why this can be such a fraught topic.

Why do AIs need to be trained?

How generative AI works

Generative AI is all about making predictions. Let’s say you want it to create a picture of a car driving down the street. The AI has to do a lot of guesswork in order to grant your request. But because you asked for a ‘car,’ it can assume that you want to see a four wheeled vehicle. And since cars are a modern phenomenon, the street shouldn’t look like something out of ancient Rome. These are very simplistic examples, but you get the general idea. Of course, the AI needs to be able to make good predictions. It needs to realize that not all four-wheeled vehicles are the same, and when someone asks for a ‘car,’ they’re seeking something different from a ‘bus.’  

The AI Training Process

This is where training comes in. Michael Chen of Oracle has compared the process of training an AI to parenthood. If you want your child to understand the distinction between cats and dogs, you might start out by showing them images. Then, you might provide further context by telling them that cats meow while dogs bark. The more information you provide, the easier it becomes for the child to distinguish between the two animals. As Chen points out, training an AI follows the same basic paradigm.


Building AI Understanding

So if the developers of a generative AI wanted to teach it what a car should look like, they might start by showing it images of different vehicles, making sure to tag photos of cars either in captions of alt text. Once it was reasonably good at differentiating cars from buses, trains, and airplanes, the developers might show it images of different types of cars, from sleek Porsches to boxy Plymouth Dusters. The more nuanced the data, the more nuanced the AI’s understanding of a car. 

Garbage in, garbage out

Of course, if you want an AI to produce high-quality results, you’ll need to train it with high-quality information. If you aren’t careful, you can end up teaching your AI the wrong lessons. 

  • Gender bias in recruitment
    In 2018, Amazon had to scrap the AI it had created to evaluate candidates for software developer positions. The problem? It was discriminating against women. This occurred because it was trained using resumes Amazon had received over a 10-year period. But since men have long dominated the tech industry, the pool of resumes was heavily skewed in favor of men, which in turn colored the AI’s perceptions.
  • Problematic image categorization
    In 2019, researchers discovered that much of the material in ImageNet, a vast library of images used to train generative AI, had been tagged with some highly problematic descriptions. A smiling woman in a bikini was tagged with the words “slattern, slut, slovenly woman, trollop” while a kid in sunglasses was labeled “failure, loser, non-starter, unsuccessful person.” There were entire categories with names like “bad person,” “closet queen,” and “pervert.” ImageNet ultimately addressed the problem, but there are still plenty of other image libraries that use similarly problematic categorization.
  • Stereotyping in image generation
    In 2022, StableDiffusion was criticized after users noticed that, when the AI was asked to create images of Latinas, it often depicted them in revealing outfits and provocative poses. Although the developers reined in the AI, the Washington Post noted that the original training set had included a good deal of pornography.

How to fight bias in AI training

IBM has suggested some strategies for combating bias in AI training, including:

  • Selecting the correct learning model.
  • Using the right data.
  • Ensure development teams reflect a diverse pool of individuals.
  • Be mindful of bias at all stages of the training process.
  • Evaluate methodologies to ensure they’re still appropriate for the job.

The human element is key

Throughout our discussions of AI, we’ve always stressed the fact that AI is just a tool that must always be used in an ethical manner. Just as parents must carefully educate their children to help them reach their full potential, developers must also be prepared to take a hands-on approach when training their AIs. They need to make sure they’re training their AIs with the right data, and that they need to ensure they’ve selected the best training model for the job. Moreover, they need to continually monitor their training system and make improvements as needed. By doing these things, developers can ensure that their AIs are responsible actors within the digital realm.  

Further reading:

The enduring cycle of disruptive technologies: AI art in historical context

Balancing AI and human values: the Vatican's call for ethical AI

Generative AI: the new frontier in unlocking creativity and idea generation

Illustration of colorful books on a shelf against a dark background.