The challenges creatives face in opting out of AI training datasets

A surreal image of a person in a maze

There's been a lot of concern over the widespread use of copyrighted works to train generative AI. In response, many developers now offer ways for creatives to keep their work out of training sets. But is this actually an effective strategy?

Table of Contents

We recently discussed how many creatives are unhappy that their work is being used to train generative AI tools. To combat what they see as the illicit appropriation of their work, some have attempted to turn the tables through ‘data poisoning.’ Through subtle manipulation of source material, they hope to trick these AIs into producing highly distorted output. The controversy over AI training has led a number of developers to offer opt-outs that theoretically prevent content from being used to train AI. But as we’ll see, there can be more than meets the eye here.

What is generative AI?

First, a refresher. Although there are many different types of artificial intelligence (AI) out there, generative AI has become one of the most familiar thanks to the popularity of tools such as ChatGPT and Midjourney. This type of AI is capable of creating various types of content in response to user prompts. 

How does generative AI work?

At the most basic level, generative AI works by making predictions. If you ask Midjourney to create a scene of Egyptian pyramids, the AI has to make a number of educated guesses to figure out what you’re looking for. For example, because you indicated that you wanted ‘Egyptian’ pyramids, it can deduce that you want it to create a desert scene. 

How is generative AI trained?

Generative AI is able to make these predictions because it’s been trained on vast amounts of data. By showing it a bunch of images that have been tagged with the word ‘Egyptian,’ it learns to associate that term with monumental stone structures, sand-swept vistas, and a narrow river. 

It takes massive amounts of data to train these tools. Even a ‘small’ dataset can include hundreds or even thousands of data points. While some developers use carefully curated datasets, others simply scrape material from the open Web. As a result, there’s a good chance that copyrighted content is being used for AI training.

Why is using copyrighted material in AI training problematic?

An AI developer might argue that using copyrighted material doesn’t hurt the rights holder since the work produced by these tools doesn’t directly replicate their content. In their view, it’s like someone being inspired to write a story about wizards after reading Harry Potter. But many creators argue that this isn’t the whole story. Imagine you’re an artist with a very distinctive style. If generative AI is trained on your work, it will be able to create images that look like you created them. Even if they don’t directly duplicate one of your works, they can still have a negative impact on you by driving away business. After all, why should a potential client pay you money to create for them when they can get something similar by typing words into a text box?

The perils of opting out

Given these complaints, it’s little wonder that there have been a lot of conversations about the ethics of AI training. This has led some developers to allow individuals or businesses to opt-out from having their content used in this manner. 

The problem with this approach is that vast amounts of data have already been scraped from the Internet and used in AI training. Opting out at that point is like closing the barn door after the horse has bolted and is already several miles down the road. While it’s theoretically possible for an AI to ‘unlearn’ something it’s been taught through training, it’s an incredibly opaque area.

Many developers also make the opt-out process as labor-intensive as possible. For example, OpenAI allows creators to remove owned or copyrighted images from their training datasets. However, they must submit a separate request for each image along with a description.As Business Insider pointed out, the Georgia O’Keeffe Museum would have to submit 2,000 separate requests in order to remove all her works. Even then, this opt-out only applies to future projects. 

Unfortunately, some of the best strategies for keeping your content away from AI training datasets can have major downsides. Setting your Instagram account to private will stop Meta and others from using the contents of your feed, but as The New Yorker points out, this comes at the cost of limiting your account’s reach. If Instagram is a vital component of your marketing efforts, that might not be a feasible option for you. 

When in doubt, it’s always a good idea to check the settings of each platform you use. If you look hard enough, there may be an option that will allow you to opt-out of having your content used for AI training. Be careful, though. Some companies use subtle tricks to make it more difficult for you to apply these settings. For example, Meta deliberately tries to steer you away from clicking the necessary link, and even if you do, you’ll have to provide them with personal information (and an explanation) in order to opt-out.

Future pathways: towards transparent and consent-driven AI

Many of these issues could be addressed through regulation, but this will be a tall order. It’s hard for governments to impose meaningful rules when the technology is rapidly evolving. And the borderless nature of the Web will make it hard to enforce national regulations since AI can still be weaponized by bad actors in other jurisdictions.

Some have hoped that AI developers can be persuaded to regulate themselves. For example, the British government is looking to establish a voluntary AI Cyber Security Code of Practice to promote best practices in the industry. But given how many different companies operate in the AI sector, getting everyone to adhere to a voluntary code of practice may well be like herding cats. 

Other stakeholders have come up with their own strategies to make AI training more ethical, including:

Coalition for Content Provenance and Authenticity (C2PA):

  • C2PA is developing technical standards to certify the source and history (provenance) of media content, aiming to combat misinformation and ensure transparency in digital media.
  • Website: C2PA Official Site

DECORAIT – DECentralized Opt-in/out Registry for AI Training:

  • DECORAIT proposes a decentralized registry allowing content creators to assert their rights to opt in or out of AI training, promoting transparency and consent in AI data usage.
  • Research Paper: DECORAIT on arXiv

A problem with few good solutions

AI training is a major ethical problem. Generative AI is embedding itself in our lives, yet it relies on massive amounts of existing content in order to function. Inevitably, this means that copyrighted or otherwise protected works are being used for training purposes even though the creators haven’t given their consent and may end up being hurt by this type of use. But while some companies have created ways for people to opt-out of AI training, these aren’t surefire solutions. Opting-out won’t matter if your content has already been used, and exercising this option can be an exercise in frustration as companies make the process as difficult and time-consuming as possible. Other solutions, such as setting one’s account to private, can have serious drawbacks that make them unrealistic in many situations. 

Illustration of colorful books on a shelf against a dark background.