Is (/should?) Training an AI Model on Copyrighted Works Fair Use?
Two in one week, somebody stop him!
OpenAI submitted their proposal to the US government for what they think should be entailed in an “AI Action Plan”, and unsurprisingly they suggest codifying that the training of AI models on copyrighted works is considered fair use.
Unfortunately it doesn’t look like any of these news articles are linking to the actual document, or if so are doing it so confusingly that I can’t find it, so I can’t actually see what the text of the proposal is beyond suggesting an approach that features “preserving American AI models’ ability to learn from copyrighted material.” So I don’t know the specifics beyond that.
But this is a good peg for talking about fair use and AI training, which I’ve been wanting to poke at for a while as someone with an amateur interest in IP law and policy, inspired by the fact that the only radical position I have in my otherwise completely normie brain is my way-to-the-far-edge-of-the-bell-curve skepticism of IP regimes and US Copyright in particular.
In general, across the board I’m in favor of things that loosen the strength of IP monopolies, from things like shortening copyright terms, to replacing patents with challenge prizes1, to introducing more compulsory licenses2, and so on.
But one biggest thing I do generally support is both:
better codification of fair use defenses and principles, and
expansion of what constitutes fair use.
So that’s my bias coming into this conversion!
Fair use is a doctrine that provides a defense against claims of copyright infringement. Meaning if you do something that would otherwise be infringing and get sued over it, you can claim fair use and it’s up to the court to use a set of factors to determine whether your defense is successful.
So an interesting first step in the question of whether AI training on copyrighted works is fair use is to first ask, is it even infringement? For instance, if you buy a book, it’s not “fair use” to sell that physical book to someone else, because you’re not making a copy and therefore not infringing. It’s not “fair use” to use the same title as another work, or to use someone else’s recipe in your restaurant, because neither titles nor recipes are copyrightable. You also can’t bring a fair use defense if you use someone’s patent or trademark, because that’s a violation of the patent/trademark, not a violation of their copyright.
I am not a lawyer, I didn’t even finish my coursera course on IP law, but my sketchy amateur understanding of copyright is that it is really focused on copying. If you aren’t making a copy, it is not covered. What counts as a “copy” can certainly be broad, such as when you make a movie based on a book, even if you never use a single line of text from the book you will be “copying” plot, character, names, etc.
The act of training an AI model involves the model reading through its input, breaking that input down into tokens that represent some minimum snippet of text3. and then going through that input to create billions of probabilistic vectors that are arranged to do complicated vector math that takes in a series of tokens and outputs the best next token. You can have vectors representing things from concrete stuff like “dog” or “The Golden Gate Bridge”, to more abstract things like “truth-telling” or “Title case capitalization rules”. The math can do interesting things like (hypothetically) if you multiply the “dog” and “child” vector together, it can bring you close to the “puppy” vector, while multiplying “dog” and “title case” get you “Dog.” Obviously there’s way more technical details to that, but that’s a high-level view of what happens.
I do think there is a plausible argument that in that case, training may not constitute “copying” the work. At the very least, I don’t think it’s obvious either way.
But if OpenAI is pursuing the argument that it’s fair use, then that is them implying that they believe they are copying the works, but the copying is and should be allowed. That or they’re doing a multi-level strategy where they’re arguing it’s not copying, but they also want to have in reserve “okay well even though you decided it’s copying, we still think the copying is acceptable.” I haven’t read their court filings, I don’t know.
But assume it is copying, and therefore could be considered infringement, is it fair use?
As mentioned before, fair use (legally) is a defensive claim that you make when and if you get sued. You can’t “apply” for fair use or anything (which is one of many reasons why all of those “I’m claiming fair use under USC blah blah blah” disclaimers in YouTube descriptions of straight up copyrighted videos are not the protection people think they are!)
Fair use is also a factors test, meaning that US copyright law provides four factors that are to be used holistically by the court in deciding if this particular instance of infringement can be considered fair use. Fair use is also explicitly always case-by-case, and one case’s fair use determination does not establish a binding precedent on any other.
Throughout this I’m liberally paraphrasing from Stanford’s Fair Use guidelines (which, I believe is in and of itself a fair use of their work! 😅)
The four factors are:
The purpose and character of your use
The nature of the copyrighted work
The amount and substantiality of the portion taken
The effect of the use upon the potential market
Again, the act of training an AI model is consuming a work and mysteriously turning the distributions of tokens within the work into convoluted numerical vector representations that can be mathed together into a prediction for the next token. In this case, again let’s assume that this encoding of tokens, concepts, and their relationship into complicated mathematical vectors is “copying” the protected work. So is it fair use? Only a court can decide, but if we put on our pretend judge wig, let’s go through the factors.
1 - The purpose and character of your use
This is getting at the question of whether your work is “transformative” or not. Generally speaking, the more you are creating something new out of the work, the better (for fair use purposes.) This includes things like parody, commentary, criticism, news reporting, education, etc. If you’re just reproducing the work wholesale, that’s not great.
This raises a question about where our boundaries are when talking about training: are we just looking at the model itself, or are we looking at the model and its outputs?
If just the model, then the question is whether going from text -> vectors is considered transformative. I can see plausible arguments both ways, saying that “obviously it’s transformative because it’s an entirely different format and it’s really learning meanings and concepts from the work, not direct reproduction of the work itself.” On the flip side I can see “just because you convert an image from JPEG -> PNG doesn’t make it transformative, same with going from text -> vectors.” In this case I buy the former more, but I don’t know how a judge would think.
If however you’re saying the thing under scrutiny is the model and its outputs, then again I can see both ways! One on side “It’s clearly transformative because we’re going from a specific text to a general purpose remix of tons of specific texts”, while on the other hand “I can get a near word-for-word recreation of the copyrighted text if I prompt it in a specific way.”
Again, I lean towards the former and also I think lean towards considering just the model, separate from its outputs, so all in I find myself on the side of “yes transformative.” But put a pin in the “word-for-word recreation of the copyrighted text” bit, we’ll come back to it later.
2 - The nature of the copyrighted work
Again, paraphrasing from Stanford, but this is referring to whether you’re primarily copying factual and informative works vs creative works (with factual works being better for fair use) and whether you’re copying published vs unpublished works (with published being better.)
In this case, presumably OpenAI and all these mega models are copying both factual and creative works, so no real help there, but by definition if they’re scraping the public internet they’re copying almost exclusively published works, so minor help.
3 - The amount and substantiality of the work taken
How much of the work are you copying, and are you copying the “heart” of the work (vs something that’s more extraneous), with less work and less heartful being better.
If we’re just looking at the model, well this is no good, because you’re copying the whole work into your vectors!
If we’re looking at the model and the output, then it varies, because the output will in some cases be a lot of substantial copying of a particular work, but in other cases be a very diffuse mix of millions of works contributing to the new output.
4 - The effect of the use on the potential market
This is testing whether the copying has the effect (or potential effect) of depriving the copyright holder of income. And this I think is the worst for OpenAI, as they do market explicitly that these tools are meant to have the capability to replace the need for (some) work and workers!
So yeah, no help whatsoever on this front!
So all four together:
Is it transformative? I think yes, though I can see plausible arguments both ways (good for fair use)
What’s the nature of the copyrighted work? All published, but mix of factual and creative (mixed bag for fair use)
What’s the amount and substantiality? Either all of it if just looking at the model, or a mix of a lot and a little if looking at the output (bad to mixed bag for fair use)
What’s the effect on the market? Lol, just bad!
So again, not a lawyer, but I can see a plausible argument both for and against this being a fair use of the copyrighted material, mostly coming down to how transformative is it and whether that overcomes the market effects. But it really all comes down to whatever a judge or jury thinks in the moment.
Separate from the question of whether it is currently fair use, should it be?
And here I unambiguously think yes.
Remember that I’m a weird copyleft radical, so my general tenor is “Fair use in the US is way too restrictive and there should be way more deference to even moderately transformative works.” And in this context I do think that this is a genuinely novel and transformative use of existing works and should be allowed.
But remember back to the “recreation of copyrighted text” bit. While I will tend to side with anything that reduces the strength of copyright protections, even if you disagree with that take, I don’t think the question of whether the training is infringing is the only ballgame. The output of a model can still be infringing, even if the training of the model is not.
If a model produces for you a picture of (non-Steamboat Willie) Mickey Mouse, or creates a paragraph of Harry Potter fanfic, or reproduces a NYT article, that output itself is absolutely 100% copyright infringement. There is no defense for “it’s not infringing because I used a tool to make it for me,” the existence of the infringing artifact is in and of itself infringement.
You can apply a fair use test for it, but remember there is no factor of “did you do this yourself or use a tool to help you.” The factors only examine the work itself and how it compares to the alleged infringed original work.
Similarly, if you use a model and it ends up reproducing enough copyrighted text, even if you didn’t tell it do that it’s still infringing. There’s no fair use defense for “I didn’t know it was a copy from somewhere else.”
Again again again, not a lawyer, but like… that part of whether AI output can be infringing I feel like is 1000% settled? The unsettled part (to my knowledge) is who is liable for that infringement? Is it the person who used the service to create the infringing work? Is it the service that provided the interface to the model that created the work? Is it the developer of the model?
This is a part of copyright law I’m least familiar with, but to my common sense thinking I would imagine either the person who used the service or the service itself should primarily be in the hot seat? (In the case of OpenAI they’re both the service and the model developer, but you could imagine that being different in say a third party app that incorporates Meta’s open-source Llama models.)
I don’t know if this is the case, but I would imagine if the big apps and model shops are smart, they would be probably putting a lot of indemnification clauses in their terms of service to make sure they’re passing the buck on any infringement liability to their users. And if so (or if it’s determined in the future that the liability already is solely with the user of an AI tool) then it’s pretty risky to use AI for anything substantial! You could be opening yourself up to some pretty hefty liability risk, and again there’s no “but I didn’t create it myself, I used an AI tool” defense for copyright infringement! I feel like this is drastically under appreciated by AI boosters lol!!!!!!!4
And so I think this is my overall take on how copyright both may (caveat it’s all dependent on how judges and juries rule) and should work in relation to AI tools:
Training an AI model on copyrighted work may be allowed (either because it’s not copying, or it’s transformative fair use) but definitely should be allowed
Generating infringing work using AI tools definitely is not allowed, and (with the asterisk I still want to expand fair use more broadly) still should not be allowed.
And to me that actually gets the harms right: the model is not what harms, say, the NYT, it’s the output of the model, and therefore it’s the output that both is currently and should continue to be subject to copyright claims.
The idea that rather than granting the developer of a drug or scientific or technical process a 15-20 year monopoly over their development (aka patent), but instead giving them a big cash prize as a reward, and then putting the development in the public domain for everyone else to use
The idea that you still need to get a license to use a copyrighted work, but the base rate of that license is set already and the copyright holder can’t deny you your license if you pay that base rate (which is how cover songs work in the US: the songwriter is compelled to grant the license for the performer of the cover provided they get paid their required royalty.)
You can play around with how different models tokenize text with this playground!
The other tangential but equally drastically underappreciated relationship between IP law and AI tools is that human authorship is required in order for something to be copyrightable, meaning the output of an AI model is non-copyrightable and therefore public domain. The output of an AI model can contribute to a human-authored copyrightable work, but the most substantial portion of the creation needs to happen by the human themselves. If I were a CEO I would be wary about, say, having all of the code my company develops, or all of the marketing that we create be public domain! I think this is also an underappreciated aspect of why Hollywood is reticent to jump feet first into AI screenwriting or things like that; the fact that it’s generally worse quality and there are unions are obviously huge, but also…. if you can’t own the copyright to the story then that’s pretty bad for your monetization model!!!
That said, remember I’m a weird copyleft radical, so if the proliferation of AI tools accidentally unleashes a new public domain golden age, I can’t say I’m strictly opposed, lol.