The hits keep coming! A day after Judge Alsup in the Northern District of California ruled (partially) in favor of Anthropic’s fair use claim in a lawsuit about AI training, now Judge Chhabria (also in the Northern District of California, makes sense, this is where the companies are lol) ruled in favor of Facebook’s fair use claim!
What is interesting is the way these two rulings overlap and differ. This is to be expected, as fair use is a necessarily fact-based case-by-case defense, and so it’s one of the areas of law where there’s the most possibility for different cases to come to very different results. But also, this highlights how unsettled a lot of things are!
To recap the Anthropic case, plaintiffs (a collection of book authors) argued that Anthropic had infringed on their copyright by:
Pirating a bunch of books off the internet
Digitizing a bunch of books they’d purchased legally
Using both in order to train the LLM
And Alsup found that the first was not fair use, while the latter two were.
In this case the plaintiffs (a different set of book authors) alleged infringement by Facebook in training its Llama models on a bunch of torrented books. Separately there’s an interesting question around whether Facebook committed additional copyright infringement by torrenting the books and by not successfully preventing “leeching” (aka when torrenting something and it re-uploads pieces of the torrent for other downloaders during the downloading process) but that was not part of the summary judgement requests.
The main arguments that plaintiffs make are that training Llama cannot be considered fair use because:
It can reproduce verbatim snippets of their original text, and
It denies them the opportunity to license their works for AI training
Market for Licensing
First, the second argument, because Chhabria reaches the same conclusion that Alsup does that they are not entitled to this market in this case. Chhabria though does explain it in a way that I now understand!!
I was assuming from Alsup’s ruling that it was something to do with this being a new market not proscribed by the Copyright Act, or something along those lines, but in fact it’s to do with circular reasoning. Quoting Chhabria:
“In every fair use case, the ‘plaintiff suffers a loss of a potential market if that potential [market] is defined as the theoretical market for licensing’ the use at issue in the case. Therefore, to prevent the fourth factor analysis from becoming circular and favoring the rights holder in every case, harm from the loss of fees paid to license a work for a transformative purpose is not cognizable.”
Aka the whole thing at question in this is whether training an AI on copyrighted works is fair use. And so you can’t say “It’s not fair use because I should be able to charge for it” because being able to charge for it presupposes that it is not fair use, and that’s what we’re trying to figure out. This makes sense to me! The question of whether or not this market is one the authors are entitled to is a potential outcome of cases like these, it’s not the premise, and therefore can’t be taken as a given.
The Rest of the Ruling
Going through the rest of the analysis, Chhabria similarly applies the four factors test to Facebook1. Like Alsup he rules that training an LLM is absolutely obviously transformative and decides the first factor for Facebook; that the works in question are definitely works that cut to the heart of what is copyrightable and decides the second factor to the authors; and that given the nature of the transformative work (training an LLM) the amount of copyrighted work is reasonable and proportional, in that the more text you use the better your LLM will be, and decides the third factor for Facebook.
That then leaves the question of the fourth factor: what is the impact of the copying on the market for the original works?
Despite the fact that both cases rule for the LLM companies and say there is not a significant impact on the market, they come at it from different and potentially contradictory directions, in part due to the facts of the specific cases in front of them.
In Anthropic’s case the authors argue that the market is harmed because of the ability of Claude to allow people to create other new works—that are not substantially similar or infringing in and of themselves—that could compete with their original works. They do not make any claims regarding direct substitution, ie that Claude can reproduce their books directly. Alsup rules that the authors are not entitled protection from that potential market dilution (memorably comparing it to the effect of teaching kids to write well by having them read good books, and that creating competition for the original books.) Alsup does strongly hint that it may be different if they could claim or prove any sort of direct copying coming from the models, but that’s not at issue in the case and so he rules for Anthropic.
In Facebook’s case, it’s the opposite: the authors are focusing (once the claim around the market for AI training data are cast aside) on the fact that Llama can reproduce verbatim text and saying that is damaging to the market for their work. But the evidence from both sides shows that Llama can only produce on the order of <50 words of verbatim text, even with pretty aggressive adversarial prompting, and Chhabria dismisses that as being woefully insufficient to prove any sort of direct substitution harms.
On the question of dilution of the market through new works that aren’t direct reproductions, Facebook introduces some evidence to show no market impact after the release of Llama, and the authors barely respond to it. They do not introduce any other factual evidence or do anything to make this significant enough to litigate as a factual dispute. They attempt to claim that it should be taken as given that the market will be diluted, but Chhabria rebukes that, providing case law saying that in circumstances where the defendant provides evidence and where it’s a question of market harm through indirect substitution/competition, you cannot just infer that harm or take it as given, you have to actually prove it. Through gritted teeth, Chhabria then rules for Facebook, but with almost half of the opinion saying that basically the plaintiffs fucked up by not anchoring their claim on the market dilution argument. Chhabria in fact directly argues with Alsup, saying that his teaching a child to write simile is wrong for this situation, and seemingly putting much more weight on the idea that market dilution is a legitimate harm, and one that’s likely to occur at a greater rate.
So you end up having:
one case (Anthropic) that argues there’s no market impact because:
the dilution argument doesn’t matter
but maybe you could win on direct substitution, and
another case (Facebook) saying that:
the direct substitution doesn’t matter
but maybe you could win on dilution.
Interesting! Obviously that should continue to make any big AI labs nervous!
There are some ways to square that circle though: on Alsup’s side, if he was presented with the same evidence where the model only produced <50 word snippets, would he find same as Chhabria that that’s insufficient to rise to the level of market harm through direct substitution? And vice versa, on Chhabria’s side he does take pains to say you would have to actually prove market harm through indirect substitution and market dilution, and that may be difficult! Particularly Chhabria points out that the thing that needs to be proven is not whether a world with LLMs trained on the copyrighted works in question would create substantially more market dilution than a world without LLMs at all, but whether it would create substantially more market dilution than a world of LLM trained without the copyrighted works in question. That may be hard to prove, and in fact may not actually be true!
Other odds and ends: Chhabria also makes an interesting argument re: Facebook’s claim that this must be fair use because it risks preventing an important transformational technology from existing. Chhabria’s response is basically no it doesn’t, you’d just have to license the works, and that’s fine because you’re making so much money. There’s a view that these lawsuits are loaded guns just waiting to go off and destroy the AI companies, but Chhabria disagrees and thinks the companies could survive an adverse ruling just fine. I’d be curious to see if that holds true in other cases, and obviously if it holds true in reality.
Also Alsup finds that Anthropic’s piracy of books to be infringing regardless of the fact it is later used (in part) for training, refusing Anthropic’s argument that because the LLM training is fair use, so too was the downloading. Chhabria appears to take the other side, excusing the piracy2 based on the fact the piracy was for the goal of having works to train the LLM. Again, the challenge of fact-specific case-by-case analyses!
My Thoughts
The key difference between these rulings seems to be the opinions of the judges on the question of market dilution, with Alsup dismissing it and Chhabria basically begging for someone to make the argument for it. Now I am a weird copyleft libertarian radical, so on the question of market dilution I think I come down pretty skeptically, at least in how I want it to work. I mentioned before that when analyzing an IP regime I bias towards “does this enable the creation of more art” over “does this properly reward artists” and so I think it makes sense that I’m much more on the Alsup side of the line than the Chhabria side, at least in personal preference for how copyright and fair use should work.
The piracy angle is the other big difference, not so much in terms of it being as interesting to the central question, but because the damages for piracy can be enormous, so it can have very large practical impact on the AI labs. Even in a world where the training is near-universally considered fair use, if it comes with a hefty set of piracy damages that’s still not a great place for these companies to be.
Having these two cases in mind, if I were to wave a magic wand and settle everything I think my version would look something like:
Anyone that pirated to train their models is absolutely liable for that and have to pay damages (or settle)
But the training of the LLM is generally considered fair use, and therefore you don’t need to necessarily license work in order to train on it
Implying an Alsup-style view that indirect substitution of otherwise non-infringing output is not a harm that can be claimed
Which also implies three choices for AI labs that train on non-synthetic data, all of which require paying money to someone eventually:
Pirate and pay money later in damages/settlements
Pay money upfront to buy the the works legally and train on those (like Anthropic did with physical books)
Pay money upfront to license the works
Outputs are judged on their own terms as to whether or not they’re infringing, with liability on the user
And as a matter of PR, or as part of some licensing deals or settlements, or because users would be scared to use a tool that might open them up to ruinous copyright liability, the AI labs provide tools to help prevent infringing output and/or hunt down infringing output out in the wild (I imagine like YouTube ContentID on steroids)
Aka piracy is bad, training is fine, infringing output is bad, and everyone making money in the system is incentivized to prevent the bad stuff as reward for allowing the fine stuff. I don’t think that’s exactly what we would get out of this ruling, but because the author’s didn’t successfully prove harm from indirect substitution, who knows!
Four factors being what’s the nature of the copying (is it transformative?); what’s the nature of the copyrighted work (is it stuff that’s meant to be copyrightable, like expressions of creativity, or not?); what’s the amount and substantiality of the work being used (and is it reasonable and proportional for the use in question?); and what’s the impact on the market for the original work?
With the exception of the question of whether the “leeching” was separately infringing