Brief update on ai model training stuff

because the news keeps newsing

Mar 21, 2025

Yesterday the Atlantic published a piece + tool talking about LibGen, a compendium of pirated copyrighted books, and one allegedly used by Facebook when training their Llama AI models. And I’ve seen many a tweet1 on Bluesky and whatnot about how this obviously destroys any fair use claim Facebook has regarding training their models.

Caveat always, not a lawyer, didn’t even finish my coursera class on IP law, but given last week I wrote a thing about fair use and AI training, let’s look at that! For the sake of argument take as fact that Facebook downloaded this collection, this collection contains pirated copyrighted works, and it used those copyrighted works in its training.

The basic gist of last week’s piece was:

Is training “copying”? I lean no, but I think it’s plausible the answer could be yes.
If it’s “copying”, is the copying fair use?
- Is it transformative? I think yes, though I could see a plausible argument no
- Is the work factual or creative? Both! And is it published or unpublished? All published, one assumes
- How much was taken? Either all of it or a mix of a lot of it or some of it, depending on whether you’re just looking at the model or the model + its outputs
- What is the effect on the market? Presumably bad!
To me that adds to plausible cases both for it being fair use and not being fair use under current doctrine, but I strongly believe it should be fair use, in part because I’m a weird copyleft radical, and in part because I think the better place to police infringement is at the model output level (“is it actually creating infringing Mickey Mouse pictures?” and if so that’s a crime in and of itself, rather than asking “was it trained on Mickey Mouse pictures?”)

Add in now the allegation that Facebook used pirated materials in the training set, does that change anything with regards to the training itself?

Is training on pirated materials now more clearly “copying” than training on other stuff downloaded from the internet? I don’t think so?
If it’s copying:
- Does this impact how transformative (or not) model training is? Not really?
- Does it change the mix of factual vs creative, or published vs unpublished? No, though I’m not sure if fair use analyses make any distinction between works published for sale, like a book, vs published for wide consumption, like an internet article; if so then maybe this is actually different.
- Does it change how much of the work was taken? No
- Does it change the effect on the market? No

So all in all I think this either has no impact or a slight impact on the fair use analysis around the act of training an model on copyrighted material.

That said, my understanding of the allegation is that Facebook employees torrented this LibGen dataset, meaning they actively participated in the downloading of all the pirated materials. That in and of itself, is potentially infringing, in that it’s obviously “copying” the works in question. So I think there is an intellectually consistent read that says it’s fair use to train an AI model on copyrighted works, but still illegal to download those copyrighted works in the first place.

Alternatively, you could have a (I think less) consistent read that if it’s fair use to train the model, then the copying for the purpose of specifically model training is also fair use, or a (I think more consistent, though I disagree with it) read that it’s infringing both to pirate the books and to train the model on them.

And there’s also the realpolitik that fair use is decided by courts made up of humans and there’s a ton of discretion granted to the court, and this is a bad look for Facebook! If this were to be determinative, it would neither be the first nor the last time a plausibly legally sound but icky fair use defense was rejected more for the ickiness than for the legality!

Again, copyleft weirdo, but in my ideal design of the system I think I would still say:

The act of pirating the books is illegal
The act of training the model is either not infringing, or fair use (I prefer not infringing, but I’ll settle for fair use)
The act of using the model to create infringing content is illegal

So yeah, TL;DR not a lawyer take, ¯\_(ツ)_/¯

they’re always called tweets. That’s the style guide for dottxt, even on Bluesky or Threads or Mastodon, you write tweets

dot txt

Discussion about this post