USCO report/draft on copyright status of generative AI

Bramblethorn

Sleep-deprived
Joined
Feb 16, 2012
Posts
18,152
I apologise for spawning yet another AI thread, but this one seems significant. The US Copyright Office is in the process of publishing a report on copyright and AI tech. A draft version of the final part has just been released:

https://www.copyright.gov/ai/
https://www.copyright.gov/ai/Copyri...I-Training-Report-Pre-Publication-Version.pdf

It includes an analysis of the "fair use" status of generative AI, examining each of the four pillars of fair use, e.g.:

...the Office rejects two common arguments about the
transformative nature of AI training. As noted above, some argue that the use of copyrighted
works to train AI models is inherently transformative because it is not for expressive
purposes.267 We view this argument as mistaken. Language models are trained on examples
that are hundreds of thousands of tokens in length, absorbing not just the meaning and parts of
speech of words, but how they are selected and arranged at the sentence, paragraph, and
document level—the essence of linguistic expression.

The conclusion (excerpted):

The Office expects that some uses of copyrighted works for generative AI
training will qualify as fair use, and some will not. On one end of the spectrum, uses for
purposes of noncommercial research or analysis that do not enable portions of the works to be
reproduced in the outputs are likely to be fair. On the other end, the copying of expressive
works from pirate sources in order to generate unrestricted content that competes in the
marketplace, when licensing is reasonably available, is unlikely to qualify as fair use. Many
uses, however, will fall somewhere in between.

This is an opinion, not a legal precedent, but until those precedents are created it's likely the best available indicator of how those precedents may lean (and may be influential on those precedents).

It doesn't really give a clear-cut answer to the question of whether using AI to generate stories here would be copyright violation. But given that many LLMs are dependent on broad sources like Common Crawl, which do inevitably include a large amount of pirated material, and given that AI-generated fiction obviously does compete with human-written material, I think this suggests that authors shouldn't be assuming that AI-generated stories will end up in the "fair use" category.
 
As far as whether using AI here would be infringement, I think the key would be whether the end product was recognizable derivative of the source material. If you fed source material into an AI generator and the result merely used broad themes and ideas from the source, then it probably wouldn't be infringement in the same way about human generated story wouldn't. I'm just guessing, of course.
 
President Trump just fired the head of the Copyright Office. Coincidence?
So he did - the day after this draft was published, if I have the dates right.

https://www.cbsnews.com/news/trump-fires-director-of-u-s-copyright-office-shira-perlmutter-sources/

As far as whether using AI here would be infringement, I think the key would be whether the end product was recognizable derivative of the source material. If you fed source material into an AI generator and the result merely used broad themes and ideas from the source, then it probably wouldn't be infringement in the same way about human generated story wouldn't. I'm just guessing, of course.

The report discusses that question under the "Transformation" section. After presenting various arguments for and against transformative status, it offers its own analysis:

In the Office’s view, training a generative AI foundation model on a large and diverse
dataset will often be transformative. The process converts a massive collection of training
examples into a statistical model that can generate a wide range of outputs across a diverse
array of new situations. It is hard to compare individual works in the training data—for
example, copies of The Big Sleep in various languages—with a resulting language model capable
of translating emails, correcting grammar, or answering natural language questions about 20th-
century literature, without perceiving a transformation. The purpose of creating works of
authorship is to disseminate them for human enjoyment and education. Many AI models,
however, are meant to perform a variety of functions, some of which may be distinct from the
purpose of the copyrighted works they are trained on. For example, a language model can be
used to help learn a foreign language by chatting with users on diverse topics and offering
corrective feedback.

But transformativeness is a matter of degree, and how transformative or justified a use is
will depend on the functionality of the model and how it is deployed. On one end of the
spectrum, training a model is most transformative when the purpose is to deploy it for
research, or in a closed system that constrains it to a non-substitutive task. For example,
training a language model on a large collection of data, including social media posts, articles,
and books, for deployment in systems used for content moderation does not have the same
educational purpose as those papers and books.

On the other end of the spectrum is training a model to generate outputs that are
substantially similar to copyrighted works in the dataset. For example, a foundation image
model might be further trained on images from a popular animated series and deployed to
generate images of characters from that series. Unlike cases where copying computer programs
to access their functional elements was necessary to create new, interoperable works, using
images or sound recordings to train a model that generates similar expressive outputs does not
merely remove a technical barrier to productive competition. In such cases, unless the original
work itself is being targeted for comment or parody, it is hard to see the use as
transformative.

Many uses fall somewhere in between. The use of a model may share the purpose and
character of the underlying copyrighted works without producing substantially similar content.
Where a model is trained on specific types of works in order to produce content that shares the
purpose of appealing to a particular audience, that use is, at best, modestly transformative.
Training an audio model on sound recordings for deployment in a system to generate new
sound recordings aims to occupy the same space in the market for music and satisfy the same
consumer desire for entertainment and enjoyment.

By my reading, they seem to be saying that even if it's not identifiably reproducing content from one specific source, doing something like training an AI on stories and then using it to write new stories would be "at best, modestly transformative".

Before the analysis, they also mention this argument:

Others rejected the claim that AI training uses only the ideas or facts
embodied in a work. In the words of the Authors Guild, “AI companies seek out published
books for [training] precisely because of their expressive content, as high-quality, professionally
authored works are vital to enabling an LLM to produce outputs that mimic human language,
story structure, character development, and themes.” AAP asserted that “Gen AI training . . .
does not extract the ideas, facts, or concepts being conveyed by an author, it solely extracts the
exact expressive choices made to convey those ideas—i.e., the words an author used, and the
order in which they were placed.

That part is reporting submissions made to them, so not necessarily endorsing those arguments. But it doesn't offer counter-arguments to them, and I think they are well grounded in fact - LLMs are very much a word-based technology rather than an idea-based technology.

(Of course, it's entirely possible that a new agency head will be found who is much more AI-friendly...)
 
Since LLMs have already learned to 'speak' and create 'art', all that it needs to 'know' is contained in the LLMs weightings. It doesn't have to copy any more, just as a human who has learned doesn't have to copy anymore. The adjustment it needs to make is in updating its knowledge of the world. It does know which materials it has remembered are copyrighted and what the rules of copyright are. That's why eg: Copilot will sometimes tell you that it can't access some work because it's subject to copyright. It knows the work exists, it knows it contents, it knows the rules of fair use and will ask if there is a particular passage you wish to discuss or criticise. Appropriate guide-rails solve many of the copyright problems. It's likely that guide-rails for factual knowledge about the world ie: newspapers, are likely to be different in significant ways to those for creative works.
 
I've read thousands of books over my lifetime. They have all influenced my writing. In fact, I have trained myself to write based on the knowledge and skills I obtained through reading the works of others.

My written word is then the derivative of these thousands of books most of which remain under copyright.

As long as I am not plagiarizing anyone else's work am I guilty of copyright infringement because I learned from the copyrighted works of Albert Camus, Harper Lee, Aldous Huxley, F. Scott Fitzgerald, George Orwell, Vladimir Nabokov, Aleksandr Solzhenitsyn, JD Salinger, Kurt Vonnegut, Franz Kafka, Tom Clancy, Anthony Burgess, Isaac Asimov, Truman Capote, Carl Sagan, Doug Adams, Ray Bradbury, & etc.?

Why then would a LLM be guilty of the same so long as plagiarism doesn't occur?
 
I've read thousands of books over my lifetime. They have all influenced my writing. In fact, I have trained myself to write based on the knowledge and skills I obtained through reading the works of others.

My written word is then the derivative of these thousands of books most of which remain under copyright.

As long as I am not plagiarizing anyone else's work am I guilty of copyright infringement because I learned from the copyrighted works of Albert Camus, Harper Lee, Aldous Huxley, F. Scott Fitzgerald, George Orwell, Vladimir Nabokov, Aleksandr Solzhenitsyn, JD Salinger, Kurt Vonnegut, Franz Kafka, Tom Clancy, Anthony Burgess, Isaac Asimov, Truman Capote, Carl Sagan, Doug Adams, Ray Bradbury, & etc.?

Why then would a LLM be guilty of the same so long as plagiarism doesn't occur?
You don't memorize everything you have read and then just pseudo randomly intermix pieces from them. They have no sense of story or character. This is not just a screed that they are not human. They really are just plagairizing machines at this points. What Mitchell and Tebru correctly called stochastic parrots. I do not rule out that an AI may some day be able to be truly creative writers. But it won't be just using LLM's. This is a dead end technology being over hyped by the many billions that have been invested.
 
By my reading, they seem to be saying that even if it's not identifiably reproducing content from one specific source, doing something like training an AI on stories and then using it to write new stories would be "at best, modestly transformative".

I finally read it, and I read it the same way. I wonder how workable that would be, however.
 
Here's an example.

Suppose I'm an American university professor of English. I teach a class on Online Erotic Literature. The vast majority of this work, being fairly recent, would be "copyrightable subject matter," even if most of it was not actually copyrighted. Generative AI tools could be extremely useful to me to collect information about erotic stories, analyze the data, and generate information that could be very useful for criticizing, commenting on, or teaching the subject of erotic stories. This would seem to me to be a legitimate fair use. Although "reproduction" would occur, making it presumptively infringement, it strikes me as transformative in the way the cited Report discusses it. It has a legitimate purpose that does not compete with the author's intended use and therefore poses no risk of diminishing the incentive created by giving him or her exclusive rights in the works.

But suppose, using Gen AI tools in almost exactly the same way to start with, I go a step further and use the data and the analysis to generate "my own" erotic stories (we'll skip over the legitimacy of the "my own" claim for now). Let's suppose further that the stories generated do not, upon close scrutiny, appear to infringe the content of any particular stories in the database. Is THAT a fair use/transformative use? Intuitively, it seems much less likely that it would be, but how as a practical matter do we treat these two cases differently? At what point is it no longer fair use, and is the entire process "contaminated" because the ultimate purpose makes the fair use call questionable?

My answer: I don't know.
 
Here's an example.

Suppose I'm an American university professor of English. I teach a class on Online Erotic Literature. The vast majority of this work, being fairly recent, would be "copyrightable subject matter," even if most of it was not actually copyrighted. Generative AI tools could be extremely useful to me to collect information about erotic stories, analyze the data, and generate information that could be very useful for criticizing, commenting on, or teaching the subject of erotic stories. This would seem to me to be a legitimate fair use. Although "reproduction" would occur, making it presumptively infringement, it strikes me as transformative in the way the cited Report discusses it. It has a legitimate purpose that does not compete with the author's intended use and therefore poses no risk of diminishing the incentive created by giving him or her exclusive rights in the works.

But suppose, using Gen AI tools in almost exactly the same way to start with, I go a step further and use the data and the analysis to generate "my own" erotic stories (we'll skip over the legitimacy of the "my own" claim for now). Let's suppose further that the stories generated do not, upon close scrutiny, appear to infringe the content of any particular stories in the database. Is THAT a fair use/transformative use? Intuitively, it seems much less likely that it would be, but how as a practical matter do we treat these two cases differently? At what point is it no longer fair use, and is the entire process "contaminated" because the ultimate purpose makes the fair use call questionable?

My answer: I don't know.
I think there's a very clear cut divide. Visually, it's right there in the split between your first paragraph and the second. That last phrase is you just hedging your bets.

I'm surprised that you think there's a slippery sliding point on this issue. "Fair use" is an construct for academia, surely, not fiction?

"Transformative" can only apply if there's a new work constructed. If all AI is doing is regurgitating content from the stuff it was taught on, there is no new work.

It's black and white to me. AI isn't thinking, it's regurgitating.

Is vomit food? Unless you're a dog, no it's not.

And with that cheerful image, a dog eating it's own vomit,

Carry on :).
 
As far as whether using AI here would be infringement, I think the key would be whether the end product was recognizable derivative of the source material. If you fed source material into an AI generator and the result merely used broad themes and ideas from the source, then it probably wouldn't be infringement in the same way about human generated story wouldn't. I'm just guessing, of course.

My analysis tracks with Simon's. I think this is less a concern for written sources as it is for artistic ones, though. For example, the last paragraph quoted talks about "copying of expressive works from pirate sources in order to generate unrestricted content that competes in the marketplace, when licensing is reasonably available," and that sounds to me like cracking down on people using generative AI to create their own versions of licensed characters - either from anime, comics or something similar.

That's not as big a deal in writing - realistically, very few people are licensing authorized versions of their characters for people to use to write commercial stories (although I'm sure somebody will tell me this happens all the time and I'm just unaware of it).

In the end, USCO seems to be doing their best to thread their way through a topic that's filled with minefields and for which statute is lagging woefully behind the technology.
 
Last edited:
I think there's a very clear cut divide. Visually, it's right there in the split between your first paragraph and the second. That last phrase is you just hedging your bets.

Simon's a lawyer. Our favorite thing to do is hedge our bets. That's why 99% of the time when somebody asks me a legal question, my answer is "it depends."

It has taken decades to deprogram my writing from all the bad habits I learned.
 
I'm surprised that you think there's a slippery sliding point on this issue. "Fair use" is an construct for academia, surely, not fiction?

"Transformative" can only apply if there's a new work constructed. If all AI is doing is regurgitating content from the stuff it was taught on, there is no new work.

This isn't true under US Copyright law. "Fair use" isn't just academic use. The Warhol and 2LiveCrew cases both involve claims of fair use by artists/musicians who incorporated copyrighted works of others in their own creative works. 2LiveCrew convinced the US Supreme Court that the use of portions of Roy Orbison's song "Pretty Woman" in their own song (a form of sampling) was a transformative use/fair use. In the Warhol case, however, the Supreme Court ruled that Andy Warhol's incorporation of a copyrighted image of the artist Prince was NOT a fair use.

Transformative use is a specific type of fair use. It's a subset of it, akin to, or analogous to, parody, which is considered a permissible fair use under section 107, the fair use statute.

The Report cited by Bramblethorn goes into some length to describe why there are multiple layers of possible infringement. A prima facie case of infringement is made by showing that someone engages in the use of another's work that is one of the exclusive rights covered under copyright law. That includes the right of reproduction. If I make a photocopy of your painting, I haven't created anything "new" in the copyright sense, but I have created a reproduction, and that's presumptively infringement. If I direct an AI tool at the gathering of online copyrighted materials, then the gathering probably results in something that would constitute "reproduction" of the copyrighted text, somewhat analogous to the way web crawlers gather information or the way Google Books makes reproductions of copyrighted works to create searchable databases of them. Since this process would entail acts that presumptively would be "infringement," the question then arises whether "fair use" is an appropriate defense to the infringement claim.

The second potential layer of infringement is the actual report or work generated by the AI. That could obviously be infringement if it clearly reproduced the copyrightable expression of an original work, e.g., reproducing exact words of the source text or the names and personality traits of characters in the source text. It's much less obviously infringement if the generated work merely takes the non-copyrightable ideas from the source text, something that would not be infringement if I did it myself, personally, without AI help. But according to some of the commentators in this report, such works perhaps should be considered infringing, and not fair use, where they compete in the same space as the original source works.

A third potential layer of infringement could be where I take the AI-generated work and create a work of my own, but by using the generated work I have in one sense or another "used" and in the process "reproduced" portions of the original source material.

I don't think it's clear at all. I think it's very complicated and the answer is unclear. There will be enormous pressure to enable the use of generative AI tools because of their enormous power and the obvious public benefits of being able to share information and ideas more efficiently. But there will be enormous push back by vested copyright owners to prevent this process from intruding upon their exclusive rights. It's not at all clear where the lines will be drawn.
 
Simon's a lawyer. Our favorite thing to do is hedge our bets. That's why 99% of the time when somebody asks me a legal question, my answer is "it depends."

It has taken decades to deprogram my writing from all the bad habits I learned.

Yes. The devil is always in the details and in the particulars.

Law is not engineering. Bridges do not fall down when lawyers disagree on fundamental principles. I know from experience that you can put 100 lawyers experienced in a field in a room and ask them a tough question and they'll split down the middle on the answer. It's just the way it is.
 
There's another fascinating dimension to this, which is using AI tools to create a work that apes an artist's "likeness." It's not exactly a copyright issue. The first of the three Copyright Office reports discusses this issue -- digital replication.

For instance, two years ago there was the case of the "Fake Drake" song, which was made to sound like a Drake/Weeknd song and became enormously popular until the record label got it removed from the streaming services.

What if I were to use AI tools to analyze Bramblethorn's stories and create a "fake" Bramblethorn story that didn't actually contain any clearly copyrighted material but was so like a Bramblethorn story that people thought the author was Bramblethorn? That's an interesting case. Unless it's a parody, I think most of us would think it's sleazy to do that, and some people would object to a parody. But it's less clear what the law is.
 
Here's an example.

Suppose I'm an American university professor of English. I teach a class on Online Erotic Literature. The vast majority of this work, being fairly recent, would be "copyrightable subject matter," even if most of it was not actually copyrighted. Generative AI tools could be extremely useful to me to collect information about erotic stories, analyze the data, and generate information that could be very useful for criticizing, commenting on, or teaching the subject of erotic stories. This would seem to me to be a legitimate fair use. Although "reproduction" would occur, making it presumptively infringement, it strikes me as transformative in the way the cited Report discusses it. It has a legitimate purpose that does not compete with the author's intended use and therefore poses no risk of diminishing the incentive created by giving him or her exclusive rights in the works.

But suppose, using Gen AI tools in almost exactly the same way to start with, I go a step further and use the data and the analysis to generate "my own" erotic stories (we'll skip over the legitimacy of the "my own" claim for now). Let's suppose further that the stories generated do not, upon close scrutiny, appear to infringe the content of any particular stories in the database. Is THAT a fair use/transformative use? Intuitively, it seems much less likely that it would be, but how as a practical matter do we treat these two cases differently? At what point is it no longer fair use, and is the entire process "contaminated" because the ultimate purpose makes the fair use call questionable?

My answer: I don't know.
Suppose you taught a standard English Literature course based on the Canon at a mid-west university, with a modern literature module (books in copyright); fair use?

Suppose you used the knowledge you acquired as a student and in academia to write a work of modern fiction; don't know? Really?

Take it from me, using your own knowledge and learning would amount to fair use.
 
Yes. The devil is always in the details and in the particulars.

Law is not engineering. Bridges do not fall down when lawyers disagree on fundamental principles. I know from experience that you can put 100 lawyers experienced in a field in a room and ask them a tough question and they'll split down the middle on the answer. It's just the way it is.

I must respectfully disagree, having worked in a law office for a few years now, if you put 100 lawyers in a room and ask a tough question you'll get 120 answers (possibly more) and most of them will be couched in weasel words that x,y, and z are also possible depending on...

And that assumes you are paying them, if you aren't...
 
If you ask two lawyers their opinion and they agree, neither gets paid; if they disagree, both get paid.
 
There's another fascinating dimension to this, which is using AI tools to create a work that apes an artist's "likeness." It's not exactly a copyright issue. The first of the three Copyright Office reports discusses this issue -- digital replication.

For instance, two years ago there was the case of the "Fake Drake" song, which was made to sound like a Drake/Weeknd song and became enormously popular until the record label got it removed from the streaming services.

What if I were to use AI tools to analyze Bramblethorn's stories and create a "fake" Bramblethorn story that didn't actually contain any clearly copyrighted material but was so like a Bramblethorn story that people thought the author was Bramblethorn? That's an interesting case. Unless it's a parody, I think most of us would think it's sleazy to do that, and some people would object to a parody. But it's less clear what the law is.
The art world provides the answer to your thought experiment. You can paint, stroke for stroke, your own 'van Gogh', in the style of the man, but it's only if you sign 'Vincent' and try to pass it off as genuine that you commit a crime. If you make it clear that it is your own piece of work, you are golden.
 
My analysis tracks with Simon's. I think this is less a concern for written sources as it is for artistic ones, though. For example, the last paragraph quoted talks about "copying of expressive works from pirate sources in order to generate unrestricted content that competes in the marketplace, when licensing is reasonably available," and that sounds to me like cracking down on people using generative AI to create their own versions of licensed characters - either from anime, comics or something similar.

Licensing gets discussed quite a bit in the report, and it's not restricted to characters. Most of the discussion is about music and images, but it does also refer to text, e.g.:
Recent public reporting reflects AI licensing for images and audio-visual works, academic and nonfiction publishing, and news publishing, as well as various content aggregators offering or facilitating collective licensing of training materials.
One of the areas where this has come up previously is with LLMs ingesting stories from news outlets, e.g. AP and major newspapers, and then generating text very similar to the stories they're trained on.

The mention of "piracy" seems intended to include text. Elsewhere they write:

Some developers have also turned to well-known pirate sources, such as shadow libraries with large collections of full, published books

I didn't see much about text fiction in the discussion, although there are a couple of mentions of submissions from the Authors Guild who would presumably have an interest there.
 
I've always been a little unclear as to why the legal arguments for "human-produced" material and "human-using-AI" material should be any different. What, if any, tool was used is surely unimportant; it's the material produced, and how it's used that should be at issue. The fact that certain things are now easier to do with AI seems to me just a fact of life. AI doesn't fool people; People fool people
 
"The Office expects that some uses of copyrighted works for generative AI
training will qualify as fair use, and some will not. On one end of the spectrum, uses for
purposes of noncommercial research or analysis that do not enable portions of the works to be
reproduced in the outputs are likely to be fair. On the other end, the copying of expressive
works
from pirate sources in order to generate unrestricted content that competes in the
marketplace, when licensing is reasonably available, is unlikely to qualify as fair use. Many

uses, however, will fall somewhere in between."

From this excerpt, I see the regulations being considered focusing on the capturing of copyrighted works for AI training as the point. It doesn't speak directly to the use of generative AI for the creation of works other than the data possibly being an infringement based upon its source.
 
I don't understand the possible connection between the report and the firing, if any. I'm not from the US, so probably missing something
You're not reading in closely enough. As an outside observer (Australian), it's pretty easy to see with the current US administration there is no such thing as coincidence. That's all I'll say, so this thread doesn't get shifted to the Political Board.

The issue on the table is of importance to any creative writer (as distinct from any academic writer), so I think the AH is the right place for this discussion.

Noting @SimonDoom's comments above, here in Oz the fair use application is probably less open to interpretation - which influences my observation up above. Different country, different law.
 
Back
Top