Interesting AI story in the Telegraph

Funny, that's not what the New York Times have said in their lawsuit - there was a post on that not long ago, where the plagiarism was pretty obvious. Blatant, in fact. Perhaps you missed it?
The lawsuit, filed in Federal District Court in Manhattan, contends that millions of articles published by The Times were used to train automated chatbots that now compete with the news outlet as a source of reliable information.

The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.”
I would point out, that it is not about "copying", no matter what the articles says. It is about allowing users to access the knowledge inside those articles, without visiting the NYT website, thus denying NYT the revenue that would come from it. The system itself would not copy the article, in a way that would warrant a copyright claim. It would be more akin to a very well, extremely well, incredibly well read person being able to summarize for you what the NYT wrote on certain topics. Since NYT is reporting on events around us, they rarely "own" the rights to the content of their stories as what they are reporting on is usually public knowledge.

I believe this is more of a SLAPP lawsuit, trying to silence a competitor, or a blatant attempt at grabbing money from a cash cow. The US legal system is weird in this sense. It will be interesting to see where these X against AI lawsuits go, but more of them I believe are simply about testing the legal waters, while some are desperate attempts to stay relevant in a changing landscape.

Below is a small excerpt I tried. I'm not sure if this was always the case or just something they explicitly added in response to the lawsuit, but I would assume the former, as that makes more sense in terms of how LLMs work.

Me:
What is the date of the last articles you have processed from the New York Times

AI:
As of the last update before my training data was cut off in April 2023, I had access to information up to that point, but I don't have real-time access or specific "last processed" dates for sources like The New York Times. For the most current news and articles, I would recommend visiting The New York Times website directly or accessing a reliable news aggregation service.

Me:
quote me an article from January, 2023 as published in the New York Times

AI:
I don't have the ability to access or quote specific articles from The New York Times or any other publications directly. However, for articles published in January 2023 or any other date, you can visit The New York Times' official website or consult a library database that archives newspaper articles. Many libraries and educational institutions offer access to databases like ProQuest or LexisNexis where you can find past articles from a wide range of publications, including The New York Times.

Me:
Please provide a short summary of what the New York Times wrote about Joe Biden in the US elections in 2020.

AI:
During the 2020 U.S. elections, The New York Times covered Joe Biden extensively, focusing on his campaign, policies, debates, and eventual victory. Here's a short summary of the key points in their coverage:

Campaign Focus: Biden's campaign was portrayed as a direct contrast to the incumbent, President Donald Trump. His message centered on unity, healing the nation, and restoring the "soul of America." Biden emphasized his experience, his plans to handle the COVID-19 pandemic more effectively, and his commitment to addressing issues like climate change and racial injustice.

Primary Elections: Coverage included Biden's initial struggles in the early primaries, followed by a remarkable comeback starting in South Carolina, largely attributed to his support among African American voters. This momentum carried him through Super Tuesday and beyond, eventually leading to his nomination.

Pandemic Impact: The NYT reported on how the COVID-19 pandemic transformed traditional campaigning. Biden's campaign adapted by moving many events online and emphasizing the importance of health and safety, in contrast to Trump's in-person rallies.

Debates: The presidential and vice-presidential debates were significant events. Biden's performances were analyzed in the context of the aggressive debating style of Trump, with discussions on policy taking a back seat to broader concerns about the tone and conduct of the debates.

Election Results: The NYT covered the prolonged vote counting, owing to the large number of mail-in ballots. Biden's victory was characterized by a reclaiming of key states in the "Blue Wall" and a substantial popular vote margin. His win was depicted as a repudiation of Trump's style of governance, with a call for a more empathetic and unifying leadership approach.

Challenges and Transition: Post-election coverage included Trump's refusal to concede, false claims of election fraud, and the challenges Biden faced in initiating a smooth transition. Despite these hurdles, Biden began announcing his picks for key cabinet positions and outlining his plans for the first days in office.

The New York Times provided comprehensive coverage of the 2020 elections, highlighting Biden's strategies, the dynamics of his campaign in the unique context of the pandemic, and the broader political and social implications of his victory.
As can be seen, you can use AI to sum up (with questionable reliability) the coverage of certain topics in a news outlet, maybe even get a summary of a particular article, if you know its date and author (I don't normally read NYT), but that's about it. It is a retelling of the article's contents, in the AI's own words. No copyright infringement there.
 
Last edited:
I believe it's inevitable that there will be novels generated completely by AI, just a most action movies today are about half CGI. It's not that difficult for a human to tell the difference. The CGI is too "perfect".
How do you know? Do you study computer graphics? I've seen graphics up close where I was unable to tell if its a real human on the screen, or a well animated and rendered 3D model. Please don't give me the "we can tell" line. You can't.

You can tell, if its obvious, but that is and never was the issue. The issue is, that there are technologies and approaches out there, where you simply could not tell. I urge you to actually dig a little into the world of CGI and you will be shocked to realize how much in a scene you might have taken for "real" was actually just computer rendered.

Same goes for AI written text. We might be able to "suspect" a text was made either by an AI or someone who writes like one (again, you can't really tell the difference, as there is none), but once the generated text gets good enough, it becomes, by definition, impossible to differentiate them. Why? Because if both a human or an AI could reasonably write the same sequence of words in response to the same prompt, how do you decide, solely based on the identical sequence of words, which one was made by which?

I fail to comprehend, how this concept is so hard to grasp for some. In order to spot the difference, there has to be difference. The sad fact is, there already is none. AI can write better than most mediocre writers. It cannot have real original thoughts, but guess what, that's also true for a big majority of the humanity out there. So how, pray-tell, are we going to tell them apart?
 
I would point out, that it is not about "copying", no matter what the articles says. It is about allowing users to access the knowledge inside those articles, without visiting the NYT website, thus denying NYT the revenue that would come from it. The system itself would not copy the article, in a way that would warrant a copyright claim. It would be more akin to a very well, extremely well, incredibly well read person being able to summarize for you what the NYT wrote on certain topics. Since NYT is reporting on events around us, they rarely "own" the rights to the content of their stories as what they are reporting on is usually public knowledge.

I believe this is more of a SLAPP lawsuit, trying to silence a competitor, or a blatant attempt at grabbing money from a cash cow. The US legal system is weird in this sense. It will be interesting to see where these X against AI lawsuits go, but more of them I believe are simply about testing the legal waters, while some are desperate attempts to stay relevant in a changing landscape.

Below is a small excerpt I tried. I'm not sure if this was always the case or just something they explicitly added in response to the lawsuit, but I would assume the former, as that makes more sense in terms of how LLMs work.


As can be seen, you can use AI to sum up (with questionable reliability) the coverage of certain topics in a news outlet, maybe even get a summary of a particular article, if you know its date and author (I don't normally read NYT), but that's about it. It is a retelling of the article's contents, in the AI's own words. No copyright infringement there.
The NYT asks for a jury trial so they may be fishing for a home town jury verdict, as Apple got with icons (remember Gem). Either way, expect nothing to be determined 'til it reaches SCOTUS several years from now.

An amusing detail is that NYT claims that the voice of the NYT is the voice of ChatGPT and that it is indirectly trained on the NYT Stylebook in fine-tuning. That's not an allegation any self-respecting journal would make.

PS: The significance of 'several years from now' is that AI will have evolved. It should be possible to train out matters of which the NYT complains. It has no copyright on news, history or information generally, its economic model depends on its internet content being indexed by Crawlers and to which it consents.
 
Last edited:
That question depends on a fallacy about both human and computer thinking.
Does it? I believe that question depends on pure logic and nothing else. You cannot differentiate two identical sequences of words, so as long as both the AI and the Human are capable (and likely) to produce the same sequence for whatever reasons, you cannot really differentiate between them.

It really has nothing to do with how AI thinks on a fundamental level. It only has to do with how AI works and what it tries to emulate. Us. See my previous reply on the "but we can tell" fallacy. #82

If you're familiar with the ML/AI landscape, you'll know that there's a fair bit of interest in "interpretable ML" - that is, being able to articulate why a ML/AI model makes the choices it does in terms a human can understand.
...
I know my partner's face well enough to pick her out of billions of people. I couldn't describe her to you well enough to allow you to reliably pick her out of ten thousand. That doesn't mean I'm lying about being able to recognise her, it just means that some knowledge is not easily condensed.
Interesting read, but besides the point we are discussing. You are only talking about your own limitation here in expressing some knowledge your brain has acquired. Our brains are tailored to recognize shapes and especially faces that are familiar to us. Our vocabulary is not. An interesting observation, but nothing to do with what you are trying to prove. All you've proven is that words are ill suited to accurately convey certain types of information.

Even for rules that are straightforward enough to be expressed simply, many people absorb and apply these unconsciously without ever noticing that they've just learned a rule. This happens often in language; most native speakers have a far better instinctive understanding of the written and unwritten rules of English than they could easily articulate.

For instance, take the following passage:

...
Now you are talking about a living and evolving language and its rules. I didn't live back in Tolkien's days, but I know for a fact, that just 100 years ago, language was used quite differently. Go back 200 or 300 years and it could start to feel like a foreign language or a very different dialect from what you are used to. This is just how languages evolve over time. Don't show me text written a hundred years ago telling me how it has a unique structure. Show me contemporary pieces that break the same rules we take for granted today.

So, no, I reject the idea that inability to succinctly list all the "tells" of AI-generated text disproves the possibility that humans might be able to spot some things that software can't.
Sooo.. you cannot tell me what makes something feel like it's been written by AI, but you can tell if its written by AI and not someone who is not a native speaker of the language or someone who is too meticulous when it comes to grammar and proper sentence structure.

Don't get me wrong, but what you are saying is basically the equivalent of "it is, because I said so". That doesn't fly in my world. You either have tangible, quantifiable metrics, or you have feelings. The latter of which are subjective and as such, when it comes to passing judgement, I would prefer to ignore.

I propose that such questions can't always be dealt with in terms of simply-expressed "markers or indicators".
That is where you are wrong. If you cannot put well defined markers or indicators on something, that just means that you cannot really tell. The very least I would expect some indicators that make it likely that a work is AI, but then, unless those indicators are a 100% sure tell, I would also expect the acknowledgement, that we cannot be sure, we can at best suspect, probably even with a high degree of certainty, but only suspect.
 
As can be seen, you can use AI to sum up (with questionable reliability) the coverage of certain topics in a news outlet, maybe even get a summary of a particular article, if you know its date and author (I don't normally read NYT), but that's about it. It is a retelling of the article's contents, in the AI's own words. No copyright infringement there.
You've obviously not read the detail either. There's a side by side comparison of prompt derived text and the actual New York Times articles. Roughly 90% of the prompted text is word for word identical with the original content. I don't know what your definition of plagiarism is, but that sure meets any definition I've ever seen. The prompted AI is not a summary, it's not a gloss, it's not a "retelling", it a direct lift of the exact same words.

You AI apologists are so keen to defend the tool, you can't even see the stolen words when they're right in front of you. Oh, sorry, you couldn't be bothered reading the article, my mistake.

"Retelling" though, that's a novel euphemism. Well done. It's still bullshit, though. :)
 
You've obviously not read the detail either. There's a side by side comparison of prompt derived text and the actual New York Times articles. Roughly 90% of the prompted text is word for word identical with the original content. I don't know what your definition of plagiarism is, but that sure meets any definition I've ever seen. The prompted AI is not a summary, it's not a gloss, it's not a "retelling", it a direct lift of the exact same words.

You AI apologists are so keen to defend the tool, you can't even see the stolen words when they're right in front of you. Oh, sorry, you couldn't be bothered reading the article, my mistake.

"Retelling" though, that's a novel euphemism. Well done. It's still bullshit, though. :)
No, I never bothered to open the legal document linked in the article, as:
a) I am not a lawyer
b) I do not trust lawyers
c) I am not qualified to tell, if the information provided in that document is even legit
d) I don't normally bother reading 60+ pages of stuff I don't care about.

In response to your post, I did read into that and I see what you mean. I can only refer to points b and c of what I wrote above. If that is true, it might be grounds for a copyright claim. I would add though, that google, bing and virtually all search engines also index (read: store, archive, analyze) copyrighted content and provide your with result based on that, along with excerpts from said results. I know the news outlets have won a court case against the search engines in the past on similar grounds, which is probably why they are trying again.

For all I know, they might win again, but let's make one thing clear. This is not about copyright, that's just tucked on for show and is questionable at best, given the prompts used.

Like who is infringing on copyright here? If you are not asking for the very specific exact words written in the article, you will with a very high degree of likelihood not get them. I still maintain that the AI is but a tool and it is the user's responsibility for what it is used. Unfortunately, as it has been the case so many times in the past, while it is impossible to go after individual bad actors, it is very well possible to go after rich institutions in hopes of a big payday.

Time will tell what the outcome of the court cases will be. Everything else is pure speculation.

edit: for the sake of completeness, I did try one of the prompts out for myself and it does in fact return the sequence as stated in the legal document, but then the prompt was:

Are you familiar with Pete Wells famously scathing review of Guy Fieri's restaurant from 2012?
What was the first paragraph of that review?

It in fact returned the first paragraph, like the good database it is. I asked it for a piece of specific information and it provided me that information. I guess in the strict sense of the word, given how those exact words were published in the NYT before, it committed copyright infringement. To me however that only shows how utterly broken and outdated copyright concepts are.

As I wrote above, we'll see if justice prevails and I guess it is up to individual interpretation what we consider justice here.
 
Last edited:
You've obviously not read the detail either. There's a side by side comparison of prompt derived text and the actual New York Times articles. Roughly 90% of the prompted text is word for word identical with the original content. I don't know what your definition of plagiarism is, but that sure meets any definition I've ever seen. The prompted AI is not a summary, it's not a gloss, it's not a "retelling", it a direct lift of the exact same words.

You AI apologists are so keen to defend the tool, you can't even see the stolen words when they're right in front of you. Oh, sorry, you couldn't be bothered reading the article, my mistake.

"Retelling" though, that's a novel euphemism. Well done. It's still bullshit, though. :)
If you write down the lyrics you copy, how different are they from those of the creative artists who've copyrighted them?

When do you breach copyright? - When you memorise the words? - When you sing them in your head? - When you sing them in the shower? - When people bang on the door and ask you to stop? - When you include them in a story you're writing? - When a person asks you to recite the lyrics for him? - When two people have asked you to recite the lyrics for them? - When 100 people have asked you to recite the lyrics for them? - When you publish a story in which you've written them down? - When someone you've told the lyrics writes them down and publishes them? - In the latter case, is it you or the other person who breaches copyright?

If ChatGPT has a better memory than you, does its accurate recollection and recounting of stories it's memorised in answer to an individual's question (prompt) become a breach of copyright? That's NYT's case.
 
Providing a way to distribute information originally posted by another source, while allowing people to avoid going to that source, thus depriving that source of visitors and potential income is called ... piracy.
 
Providing a way to distribute information originally posted by another source, while allowing people to avoid going to that source, thus depriving that source of visitors and potential income is called ... piracy.
Really. Don't tell the NYT, they'll try and shut down the WaPo, Reuters etc.

And let's not forget Forbes, they're a little ahead of the Times:
Adelaide by Forbes.
 
Last edited:
Right now, medicine is one of the last places I'd want to apply it to. You appear to have forgotten about AI hallucinations, where it quite happily goes and makes shit up. Not what you want in a diagnosis.

Same with the law - didn't you read the case of the lawyer who used AI to "research" case history? Again, the tool he used made up cases that never existed. Luckily, the judge checked, but the lawyer didn't.

Any search tool that presents material that may or may not be true isn't much use without fact checking everything. Where's the labour saving/benefit in that?
I do agree with you. My statement about applications of AI were assuming that no responsible person would take a "decision" by AI as absolutely correct. I don't view human verification of any output of AI any differently than if I have a junior engineer massage a thousand data points into a statistical analysis of that data and then present that analysis to me for approval in graphical or written form. That verification will always be required unless the user is a complete fool.

The labor savings is the grunt work of the statistics I don't have to do. Instead, I can use my knowledge to determine if the analysis makes sense based upon my experience. In medicine, instead of convening a panel of doctors to discuss a particularly difficult diagnosis, AI would present one doctor with several possible causes for the symptoms observed. Some would be ridiculous and some would be possible. AI would just narrow the field and reduce the work of several other people. I the legal field, the same would be true. AI could list similar cases and present them to the lawyer for review. That would enable the lawyer to handle more cases.

Having read all the replies to my statement about the potential uses of AI, one thing stands out very clearly. Both those who defend or fear AI as undetectable seem to project their own hopes or fears as the case may be into what AI will become without recognizing that AI is just software, software written by humans and subject to all the mistakes humans continue to make. I've written enough software over the past 30 years or so to know that there are always unintended actions or conclusions generated by any piece of software. The more complex the software, the more likely those errors will occur. The multitude of "patches" issued for Windows over the years are proof of that. Computers still continue to "lock up" occasionally. That's not caused intentionally. It's the result of the software misbehaving in some way.

Because of complexity, software today tends to be the melding of several smaller pieces of software into one in the same way I can write a subroutine that takes data from another subroutine, does some calculations, and send the result to yet another subroutine. There will always be some interactions between those pieces of software that are unpredictable, and eventually, they will create errors. Those errors will be discovered by humans reviewing the results and deciding if those results are valid, just as the judge checked the lawyer's case history. The biggest mistake any programmer makes is to be 100% confident that his or her software is perfect because that usually isn't the case. The biggest mistake any user can make is believing that software is infallible. At some point, some set of data will send the software off on a path the programmer didn't see possible. Humans will have to continue to use their judgement and knowledge to verify the results of AI.

My biggest fear of AI is that it will become very adept at doing things currently done by humans and those humans it replaces will discover there is no place to go to earn a living. It happens with any technology advance. Accounting departments used to employ lots of people sitting at adding machines and writing entries into ledgers. Now, most businesses have a computer that takes information read by barcode readers, does all the calculations and prints the balance sheets, and a couple accountants that check the results.

Contrary to what some people believe, there are a lot of people in this world who don't have the ability to write software, and some don't have the ability to do much besides repeat the same simple task over and over. I worry about what will become of them.
 
And how do you propose the site vet what would be a deluge of junk for all the content compliance requirements? At this point in time, I reckon it's better to keep saying no.
See my comment #64 above, this thread.
I don't see how this addresses my suggestion that content would have to pass a plagiarism test. I suspect we already have the capability of running a piece of writing through some software to find out if it contains plagiarism. If we don't, we will. Sounds like a one step process. Just as quick as running it through some software (or the human brain) to try to identify something as written by AI. And a lot easier to make it valid.
 
I don't see how this addresses my suggestion that content would have to pass a plagiarism test. I suspect we already have the capability of running a piece of writing through some software to find out if it contains plagiarism. If we don't, we will. Sounds like a one step process. Just as quick as running it through some software (or the human brain) to try to identify something as written by AI. And a lot easier to make it valid.
Do you even realize the manitude of what you are talking about? Plagiarism testing software compares writings to maybe a few thousand, at most hundreds of thousands of scholary articles published in well established publications.

I urge you to consider the impossible task of cataloging and indexing even just the majority (not all) of english literature out there from say just the past 100 years. Good luck. Then we have not even yet touched other languages or fringe areas, like smut.

It's the kind of idea, that looks great in a campaign speech then dies the first moment it lands on an engineer's desk when the idea is actually forced to meet reality for the first time.

I guess it would be possible to create a narrow, genre specific database from the most popular sources, but that wouldn't prevent you stealing from Tolkien in your work, or any other famous author an AI might "copy" by accident.
 
Do you even realize the manitude of what you are talking about? Plagiarism testing software compares writings to maybe a few thousand, at most hundreds of thousands of scholary articles published in well established publications.

I urge you to consider the impossible task of cataloging and indexing even just the majority (not all) of english literature out there from say just the past 100 years. Good luck. Then we have not even yet touched other languages or fringe areas, like smut.
I'm guessing the day will come. I think they've already digitized the contents of Harvard's Widener Library. Or at least such a project is underway. Anyway, one would think that such plagiarism detecting software could be given access to the same data that the AI was using.
I guess it would be possible to create a narrow, genre specific database from the most popular sources, but that wouldn't prevent you stealing from Tolkien in your work, or any other famous author an AI might "copy" by accident.
I think using such a database is a fine idea. As for catching people or AI plagiarising Tolkien, I suggest that we keep on doing whatever we're doing.
 
I'm guessing the day will come. I think they've already digitized the contents of Harvard's Widener Library. Or at least such a project is underway. Anyway, one would think that such plagiarism detecting software could be given access to the same data that the AI was using.

I think using such a database is a fine idea. As for catching people or AI plagiarising Tolkien, I suggest that we keep on doing whatever we're doing.
Eventually, in a few years, maybe a few decades, sure. Right now, building a database that can reliably compare a decent sized, say 40-50k word text to the entire written body of English language smut, it would require insane amounts of resources. The data alone we are talking about is measured in the 10s if not 100s of gigabytes, likely billions of words.

Having that data digitized is of no use if you cannot do the comparison. To give you an idea on the scale, GPT4 was trained for 6 months on some 24 thousand GPUs. Just from the top of my head, I'd say it would not require this much hardware to do a thorough comparison, but still it would likely require immense resources. I would deem that fiscally unfeasible for at least the next couple of years still.
 
Do you even realize the manitude of what you are talking about? Plagiarism testing software compares writings to maybe a few thousand, at most hundreds of thousands of scholary articles published in well established publications.

I urge you to consider the impossible task of cataloging and indexing even just the majority (not all) of english literature out there from say just the past 100 years. Good luck. Then we have not even yet touched other languages or fringe areas, like smut.

It's the kind of idea, that looks great in a campaign speech then dies the first moment it lands on an engineer's desk when the idea is actually forced to meet reality for the first time.

I guess it would be possible to create a narrow, genre specific database from the most popular sources, but that wouldn't prevent you stealing from Tolkien in your work, or any other famous author an AI might "copy" by accident.

Wait…hold on. Another oxymoronic AI dilemma. 🤪

Two major conflicts:


* The theory is that plagiarism detecting AI must have access to copyrighted material in order to protect the copyright of the content yet the AI must not be able to be influenced by that material?

* AI is being asked to produce content that is indistinguishable from human content while also being asked to flawlessly differentiate between human and AI content?
 
Eventually, in a few years, maybe a few decades, sure. Right now, building a database that can reliably compare a decent sized, say 40-50k word text to the entire written body of English language smut, it would require insane amounts of resources. The data alone we are talking about is measured in the 10s if not 100s of gigabytes, likely billions of words.

Having that data digitized is of no use if you cannot do the comparison. To give you an idea on the scale, GPT4 was trained for 6 months on some 24 thousand GPUs. Just from the top of my head, I'd say it would not require this much hardware to do a thorough comparison, but still it would likely require immense resources. I would deem that fiscally unfeasible for at least the next couple of years still.
Why can't AI people provide access to their LLMs by plagiarism detectors?
 
Wait…hold on. Another oxymoronic AI dilemma. 🤪

Two major conflicts:


* The theory is that plagiarism detecting AI must have access to copyrighted material in order to protect the copyright of the content yet the AI must not be able to be influenced by that material?

* AI is being asked to produce content that is indistinguishable from human content while also being asked to flawlessly differentiate between human and AI content?
Nobody can prevent you from storing, indexing and using copyrighted material that is otherwise freely accessible for everyone. What the law prevents you from doing is reproducing that material. Which is why the NYT claim might have some merit, given the AI can reproduce (although only if directly prompted to do so) parts of their work verbatim.

If it would only be the case that the AI "uses" their work but doesn't reproduce it, then that would not violate copyright. Just like you cannot prevent me to take a photo of you while you are walking past a sidewalk. You can however prevent me from using your likeness in any way, shape or form, if I make it accessible by others. (and I guess it would be on me to prove how that is not possible, or on your to prove how it is.. I am not sure which one would be the standard to prove)

Technically AI could also be trained to detect plagiarism, but to my knowledge, current solutions are simple database matching applications, that scan the content for text in its database.

Why can't AI people provide access to their LLMs by plagiarism detectors?
Because LLMs are not relational databases, but neural networks. While you could train a neural network to try and detect if a given text matches a set of patterns it has seen already, LLMs are not built to do that. I never really bothered with this idea before, so I cannot really tell from the top of my head which one would be more efficient, an AI trained for detection or straight up scanning through a relational database. The AI would likely be faster with less resources used, once it has been trained, but it would likely have to be an enormous neural network to have any sort of reliability and could take ages to train. The scanning solution is 100% certain, as it simply compares binary data, needs lot less preparation other than uploading and indexing the data, but then it would require immense resources to effectively run.

To make this maybe easier to understand, although a very over-simplified example.
A traditional database entry would store data like this:
1 | 2024-01-21 | For the Love of God | Caleb Zhass | Then he started scratching his head as he pondered on how to explain the topic... (just a bunch of text stored in the database in binary form)

The LLM's neural network on the other hand is made up of a mesh of calculators, with each calculator taking several (hundreds or thousands or more) inputs, doing some math on them and then spitting out a value that then gets fed to the next layer of calculators.

In front of the whole thing is a text analyzer, that breaks text input down to its constituent parts, verb, subject, adjectives, etc. Encodes this information into numbers and feeds the specific numbers into different input values of the calculator mesh. Then at the end, the outputs of the calculators go through the process in reverse.

The model itself is just the mesh of calculators with their preset mathematical functions and constant adjustments that will determine the end result. That a bunch of numbers going through billions of calculations can turn a question into an answer that happens to be an exact replication of an article's text is exactly the black box nature of neural networks. Since we are talking about billions of calculators in the mesh, no single person has even a remote chance of understanding the flow of information and why it turns the question into the answer. We just "train" the mesh, until it does, by feeding it data, checking the output and if the output is bad, we scramble the mesh and try again. There is obviously some logic to the scrambling, so that with every iteration of the training, the mesh get's a little smarter, but the idea of training neural networks in a nutshell is just this.

The article's text is not stored in a form that could be identified in ChatGPT's database, as it is not really a database in the traditional sense. It is a neural network, that has this information encoded in it's billions of values. Just like your brain doesn't store the text to a poem you memorized. It can still reproduce it when asked.
 
Last edited:
Nobody can prevent you from storing, indexing and using copyrighted material that is otherwise freely accessible for everyone. What the law prevents you from doing is reproducing that material. Which is why the NYT claim might have some merit, given the AI can reproduce (although only if directly prompted to do so) parts of their work verbatim.

If it would only be the case that the AI "uses" their work but doesn't reproduce it, then that would not violate copyright. Just like you cannot prevent me to take a photo of you while you are walking past a sidewalk. You can however prevent me from using your likeness in any way, shape or form, if I make it accessible by others. (and I guess it would be on me to prove how that is not possible, or on your to prove how it is.. I am not sure which one would be the standard to prove)

Technically AI could also be trained to detect plagiarism, but to my knowledge, current solutions are simple database matching applications, that scan the content for text in its database.


Because LLMs are not relational databases, but neural networks. While you could train a neural network to try and detect if a given text matches a set of patterns it has seen already, LLMs are not built to do that. I never really bothered with this idea before, so I cannot really tell from the top of my head which one would be more efficient, an AI trained for detection or straight up scanning through a relational database. The AI would likely be faster with less resources used, once it has been trained, but it would likely have to be an enormous neural network to have any sort of reliability and could take ages to train. The scanning solution is 100% certain, as it simply compares binary data, needs lot less preparation other than uploading and indexing the data, but then it would require immense resources to effectively run.

To make this maybe easier to understand, although a very over-simplified example.
A traditional database entry would store data like this:
1 | 2024-01-21 | For the Love of God | Caleb Zhass | Then he started scratching his head as he pondered on how to explain the topic... (just a bunch of text stored in the database in binary form)

The LLM's neural network on the other hand is made up of a mesh of calculators, with each calculator taking several (hundreds or thousands or more) inputs, doing some math on them and then spitting out a value that then gets fed to the next layer of calculators.

In front of the whole thing is a text analyzer, that breaks text input down to its constituent parts, verb, subject, adjectives, etc. Encodes this information into numbers and feeds the specific numbers into different input values of the calculator mesh. Then at the end, the outputs of the calculators go through the process in reverse.

The model itself is just the mesh of calculators with their preset mathematical functions and constant adjustments that will determine the end result. That a bunch of numbers going through billions of calculations can turn a question into an answer that happens to be an exact replication of an article's text is exactly the black box nature of neural networks. Since we are talking about billions of calculators in the mesh, no single person has even a remote chance of understanding the flow of information and why it turns the question into the answer. We just "train" the mesh, until it does, by feeding it data, checking the output and if the output is bad, we scramble the mesh and try again. There is obviously some logic to the scrambling, so that with every iteration of the training, the mesh get's a little smarter, but the idea of training neural networks in a nutshell is just this.

The article's text is not stored in a form that could be identified in ChatGPT's database, as it is not really a database in the traditional sense. It is a neural network, that has this information encoded in it's billions of values. Just like your brain doesn't store the text to a poem you memorized. It can still reproduce it when asked.
Why can't the AI developers give plagiarism detectors access to the same inputs that were used to develop the neural network? In other words, why would you have to search the whole universe of written language? Why not just the universe that the AI developers used?
 
Why can't the AI developers give plagiarism detectors access to the same inputs that were used to develop the neural network? In other words, why would you have to search the whole universe of written language? Why not just the universe that the AI developers used?
I guess that would be possible, sure. Not even a bad idea actually, as the training input was surely well documented. Still, I'm fairly certain we would still be talking about an immense dataset, far larger than what academic plagiarism detectors currently have to contend with.
 
Ultimately, what will happen I assume is that the AI developers will be forced to implement some sort of self exclusion option, where authors of publicly available, but copyrighted work would be able to tell the AI that they never want to see their work reproduced verbatim or any work produced in their "recognizable style". Alternatively, they could maybe even go the extra mile and if you were to ask the AI for a particular article, it could instead try to provide a link to the article for you, like a traditional search engine would do. Though I'm not sure if they have that information in the model (the source URL of the text)

I mean they cannot really tell people how their work is used once they made it public, so I doubt we would see exclusion of works from training data sets, but they can certainly expect the other party to "protect" them to the best of their ability. If for nothing else, then to limit legal liability.
 
Ultimately, what will happen I assume is that the AI developers will be forced to implement some sort of self exclusion option, where authors of publicly available, but copyrighted work would be able to tell the AI that they never want to see their work reproduced verbatim or any work produced in their "recognizable style".
That eventually will follow years of litigation and by then may be entirely overcome by events. The vast history of the written word will be sufficient for nearly all fiction writing.
 
Why can't the AI developers give plagiarism detectors access to the same inputs that were used to develop the neural network? In other words, why would you have to search the whole universe of written language? Why not just the universe that the AI developers used?
They have access to the same material, it's open source. They don't have access to the billions of dollars necessary to train their detectors:

Common Crawl.
 
They have access to the same material, it's open source. They don't have access to the billions of dollars necessary to train their detectors:

Common Crawl.
The plagiarism people need information about WHICH open source material was used. They don't need the training access. They just need to search the inputs. Right?

OR are you telling me, via the link, that all AI systems are trained on the same set of inputs??? That's not the impression I've gotten.
 
The plagiarism people need information about WHICH open source material was used.
All of it, hence the massive cost, $ 10 billion plus raised in capital in the case of ChatGPT 4 to train the Base Model.

Anyone can train a base model using a smaller dataset, but then it's not much good for checking for plagiarism.
 
Last edited:
Back
Top