NYT vs OpenAI: LLM’s Do Not Copy Training Data Argument

12 min readJan 7, 2024

As an engineer and free researcher, I personally hope that all this hype will be over soon and my regular feeds are again full of technical and interesting articles about machine learning and quantum computing, instead of being spammed about all the uninteresting drama around OpenAI. The scene has become a soap opera show.

Anyway, every now and then I try to contribute when I see my contributions are due.

Yesterday I read this very well written backgrounding of the pending law suit against OpenAI, sued by New York Times. In the comment section someone claims that they are very experienced in machine learning and they have consulted lawyers, who according to him state that using copyrighted materials in training data is totally okay. When you are walking in the gray areas and talk with lawyers it is good idea to be sure to ask the right questions, not to beg for the answers you want to hear and when you share the information forward, it is good idea to confirm that what you say is still relevant.

I first time consulted lawyers myself in 2010 about this same issue with my own machine learning startup (back then we didn’t call it AI or machine learning, we were just doing “heavy algorithms”). It seems like I got some better advice, because Google actually won their case based on the same ideas.

In this article you will learn about transformativeness as legal concept, copyrights around generative AI, how all this relates to machine learning and how modern LLMs relate to lossy compression lawsuits around .mp3’s.

The Basic Frame of Copyright and Generative AI

Considering the whole path from training data to trained model and copyrights of the artifacts produced via generative process, I will try to give you here the whole full view of roles, responsibilities and grounds of dispute here.

In general, most modern legal systems are based on the idea that previous cases ought to produce predictability to the rulings if it can be established that similar facts are present in new cases. This comes from the principle that everyone is equal in the face of law: if someone was earlier convicted on some ground, everyone after him ought to be convicted also. In that sense many relevant issues with generative AI have been dealt with already when music industry digitalized. However, there are some new things relevant here.

Generative AI has three different legal entities involved: the training data and model training, the model provider and the users. In general sense gathering training data and training any kind of model is legal, unless they are related to clearly restricted domains in other parts of law, such as autonomous weaponry or generally prohibited content. Model providers tend to be more or less responsible on the produced artifacts and their legality and as far as I know, the users of generative models have not been sued for crimes.

There have been multiple cases where startups have been convicted because of lazy interpretation of their responsibilities. The general mistake is that the startup buys some copyrighted material and then offers it in original form within some kind of apparatus, which is claimed to be AI while in fact they have just offered “search to intellectual property they do not own”. For example, many national statistics institutions sell their copyrighted datasets under licences for analysts. The idea here is that they can finance the collection of data via the licence fees. It is understandable that is someone buys the license and then sells the data for their users, it is damaging the original author of the data while no new value has been added to the dataset. This is clearly criminal.

From that perspective, if it is true that the NYT articles can be reproduced via OpenAI models, then it seems to be illegal in this particular sense. It is illegal to show unaltered data if the original author of the copyright doesn’t benefit and it is also illegal even if the original data is only slightly altered (more about that under the lossy compression part).

There is one exception though. Artistic usage gives the author of remixed data a copyright of it’s own.

It is already established that generation of content doesn’t automatically give a person copyright to that specific content. Many people were annoyed when they realized that simple prompting doesn’t give them any copyright protection for the artifacts they have created. Some mistakenly assume that complex prompting fulfills the “artistic effort”, however, the whole truth is a bit more complex. In parody and satire of visual arts copying the original is not rare at all and those works are within the protection of copyrights. However, copying monalisa is illegal no matter how much effort you put in. “Artistic effort” can be determined by… well artistic effort. What is means to be an artist, in this sense, is that you have specific art philosophical method that you utilize in creation. If you can formally describe the method you used in court and it lands as recognizable via philosophy of arts, then you most likely are safe unless the artistic process included a criminal step.

At this point it seems quite clear that the defense of OpenAI will not be based on any kind of artistic effort claim, because this has not been part of their product idea at any point.

While generative AI applications, such as ChatGPT (which is powered by the GPT-3.5 model) seem undeterministic, Transformer models are very much deterministic. This illusion of non-determinism comes from the application side, which uses System Prompts that are kind of invisible additional text that is shown to the model but not for the use. A common misconception about ChatGPT is that it reacts to the text you said last, but in fact it always evaluates the whole discussion or some paraphrased part of it, if the context limit was exceeded.

Transformer based models are able to give seemingly random generations from same prompts, but this is actually caused by temperament and a feature called pooling. I will not explain them in detail here, but the basic idea is that when the decoder produces next embedding, it is based on math from the earlier embeddings. If we use zero temperament and deterministic pooling method, like max pooling, we will always get same next token deterministically.

This fact makes the NYT law suit a serious business. If one would like to reverse engineer the training set used by a Transformer based language model, you would set the determistic generation parameters on, give the model a prompt that starts with the content you want to test for belonging to the training set or not. NYT claims that this is what they did and that they were able to discover their copyrighted materials.

Some OpenAI fans have vocally attacked against NYT, when they have not been able to replicate the results with ChatGPT. They are wrong and the reasons should be obvious for you now: ChatGPT has system prompts and the NYT and OpenAI have been negotiating for months before the lawsuit came around. It is more than likely that these articles have now been blocked from the ChatGPT programmatically and the models might have gone through minor updates.

While updating the full pre-trained model is very expensive and you would not do that even for a lawsuit, most of fine-tuning methods only touch the last layers. For example, GPT-3.5 might have some part of the GPT-3 pre-trained core and some replaced layers over it. Data science with language models is often about fixing some of the late layers to work a bit differently, or removing them in order to access some lower level features of the model instead of using the original embeddings for example. This is quite standard stuff that any machine learning engineer with modest experience can do.

In my opinion it seems likely that OpenAI has not gone through their full training data and checked if copyrighted material have appeared there. That would cost too much money to do. Though also, I do not believe that they have just plain and simple used Common Crawl and hoped to get away with it, because nobody can replicate or reverse engineer their models due to the size, as some seem to claim. However, it might be that they have uncritically analyzed the terms of usage and forgotten some datasets to their training data after they went from non-profit to for-profit operations with ChatGPT.

In following section I describe my theory of how that uncritical usage of training data might have spawned inside OpenAI. In a nutshell, people change jobs and some less vocal but important information doesn’t always get handed over properly.

Language Models vs Large Language Models

Before GPT-3, we didn’t talk about “Large Language Models” (LLM’s). There are some quite obvious reasons why we are here with this lawsuit today and why Google didn’t make the first mover attempt on LLM’s; they had already been at court and learned their lesson. In 2018 when Transformers first came out big with BERT, it had been trained on clearly Creative Commons licensed datasets; the encoder-decoder paradigm had been optimized for what can be learned from Wikipedia style text (which is the reason why Wikipedia also started ranking high in Google Search).

Everyone wants to beat Google in this game, but when you start to fight with the old titans, it becomes tempting to avoid the same restrictions that they have imposed on themselves and try to discover some benefit from that direction. OpenAI being a non-profit company, for example, had the privilege to move past some of the obstacles Google had with scaling language models (technological resources and fines they probably had). However, OpenAI later incorporated and started profiting from the model via ChatGPT, but it seems that unfortunately some of the original ideas and restrictions from their foundations had been forgotten to the company’s internal history books.

In the era around 2016–2020 in the language modeling scene, overfitting with training data had been seen as a bad thing. We wanted to build language models that learned reasoning skills instead of parroting the training data. However, Few-Shot Learning took everyone a bit of surprise and OpenAI started to maximize that direction. Indeed it was very interesting to find out what might be the end result.

However, pretty early on there came the problems. It seems like in the internal lingo of OpenAI, overfitting started to mean “inability to learn from the training data” as opposed to “parroting answers from the training data”. For example, in this paper OpenAI seems to have already adopted this new idea of theirs, while in this paper Facebook Research observed that 60–70% of the LLM performance is maybe not direct overfitting but overlap of answers in the training data and test data. This excellent article shows commentary from Santa Fe Institute researcher Melanie Mitchell about how this new paradigm has made scientific criticism of new models impossible and how instead of linguistic reasoning their main focus might had become of pattern memorization.

While all this is interesting and all, how does this relate to the new lawsuit?

If lawyer in 2018 said that that using copyrighted material for language modeling is riskless, a lot has changed. If the memorization rate of far smaller LLM’s than GPT-4 was already around 60–70%, it is probably much higher today. Also in 2018 the context size was very short compared to today, which meant that it was impossible to store whole article via memorization. Also, the general use case was classification or something similar as opposed to generation. Generation is not transformative. The paradigm back then was in optimization for generalization (avoiding any kind of overfit) instead of memorization based chaining of patterns (which is exactly what decoder only language models do, in practice).

Now that the use case is generative, the main paradigm is memorization and the context length has grown to the scale of most common length of copyrighted journalistic material, the risk has become real. Because the large language models are non-transformative, the law suit becomes all about whether or not GPT-4 (or did they sue for GPT-3.5?) is compressing NYT articles or not.

Neural Networks as Lossy Compression and .mp3's

If OpenAI has to defend GPT-4 with lossy compression argument, they have already lost, because there is no fundamental information theoretic difference to the “is .mp3 a copy of .wav music file”-lawsuit, which was already lost. In computational sense and from the perspective of neural networks, there is no essential difference between music and text: both are timeline sequences of tokens that represent something in real-life physics, which has deeper meaning for human beings. In case of music and text our ability to reproduce the original from lesser than perfect replica is there.

Many of you probably know that more parameter size matters, when it comes to language models, but many probably do not know what this thing called a parameter actually means.

It all comes back to this memorization vs generalization issue. With modern computers it is very easy to do polynomial fitting. If you have a curve and the historical data, you can build perfectly overfitted model by taking same amount of parameters to the polynomial fitting as you have samples. When we do predictive analytics, we instead want to use lower amount of parameters in hopes that the model will also predict future events. This is achievable if the data is good and there is some small generic error, which causes the minor disturbances but doesn’t mess too much with the long term trends.

The history of machine learning comes from simple perceptrons, to delta change functions, to feed-forward networks, to back-propagation, hidden-layers and deep-learning. What all this historical phases add is the ability to do more and more complex “fitting”. First models were able to do linear regression, then non-linear regression and now with deep-learning we are able to learn weights from complex multidimensional data, of which’s geometry human mind has difficulties in grasping.

However, this is the same cross-section part of information theory, math and computer science, which mp3’s are based on. In the original lawsuits in the early 2000’s or 1990’s it was claimed that .mp3 file is not a copy of the original .wav file, but because it was functionally similar, they lost the case. This is similar to the Monalisa argument: no matter if the copy is lesser than Monalisa, it is still a copy.

For this reason, if the evidence holds in court — which I more likely believe than not at this stage — OpenAI has already lost, because there is no mathematical way of establishing that neural networks and machine learning paradigms are not part of lossy compression family. The fact that increasing the parameter size seems to cause less lossy compression is very problematic for OpenAI’s defense. This might be one reason why Google went towards reducing the amount of parameters in their own language model research with DistillBERT in 2019.

Conclusions

While it is perfectly legal to gather training data and build models (if the possession of such data is not in other laws prohibited), that doesn’t mean you are free of consequences from serving the model as a platform, if the platform enables activity of consuming copyrighted materials without the permission of the original author of the copyrights. Using prompts to replicate copyrighted materials is clearly outside of fair-usage or artistic effort use-cases, but still those are possible ways of using the OpenAI model.

NYT has quite well established that there is a reasonable doubt that their copyrighted articles have been in the training materials of OpenAI, because they used the exact methods one would use, as computer scientist, to reverse engineer the training data from a Transformer based model; similar approach has been mentioned as privacy risk.

The reason for the mistake might have been internal turmoil of OpenAI; a lot has changed in the field of language modeling and some earlier facts might not have been well-translated to the new employees during the hand-over processes. However, there is still the opportunity that OpenAI has followed the terms of usage of their training materials and might be able to forward the lawsuit to the data providers. However, such act might hurt their reputation as some of the data providers are small non-profit groups and would merely destroy lives of some people and they would still see no money. For me it is really hard to imagine how the OpenAI could survive without fines if the evidence gathered by NYT hold in court.

NYT vs OpenAI: LLM’s Do Not Copy Training Data Argument

The Basic Frame of Copyright and Generative AI

Language Models vs Large Language Models

Neural Networks as Lossy Compression and .mp3's

Conclusions

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ahti Ahde

No responses yet