OpenAI seeks to trim NYT lawsuit to fair use question

OpenAI is asking a federal court in the Southern District of New York to dismiss several of the counts against it in the lawsuit filed late last year by The New York Times accusing the tech company of threatening its business model with copyright violations.

The company wants the court to throw out a claim of infringement involving older NYT articles that it says are time-barred and a claim that it removed copyright management information under the Digital Millennium Copyright Act when it scraped articles off the internet. And it wants it to throw out a claim it misappropriated copyrighted works under New York law on the grounds it is preempted by federal law. It also wants to nix a claim of contributory infringement that would hold it liable for infringement by others who use the ChatGPT tool.

What would be left is mainly the broader question that’s at the heart of a number of lawsuits involving generative AI — whether the use of copyrighted work scraped from the internet to train the large language models central to generative AI is protected by fair use.

“There is a genuinely important issue at the heart of this lawsuit,” OpenAI said in its motion, “critical not just to OpenAI, but also to countless start-ups and other companies innovating in this space: whether it is fair use under copyright law to use publicly accessible content to train generative AI models to learn about language, grammar, and syntax, and to understand the facts that constitute humans’ collective knowledge.”

Revenue losses

The New York Times late last year sued OpenAI and Microsoft, which built much of the genAI training infrastructure, for generating revenue at its expense.

“Defendants seek to free-ride on the Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the Times said in its complaint.

To make its case, the Times provided dozens of examples of ChatGPT answering prompts with large amounts of text that mimic the newspaper’s stories, almost to the word, removing people’s incentive to pay the Times for its content.

The technology also risks the Times’ reputation by making up content that it attributes to the newspaper.

“Defendants’ models are … causing the Times commercial and competitive injury by misattributing content to the Times that it did not, in fact, publish,” the Times said. “In AI parlance, this is called a ‘hallucination.’ In plain English, it’s misinformation.”

Hired hand

In its motion to dismiss the counts, OpenAI accused the Times of hiring a hacker to generate the dozens of examples showing wholesale copying of its work.

“The allegations in the Times’s Complaint do not meet its famously rigorous journalistic Standards,” the company said. “The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results [in] the Complaint.”

OpenAI says NYT’s hacker exploited a bug in which the tool regurgitates long passages of text. “Put simply, a model trained on the same block of text multiple times will be more likely to complete that text verbatim when prompted to do so,” the company said.

That’s a bug that it’s been trying to correct, the company said. “Training data regurgitation — sometimes referred to as unintended ‘memorization’ or ‘overfitting’ — is a problem that researchers at OpenAI and elsewhere work hard to address,” it said.

The company accused the hacker of exploiting this bug by, among other things, feeding large amounts of article texts into the prompt to help induce the tool to respond by regurgitating the text.

“They were able to [create the long passages of identical text] by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use,” the company said. “Normal people do not use OpenAI’s products in this way.”

In an emailed statement to Bloomberg Law, Ian Crosby of Susman Godfrey, an attorney representing the newspaper, said hacking the tool wasn't necessary to generate the damning results.

“What OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works,” Crosby said. “OpenAI’s response also shows that it is tracking users’ queries and outputs, which is particularly surprising given that they claimed not to do so. We look forward to exploring that issue in discovery.”

Fair use claim

On the larger question of fair use, the company says it expects to prevail because factual information belongs to everyone.

“Established copyright doctrine will dictate that The Times cannot prevent AI models from acquiring knowledge about facts, any more than another news organization can prevent the Times itself from re-reporting stories it had no role in investigating,” the company said.