The New York Times (NYT) is suing OpenAI and Microsoft (MSFT) for allegedly using millions of articles without permission to train its AI chatbots. MIT Initiative on the Digital Economy Director Sinan Aral and Newsroom Robots Podcast host Nikita Roy discuss the details of the copyright lawsuit and potential implications for the publishing industry with Yahoo Finance Live.
Roy says she was "completely expecting" this scenario, suggesting the Times is "helping" smaller publishers lacking resources to take legal action. However, she notes accusing chatbots of infringement is "complex," hinging on whether courts deem AI a tool or if liability falls on the user. Still, Roy stresses that "we are facing a very ethical issue" regarding how creators' work gets utilized.
"This is a debate about whether the companies training large language models on content from the web... is fair use or infringement of copyright," Aral states. If courts mandate payments to original data producers, costs could rise significantly for AI firms. However, Aral believes "this was expected," and the only question that remains is "where does the price point lie?"
For more expert insight and the latest market action, click here to watch this full episode of Yahoo Finance Live.
Video Transcript
- One, Nikita, were you surprised that the Times brought this lawsuit? And two, give us your thoughts on it. What did you make of it?
NIKITA ROY: Yeah. I was completely expecting this to happen actually for a long time. And I think the New York Times is doing this is really helping a lot of the media organizations who probably don't have the ability to go out and take on these tech giants.
But the issue of copyright is really so complex, and it comes down to how the courts are going to define generative AI and specifically large language models because who is liable in this case? Is it considered a tool, or is it the user? And the problem is that large language models and all of these tech companies like OpenAI, and Anthropic by Cloud, they are completely shifting in the way they aren't thinking about these large language models.
So one of the things that I think really is important to take note of is last month, OpenAI CEO Sam Altman said in his keynote speech that they would defend their customers and pay the costs incurred if they face legal claims around copyright. And so I think that really shows how confident maybe the tech companies are regarding their claims to make sure that these are considered just as tools and push that liability over to users. But at the end of the day, we are really facing a very ethical issue in terms of how are we going to be using people's work, and take away that, and be them be a competitor in that space as well.
- And Sinan, I want to bring you into this because as someone who both studies these issues but also invests in startups that use AI, I'm curious. You know, when Nikita talks about that some of these companies are ready to face the costs, how substantial do you think they could potentially be for both smaller startups and really big ones like OpenAI?
SINAN ARAL: Well, I mean, I think that the costs could be very large, Julie. It's great to see you. Happy Holidays. This is a debate between whether the companies training large language models on content from the web, the New York Times, and content from lots of other places is fair use or infringement of copyright.
And if the courts or a settlement determine that there need to be payments made to the original producers of the training data, that could increase the costs for AI, generative AI companies, generative AI startups, and that makes a big difference for how the industry runs.
However, this was expected, as Nikita said. So these costs in large part have been planned for, have been thought about, and it's not something that is brand new. This has been a debate that's been ongoing for months, if not the better part of a year. And this is just the first opening of the conversation.
And really the determination will be where does the price point lie? Is it going to be a very large transfer to those who are creating the content that the models are trained on, or is it going to be smaller? Is it going to be settled, or is it going to go to court and have judicial case law created about copyright infringement based on this particular case? We will see. However, this was inevitable. And therefore, I think a lot of these costs have already been thought about in the long run.
- But Sinan, let me just get your take because we do tech companies have made the case before, right? And it seems like their argument is this is publicly available information we're scraping from the public internet. It's just oceans of data, oceans of text. So it is fair use. Do you buy that argument?
SINAN ARAL: No. I don't buy that argument, and it depends on how you use it. So if I were to take New York Times articles and start a website, sinanaral.com, and post New York Times articles on my website and charge for them, that would be copyright infringement, even if I scraped them from the web. That is not considered fair use.
If I were to, however, copy a single New York Times article for one class at MIT for educational purposes, that would likely be considered fair use. And training very large language models on millions, as the lawsuit indicates, pieces of content by The New York Times is a new use. In other words, it hasn't been considered in the past as being a traditional use of copyrighted material. And therefore, we have not decided as a society whether this is fair use or not. And that's why this case is so important because it will decide either through case law or through settlement how much we believe content producers deserve in terms of the use, this type of use of their content, and that's what's new about this case.