The New York Times is suing OpenAI for copyright violation, arguing that the way next-generation models like ChatGPT are designed consistently violates its rights in a way that does the media serious economic harm.
At this point, given that I write in the media for a living, I should probably be writing a riotous defence of the NYT action, standing up against nasty big tech and protecting our information ecosystem. The problem is that the suit is largely nonsense, obviously self-serving, and based on a misunderstanding of how these models work. It could risk stymieing all of the potential benefits of the technology all to prop up profits (which at the NYT are pretty good as it is).
There are legitimate fights to be had between the media and tech over AI. The desire for AIs to give you an answer directly, rather than link you to a site, shatters the compact between search engines and sites – the former would scrape your content, yes, but they would send users your way (who would then see ads and maybe even sign up).
So there are grounds for media companies to want compensation if their real-time newsgathering is being monetised by big tech. But the NYT’s actual lawsuit is a much bigger power grab than that: when a human journalist reads a few articles from a rival publication and then puts together their own version of an article on that topic, it is totally legal providing they put it into their own words.
By and large, that is what an AI is at least trying to do, as well. It might be “trained” on a huge dataset, but contrary to how most of us think it works, it does not retain this data and look it up when it is asked a question. Instead, modern generative AIs are best thought of as “spicy autocomplete” – a souped-up version of messaging and email apps’ suggestions for how to complete sentences.
The AI creates weightings for the best or likeliest combinations of words to respond to a series of words that makes a query, and it’s these that are looked up. Trying to stop it mashing together things that don’t fit is how AI developers are trying to minimise “hallucinations”, circumstances in which AIs give a convincing but completely false answer.
To ensure that AIs never generate particular sentences would need some database against which it could look up – which doesn’t exist, as there is no central registry of copyrighted information. Such a database would never be possible: for example, if you have ever written a few notes to yourself to jog your memory, or briefly kept a diary, that’s copyrighted.
The NYT’s main examples in its case seem to come from very well-known articles that have been replicated across the web already – one sentence from Snow Fall, a multimedia story about an avalanche it published to great acclaim in 2012, has ended up heavy in ChatGPT’s weightings because it has already been copied so widely across the internet.
These lawsuits don’t help the media, because tech has a habit of thinking that coverage is self-serving – outsiders underestimate the independence of reporters in the newsroom from their corporate parents. But in over-focusing on our own battles, we make our coverage of tech weaker: coverage of social media platforms was hampered by the relentless (and often silly) battles between newsrooms’ corporate parents and those same platforms.
So while it’s the New York Times case that has taken up most of the oxygen of coverage, there’s a much more important one that has been ignored – and it starts with online fudge recipes.
The issue, discovered by the writer Zoah Hedges-Stocks, is that a lot of recipe content on the internet is now written by low-quality bots that churn out any old regurgitated crap. What she had noticed was that alongside posts comparing the process of making fudge to making toffee, there were also posts comparing the process of fudge making to making “scrimgeour”, another apparent Scottish treat of which she had never heard.
What had happened was that AI had confused the use of the word “fudge” in different contexts – “Cornelius Fudge” is the name of the minister of magic in JK Rowling’s Harry Potter series, until he is replaced by the more malign “Rufus Scrimgeour”. There is no candy product called “scrimgeour”, but shoddy AIs collapsed the context gap and generated nonsense – now replicated across numerous sites (and referenced here, too). Other examples abound – X user Will Rayner noted that when looking up the weather recently he had seen errant sentences referring to characters from The Hunger Games and the video game Baldur’s Gate 3, both called “Gale”.
By far the most entertaining to date came from Donald Trump’s former lawyer Michael Cohen, who had to apologise to a US court after citing non-existent cases that had been found for him (and not checked at all) by an AI assistant. It is generally bad form to do that in court.
Current generations of AI are already “trained”, and they were trained on internet content that is pre-AI. But keeping future generations of AI products current – and updating their features – will rely on scraping the internet as it is now.
The problem is that a growing share of the internet is either polluted by the lowest-quality AI content, and then in turn further confused by articles like this one (of which I predict there will be many more in years to come) trying to explain the mess, but strengthening the associations between the misleading words as we do.
That has the potential for a very dangerous cycle, in which ever-degrading inputs mean that AI’s outputs degrade even as the technology gets more clever, leading to a spiral of ever-worse content on the internet (a process dubbed “enshittification” by Cory Doctorow) and eventually the joy and creativity of the internet reduced to an algorithmic grey goo – information reduced to a mulch of wasted words.
Today it’s fudge and Harry Potter. It will be affecting news content in the very near future, if it isn’t already.
No one’s quite sure how significant a risk the grey goo outcome is, and there are no certain plans to prevent it. Media companies need to make sure we’re looking beyond our own backyards – as if we’re not, we might miss the disaster that takes us all out.