Word Count 53: The state of AI and the Goodreads fiasco
Just two main topics this week but one is important and the other is mindboggling.
Lots of articles about so-called AI have crossed my field of vision over the last two weeks, so I thought now might be a good time to do a round-up. It’s really, really hard to keep up with all that’s going on, and I’m not even going to pretend this is a comprehensive look at everything that’s happening, but it’s a taster.
Is the wave of AI shite breaking?
Generative computing company OpenAI is being sued over copyright infringement for training it’s first large language model (LLM), GPT-1, on unpublished but still copyrighted novels hosted on Smashwords. The suit also claims that GPT-3 was trained on “‘notorious shadow library websites,’ like Library Genesis, Z-Library, Sci-Hub and Bibliotik”. Lawyers are “seeking to represent a nationwide class of hundreds of thousands of authors in the U.S.”.
This is only one of several challenges for generative computing companies. OpenAI has also been “hit with a lawsuit alleging the company stole and misappropriated vast swaths of peoples’ data from the internet to train its AI tools”. That suit claims that OpenAI scraped as much personal data as it could from the internet, thus putting everyone affected at risk (I assume at risk of identity theft), and seeks an injunction preventing commercial use of OpenAI’s products.
Artists are suing generative art companies Stability AI, Midjourney and DeviantArt for copyright infringement after they downloaded “billions of images” from the internet. And Getty Images is suing Stability AI in the UK for “illegally copying and processing millions of its copyrighted images”. Stability AI is claiming that their scraping was fair use.
Surprising literally no one who’s been paying attention, it’s now possible to use an LLM to generate a whole novel. There have been plenty of guides to generating a novel chapter by chapter, but now there’s a GitHub script that generates an entire book in one go. Emphasis original author’s own.
This project utilizes a chain of GPT-4 and Stable Diffusion API calls to generate an original fantasy novel. Users can provide an initial prompt and enter how many chapters they'd like it to be, and the AI then generates an entire novel, outputting an EPUB file compatible with e-book readers.
A 15-chapter novel can cost as little as $4 to produce, and is written in just a few minutes.
With a novel costing just $4 to create, you don’t need to sell many copies to turn a profit. It’s quite affordable to create and upload 10 books, or even 100 books, and flood Kindle with crap in the hope that people are stupid enough to buy a few copies.
It’s an extremely simple grift, but we’ll have to wait and see if it becomes a probl… oh, hang on, no, we don’t have to wait at all.
Amazon is drowning in LLM-generated shite. Last week, Vice reported that:
Amazon’s Kindle Unlimited young adult romance bestseller list was filled with dozens of AI-generated books of nonsense on Monday and Tuesday. As of Wednesday morning, Amazon appeared to have taken action against the books, but the episode shows that people are spamming AI-generated nonsense to the platform and are finding a way to monetize it.
“The AI bots have broken Amazon,” wrote Caitlyn Lynch, an indie author, in a Tweet on Monday. “Take a look at the Best Sellers in Teen & Young Adult Contemporary Romance eBooks top 100 chart. I can see 19 actual legit books. The rest are AI nonsense clearly there to click farm.”
I hate to say I told you so, but I told you so.
Amazon appear to have to taken most of these books off the bestseller lists, but they remain available to buy. The company has a long, long history of not giving a rat’s arse about quality, either of content or reviews, but this flood of nonsense books might force them to take the issue a bit more seriously. If customers get stung too often by LLM-generated crap, they’ll start to look elsewhere for their cheap ebooks.
If only they’d try their local library.
Meanwhile, Google’s automated display ad system is placing ads for “top brands” on “spammy, fake, and chatbot-written sites that egregiously violate Google’s own policies”. Programmatic ad systems automatically put ads on websites that supposedly fit parameters defined by the advertisers, but the lack of humans in the process means that there’s nothing to stop LLM-generated “news” sites signing up to earn money from “a wide variety of blue chip advertisers”. Not a good look for Google.
On the other side of the Atlantic, German tabloid Bild, which is owned by Axel Springer, is making 200 people redundant in as part of a reorganisation, and is replacing some editorial jobs, including "editors, print production staff, subeditors, proofreaders and photo editors," with AI. This is despite the fact that LLMs are horribly unreliable, frequently making stuff up out of whole cloth, as CNET found out the hard way earlier in the year. Even when LLM-generated content isn’t wrong, it’s dull. Witness BuzzFeed’s travel section which is chock full of places that are “hidden gems”.
And as if the current unreliability of generative computing models wasn’t bad enough, it’s getting worse as new models are being trained on an internet already polluted with junk from previous models. Researchers from the UK and Canada are warning that the “use of model-generated content in training causes irreversible defects in the resulting models.”
That basically means that as more and more content on the internet is generated by computers, more and more of it seeps into the training data used by new models, with the obvious result that those newer models degrade horribly.
learning from data produced by other models causes model collapse — a degenerative process whereby, over time, models forget the true underlying data distribution … this process is inevitable, even for cases with almost ideal conditions for long-term learning. … We were surprised to observe how quickly model collapse happens: Models can rapidly forget most of the original data from which they initially learned.
The researchers suggest that the only way to prevent model collapse is to just avoid pulling computer-generated content into training sets (duh), or refreshing models on datasets that are human-generated. Question is whether so-called AI companies care enough to do that.
If you want to read more, The Verge’s look at the impact of generative computing on the web has enough links to keep you busy over lunch. It’s a good piece, charting the degradation of the web from a “place where individuals made things” to one of “slick and feature-rich platforms” made by companies that chase scale to make money, and taking about how parasitical computer-generated websites could “potentially overrun or outcompete the platforms we rely on for news, information, and entertainment”.
Indeed, Reddit, Wikipedia and Stack Overflow are all being polluted by low-quality LLM-generated content. Reddit and Stack Overflow’s management are both at odds with their users about how to deal with this problem, with management keen to find a solution that involves them monetising access to their APIs and users going on strike.
However these situations resolve, as Wired says, the internet will never be the same. The log has been removed from users’ eyes and they can now clearly see just how much valuable work they’ve been doing for free.
AI’s rise has caused a revaluation of what people put on the internet. Artists who feel their work was scraped by AI without credit or compensation are seeking recourse. Fan fiction writers who shared their work freely to entertain fellow fans now find their niche sex tropes on AI-assisted writing tools. Hollywood screenwriters are currently on strike to make sure AI systems aren’t enlisted to do their work for them. No, TV and film writers don’t write for the internet, but so much of what they create ends up online anyway, ready to be plucked.
To go back to Google for a moment: They are now pushing to use AI in their search engine, a move that has been described as turning Google search into a “plagiarism engine”. Google controls 91 per cent of the search market, making them by far the most powerful force on the internet – if your website doesn’t show up in Google, it might as well not exist.
Tom’s Hardware reports that Google is:
testing a major change to its interface that replaces the chorus of Internet voices with its own robotic lounge singer. Instead of highlighting links to content from expert humans, the “Search Generative Experience” (SGE) uses an AI plagiarism engine that grabs facts and snippets of text from a variety of sites, cobbles them together (often word-for-word) and passes off the work as its creation. If Google makes SGE the default mode for search, the company will seriously damage if not destroy the open web while providing a horrible user experience.
Could the tsunami of AI shite turn out to be a flash flood? Might the models rapidly degrade into uselessness or soon be sued or blocked out of existence? Will users rebel as their experience of the internet is degraded?
In my most optimistic moments, I find myself hoping that the whole AI edifice will come tumbling down as tools disintegrate, people realise how unreliable they are, and how valuable human-generated and curated information really is. But it’s not a safe bet.
Indeed, the UK’s Society of Authors has drafted advice for authors and audiobook narrators on how to protect themselves from the uncompensated exploitation of their work by companies running LLMs, and how to responsibly use computer generate content.
They suggest that authors explicitly state in their contracts with publishers how their work may or may not be used with respect to machine learning and AI. It’s well worth talking to your agent/publisher about this, because if you don’t explicitly forbid stuff, you may as well be giving them the green light.
In other shitshow news
Goodreads has been a toxic dumpster fire for a while, has have other corners of the book reviewing world, but it’s now turning into a key weapon in what appear to be co-ordinated harassment campaigns.
Cecilia Rabess’s debut novel, Everything’s Fine, was review-bombed with one-star reviews six months before publication because people didn’t like the premise. Says the NY Times, “The story centers on a young Black woman working at Goldman Sachs who falls in love with a conservative white co-worker with bigoted views.”
Now, that might not sound like a fabulous basis for a relationship, fictional or otherwise, but in reality it is possible to fall in love with someone with whom you do not always agree. The whole point of fiction is to take these ideas and push them to an extreme and see what happens. The idea of review-bombing a book because you personally don’t like the premise is ludicrous.
This comes not long after Elizabeth Gilbert pulled her upcoming book, The Snow Forest, because of a flood of negative Goodreads reviews. Once again, the book hadn’t been published and reviewers weren’t interested in the book itself, but were objecting to the fact that it is set in Russia in the middle of the last century.
Again, this is utterly ludicrous. The reviewers’ cover story is that they feel it’s inappropriate to write anything featuring Russia whilst its invasion of Ukraine is ongoing. But I’m pretty sure that the folks in Ukraine who are doing the fighting don’t really care about a book. The idea that pulling The Snow Forest in anyway helps Ukraine is facile, and it’s disappointing that Gilbert caved to the pressure.
Goodreads is, of course, owned by Amazon who, again, don’t care at all about this kind of furore. As Lincoln Michel says, Goodreads has no incentive to be good.
I think the fundamental problem with Goodreads is the same of social media in general: they care about engagement not accuracy.
Just as Twitter is happy to be filled with ragebait trolls and Facebook is fine being flooded with misinformation that generates engagement, Goodreads is presumably happy to have fake reviews. Hell, what better way to get engagement than having one-star review campaigns provoke fan campaigns to bring up the rating and so on and so forth. The more clicks the better.
Amazon/Goodreads could, of course, take steps. But they won’t.
Obligatory cat photo
I wish I could be this relaxed when I’m at my desk.
Finally, an update on Grabbity’s eyes. We went to the vet again yesterday and whilst her corneal ulcers haven’t improved, they haven’t got any worse either. So we’re moving to steroid drops to see if that helps. She’s a real stoic and put up with the vet sticking swabs in her eye without complaining once. Next vet visit is in about a month, so fingers crossed!
Right, that’s it for now. See you again in a couple of weeks!
All the best,
Suw
PS…. As I haven’t really been able to produce a lot of content for premium members of late, I’ve decided to make all existing content free. I’ll revisit premium content when I have a few more subscribers, but in the meantime, a huge, huge thank you to everyone who has upgraded to paid. You are all marvellous human beings!
One thing worth pointing out is that if you want to participate in a Goodreads-like community (or to use something similar for your reading habits) then there is a free Fediverse alternative - Bookwyrm. Like many of the ActivityPub-based services it has a few usability issues and I've found that the catalogs tend to need help / contributions, but a Bookwyrm-based site may be worth exploring for those fed up with GoodReads/Amazon (I'm on bookrastinating.com, other communities are available)
This is so informative while entertaining at the same time with appropriate and beautifully graphic swear imagery. It appeals to me on many levels. Bravo for successful wading through the AI generated sea of shite!