- AI Sidequest: How-To Tips and News
- Posts
- A big test of AI summaries
A big test of AI summaries
Plus, why chatbots sound so human
Issue 86
On today’s quest:
— Can AI write usable news briefs?
— Word watch: the bitter lesson
— Why chatbots sound so human …
— … but still fall short
— How good are LLMs at text-based adventure games?
— Chatbots versus search
Can AI write usable scientific summaries?
The short answer is no.
Writers at the journal Science ran a year-long experiment to see if AI could write usable summaries of its research articles for their press packages (PDF). Over 12 months, they generated 64 AI-written summaries and compared them to summaries written by staff.
The writers evaluated the first output of ChatGPT and compared it to human-written summaries for the same journal article. They found that although the system was adequate for less technical articles (perspectives, policy forums, and editorials), it could not provide the detail and big-picture implications needed for research article summaries.
The more complicated answer is that we probably still don’t know. The study ended in December 2024, and LLMs have significantly improved since then.
I’m delighted to see such a long-running test and have lots of questions about the details. I hope to have more info for you in the future.
Word watch: the bitter lesson
I’ve recently seen multiple AI people refer to the “bitter lesson” as though it’s something everyone would recognize.
The phrase comes from a 2019 post by Richard Sutton, whose primary work is in reinforcement learning. The lesson is that in AI, approaches that prioritize scale have done much better than approaches that rely on human input. For example, in the early days, when computers beat humans in the games chess and Go, it was brute-force computing that made the difference, not trying to build systems that included human knowledge of the games.
The lesson is “bitter” because it suggests that human expertise doesn’t matter, and many researchers have wasted years trying to build systems where it does, only to be beaten when more computational resources are applied to the problem.
However, Sutton’s thinking isn’t inherently anti-human. He ends by saying knowledge-based approaches don’t work because our own thinking is immeasurably complex and can’t be turned into code. He says, “We want AI agents that can discover like we can, not which [only] contain what we have discovered.”
$10k free to test catalog ads on TV
When the leader in catalog ads partners with the leader in CTV, something special happens.
Catalog ads on streaming TV with a free $10k to test it.
Turn your product feed into high-performing video ads on the biggest screen in your customers' homes.
CTV + catalog ads = your next massive growth opportunity.
Why chatbots sound so human …
After being addressed with “Hey, you,” by a chatbot, linguist John McWhorter wrote about a concept linguists call “pragmatics” — that hard-to-define human feeling you get from words you might think of as informal or extra in a sentence or from the context of a statement.
Pragmatics is what the word “even” is doing in a sentence like “I turned in my story, and it was even on time.” It adds more of a feeling than a meaning.
Pragmatics is why “Could you maybe turn the music down?” is different from “Turn the music down.”
McWhorter argues that LLMs have mastered pragmatics, and that’s what makes them feel so human.
… but still fall short
Meanwhile, psychologist George Gaskell described a study in which ChatGPT regularly violated what linguists call Grice’s maxims of communication: be accurate, relevant, clear, and give the right amount of information.
For example, ChatGPT ignored the main point of a question about cultural norms, answered with a Western perspective for users who clearly lived elsewhere, and lied/hallucinated.
Gaskell argues, “The failure of ChatGPT to follow the cooperative principles of communication in addressing mundane commonsense issues, suggests that the use of ‘digital human or ‘human talk’ is inappropriate.”
How good are LLMs at text-based adventure games?
Researchers created an AI leaderboard based on how well LLMs perform in text-based games that were popular in the 1980s, including Sherlock, Seastalker, Enchanter, and Zork I.
Meant to be a test of reasoning ability, the challenge requires LLMs to take 100s of precise actions to complete the games. To succeed, the LLMs must make long multi-step plans, keep focus, and learn from experience.
The researchers found that the longer the game went on, the more likely the LLMs were to hallucinate, and that all models struggled with spatial reasoning. For example, “In Wishbringer, most LLMs struggled to navigate back down a cliff after climbing it. The solution simply required reversing the sequence of directions used to ascend — information available in the context history.”
ChatGPT-5 currently tops the leaderboard, followed by Claude Opus 4.1. Grok 4 comes in third but also has the highest “harm” score — a measure of harmful actions taken in the games.
Chatbots versus search
In his blog post “Comparing the wrong things,” Mike Caulfield gets at something I’ve had in the back of my mind for a while — that we’re holding chatbots to a higher standard than a Google search (or a human source).
You’ll find all kinds of inaccurate information when you do a web search. This morning, I tried to figure out if it would be a good idea to make yogurt from milk that had gone sour, and after searching Google and clicking through five articles, I still wasn’t sure and just poured the milk down the drain.
Human sources and full articles written by journalists can also provide wrong or misleading information. It’s up to us — always — to determine the truth of what we hear or read. Why should a chatbot be any different?
The counterargument, which has a lot of merit, is that we should hold chatbots to a higher standard because they present their replies confidently, in a way that plays on our instincts to take them as fact. No small “ChatGPT can make mistakes” disclaimer at the bottom overcomes most people’s tendency to see the output as real and true. Therefore, we should hold their feet to the fire and complain about hallucinations.
As an aside, I wonder about this with the energy comparisons too. When people compare the cost of getting an answer from an LLM to doing “a Google search,” they are only looking at the cost of submitting your query to Google and getting back the 10 blue links, not the energy involved in clicking through to any of the sources. In 2009, Google reported that a search used 0.3 Wh of energy. My guess is that it’s lower now, but I haven’t seen a number.
Also, I am not confident in the following numbers because they come from ChatGPT, and I don’t know how to validate them, but it tells me that if you add visiting two pages from a Google search to the calculation, and you’re working on a laptop, this is the total energy cost of seeking an answer depending on the kind of pages you visit:
Typical web pages: 0.55 Wh
Image-heavy web pages: 2 Wh
Regardless of the accuracy of the exact numbers, the point stands that if you click through to investigate the results of your search, you’re using more energy than the commonly cited 0.3 Wh, and how much depends on the type of sites you’re visiting. (The most recent estimate for doing a single average ChatGPT query is also 0.3 Wh.)
It may not matter for long though because in March, 27% of U.S. Google searches resulted in zero clicks, and that number is growing. But most people think that number is going up because the AI Overviews are giving people answers.
Quick Hits
Using AI
Tips for getting the best image generation and editing in the Gemini app — Google Blog
How I used ChatGPT-o3 to plan an entire marketing campaign during one plane ride (very detailed) — HubSpot
Unlocking significant improvements with better prompting — Umut Eser
Security
New attack on ChatGPT research agent pilfers secrets from Gmail inboxes. People considering connecting LLM agents to their inboxes, documents, and other private resources should think long and hard about doing so. — ArsTechnica
Psychology
ChatGPT Is Blowing Up Marriages as Spouses Use AI to Attack Their Partners — Futurism
AI Is Making Online Dating Even Worse. What happens when users are inundated with machine-generated profiles and pick-up lines? — The Cut
Climate
Legal
California issues historic $10,000 fine over lawyer’s ChatGPT fabrications (21 of 23 quotes in an opening brief submitted in 2023 were fake) — CalMatters
Bad stuff
My daughter used ChatGPT as a therapist, then took her own life — The Sunday Times (UK)
The U.S. government is annoyed that Anthropic won’t let it use Claude to surveil citizens — Semafor
We set out to craft the perfect phishing scam. Major AI chatbots were happy to help. — Reuters
Bias in LLM peer review: identical papers attributed to low-prestige affiliations face a significantly higher risk of rejection — arXiv
Medicine
Why is AI struggling to discover new drugs? “So far, the problem has proved beyond the algorithms. We know surprisingly little about our own biology. … Darren Green, a veteran chemist who spent more than 30 years at GSK, says drug discovery is “probably the hardest thing mankind tries to do”. “We get these great new tools, which is fantastic. And then you just find another problem,” he says. — Financial Times
AI medical tools downplay symptoms in women and ethnic minorities — Financial Times
Model updates
Education
I’m a High Schooler. AI Is Demolishing My Education. — The Atlantic
Oxford becomes first UK university to offer ChatGPT Edu to all staff and students — University of Oxford
Music
The business of AI
Is AI a bubble? A practical framework to answer the biggest question in tech — Exponential View
Religion
Pope nixes 'virtual pope' idea, explains concerns about AI — National Catholic Reporter
Finding God in the App Store. Millions are turning to chatbots for guidance from on high. — New York Times
Other
What is AI Sidequest?
Are you interested in the intersection of AI with language, writing, and culture? With maybe a little consumer business thrown in? Then you’re in the right place!
I’m Mignon Fogarty: I’ve been writing about language for almost 20 years and was the chair of media entrepreneurship in the School of Journalism at the University of Nevada, Reno. I became interested in AI back in 2022 when articles about large language models started flooding my Google alerts. AI Sidequest is where I write about stories I find interesting. I hope you find them interesting too.
If you loved the newsletter, share your favorite part on social media and tag me so I can engage! [LinkedIn — Facebook — Mastodon]
Written by a human (except the examples below, obviously)