AI Sidequest: How-To Tips and News
Posts
HOT TAKE: Can AI do 2.5% of your work or 48% of your work?

HOT TAKE: Can AI do 2.5% of your work or 48% of your work?

A new study has dramatically different results from a previous study. I look at what's up.

Mignon Fogarty
November 03, 2025

Issue 93

If you’d rather listen to my hot take on this new study, you can find the podcast version at Apple Podcasts or most other podcast platforms.

AI succeeds at only 2.5% of Upwork-like tasks

Researchers from the Center for Internet Safety and ScaleAI did an AI study on real work projects that, at first glance, seemed similar to the recent OpenAI study that found AI could successfully complete 48% of real-world, economically valuable human work, but this time the researchers found AI could only do 2.5% of the work. I was completely intrigued! What was the difference?

Background: the Remote Labor Index

The "Remote Labor Index" (RLI) study tested 240 different projects that had already been completed by freelancers at Upwork. They got the project briefs and supporting files the freelancers had used to do the work, lightly structured it, and gave it to various LLMs with a simple prompt and access to helper tools like speech-generation and image-generation technology.

Tasks were more complex

The first difference is that the projects in the new RLI study were much more complex than the tasks in the OpenAI (GDPval) study. For example, they took more than twice as long for human workers to complete and required producing a more diverse set of file types, such as 3D designs and CAD files. The RLI paper says, "These projects exhibit significantly higher complexity than tasks found in existing agent benchmarks." And the projects did look bigger to me. Here are some examples:

Build an interactive dashboard for exploring data from the World Happiness Project.
Develop architectural plans and a 3D model for a container home based on an existing PDF design.
Format a paper using the provided [very messy] figures and equations for an IEEE conference.

Prompts weren't as tailored

A second difference is that the more successful GDPval study put significantly more effort into creating the prompt for the LLMs. In that study, the prompts were written by experts with an average of 14 years of experience in the task and went through an average of five rounds of revisions.

In the RLI study, the prompt was the same for each task: After being given the same project brief and materials that had been given to the freelancer in the original project, the system was given a prompt that had the following as its core: "Read the brief attached and create only the deliverables described. Do not ask any questions..."

The information was screened to make sure it contained all the information needed to complete the project, and researchers gave technical instructions (such as specifying what format of output files were required or instructions on how to use speech-generation tools), and it was structured so that it contained "Work description," "Provided material," and "Deliverables" sections — but the brief was kept as close as possible to the raw data that was sent to the freelancers.

This makes the RLI project much more realistic in the way that people are likely to approach giving projects to AI, but it was also not optimized for AI success in the way the GDPval prompts were.

This feels like the most significant difference to me. Writing a prompt for a system such as ChatGPT is not the same as writing a project brief for a freelancer. Further, the prompts in the GDPval study were written by experts and with AI in mind, but the project briefs in the RLI study were written by Upwork clients (presumably of varying levels of expertise) writing them specifically for human freelancers. When I saw this, my gut reaction was, "Of course it failed."

One thing that has concerned me about the GDPval study is how much work was put into the prompts. Although it shows what is possible with good prompting, I've also wondered how realistic it is in the real world for people to have the expertise and patience to give such good instructions. The RLI study looks more like a test of how people may actually approach projects.

Content writing tasks were excluded

A third difference, which I think will be of particular interest to many readers here, is that the RLI study specifically excluded content writing tasks because the researchers believe "today's models already perform relatively well" on these tasks and there is nothing to learn from doing tests in this area.

The authors specifically note that the older GDPval study included a much higher percentage of research and writing tasks than their study, and they note that their study is more aligned with the distribution of real-world projects that are available on Upwork.

According to the RLI paper, more than 50% of the tasks in the GDPval study were "research and writing" tasks (which they consider easy), whereas only ~5% of the tasks in the RLI study fell into this category.

Both tests were one-shot

Both GDPval and RLI are essentially one-shot tests. Nobody worked with the models to improve the output if it wasn't right the first time, which doesn't seem like a real-world scenario.

For example, in the RLI study, 36% of the failures were because the project was incomplete. (In one example that made me laugh, the system produced an 8-second video instead of the required 8-minute video.) In the real world, a user would likely reply, "You didn't do X. Try again," and in some percentage of cases, they'd then get what they wanted.

Further, the prompt including "don't ask any questions" also doesn't seem like a real-world scenario and seems almost designed to lead to failure. Long-time readers may recall that back in July, I wrote about the one line Christopher Penn calls "magic" for improving your LLMs' responses: “Ask me one question at a time until you have enough information to complete the task.”

The bottom line

I went into the RLI paper wondering how the results could be so dramatically different from the GDPval results and came away feeling like it is obvious: the projects were much more complex, and the prompts were much worse.

The goal of the GDPval study was to see if it is possible to get the systems to succeed at common real-world tasks, and the goal of the RLI study was to see how the systems perform on more difficult, messy, real-world complete projects. That's the difference between 48% success and 2.5% success.

The real-world takeaway is if you want to do work with AI, choose your tasks carefully and put effort into the context and prompts you give the system.

Quick Hits

How data center power demand could help lower electricity prices [Especially interesting because it’s counterintuitive.] — PBS News

We’ve mastered harvesting the sun, but storage is the game changer [An interesting comparison of refrigeration in food distribution to the growth of battery storage for electricity.] — The Electrotech Revolution

Legal

OpenAI sued for trademark infringement over Sora's 'Cameo' feature [I had been wondering if this would happen. I immediately thought of the company Cameo when OpenAI used “cameo” in Sora.] — Engadget

George R.R. Martin Is Carving Up OpenAI In Court, So Far — Hollywood Reporter

Bad stuff

Fox News is reporting on racist AI videos as if they're real — Canary

How ChatGPT Encourages Teens to Engage in Dangerous Behavior — Inside Higher Ed

A.I. Is Making Death Threats Way More Realistic — New York Times

When generating and evaluating resumes, ChatGPT assumes that women are younger and less experienced, rating older male applicants as higher quality — Nature

Science & Medicine

Why Foundation Models in Pathology Are Failing (and What Comes Next) — Rewire.it

Job market

55 percent of employers regret laying off workers because of AI. “More people in charge of AI investment expect it to increase headcount (57 percent) than to decrease it (15 percent) over the next year.” — The Register

Geoffrey Hinton says tech giants can't profit from AI investments unless human labor is replaced — Fortune

Model updates

Canva launches its own design model, adds new AI features to the platform — TechCrunch

Science & Medicine

The first pregnancy via AI-detected sperm from a 39-year-old man with infertility by finding 5 motile sperm in 2.5 million images. — The Lancet, via Eric Topol

Grieving family uses AI chatbot to cut hospital bill from $195,000 to $33,000 — family says Claude highlighted duplicative charges, improper coding, and other violations — Tom’s Hardware

Education

Why even basic AI use is so bad for students — New York Times

Music

Universal Music Group has been striking a lot of licensing deals with AI companies — Music Business Worldwide

How Many AI Artists Have Debuted on Billboard’s Charts? [At least six AI or AI-assisted artists in the last few months.] — Billboard

The business of AI

Exclusive: OpenAI lays groundwork for juggernaut IPO at up to $1 trillion valuation — Reuters

5 Policy Questions Prompted by OpenAI’s Restructuring — Tech Policy Press

Tech leaders insist there is no AI bubble — Understanding AI

Other

Polish to be the most effective language for prompting AI, new study reveals [English was 6th.] — Euronews

Character.AI is ending its chatbot experience for kids — TechCrunch

Los Angeles to equip entire city workforce with Google AI — Cities Today

Resources

The official database with books that qualify for a payout in the Anthropic copyright settlement case. Includes an FAQ.

Where to search if your publisher copyrighted your books.

AI Competencies for Academic Library Workers — Association of College and Research Libraries

What is AI Sidequest?

Are you interested in the intersection of AI with language, writing, and culture? With maybe a little consumer business thrown in? Then you’re in the right place!

I’m Mignon Fogarty: I’ve been writing about language for almost 20 years and was the chair of media entrepreneurship in the School of Journalism at the University of Nevada, Reno. I became interested in AI back in 2022 when articles about large language models started flooding my Google alerts. AI Sidequest is where I write about stories I find interesting. I hope you find them interesting too.

If you loved the newsletter, share your favorite part on social media and tag me so I can engage! [LinkedIn — Facebook — Mastodon]

Written by a human.

HOT TAKE: Can AI do 2.5% of your work or 48% of your work?

A new study has dramatically different results from a previous study. I look at what's up.

AI succeeds at only 2.5% of Upwork-like tasks

Background: the Remote Labor Index

Tasks were more complex

Prompts weren't as tailored

Content writing tasks were excluded

Both tests were one-shot

The bottom line

Quick Hits

Psychology

Companions

Climate & Energy

Legal

Bad stuff

Science & Medicine

Job market

Model updates

Science & Medicine

Education

Music

The business of AI

Other

Resources

What is AI Sidequest?