HOT TAKE: Can AI do 2.5% of your work or 48% of your work?

A new study has dramatically different results from a previous study. I look at what's up.

Issue 93

If you’d rather listen to my hot take on this new study, you can find the podcast version at Apple Podcasts or most other podcast platforms.

AI succeeds at only 2.5% of Upwork-like tasks

Researchers from the Center for Internet Safety and ScaleAI did an AI study on real work projects that, at first glance, seemed similar to the recent OpenAI study that found AI could successfully complete 48% of real-world, economically valuable human work, but this time the researchers found AI could only do 2.5% of the work. I was completely intrigued! What was the difference?

Background: the Remote Labor Index

The "Remote Labor Index" (RLI) study tested 240 different projects that had already been completed by freelancers at Upwork. They got the project briefs and supporting files the freelancers had used to do the work, lightly structured it, and gave it to various LLMs with a simple prompt and access to helper tools like speech-generation and image-generation technology.

Tasks were more complex

The first difference is that the projects in the new RLI study were much more complex than the tasks in the OpenAI (GDPval) study. For example, they took more than twice as long for human workers to complete and required producing a more diverse set of file types, such as 3D designs and CAD files. The RLI paper says, "These projects exhibit significantly higher complexity than tasks found in existing agent benchmarks." And the projects did look bigger to me. Here are some examples:

  • Build an interactive dashboard for exploring data from the World Happiness Project.

  • Develop architectural plans and a 3D model for a container home based on an existing PDF design.

  • Format a paper using the provided [very messy] figures and equations for an IEEE conference.

Prompts weren't as tailored

A second difference is that the more successful GDPval study put significantly more effort into creating the prompt for the LLMs. In that study, the prompts were written by experts with an average of 14 years of experience in the task and went through an average of five rounds of revisions. 

In the RLI study, the prompt was the same for each task: After being given the same project brief and materials that had been given to the freelancer in the original project, the system was given a prompt that had the following as its core: "Read the brief attached and create only the deliverables described. Do not ask any questions..."

The information was screened to make sure it contained all the information needed to complete the project, and researchers gave technical instructions (such as specifying what format of output files were required or instructions on how to use speech-generation tools), and it was structured so that it contained "Work description," "Provided material," and "Deliverables" sections — but the brief was kept as close as possible to the raw data that was sent to the freelancers. 

This makes the RLI project much more realistic in the way that people are likely to approach giving projects to AI, but it was also not optimized for AI success in the way the GDPval prompts were.

This feels like the most significant difference to me. Writing a prompt for a system such as ChatGPT is not the same as writing a project brief for a freelancer. Further, the prompts in the GDPval study were written by experts and with AI in mind, but the project briefs in the RLI study were written by Upwork clients (presumably of varying levels of expertise) writing them specifically for human freelancers. When I saw this, my gut reaction was, "Of course it failed."

One thing that has concerned me about the GDPval study is how much work was put into the prompts. Although it shows what is possible with good prompting, I've also wondered how realistic it is in the real world for people to have the expertise and patience to give such good instructions. The RLI study looks more like a test of how people may actually approach projects.

Content writing tasks were excluded

A third difference, which I think will be of particular interest to many readers here, is that the RLI study specifically excluded content writing tasks because the researchers believe "today's models already perform relatively well" on these tasks and there is nothing to learn from doing tests in this area.

The authors specifically note that the older GDPval study included a much higher percentage of research and writing tasks than their study, and they note that their study is more aligned with the distribution of real-world projects that are available on Upwork.

According to the RLI paper, more than 50% of the tasks in the GDPval study were "research and writing" tasks (which they consider easy), whereas only ~5% of the tasks in the RLI study fell into this category.

Both tests were one-shot

Both GDPval and RLI are essentially one-shot tests. Nobody worked with the models to improve the output if it wasn't right the first time, which doesn't seem like a real-world scenario. 

For example, in the RLI study, 36% of the failures were because the project was incomplete. (In one example that made me laugh, the system produced an 8-second video instead of the required 8-minute video.) In the real world, a user would likely reply, "You didn't do X. Try again," and in some percentage of cases, they'd then get what they wanted. 

Further, the prompt including "don't ask any questions" also doesn't seem like a real-world scenario and seems almost designed to lead to failure. Long-time readers may recall that back in July, I wrote about the one line Christopher Penn calls "magic" for improving your LLMs' responses: “Ask me one question at a time until you have enough information to complete the task.

The bottom line

I went into the RLI paper wondering how the results could be so dramatically different from the GDPval results and came away feeling like it is obvious: the projects were much more complex, and the prompts were much worse.

The goal of the GDPval study was to see if it is possible to get the systems to succeed at common real-world tasks, and the goal of the RLI study was to see how the systems perform on more difficult, messy, real-world complete projects. That's the difference between 48% success and 2.5% success.

The real-world takeaway is if you want to do work with AI, choose your tasks carefully and put effort into the context and prompts you give the system.

Quick Hits

Psychology

Companions

Climate & Energy

How data center power demand could help lower electricity prices [Especially interesting because it’s counterintuitive.] — PBS News

We’ve mastered harvesting the sun, but storage is the game changer [An interesting comparison of refrigeration in food distribution to the growth of battery storage for electricity.] — The Electrotech Revolution

OpenAI sued for trademark infringement over Sora's 'Cameo' feature [I had been wondering if this would happen. I immediately thought of the company Cameo when OpenAI used “cameo” in Sora.] — Engadget

Bad stuff

Science & Medicine

Job market

55 percent of employers regret laying off workers because of AI. “More people in charge of AI investment expect it to increase headcount (57 percent) than to decrease it (15 percent) over the next year.” — The Register

Model updates

Science & Medicine

The first pregnancy via AI-detected sperm from a 39-year-old man with infertility by finding 5 motile sperm in 2.5 million images. — The Lancet, via Eric Topol

Education

Music

How Many AI Artists Have Debuted on Billboard’s Charts? [At least six AI or AI-assisted artists in the last few months.] — Billboard

The business of AI

Other

Resources

AI Competencies for Academic Library Workers — Association of College and Research Libraries

What is AI Sidequest?

Are you interested in the intersection of AI with language, writing, and culture? With maybe a little consumer business thrown in? Then you’re in the right place!

I’m Mignon Fogarty: I’ve been writing about language for almost 20 years and was the chair of media entrepreneurship in the School of Journalism at the University of Nevada, Reno. I became interested in AI back in 2022 when articles about large language models started flooding my Google alerts. AI Sidequest is where I write about stories I find interesting. I hope you find them interesting too.

If you loved the newsletter, share your favorite part on social media and tag me so I can engage! [LinkedInFacebookMastodon]

Written by a human.