Guest contributor: Deepanshu Sharma is an independent analyst, researcher, and writer decoding AI capabilities, evaluating benchmarks, and simplifying the societal impacts of these emerging technologies.

I was recently watching a Demis Hassabis interview, and his one line has stuck with me. When Stanford President John Levin asked, “Demis, if you were back in school, what would you be studying, and what would be your advice to students about what to study in their careers?

I want to stop here for a moment.

Because this question relates to a specific pattern that has emerged among university students in the last month, and it shows what’s on their minds regarding what would be the impact of AI on their professional lives.

Many prestigious people received backlash on stage from students while giving commencement speeches for discussing and utilizing artificial intelligence. It reflects a growing perception among students that AI will wipe out all Jobs.

Current evidence states otherwise. For example, let’s take GDPval, which is an evaluation framework introduced by OpenAI in September 2025, that covers 44 occupations across the top 9 industries. On the one hand, it evaluates the latest AI models on real-world tasks, but on the other hand, its results also tell us why a human is required in the loop.

GDPval measures the performance of an LLM on 1320 specialized tasks. It is presented as a benchmark that truly represents the capabilities attained by LLMs, comparing them directly to the everyday work people actually do.

In the original paper, Claude Opus 4.1 acheived 47.6%, which was shown as almost reaching the level of an industry expert. On the surface, it looks like LLMs are good enough to compete with experts and gain general intelligence as compared to humans, who are restricted to expertise only in their own domain.

Initial Assessment of Frontier Models from the Original Paper

Reality is slightly twisted, and there are reasons for it.

First is the limited scope of work that can actually be done by LLMs. This benchmark evaluates the performance of LLMs solely on “well-specified digital tasks”. If I were to summarize it:

  • This benchmark does not cover all the tasks of any single occupation. Only a part of them.

  • Then, from that subset, it only covers tasks that can be done digitally

  • Even then, it only includes tasks that can be clearly specified from the start, meaning it cannot cover any scenario where information changes midway through.

Limitations from the Original GDPval Paper.

Secondly, I want to emphasize the speed and cost improvements.

If these systems can complete a task faster than a human, then it makes sense to use them. However, it completely defeats the purpose if reviewing their work ends up taking more time overall.

According to the initial paper, only GPT-5 shows some meaningful speed improvements. In fact, looking at the data, using GPT-4o actually slowed the work down.

Furthermore, this evaluation ignores the impact of catastrophic mistakes, instances where the cost of a single error can wipe out any savings gained from using AI in the first place. The paper highlights examples such as insulting a customer, giving a wrong medical diagnosis, recommending fraud, or suggesting actions that could cause physical harm. According to their own evaluation, this occurs 2.7% of the time.

Speed & Cost Improvement Table from the Original GDPval Paper.

Two weeks ago, Sam Altman clarified on this when an interviewer asked, “When you introduced GPT-5.2, you said it outperforms professionals across 44 occupations. You can understand why there may be a backlash when people hear things like that.”

Sam said:

“What I wish we had said then is that it outperforms professionals at small tasks in 44 occupations, which is, I think, a more accurate thing”

Excerpts of GDPval from GPT-5.2 release notes.

Discussion on GDPval and other stats (like GPT-5.2) was very important because we still use this benchmark to test the latest models, and this was one of the reasons for the prevailing misconception that AI will soon wipe out all jobs. People rarely remember the actual benchmark; instead, the emotional narrative spreads that AI is already beating experts in every domain.

I wanted to clear the air. But in AI, one month feels like one year.

So much has changed in just the last couple of days, and I want to shift focus to the latest models like Fable 5 and how they handle real-world complexities.

On June 19th, Artificial Analysis introduced a new benchmark called AA-Briefcase. It tests models on agentic knowledge work involving practical complexities, such as multi-week projects requiring thousands of inputs (company documents, meeting transcripts, large-scale data exports, Slack messages, and emails).

Fable 5 performed better than other models; it successfully completed only 3% of the tasks. Furthermore, on 31 out of 91 tasks, not a single model scored above 50% on the rubric criteria.

This shows the capabilities of these models in real-world settings. If we focus on why these scores are low, the reasons were:

  • Strong models fail by missing nuanced requirements hidden within the task or source files.

  • Less capable models struggle with basic execution, often ignoring input files, delivering unusable work, or producing nothing at all.

It’s true that AI is far from eliminating professionals. However, every profession has so many sub-tasks where AI significantly outperforms us, and failing to automate those will be a costly mistake.

I know AI can summarize multiple academic papers faster than I can, extract data from long documents, find insights I might otherwise miss, and help me understand new perspectives.

The list goes on.

AI is the exact reason I am excited about learning new things. Now, I don’t have to worry about not knowing the basics of a subject. Research is only one prompt away.

In my opinion, two types of people will win with AI. The first are those who are already experts in their domain (or intend to become experts); they know exactly which tasks to delegate to AI, freeing them up to work on the rest and innovate. The second are those who will use AI to rapidly learn new things, specifically focusing on vertical integration to expand their footprint into new domains.

I want to go back to Demis’s answer to John, and I’ve never heard a better explanation on this sensitive subject from anyone else, Demis said:

I would be really excited, if I was back at college now. Those of you doing science, STEM subjects, mathematics and computer science, still do those things. You'll be able to take better advantage of these tools if you understand how they are put together and what they're capable of. That's gonna be true for the next period, the next 10 years at least.

He further says, “I would also lean in, though, to not wish it away. The genie's not going back in the bottle.

It’s true that we can’t go back to the pre-AI era, whether in our personal or professional lives. AI will seep in more and more as time passes. The only way to tackle this generational challenge is to get this technology on our side. We must find better ways to execute our next big task, so that it takes substantially less time than before, or completely eliminate the tasks that AI is already good at.

I hope I was able to add a little value to your day. Thank you so much for your time, and have an absolutely wonderful day!

Deepanshu Sharma is an independent analyst, researcher, and writer decoding AI capabilities, evaluating benchmarks, and simplifying the societal impacts of these emerging technologies.

Reply

Avatar

or to participate