The meaning of artificial general intelligence remains unclear


When Chinese AI startup DeepSeek burst onto the scene in January, it sparked intense chatter about its efficient and cost-effective approach to generative AI. But like its U.S. competitors, DeepSeek’s main goal is murkier than just efficiency: The company aims to create the first true artificial general intelligence, or AGI.

For years, AI developers — from small startups to big tech companies — have been racing toward this elusive endpoint. AGI, they say, would mark a critical turning point, enabling computer systems to replace human workers, making AI more trustworthy than human expertise and positioning artificial intelligence as the ultimate tool for societal advancement.

Yet, years into the AI race, AGI remains a poorly defined and contentious concept. Some computer scientists and companies frame it as a threshold for AI’s potential to transform society. Tech advocates suggest that once we have superintelligent computers, day-to-day life could fundamentally change, affecting work, governance and the pace of scientific discovery.

But many experts are skeptical about how close we are to an AI-powered utopia and the practical utility of AGI. There’s limited agreement about what AGI means, and no clear way to measure it. Some argue that AGI functions as little more than a marketing term, offering no concrete guidance on how to best use AI models or their societal impact.

In tech companies’ quest for AGI, the public is tasked with navigating a landscape filled with marketing hype, science fiction and actual science, says Ben Recht, a computer scientist at the University of California, Berkeley. “It becomes very tricky. That’s where we get stuck.” Continuing to focus on claims of imminent AGI, he says, could muddle our understanding of the technology at hand and obscure AI’s current societal effects.

The definition of AGI is unclear

The term “artificial general intelligence” was coined in the mid-20th century. Initially, it denoted an autonomous computer capable of performing any task a human could, including physical activities like making a cup of coffee or fixing a car.

But as advancements in robotics lagged behind the rapid progress of computing, most in the AI field shifted to narrower definitions of AGI: Initially, this included AI systems that could autonomously perform tasks a human could at a computer, and more recently, machines capable of executing most of only the “economically valuable” tasks a human could handle at a computer, such as coding and writing accurate prose. Others think AGI should encompass flexible reasoning ability and autonomy when tackling a number of unspecified tasks.

“The problem is that we don’t know what we want,” says Arseny Moskvichev, a machine learning engineer at Advanced Micro Devices and computer scientist at the Santa Fe Institute. “Because the goal is so poorly defined, there’s also no roadmap for reaching it, nor reliable way to identify it.”

To address this uncertainty, researchers have been developing benchmark tests, similar to student exams, to evaluate how close systems are to achieving AGI.

For example, in 2019, French computer scientist and former Google engineer Francois Chollet released the Abstract Reasoning Corpus for Artificial General Intelligence, or ARC-AGI. In this test, an AI model is repeatedly given some examples of colored squares arranged in different patterns on a grid. For each example set, the model is then asked to generate a new grid to complete the visual pattern, a task intended to assess flexible reasoning and the model’s ability to acquire new skills outside of its training. This setup is similar to Raven’s Progressive Matrices, a test of human reasoning.

The test results are part of what OpenAI and other tech companies use to guide model development and assessment. Recently, OpenAI’s soon-to-be released o3 model achieved vast improvement on ARC-AGI compared to previous AI models, leading some researchers to view it as a breakthrough in AGI. Others disagree.

“There’s nothing about ARC that’s general. It’s so specific and weird,” Recht says.

Computer scientist José Hernández-Orallo of the Universitat Politécnica de València in Spain says that it’s possible ARC-AGI just assesses a model’s ability to recognize images. Previous generations of language models could solve similar problems with high accuracy if the visual grids were described using text, he says. That context makes o3’s results seem less novel.

Plus, there’s a limited number of grid configurations, and some AI models with tons of computing power at their disposal can “brute force” their way to correct responses simply by generating all possible answers and selecting the one that fits best — effectively reducing the task to a multiple-choice problem rather than one of novel reasoning.

To tackle each ARC-AGI task, o3 uses an enormous amount of computing power (and money) at test time. Operating in an efficient mode, it costs about $30 per task, Chollet says. In a less-efficient setting, one task can cost about $3,000. Just because the model can solve the problem doesn’t mean it’s practical or feasible to routinely use it on similarly challenging tasks.

AI tests don’t capture real-world complexity

It’s not just ARC-AGI that’s contentious. Determining whether an AI model counts as AGI is complicated by the fact that every available test of AI ability is flawed. Just as Raven’s Progressive Matrices and other IQ tests are imperfect measures of human intelligence and face constant criticism for their biases, so too do AGI evaluations, says Amelia Hardy, a computer scientist at Stanford University. “It’s really hard to know that we’re measuring [what] we care about.”

Open AI’s o3, for example, correctly responded to more than a quarter of the questions in a collection of exceptionally difficult problems called the Frontier Math benchmark, says company spokesperson Lindsay McCallum. These problems take professional mathematicians hours to solve, according to the benchmark’s creators. On its face, o3 seems successful. But this success may be partly due to OpenAI funding the benchmark’s development and having access to the testing dataset while developing o3. Such data contamination is a continual difficulty in assessing AI models, especially for AGI, where the ability to generalize and abstract beyond training data is considered crucial.

AI models can also seem to perform very well on complex tasks, like accurately responding to Ph.D.-level science questions, while failing on more basic ones, like counting the number of r’s in “strawberry.” This discrepancy indicates a fundamental misalignment in how these computer systems process queries and understand problems.

Yet, AI developers aren’t collecting and sharing the sort of information that might help researchers better gauge why, Hernández-Orallo says. Many developers provide only a single accuracy value for each benchmark, as opposed to a detailed breakdown of which types of questions a model answered correctly and incorrectly. Without additional detail, it’s impossible to determine where a model is struggling, why it’s succeeding, or if any single test result demonstrates a breakthrough in machine intelligence, experts say.

Even if a model passes a specific, quantifiable test with flying colors, such as the bar exam or medical boards, there are few guarantees that those results will translate to expert-level human performance in messy, real-world conditions, says David Rein, a computer scientist at the nonprofit Model Evaluation and Threat Research based in Berkeley, Calif.

For instance, when asked to write legal briefs, generative AI models still routinely fabricate information. Although one study of GPT-4 suggested that the chatbot could outperform human physicians in diagnosing patients, more detailed research has found that comparable AI models perform far worse than actual doctors when faced with tests that mimic real-world conditions. And no study or benchmark result indicates that current AI models should be making major governance decisions over expert humans.

The benchmarks that OpenAI, DeepSeek and other companies report results from “do not tell us much about capabilities in the real world,” Rein says, although they can provide reasonable information for comparing models to one another.

So far, researchers have tested AI models largely by providing them with discrete problems that have known answers. However, humans don’t always have the luxury of knowing what the problem before them is, whether it’s solvable or in what time frame. People can identify key problems, prioritize tasks and, crucially, know when to give up. It’s not yet clear that machines can or do. The most advanced “autonomous” agents struggle to navigate ordering pizza or groceries online.

General intelligence doesn’t dictate impact

Large language models and neural networks have improved dramatically in recent months and years. “They’re definitely useful in a lot of different ways,” Recht says, pointing to the ability of newer models to summarize and digest data or produce serviceable computer code with few mistakes. But attempts like ARC-AGI to measure general ability don’t necessarily clarify what AI models can and can’t be used for. “I don’t think it matters whether or not they’re artificially generally intelligent,” he says.

What might matter far more, based on the recent DeepSeek news, is traditional metrics of cost per task. Utility is determined by both the quality of a tool and whether that tool is affordable enough to scale. Intelligence is only part of the equation.

AGI is supposed to serve as a guiding light for AI developers. If achieved, it’s meant to herald a major turning point for society, beyond which machines will function independently on equal or higher footing than humans. But so far, AI has had major societal impacts, both good and bad, without any consensus on whether we’re nearing (or have already surpassed) this turning point, Recht, Hernández-Orallo and Hardy say.

For example, scientists are using AI tools to create new, potentially lifesaving molecules. Yet in classrooms worldwide, generative chatbots have disrupted assessments. A recent Pew Research Center survey found that more and more U.S. teens are outsourcing assignments to ChatGPT. And a 2023 study in Nature reported that growing AI assistance in university courses has made cheating harder to detect.

To say that AI will become transformative once we reach AGI ignores all the trees for the forest.



Leave a Reply

Your email address will not be published. Required fields are marked *