LLMs have a HUGE flaw according to new Apple study

OpenAI logo

Published: Oct 12, 2024, 7:35PM by @AdyaGD

A new study published by six AI researchers at Apple aims to prove that Large Language Models, such as OpenAI's ChatGPT or Google's Gemini, are not capable of performing logical reasoning:

There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer

OpenAI benchmarks their LLMs using their tool "GSM8K". Three years ago, GPT-3 scored 35% on the test, but today's LLMs are able to score more than 95%. The researchers asked themselves the following question:

has model 'reasoning' really improved? How much of this is genuine logical/symbolic reasoning? vs. pattern recognition, inadvertent data contamination, or overfitting?

To explore this, the group introduced a new LLM benchmark called "GSM-Symbolic" to conduct their study, using GSM8K examples but with different values and names:

The GSM-Symbolic benchmark revealed a decline of ~10%!

Would a grade-school student's math test score vary by ~10% if we only changed the names?

Okay, but do LLMs truly understand mathematical concepts? Well, the group also developed a new benchmark named "GSM_NoOp": "We add a single clause that seems relevant but doesn't contribute to the overall reasoning"

A massive decline:

Conclusion:

Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question.

They hypothesized that this is because LLMs are simply not capable of logical reasoning.