Skip to content
GCC AI Research

LLMs tackle math word problems

MBZUAI · Notable

Summary

MBZUAI researchers presented a study at NAACL 2024 analyzing errors made by open-source LLMs when solving math word problems. The study, led by Ekaterina Kochmar and KV Aditya Srivatsa, investigates characteristics that make math word problems difficult for machines. Llama2-70B was used to test the ability of LLMs to solve these problems, revealing that LLMs can perform math operations correctly but still give the wrong answer. Why it matters: The research aims to improve AI's ability to understand and solve math word problems, potentially leading to better educational applications and teaching methods.

Get the weekly digest

Top AI stories from the GCC region, every week.

Related

Solving complex problems with LLMs: A new prompting strategy presented at NeurIPS

MBZUAI ·

Researchers from MBZUAI and King's College London have developed a new prompting strategy called self-guided exploration to improve LLM performance on combinatorial problems. The method was tested on complex challenges like the traveling salesman problem. The findings will be presented at the 38th Annual Conference on Neural Information Processing Systems (NeurIPS) in Vancouver. Why it matters: This research could lead to practical applications of LLMs in industries like logistics, planning, and scheduling by offering new approaches to computationally complex problems.

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

arXiv ·

A new method is proposed to reduce the verbosity of LLMs in step-by-step reasoning by retaining moderately easy problems during Reinforcement Learning with Verifiable Rewards (RLVR) training. This approach acts as an implicit length regularizer, preventing the model from excessively increasing output length on harder problems. Experiments using Qwen3-4B-Thinking-2507 show the model achieves baseline accuracy with nearly twice shorter solutions.

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

arXiv ·

This paper benchmarks reasoning-focused LLMs, especially DeepSeek models, on fifteen Arabic NLP tasks. The study uses zero-shot, few-shot, and fine-tuning strategies. Key findings include that three in-context examples improve F1 scores by over 13 points on classification tasks, DeepSeek outperforms GPT-4-mini by 12 F1 points on complex inference tasks in the zero-shot setting, and LoRA fine-tuning yields up to an additional 8 points in F1 and BLEU. Why it matters: The systematic evaluation provides insights into the performance of LLMs on Arabic NLP, highlighting the effectiveness of different strategies for improving performance and contributing to the development of more capable Arabic language models.