Math Word Problem (MWP) solving involves understanding mathematical questions expressed in natural language and deriving the appropriate mathematical equations. Traditional approaches heavily rely on simple lexical pattern matching, limiting their flexibility in diverse real-world scenarios. Our co-authored paper introduces ATHENA (Attention-based THought Expansion Network Architecture), designed to mimic human cognitive processes for more generalized and robust mathematical reasoning.
Let's explore how ATHENA achieves robust performance and why this matters for mathematical reasoning in AI.
Research Background
Math word problem (MWP) solving involves translating complex linguistic descriptions into mathematical expressions. Traditional models tend to memorize lexical patterns rather than understand mathematical principles and procedures, limiting their ability to generalize to unseen or slightly varied problems.
Consider the following cases: calculating the area of a rectangle and determining how many items can be evenly distributed across containers. While both require multiplication, they involve different types of conceptual understanding.
Previous methods have struggled with two key aspects in reaching human-level understanding:
•
Conceptual knowledge: Understanding how mathematical principles apply in various contexts.
•
Procedural knowledge: The ability to deduce answers step by step through logical reasoning.
ATHENA is specifically designed to bridge this gap, enabling the model to expand its reasoning capabilities by mimicking human cognitive processes.
Methodology: ATHENA
ATHENA employs an innovative two-step reasoning process:
•
Candidate Thought Generation: At each reasoning stage, ATHENA generates multiple potential "thoughts"—possible mathematical expressions derived from previous steps.
•
Reasonable Thought Selection: It then evaluates these candidates based on context relevance, progressively narrowing toward a correct mathematical expression.
This iterative method mirrors human cognitive expansion, generating diverse reasoning pathways and selecting the most contextually relevant and mathematically sound options.
Preliminaries
Before we get into the details of ATHENA, let's first clarify what a thought is, and introduce the roles of the premise vector and goal vector, which guide the reasoning process.
Thought
In this study, a thought () is defined as an embedding of a possible mathematical expression () derived from quantities in a problem representing the contextual meaning of the expression. The objective of the model is to find a thought that satisfies the ground-truth expression .
Premise Vector
A premise vector () encodes previously inferred thoughts and is used to assess and filter candidate thoughts in each reasoning depth . The initial premise vector () is initialized using the embedding of the [CLS] token from the problem descriptions.
Goal Vector
A goal vector () serves as a ground-truth reference to determine whether a thought is an appropriate answer to the given question. It is defined using the token embedding of the punctuation mark (e.g., a question mark at the end of the sentence) in the problem description.
Thought Expansion
At each reasoning depth, the thought expansion process generates candidate thoughts and filters them to obtain the reasonable thoughts which serve as intermediate steps toward the final answer.
Candidate Thought Generation
Initial Thoughts ():
Initial thoughts are embeddings that represent each quantity mentioned in the problem context or question. These embeddings encode the contextual semantics of the quantities and serve as the starting point for reasoning.
At each reasoning depth , the model generates a set of possible new thoughts from the previously selected reasonable thoughts as the candidates. A new candidate thought is obtained by combining two previous thoughts with an arithmetic operation:
where
To enable this composition process, ATHENA introduces two operation layers—merge () and transform ()— to model the fundamental properties of arithmetic operations.
Merge Layer ()
The merge layer takes a pair of thoughts () and combines them into a new thought to model operations like addition or multiplication:
This layer is implemented using Feed-Forward Network(), multi-head self-attention ( )and layer normalization ():
Transform Layer ()
The transform layer takes a single thought and applies an inverse operation, such as subtraction or division, to produce a new thought :
This layer is simply implemented as:
Layer Scheduling:
The two layers are applied alternately during the reasoning process: the transform layer is applied at odd depths (), while the merge layer is applied at even depths (). At the initial depth , the initial thoughts are directly used as candidates.
Reasonable Thought Selection
After obtaining candidate thoughts , the model yields reasonable thoughts that constitute the final thought . In each reasoning depth , it selects reasonable thoughts from candidate thoughts through the inference layer () with a premise vector .
Inference and Update Premise
Given a set of candidate thoughts generated at depth , ATHENA evaluates each thought by computing its correlation to the current premise vector . This correlation indicates how compatible a new thought is with the prior reasoning. To score each candidate, the model applies multi-head attention and feed-forward network :
A thought is considered reasonable if its correlation score exceeds a predefined threshold .
Once the reasonable thoughts are selected, ATHENA updates the premise vector to reflect the new reasoning context. The updated premise vector is constructed by concatenating all reasonable thoughts after the multi-head attention using the parameters of the inference layer :
Termination and Final Thought Selection
The final thought represents the model’s answer to the question. ATHENA terminates the thought expansion process based on one of two criteria:
(1) when the reasoning depth reaches the predefined maximum , or
(2) when any reasonable thought achieves a confidence score exceeding a predefined threshold
To determine the final answer, the model computes a score for each reasonable thought by evaluating its alignment with the goal vector using a feed-forward network () and multi-head attention ():
The thought with the highest score is selected as the final thought:
This final thought encapsulates the reasoning path most aligned with the goal, allowing the model to generate the final answer:
Experiment and Analysis
To evaluate ATHENA's performance, the study used standard benchmark datasets such as MAWPS, ASDiv-A, and Math23k, which cover a diverse range of math word problem types and linguistic styles. Additionally, to assess the model’s ability to generalize across contextually related but lexically varied problems, the authors included SVAMP—a variant of ASDiv-A—and UnbiasedMWP, both of which are designed to evaluate the model performance without bias from learned data.
The baselines used for comparison include representative MWP approaches such as Transformer, GTS (Goal-driven Tree-Structured model), Graph-to-Tree, and the more recent reasoning-based model DeductReasoner, all of which were evaluated alongside ATHENA.
One-to-Many Test (1:N)
The One-to-Many test evaluates a model's ability to generalize mathematical reasoning across multiple questions that share the same context. Specifically, only one question from each context group is used for training, while the remaining questions are used for testing.
ATHENA demonstrates strong overall performance (answer accuracy) across multiple MWP benchmarks, setting new state-of-the-art results and significantly outperforming previous models, including DeductReasoner, especially in challenging one-to-many generalization tests such as UnbiasedMWP (1:N). Remarkably, ATHENA also shows substantial improvements even with minimal additional training data—highlighting its exceptional generalization capabilities and efficiency. Moreover, ATHENA's incorrect predictions are significantly less dependent on memorized training examples compared to baselines, indicating its robust ability to learn underlying mathematical principles rather than merely replicating learned patterns.
Thought Visualization
To better interpret ATHENA’s reasoning process, the authors visualized attention scores between each reasonable thought and the input problem text. As shown in the figure, most of the initial thoughts are closely linked to terms like “playground”, while thoughts carrying the meaning of increased sizes strongly attend to “later”. The thoughts about width, such as [15] or [40+15] show high attention on the word “width”. Similarly, area-related thoughts focus on words like “square meter” or “area”, and the final thought shows strong alignment with “compared”, indicating that the model correctly associates it with computing a difference. This visualization provides insight into how ATHENA aligns its intermediate reasoning with key semantic cues in the problem.
Conclusion and Limitations
This study presents ATHENA, a novel reasoning framework that leverages thought expansion to achieve robust performance on diverse and previously unseen math word problems. By explicitly modeling intermediate reasoning steps, ATHENA demonstrates strong generalization capabilities in contexts where mathematical operations are expressed in varied linguistic forms. Nonetheless, the current work is limited to arithmetic problems involving single-equation reasoning. Although the architecture is extensible to multi-equation settings, such evaluation was excluded to ensure comparability with existing baselines. Furthermore, comparisons with large-scale language models (LLMs) were not conducted, as the primary objective was to assess reasoning performance under limited-data conditions.
References
Kim, J., Kim, H., Hahn, J., & Han, Y. S. (2023, December). ATHENA: Mathematical Reasoning with Thought Expansion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 16315-16327).
공동 저자 김현지 (Hazel Kim)
연세대학교에서 인공지능 전공으로 석사 학위를 취득했으며, 클래스팅 AI Researcher로 재직하였다. 현재는 옥스포드 대학교에 박사과정으로 재학중이다. 관심 연구 분야는 자연어처리, 제한된 데이터 기반 학습, 언어모델의 불확실성 및 통제가능성 연구 등이다.