Search

ATHENA: Mathematical Reasoning with Thought Expansion

Updated
2025/07/07 06:24
Keywords
NLP
Math Word Problem
ML
4 more properties
Math Word Problem (MWP) solving involves understanding mathematical questions expressed in natural language and deriving the appropriate mathematical equations. Traditional approaches heavily rely on simple lexical pattern matching, limiting their flexibility in diverse real-world scenarios. Our co-authored paper introduces ATHENA (Attention-based THought Expansion Network Architecture), designed to mimic human cognitive processes for more generalized and robust mathematical reasoning.
Let's explore how ATHENA achieves robust performance and why this matters for mathematical reasoning in AI.

Research Background

Math word problem (MWP) solving involves translating complex linguistic descriptions into mathematical expressions. Traditional models tend to memorize lexical patterns rather than understand mathematical principles and procedures, limiting their ability to generalize to unseen or slightly varied problems.
Consider the following cases: calculating the area of a rectangle and determining how many items can be evenly distributed across containers. While both require multiplication, they involve different types of conceptual understanding.
Previous methods have struggled with two key aspects in reaching human-level understanding:
Conceptual knowledge: Understanding how mathematical principles apply in various contexts.
Procedural knowledge: The ability to deduce answers step by step through logical reasoning.
ATHENA is specifically designed to bridge this gap, enabling the model to expand its reasoning capabilities by mimicking human cognitive processes.

Methodology: ATHENA

ATHENA employs an innovative two-step reasoning process:
Candidate Thought Generation: At each reasoning stage, ATHENA generates multiple potential "thoughts"—possible mathematical expressions derived from previous steps.
Reasonable Thought Selection: It then evaluates these candidates based on context relevance, progressively narrowing toward a correct mathematical expression.
This iterative method mirrors human cognitive expansion, generating diverse reasoning pathways and selecting the most contextually relevant and mathematically sound options.

Preliminaries

Before we get into the details of ATHENA, let's first clarify what a thought is, and introduce the roles of the premise vector and goal vector, which guide the reasoning process.
Thought
In this study, a thought (θRH\theta \in \mathbb{R}^H) is defined as an embedding of a possible mathematical expression (E(θ)\mathcal{E}(\theta)) derived from quantities in a problem representing the contextual meaning of the expression. The objective of the model is to find a thought θ\theta^* that satisfies the ground-truth expression E\mathcal{E}^*.
Premise Vector
A premise vector (Pd\text{P}_d) encodes previously inferred thoughts and is used to assess and filter candidate thoughts in each reasoning depth dd. The initial premise vector (P0\text{P}_0) is initialized using the embedding of the [CLS] token from the problem descriptions.
Goal Vector
A goal vector (G{G}) serves as a ground-truth reference to determine whether a thought is an appropriate answer to the given question. It is defined using the token embedding of the punctuation mark (e.g., a question mark at the end of the sentence) in the problem description.

Thought Expansion

At each reasoning depth, the thought expansion process generates candidate thoughts Θd\Theta_d and filters them to obtain the reasonable thoughts Θd\Theta^*_{d} which serve as intermediate steps toward the final answer.
Candidate Thought Generation
Initial Thoughts (Θ0\Theta_0):
Initial thoughts are embeddings that represent each quantity mentioned in the problem context or question. These embeddings encode the contextual semantics of the quantities and serve as the starting point for reasoning.
At each reasoning depth dd, the model generates a set of possible new thoughts Θd\Theta_d from the previously selected reasonable thoughts Θd1\Theta^*_{d-1} as the candidates. A new candidate thought θ\theta' is obtained by combining two previous thoughts θi,θjΘd1\theta_i,\theta_j \in \Theta^*_{d-1} with an arithmetic operation:
E(θ)=E(θi)  E(θj)\mathcal{E}(\theta')=\mathcal{E}(\theta_i)\ \circ\ \mathcal{E}(\theta_j) where {+,,×,÷}\circ \in \{+, -, \times, \div \}
To enable this composition process, ATHENA introduces two operation layers—merge (M\text{M}) and transform (T\text{T})— to model the fundamental properties of arithmetic operations.
Merge Layer (M\text{M})
The merge layer takes a pair of thoughts (θi,θj\theta_i, \theta_j) and combines them into a new thought θ\theta' to model operations like addition or multiplication:
Mop:θi,θjθs.t. E(θ)=op(E(θi),E(θj)) where op{+,×}\stackrel{op}{M} : \theta_i, \theta_j \mapsto \theta' \\ \text{s.t.} \ \mathcal{E}(\theta')=\text{op}(\mathcal{E}(\theta_i),\mathcal{E}(\theta_j)) \ \text{where} \ \text{op} \in \{+,\times\}
This layer is implemented using Feed-Forward Network(FF\text{FF}), multi-head self-attention (Aself\mathop{\text{A}}\limits_{\text{self}} )and layer normalization (\ell):
M(θi,θj)=FF(θi+θj+(12TAself([θi;θj]))W+b),where  WRH×H, bRH\text{M}(\theta_i, \theta_j) = \text{FF}(\theta_i+\theta_j+{\ell}(\mathbf{1}^{\text{T}}_2\mathop{\text{A}}\limits_{\text{self}}([\theta_i;\theta_j]))W+b), \\ \text{where}\ \ W\in\mathbb{R^{H\times H}},\ b \in \mathbb{R}^H
Transform Layer (T\text{T})
The transform layer takes a single thought θ\theta and applies an inverse operation, such as subtraction or division, to produce a new thought θ\theta':
Top:θθs.t. E(θ)=op(E(θ)) where op{,1}\stackrel{op}{T}: \theta \mapsto \theta' \\ \text{s.t.} \ \mathcal{E}(\theta')=\text{op}(\mathcal{E}(\theta)) \ \text{where} \ \text{op} \in \{-\cdot, \cdot^-1\}
This layer is simply implemented as:
T(θ)=FF(θ)\text{T}(\theta) = \text{FF}(\theta)
Layer Scheduling:
The two layers are applied alternately during the reasoning process: the transform layer is applied at odd depths (d=2n1d = 2n - 1), while the merge layer is applied at even depths (d=2nd = 2n). At the initial depth d=0d = 0, the initial thoughts Θ0\Theta_0 are directly used as candidates.
Reasonable Thought Selection
After obtaining candidate thoughts Θd\Theta_d, the model yields reasonable thoughts Θd\Theta^*_d that constitute the final thought θ\theta^*. In each reasoning depth dd, it selects reasonable thoughts from candidate thoughts through the inference layer (infer\text{infer}) with a premise vector Pd\text{P}_d.
Inference and Update Premise
Given a set of candidate thoughts Θd\Theta_d generated at depth dd, ATHENA evaluates each thought θΘd\theta \in \Theta_d by computing its correlation to the current premise vector Pd\text{P}_d. This correlation indicates how compatible a new thought is with the prior reasoning. To score each candidate, the model applies multi-head attention A(Q,K=V)\text{A}(Q,K=V) and feed-forward network FF\text{FF}:
infer(Pd,θ)=σ(A(FF(θ),Pd)Wr+br),where WrRH×1,brR\text{infer}(\text{P}_d,\theta)=\sigma(\text{A}(\text{FF}(\theta),\text{P}_d)W_r+b_r), \\ \text{where}\ W_r \in \mathbb{R}^{H\times1}, b_r \in \mathbb{R}
A thought θ\theta is considered reasonable if its correlation score infer(Pd,θ)\text{infer}(\text{P}_d,\theta) exceeds a predefined threshold tr=0.5t_r=0.5.
Once the reasonable thoughts Θd\Theta^*_d are selected, ATHENA updates the premise vector Pd\text{P}_d to reflect the new reasoning context. The updated premise vector Pd+1\text{P}_{d+1} is constructed by concatenating all reasonable thoughts Θd\Theta^*_d after the multi-head attention A\text{A} using the parameters of the inference layer infer\text{infer}:
Pd+1=PdA(FF([Θd]),Pd)\text{P}_{d+1} = \text{P}_d\Vert\text{A}(\text{FF}([\Theta^*_d]),\text{P}_d)
Termination and Final Thought Selection
The final thought θ\theta^* represents the model’s answer to the question. ATHENA terminates the thought expansion process based on one of two criteria:
(1) when the reasoning depth reaches the predefined maximum DD, or
(2) when any reasonable thought achieves a confidence score exceeding a predefined threshold tft_f
To determine the final answer, the model computes a score for each reasonable thought θΘd\theta \in \Theta^*_d by evaluating its alignment with the goal vector GG using a feed-forward network (FF\text{FF}) and multi-head attention (A\text{A}):
answer(G,θ)=σ(A(FF(θ),G)Wa+ba),where WaRH×1,baR\text{answer}({G}, \theta) = \sigma(\text{A}(\text{FF}(\theta),{G})W_a+b_a), \\ \text{where} \ W_a\in\mathbb{R}^{H\times1},b_a\in\mathbb{R}
The thought with the highest score is selected as the final thought:
θ=arg maxθΘd(answer(G,θ))\theta^*=\mathop{\text{arg max}}\limits_{\theta\in\Theta^*_d} (\text{answer}(G,\theta))
This final thought encapsulates the reasoning path most aligned with the goal, allowing the model to generate the final answer:
E=E(θ)\mathcal{E}^*=\mathcal{E}(\theta^*)

Experiment and Analysis

To evaluate ATHENA's performance, the study used standard benchmark datasets such as MAWPS, ASDiv-A, and Math23k, which cover a diverse range of math word problem types and linguistic styles. Additionally, to assess the model’s ability to generalize across contextually related but lexically varied problems, the authors included SVAMP—a variant of ASDiv-A—and UnbiasedMWP, both of which are designed to evaluate the model performance without bias from learned data.
The baselines used for comparison include representative MWP approaches such as Transformer, GTS (Goal-driven Tree-Structured model), Graph-to-Tree, and the more recent reasoning-based model DeductReasoner, all of which were evaluated alongside ATHENA.
One-to-Many Test (1:N)
The One-to-Many test evaluates a model's ability to generalize mathematical reasoning across multiple questions that share the same context. Specifically, only one question from each context group is used for training, while the remaining questions are used for testing.
ATHENA demonstrates strong overall performance (answer accuracy) across multiple MWP benchmarks, setting new state-of-the-art results and significantly outperforming previous models, including DeductReasoner, especially in challenging one-to-many generalization tests such as UnbiasedMWP (1:N). Remarkably, ATHENA also shows substantial improvements even with minimal additional training data—highlighting its exceptional generalization capabilities and efficiency. Moreover, ATHENA's incorrect predictions are significantly less dependent on memorized training examples compared to baselines, indicating its robust ability to learn underlying mathematical principles rather than merely replicating learned patterns.
Thought Visualization
To better interpret ATHENA’s reasoning process, the authors visualized attention scores between each reasonable thought and the input problem text. As shown in the figure, most of the initial thoughts are closely linked to terms like “playground”, while thoughts carrying the meaning of increased sizes strongly attend to “later”. The thoughts about width, such as [15] or [40+15] show high attention on the word “width”. Similarly, area-related thoughts focus on words like “square meter” or “area”, and the final thought shows strong alignment with “compared”, indicating that the model correctly associates it with computing a difference. This visualization provides insight into how ATHENA aligns its intermediate reasoning with key semantic cues in the problem.

Conclusion and Limitations

This study presents ATHENA, a novel reasoning framework that leverages thought expansion to achieve robust performance on diverse and previously unseen math word problems. By explicitly modeling intermediate reasoning steps, ATHENA demonstrates strong generalization capabilities in contexts where mathematical operations are expressed in varied linguistic forms. Nonetheless, the current work is limited to arithmetic problems involving single-equation reasoning. Although the architecture is extensible to multi-equation settings, such evaluation was excluded to ensure comparability with existing baselines. Furthermore, comparisons with large-scale language models (LLMs) were not conducted, as the primary objective was to assess reasoning performance under limited-data conditions.

References

Kim, J., Kim, H., Hahn, J., & Han, Y. S. (2023, December). ATHENA: Mathematical Reasoning with Thought Expansion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 16315-16327).
공동 저자 김현지 (Hazel Kim)
연세대학교에서 인공지능 전공으로 석사 학위를 취득했으며, 클래스팅 AI Researcher로 재직하였다. 현재는 옥스포드 대학교에 박사과정으로 재학중이다. 관심 연구 분야는 자연어처리, 제한된 데이터 기반 학습, 언어모델의 불확실성 및 통제가능성 연구 등이다.