ATHENA: Mathematical Reasoning with Thought Expansion

Updated

2025/07/07 06:24

Keywords

NLP

Math Word Problem

4 more properties

Math Word Problem (MWP) solving involves understanding mathematical questions expressed in natural language and deriving the appropriate mathematical equations. Traditional approaches heavily rely on simple lexical pattern matching, limiting their flexibility in diverse real-world scenarios. Our co-authored paper introduces ATHENA (Attention-based THought Expansion Network Architecture), designed to mimic human cognitive processes for more generalized and robust mathematical reasoning.

Let's explore how ATHENA achieves robust performance and why this matters for mathematical reasoning in AI.

Research Background

Math word problem (MWP) solving involves translating complex linguistic descriptions into mathematical expressions. Traditional models tend to memorize lexical patterns rather than understand mathematical principles and procedures, limiting their ability to generalize to unseen or slightly varied problems.

Consider the following cases: calculating the area of a rectangle and determining how many items can be evenly distributed across containers. While both require multiplication, they involve different types of conceptual understanding.

Previous methods have struggled with two key aspects in reaching human-level understanding:

•

Conceptual knowledge:  Understanding how mathematical principles apply in various contexts.

•

Procedural knowledge: The ability to deduce answers step by step through logical reasoning.

ATHENA is specifically designed to bridge this gap, enabling the model to expand its reasoning capabilities by mimicking human cognitive processes.

Methodology: ATHENA

ATHENA employs an innovative two-step reasoning process:

•

Candidate Thought Generation: At each reasoning stage, ATHENA generates multiple potential "thoughts"—possible mathematical expressions derived from previous steps.

•

Reasonable Thought Selection: It then evaluates these candidates based on context relevance, progressively narrowing toward a correct mathematical expression.

This iterative method mirrors human cognitive expansion, generating diverse reasoning pathways and selecting the most contextually relevant and mathematically sound options.

Preliminaries

Before we get into the details of ATHENA, let's first clarify what a thought is, and introduce the roles of the premise vector and goal vector, which guide the reasoning process.

Thought

In this study, a thought (

\theta \in \mathbb{R}^H

) is defined as an embedding of a possible mathematical expression (

\mathcal{E}(\theta)

) derived from quantities in a problem representing the contextual meaning of the expression. The objective of the model is to find a thought

\theta^*

that satisfies the ground-truth expression

\mathcal{E}^*

Premise Vector

A premise vector (

\text{P}_d

) encodes previously inferred thoughts and is used to assess and filter candidate thoughts in each reasoning depth

d

. The initial premise vector (

\text{P}_0

) is initialized using the embedding of the [CLS] token from the problem descriptions.

Goal Vector

A goal vector (

{G}

) serves as a ground-truth reference to determine whether a thought is an appropriate answer to the given question. It is defined using the token embedding of the punctuation mark (e.g., a question mark at the end of the sentence) in the problem description.

Thought Expansion

At each reasoning depth, the thought expansion process generates candidate thoughts

\Theta_d

and filters them to obtain the reasonable thoughts

\Theta^*_{d}

which serve as intermediate steps toward the final answer.

Candidate Thought Generation

Initial Thoughts (

\Theta_0

Initial thoughts are embeddings that represent each quantity mentioned in the problem context or question. These embeddings encode the contextual semantics of the quantities and serve as the starting point for reasoning.

At each reasoning depth

d

, the model generates a set of possible new thoughts

\Theta_d

from the previously selected reasonable thoughts

\Theta^*_{d-1}

as the candidates. A new candidate thought

\theta'

is obtained by combining two previous thoughts

\theta_i,\theta_j \in \Theta^*_{d-1}

with an arithmetic operation:

\mathcal{E}(\theta')=\mathcal{E}(\theta_i)\ \circ\ \mathcal{E}(\theta_j)

where

\circ \in \{+, -, \times, \div \}

To enable this composition process, ATHENA introduces two operation layers—merge (

\text{M}

) and transform (

\text{T}

)— to model the fundamental properties of arithmetic operations.

Merge Layer (

\text{M}

)

The merge layer takes a pair of thoughts (

\theta_i, \theta_j

) and combines them into a new thought

\theta'

to model operations like addition or multiplication:

\stackrel{op}{M} : \theta_i, \theta_j \mapsto \theta' \\ \text{s.t.} \ \mathcal{E}(\theta')=\text{op}(\mathcal{E}(\theta_i),\mathcal{E}(\theta_j)) \ \text{where} \ \text{op} \in \{+,\times\}

This layer is implemented using Feed-Forward Network(

\text{FF}

), multi-head self-attention (

\mathop{\text{A}}\limits_{\text{self}}

)and layer normalization (

\ell

\text{M}(\theta_i, \theta_j) = \text{FF}(\theta_i+\theta_j+{\ell}(\mathbf{1}^{\text{T}}_2\mathop{\text{A}}\limits_{\text{self}}([\theta_i;\theta_j]))W+b), \\ \text{where}\ \ W\in\mathbb{R^{H\times H}},\ b \in \mathbb{R}^H

Transform Layer (

\text{T}

)

The transform layer takes a single thought

\theta

and applies an inverse operation, such as subtraction or division, to produce a new thought

\theta'

\stackrel{op}{T}: \theta \mapsto \theta' \\ \text{s.t.} \ \mathcal{E}(\theta')=\text{op}(\mathcal{E}(\theta)) \ \text{where} \ \text{op} \in \{-\cdot, \cdot^-1\}

This layer is simply implemented as:

\text{T}(\theta) = \text{FF}(\theta)

Layer Scheduling:

The two layers are applied alternately during the reasoning process: the transform layer is applied at odd depths (

d = 2n - 1

), while the merge layer is applied at even depths (

d = 2n

). At the initial depth

d = 0

, the initial thoughts

\Theta_0

are directly used as candidates.

Reasonable Thought Selection

After obtaining candidate thoughts

\Theta_d

, the model yields reasonable thoughts

\Theta^*_d

that constitute the final thought

\theta^*

. In each reasoning depth

d

, it selects reasonable thoughts from candidate thoughts through the inference layer (

\text{infer}

) with a premise vector

\text{P}_d

Inference and Update Premise

Given a set of candidate thoughts

\Theta_d

generated at depth

d

, ATHENA evaluates each thought

\theta \in \Theta_d

by computing its correlation to the current premise vector

\text{P}_d

. This correlation indicates how compatible a new thought is with the prior reasoning. To score each candidate, the model applies multi-head attention

\text{A}(Q,K=V)

and feed-forward network

\text{FF}

\text{infer}(\text{P}_d,\theta)=\sigma(\text{A}(\text{FF}(\theta),\text{P}_d)W_r+b_r), \\ \text{where}\ W_r \in \mathbb{R}^{H\times1}, b_r \in \mathbb{R}

A thought

\theta

is considered reasonable if its correlation score

\text{infer}(\text{P}_d,\theta)

exceeds a predefined threshold

t_r=0.5

Once the reasonable thoughts

\Theta^*_d

are selected, ATHENA updates the premise vector

\text{P}_d

to reflect the new reasoning context. The updated premise vector

\text{P}_{d+1}

is constructed by concatenating all reasonable thoughts

\Theta^*_d

after the multi-head attention

\text{A}

using the parameters of the inference layer

\text{infer}

\text{P}_{d+1} = \text{P}_d\Vert\text{A}(\text{FF}([\Theta^*_d]),\text{P}_d)

Termination and Final Thought Selection

The final thought

\theta^*

represents the model’s answer to the question. ATHENA terminates the thought expansion process based on one of two criteria:

(1) when the reasoning depth reaches the predefined maximum

D

, or

(2) when any reasonable thought achieves a confidence score exceeding a predefined threshold

t_f

To determine the final answer, the model computes a score for each reasonable thought

\theta \in \Theta^*_d

by evaluating its alignment with the goal vector

G

using a feed-forward network (

\text{FF}

) and multi-head attention (

\text{A}

\text{answer}({G}, \theta) = \sigma(\text{A}(\text{FF}(\theta),{G})W_a+b_a), \\ \text{where} \ W_a\in\mathbb{R}^{H\times1},b_a\in\mathbb{R}

The thought with the highest score is selected as the final thought:

\theta^*=\mathop{\text{arg max}}\limits_{\theta\in\Theta^*_d} (\text{answer}(G,\theta))

This final thought encapsulates the reasoning path most aligned with the goal, allowing the model to generate the final answer:

\mathcal{E}^*=\mathcal{E}(\theta^*)

Experiment and Analysis

To evaluate ATHENA's performance, the study used standard benchmark datasets such as MAWPS, ASDiv-A, and Math23k, which cover a diverse range of math word problem types and linguistic styles. Additionally, to assess the model’s ability to generalize across contextually related but lexically varied problems, the authors included SVAMP—a variant of ASDiv-A—and UnbiasedMWP, both of which are designed to evaluate the model performance without bias from learned data.

The baselines used for comparison include representative MWP approaches such as Transformer, GTS (Goal-driven Tree-Structured model), Graph-to-Tree, and the more recent reasoning-based model DeductReasoner, all of which were evaluated alongside ATHENA.

One-to-Many Test (1:N)

The One-to-Many test evaluates a model's ability to generalize mathematical reasoning across multiple questions that share the same context. Specifically, only one question from each context group is used for training, while the remaining questions are used for testing.

ATHENA demonstrates strong overall performance (answer accuracy) across multiple MWP benchmarks, setting new state-of-the-art results and significantly outperforming previous models, including DeductReasoner, especially in challenging one-to-many generalization tests such as UnbiasedMWP (1:N). Remarkably, ATHENA also shows substantial improvements even with minimal additional training data—highlighting its exceptional generalization capabilities and efficiency. Moreover, ATHENA's incorrect predictions are significantly less dependent on memorized training examples compared to baselines, indicating its robust ability to learn underlying mathematical principles rather than merely replicating learned patterns.

Thought Visualization

To better interpret ATHENA’s reasoning process, the authors visualized attention scores between each reasonable thought and the input problem text. As shown in the figure, most of the initial thoughts are closely linked to terms like “playground”, while thoughts carrying the meaning of increased sizes strongly attend to “later”. The thoughts about width, such as [15] or [40+15] show high attention on the word “width”. Similarly, area-related thoughts focus on words like “square meter” or “area”, and the final thought shows strong alignment with “compared”, indicating that the model correctly associates it with computing a difference. This visualization provides insight into how ATHENA aligns its intermediate reasoning with key semantic cues in the problem.

Conclusion and Limitations

This study presents ATHENA, a novel reasoning framework that leverages thought expansion to achieve robust performance on diverse and previously unseen math word problems. By explicitly modeling intermediate reasoning steps, ATHENA demonstrates strong generalization capabilities in contexts where mathematical operations are expressed in varied linguistic forms. Nonetheless, the current work is limited to arithmetic problems involving single-equation reasoning. Although the architecture is extensible to multi-equation settings, such evaluation was excluded to ensure comparability with existing baselines. Furthermore, comparisons with large-scale language models (LLMs) were not conducted, as the primary objective was to assess reasoning performance under limited-data conditions.

References

Kim, J., Kim, H., Hahn, J., & Han, Y. S. (2023, December). ATHENA: Mathematical Reasoning with Thought Expansion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 16315-16327).

공동 저자 김현지 (Hazel Kim)

연세대학교에서 인공지능 전공으로 석사 학위를 취득했으며, 클래스팅 AI Researcher로 재직하였다. 현재는 옥스포드 대학교에 박사과정으로 재학중이다. 관심 연구 분야는 자연어처리, 제한된 데이터 기반 학습, 언어모델의 불확실성 및 통제가능성 연구 등이다.

hazel.kim@cs.ox.ac.uk