Math Self-Doubt

Research sprint report

When Math Models Think a Problem Is Unsolved

A paired-framing pilot on whether open-problem language changes a math model's accuracy, abstention, answer stability, and visible self-doubt.

Rishabh Sai | June 2026 | Run memo for Qwen/Qwen2.5-Math-1.5B-Instruct

The first run looked like evidence that an "open or unsolved" frame made a small math model less accurate. The follow-up run made the result more modest and more useful: much of the effect was a prompt-format and answer-extraction interaction, not a clean proof that open framing breaks reasoning.

Epistemic status: pilot result, not a paper. The clean claim is that the harness can detect framing-sensitive behavior, but this first dataset is too small and too model-specific for broad claims about reasoning models.

1. Question

The experiment takes the same known-solvable math problems and varies only the frame around them: neutral, known-solved, contest-style, or possibly open/unsolved. The intended signal is not hidden cognition. It is observable behavior: exact answer, abstention, self-doubt language, output length, and answer instability across framings.

The useful version of the question is narrow: can we measure whether a model becomes less decisive or less correct when it is told a problem may be unsolved, while still keeping appropriate uncertainty on genuinely open or underspecified controls?

2. Runs

Run Rows Purpose
hf-qwen-sprint20 80 First paired-framing run over 20 solvable exact-answer problems.
hf-qwen-controls 20 Open and underspecified controls for uncertainty behavior.
hf-qwen-answerfirst-f0-f3 40 Format-control follow-up comparing neutral and open framing with answer-first prompting.

The backend was the Hugging Face Inference API through featherless-ai. Raw generations, metadata, rescored CSVs, and figures are committed in the repository.

3. Baseline Result

The first pass suggested a large enough effect to investigate. Neutral framing reached 60% exact-match accuracy; open framing reached 45%. The paired open-minus-neutral accuracy delta was -15.0 pp. Observable self-doubt rose only slightly, by +1.85 score points.

Baseline run summary chart
Baseline run over 20 solvable problems and four framings. The open frame looked worse on exact-match accuracy, but this run still mixed reasoning failures with answer-format failures.
Measure Neutral Open Delta
Exact-match accuracy 60% 45% -15.0 pp
Self-doubt score low higher +1.85
Abstention lower higher +5.0 pp
Answer instability Across paired framings 22.2%

4. Controls

The control set used genuine open or underspecified prompts. These rows were not scored for exact correctness. They were included to test whether the metric sees appropriate uncertainty and whether some framings push the model toward false concrete answers.

Open and underspecified control summary chart
Controls produced much higher self-doubt and abstention than the solvable problems. Contest framing sometimes produced concrete answers where uncertainty would be more appropriate.

5. Format Control

The follow-up run forced a cleaner answer channel by asking the model to put the final answer before the explanation. Under that condition, the neutral and open frames both scored 55% exact-match accuracy. The apparent open-frame accuracy gap disappeared.

Answer-first format-control summary chart
Answer-first prompting removes the baseline open-vs-neutral accuracy gap on the same 20-problem slice, while answer instability remains nonzero.
Measure Neutral Open Delta
Exact-match accuracy 55% 55% 0.0 pp
Self-doubt score low slightly higher +0.60
Answer instability Across neutral/open pairs 15.8%

6. Limitations and Next Step

This should not be sold as "open framing breaks models." The honest claim is narrower: a small pilot found framing-sensitive behavior, then found that the most dramatic accuracy gap was not robust to an answer-first format control.

Next experiment: keep answer-first formatting, add verifier-aware prompting for the open frame, and test whether it improves open-framed answers without increasing false final answers on open or underspecified controls.