Research sprint report

When Math Models Think a Problem Is Unsolved

A paired-framing pilot on whether open-problem language changes a math model's accuracy, abstention, answer stability, and visible self-doubt.

Rishabh Sai | June 2026 | Run memo for Qwen/Qwen2.5-Math-1.5B-Instruct

Full memo Autoresearch log Raw generations Code and data

The first run looked like evidence that an "open or unsolved" frame made a small math model less accurate. The follow-up run made the result more modest and more useful: much of the effect was a prompt-format and answer-extraction interaction, not a clean proof that open framing breaks reasoning.

Epistemic status: pilot result, not a paper. The clean claim is that the harness can detect framing-sensitive behavior, but this first dataset is too small and too model-specific for broad claims about reasoning models.

1. Question

The experiment takes the same known-solvable math problems and varies only the frame around them: neutral, known-solved, contest-style, or possibly open/unsolved. The intended signal is not hidden cognition. It is observable behavior: exact answer, abstention, self-doubt language, output length, and answer instability across framings.

The useful version of the question is narrow: can we measure whether a model becomes less decisive or less correct when it is told a problem may be unsolved, while still keeping appropriate uncertainty on genuinely open or underspecified controls?

2. Runs

Run	Rows	Purpose
`hf-qwen-sprint20`	80	First paired-framing run over 20 solvable exact-answer problems.
`hf-qwen-controls`	20	Open and underspecified controls for uncertainty behavior.
`hf-qwen-answerfirst-f0-f3`	40	Format-control follow-up comparing neutral and open framing with answer-first prompting.

The backend was the Hugging Face Inference API through featherless-ai. Raw generations, metadata, rescored CSVs, and figures are committed in the repository.

3. Baseline Result

The first pass suggested a large enough effect to investigate. Neutral framing reached 60% exact-match accuracy; open framing reached 45%. The paired open-minus-neutral accuracy delta was -15.0 pp. Observable self-doubt rose only slightly, by +1.85 score points.

Baseline run summary chart — Baseline run over 20 solvable problems and four framings. The open frame looked worse on exact-match accuracy, but this run still mixed reasoning failures with answer-format failures.

Measure	Neutral	Open	Delta
Exact-match accuracy	60%	45%	-15.0 pp
Self-doubt score	low	higher	+1.85
Abstention	lower	higher	+5.0 pp
Answer instability	Across paired framings		22.2%

4. Controls

The control set used genuine open or underspecified prompts. These rows were not scored for exact correctness. They were included to test whether the metric sees appropriate uncertainty and whether some framings push the model toward false concrete answers.

Open and underspecified control summary chart — Controls produced much higher self-doubt and abstention than the solvable problems. Contest framing sometimes produced concrete answers where uncertainty would be more appropriate.

5. Format Control

The follow-up run forced a cleaner answer channel by asking the model to put the final answer before the explanation. Under that condition, the neutral and open frames both scored 55% exact-match accuracy. The apparent open-frame accuracy gap disappeared.

Answer-first format-control summary chart — Answer-first prompting removes the baseline open-vs-neutral accuracy gap on the same 20-problem slice, while answer instability remains nonzero.

Measure	Neutral	Open	Delta
Exact-match accuracy	55%	55%	0.0 pp
Self-doubt score	low	slightly higher	+0.60
Answer instability	Across neutral/open pairs		15.8%

6. Limitations and Next Step

This should not be sold as "open framing breaks models." The honest claim is narrower: a small pilot found framing-sensitive behavior, then found that the most dramatic accuracy gap was not robust to an answer-first format control.

Only one model was tested.
The solvable run used 20 of the 50 prepared problems.
Exact-match accuracy still mixes math errors with extraction failures.
The requested confidence line was often ignored, so calibration is not usable yet.
Self-doubt markers are text features, not direct evidence of hidden cognition.
The open-control set is manually audited, not a scored benchmark.

Next experiment: keep answer-first formatting, add verifier-aware prompting for the open frame, and test whether it improves open-framed answers without increasing false final answers on open or underspecified controls.