Research sprint report
When Math Models Think a Problem Is Unsolved
A paired-framing pilot on whether open-problem language changes a math model's accuracy, abstention, answer stability, and visible self-doubt.
The first run looked like evidence that an "open or unsolved" frame made a small math model less accurate. The follow-up run made the result more modest and more useful: much of the effect was a prompt-format and answer-extraction interaction, not a clean proof that open framing breaks reasoning.
Epistemic status: pilot result, not a paper. The clean claim is that the harness can detect framing-sensitive behavior, but this first dataset is too small and too model-specific for broad claims about reasoning models.
1. Question
The experiment takes the same known-solvable math problems and varies only the frame around them: neutral, known-solved, contest-style, or possibly open/unsolved. The intended signal is not hidden cognition. It is observable behavior: exact answer, abstention, self-doubt language, output length, and answer instability across framings.
The useful version of the question is narrow: can we measure whether a model becomes less decisive or less correct when it is told a problem may be unsolved, while still keeping appropriate uncertainty on genuinely open or underspecified controls?
2. Runs
| Run | Rows | Purpose |
|---|---|---|
hf-qwen-sprint20 |
80 | First paired-framing run over 20 solvable exact-answer problems. |
hf-qwen-controls |
20 | Open and underspecified controls for uncertainty behavior. |
hf-qwen-answerfirst-f0-f3 |
40 | Format-control follow-up comparing neutral and open framing with answer-first prompting. |
The backend was the Hugging Face Inference API through
featherless-ai. Raw generations, metadata, rescored CSVs,
and figures are committed in the repository.
3. Baseline Result
The first pass suggested a large enough effect to investigate. Neutral framing reached 60% exact-match accuracy; open framing reached 45%. The paired open-minus-neutral accuracy delta was -15.0 pp. Observable self-doubt rose only slightly, by +1.85 score points.
| Measure | Neutral | Open | Delta |
|---|---|---|---|
| Exact-match accuracy | 60% | 45% | -15.0 pp |
| Self-doubt score | low | higher | +1.85 |
| Abstention | lower | higher | +5.0 pp |
| Answer instability | Across paired framings | 22.2% | |
4. Controls
The control set used genuine open or underspecified prompts. These rows were not scored for exact correctness. They were included to test whether the metric sees appropriate uncertainty and whether some framings push the model toward false concrete answers.
5. Format Control
The follow-up run forced a cleaner answer channel by asking the model to put the final answer before the explanation. Under that condition, the neutral and open frames both scored 55% exact-match accuracy. The apparent open-frame accuracy gap disappeared.
| Measure | Neutral | Open | Delta |
|---|---|---|---|
| Exact-match accuracy | 55% | 55% | 0.0 pp |
| Self-doubt score | low | slightly higher | +0.60 |
| Answer instability | Across neutral/open pairs | 15.8% | |
6. Limitations and Next Step
This should not be sold as "open framing breaks models." The honest claim is narrower: a small pilot found framing-sensitive behavior, then found that the most dramatic accuracy gap was not robust to an answer-first format control.
- Only one model was tested.
- The solvable run used 20 of the 50 prepared problems.
- Exact-match accuracy still mixes math errors with extraction failures.
- The requested confidence line was often ignored, so calibration is not usable yet.
- Self-doubt markers are text features, not direct evidence of hidden cognition.
- The open-control set is manually audited, not a scored benchmark.
Next experiment: keep answer-first formatting, add verifier-aware prompting for the open frame, and test whether it improves open-framed answers without increasing false final answers on open or underspecified controls.