We had a presentation to plan. Background research, structure, decisions about what to include, copy for the slides. Opus 4.7 had just dropped with improvements in reasoning and agentic tasks, so switching it into our co-work session on claude.ai felt like the obvious move.

Two hours in, we had nothing.

What Co-Work Sessions Are

Claude has a co-work mode on claude.ai, a collaborative session type built for planning and thinking through problems before you write code or produce anything. The idea is that you bring a capable model into the planning phase, work through the structure and approach together, and arrive at your next step with a clear picture of what you are building. We use it for presentations, strategy documents, and anything where real decisions need to be made before production starts. No output gets produced in co-work, just direction.

We use Opus for these sessions because planning is where you want the model to actually think through the problem, not pattern-match on similar requests. When 4.7 dropped with reasoning improvements, putting it into co-work felt obvious.

What Actually Happened

It stalled. Not broke, not errored. Stalled.

Responses came back flat and passive, like the model was waiting to be told exactly what to do at each step rather than working through the problem with us. No research happened. No structure came together. We gave it the task, we gave it context, we gave it time. We got back something that felt like the model had decided not to move.

We ruled out a bad prompt by trying variations. Same result. After two hours we stopped and switched models.

What Opus 4.6 Did With the Same Information

We took the full conversation from that session, pasted it into a new co-work session on Opus 4.6, and gave it the same brief. That was the only change.

It moved immediately. Research got done. The structure came together. Copy came back usable. Within an hour we had a working presentation.

Same task. Same information. Different model version. Completely different result.

The Distinction That Matters

This is not a verdict on Opus 4.7 overall. We use Claude Code separately for coding sessions and Opus 4.7 works correctly there, including for planning work within a coding context. The problem is specific to co-work sessions in claude.ai.

Co-work and Claude Code are different environments with different session structures and different ways the model handles open-ended input. Something about how Opus 4.7 behaves inside co-work specifically caused the stall. In a Claude Code context it did not. That distinction matters because the fix is not to stop using 4.7. It is to use 4.6 in co-work until this resolves.

We also wrote a separate post covering what Opus 4.7 actually improved across coding and agentic tasks, if you want the full picture before deciding where to use each version.

Why This Probably Happened

We are not Anthropic engineers, but some things are worth considering.

Opus 4.7 is optimised for structured agentic tasks. The improvements are in coding benchmarks measured by SWE-bench, computer use, and multi-step reasoning under defined conditions. Open-ended planning in co-work, where the model needs to make directional judgment calls without a clear success condition, is a different type of problem. The benchmarks do not capture that.

Context changed the outcome completely. When we gave 4.6 the previous conversation, it had something 4.7 never had in that first session: a picture of where we had been, what we were trying to do, and what had already been tried. Starting cold into a vague planning brief is harder for any model than picking up a conversation with background. That context probably did as much work as the model version switch itself.

What We Are Doing Now

For co-work sessions in claude.ai, we are staying on Opus 4.6. For Claude Code, Opus 4.7 is in the mix. We will keep testing 4.7 in co-work as Anthropic ships updates, but right now 4.6 is doing the job there and 4.7 is not, so the choice is clear.

The broader point: model versions and session environments interact in ways that benchmarks do not predict. A model that performs well in one context can stall in another. The only way to know is to test in your actual workflow, not read the release notes and assume.

We build and integrate AI tools into business workflows, and model selection for specific tasks is something we think about constantly. If you are trying to work out the right setup for your team, we are happy to talk through it.