Anthropic released Opus 4.7 on April 16. We wrote about the benchmark improvements at the time. Then we actually tried it in a real session. The experience was different enough from what we expected that it is worth talking about separately, because the problem is not the model itself. It is where you use it.
Co-Work for Code: What It Is
Claude has a feature called co-work for code. It is a collaborative session mode built specifically for planning and thinking through problems before you write a single line. The idea is that you bring a capable model into the planning phase, work through the structure and approach together, and then move into your actual coding environment with a clear picture of what you are building. We use it for exactly that: no code gets written in co-work, just decisions.
We use Opus for these sessions because planning is where you want the model to actually think, not just pattern-match on previous answers. When 4.7 dropped with improvements in reasoning and agentic tasks, switching it into co-work felt like an obvious test.
What Opus 4.7 Did in Co-Work
It was a train smash. That is the honest description.
The task was a presentation: background research, structure, decisions about what to include, copy for the slides. The kind of open-ended planning work co-work is built for. No research happened. The model did not move. Responses came back flat, like it was waiting to be told exactly what to do at every step rather than thinking through the problem with us.
We gave it the task. We gave it context. We got back something that felt like the model had stalled. We sat with it long enough to rule out a bad prompt, then stopped and switched models.
What Opus 4.6 Did With the Same Information
We took the conversation from that session, pasted it into a new co-work session on Opus 4.6, and gave it the same task. That was the only change. Same information, same brief, different model version.
It moved immediately. Research got done. The structure came together. Copy came back usable. Within an hour we had a working presentation.
This is not a marginal difference. This is the difference between a session that produces something and a session that produces nothing.
The Important Distinction
Before drawing the wrong conclusion: this is not a verdict on Opus 4.7 overall. We use Claude Code separately for coding sessions, and Opus 4.7 works fine there for planning within code. The issue is specific to co-work.
Co-work and Claude Code are different environments with different session structures. Something about how Opus 4.7 behaves inside a co-work session specifically caused the problem. In a Claude Code planning context it does not. That distinction matters because the fix is not to downgrade globally. It is to use 4.6 in co-work until this sorts itself out.
Why This Probably Happened
We are not Anthropic engineers, but a few things are worth considering.
Opus 4.7 is optimised for structured agentic tasks. The headline improvements are in coding benchmarks measured by SWE-bench, computer use, and reasoning under defined conditions. Open-ended planning in co-work, where the model needs to make judgment calls about direction without a clear success condition, is a different type of problem. The benchmarks do not capture that.
Context changed the outcome completely. When we gave 4.6 the previous chat, it had something 4.7 never had: a picture of where we had been, what we were after, and what had already been tried. That context almost certainly did as much work as the model version itself. Starting cold into a vague planning session is harder for any model than picking up a conversation with real background.
What We Are Doing
For co-work sessions, we are staying on Opus 4.6. For Claude Code, Opus 4.7 stays in the mix. We will keep testing 4.7 in co-work as Anthropic ships updates, but right now 4.6 is doing the job and 4.7 is not, so the choice is obvious.
The broader lesson is that model versions and session environments interact in ways that benchmarks do not predict. A model that performs well in one context can stall in another. The only way to know is to test in your actual workflow, not read the release notes and assume.
We build and integrate AI tools for businesses across different platforms, and model selection for specific tasks is something we think about constantly. If you are trying to figure out the right setup for your workflow, we are happy to talk through it.