Claude Extended Thinking Mode: Developer Guide to the GA Release

Introduction

Anthropic made Extended Thinking generally available for the Claude API on June 26, 2026. Extended Thinking is a mode where Claude works through a problem step by step before generating a final answer — thinking through intermediate reasoning that is visible to developers but not included in the response token count in the same way as standard output.

For developers building on Claude, this changes what the API can handle reliably. Tasks that previously required careful prompt engineering to coax structured reasoning out of the model — complex analysis, multi-constraint optimization, mathematical reasoning, long-document synthesis — become more tractable by design rather than by prompt craft.

This article explains what Extended Thinking is, how it works at the API level, when to use it versus standard mode, and what the cost and latency tradeoffs look like.

Claude Extended Thinking Mode Is Now Generally Available: What Developers Need to Know

What Happened

Anthropic moved Claude Extended Thinking from limited preview to general availability on June 26, 2026, for the Claude API.

Extended Thinking was first introduced in Claude 3.7 Sonnet in early 2025 as a feature that allowed the model to reason through problems before answering. The GA release makes it available across Claude models that support the feature, removes preview access restrictions, and stabilizes the API interface for production use.

With Extended Thinking enabled, a Claude API call returns both a thinking block — the model’s internal reasoning process — and the final response. The thinking block is visible to the developer. It contains the model’s step-by-step reasoning: how it approached the problem, what it considered, how it resolved ambiguity, and how it arrived at its conclusion.

The thinking block is not shown to end users by default; it is an inspection and debugging surface for developers. The final response is what users see.

API-level changes:

Extended Thinking is enabled by passing thinking: { type: "enabled", budget_tokens: N } in the API request, where budget_tokens sets the maximum number of tokens the model can use for internal reasoning before generating the final response. A higher budget allows more complex reasoning at higher cost; a lower budget reduces cost and latency.

Why It Matters

Reasoning quality improves measurably on hard tasks. The performance gap between Extended Thinking enabled and disabled is most pronounced on tasks that require working through multiple constraints, resolving ambiguity, or synthesizing information from long contexts. For routine tasks — summarization, simple Q&A, standard code generation — the difference is smaller and the cost increase is often not justified.

The thinking block is a debugging surface. When a Claude API call produces a surprising or incorrect answer, standard mode offers no visibility into how the model arrived at it. Extended Thinking exposes the reasoning process. Developers can inspect the thinking block, identify where the model’s reasoning diverged from correct, and adjust their prompt or system design accordingly. This is a significant operational advantage for teams building reliability into AI applications.

Budget control gives developers cost flexibility. The budget_tokens parameter is a direct handle on the reasoning depth vs. cost tradeoff. For a task that benefits from deep reasoning, set a high budget. For a task where quick responses matter more than maximum reasoning quality, set a low budget or disable Extended Thinking. This is more precise than the alternatives — retry logic, temperature adjustment, or elaborate chain-of-thought prompting — because it directly controls how much computation the model applies to the problem.

Industry Impact

Extended Thinking raises the baseline for what API-accessible reasoning can do. Until recently, getting a language model to reason carefully through a complex problem required elaborate prompt engineering — chain-of-thought instructions, few-shot examples, decomposition into sub-problems. Extended Thinking makes systematic reasoning a parameter you set, not a technique you engineer. This lowers the barrier for developers who want reasoning quality without deep prompt optimization expertise.

It changes the competitive calculus for AI coding assistants and analytical tools. Products built on Claude that handle tasks requiring careful reasoning — code review, security analysis, complex document synthesis — get a capability upgrade without product changes. For independent software vendors building on the Claude API, GA availability means they can now ship Extended Thinking-dependent features with confidence in API stability.

OpenAI’s o-series models and Google’s Gemini thinking models established the category. Extended Thinking GA positions Claude as fully competitive in the reasoning model space rather than lagging on a capability that enterprise buyers now routinely evaluate. From an enterprise procurement perspective, the question “does it support extended reasoning?” now has the same answer across the major model providers.

Claude Extended Thinking Mode API diagram showing thinking block reasoning process and budget tokens parameter for developer applications

Developer Impact

When to use Extended Thinking:

Use it when:

The task requires working through multiple constraints simultaneously — pricing optimization, scheduling, multi-criteria evaluation
The problem involves ambiguity that needs to be reasoned through — unclear requirements, contradictory information, edge cases
Long-context synthesis is required — analyzing a large codebase, summarizing a lengthy document while maintaining logical consistency
Accuracy matters more than latency — legal analysis, financial modeling, security review
You want to inspect how the model is reasoning about a problem, not just what it concludes

Do not use it when:

The task is routine and latency matters — live chat, autocomplete, standard summarization
The query is simple enough that step-by-step reasoning adds no value
Cost per call is the primary constraint and the task does not benefit measurably from deeper reasoning

Setting the budget_tokens parameter:

A low budget (1,000–4,000 tokens) is appropriate for moderately complex reasoning where some structured thinking helps but full depth is unnecessary. A medium budget (4,000–16,000 tokens) handles most analytical tasks that benefit from Extended Thinking. A high budget (16,000–32,000 tokens) is for the most complex reasoning tasks — deep technical analysis, multi-document synthesis, complex constraint satisfaction. Start low and increase only if output quality justifies the cost.

Cost implications:

Extended Thinking tokens are billed. The thinking block tokens are not free — they are billed at the input token rate for the thinking process. A request with a 16,000-token thinking budget that uses all of it costs substantially more than a standard request. For applications where Extended Thinking is enabled on every call, factor this into unit economics before production deployment.

Latency implications:

Extended Thinking adds latency proportional to how many thinking tokens the model uses. A request with a 32,000-token budget takes longer to return than a standard request. For latency-sensitive applications, disable Extended Thinking or use a low budget. For applications where answer quality matters more than response time, the tradeoff is favorable.

Business Impact

Enterprise AI applications get a reasoning quality upgrade. Enterprise buyers using Claude API through their existing contracts gain Extended Thinking without renegotiation. Applications built on Claude that handle analytical work — contract review, due diligence, technical specification analysis — can now enable Extended Thinking on appropriate task types and deliver demonstrably better results on hard cases.

The inspection capability changes compliance conversations. One concern enterprise legal and compliance teams have about AI systems is the opacity of reasoning — the model produces a conclusion with no visible justification. Extended Thinking’s thinking block does not fully solve this problem (the thinking block is the model’s internal monologue, not a formal proof), but it provides more transparency than a standard API response. For organizations evaluating AI-assisted workflows in regulated industries, this visibility matters.

ISVs and AI product companies need to re-evaluate their reasoning tier strategy. Products that previously used prompt engineering workarounds to simulate structured reasoning should evaluate whether Extended Thinking produces better results at acceptable cost. In many cases it will, and the migration from prompt workaround to Extended Thinking is straightforward at the API level.

Future Outlook

Budget_tokens will become a standard API parameter pattern. The explicit compute budget model — telling the API how much reasoning to apply — is more flexible and predictable than implicit reasoning depth. Expect this pattern to generalize: future API features will likely expose similar budget controls for other compute-intensive capabilities.

Extended Thinking will influence how agentic systems are designed. In multi-step agent loops, Extended Thinking can be applied selectively — use it for the planning step where the agent decides what to do next, and disable it for the execution steps where the action is straightforward. This selective application of reasoning depth is a more sophisticated agent design pattern than applying the same reasoning depth to every step.

The thinking block will evolve into a richer observability surface. The current thinking block is a text stream of internal reasoning. Future iterations may provide more structured inspection — tagging specific reasoning steps, flagging uncertainty, or identifying which parts of the context the model drew on for specific conclusions. This would significantly improve debugging and verification capabilities for production AI applications.

FAQ

Q: What is Claude Extended Thinking Mode? Extended Thinking is a Claude API feature where the model reasons through a problem step by step before generating its final answer. The reasoning process is visible to developers in a separate “thinking block” returned alongside the response.

Q: How do I enable Extended Thinking in the Claude API? Add thinking: { type: "enabled", budget_tokens: N } to your API request, where N is the maximum number of tokens the model can use for internal reasoning. The Anthropic documentation at docs.anthropic.com has the complete parameter reference.

Q: Does Extended Thinking cost more? Yes. Thinking block tokens are billed. The additional cost depends on how many thinking tokens the model uses, which is bounded by your budget_tokens setting. For tasks that benefit significantly from deeper reasoning, the quality improvement typically justifies the cost. For routine tasks, it often does not.

Q: Does Extended Thinking add latency? Yes. More thinking tokens means more time before the final response is returned. For latency-sensitive applications, use a low budget or disable Extended Thinking. For accuracy-critical applications, the latency tradeoff is often acceptable.

Q: Can end users see the thinking block? The thinking block is returned to the developer in the API response. Whether end users see it is a product decision — most applications do not expose it to users. It is primarily a debugging and inspection surface for developers.

Q: When should I not use Extended Thinking? For routine, low-complexity tasks where latency and cost matter — live chat responses, simple summarization, standard code completion, straightforward Q&A. Extended Thinking adds value where the problem genuinely requires working through multiple steps or constraints.

Q: How does Extended Thinking compare to OpenAI’s o-series models? Both expose explicit reasoning processes before generating final answers. The architectural approaches differ — OpenAI’s o-series models are distinct models trained with reinforcement learning for reasoning; Claude Extended Thinking is a mode available on the standard Claude models. The practical comparison depends on the specific task and models being compared.

Q: Does Extended Thinking work with all Claude models? Extended Thinking is available on Claude models that support the feature. Check Anthropic’s documentation at docs.anthropic.com/en/docs/extended-thinking for the current list of supported models.

Q: Can I use Extended Thinking in agentic workflows? Yes, and this is one of the most valuable use cases. Applying Extended Thinking to the planning and decision-making steps of an agent loop — where the agent determines what to do next — while disabling it for straightforward execution steps is an effective design pattern.

Q: What is the maximum budget_tokens value? Check current Anthropic documentation for the latest limits, as these may be updated over time. The practical limit is typically in the range of 32,000–100,000 tokens depending on the model.

Analyst Perspective

The Extended Thinking GA release is less interesting as a product announcement than as a signal about how reasoning capability is being productized.

The budget_tokens parameter is the important part. It reframes “how smart should this AI response be?” from a qualitative question answered by model selection or prompt engineering into a quantitative parameter with direct cost and latency implications. This is the right abstraction. It gives developers explicit control over a tradeoff they were previously managing indirectly through model choice, temperature, and prompt structure.

The second-order effect: as reasoning budget becomes an explicit API parameter, the economics of AI applications become more predictable. You can model “for our most complex queries, we allocate X thinking tokens at Y cost, and for routine queries, we allocate Z tokens at lower cost.” This is enterprise-grade cost modeling, not “we pay per token and hope the results are good enough.”

The thinking block as a debugging and compliance surface is underappreciated in the GA coverage. Enterprise AI deployments in regulated industries — finance, healthcare, legal — face consistent pushback on AI opacity. “How did it reach this conclusion?” is a question that currently has no good answer for most AI API responses. The thinking block does not fully answer that question, but it is meaningfully better than nothing. For organizations piloting Claude API in compliance-sensitive workflows, Extended Thinking GA may accelerate timelines that were blocked on auditability concerns.

Key Takeaways

Claude Extended Thinking Mode is generally available as of June 26, 2026 — it enables explicit reasoning before final response generation, visible to developers in a thinking block.
Enable it with thinking: { type: "enabled", budget_tokens: N } in the API request. Higher budget = more reasoning depth, higher cost, higher latency.
Use it for complex analytical tasks, multi-constraint problems, long-context synthesis, and cases where reasoning transparency matters. Skip it for routine, latency-sensitive tasks.
The thinking block is a debugging surface — it shows how the model reasoned toward its conclusion, making unexpected outputs traceable.
Cost is higher than standard mode due to thinking token billing. Model unit economics before enabling on all calls in a production application.
The budget_tokens parameter is the key design decision: start low, measure output quality, increase only where the improvement justifies the cost.

Continue Learning

About GAVIHOS

GAVIHOS helps developers, founders and technology enthusiasts understand AI, software engineering and emerging technologies through practical guides, tutorials and industry analysis.

Stay Updated

Follow GAVIHOS for practical AI, technology and developer-focused insights.