THE BRIEF
Inference now owns the bill — and most infrastructure strategies are still organized around training. In 2026, inference accounts for roughly two-thirds of all AI compute demand, up from approximately one-third in 2023, according to Deloitte's 2026 TMT Predictions. The model training race that drove headlines from 2022 through 2024 produced the systems you're now running at scale. The question for every CIO is whether your infrastructure reflects that shift — because your cloud vendor's pricing model certainly does.
The per-token cost story is real. The total spend story contradicts it entirely. Per-token prices have collapsed. Yet enterprises are reporting monthly AI compute bills in the tens of millions of dollars, according to Deloitte's 2026 Tech Trends analysis — despite dramatic unit cost reductions in the same period. Agentic workflows, longer context windows, and internal demand expansion are outpacing every efficiency gain. CFOs expecting cost relief from AI productivity are waiting for a reversal that isn't coming. The conversation you need to be having is about volume, not rates.
Enterprise AI budgets are systematically wrong — and not by a small margin. IDC's research, as reported by CIO.com, forecasts that Global 1000 companies will underestimate AI infrastructure costs by approximately 30% through 2027. The 2025 State of AI Cost Management report corroborates this: 85% of companies miss their AI infrastructure forecasts by more than 10%, with nearly one in four missing by 50% or more. This isn't a failure of individual planning. It's a structural problem with how traditional IT budgeting models handle AI workloads — which behave less like predictable systems and more like living organisms that expand to fill whatever capacity is available.
Shadow AI is an infrastructure problem, not an HR problem. Over 80% of workers use unapproved AI tools, according to UpGuard's State of Shadow AI report — with fewer than one in five employees using exclusively company-approved tools. Every unauthorized model call is a data governance failure that sits below your infrastructure perimeter. For regulated industries, the compliance exposure this creates isn't addressable through policy alone. It requires visibility at the infrastructure layer. If you don't know where your data is going, your stack isn't as compliant as your documentation suggests.
The build-vs.-buy math has materially shifted for sustained inference workloads. Lenovo's 2026 Total Cost of Ownership analysis found that on-premises AI infrastructure reaches cost parity with cloud in as few as four months for organizations running GPU utilization above 20%, with a potential 18x cost advantage per million tokens compared to API consumption at production scale. This isn't a universal conclusion — it depends heavily on workload profile and utilization rates — but it signals that the conventional wisdom of "cloud is always cheaper" has expired for enterprises running predictable, high-volume inference in production.
Hyperscaler capex isn't just supply — it's leverage. The five largest hyperscalers have committed a combined $630 to $690 billion in AI infrastructure capital expenditure for 2026, according to Futurum Group. That capital is purchasing GPU clusters, networking fabric, and power contracts that will be operational for the next decade. For enterprise customers, the implication is direct: every quarter of delay in negotiating multi-year cloud commitments or diversifying to alternative providers is a quarter in which the hyperscalers' installed base grows and your negotiating position weakens.
THE REALITY CHECK
The per-token cost of running a GPT-4-equivalent model has fallen from approximately $30 to $40 per million tokens in early 2023 to under $1 per million tokens by 2025 and 2026, according to Epoch.ai's LLM inference price tracking data. That decline has not produced proportional savings for enterprise AI budgets — because inference volume has grown faster than unit costs have fallen. The real infrastructure challenge in 2026 isn't finding cheaper compute. It's determining which infrastructure commitments you'll still want when the models change again in 18 months — and which ones will still be billing you whether you want them or not.
THE SIGNAL
The gap between what CIOs know and what they're actually deciding is the infrastructure story of 2026.
Every major CIO survey published in the last six months confirms the same tension: AI infrastructure investments are accelerating, costs are being underestimated, and the decisions being made now will be significantly harder to reverse in 2028 than they appear today.
Here is the uncomfortable position most enterprise technology leaders currently occupy: they are being asked to make 5-year capital commitments in a market where the technical landscape shifts materially every 12 to 18 months. The model that's frontier today is commodity next year. The GPU architecture that's state-of-the-art today is being superseded by a successor that draws more power and requires different cooling infrastructure to deploy. And yet the data center decisions, cloud agreements, colocation contracts, and on-premises hardware purchases being finalized in Q2 and Q3 of 2026 will still be operational constraints in 2029.
This is the infrastructure trap: decisions that look tactical carry strategic duration.
The supply constraint is structural — and it's working in the hyperscalers' favor.
NVIDIA's Blackwell GPU lineup — the current performance benchmark for large-scale AI inference — carried a reported backlog of approximately 3.6 million units through mid-2026, according to Spheron's GPU shortage analysis. This is not a temporary supply disruption. It reflects multi-year purchase agreements that the largest hyperscalers entered into well before most enterprise buyers were evaluating Blackwell. Amazon, Google, Microsoft, and Meta collectively secured the capacity that enterprise customers now need to access — and they're renting it back through cloud services with margins attached.
This dynamic isn't unusual in technology markets. Hyperscalers can absorb the capital risk of hardware at scale that enterprises cannot. But it creates an asymmetry in negotiating power that compounds every quarter. The enterprise buyer who didn't lock in capacity in 2024 is now paying spot or on-demand rates, or renting from someone who did.
The pricing signal from cloud providers is not what it appears.
AWS implemented approximately a 15% price increase on EC2 Capacity Blocks for ML in early 2026, with the p5e.48xlarge instance — one of the primary GPU compute options for enterprise AI workloads — rising from approximately $34.61 per hour to $39.80 per hour, according to usage.ai's AWS pricing tracker. This matters not because a 15% increase is catastrophic in isolation, but because it represents a reversal of the multi-year deflationary trend in cloud compute pricing that most enterprise infrastructure strategies had been built to assume would continue.
The cloud compute narrative of the last five years was unit cost deflation. The emerging narrative for AI-constrained compute is selective repricing upward on high-demand capacity — while spot prices for less-constrained GPU models continue to decline. Enterprises that haven't segmented their workloads by price sensitivity and planned for this divergence are operating on assumptions that no longer hold across the board.
Who is winning and who is losing in enterprise AI infrastructure.
The hyperscalers are winning the scale race. AWS, Azure, and Google Cloud have the GPU inventory, the managed service layers, and the enterprise account relationships to remain the default infrastructure choice for the majority of organizations in 2026. Enterprises with genuinely elastic, unpredictable, or early-stage AI workloads — prototyping, burst training, experimentation — are correctly in the cloud.
On-premises infrastructure vendors — Dell, HPE, and Lenovo, with NVIDIA's DGX platform at the center — are winning regulated industries and sustained-inference workloads. Healthcare organizations managing patient data under HIPAA, financial institutions operating under DORA and GDPR Article 9, and government agencies with data residency mandates face a compliance calculus that most hyperscaler sovereign cloud offerings still don't fully satisfy. For these organizations, the sovereign AI infrastructure question has a clear answer: keep the data behind your perimeter.
The category quietly gaining enterprise interest is the neocloud: specialized GPU cloud providers who offer H100 and Blackwell access at materially lower rates than hyperscalers, without the proprietary managed-service lock-in. They aren't a complete enterprise solution — governance, SLA coverage, and integration depth are real gaps — but for the inference tier of a hybrid architecture, they're becoming competitively relevant.
The bottom-line stakes.
The organizations that built reversible infrastructure architectures in 2024 and early 2025 — those that standardized on open APIs, maintained multi-cloud optionality, and resisted the pull of proprietary managed AI services — are entering H2 2026 with negotiating leverage. The organizations that consolidated to a single hyperscaler's AI stack because it was the path of least resistance are now discovering that their exit costs are real and substantial.
Switching costs in enterprise AI infrastructure are not abstract. They include data migration fees, egress charges, workload retraining on new hardware, API integration rewrites, and the rebuilding of observability and governance tooling tuned to the original environment. These costs don't appear in procurement models that were written before 2025. They need to appear in the ones being written now.
The organizations that will maintain genuine leverage at contract renewal in 2027 and 2028 are not the ones that optimized hardest for this quarter's per-token rate. They're the ones that protected optionality — the ability to credibly threaten to walk.
THE DEEP DIVE
Thesis: Enterprise AI infrastructure decisions made in 2026 will determine organizational AI leverage through 2030 — and most organizations are optimizing for the wrong variable.
The dominant frame in enterprise AI infrastructure discussions is cost per token. It is the wrong frame.
Cost per token is a spot market metric. What CIOs are actually purchasing when they make infrastructure commitments is durability — the capacity for a given architectural decision to remain a competitive asset as the surrounding technology stack evolves, rather than becoming a liability that compounds with every passing quarter.
The organizations that executed well on enterprise cloud adoption in the 2010s were not the ones who found the cheapest compute. They were the ones who built abstraction layers early enough to switch providers, negotiate renewals from a position of genuine optionality, and upgrade infrastructure without wholesale platform migrations. The infrastructure question in 2026 is structurally identical to the one those organizations answered correctly a decade ago — but with larger financial stakes, a faster technology cycle, and significantly more complex hidden costs.
The Hidden Cost Stack
GPU purchase price is the visible number. It is not the important number.
Introl's 5-year GPU infrastructure total cost of ownership modeling finds that hardware acquisition represents approximately 35% of total infrastructure cost over a 5-year deployment period. The remaining 65% consists of power, cooling, networking, storage, software licensing, and operational talent — costs that are consistently underestimated or excluded from initial infrastructure business cases.
The power numbers are not intuitive at scale. NVIDIA's B200 GPU — the current performance standard for large-scale inference — draws approximately 1,000 watts per unit, according to Verticaldata.io's infrastructure cost analysis. A 100-GPU cluster running continuously consumes power in the hundreds of kilowatts. At standard commercial electricity rates and accounting for typical data center power efficiency ratios, annual energy costs for a mid-sized GPU cluster can reach well into the hundreds of thousands of dollars — before a single inference has run.
The cooling problem is distinct from the power problem, and they compound each other. Traditional enterprise data centers are engineered for rack densities of 5 to 15 kilowatts per rack, according to IEA energy and AI demand analysis. AI-ready infrastructure commonly operates at 40 to 100 or more kilowatts per rack — and less-efficient enterprise data centers already direct more than 30% of total energy consumption to cooling before AI workloads are factored in. Liquid cooling — increasingly mandatory at Blackwell-class GPU densities — adds $50,000 to $200,000 or more per rack in capital cost, according to Verticaldata.io. For an organization deploying at any meaningful scale, this is not a rounding error. It is a capital project.
The talent line item is the one most consistently excluded from infrastructure business cases — and the most consequential. GPU infrastructure engineers capable of deploying, optimizing, and maintaining large-scale NVIDIA clusters command approximately $275,000 annually in competitive markets, per Introl's infrastructure modeling. This is not a skill set that scales with training programs, and it is not a hire that organizations can execute quickly in a constrained labor market. The cloud alternative to this cost — managed services — solves the talent problem but introduces the lock-in problem. There is no version of this tradeoff without a cost.
The Reversibility Framework
The most useful decision framework for CIO infrastructure planning in 2026 is not cost per token. It is reversibility.
Every infrastructure decision sits in one of four categories:
Category one — high reversibility, low cost: Cloud instances on-demand, spot capacity, API-layer integrations using open standards. These decisions can be unwound in weeks with manageable cost. Experimentation and prototyping belong here.
Category two — high reversibility, moderate cost: Reserved cloud capacity on 1-to-3-year terms, colocation agreements with standard exit clauses, SaaS AI tooling without proprietary data accumulation. These decisions can be reversed, but carry a cost measured in contract penalties and migration effort. Steady-state inference that isn't yet proven at production scale belongs here.
Category three — low reversibility, high cost: Proprietary managed AI services with data gravity (every month of data ingested into a vendor's managed system increases the migration cost), custom hardware installations optimized for a specific GPU generation, on-premises deployments without established hardware upgrade paths. These decisions can be reversed, but the cost is substantial and takes quarters to execute. High-volume, production inference workloads at verified economics belong here.
Category four — effectively irreversible: Long-term hyperscaler commitments at scale that embed AI services deeply into core operational systems — generating egress lock-in, proprietary data formats, and deeply coupled service dependencies where the switching cost exceeds any available competitive benefit. Nothing belongs in category four by choice.
The strategic imperative is to remain in categories one and two for as long as possible, enter category three only when unit economics are definitively superior over a verifiable time horizon, and avoid category four entirely unless regulatory mandates leave no alternative.
The Abstraction Layer Question
Gartner projects that 70% of organizations running multiple large language models will implement AI gateway or abstraction layer solutions by 2028. The adoption rationale is straightforward: an abstraction layer — a unified API sitting between enterprise applications and multiple underlying model providers — converts a low-reversibility architecture (direct API dependencies on a single vendor) into a high-reversibility one (swap the underlying model without rewriting the application layer).
This is not a novel concept. It is the enterprise architecture pattern that governed database abstraction in the 1990s and API gateway adoption in the 2010s. Both debates ended the same way: the organizations that implemented abstraction early maintained optionality through vendor transitions. The organizations that argued it was unnecessary overhead paid for that judgment during every subsequent migration.
The practical implementation requires choosing between purpose-built AI gateway solutions and building a lightweight routing layer internally. Purpose-built solutions (including LiteLLM, Kong AI Gateway, and others) offer faster implementation and lower maintenance overhead. An internally built layer offers tighter organizational control and avoids adding a new vendor dependency. The right answer depends on internal engineering capacity — but the question of whether to implement an abstraction layer at all is no longer meaningfully debatable.
The Failure Modes
Four failure modes are appearing reliably in enterprise AI infrastructure programs at scale in 2026.
The cooling surprise: Organizations acquire GPU hardware through colocation without fully accounting for power density requirements. The colocation provider's standard rack allotment is insufficient for the GPU load. Retrofitting or liquid cooling installation delays the deployment by months and doubles the infrastructure capital budget. This failure is preventable through pre-deployment site assessment — and is consistently skipped in programs that treat GPU procurement as an IT acquisition rather than a facilities project.
The spot price trap: Organizations build production inference workloads on cloud spot capacity, correctly benefiting from substantially lower rates. When demand spikes or capacity becomes unavailable, production workloads fail. The migration to reserved capacity is rushed and expensive. The organizations that encounter this failure are not naive — they understood spot pricing was variable. What they didn't model was the business cost of availability interruption at production scale.
The shadow AI governance failure: Security and compliance teams implement AI governance frameworks at the application and API layer while employees route sensitive data through personal accounts on consumer AI tools. The infrastructure perimeter does not match the compliance documentation. This failure is increasingly common — and increasingly expensive when it surfaces during regulatory audits. The gap between acceptable use policy and actual network telemetry is where the liability accumulates.
The architecture lock-in through inertia: Organizations default to the managed AI services of their existing cloud vendor because integration is easiest in the short term. Over 12 to 18 months, data accumulates in proprietary formats, workflows are tuned to vendor-specific APIs, and the switching cost quietly reaches a level where it exceeds any available competitive benefit from moving. This is not a dramatic failure event. It is a slow reduction in negotiating leverage — invisible in quarterly budget reviews, fully visible at contract renewal.
The Consequence
The infrastructure decisions being made in Q2 and Q3 of 2026 are not IT operations decisions. They are strategic decisions about which vendors will hold leverage over your organization's AI capabilities in 2028 and 2029.
The difference between an organization that maintains genuine infrastructure optionality and one that does not is invisible in the quarterly budget review. It is fully visible in the contract renewal conversation — when one organization has a credible alternative and the other is negotiating from a position of operational dependency.
The organizations that understood this reality when they were making cloud decisions in 2013 and 2014 are the ones that successfully renegotiated their cloud contracts from a position of strength in 2019 and 2020. The ones that didn't are the ones whose cloud costs are now a fixed operational constraint, not a variable they control.
The 2026 infrastructure decision is the 2013 cloud decision. The organizations that treat it that way will have a different set of options available to them in 2030.
THE PLAYBOOK
C-Suite: Three Decisions Before Q3
- Require a 5-year total cost of ownership model for any AI infrastructure commitment exceeding 12 months, including power, cooling, talent, and governance costs — because the hidden operating costs of AI infrastructure consistently exceed the hardware acquisition budget, and the business cases being built today are systematically excluding the costs that will drive budget variance.
- Ask your CIO which infrastructure decisions made in the last 18 months are reversible and at what cost — because abstraction layers that enable model-layer flexibility must be built before vendor dependencies accumulate, and the organizations that haven't started that conversation yet are accumulating switching costs every quarter without knowing it.
- Require explicit data portability terms and exit cost estimates in any managed AI service contract before signing — because proprietary data gravity is the primary infrastructure lock-in mechanism of the current AI cycle, and the cost to exit a managed AI platform accelerates with every month of data accumulation in a vendor's proprietary format.
CIO/CTO: Three Implementation Priorities
- Implement workload segmentation before the next infrastructure procurement cycle: classify every AI workload by utilization predictability and data sensitivity, then route accordingly — cloud for bursty, experimental, or low-sensitivity workloads; on-premises or colocation for sustained, high-volume inference with regulated data — because running regulated data through cloud-based inference while using on-premises infrastructure for experimental workloads is the inverse of how most organizations currently operate, and inverting it is where the cost and compliance efficiency sits.
- Deploy an AI gateway or abstraction layer this quarter in any environment running more than one LLM provider — because direct API dependencies on individual model vendors are how model-layer lock-in accumulates, and the conversion from a reversible architecture to a low-reversibility one happens quietly, through the steady accumulation of integrations that assume a specific vendor's API contract.
- Audit shadow AI exposure at the infrastructure layer through network monitoring and data egress analysis, not policy review — because policy frameworks applied only at the application layer do not capture the data flows that determine compliance reality, and the gap between your acceptable use documentation and your actual network telemetry is where the regulatory liability lives.
Department Lead / AI Initiative Owner: Three Operational Steps
- Establish a cost-per-outcome metric for every AI deployment this quarter — because enterprise AI spend is moving under increased CFO scrutiny, and the programs that survive the next budget cycle will be those tied to measurable business outputs rather than raw compute consumption metrics that don't connect directly to revenue or cost avoidance.
- Document every model API your team calls, what data you're sending, and the business process consequence if that provider restricts access or raises prices significantly — because cloud AI compute pricing is not stable across all capacity types in 2026, and the programs without a documented dependency map are the ones that scramble reactively when a provider changes terms or restricts access.
- Identify one high-volume production inference workflow and run a 3-year on-premises TCO comparison — because the economics of sustained inference have shifted materially in the last 18 months, and the programs that can demonstrate infrastructure cost efficiency improvements will carry substantially stronger positioning in the next budget cycle than those arguing for continued cloud spend at scale.
THE NUMBERS
$2.59 trillion
Gartner's May 2026 projection for worldwide AI spending in 2026, a 47% increase year-over-year. Infrastructure accounts for the single largest segment of that figure — more than software and services combined.
Gartner, May 2026
$487 billion
IDC's 2026 projection for global AI infrastructure hardware spending — the accelerated servers, networking fabric, and storage purpose-built for AI workloads. A 53% increase from 2025. Hyperscalers account for approximately 80 to 87% of that total.
IDC AI Infrastructure Tracker, 2026
>90%
Gartner's March 2026 prediction for the reduction in inference cost for a 1-trillion-parameter model by 2030 compared to 2025. Unit costs will continue to fall. The organizations managing total infrastructure spend — not unit rates — will be the ones positioned to capture that deflation rather than watch it absorbed by volume growth.
Gartner, March 2026
81–94%
The range of IT leaders expressing concern about AI vendor lock-in, according to a February 2026 Parallels cloud survey. Nearly universal awareness. Materially lower action rate.
Parallels Cloud Survey, February 2026
~88%
The decline in AWS H100 spot GPU pricing over one period spanning 2024 to 2025, according to Cast.ai data. The spot market is deflationary. Reserved and on-demand capacity for constrained, high-demand models is not.
Cast.ai, 2026
325 to 580 TWh
The range of projected US data center electricity consumption by 2028, up from 183 TWh in 2024, according to the U.S. Department of Energy. AI workloads are the primary driver. Power availability is becoming a site selection constraint before GPU availability is.
U.S. Department of Energy, 2025
The 81% of IT leaders who say they're worried about vendor lock-in are correct. The organizations that do something about it this quarter will negotiate from a position of strength in 2028. The ones that don't will negotiate from a position of dependency — and everyone in that room will know it.
WHAT'S NEXT + WHAT'S COMING
The signal gaining traction across the most reliable infrastructure-focused channels — CTO conversations on X, Datacenter Dynamics coverage, and The Register's infrastructure beat — is power delivery, not GPU availability. The bottleneck in AI infrastructure is moving from silicon supply to electrical capacity. Analog Devices' $1.5 billion acquisition of Empower Semiconductor, announced May 19, 2026 (WSJ), targets exactly this constraint: Empower specializes in high-density power management and integrated voltage regulators optimized for AI compute systems. This is the first major M&A signal that the infrastructure industry has identified power delivery as the next primary constraint after GPU supply. Expect similar acquisitions in power delivery and thermal management over the next 60 days.
One thing to watch before next Tuesday: Google Cloud implemented a list price increase on A3 Ultra instances in Europe and Asia effective May 1, 2026 — the first instance of a hyperscaler raising prices on a specific AI compute tier in those regions. Watch whether AWS or Azure follow with comparable adjustments in European markets over the next several weeks. If they do, regional AI compute pricing divergence becomes a live variable in on-premises vs. cloud economics for multinational organizations — particularly those with EU data residency requirements.
M&A and investments to watch:
- Analog Devices acquires Empower Semiconductor for $1.5 billion (announced May 19, 2026) — power delivery infrastructure for AI compute. The signal: the hardware supply chain is investing in power, not just silicon.
- Neoclouds reaching $20 billion — Forrester projects specialized GPU cloud providers to collectively reach $20 billion in revenue in 2026, driven by enterprise demand for lower-cost inference compute outside hyperscaler pricing structures. The category is gaining enterprise credibility faster than the hyperscalers would prefer.
- Hyperscaler data center pipeline — new facility announcements from AWS, Google, and Microsoft continue across North America and Europe. Watch for utility partnership announcements specifically — they are the leading indicator of where constrained power capacity is being allocated.
On the horizon:
- Enterprise cloud contract renewals are accelerating through Q3 2026 as committed-use agreements from the 2023 and 2024 AI adoption wave come up for renegotiation — the infrastructure decisions of that period are becoming budget line items now.
- Forrester projects enterprises will defer approximately 25% of planned AI infrastructure spend to 2027 as CFO scrutiny of AI ROI intensifies. The programs that survive the deferral cycle will be those that tied AI costs to documented business outcomes — not those arguing for continued spend based on capability potential alone.
This report was produced with AI assistance and human editorial review.
Vol. 02, No. 04 · May 2026 · Confidential – Subscriber Use Only