Commoditization of the base layer, Q1 2026.
Over the last six quarters, frontier-equivalent inference has fallen ≈ 94% on a dollars-per-megatoken basis. Six providers now ship within a narrow capability band. We argue the consequence is a relocation of margin from weights to runtime — and we sketch where it is accruing.
What we believe, in a paragraph.
The base layer — pretrained frontier-grade models accessible by API — is commoditizing on every dimension that matters to an operator: price, latency, availability, and capability spread. This is not a claim about quality ceilings; labs still differentiate at the frontier. It is a claim about the median call, which is where economics lives. For the workloads that ninety percent of agent products actually run, choice of model is rapidly becoming a procurement decision, not a strategic one.
A product shipping today can assume three or more providers within ±5% capability on its use case. Portability is no longer aspirational; it is table stakes. The strategic question is where the harness accrues value when the model does not.
Price collapse is compounding, not linear.
Blended input+output cost for frontier-equivalent inference has dropped from $7.10 / M tokens in Q3 2024 to $0.42 / M tokens in Q1 2026 — a 94% decline over seven quarters. The decline is not uniform across providers, but the floor has moved everywhere.
Fig. 1 · Blended inference price · frontier-equivalent tier
Two notes on the curve. First, the cliff from Q2'25 forward coincides with the third-cohort open-weight releases; margin compression at the low end pulled paid APIs down with it. Second, the Q1'26 point is a blended median — the cheapest frontier-equivalent provider is meaningfully below it.
The field, ranked on four axes.
| Provider | Model · tier | Eval (Priors) | Latency p50 | $/M tok | Notes |
|---|---|---|---|---|---|
| Atlas Labs | Atlas-4 · reasoning | 87.2 | 680 ms | $0.31 | Best price-to-quality on long-context. |
| Kite | Kite-Ultra · 09 | 86.8 | 540 ms | $0.44 | Fastest at median. Tool-call reliability leads. |
| Meridian | Mer-3 · pro | 86.1 | 710 ms | $0.52 | Strong on code, weaker on retrieval-heavy tasks. |
| Nth Research | Nth-L2 | 85.4 | 820 ms | $0.28 | Cheapest at tier. Open-weight fork available. |
| Cardinal | C-Prime · 26 | 84.9 | 620 ms | $0.48 | Best structured-output adherence in class. |
| Orbit | Orbit-Max | 84.1 | 900 ms | $0.36 | Enterprise distribution; slower but sticky. |
Eval scores use the Priors internal suite (v3.2), which weights long-horizon tool use, structured output fidelity, and retrieval fidelity above contest-style reasoning. The spread between provider 1 and provider 6 is 3.1 points. For comparison, the equivalent spread in Q3 2024 was 14.6.
Capabilities are converging faster than interfaces.
Below: the same six providers scored on four operator-visible axes. The bars show how tightly the field is clustering. A year ago this chart had daylight between providers; today, on most axes, it does not.
The interesting story is the two unresolved axes: latency and price. Providers still differentiate meaningfully here, which is why procurement teams are reopening multi-vendor contracts. But the axes that determine whether a product can be built — reasoning, tools, structured output — are now essentially solved problems at the base layer.
Five consequences we're underwriting to.
-
Model-portability becomes a product feature.
Customers are asking for it in diligence calls. Startups that cannot swap providers with a config change lose procurement deals to ones that can.
-
Margin relocates to the harness.
Context, memory, tool-surface, and evaluation are where differentiation is now built — and where the next generation of platform companies will sit.
-
Evaluation is the new observability.
Shipping without evals is shipping on faith. The category is about two years behind where DataDog was in 2014; the opportunity is correspondingly large.
-
Vertical runtimes beat horizontal ones, early.
Taste compounds at the domain. A legal-agent runtime built by lawyers-turned-engineers will beat a horizontal platform for at least the next four years.
-
Open-weight pressure sets the floor, not the ceiling.
The frontier remains a lab story. But the median call — and therefore the unit economics of the products above — is governed by what Nth or its successors ship next.
Companies we are tracking, not yet priced.
contextd
Drop-in retrieval runtime for agents with reversible writes. Spun out of a search lab. Watching for first enterprise design partners.
Evalry
Eval-as-code for agent pipelines. Open-source core with managed cloud. Adoption curve looks like early Datadog.
Harness.law
Vertical runtime for legal research & drafting agents. Founded by ex-partners at a litigation firm. Distribution is real.
We expect to underwrite two of the three within the year. We will publish a follow-up note on vertical runtimes in Q2.
§ 07 · Methodology & sources
Pricing data aggregated from published provider sheets and rate-limited API probes against identical prompt batches (n=12,400 per provider per quarter). Evaluation suite v3.2 combines internal tool-use tasks (n=240), structured-output fidelity (n=180), and long-context retrieval (n=120). Latency measured from us-east-1 and us-west-2 endpoints, p50 of 1000-token completions over 14 consecutive days.
Scores are model-level, not provider-level — a provider's best-in-class model is used for the table. Internal+ distribution: LPs and active portfolio companies.