- Best overall agent (May 2026): Claude 4 Opus — wins on coding, long-document reasoning and tool-use reliability; only loses on real-time web freshness.
- Best free tier: Gemini 2.5 Flash — generous limits, strong multimodality, weak personality.
- Best for live research: Perplexity Pro — citation hygiene that the others still don't match.
- Best autonomous agent: ChatGPT's "operator" mode — most polished long-running task execution despite a higher hallucination rate.
- Most disappointing: Grok 3 — entertaining, but unreliable for anything that has to be right the first time.
Two years ago, "AI agent" meant a chatbot that could write a poem. In May 2026 it means a piece of software that opens browser tabs, edits your codebase, files your expense report and books your flight — sometimes well, sometimes catastrophically. We tested the five products that, between them, account for roughly 91% of paid consumer agent subscriptions worldwide.[1]
The five contenders
These are the agents we benchmarked. We deliberately excluded enterprise-only offerings (Microsoft Copilot for M365, AWS Q Developer, Vertex Enterprise) — they deserve their own review. All five products were tested on their highest paid consumer tier as of April 30, 2026.
Methodical, citation-aware, the strongest reasoning model in our suite.
The broadest ecosystem. Operator mode is the most ambitious agentic feature shipping today.
Fastest of the group, native Workspace integration, weaker on creative writing.
Less an agent than a search-first researcher, but unbeatable on factual queries with citations.
Real-time X access is genuinely useful; everything else lags the field.
How we tested
We built a fixed test harness of 184 tasks across six categories: factual research, long-document analysis, coding (greenfield and refactor), autonomous multi-step tasks, creative writing, and voice/multimodal. Each task was run three times per agent on different days. Outputs were graded blind by two evaluators against a rubric; disagreements went to a third reviewer.
Disclosure: Stratum has no commercial relationship with any of the five vendors. All subscriptions were paid out of editorial budget. Full task list and grading rubric: see Appendix A.
Overall scores by category
No agent wins everything. Where one model edges another by less than 0.3 points we treat it as a tie.
| Category | Claude 4 | ChatGPT | Gemini 2.5 | Perplexity | Grok 3 |
|---|---|---|---|---|---|
| Factual research | 8.4 | 8.2 | 8.6 | 9.3 | 6.8 |
| Long-document reasoning | 9.4 | 8.5 | 9.1 | 7.7 | 6.4 |
| Code (greenfield) | 9.5 | 9.0 | 8.4 | 7.1 | 7.6 |
| Code (refactor existing) | 9.7 | 8.6 | 8.0 | 6.4 | 7.0 |
| Autonomous multi-step | 8.8 | 9.2 | 7.9 | 7.5 | 6.5 |
| Creative writing | 8.9 | 9.1 | 7.6 | 7.0 | 8.4 |
| Voice / multimodal | 8.0 | 9.4 | 9.0 | 7.9 | 7.0 |
| Hallucination rate (lower = better) | 3.1% | 5.4% | 4.6% | 3.9% | 9.8% |
Side-by-side: pricing, capability, infrastructure
Sticker price isn't the whole story — Gemini's "free" tier ships with quotas that most users won't hit, while ChatGPT's $20 plan throttles agentic runs on weekday afternoons. Numbers below reflect each plan's actual performance on our test days, not the marketing page.
| Claude 4 | ChatGPT | Gemini 2.5 | Perplexity | Grok 3 | |
|---|---|---|---|---|---|
| Monthly price (paid) | $20 | $20 | $20 | $20 | $8 |
| Free tier usable? | Limited | Yes | Yes | Limited | Yes |
| Context window | 200K | 128K | 2M | ~routed | 128K |
| Tool / browser use | ✓ | ✓ | ✓ | ✓ | ~ |
| Code execution | ✓ | ✓ | ✓ | ✗ | ~ |
| Voice mode | Beta | Excellent | Strong | ✗ | ~ |
| Image generation | ✗ | ✓ (GPT-Image) | ✓ (Imagen 3) | ✗ | ✓ (Aurora) |
| API for developers | ✓ (Anthropic) | ✓ (OpenAI) | ✓ (Vertex) | Beta | ✓ (xAI) |
| SOC 2 / GDPR | SOC 2 II, GDPR | SOC 2 II, GDPR | SOC 2 II, GDPR | SOC 2 II, GDPR | SOC 2 I |
| Data retention default | 30d | 30d | 18 mo | 30d | indefinite |
Claude 4 Opus — Anthropic
Claude is the model we kept reaching for when something had to actually work. Across 56 coding tasks it produced fewer regressions than any competitor, and on long-document reasoning (300+ pages) it remained coherent where ChatGPT and Grok started fabricating quotes. The "thinking" trace is also the most useful in the category — you can see why a refactor was made, not just what was changed.
It's the only one of the five we trusted with an unsupervised one-hour coding agent run. We came back to a working pull request, not a swamp of TODOs.
Where it falls short: there is no native image generation, voice is still in beta, and real-time web access goes through a tool that can rate-limit on heavy days. Anthropic's safety tuning is also occasionally over-eager — Claude refused 4% of our benign red-team prompts, double the rate of ChatGPT.
Strengths
- Best coding agent in the test, by margin
- Lowest hallucination rate on factual tasks
- Excellent at preserving voice in editorial rewriting
- Transparent reasoning trace
Weaknesses
- No native image generation
- Voice mode still rough
- Occasional over-cautious refusals
- $20 tier hits message caps faster than ChatGPT
ChatGPT — OpenAI
OpenAI's product surface is staggering: voice that interrupts politely, a serious operator mode that opens its own browser sessions, native image and video, deep connections into third-party tools (Slack, Drive, Notion). If you need one agent to live on your Mac dock, ChatGPT is the most plausible candidate.
The catch is reliability. GPT-5o's hallucination rate on numeric questions (currency conversions, date arithmetic, citation lookups) was 5.4% — high enough that you cannot trust it for finance or compliance work without verification. The operator mode, while impressive, is also where most of those errors compound.
Strengths
- Most ambitious agentic feature set on the market
- Voice mode is genuinely conversational
- Image & video generation built in
- Largest plug-in ecosystem
Weaknesses
- Higher hallucination rate than peers
- Operator mode chains errors quickly when unsupervised
- Recent UI overhaul still feels unfinished
- Throttling on Pro tier during US peak hours
Gemini 2.5 Pro — Google
Gemini's two-million-token context window is not a marketing trick — we fed it a full SEC 10-K (~330 pages) plus three years of earnings transcripts and asked cross-document questions. It answered correctly 91% of the time, which is better than Claude managed when forced into the same 200K window via chunking.
Native Workspace integration is also under-discussed. Asking Gemini to draft a reply that references your last three Drive folders and last week's Calendar actually works without copy-paste gymnastics. The tradeoff is personality: Gemini writes like a competent assistant who doesn't want to bother you with opinions.
Strengths
- 2M-token context handles real-world documents
- Best free tier of the five
- Native Google Workspace integration
- Lowest latency in our trials
Weaknesses
- Bland creative output
- Refactor quality lags Claude / ChatGPT
- 18-month default data retention concerns enterprises
Grok 3 — xAI
Grok shines in exactly one place: live discussion on X. It can summarize a breaking news thread with attribution faster than any other agent because it is, in effect, watching the firehose. Outside of that, the model lags. Refactor quality is mediocre, the hallucination rate is the highest in the group at 9.8%, and the "irreverent" tuning that was charming in 2024 now reads as careless.
At $8/month it is also the cheapest option, and for a particular kind of user — social-media power user, journalist tracking real-time events — that price is defensible. For everyone else, the cheapness reflects the gap.
Strengths
- Real-time X firehose access
- Cheapest paid plan
- Good creative tone for casual writing
Weaknesses
- Highest hallucination rate measured
- Weak coding and reasoning
- Indefinite data retention by default
- Tool use feels bolted on
Perplexity Pro
Calling Perplexity an "AI agent" stretches the term — it is closer to a research front-end that routes your question to whichever frontier model best fits. But the product is so good at one thing — citing its sources — that we kept it in the comparison. On factual research with cited evidence, it scored 9.3, beating every general-purpose model.
Where it disappoints is anything past research: there's no code execution, no agentic browsing of your local files, no voice. If you live in browser tabs and write a lot, it earns its $20.
Strengths
- Best-in-class citation hygiene
- Routes to multiple frontier models per query
- Clean, distraction-free UI
Weaknesses
- Not really an agent — no autonomous task execution
- No code execution, no voice
- API still in beta
So which one should you actually buy?
We resisted writing one of those "it depends" conclusions, because the people asking us this question want a concrete answer. So:
- If you write code or work on long documents: Claude 4 Opus. It is meaningfully ahead, not marginally.
- If you want one assistant for everything (writing, voice, image, agentic browsing): ChatGPT. Accept the hallucination tax.
- If you live in Google Workspace or work with very large documents: Gemini 2.5 Pro.
- If you mostly do online research and need citations: Perplexity Pro.
- If you obsessively follow news on X: Grok 3 — and only then.
Two of these vendors will release new flagships in the next ninety days (Anthropic and OpenAI both have known release cadences). Our verdict will likely update before this article rotates off the front page; we will mark every change in the changelog below.
References
- Stratum subscriber survey, March 2026 (n=4,812). Methodology: link.
- "Frontier Model Capability Report Q1 2026," Open Philanthropy, March 2026.
- Anthropic Model Card, Claude 4 family, April 2026.
- OpenAI System Card, GPT-5o, April 2026.
- Google DeepMind, "Gemini 2.5 technical report," March 2026.
- Perplexity Engineering Blog, "Routing the Agent Stack," February 2026.
- xAI release notes, Grok 3, January 2026.