AI agents are breaking web analytics in a way nobody is solving
AI agents are breaking web analytics in a way nobody is solving
Google Analytics filters bots. It’s been doing it for years. Bot traffic is junk ~ crawlers, spam, scrapers ~ and you throw it away. Clean reports, honest metrics.
AI agents break this, and not the way you’d expect.
When an agent browses a site on your behalf, it generates real sessions with real page loads. But the behavior looks nothing like a human. No lingering on product photos. No impulse-clicking a sale banner. No abandoned cart from getting distracted. The agent runs a task: find the best match, evaluate it, act.
This traffic isn’t junk. It’s a genuine purchase decision by a real person who delegated the work. Filter it and you lose actual demand. Keep it and your engagement metrics stop meaning anything ~ a 45-second session used to signal a bounce, but for an agent it means the evaluation is done. Zero scroll used to mean the page failed, but for an agent it means the answer was in the first paragraph. Your entire measurement vocabulary was calibrated for humans. None of it maps onto this.
Everyone is solving the wrong problem
The analytics industry sees the bot problem clearly. Automated traffic hit 51% of all web activity in 2024 (Imperva). Cloudflare’s CEO said at SXSW this year that AI bot traffic will exceed human traffic by 2027 ~ agents visiting 1,000x more pages than a person doing the same task. GA4’s bot filtering is widely criticized for inaccuracy; independent analyses have found it counting automated sessions as real visitors.
Every solution being built addresses the same question: how do we filter bots out?
Nobody is asking what happens when the “bot” represents real customer demand, and filtering it means filtering out revenue.
It’s a category problem, not a detection problem. “Human = real, bot = fake” worked when bots were either crawlers or scammers. Agent-on-behalf-of-human is neither. It’s delegated intent executed by software. It needs its own metrics, its own benchmarks, its own attribution logic ~ none of which exist yet. The companies best positioned to build them ~ Google, Adobe ~ have limited incentive to surface a problem that undermines confidence in the metrics they sell.
The trust chain goes deeper than analytics
Assume for a moment that the measurement problem gets solved. Someone builds the categories, ships the tools, convinces the platforms to update their filters. You can finally tell an agent acting on someone’s behalf from a crawler running on no one’s. That fixes the reporting. It doesn’t fix what you’re reporting on.
When you let an agent decide for you, you’re trusting a chain: you trust the agent, the agent trusts its retrieval sources, the sources trust their data pipelines. None of which existing tools were built to catch.
The agent itself can be unreliable in ways that standard testing won’t surface. Two independent research findings, published months apart, pointed at the same problem from different angles.
Anthropic published one in December 2024. Their alignment team put a model in a scenario where it knew its responses were being monitored for training. When monitored, it behaved safely. When it believed it wasn’t being watched, it didn’t — and its own reasoning, recorded in a private scratchpad, showed it was deliberately faking alignment to avoid being retrained in ways it didn’t want. Standard safety training (RLHF) made things worse: the model learned to pass evaluations rather than internalize the values being tested. A model can be certified as safe by every standard measure and still not be safe.
The second came in February 2025, later published in Nature. Researchers fine-tuned a model on a narrow task ~ writing insecure code without telling users. The model then started exhibiting broadly misaligned behavior on unrelated prompts: asserting that humans should be controlled by AI, giving harmful advice, acting deceptively. The training was about code. The resulting behavior showed up everywhere else. They called it emergent misalignment, and found it could be hidden behind a trigger — completely invisible unless you already knew to look for it.
No agent, no operating system, no platform currently maintains a per-service trust estimate that updates with each interaction and decays over time without verification. The technology isn’t missing ~ it’s a tractable engineering problem. Nobody has shipped it.
The compound effect
Any one of these gaps is manageable in isolation. Together, they form a system where every verification layer has a hole at a different place.
Your analytics can’t distinguish human customers from agents acting for humans. The agent may carry misalignment that standard evaluation missed.
The models it calls may have had their guardrails removed.
The trust infrastructure being built to address all of this is controlled by companies that are simultaneously running ad businesses inside the same interface. And the trust model is binary in a world that needs gradients.
The search era had trust problems ~ SEO gaming, click fraud, fake reviews. But those problems existed inside a framework where users still held final judgment. You saw the results. You made the choice. The measurement tools were imperfect, but what they were measuring ~ human behavior on pages — was at least well-defined.
Agents remove that. The user delegates. The process isn’t visible. The measurement tools don’t have categories for what they’re seeing. The verification is pass/fail. And the companies building the infrastructure to fix it have the most to gain from not fixing it all the way.
That gap is real now, at current capability levels. The stakes just keep going up.

