Type something to search...
The SIEM Cost Trap — Why Your Data Lake + AI Agents Will Win

The SIEM Cost Trap — Why Your Data Lake + AI Agents Will Win

If you’ve ever sat across from your CFO, your VP of Engineering, or your board and tried to explain why your SIEM costs what it costs — you already know how this conversation goes. The short version of what I’m about to lay out: there’s an architecture shift happening right now that breaks the cost trap, and the organizations that move on it are going to have a fundamentally different budget conversation than the ones that don’t.

Here’s what that shift looks like: hot data in the SIEM for detection, everything else in a data lake that’s searchable, dashboardable, and accessible through AI agents. The security outcome stays the same. The cost trajectory changes completely.

But first — the conversation that got us here.


I’ve had this discussion more times than I can count. Maybe you have too.

“Why does security cost this much? The SIEM solution is how much? How long are we keeping logs — forever? What logs?”

And here’s the honest answer I’ve given every time — honest, but not satisfying:

“Because the SIEM vendor charges us per gigabyte of ingestion, and we’re generating a lot of gigabytes.”

They nod. It’s a reality we’ve all learned to live with. The budget gets approved — or cut — and we move on. Then we make the painful decisions about which data sources to exclude from our SIEM because we literally cannot afford to ingest everything. Or even some of it.

I think that conversation is about to change. Not because vendors got more generous. Because the economics of security data architecture are being rewritten from the ground up — and if you understand what’s happening, you can walk into your next budget conversation with a fundamentally different argument.


The SIEM Pricing Model Was Built for a Different World

When SIEM pricing was designed, security data volumes were measured in gigabytes per day. Now they’re measured in terabytes. The per-gigabyte ingestion model — which seemed reasonable in 2008 — has become a compounding cost problem that gets worse every year as your environment grows.

Think about what that means in practice. Every new SaaS application generates logs. Every cloud workload generates logs. Every endpoint security tool, every identity platform, every network device — all of it produces more data than the year before. Your attack surface grows. Your data volume grows. Your SIEM bill grows with it, whether or not you got more security value from the additional data.

This is the cost trap: the thing that’s supposed to protect you becomes more expensive in direct proportion to how much your organization grows. No economy of scale. No efficiency gain from maturity. You just pay more.

The result is that most organizations — even large, well-resourced ones — are making active decisions to not ingest certain data sources because they can’t justify the cost. They’re running partially blind by financial necessity, not by choice.

If you’ve ever had to explain to an auditor why you don’t have 90 days of DNS logs, you know exactly what I mean.


The Math That Used to Work Doesn’t Anymore

The numbers tell the story when you put them side by side.

Microsoft Sentinel’s analytics tier — full query capability, real-time alerting — runs approximately $4.30 per GB on a pay-as-you-go basis. The data lake tier, which Microsoft made generally available in October 2025, runs roughly $0.15 per GB effective (combining storage and processing costs). That’s not an incremental difference. That’s a 97% gap — for the same raw data, stored differently.

That gap is the opportunity. It’s also the problem spelled out in pricing: organizations have been storing everything in the most expensive tier by default, because that’s where SIEM put it.

Data infrastructure companies understood this before most of the security industry did. Cribl — which routes, filters, and transforms security data before it hits storage — surpassed $300 million in annual recurring revenue in 2025, with roughly 35% of the Fortune 500 and half the Fortune 100 as customers. (Cribl, February 2026) That growth isn’t an accident. Their entire value proposition is built on one premise: not all data belongs in the expensive tier, and most organizations are paying premium prices for data they almost never touch.


The Architecture That Changes the Conversation

Here’s the thesis I’ve landed on — and that I’m starting to use in my budget conversations:

Hot data in the SIEM for active detection — only detections. Years of data in a data lake — searchable, dashboardable, reportable. AI agents connecting them at query time — triage, context, analysis, threat hunting, metrics, compliance reporting.

That’s the model that breaks the cost trap. It’s also a platform that scales for the next five years, not just the next renewal cycle.

The hot tier — your SIEM or security analytics platform — handles real-time detection, alerting, and the investigations you’re actively running. It needs to be fast, it needs full query capability, and it’s expensive. It should be. That’s where the live security work happens. Some data sources — behavioral analytics especially — may need a wider window in the hot tier. A 30, 60, or 90-day bubble for UBA/UEBA correlation. But that’s selective. Not every data type needs that depth, and paying hot-tier prices across the board for it is where the economics break down.

The cold tier — a data lake built on object storage — holds everything else. Historical logs, high-volume network telemetry, data you need for breach forensics, legal holds, regulatory compliance, and the operational reporting and dashboarding that doesn’t require real-time correlation. Your monthly metrics, your compliance posture reports, your board-ready dashboards — all of that can run against the lake.

Databricks launched Data Intelligence for Cybersecurity in September 2025 and published results from real security teams that made this shift:

OrganizationResultTimeframe
Rivian (10 TB/day, 100+ sources)60% SIEM cost reductionUnder 4 months
Barracuda Networks75% reduction in daily processing and storage costsReal-time alerting under 5 minutes
SAP Enterprise Cloud ServicesUp to 80% reduction in engineering timeDetection rule deployment 5x faster

(Databricks Data Intelligence for Cybersecurity, September 2025)

These aren’t fringe experiments. These are real security teams making pragmatic economic decisions about where data lives and why.

The bridge between tiers — the part that makes this work for a security team rather than just a data engineering team — is AI. Not AI as a magic answer to everything, but AI as the reasoning layer that connects hot and cold storage operationally. An analyst investigating a suspicious account doesn’t need two years of authentication history in real-time hot storage. But when they need to pull that history — to confirm whether this is the first time this account behaved this way, or the tenth — an AI agent can retrieve and analyze it on demand, against cold storage, at a fraction of what it would cost to keep it hot. The same applies to threat hunting across historical data, building compliance reports from months of logs, or generating executive dashboards from data that doesn’t need to be hot to be useful.

The security outcome doesn’t change. The cost trajectory does.


Not Everything Runs on the Same Clock

“Hot data in the SIEM” is the right model. But it’s not a single number.

The real question for every data type is more specific: does this data need to support active correlation — running continuously against detection rules, building behavioral baselines, connecting patterns across sources — or does it need to be searchable, dashboardable, and reportable without requiring real-time correlation? Those are fundamentally different requirements, and they don’t deserve the same price tag.

Correlation-heavy data needs real time-depth in the hot tier. A user behavior detection rule flagging deviation from baseline can’t do its job with only two weeks of history. But operational dashboards, compliance reports, and ad-hoc investigations? A search against the data lake handles those fine.

Here’s how I think about it in practice:

Data TypeHot Tier WindowLake Tier Use
Identity & authentication logs120 daysLong-term access audits, compliance reporting, insider threat investigation
Endpoint & EDR telemetry90 daysThreat hunting, forensic deep-dives, historical lateral movement analysis
Network / firewall / proxy30 daysTraffic analysis dashboards, retrospective investigation, capacity reporting
DNS, CDN, application logs15 daysBulk search, incident scoping, operational metrics and trend dashboards

Everything beyond those hot-tier windows lives in the data lake — not deleted, not archived in the traditional sense, but queryable by AI agents and available for dashboarding, reporting, and hunting when you need it. The cost difference between “15 days hot” and “15 days hot + three years in the lake” is dramatic. The security coverage gap is minimal, because a flat lake search — or an AI agent pulling context — answers the vast majority of retrospective questions your team will actually ask.

Where the lake falls short is sustained correlation across time. If a detection rule needs to run continuously against 18 months of data, you’re paying hot-tier prices for a cold-tier use case. The answer isn’t to shorten the window blindly — it’s to be honest about which rules genuinely need that depth, and architect around that reality.

(How you structure the lake to make retrieval fast and trustworthy — schema normalization, indexing strategy, OCSF adoption — is its own conversation. We’ll cover that next in this series.)


This Is a Business Decision, Not Just a Technical One

SIEMs come up for renewal. They always do. The question your leadership is going to ask — because good executives ask this — is: “Why are we paying more than last year for the same capability?”

The old answer lands badly: “Because we’re generating more data.” That’s a runway that never ends and a budget conversation that never improves.

The new answer is a strategy: “We’ve been storing everything in the most expensive tier by default. That changes. We keep the data that powers real-time detection hot — tiered by what each source actually needs. Everything else lives in the lake at a fraction of the cost — still searchable, still dashboardable, still available for compliance and reporting. AI agents make it accessible when an investigation or a report needs it. The security outcome stays the same. The cost trajectory changes fundamentally.”

IBM’s 2024 Cost of a Data Breach Report gives you the return side of that equation. Organizations using AI and automation extensively in their security operations saved an average of $2.2 million per breach compared to those that didn’t — the largest cost-reduction factor in the study. They also identified and contained breaches nearly 100 days faster on average. (IBM Cost of a Data Breach Report 2024)

You don’t need to sell AI hype to make this argument. You’re selling a data architecture decision with real cost benchmarks, documented case studies, and a clear security rationale. That’s a budget conversation. That’s a strategy conversation. And it’s one security leaders can win.


What This Architecture Doesn’t Solve

I want to be clear about limitations — because if you’ve been doing this long enough, you know that overselling an architecture change is the fastest way to lose credibility in the room.

It doesn’t solve the data quality problem. Moving data between tiers requires knowing what you have and what it’s worth. If your SIEM is full of duplicate events and misconfigured sources, a data lake gives you cheap storage of garbage. The unglamorous work of data inventory and source validation still has to happen first.

It doesn’t replace SIEM — it extends it. Real-time detection, behavioral analytics, and active alerting still belong in purpose-built security analytics platforms. The argument isn’t “replace your SIEM with a data lake.” It’s “stop putting everything in the SIEM when only some of it belongs there.”

It doesn’t work without the AI reasoning layer. Cold storage is only valuable if you can reach it when you need it — for investigations, for dashboards, for compliance pulls — without requiring every analyst to become a data engineer first. The AI piece isn’t optional. It’s what makes the economics work operationally.

Vendor lock-in is a real risk, and it just moved. If your cold storage uses a proprietary schema, you haven’t escaped SIEM lock-in — you’ve traded it for data lake lock-in. The open standard to watch is OCSF (Open Cybersecurity Schema Framework), now under the Linux Foundation. Normalize against an open schema upfront. That’s what keeps your options open as this market consolidates.


The Conversation You’re Now Ready to Have

The next time your leadership asks why security storage costs what it costs, you have a different answer than “because we generate a lot of data.”

“We tier our hot data by what each source actually needs for detection and correlation — 120 days for identity, 90 for endpoints, 30 for network, 15 for high-volume telemetry. Everything else lives in the lake at a fraction of the cost — searchable, dashboardable, reportable. AI agents pull from it for investigations, threat hunting, and compliance. Rivian did this and cut SIEM costs 60% in four months. Barracuda cut storage costs 75%. We’re not doing anything experimental. We’re getting ahead of where the industry is already going.”

That’s not a security argument. That’s a business argument grounded in security reality.

The SIEM cost trap is real. The exit exists. The organizations finding it now are going to look a lot smarter in three years than the ones still explaining per-gigabyte pricing to their executives — and still deciding which data sources they can afford to see.


Next in this series: What AI is actually doing in your SOC — and the five things it shouldn’t be doing yet.

David O’Neil is a CISO with 20+ years in cybersecurity leadership. He writes about the practical realities of security operations, AI adoption, and building security programs that survive budget season at cisoexpert.com.

Related Posts

4 Essentials for Executive & Business Buyin on your Incident Response Plan

4 Essentials for Executive & Business Buyin on your Incident Response Plan

The impact and subsequent fallout from a business-impacting cyber security attack are stressful at the best of times. Experience time and again shows that organizations without the benefit of an Inci

read more
The CyberSecurity & Evolving Threats

The CyberSecurity & Evolving Threats

Cybersecurity is a critical concern in today's world, as more and more of our daily lives are conducted online. The threat landscape is constantly evolving, and it can be challenging to keep up with t

read more
Top 5 things for a Successful Cyber Response 'IR' Plan

Top 5 things for a Successful Cyber Response 'IR' Plan

Incident Response Planning & Strategy How important is an Incident Response Plan? Some studies show that just having a plan, can reduce the cost of a breach [example one](https://insights.integrity36

read more
Pre-Selection Beats Post-Selection: How I Made Claude Code 10-30x Faster

Pre-Selection Beats Post-Selection: How I Made Claude Code 10-30x Faster

Every code navigation costs time. When you multiply 300ms delays across hundreds of searches per day, you're losing hours p

read more
I Ran 849 Tests on AI Context Files. Here's What Actually Works.

I Ran 849 Tests on AI Context Files. Here's What Actually Works.

After 849 controlled tests, $20 in API costs, and a week of experiments, I can tell you exactly how to organize your Claude Code reference files. The short version: Put everything in one flat fol

read more
How I Made Claude Code Safer (And You Can Too)

How I Made Claude Code Safer (And You Can Too)

I've been running Claude Code on real projects for months. It's great at writing code — but it doesn't always understand the consequences of what it writes. Claude Code validates which tools can run.

read more
Claude Code Has Two New CVEs — Here's What They Exploit and How to Harden Your Setup

Claude Code Has Two New CVEs — Here's What They Exploit and How to Harden Your Setup

Your engineers cloned repositories today. Probably dozens. If any of those repos contained a malicious .claude/settings.json, they may have executed arbitrary shell code without a single confirmatio

read more
I Scanned 152 Files of My Own AI-Generated Code for Invisible Unicode Malware

I Scanned 152 Files of My Own AI-Generated Code for Invisible Unicode Malware

Two weeks ago, a supply chain attack called Glassworm compromised 150+ GitHub repositories and 72+ browser extensions by hiding malicious payloads in characters that are literally invisible in every

read more
Four Generations of Broken Promises: Why AI SOC Agents Might Actually Be Different

Four Generations of Broken Promises: Why AI SOC Agents Might Actually Be Different

Series: The SIEM & AI Reckoning — Article 1 of 10Over twenty years and hundreds of vendor pitches, one line never changes: "This is going to change everything." 2005, SIEM. 2012, Next-Gen

read more
The Math Problem AI Just Changed for Security Testing

The Math Problem AI Just Changed for Security Testing

Published: 2026-03-22 | RSA 2026 Pre-Conference SeriesHere's the problem every security team lives with but rarely says out loud. Your environment changes every time a developer merges code,

read more
Your Data Lake Is Only as Useful as Its Ability to Answer a Question

Your Data Lake Is Only as Useful as Its Ability to Answer a Question

You moved your security data out of the SIEM and into a data lake. Costs dropped. For the first time in years, you had budget to spare. Then an investigation hit — and your team spent two weeks findi

read more