Your Data Lake Is Only as Useful as Its Ability to Answer a Question
- David O'Neil
- Cybersecurity
- 09 Apr, 2026
You moved your security data out of the SIEM and into a data lake. Costs dropped. For the first time in years, you had budget to spare.
Then an investigation hit — and your team spent two weeks finding what should have taken hours. If that investigation triggers a regulatory notification, you just blew your SLA. Not because the data was missing. Because the data couldn’t answer a question fast enough.
This is the distinction that most vendors skip when they sell you on the economics of tiered storage: a cheap data lake and a useful data lake are not the same thing. Cheap is about storage cost. Useful is about whether your analysts — and increasingly, your AI agents — can actually get answers from it when a question is being asked under pressure.
The difference comes down to three architecture decisions — normalize at ingest, partition deliberately, and use a catalog layer. In my experience — and the industry data bears this out — most organizations building data lakes today have made zero of them. They’ve solved the storage economics and stopped there. This article is about what makes the difference.
The Three Jobs Your Lake Has to Do
Before getting into the technical specifics, it helps to be clear about what you’re actually asking a security data lake to do — because it’s not what your SIEM does.
Your SIEM’s job is detection: correlation rules, behavioral analytics, real-time alerting on hot data. It’s optimized for speed on recent events.
Your data lake’s job is different. It’s the long-term memory — the place where months or years of security data lives, queryable, at a fraction of SIEM pricing. Detection stays in the SIEM. Investigation, compliance reporting, and AI-driven analysis move to the lake.
With that distinction clear, here are the four jobs your lake needs to do:
| Job | What It Requires | Who’s Doing It Well |
|---|---|---|
| 1. Store data cost-effectively | Object storage (S3, ADLS, GCS) | Most organizations |
| 2. Support reporting and compliance | Queryable structure, dashboarding, metrics, audit trail | Some organizations |
| 3. Return results to a direct query | Schema consistency, indexing, reasonable query speed | Fewer organizations |
| 4. Support AI agent investigation | Catalog layer, normalized schema, sub-second cross-source queries | Almost no one — yet |
Job 1 is the easy part. Object storage is cheap. If this is all you’ve built, you’ve saved money but you haven’t built a security capability.
Job 2 is where most organizations think they’re done. Your CISO needs a quarterly compliance report. Your SOC manager wants a dashboard showing mean time to detect over the last six months. Your auditors want proof you retained firewall logs for the required period.
These are legitimate requirements — and a well-structured lake handles them. But they’re batch queries on known data, not investigation under pressure.
Job 3 is where things get interesting. An analyst says “show me all authentication events from this user between these dates.” That’s a flat retrieval question. If your lake is searchable at all, it should handle this — but the speed and cost of that query depend entirely on how the data is organized and indexed. A poorly structured lake can make even a simple retrieval expensive and slow.
Job 4 is the hard one. An AI agent is working an active incident. It needs to pull behavioral history, cross-reference threat intel, and surface patterns across multiple data sources — on demand, in seconds, without human intervention. This is where most lakes fail. Not because the data isn’t there, but because the AI doesn’t have a reliable way to find and interpret it.
The gap between those four states is what this article addresses.
Why Raw Storage Isn’t Enough
Here’s what happens when you dump security logs into object storage without a plan:
Every source lands in its own format. Firewall logs look like firewall logs. CloudTrail looks like CloudTrail. Endpoint telemetry looks like endpoint telemetry.
Each has different field names for the same concepts — source IP might be src_ip in one source and source_address in another and remote_addr in a third. Timestamps might be in different time zones, different formats, different precision.
When an analyst or an AI agent asks “show me all network connections from this endpoint,” the system has to understand that those three field names mean the same thing — and it has to resolve that understanding across potentially dozens of source types. If that normalization hasn’t been done at ingest time, every query has to do it at query time. That’s slow, expensive, and error-prone. And it’s nearly impossible for an AI agent to do reliably when it has no schema to reason from.
This is why the SIEM seemed to solve the problem. It enforced normalization at ingest — everything got mapped to a common schema before it was stored. The cost was high because that schema was proprietary and everything had to be processed upfront. The benefit was that queries worked reliably.
The security industry has been trying to solve this normalization problem for two decades. It’s worth a quick tour of where that effort has gone — because the schemas that came before OCSF deserve credit, and understanding their limitations explains why a new approach matters.
The Schema Landscape
| Schema | Origin | Strengths | Limitations |
|---|---|---|---|
| CEF (Common Event Format) | ArcSight, early 2000s | Simple, readable, widely adopted. Moved the industry toward normalization. | Built for a network-centric, syslog world. Extending to identity, cloud, or app logs is a force-fit. |
| CIM (Common Information Model) | Splunk | Mature, well-documented. Works well inside Splunk’s ecosystem. | Stops at the platform boundary. CIM doesn’t travel. |
| ASIM (Advanced Security Information Model) | Microsoft Sentinel | Powerful within the Microsoft security stack. Heavy investment in cross-product coverage. | OCSF support on the roadmap but not yet native. |
| UDM (Unified Data Model) | Google Chronicle | One of the more sophisticated implementations — sub-second searches across petabytes. | Google’s schema. Doesn’t leave Chronicle without translation work. |
| ECS (Elastic Common Schema) | Elastic | Most mature open alternative pre-OCSF. Donated to OpenTelemetry — ECS and OTel Semantic Conventions are converging. | Still evolving post-merger with OTel. Tooling ecosystem advantage may take time to materialize. |
| OCSF (Open Cybersecurity Schema Framework) | AWS, Cisco, IBM, Splunk + Linux Foundation (2024) | Platform-neutral by design. 900+ contributors, 200+ orgs. Native in AWS Security Lake. | Still maturing (v1.8). Records are larger than raw logs. Most adopters fork it to fit. |
Each of these schemas solved real problems for real organizations. The first five share the same limitation — none were designed to be platform-neutral by default. OCSF was built to fix that. Whether it delivers is the next question.
The Standard That’s Trying to Change This: OCSF
The Open Cybersecurity Schema Framework was founded in 2022 by AWS, Cisco, IBM, Splunk, and others — notably, derived from schema work originally done by Broadcom (Symantec). In November 2024 it joined the Linux Foundation, which is the signal that a project has grown beyond what any single vendor should control. It now has over 900 contributors and 200 participating organizations.
OCSF does what CEF, CIM, ASIM, and UDM each tried to do within their platforms — define a common language for security events — but as an open standard that any tool can adopt. Authentication events always have the same fields in the same places. Network activity events follow a consistent structure. Process activity, file system events, findings from security tools — all organized into a taxonomy that different tools, different platforms, and different AI agents can reason from consistently.
AWS Security Lake adopted OCSF as its native schema. Every log that enters Security Lake gets automatically converted to OCSF format and stored as Apache Parquet — a columnar storage format optimized for analytical queries at scale. Major security vendors including Datadog, SentinelOne, Rapid7, CrowdStrike, and Palo Alto Networks have implemented OCSF support in varying degrees. The ecosystem is real and growing.
But OCSF isn’t a finished product — and this article would be dishonest if it implied otherwise.
OCSF is still maturing. Version 1.8 is the current release, and in December 2025 the ITU endorsed OCSF for ratification as an international standard — a significant credibility milestone. Many vendors who built OCSF integrations for the launch of AWS Security Lake have been inconsistent about keeping those integrations updated as the schema has evolved. And here’s a practical reality most case studies skip: raw OCSF records are significantly larger than raw logs — the richer schema carries overhead. That’s a storage cost tradeoff worth understanding before committing.
Perhaps most tellingly: most enterprise security teams that have adopted OCSF have forked it to fit their specific needs. That’s not a failure. It’s actually a reasonable strategy — OCSF gives you a well-designed starting point, and even a forked OCSF implementation is dramatically easier to query and extend than raw vendor logs. But it does mean “we adopted OCSF” means different things in different organizations.
The practical implication for your architecture: OCSF-normalized data, even imperfectly, is far more queryable than raw logs. The normalization cost is paid once, at ingest. Every query, and every AI agent investigation, benefits from it. That math holds even if your OCSF implementation is a fork.
The Multi-Cloud Reality
Most enterprise security teams aren’t running on one cloud. They’re running on two or three — with different native schemas in each.
| Cloud | Lake Architecture | Native Schema | OCSF Status | Key Strengths |
|---|---|---|---|---|
| AWS | Security Lake + S3 | OCSF (native) | Production-ready | Parquet storage, Apache Iceberg catalog, well-documented |
| Azure | Sentinel Data Lake | ASIM | On the roadmap | Up to 12 years retention at <15% of analytics tier pricing — strong economics |
| GCP | SecOps (Chronicle) + BigQuery | UDM | Third-party mapping required | Sub-second search, Gemini natural language queries |
Here’s the point that matters for most enterprises: if you’re running AWS, Azure, and GCP — and most large organizations are — you have three different native schemas that don’t speak to each other. OCSF’s value multiplies in exactly this environment, because it’s the only schema designed to travel across all three without re-normalization at every boundary.
Your AWS security events are in OCSF. Your Azure events are in ASIM. Your GCP events are in UDM. An AI agent investigating an incident that spans all three environments has to translate between three schemas in real time — or you’ve done that normalization upstream, once, into OCSF, and the agent just works.
That’s the multi-cloud argument for OCSF. Not just “it’s an open standard.” It’s the only cross-cloud normalization layer that currently exists with meaningful vendor backing.
Tools like Cribl and Datadog’s Observability Pipelines are building exactly this capability — pipeline-level OCSF normalization that works regardless of which cloud the data originates from, and routes it to whichever lake or SIEM is the destination. That’s the architecture bridge for organizations that can’t wait for native OCSF support from every platform in their stack.
The Three Technical Decisions That Determine Searchability
Getting to a searchable lake that supports AI agent investigation comes down to three decisions made early in the architecture. Change them later and you’re re-ingesting data.
Decision 1: Normalize at ingest, not at query.
The temptation is to ingest raw logs fast and cheap, and normalize later when you need to query. This feels efficient. It isn’t. It means every query carries the normalization cost, every query requires knowledge of the source format, and AI agents working at query time have no schema to reason from.
Normalize at ingest using your pipeline layer — Cribl, Datadog Observability Pipelines, AWS Glue, or similar — mapping events to OCSF (or ECS/OTel if that’s your path) before they hit storage. This costs more compute upfront and requires mapping work for each source type. It pays back every time you run a query, and it’s the prerequisite for AI agent investigation.
Decision 2: Partition by time and event type.
Partitioning is how a data lake knows where to look before it starts scanning. A well-partitioned lake stores data in a hierarchy — by year, month, day, and event class — so a query for “authentication events from last Tuesday” doesn’t have to scan two years of firewall logs to find them.
Without partitioning, every query is a full scan. At terabyte scale, full scans are expensive and slow. With proper partitioning aligned to your common query patterns — which for security investigations are almost always time-bounded and event-type-specific — queries that would take minutes take seconds, and queries that would cost dollars cost cents.
Decision 3: Use an open table format for the catalog.
Apache Iceberg and Delta Lake are the two leading open table formats for large-scale data lakes. Both solve a problem that flat Parquet files don’t: they give the lake a catalog — a structured registry of what data exists, where it lives, how it’s partitioned, and what schema version it was written against.
For AI agent investigation, the catalog is critical. An agent navigating a data lake without a catalog is like a detective searching a library with no index — the books are there but finding the right one is guesswork. With an Iceberg or Delta Lake catalog, an agent can efficiently discover what data is available, query specific subsets, and join across sources without scanning everything.
AWS Security Lake uses Apache Iceberg. Databricks Data Intelligence for Cybersecurity is built on Delta Lake. Both are production-proven at security-relevant scale.

What “AI-Queryable” Actually Means
The phrase “AI agents can query your data lake” gets used a lot right now. It’s worth being specific about what that actually requires.
An AI agent needs to be able to: discover what data is available, formulate a query in a language the lake understands, execute it efficiently, interpret the results against a schema it understands, and connect those results to the context of the current investigation.
Every one of those steps has a dependency on the architecture decisions above. Discover what’s available → requires a catalog. Formulate a query → requires schema knowledge (OCSF, ECS, or a well-documented fork). Execute efficiently → requires partitioning. Interpret the results → requires schema consistency across sources.
If any of those elements is missing, the AI agent either fails silently — returning incomplete or incorrect results — or requires a human data engineer to bridge the gap. Either outcome defeats the purpose.
The organizations getting real AI-agent investigation capability from their data lakes today built the architecture right from the start. The 60% and 75% cost reductions cited in earlier articles are the visible part. The less visible part is that those organizations now have data that AI can actually use for investigation. The cost savings funded the capability. The capability is what matters long-term.
The Honest Assessment: Where Most Organizations Are
I’ve talked to dozens of security leaders over the past two years about their data lake strategies. The pattern is consistent — and the practitioner data confirms it: most organizations building security data lakes today are at Decision 0: cheap storage, raw logs. A few are at Decision 1: normalized data, probably in a vendor-specific schema — CIM inside Splunk, ASIM in Sentinel, UDM in Chronicle. Fewer still have made all three decisions deliberately with an open, cross-platform schema, proper partitioning, and a catalog layer. The Gurucul 2025 Pulse of the AI SOC survey found that only 9% of security practitioners are “very confident” in their AI tooling — and data architecture is a major reason why.
That’s not a criticism — this architecture has only been practically achievable for the last 18–24 months. But it does mean most organizations claiming “we have a data lake” are describing something closer to cheap archive storage than a queryable security asset.
The gap matters because it’s the gap between “we saved money on storage” and “we gave our AI agents a memory.”
Your Next Move
Good (this week): Audit your current data lake. Can an analyst answer “show me all authentication failures for this user in the last 90 days” in under 60 seconds? If not, you have a storage tier, not a security capability. Know which of the three decisions you’ve made and which you haven’t.
Better (this quarter): Pick one high-value log source — identity provider logs are a strong starting point — and normalize it to OCSF (or ECS/OTel) at ingest. Build the partition structure. Run the same query against normalized and raw data and measure the difference. That delta is your business case for the rest.
Best (this half): Deploy a pipeline layer (Cribl, Datadog Observability Pipelines, or AWS Glue) that normalizes all inbound security data to a common schema before it hits storage. Add an Apache Iceberg or Delta Lake catalog. Then point an AI agent at it and ask it an investigation question. The answer will tell you whether you’ve built a lake or an archive.
David O’Neil is a CISO with 20+ years in cybersecurity leadership. He writes about the practical realities of security operations, AI adoption, and building security programs that survive budget season at cisoexpert.com.