Everything Is Green… Until It Isn’t

Datadog’s Harel Shein on OpenLineage, data observability, and why “everything is green” is dangerous

Liz Cohen

December 17, 2025

For years, software teams learned to trust their monitoring: latency graphs, error rates, distributed traces. If the dashboards were green, the system was probably healthy.

‍

Data teams live in a weirder reality.

‍

In the latest episode of בול בדאטה, the data/AI podcast hosted by Hetz partner Guy Fighel, Guy sat down with Harel Shein, (currently leading a development group at Datadog focused on data observability) to unpack why data breaks differently, why the pain scales exponentially, and how an open standard like OpenLineage became a missing layer in modern data infrastructure.

‍

What follows is a distillation of the core themes, expanded for founders and data leaders building systems where “green” is not good enough.

‍

The thing that makes data harder than software

Guy framed a point we’ve heard from a lot of experienced leaders but rarely stated so cleanly: data systems aren’t just distributed - they’re implicitly connected.

‍

In software engineering, relationships tend to be explicit: APIs, message queues, contracts between services. Teams argue about interfaces, version things, and eventually, the organization learns where the seams are.

‍

“In data pipelines, everything is much more implicit. Someone sees a table or a file in S3 and thinks, ‘Great, I can use this.’ There’s no contract.”

‍

Data pipelines don’t behave that way.

As Harel described it, a data team can “publish” a table or a file in S3 and suddenly it becomes infrastructure. Someone else will discover it, depend on it, and build on top of it, often without talking to the producer, without writing requirements, and without creating a contract.

‍

A small org can survive that because everyone shares tribal knowledge. But at scale, the complexity becomes exponential: more pipelines, more consumers, more silent dependencies, more places for things to break in ways nobody sees.

‍

Which brings us to the most familiar failure mode in data…

‍

“Everything is green” but the business is broken

‍

“This is the classic case where everything is green — but the result is garbage.”

‍

Harel gave a very real example of what “data failure” looks like in practice:

A field quietly changes type (say, float becomes string) because it solves a constraint in an application.
A downstream job in Python converts it back “successfully” (no type safety, no hard failure).
The pipeline stays green.
The output becomes garbage.
And only at the end (in ranking, recommendations, revenue attribution, or dashboards) someone notices the world is wrong.

‍

This is the core reason data observability exists as its own category. In classic observability, the product is to detect failures of the system. In data, the product is often to detect failures of meaning. Your pipeline can ‘run.’ Your jobs can ‘succeed’ and your SLAs can look fine - and also, your results can still be nonsense.

‍

Why OpenLineage happened: making the implicit explicit

A big part of Harel’s story starts earlier, at WeWork, during the era when the company was trying to grow like a tech company at massive scale.

‍

When you build a serious data platform, you always discover the same missing piece: the platform can move and compute data, but the organization can’t explain it. Who produces this dataset? Who consumes it? What changed? What broke? Where did this number come from?

‍

“People used to manage lineage by interviewing teams and maintaining spreadsheets. It was outdated the moment they finished.”

‍

Historically, companies attempted ‘lineage’ in the most painful way possible: manual mapping. Interviews. Docs. Spreadsheets. And the moment you finish mapping it, it’s already outdated. Harel described the transition from ‘we should have lineage’ to ‘we need a systematic way to capture it,’ which ultimately led to:

a metadata repository effort (‘Marquez’)
and then the insight that really matters: lineage needs to be vendor-agnostic to scale

‍

That’s where OpenLineage was born. OpenLineage is to data workflows what OpenTelemetry is to observability. A shared spec that many systems can emit, many systems can consume, and no single vendor controls. Not because open standards are ‘nice’, but because without them, every integration becomes a political fight.

‍

A design lesson founders should steal: start primitive, stay extensible

One of my favorite parts of the conversation was surprisingly practical: how you design a standard that can survive. Harel described the philosophy behind OpenLineage’s early choices:

start with the most basic units (job, dataset, run)
use a simple, human-readable format (JSON)
avoid stuffing every “nice-to-have” into the core spec
add extensibility so vendors can customize without breaking the commons

‍

It’s also why OpenLineage is now adopted across very different ecosystems: Airflow DAGs, Spark jobs, Flink checkpoints, query history crawls, catalogs, and more. The primitives stay stable while the edges evolve.

‍

“We started from the most basic units and resisted adding everything we knew was ‘missing.’”

‍

For anyone building platform products: this is a masterclass in restraint. Most standards fail because they try to be complete too early.

‍

Governance and compliance aren’t policy problems but instrumentation problems

‍

When Guy asked about governance and regulation (especially with AI regulation rising), Harel’s answer landed in a simple analogy: Metadata is like documentation. If it’s not generated automatically as part of the workflow, it will drift and rot.

‍

Data governance is often treated as a top-down initiative: define rules, write policies, hold committees. But at scale, it’s operational. You can’t govern what you can’t see.

‍

“Metadata is like documentation. If it’s not generated automatically, it will always drift.”

‍

Harel’s view was clear: the only way to do metadata at scale is to instrument it as data moves during the actual processes that transform and publish it. And once you have that, compliance becomes tangible:

You can tag sensitive sources (PII).
You can observe propagation through downstream datasets.
You can answer: Where did this training data come from? Where is it used? Did we pull something we shouldn’t?

‍

Most organizations today still answer those questions by interviewing humans. That’s slow, expensive, error-prone, and, under regulation, increasingly indefensible.

‍

The agentic future: metadata as the ‘context hub’

‍

I didn’t expect the most forward-looking segment to be the most concrete, but it was. When Guy asked about AI and agents, Harel didn’t go for hype. He framed OpenLineage and metadata as a context hub - a map of what runs where, what depends on what, and what resources are consumed. That context unlocks two kinds of automation:

‍

1) Operational agents (the ‘nanobots’ model)

‍

Agents that can identify bottlenecks, propose small changes, and iteratively improve pipeline efficiency:

CPU/memory optimization
bottleneck detection
reliability improvements
closed-loop measurement (“what was the impact?”)

‍

2) Business agents (ask questions without dashboards)

‍

Agents that can answer business questions using the workflow context, not just raw data. If you know the pipeline logic, ownership, and semantics, you can accelerate real decision-making.

‍

Guy added an important point: text-based specs like OpenLineage are unusually model-friendly. LLMs can ‘eat’ this structure easily, making it a natural substrate for agentic analysis. In other words: metadata isn’t just governance infrastructure. It may become the substrate for automation.

‍

A playbook (in short)

Assume data complexity scales exponentially. Pipeline count is not linear pain.
Treat green pipelines as a weak signal. In data, meaning can break without errors.
Make the implicit explicit. Lineage is the missing layer between storage/compute and trust.
Prefer open standards where integration is political. Vendor-agnostic specs remove friction.
Governance starts with instrumentation. Metadata must be generated in the workflow, not documented afterward.
Think of metadata as a context hub for AI. Agents will be as good as the operational map they can see.

‍

Ending off with a human touch: Harel says the data community is unusually analytical and unusually friendly, and it’s a space he wants to keep building in. If you’re curious about OpenLineage, check out the project at openlineage.io, where the community and resources are open to anyone who wants to dig into the ‘metal layer’ behind modern data systems.

‍

If you build data products, you already know the truth: you don’t just need better pipelines. You need a system that can explain itself.

‍

‍Subscribe to Hetz Ventures on YouTube to get updated when new episodes release, or follow along on Geektime.

‍