"More sunshine to come": We're only at the beginning of data & privacy engineering

A Q&A on data privacy with Nishant Bhajaria, Uber's Head of Privacy Engineering
November 10, 2021

What is privacy engineering? How do major companies shape their privacy engineering teams? What should we expect over the next decade of data privacy? Nishant Bhajaria, Uber's Global Head of Privacy Engineering has led privacy architecture and engineering efforts at Netflix, Google and Nike. He also wrote the book on it - due next month. In this conversation between Hetz Ventures General Partner Pavel Livshiz and Nishant, hear about:

👉 What is data privacy and data engineering 00:27

👉 How engineering teams work with legal teams successfully 02:14

👉 Building the data engineering team 05:25

👉 What does the future data department look like in an organization? 07:53

👉 How do you start building the infrastructure for this? 10:00

👉 Where are the opportunities to build technology in this space? 14:21

👉 What's the right approach to a company for a startup founder? 21:00

👉 How do you view the approach to internal data sharing (vs external)? 23:22

👉 Preview of Nishant's book, Data Privacy: A runbook for engineers 25:54

The following interview has been edited lightly for clarity based on our interview with Nishant Bhajaria.

Can you define what is data privacy and data engineering? What is the actual difference?

It’s interesting because these terms are highly contextual and they mean what people think they mean sometimes. But when it comes to data privacy, what I think it means is that sweet spot where what you're supposed to do from a regulatory customer trust perspective - what the law wants you to do - you combine that with ‘how would I like this data to be handled if it were my data?’ How would I do it if it were my dad's data or my mom's data? How would I protect it? How would I make sure that it doesn't get used incorrectly, it doesn't get accessed in an inappropriate fashion, doesn't get shared with shady third parties?

Data privacy is a combination of the legal arm and the more ethical arm in terms of data handling. The law always has to play catch up, because as engineers, we tend to move a lot faster. We tend to break new ground, but the law looks at existing ground and tries to fix it. There is an asynchronous relationship between the two, on the best of days.

Data management is a much bigger, much more complicated issue because it involves the actual tooling, the actual metrics collection, onboarding, verification, things like that. Data privacy is like poetry, where everybody can agree that doing the right thing is important. Data management and data protection is a bit like prose where you actually have to understand the details and verify what works and what doesn't. The two are separate but very connected.

You mentioned working with legal is a very integral part of data privacy and data engineering. How does this interaction look and what is the general day to day relationship with the legal department? Is there a constant dialogue?

It's a very constant dialogue. Honestly, every time I've run a privacy engineering team, the people who report to me are folks who are traditional software engineers and in this case they're learning about privacy as they go along and they repurpose their technical knowledge to build privacy solutions. I also have architects on my team whose job it is to be advocates, product managers, system experts, to design these at a more holistic level.

The engineers build point solutions, the architects help divine company-wide solutions across the board. People who report to me also happen to be data analysts, data scientists who actually look at it from the data perspective. What I don't have reporting to me and for good reason are attorneys, because there needs to be this line of this veil, not a wall between the engineering arm and the legal arm, where the two need to work together but they need to bring their separate perspectives.

Engineering is about how can we accomplish this and what can we do? On the legal side, it's more about what do we not want to do and what outcomes do we want in terms of customer trust? So they have a separate perspective in terms of how to get to the same place. So the dialogue with legal tends to be very continuous. My recommendation for people is build an internal abstraction that unifies things like GDPR, CCBI, et cetera, all the best practices into something that the company can consume and then engineering can use as a requirement stock to build solutions across the board.

You want to make sure that that abstraction is the outcome of legal and engineering. On my side, talking on a more continuous basis, because it is always going to be a living, breathing document. We saw some of the challenges with GDPR where GDPR was built to protect customers, but in the last three years, so much has changed already.

Who would you say drives the needs of data privacy within the organization? Legal or engineering?

It's got to be both.

In traditional companies, more legacy type companies, the legal arm is a lot stronger. Just like the IT arm is, so IT is very traditional. I don't want to say the word bottleneck, but they have the authority to block something before it ships out. They have the ability to clamp down and shadow IT. So those companies typically also have a very powerful legal team that can decide what ships and what doesn't.

That's the traditional model and in the more newer, more agile model, I'm not suggesting that legal and IT are not important or powerful. They are, but the power is a lot more decentralized. You cannot micromanage every engineer in terms of how they do their key management, how they manage their encryption algorithms, et cetera. Those decisions are broken into smaller and smaller pieces and are then given out to a bunch of engineers.

And engineers make decisions every single day, every single moment with their APIs, their automation platforms. So it comes down to education, evangelism, continuous training, accountability, and tooling all over the place because it's not legal or engineering, it's going to be a bit of both. The question is going to be how do we make sure that the two stay in reasonable sync? And secondly, how do we make sure that there is enough tooling accountability and auditing to verify the things that engineering and legal agree upon as first principles?

Data privacy is a new department. Can you talk about how the org chart changed over the last five years and how do you think the org chart will look like in the next five years?

When I first started working in this space, whether it's security or privacy, I ended up reporting to the head of engineering, not directly but several layers down, when I was a junior engineer. The understanding was there was no place for security and privacy to report into that did not conform to the engineering tech stack, because at the end of the day, what are you trying to secure?

You are securing the tech stack and the data and the artifacts that are attached to it, and who knows it best: the engineers. Over time though, in the beginning of the 2010s, there was an understanding there's a cognitive interest there where basically, the engineers who are essentially the cause of privacy issues and I'm not saying in a pejorative way, it's the innovation that the engineers try to do that causes privacy challenges.

You can't have them reporting to the same person because if an escalation goes to the SVP of engineering, who are they going to weigh in favor of? Is it the engineers who are only doing what the SVP prioritized, or the privacy engineers who want to protect the company from privacy harm? There is a conflict of interest here. Over time, the CISO role became a role to go to, and then there was understanding that security and privacy are not the same thing.

Now you have a challenge where people like me would prefer to report to the CISO because it keeps you out of engineering, but connected to engineering. If you are on the legal side, then it often sends the message that you only speak for legal. In this case, in this incarnation where somebody like me reports to the CISO directly gives me the credibility, the seat at the table, helps me build an organization that can then align with engineering on an ongoing basis. But at the same time, I can get directions from legal very explicitly about how much I need to accomplish purely from a compliance legal perspective.

So that's the upside here. So I feel if you report to the chief legal officer, that's fine as well. But at least currently in high powered engineering companies, it's good to have this CISO role which combines the best of the legal aspect of privacy and the engineering aspect of privacy.

If you look forward, and as data becomes a more visible asset, do you think the data department becomes a separate department that will essentially report to the C level?

It's one of those things where, and you'll notice that there's a theme in all my responses where I'm not deliberately talking about a specific person, my general sense is how do you make sure that the data is understood as quickly as possible?

I was a product manager back in the day - when we build a feature we can use engagement, product reach, lifetime value per customer and come up with a metric for each feature. How much is the upside? We've been able to quantify and make data driven decisions around product development the whole time. And that does not come down just to the chief product officer or the head of engineering. So in privacy and security, it's roughly the same thing. I would want to make sure that we quantify how much data are we talking about? What is the customer perspective? Where are we located? What is this feature? How much security do we have in place? And we can quantify the risk as well.

My sense is the sooner you make that calculation, the sooner you can apply that calculation to other data activities downstream. If you automate, if you categorize, if you do this at the ingest layer, it becomes a lot easier to make the decision. And I think that is a conversation the chief privacy officer, the chief data officer and the head of engineering should have with privacy in the room at the same time. Because it's all about quantifying. Because at the end of the day, it's very hard to know what will happen to the data.

You can do the right thing all day long, but unless you actually know what you're getting yourself into, what you're subjecting your customer's data to, it's very hard to make those decisions at scale because you can do everything right across the board, but that one mistake will bring you down and that'll permeate all across your infrastructure. So that's how my perspective shapes on this.

How do you start building an infrastructure within a company?

When it comes to infrastructure, you have two options at this point. The first is you build something separately for privacy on the right hand side and you hope that people will onboard it. The second thing is you build solutions that are available on demand, some of which are easy to onboard automatically, and some of which are essentially embedded into the organization's infrastructure from the bottom up.

In other words, you don't have to build everything, you don't have to buy everything. The more you embed these solutions into the company's infrastructure, the better. So you can build solutions to encrypt data, to delete data, to transform data, to export data, those are key GDPR expectations. So having some centralization in terms of servicing is very important because that way, every team gets the same data policy enforced, every team has the same guidelines in terms of what they can share, how you throttle APIs, things like that.

You need that level of standardization because unlike innovation, unlike personalization, unlike UI development for different regions, you cannot just personalize privacy based on user data. You have to understand the regulatory trust implications as well as potential future users of the data. So that one-half having some sort of centralized data resources is pretty critical. But then you also want facilities and engineers and architects who can tell people how to build stuff.

How do we build a central pipeline that checks for data tagging across the board? That's not something privacy can build. You want to work with a pipeline platform deemed to do that. How do you make sure that data at rest is always encrypted if it meets certain criteria? You can't just do it for privacy. You have to make sure that people who own the services and the actual data stores build out those services as well. Because there is something about asset discovery, about actual tagging, about actual control application, about verification, things like that.

Infrastructure is a sum total of several activities. There are things people use like terminology, people like compliance, things like strategy. These things have a very fancy name and everybody wants to work on them. But really it's the incremental super set of several small decisions and techniques you build gradually and measure their effectiveness and that's what ends up as architecture. My general sense is people shouldn't go into building architecture, people should go to solve problems and the sum total of those on an iterative basis is what I call architecture.

Would you say there is a single path that everyone takes, when they build infrastructure, or there are several pathways, and do you think there is a playbook to build right infrastructure within the company?

There is not an actual playbook per se, but I feel like approaching this in the sense where you first build a governance, which is a combination of legal policy as well as engineering interpretation of the controls required to enforce that policy, that's known as governance. Then you build point solutions where solutions for deletion, solutions for extraction, anonymization, etc, they are essentially the vertical version of the horizontal governance.

So the governance is going to say, "Here's how the company's supposed to operate horizontally," but then you need vertical solutions that will go pretty deep. So your next step basically is going to be building those point solutions. Then you figure out how to scale those solutions by building maturity models.

We have a capability maturity model that was more business friendly that has been adapted to software engineering. You can now build a framework that is very, very similar for maturity for privacy and security as well. Then you scale different solutions in terms of how do you enforce applicability in terms of onboarding new services, apply privacy solutions to it. You're about to buy a new company, onboard privacy solutions to it. So first is building broad governance, then is building deep solutions, and then is applying it to new use cases.

On and on you go in that circle and I think that's probably the best way to go because the other thing that gives you, if you will, is a sense of auditability. That is how quickly are you adapting to privacy changes? Remember when there is a breach, you have an SLA for response. You have to notify customers within a certain period. You want to apply the same discipline, the same data driven, metric driven, dashboard driven approach for privacy adoption, not just enforcement. We do a very good job as leaders on the enforcement side when something bad happens, because our backs are against the wall. My sense is let's do the same thing when it's actually time to architect and design these solutions because that'll cut back on the enforcement burden at the same time.

When you think about the next five years, where do you think are the opportunities in the space? Where are large companies going to come from?

A few things. I would say the biggest opportunity is in data categorization where everything you have to do, for example, if you want to decide how do I protect this data, do I do encryption? Do I come up with some other kind of access control? Do I do some sort of access management where people are audited after the fact? You cannot make those decisions unless you have some level of understanding of where the data is. It's a bit like you go to a grocery store, depending upon where you can find the most items on your list. You don't just show up to the store and start building the list.

But that's what companies do. They end up collecting a chunk of data and then they throw tools and engineers at that data, hoping that somehow you end up with some sort of privacy protecting. What I want to do is at the point of ingest, when a service is being conceived, when data is being collected, when a connection to an API is being made, right at that point get a sense of what you have, where is it coming from? What use might it be put to? Where is it going to end up? And you apply tags to the data that enforces all this information so that when an engineer downstream tries to make changes to the data, tries to use the data, share the data, et cetera, they know what they're doing.

Every time you have had a breach, it comes down to, "We didn't know this could happen." The Colonial Pipeline, they didn't realize that they had an account that was meant for remote access that had VPN privileges but that didn't have MFA. And then you end up with the East Coast of the US where gas is not available, and brings back the memories of the late 1970s.

You don't want a situation where big tech that is supposed to be disruptors ends up creating gas lines reminiscent of inflation days. So you want to have a situation where you can ingest data and make sure that categorization is applied at the get-go. Then the other opportunity is building a sum total of privacy vertical solutions that can feed off of that categorization. Where I know that this data is very sensitive. It contains IP address, email address, past behavior.

Now you can build a mobility profile of the user, so you can apply a very short time to live very strict access control enforcement with only limited key management access systems, a tokenized database. So building point solutions off of the categorization is another opportunity. You can have one company do both or you can have two different companies work very closely. There is room for some consolidation in this space as well.

What has happened now is there are a lot of companies in privacy tech doing roughly the same thing, but nobody provides the end-to-end coverage and a lot of companies will either give up because the ownership costs are too high, integration costs are too many, or they'll try to build it internally. So having some sort of aggregation in terms of building the governance and the point solutions is pretty critical. The other opportunity lies for companies in terms of auditability.

There are a lot of businesses that don't know what they need to do in GDPR or CCPA. Like when you file taxes, you use a tax preparation software, at least I do. I'm not smart enough to do my own taxes, but I'm sure in other countries some software already exists. You have some degree of confidence that if you use this tax preparation software, the laws that are being applied are the right ones and you have inoculation if something goes wrong afterwards.

Having some sort of model where you can say if company XYZ, you build these solutions, you build these tools, you categorized it at this way, you are, there is some... not guarantee, but confidence in GDPR compliance, that would be very helpful because at that point, you are basically helping your engineers feel certain that they have compliance taken care of so that they can innovate without creating unnecessary bureaucracy. So there is a lot of opportunity at the front end and the back end across the board, which are the two ends that are currently pretty neglected.

If you look into the future, do you think there's going to be a one winner that takes it all or maybe even a few multi-billion-dollar companies that win a category, or do you see smaller point solutions that will be acquired by others aiming to expand their offerings?

There is enough room for everybody. The challenge is not about one person who wins and everybody else loses. My concern is the exact opposite. There is so much demand in this space and especially once all the enforcement for CPRA, CCPA, LGBD starts, there will be a ton of need for these tools. And what happens is buying a privacy tool or a solution or a platform when you have a privacy issue is like going grocery shopping when you're hungry.

You end up buying a lot of stuff you don't need and you'll end up buying stuff that you probably shouldn't need. So the same thing happens, companies get in trouble and then they go about buying tools and often happens, to give you an example, if you need a tool like BigID for example that does crown jewel discovery governance, that is a very different need than buying OneTrust, which is more compliance based privacy program out of a box, a legal checklist.

These are two different tools. I'm not commenting on the usefulness or the applicability of these tools. I've used them both. I've known people on both sides. But to buy one tool when you need the other is a significant miss because they will not solve your problem. And then it creates this dogma where privacy tooling is useless, it doesn't work, we should all build it in house. There is a way to make an argument about a third-party tool that adheres very closely to a more comprehensive solution, but also fixes the most urgent need.

My sense is the market is really big. If you were to compare it to a 24-hour day, we are looking at 8:30 in the morning right now. There is a lot of sunshine still to come. Where there is a weakness is in terms of differentiation in the product pitch for these companies, because they don't do a very good job of explaining it, they assume that everybody thinks like them and uses the same vocabulary.

And sometimes when I talk to these founders and these VCs, I feel like crossing to the other side of the table and saying, "No, this is how you sell it to me," and I can't do that because I'm supposed to be the customer. So the real weakness here is in the product pitch and the explanation and the product vision aspect of the house. And the other weakness is in the company side, on the customer side, where the people who want to buy these tools don't always get a seat at the table.

The engineers, the associate general counsels, the data scientists, the platform engineers, these are the people who see the data. And yet the people always who make the decision tend to be the attorneys or the engineering executives who are not impacted by their day to day engineering decisions. So my role exists because I bring both perspectives.

I can still do a whiteboard solution with an engineer and I can talk to C-level executives, but people like me had to learn the hard way. There are not that many people like us in the industry. So I think that's the other weakness.

So there is a lot of money, a lot of opportunity… TL; DR, but what is missing is the connecting tissue.

When there is a startup building a data privacy/data engineering solution, what is the right way to approach a company? Is it through a CISO, engineering department, data privacy team?

I've had two conversations already with founders this morning alone, so that question hits some in a very direct way. If I were to approach a company and if I were in your shoes, what I would do is two things. First is I would reach out to the executive leadership so they know we exist, what we do, how our tool is different, make sure you have executive visibility, but you need to get visibility at the executive layer, but you need to get buy-in at the engineering bottom-up layer.

So visibility is top down, buy-in is bottom up. People need to understand that. If the engineers don't believe your tool will work, they will not adopt the tool, no matter how many executives make it a priority. I have been in meetings with executives at the C level, they say, "Security before features, privacy before features," all of their directs nod very, very approvingly, and then nothing changes.

The reason is the engineers don't have bandwidth, they don't have time, they don't know how to integrate things like that. So you have a two-tier approach, get the executive visibility, but then with the engineers, give them a free license for 30 days, create a community on Stack Overflow, open source some of your code, come up with an API that they can tap and use with synthetic data. If the engineers see that your tool is working, the word of mouth that'll happen within the engineering community and the company will become your best customer.

Because what you want is awareness at the exec layer, so when the engineers get excited and surface this tool to them, the executives, there will be more of an appetite to buy. And what you want to do is convert the engineers who operate in the trenches and make them your de facto sales staff.

Because they're inside the company, they have credibility, they can make the argument that your tool will help their productivity, will help their visibility, will help their efficacy, and you want them to sell your tool to the executive layer because the execs, people like myself, they don't want the engineers to leave. They don't want the engineers to get bored and they don't definitely don't want the engineers to worry about security and privacy slowing them down. So you are much more likely to get a yes from the executives quickly if engineers inside the company are already rooting for you.

In terms of internal vs. external data privacy issues, what’s more important to you and where do you see most vulnerabilities coming from?

I don't get to decide what's important for me, the data and the customer and the combined ambiance collectively decide. And I try to make sure that I'm foresighted enough to make sure that these issues don't become a big deal.

Given how interconnected we are, I would say both [internal and external] because I don't see a distinction between the two.

When I got into engineering, my favorite professor in college gave me a book written my Malcolm Gladwell, Tipping Point. It's a pretty famous book. It talks about how small changes reach an ecosystem or a tipping point at some point, and then big changes happen right afterwards. And then when people wonder how did this happen, they have to go back and reconnect all the dots looking backwards like Steve Jobs would say.

For me, interactions between engineers, the tools together, the data stores, how that connects to archiving databases, how that connects with third party ad networks, how that connects with third party APIs, that continuous data flow, the inferences that you wrote, this is all part of one big ecosystem that is connected. My first advice would be don't think of internal and external privacy as different, because it could be something direct and visible like data or it could be business practices or assumptions.

Those will flow throughout the system. And your next weak link could be exposed as one or the other, internal or external or some combination thereabouts. If you have an internal vulnerability, you might have that manifest itself in a much more dangerous fashion when you connect to a third-party vendor or if you have a third party vendor that is misusing data, it is possible that they are able to do that because somebody on the inside has an API that's open for them without being throttled or limited.

I would think of the end-to-end connection rather than distinction between internal and external. And I have no way of proving that my way is better than somebody else's except to say that over 10 years, I've run multiple privacy programs and my job has been to make privacy boring again. That is, I don't want my company's name in the newspapers because of a domain that I own. And I feel comfortable with my record in that regard. So my sense would be combine the two and think of it holistically.

Tell us about your book, Privacy Engineering: A runbook for engineers, and what to expect.

People like me don't exist easily in this domain because you have to have combined security, privacy, and engineering experience together. My book basically is aimed at founders, people of SMBs, and who run SMBs or even large companies and they don't have the ability to hire top privacy or security talent. How do you train your existing engineers and maybe build a team within the engineering talent you already have? How do you teach them to classify data, to inventory it, to delete data, to obfuscate data, to export data, to share data, fix security gaps, come up with a majority model?

How do you build an internal privacy team, an engineering team that can also do other engineering work for you? How do you motivate engineers to work on this stuff and how do you make your business and your customers safe again? This book is a run book for engineers to make it easier for them to do privacy at scale without hurting the business. It's a book aimed at engineers, but also at founders, attorneys, regulators, media, so that they can understand how privacy orchestration can actually work within a company that doesn't have the muscle for it just yet.


Check out Nishant’s book, Privacy Engineering: A runbook for engineers.