Contain Your Toxic Waste: Keep Prod Out of Dev

I completed an impossible hack the other day. A simple authorization bypass led me to a few hundred thousand fullz. I’m talking Social Security numbers, names, addresses, the whole deal. My big payday shoulda been impossible; I was testing in a development (dev) environment. That data is not supposed to be there. Yet, there it was.

From a developer’s perspective, dev and prod look similar, but only one can safely contain toxic data

I’ve noticed this increasingly common yet disturbing trend while testing web applications—they’re populated with production data. At some level, it makes sense. Testing instances should process realistic workloads to assess performance and outlier conditions, and production (prod) data is right there to copy and paste into a “secure” internal environment. And it’s so convenient when it’s suddenly time for integration testing!

Too bad there’s a critical and obvious problem with this approach. There’s a huge reason environments are kept separated into Development, Staging, and Production: security. These siloed phases keep data availability high, so that a change in test won’t take down prod. That separation upholds the A cornerstone of the CIA triad.

Tony Lozano Blog Post Illustration 2 - Light BG v2

CIA represents the triad of information security concerns

You should use fake data sets

Putting prod data in dev means losing confidentiality for the sake of availability, which crumbles the CIA triad. Developers don’t usually need real production data to test the application thoroughly. What they need is a production-like set of data. While they take time to build, fake data sets can still have the outliers, edge cases, and other conundrums that make testing thorough. Those data sets should be part of internal system testing from the beginning, well before a team comes to pen test your environment. If you choose not to use fake data sets, well… then eventually I – or somebody worse - will come along and steal all your production data.

This is bad and it's common

While at first it seemed like my lucky day to get all those fullz, this easy exposure of sensitive data unfortunately normal, not a rare jackpot. I’ve unearthed large troves of production data at least three times in the past year while working in a test or staging environment. I’ve found massive amounts of PII (including financial data and Social Security numbers), media files (the spoiler alert motherlode), and credentials (like access to back-end prod servers). Sometimes these machines are exposed to the entire internet, which makes them mere clicks away from a serious data breach.

Other times, these test machines are vulnerable to exposing data because of a superuser account that’s been provisioned for testing. It’s handed over to me, with a dead simple password (think admin:admin), because “only internal people” use this machine. But “internal people” typically includes people who have no right to be looking at that data, like contractors and interns.

The consequences from a real-world attack like my simple authorization bypass can be pretty severe. If I were a malicious man, I could have exposed their data, thereby violating their compliance protocols (PCI, HIPAA, what have you), their user agreements, their licenses, and their contracts. A massive data breach will tank the stock by 3% for a month.

Developers turbulent relationship with prod data

Developers need access to production data to debug new issues coming from customers, to find out trends in systems that inspire new features, and a whole host of other things. But how would you feel if the latest intern at a big tech company leaked your personal data? Developers really shouldn’t be burdened with custodianship of this data.

Developers shouldn’t even want access to it without compelling reason. The best attitude is that the responsibility that comes with prod access outweighs the convenience and frankly sucks. The people who have access to prod needs to think super carefully about their operational security, and likely gets stuck with pager duty. It’s not a pleasant job role.

With strict regulations like the EU’s GDPR, the USA CLOUD act, and Australia’s Access and Assistance Act, developers should be actively rejecting any sort of custody of production data. You don’t want to be on the receiving end of a subpoena to supply that info. It’s not just us security folks yelling at you anymore, the lawyers are coming, too.

In a small company, only one or two people should have access to the core data. In a larger organization, Ops or DevOps should have a process for accessing production data that logs who uses it and checks that access is authorized. Keep that number of people responsible for this stuff minimal and checked.

So, what are you supposed to do instead if access is truly restricted? Developers still need access to a playground with realistic data sets.

Get into the driver's seat and test

Well pardon my spit, but proper prior planning prevents piss poor performance. Test-driven development (TDD) has a decent answer here. It’s no silver bullet, but with TDD you have to make time in the process to test the system before you launch a new feature. Ensure that each model in the application has one or more generators that can populate fake valid and invalid data. For valid data, there are tools that can generate realistic-looking sets of data (including credit card numbers and Social Security numbers) on the fly.

Use fake data, not anonymized data

While it might be tempting, you can’t just anonymize your prod data. Trying to make real data sufficiently anonymous is probably more work than just generating completely fake data. (Great minds have put in a lot of time figuring out how to cross-link data and undo anonymization.) If you’ve already taken the anonymization path, check your work to see how well you’ve truly transformed your anonymized data. Measure your data by k-anonymity, l-diversity, and t-closeness.

Being armed with a suite of fake and invalid data lets you do all sorts of other things too. The testing problems that your third-party tester hopes to include are the same ones you’ll want to manipulate in your everyday work. The realistic workloads you generate (with their built-in outliers and edge cases) can give you a nice, reproducible environment to test and debug those exceptions without endangering prod data.

Treat prod like it's dangerous

Developers should learn to treat production data with the same sensitivity they would toxic waste or a virus: anything it touches suddenly takes on a bunch of security and legal requirements that make it a pain in the ass. Contain the virus, quarantine the data. Keep prod data out of your test environment.