Managing Destructive Releases in Big Data Projects
When it comes to releasing software, there are, to simplify things, two camps.
One camp is pelting behind the process. The process guides them through the risky business of getting software out to the customers.
The other camp is foraging through the forest with a tool belt of Continuous Integration and Continuous Deployment tools.
The Process camp release software with fixed cadence. Every two weeks, but more likely on a monthly or even quarterly basis. People from CI/CD side scoff at this practice and frequently release daily or even multiple times a day.
But both camps are utterly terrified of a destructive release.
What is Destructive Release?
A destructive release is one that comes with a permanent, non-reversible consequences. This is no news for those working in big data and similar systems. A release takes place and instantly affects incoming user data in such a way that renders rolling back a Herculean task. That is, if rollback is even possible.
For a good chunk of the decade, I worked on Search Engines and data related projects. Nothing smells more like trouble than bodged release corrupting the search index.
Stop you say. Obviously, you can roll back, re-ingest data, and do whatever necessary to tackle every scenario.
And that is fair comment. In many instances that is the case. Depending on cost incurred, the fix can be quick, short-lived, or manageable by communicating with the customer.
Unless, of course, you take into consideration privacy concerns.
Why are Destructive Releases so hard?
Cue in GDPR. Multiple Articles of GDPR stipulate how Personal Data can and cannot be handled, but also force organization to clearly describe the data retention times. The General Data Protection Regulation, in principle, protects users against indiscriminate data hoarding, so you must have a good reason to store the data for a long period of time.
Ability to provide service is in numerous instances sufficient reason to keep the data and should be the end of this discussion. But it is in the interests of your organization, the Data Controller, to keep the data backlog to the minimum.
Software is written by people, and people make mistake. Data breaches do happen. It is best to minimize the amount of data, especially Personal and Sensitive one, to decrease legal exposure should the data breach take place.
Lately, we also saw the rise of privacy-first tools, the kind of tools that intentionally avoid gulping vast troves of data. And that's the segment, Privacy-First Web Analytics, where Wide Angle Analytics is located. Whenever we encounter any potentially personal or identifiable information, we promptly anonymize them.
That instantly impacts your data related rollback capability, or at least greatly complicates it.
Managing Impact of Destructive Releases
We established that Destructive Release are HARD. But rather than getting paralysed by the threat, we need to come up with approach to manage them in the real life. What options do we have?
You have to keep releasing, software is likely the lifeblood of your business. Your first line of defence against major F*CK UP is testing.
Test the living shit out of as many edge cases you can think of. Struggle to cover all cases? Guess what, you are unlikely to cover them all.
If you seek refuge in Property-Based testing that can automate data generation, you can get more comprehensive coverage of edge cases.
While testing is automated, tests are written by humans. And that means mistakes. We are human alter all.
Test, but don't stop here. It is just one of many tools you can use.
After Scala's strong type system, tests are the first line of defence for the Wide Angle Analytics code base.
Backup and Restore
In the ideal world, Destructive Releases are non-issue. You don't have regulatory retention concerns, storage cost is miniscule, and transferring terabytes of backed up data happens in the blink of an eye.
If you ever find this kind of fantasy world, please let me know. I would like to consider moving there.
Back in the real world, you might face prohibitively high cost of creating full backup data which can be quickly accessed.
Likewise, data in backup do not always share the same format as the live production data. The live data streams reflect ever-changing product and are subject to change, and your backup might use long-term storage format, assuring access by future generations. So while data might be available, accessing them as a result of botched release might drag multiple engineers out of their lairs.
And then there is time it takes to restore. With projects that involve model training, search indexes and similar, the sheer effort and time that it takes to re-ingest data can be prohibitively long. Thus creating a major business concern.
This does not mean you should skip backups. You must be prepared for worse.
As for restore, one of the solutions is to have critical data history in a form of hot-storage, with dedicate restore process allowing quickly dumping, say last 4 weeks of data back into live system. Depending on your operational conditions and nature of the product, this might be enough to give your users good enough condition while you implement a wholistic restore process.
Depending on your stack and application, another option is to create temporary snapshots of the storage devices/containers. The quality/price/reliability of such as solution will vary depending on your cloud provider.
The drawback of this approach is the recovery from snapshots, which often results in whole application downtime. However, this will greatly depend on your use case.
Aside from encrypted, long-term backup in object store, we leverage Kafka and selective retention period per topic. We allow ourselves a day to handle any immediate issues on raw data. Once data has been anonymized, it is transferred to a long-lived topic.
As we permit processing Personal Data which we cannot anonymize, the long-lived queue also has an expiration date, which is 14 days.
Depending on the issue during release, we have between 1 and 14 days to resolve before it has an impact on customer data.
Tests, Backup, and Restore are engineering processes that rely on tools and software. We are used to writing code, so we like to start with these.
What is frequently overlooked is the importance of establishing a solid process. A process is a tool designed to help humans accomplish a task, so the exact implementation will greatly depend on your organization. The process for destructive releases will be influenced by office politics, team structure, and even office hours.
Here are questions that should be asked when creating a process.
- Do we have a pre-release test plan?
- Have the tests been reviewed? If yes, by whom?
- Are backups actually there? We all experienced a false sense of safety only to discover the backup process being stuck for weeks.
- Are the right people available/online/on-call?
- Are the appropriate people in the proverbial room during the release?
- Is there someone who can update the public status page in case a system must be taken down for maintenance?
- Have stakeholders been notified? If you employed continuous deployment practice and release frequently, your business might be unaccustomed to the risk associated with Destructive Release.
- Do you have to leave in an hour to pick up your kid from school? Make sure people involved are in the right mindset.
These and many other questions will help you create a solid backbone of a checklist that you can reference later during release.
This human intensive, process-driven approach creates friction with Agile environments. And that should be OK. A release that makes destructive changes to the database schema, modifies large search indexes, or updates a machine learning model with large training sets might be so infrequent in your organization that it warrants making it an important event.
The worst thing you could do is to trivialize such a release.
Four or Six Eye Principle
Another human-centric tool you can use is the Four or Six-Eye principle. Please allow me to explain.
In the past, when working in an organization with some challenges listed above, with a prohibitively high cost of restoration in case of failure, I would orchestrate release by having one or two colleagues shadowing every action over screen share.
Before pressing enter to confirm the critical stage, I would ask to inspect the command and parameters to make sure they look good. I would also ask to cross-check that the relevant services are in the state, fit to proceed.
This is what you see every time before your flight departure.
Pilot: Cabin Crew – Arm doors and cross-check Two staff members yanking the lever and checking each other work. Crew member: Cross-check complete.
This trick can save you a weekend of recovery.
Don't fear releasing on Friday. Build your continuous deployment process such that it is always safe to release whenever necessary. If that's what your business demands.
But, please, pause when you encounter Destructive Release. Prepare. Make backups. Plan how to restore. Devise a process and stick to it. Have someone watch over your back.