Without fail, when I am talking to my customers about their retention policies I always get “We keep everything forever” as their answer. It’s time to change your thinking folks, and backup and archive admins, I realize I am preaching to the choir here but you have to put together a compelling talk track to convince executive management and your legal team. Below I break the challenges of infinite retention down into a few key categories that you can use to summarize your argument for changing your policies and relieving the pressure from your poor data protection environment.
Legal Defensibility – Let’s start here. If your data retention policy is that you keep everything forever and you have a legal suit introduced into your organization, the scope of discovery is massive. If I’m good at my job as opposing counsel, I am going to use this to my advantage. First, I am going to force your staff to run around and collect huge datasets into a format that I can read. This means you have to figure out how to read data from this DLT tapes you have in storage even though you don’t have a DLT drive or the software that wrote it. You are also going to have to scour all of the corners of the environment to find where some user may have stashed some data but never told anyone. As opposing counsel, I won’t actually use most of the data you are working diligently to collect, but I will also slap a ridiculous deadline on my discovery request that your team will never be able to meet. Now you are in breech, and I can perhaps start to make my case for judgement in my favor. Consider instead a defined corporate policy that states you retain critical client information for 7 years to comply with regulation XYZ, and all other data is purged after 1 year from last access. The process for purging this data is via shredding and audited by using archiving software to validate that you in fact did destroy the data. You are now defensible against the discovery request, because the data doesn’t exist. Period. You aren’t hiding anything, you just are dragging that stuff around forever and when someone wants to see what you have, you are providing them what’s left. Further, because you have archived this data using a tool, the act of doing the discovery is simple as putting in a few search terms, and exporting in a format that everyone agrees on and meets everyone’s needs. Another side effect of this approach is that your primary storage is nice and tidy, your backup stream is reduced meaning you can meet your windows, and users know from the beginning that data they don’t touch or use gets thrown away so they have no right to complain.
Media Pressures and Costs – It’s not free to store data. At least not anything over the 5gb in my cloud drive. So by keeping everything forever, you are spending thousands or millions of dollars on media just for the sake of consuming space in a warehouse or space on the datacenter floor. Stop it! But you may argue that long term archive storage is cheap. It may be, but with data sets growing 44x, even the cheapest storage is not well spent on long term archive of worthless data. Note that I said WORTHLESS data. There is some data that you do in fact want to keep forever, and I understand that and encourage you to do so. But EVERYTHING FOREVER includes a lot of worthless data since you aren’t applying good filters. Think of how many MP3’s and JPG’s are probably in your environment today. How many of those are of your employee’s family or other personal related events? Why is it IT’s responsibility to backup Susy’s pictures? It’s not, and if you start to archive it, you are going to start burning through media. Deduplication solves all of this, right? Wrong. Deduplication does relieve some of this pressure, but again, with data growth off the charts, even deduplication can’t be the answer to everything. So stop spending money on media that you don’t need and redirect that money back into your training budgets, or overhaul of your backup environment. You know they are probably being neglected.
Operational Recovery vs. Disaster Recovery, and Data Retention – First, lets start with a few key definitions. Backup is a process of making a complete COPY of a set of data that is stored on separate and distinct media than the original source. Archive is a process of MOVING a set of data to separate and distinct media than the original source. Operational Recovery is the day to day act of restoring files, databases, applications, or servers, or environments that are damaged as a result of any number of issues in the datacenter. Disaster recovery is standing up the entire datacenter in a new location due to loss of the original facility. OK, now that we have some basic definitions defined lets see how they apply to some scenarios. For operational recovery, the hope is that the loss is realized within a reasonable time frame. In a perfect world, yesterday’s backup would hold the data we need to recover. Worst case a few weeks or even months may pass before the missing file is discovered. In this case, does infinite retention help? No. Consider an application like a CRM that holds customer information. The data in this app is never purged and is backed up every day. Someone left the company and on their way out, maliciously deleted all of the accounts they worked on for the last two years. You have to go performa a recovery for this data, but you aren’t going to look at the backup from two years ago, you’re going to look at yesterday’s backup. Did infinite retention help here? No. Your datacenter is destroyed as a result of a massive weather event and you have to bring everything back online immediately. Fortunately, all of your Tier I applications are already being replicated to a remote data center, so you are able to failover these quickly. Now you being recovering your Tier II and Tier III apps from backup. Does infinite retention help here? No. So what backup data retention policy DOES work here? The best example I have heard accounts for a normal event in organizations and also allows for a reasonable amount of storage. Consider maternity/paternity leave. An employee may be gone up to a year if they want to spend time at home with their newborn. In many companies, their accounts are deleted and data destroyed and will be brought back online once they return to work. Since I may need to be able to recovery data from a year ago, I need to be able to roll back to that earlier point in time. If I don’t have an archive program in place, then I need to rely on a backup from before the time the data was deleted. If I kept a backup at the end of the year mark, I would be able to get this data back, right? Does infinite retention help? No. BUT, a retention policy that captures a copy of backup data at the quarter or year mark does.
A good working retention suggestion – As we saw above, once you start to delete data, you have to start defining your retention policy. I usually suggest something similar to the following. Take a daily backup and retain a copy for two weeks (14 copies) Roll off the next two weeks and retain a weekly copy at the end of week 3 and week 4 (2 more copies). Finally, continue rolling off daily and weekly copies and retain a monthly copy at the end of month 2 and month 3 (2 more copies totaling 18) . This now represents a rolling 90 day recovery window that you can get data from. If you need to accommodate a longer period, you could consider quarterly backup in 2nd, 3rd and 4th quarter, or simply add a copy at the end of the year. This should allow you to recover nearly any data that is asked for and still provide a reasonable amount of recovery.
Bringing it all back together, I realize there are times that we need to access data that is old, but data hoarding doesn’t help anyone. Infinite retention is usually much more harmful than helpful so take a look at your policies and think through your normal scenarios. Do they align?