Archival Storage Part 1: The Problems

All of us have data which has value beyond our own lives. My parents’ generation have little record of their childhoods, other than the occasional photo album, but what little records there are, are cherished. My own childhood was well preserved, thanks to the efforts of my mother. Each of my brothers and I has a stack of photo albums, with dates and milestones meticulously documented.

Today, we are generating a massive amount of data. While the majority of it will not be of interest to future generations, I believe preserving a small, selective record of it, akin to the photo albums my mother created, would be immensely valuable to my relatives and descendants – think of your great grandparents jewellery, a photo album of your childhood that your parents created, immigration papers of your predecessors.

Modern technology allows us to document our lives in vivid detail, however the problem is that the data is transient by nature. For example, this blog is run on a Linode server – if I die, the bill doesn’t get paid and Linode deletes it. If Linode goes away, I have to be there to move it to a new server. If Flickr goes away, my online photos are lost. If Facebook goes away, all that history is lost. Laptops and computers are replaced regularly, and the backups created by previous computers may not be readable by future ones, unless we carry over all the data each time.

In part one of this series (this article) I document the problems of common backup solutions for archival storage, with reference to my own set-up. In part two, I’ll detail my “internet research” into optical BD-R media and how it solves these problems, and in part 3 I’ll deal with checksums and managing data for archival (links will be added when done).

Part 1 is fairly technical, so if you just want safe long-term storage, install and configure Crashplan, and skip to part 2.

Backups Evolved

My backups in my early university days were CD-R disks, which eventually progressed to DVD-R. Hard drives at the time were many times more expensive per gigabyte than they are today, and optical media represented a cost effective and semi-permanent method of saving critical data at the time.

However some time around 2006-2008, hard drives got cheaper, while optical media technology, as far as low-cost writeable media was concerned, stayed largely static. I started using external hard drives for backup instead of blank DVDs because they were cheaper, easier, faster, and took less physical space.

The optical media of the day was also not particularly durable. DVD-R and CD-R media at the time were almost exclusively made with organic dyes, which break down over time – both optically when exposed to sunlight, and chemically even when they are not. Keeping them in a cool, dry, dark place will make them last a reasonable amount of time, but it is difficult to predict. The Optical Storage Technology Association states that, generally, manufacturers aim for 30-100 years, in presumably favourable conditions.

Anecdotally however, <10 years is not uncommon, and it is safe to assume that many of these DVDs have not been properly stored. The disks I burned before 2006 are sitting high up in my parents uninsulated garage and while they won’t be exposed to any light it’s fairly safe to assume large daily fluctuations in temperature, particularly in summer. And then there are potential manufacturing defects, and sub-optimal burns to account for (who knew that using a cheap burner and recording at maximum speed was a bad idea?).

The hard drive based backups from 2006 onwards are equally precarious today. According to a physicist on Reddit, the industry standard for a bit lifetime on a magnetic disk is 10 years, which means that you should really read and write back all the data before that time. Presumably, the clock starts when that bit is written, so if you stored a photo, didn’t touch it for years and then archived the disk, the expiration date could be sooner than you expect.

Today I do not burn optical disks, or archive physical hard drives, which is a step back in some respects. My data storage system is essentially a NAS (Network Attached Storage in the form of a desktop PC) with a 4Tb btrfs + lvm  RAID5 array, with a 2Tb external hard drive for local Photo backup, Crashplan to backup photos from the NAS to the cloud, and Crashplan backups from my laptop and desktop to the NAS and cloud. I would have used btrfs’ built-in raid, but at the time drive replacement wasn’t implemented!

Many people might say it’s sufficient – it’s certainly much further than most people go, and is sufficient for my needs today. But for archival it falls short in many ways.

The three problems I’ve identified with respect to archival storage are access, longevity and integrity. My backup solution solves all 3 problems up to a point – it protects me fairly well from hardware failure and natural disasters (longevity), would detect but not prevent or correct integrity problems, and enables access so long as I am alive. But when it comes to archival storage beyond my own life, the solution falls far short.

Far short, even, of an old photo album.

The Access Problem

The access problem deals with accessing the data in the future.

The first part of the access problem is what is colloquially known as the “bus factor” in the tech industry – the number of people required to be hit by a bus before the data is difficult or impossible to retrieve. In my case it’s one – me.

If I was to be hit by a bus (or, you know, die from old age), it would take a non-trivial amount of knowledge to boot my custom-built NAS, reset the root password, mount the volumes on a remote computer and extract the data. That’s if my family bothered to do so, although I’d like to think they would given the memories within the photos!

The USB drive is ext4, so once again, Linux user or knowledgeable techie required for data extraction. Crashplan is also useless in this scenario, as they’d need my account details and knowledge of the service (although being consumer focused, it is well documented).

The other problem to think about is reading it in the distant future. While SATA interfaces are ubiquitous today, they will not be in the future. I would expect SATA interfaces to be built in to most desktop motherboards for at least the next 10-20 years, but within the foreseeable future you’ll be using an external adaptor to connect them. It is perhaps informative to observe how obtainable IDE (predecessor to SATA) interface adaptors remain – as you can sure they’ll be unobtainable long before SATA adaptors ever are. Eventually though, as the number of serviceable drives dwindles, they will be uneconomic to produce and you’ll be relying on the second hand market, which is basically the point at which you’ll need to move to a new medium.

Going back before IDE, reading from an old MFM drive would already be very challenging today, and that technology was first introduced in 1970, or 46 years ago. Easily within my parents living memory and still some way short of from my expected lifespan. To be honest though, magnetic degradation would probably be a problem long before lack of an interface for IDE or SATA would be – we’re already past that point for IDE.

Thus, to solve the access problem, data storage should be readable by a non-technical user, and stored on media which does not require electronics from a particular era.

The Longevity Problem

The longevity problem deals with natural degradation over time.

Bit rot, or data degradation, is well documented in the industry. Essentially, all storage media degrades over time due to various factors depending on the media – the magnetic bits in magnetic media such as hard drives lose their orientation, flash memory degrades due to electrical leakage and the materials that make up optical media break down over time.

While the underlying storage cloud solutions use are also susceptible to degradation, they should (assuming they are competent) be able to detect and correct any errors. However the cloud is not really any answer to the longevity problem, as the data ceases to exist there if you stop paying!

Thus a solution that has longevity should ideally not be subject to degradation, or at least have such a low level of degradation that it can be stored for a very long time before it needs to be re-read and written again. It should also be offline, and not require electricity or payment to keep it alive, and be readable by technology in the future.

The Integrity Problem

The integrity problem deals with risks to the data other than natural degradation over time. While the longevity problem can also be thought of as an integrity problem, here we consider factors unrelated to the natural environment – bad hardware, electrical problems, cosmic radiation and malware or human error.

While cosmic radiation risk is incredibly low for a single bit, the more data you move around, the more likely it is to happen somewhere. A study by IBM in the 90s estimates the error rate at one bit flip per 256Mb per month. If you just happen to be writing data when that happens, the chance of corruption is high. The effect is more pronounced at high altitudes where there is less atmosphere to filter out radiation, but it can happen at sea level too.

But in my experience, the most common cause of corruption is bad hardware that flips occasional bits without failing outright. Given that electronics degrade over time, it could happen to any of the components handling your data (memory, CPU, disk controller, disk), at any time. If you have a laptop that is constantly crashing or resetting itself, it’s likely the cause of that problem will corrupt your data too! Remove the hard drive, put it in an external drive enclosure and copy your important data somewhere else!

Cloud vendors such as Crashplan, presumably, handle data integrity by check-summing what they write to disk, like modern file systems do. However, as I’m constantly uploading new backups, it’s the integrity of my local data that is more important, as it’s the source of truth, and any corruption would be stored as a new version by Crashplan. Thus, any measure that detects corruption 100% is better than anything which prevents corruption by less than 100%. Otherwise, you wouldn’t know to restore an earlier version!

Active, online systems also have another problem – they’re susceptible to accidental deletion, or malware such as cryptolocker, which encrypts your files so you can’t read them, then demands a ransom payment for the decryption key. Using a service or underlying storage that allows you to restore previous versions of files can guard against this, but in the case of underlying storage it’s not infeasible for malware to detect this and deliberately remove the previous versions.

Thus, to solve the integrity problem, our data needs to be checksummed on read and write, the checksums should be stored alongside the data, and it should be guarded against tampering by malware and humans (accidental or otherwise). Offline systems inherently guard against tampering, while for online systems, multiple versions should be stored, out of reach of anything running on your PC.

The Effect of Corruption and Bit-Flipping

When bits are flipped in photos, you will probably see strange colours appearing part of the way down an image, or part of the image becoming grey. In severe circumstances, the image can be unreadable without specialised recovery software.

This Arstechnica article about next-gen file-systems (ZFS and BTRFS) demonstrates about half way down the first page what happens to a photo when you flip a single bit. It’s why I really want btrfs raid, and should probably have gone with ZFS for my current array, but I wanted Linux for media PC duties and thought btrfs would mature quicker. As it stands, my btrfs volume would detect an integrity error but be unable to correct it, because the underlying replication is done with LVM. It is still preferable to silent corruption though.

The integrity problem is basically solvable in online systems if you use a next-gen file-system and ECC ram. While ECC ram is cheap enough, the motherboard and CPU cost to run it escalates quickly, which is just price-gouging in my opinion. The benefit of ECC far outweighs the performance cost, and an informed consumer would definitely want more data safety over a little more memory bandwidth. Intel, however, deems it an enterprise feature! Thus simply adding ECC ram would raise the cost of my £230 NAS to well over £400, despite that fact that ECC is highly desirable for anything dealing with data storage.

Conclusion

ECC ram, a next-gen filesystem and cloud backups solve all the immediate problems. Data corruption will be detected, and as long as you continue to pay Crashplan or the cloud vendor of your choice and maintain your server, you will be protected against any foreseeable problems during your lifetime. But access and longevity of the data is still dependant on you.

In part 2 (not yet published, will edit when done), I’ll detail my “internet research” into Blu-ray for archival storage.

Leave a Reply