More on Long Term Data Archiving

by Mike Shea on 24 September 2005

I became fascinated with the challenges of preserving our knowledge, probably some psychological arm-wrestle with mortality but whatever, I'm having fun.

There was a good article in Fairfax Digital called The Digital Dark Age that talks about a lot of the issues with preserving digital data. I also found a lot of interesting reports from various sources on archiving digital data and the challenges it presents.

At the same time, I continually look at preserving my own information both in paper and in digital formats. Right now I have a pretty good process that follows Arthur C. Clarke's Rule of Three: preserve in three formats on three media types in three places.

There are a few problems faced when considering the longevity of digital archives:

Most people only consider the last one. I bought some CD-ROMs and DVDs that are rated for 100 years but I'm pretty sure no one will have a CD-ROM drive that far in the future. Also, what good is a CD with your lost novel if its written in a binary Microsoft Word file that no one can understand?

Here are some more random thoughts:

Archival Formats: Text: UTF-8 ASCII, HTML (minimal markup, strict compliance, complete removal or isolation of style from structure), XML

Images: GIF, JPEG, PNG, BMP, TIFF

Audio: MP3, WAV, OOG

Video: MPEG-1, AVI

Archival Media: - Hard Drives (40gb+ size, lifespan = ten years or so - degrades as used and under changes in climate) - USB drives (1gb to 4gb size, lifespan = five to ten years - degrades as it is written to, survives extreme climates) - CD-ROMs (600 mb size, 100 year media lifespan, ubiquitous, open, cross platform) - DVD-ROM (4gb size, 100 year lifespan, not as ubiquitous as CD, open, cross platform)

National Archives rules: Open standards Ubiquity - is it used often Stability - is it a stable format with proven history Metadata support - does it allow for a open source of metadata Feature set - can it do what you need it to do interoperability - does it work on a variety of systems and platforms Viability - does it do any sort of internal data check?

Mike's additional rules: - Durability - can the data be restored or preserved if damaged? - Migratability - can it be easily shifted from one media, format, or system to another? - Ten year rule - has this format remained ubiquitous over a long period of time? - Your system is more important than technical detail. - Do not compress - uncompressed data is easier to fix on damaged media.

What is my backup procedure? Lets take a look:

Preserve in Three Places: - On my website - On my home machine HDs - At Michelle's House - On my person

Formats: - HTML, single file per site - HTML, one file per article - single file SQL ascii database dump

Media: - Website HDs - Home HDs - USB Flash Drives

Rules of Thumb: - Don't compress - Use standards - Keep it simple - Document it well - Keep it automated - Single volume per copy - Document well the recovery - Keep it platform, format, application, and media independent - Don't go overboard. Organization and documentation is more important than chaotic mass backup.

Procedures:

mikeshea.net/articles liquidtheater.com/ loralciriclight.com/ mobhunter.com/ bobshea.net/ vrenna_and_other_tales.html notes stories

Here are some other resources I've found:

http://www.tessella.com/Services/Capabilities/e_Digital%20Preservation.pdf

http://www.si.umich.edu/CAMILEON/reports/reports.html

http://www.dpconline.org/graphics/digpres/presissues.html

http://www.columbia.edu/acis/dl/imagespec.html

http://www.columbia.edu/cu/lweb/services/preservation/dlpolicy.html

http://www.nationalarchives.gov.uk/preservation/advice/digital.htm

http://www.nationalarchives.gov.uk/preservation/advice/pdf/selecting_file_formats.pdf

http://en.wikipedia.org/wiki/Backups