Tips for Preserving your Data

by Mike Shea on 27 July 2006

I continue to obsess about archiving data. I'd probably be better off writing a journal like a normal person. I have a great notebook and I have a great pen so why would I bother to try the same thing in the digital world? Why not cast aside these machines that last little over five years and go back to a technology known to last for thousands of years?

I could write my thoughts and stories, print the articles worthy of saving, wrap them in archival paper and ziploc bags, store them in a large polyethylene bottle (the same used to store pretzels), and keep them the hell out of the attic or the basement. For a great book on preserving your stuff, read Saving Stuff by Don Williams, a Smithsonian archivist. It has a lot of practical and required information for preserving stuff.

But this article isn't about storing your written shit in plastic bottles, its about saving digital data. So lets take a look at some maxims:

Text is better than binary stuff. Text is easy to save, it's smaller, it's easy to search, it degrades gracefully, it's about as platform independent as you can get, and it's easy to translate. If a picture is worth a thousand words, write a thousand words.

ASCII (or Unicode) is better than binary formats. Save your stuff in ASCII or Unicode. Don't hole it up in some proprietary binary format. I can't count the ways a MS Word document can be destroyed while text files live on. Perhaps I'll write a whole article about the evils of Word later.

Still images are better than moving ones. We've had images on computers for a lot longer than video. The formats are pretty standard even if the legal issues are a bit hairy. Jpegs, GIFs, Bitmaps, TIFFs, and even newer PNG formats are probably ok, but prefer them in the order I just gave. Jpegs are probably the best format if you don't mind some loss in quality. Bitmaps or TIFFs are probably best for lossless if you have a lot of room to spare.

Audio and video files are big. As our drives get bigger we have a better ability to store them, but right now, text and flat images are best and even images require a lot. For example, the cover of my book, Vrenna and the Red Stone, takes up more space than the entire text of my father's book, All Things Are Lights. I could do without the cover of my book but I can't live without the text of my father's book.

Text is easy to search and transform. Binary objects including audio, video, and still images are much harder to search or transform.

Store your digital life on media that can hold the entire thing. Spanning stuff across discs asks for trouble. If one piece of media goes bad, your collection is corrupt. Your odds of corruption increase by the number of media you must span. Pick a media that can contain your entire dataset and define a dataset to fit on the most reliable single piece of media.

Reliable media include archival CD-ROMs, archival DVD-ROMs, flash drives, and hard drives in that order.

Store your archive on multiple pieces of media with multiple interfaces. Two are good, three are better. For example, my archive sits on CD-ROM, flash drives, and USB hard drives.

The metaphors about lost data on floppy disks and the DAT tapes don't hold much water anymore. We now understand things like cross-platform formats, universal interfaces, and backwards compatibility. Blue-ray DVD drives are still able to read CD-ROMs from 15 years ago. Still, we would do well to remember that digital preservation is a lot harder than physical preservation.

If you can, recycle your data every two to five years. This isn't the ideal for data preservation, 500 to 5000 years should be possible, but for now, with what we have today, we have to pay constant attention to what we do.

Build data archiving into your daily system. Save your stuff to My Documents and have that backed up all over the place. Use blogging systems that let you download the entire text of your websites. Use systems that output text and HTML, don't trust binary databases to continue to work years from now. A little time up-front will save you or your ancestors eons later on.

Automate archiving. Use solid scheduled scripts to perform your backup and archiving. Ensure you establish a system simple enough that other people can understand it. You're not going to be around forever.

Write a "Readme.txt" file on every device so people can understand whats on the media and what they should do with it. Think of it as a digital will.

Store your archive off-site. Today I got a safety deposit box and I stored a portable hard drive, two archival CD-ROMs, and a thumb drive of my lifebackup. Now if my new house turns into a smoking hole, I still have my digital stuff.

Being able to store all your images on Flickr is great but can you download all of them if you find out the site is going down? What if the internets break down or the series of tubes gets all clogged? Storing a copy online is wonderful, but keeping a copy on-hand is just as important. Always be in control of all of your data. Don't give it all to another person or another company to take care of for you.

Gordon Bell and the My Life Bits guys have a good concept and they see the real issues but capturing sleep habits, paths upon which people walk, and other random nonsense takes away from the core of the Lifebackup - don't capture shit that doesn't matter. My thoughts matter, not my height weight distribution or how it changes.

Separate your data from the tools that can search it. Google Desktop and Picasa are great for searching the lifebackup and I don't care that it's binary and proprietary as long as it doesn't affect my data. Keep the data pure and wrap the tools around it.

Get used to separating your data into "Stuff you want to keep" and "stuff you don't care about". Make sure the "stuff you want to keep" stuff is getting backed up everywhere. Keep the stuff you don't care about elsewhere.

Organize your data into a structure that makes sense. Organize it so other people can understand it.

So how do you build your own personal Lifebackup? Here is how I built mine.

On my webserver I created a new directory called "backup". Within this directory I built a whole bunch of virtual links to each of my website directories including mikesheanet, liquidtheater, loralciriclight, mobhunter, scripts, bobshea, notes, pictures, stories, vrenna, and databases. Each of these links points to the primary web directory for each site or a subdirectory where I keep other valuable data including scripts, pictures, notes, and databases.

Each directory contains raw HTML files built from my blogging software. Loralciriclight and Mobhunter both use Movabletype. Mikesheanet uses my own custom PHP script, and liquidtheater is now just the raw HTML files. The point is, every site has raw HTML sitting in a directory. This makes it easy to suck up all the HTML files and also speeds up the site.

Every night a cronjobbed batch script will tar up selected folders in the backup directory with the following line in a shell script:

tar -cvzf /usr/home/mshea/public_html/mikeshea/lifebackup.tar.gz pictures/ liquidtheater/.html loral/.html loral/everquest.css mikesheanet/.html mikeshea.net/.xml mikesheanet/notes.txt mikesheanet/vrenna_icon.jpg mikesheanet/vrenna_cover_med.png mikesheanet/style.css .html stories/.txt bobshea/.html bobshea/.jpg scripts/

This single line fetches all of the data I consider valuable on my website - about 20 megs worth. All of these data go into lifebackup.tar.gz which is stored at mikeshea.net.

On my primary home machine, a script runs "wget" to fetch the lifebackup and then untars it with 7zip into "My Documents\lifebackup" with the following scheduled batch script:

del lifebackup.tar.gz wget http://mikeshea.net/lifebackup.tar.gz 7za.exe x lifebackup.tar.gz -y 7za.exe x lifebackup.tar -olifebackup -y copy lifebackup.tar.gz lifebackup del lifebackup.tar

So that gets all of my web writings to my local machine but that's only part of the data I want to save.

Any draft documents that aren't sent to the webserver go into "My Documents". "My Documents" is the primary directory full of data I wish to save. If I want it backed up, it goes in My Documents. This way I know exactly what directory on my computer is important. If I don't want it, it doesn't go in there. Keeping the directory clean of unimportant data is half of the challenge.

All music goes in "My Music" within "My Documents". This includes CD ripped albums and audio books that I compiled from CD myself or built using Goldwave. The audiobooks took about two to four hours to build so I don't want to lose them. This takes up about 22gb of data with maybe 2gb more a year or so. This directory also contains the Robert Shea Interviews - an mp3 of my father's talks at Starwood. They're priceless to me.

All of my pictures are saved to "My Pictures". I will also upload them to Flickr to share them but the primary source is the "My Pictures" directory in "My Documents".

I use Gmail as my primary email account but I don't trust Gmail to always take care of my mail. I set up pop access with the "keep my mail on the server" setting (very important). Then I installed Thunderbird, an email client that keeps mail in "mbox" format and once a week or so I fetch all of the mail from that week. I set up Thunderbird to save mail to "My Mail" within "My Documents". I also set it to truncate messages to about 50k to keep out the huge attachments, its a sacrifice I have to make. My inbox contains about 9400 messages, those I receive and those I send out. It's a total of 60 megs in text right now and will grow by about 30 megs a year. I may have to split it up into archives to keep the files managable.

I use memhole.com to save interesting web pages and I just wrote a script on that site that gives me a simple index of each page I saved. On my local machine I run the following command once a week:

wget -r -l1 -nd -N -P "My Saved Pages" http://memhole.com/saved_page_index_mshea.html

This will fetch all of the pages I've saved and puts them in the "My Saved Pages" directory. Each of these looks like a nice simple HTML page that contains the text from the page I saved. All of this, along with all of the other text in My Documents is indexed by Google Desktop so its easy to search.

The rest of "My Documents" contains other documents, images, stories, e-books, and other data I have collected that I find interesting and wish to keep.

The entire "My Documents" directory is backed up, with weekly and monthly versioning, to two external hard drives every night. Once a quarter I will take the portable external drive and swap it with one I keep in a safety deposit box. Other than that and the email, the rest of it is automated with cronjobs on my webserver and scheduled batch scripts on my primary home machine.

This whole system isn't totally foolproof but it covers most of what I can cover. In time, as bandwidth increases and we get more reliable storage above 40GB, I will improve it even further. For now, this should do.

So using the tips and techniques above, you can build your very own digital lifebackup. You are probably still better writing your thoughts in a journal and ensuring that that will get past along after your death. Nothing beats paper.