by Mike Shea on 6 February 2006
Over the past month or so I became very interested in the concept of saving one's entire life into data. It's an abstract concept, I know, but something to which I think we are getting closer and closer.
For a few years now I've saved all of my writings on all of my websites to huge single files, often XML. Anyone can download a single file with the entire contents of my websites in a well-structured format. They can chop it up into smaller pieces, run it through a text processor, convert it into other formats, or do anything they want. I also released all of the work under a Creative Commons license so people have the legal freedom as well as the technical.
More recently I wrote a Python script that takes all of these large XML files and combines them into one huge XML file called the "lifebackup". With my recent focus on Atom 1.0 as a content syndication format, I structured this file into a fully legal Atom 1.0 XML schema. All of my blog entries, movie reviews, home theater articles, Everquest fan fiction, and Everquest editorials were stored in a single four megabyte file. Another Python script can chop it back up into individual entries in about fifty lines of code.
I joked about how this single file was a UTF-8 XML version of my soul. I joked that this single file might be used as an intellectual DNA strand to recreate my mind as a single hair might recreate my body. I joked about all of that, but the gears kept turning in the back of my mind.
I stumbled upon an article about a Microsoft research program called "My Life Bits" where a strange fellow and a couple of technical boys (a term I'm stealing from Stephenson's Diamond Age) got together and built a system that captures this guy's voice, pictures, email, visited websites, writings, chat logs, and even scanned hardcopies into a huge database. This database can be indexed or searched to find any part of his life. The strange fellow even wore a little camera around his neck that recorded voice, took pictures, and monitored the temperature.
That seemed excessive to me, but it did keep those wheels turning in my head. What if we had a system that could capture text, images, notes, and web pages? What if we could save all these data to an individual database - a lifebackup. What if we could let others subscribe to this person's feed so they can see, up to the hour, what that person is reading or writing?
A few years ago I worked with a fellow on the concept of "tacit knowledge", a fancy schmancy term for all of the knowledge we have that we don't write down in clean articles. It's a big problem in business and the government where the intellectual corporate or government property is stored in the fatty gray matter in the skulls of millions of workers. When that fatty gray matter stops firing neurons or begins firing for another company, all of that information is lost. Businesses and governments want to capture that information and use it to help cultivate the thoughts of everyone else. Some companies index email to try to glean information out of it but the concept got bogged down with privacy problems. Yet the original problem remained. How do we capture someone else's thought, assuming, of course, that they want it captured at all.
The result was the Memory Hole: a website that captures notes, web pages, and pictures. It stores these in a database and outputs them in small fifteen item Atom 1.0 feeds so others can subscribe or larger self-contained Atom files that include all of the data saved from those sites. Update from 26 November 2013: this project is now no longer running. Go use Evernote instead.
The Memory Hole has a few problems. First, while it saves binary data, there is no good way to package the binary data up in the single XML file. XML isn't very good at holding binary data. Second, saved web pages are usually broken due to relative links and other style elements. Third, these saved pages are very large with a low signal to noise ratio. 80k of saved web page HTML gives us about 10k of actual usable text. And then we get to tagging.
I recently read Vannevar Bush's article "As We May Think" and was particularly interested in his concept of linking like objects. It sounded a lot like "tagging", a concept becoming more and more popular with the web 2.0 boys and articulated in applications like Digg, Flickr, Amazon.com, and others. I don't think tagging has a lot of merit for text-based objects. Why does Amazon.com have to tag a book when it already has about 100,000 tags already embedded in the book itself? Search engines get smarter every day. Why do we have to bother tagging anything that already has lots of big nouns? Tagging makes a lot of sense for Flickr, where the primary object is an image. Properly searching images simply can't be done without appropriate metadata. Still, the inherent lack of standards found with the ad-hoc tagging of Flickr shows the faults. Flicker recently implemented a tag-cloud that brings up related tags to your tag to help make up for the chaotic tags often applied to things.
There are two fundamental problems with user-assigned metadata or "tags":
If a strict, highly articulated vocabulary is used to tag data you will get the best results on searches or other retrieval methods. However, you will have the most user error since users suck at writing in the right word for the right object and you will have the most expensive, long, bureaurocratic process for coming up with your tags. Consider Cory Doctorow's "Metacrap" article for examples of the absurdity of strict vocabularies.
If you use ad-hoc tagging, you have to expect that users will type in "asdfasdf" just to get past the form. One person will tag something "leather purse" while another tags it "hot momma". People don't think exactly alike, and when they are inputting data, they are very lazy, so they are likely to put in words that make little sense to others and in some cases even to themselves. When you consider that the Flickr bulk-upload assigns the same tags to all uploaded images even though the images often deserve individual tags, you begin to see the problem. Ad-hoc tagging leads to chaos.
Yet still I return to Vannevar Bush's article on "Memex", the surrigate brain that records and stores images, moving pictures, books, documents, articles, letters, handwritten notes, and other memoranda. Bush's Memex was a desk-sized device that recorded things to microfilm, but the concept works better today.
Then we hit this:
"All this is conventional, except for the projection forward of present-day mechanisms and gadgetry. It affords an immediate step, however, to associative indexing, the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing."
Bush goes on to discuss how two items can be linked together to build entire paths of thought within the memex. Well links are easy, right? We have the hyperlink and we have tags. So just tag the stuff.
Then we go back to the problems before. How do we ensure that tags remain consistent? Del.icio.us seems to have a good tagging structure, but I'm a reasonably smart, web-savvy guy and I still screwed up a lot of my tags. I have "everquest" and "Everquest". I have "writing" and "Writing blogs". Is one a subset of the other? I don't know. Should "Everquest" be part of "gaming"? Is "gaming" pen-and-paper gaming or video gaming or both? Do I want to tag at very high levels like "stuff" or low levels like "pilot vanishing point fine point fountain pen"?
What if we let a computer do our "tagging" (always wrap things in quotes to keep your feet out of the horseshit)? What if we just index the whole set of text data and search for things we're interested in. Surely Vannevar Bush didn't think we'd have full-text searches like we do now, right?
The Memory Hole fully indexes every item entered using MySQL's built in full-text index. It's a pretty lame engine when we come right down to it. It works well with single words but phrase matching and stemming doesn't exist as far as I can tell. A real search engine would do a much better job and there are a lot of them out there for cheap. I played with Autonomy's "Ultraseek" and I had a lot of luck with it. It costs real money, however.
I don't know if a full-text search solves the problem of relating two or more items together into a set. I don't know if "tagging" (watch out for the horse shit) is the answer either. It seems to have a lot of smart people really excited but so did the Semantic Web. I'm still waiting for my toaster to notify my refrigerator to get an egg ready like Tim Berners-Lee promised.
So where do we end up? I've been using the Memory Hole to save web pages and notes. I rigged my Lifebackup Python script so my Memory Hole archive is saved in my Lifebackup Atom file. With only about a month's worth of saved pages, my Memory Hole archive is already over a megabyte of text - thats a lot of text. I fear what it will become in a few years.
I would like to take the concept of the Lifebackup further. I would like to find a better way to store digital media. I would like to find a better medium that can store tons and tons of data without losing the organization or lifespan. I wrote before about the lack of longevity with digital media these days and it still holds true. If I wrote my life down in a book it might last 2000 years but if I store it on a CD or a hard drive or a USB memory stick it only lasts ten. I need to make sure that whatever format or storage it is in that it is future proof, complete, and simple enough that someone can understand it and manipulate it well after I am gone.
Are we at the point where someone can store all of their intellectual thought onto a hard drive? Not quote yet, but we're getting closer. In the mean time, the concept is sure a lot of fun to think about.
Send comments to firstname.lastname@example.org or follow @mshea on Twitter. If you enjoyed this article, please use this link to Amazon.com for your next online purchase.