More on OpenDocument, RTF, and File Format Longevity

by Mike Shea on 20 December 2005

A note from a reader:

"One last question, this is about the different formats you use for your stories. What are your feelings about the RTF file? It's long been my choice for saving documents with special formatting for years, since it can (more or less) be opened with any word processing software, regardless or program age or OS. I love Open Office, but I'm not sure how long the ability to open its documents will last. Of course, I like to keep to hard copies of all my important files, but it's nice to have a computer back up too, in case the hard copy gets wet."

My reply:

Actually, RTF is the format I use when I need highly formatted documents that Word users can read. There are really a number of questions on this whole topic that need to be answered and there is no clear answer to any of it.

The format I trust the most to have a moderately well marked up document is HTML or XHTML. It's been around for fifteen years, its very ubiquitous, its standards complaint and not copyrighted by anyone, there are hundreds of editors and it is easily hand-written. If I had to pick any format I thought would last the longest, it would be XHTML.

However, XHTML isn't good for documents meant to be printed on paper. For those documents, the choice becomes less clear. You can use .DOC except not everyone has word and the format is owned by Microsoft. In two hundred years, who knows where the legalities of that format will be or whether there will be a reader at all. Microsoft's new XML format is copywritten and even though they just announced that the format will be a "standard" approved by an outside group, Ecma International, it is still the property of Microsoft. They alone hold the rights for who can use this format and what they can use it for. It is not an open standard even though it uses the Orwellian name "Office Open XML".

One could still recover data from this format easily enough, just as easily as they could from an OpenDocument file, but the legal issues become the concern. There will be less tools possibly available for a copywritten, patented format than from an open standard. Larger organizations will be much less likely to attempt the recovery of these data because doing so may violate Microsoft's license agreements. This is why open formats are ideal.

There is another problem, one faced by both OpenDocument and Microsoft's XML format: they compress them. Compressed files are much more difficult to recover data from, should they be damaged. If you ever want to keep a file for a long time, either in a backup or for an archive, you want to keep the document in an open format. Again, I go back to XHTML as the ideal. It is uncompressed and easily migrated to other formats. For example, I wrote a 20 line Python script that generates fully compliant RSS 2.0 and ATOM 1.0 XML files from the XHTML version of Vrenna and the Red Stone.

Back to your original question: RTF. RTF is actually an excellent choice. It is uncompressed ASCII, human readable, easily migrated, and very ubiquitous. However, I am not sure about the licensing issues. I don't know if the format is owned by any one company. If it is, it is most likely Microsoft who keeps control over the format and revisions.

The following document had some nasty things to say about RTF:

http://www.cv.nrao.edu/~pmurphy/doc-interchange.shtml#docrtf

However, you have to take such things with a grain of salt.

This whole topic is a strange, confusing, but interesting one. This is the first point in history where such things became issues. The only thing that comes close is the translation of the latin bible to German by Martin Luther and the pressing of the Gutenberg Bible on the first printing press. That led to a war.

Other than things like that we have never had a period where the formats in which our words are kept are owned by companies. It isn't like anyone ever owned a language before. Imagine if Microsoft owned the English word on paper and demanded that no one read it without talking to them.

Yes, it is a far cry to say that everything we've written in DOC or any propriatary commercially-controlled format is lost, but the truth is we really don't know what will happen 200 years down the line.

So, the shorter answer to your question: RTF is a fine and simple format for document storage, but save a txt and a XHTML version as well just to be safe.