Thu, 02 Jan 2003
On text data formats

Jarno Virtanen presents a couple of links on text data formats. The first, a rant about the evils of XML frequently goes off the tracks of reasonable discussion, but the main point:

SGML is a good idea when the markup overhead is less than 2%. Even attributes is a good idea when the textual element contents is the "real meat" of the document and attributes only aid processing, so that the printed version of a fully marked-up document has the same characters as the document sans tags. Explicit end-tags is a good idea when the distance between start- and end-tag is more than the 20-line terminal the document is typed on. Minimization is a good idea in an already sparsely tagged document, both because tags are hard to keep track of and because clusters of tags are so intrusive. Character entities is a good idea when your entire character set is EBCDIC or ASCII. Validating the input prior to processing is a good idea when processing would take minutes, if not hours, and consume costly resources, only to abend.
When the markup overhead exceeds 200%, when attributes values and element contents compete for the information, when the distance between 99% of the "tags" is /zero/, when the character set is Unicode, and when validation takes more time than processing, not to mention the sorry fact that information longevity is more /threatened/ by XML than by any other data representation in the history of computing, then SGML has gone from good kid, via bad teenager, to malfunctioning, evil adult as XML.
A brief summary, then: Remove the syntactic mess that is attributes. (You will then find that you do not need them at all.) Enclose the /element/ in matching delimiters, not the tag. These simple things makes people think differently about how they use the language. Contrary to the foolish notion that syntax is immaterial, people optimize the way they express themselves, and so express themselves differently with different syntaxes. Next, introduce macros that look exactly like elements, but that are expanded in place between the reader and the "object model". Then, remove the obnoxious character entities and escape special characters with a single character, like \, and name other entities with letters following the same character. If you need a rich set of publishing symbols, discover Unicode.

Jarno's second link is to A proposal: Universal Text Data format (UTD) which has a lot in common with the above rant.

path: /stuff | permanent link |