Creating a (Unified!) Vendor-Neutral Markup Standard

Take a look at any blog, wiki, forum, etc. Specifically, look at how posts are created, filtered, and displayed. There are dozens of different ways for authors to specify the formatting and content of their articles/posts, and hundreds of ways to render the results. Some blogs rely on now-famous 3rd-party markup implementations like Textile and Markdown, some use bbCode, and quite a few still rely on plain old HTML. Then you have vendor-specific proprietary implementations and many more, popping up as the need arises.

We’re not trying to standardize markup formats, on the contrary, there is no real benefit – and the web can always do with a bit more diversity. But what does need standardization is how post-markup data (the article text) is stored in the database and later rendered. In the past, this wasn’t a problem: each “platform” had its markup format, and stored the output straight in the database. Then the platform triggered the markup language’s bundled HTML renderer and converted the database contents to HTML for display.

But with the current generation of blogging and posting tools, it’s become possible (via the use of plugins and extensions) for users to install a markup language of their choice on top of the platform, and using that instead of the original markup language. There are also posting engines that have in-built support for several markup languages at once. A good example is WordPress: by default it accepts non-HTML plain-text,1 and at the same time allows authors to use full-fledged HTML to format their articles.

Many WordPress users have complained about their HTML entities being formatted to HTML-encoded characters and extra paragraphs appearing out of nowhere – without their even installing any additional plugins.2 But those that went a step further and installed 3rd party WYSIWYG implementations as plugins or used the once-bundled Markdown and Textile plugins found out – to their severe disadvantage – that all previous posts lose their formatting. And when they uninstall the plugins…

This isn’t a WordPress-exclusive problem however. It’s universal issue that needs to be addressed: formatted text, once sent to the database, needs to be in a universal standardized format. It needs to be something that can afterwards be rendered as HTML or converted back to the arbitrary markup language of choice for future editing. It needs to support any and all possible formatting styles, and it needs to be, above all, fail-proof. HTML isn’t an ideal system because while it may be easy to convert the various markup languages to HTML, HTML’s (very) loose structure makes it almost impossible to convert back safely.

Obviously storing a hundred posts here in Markdown, a hundred posts there in Textile, and a couple of hundred posts in HTML in the database isn’t the answer. Especially when it comes to the oft-discussed semantic web, these formats provide virtually no support for exporting data to other systems or re-syndicating it in a universal format.

The ideal answer would come in the form of a two-part system. The first involves converting whatever end-user markup language is being employed to a standardized, strongly-typed, and well-formed machine-readable system and storing the output in the database for later retrieval. The second part of the system should provide for a means to convert back from the standard markup storage format to either HTML (for rendering) or the user-selectable markup language (for editing purposes).

We could go into detail about an automated system that employs XML (or JSON for that matter) module files to add support for the various markup languages (basically a definition file that allows for complete and automated translation between the unified markup storage system and the human-readable markup language), but that’s what we consider an “additional detail” that can’t happen before a standard is agreed upon, developed, published, and used by some.

NeoSmart Technologies isn’t just about talk: we’re willing to create this standard (and even begin implementing the automated conversion system), but it all depends on whether or not this comes across as a good idea. For our point of view, we think it’s ripe time for such a unified vendor-neutral universal markup language. If you’re a developer of a blog, forum, wiki, cms or other online content platform, and you would be interested in using such a system, please leave a comment below and let us know.

  1. While technically not a markup language per-say, by automagically converting certain non-HTML entries – like several adjacent spaces and carriage returns – to their HTML counterparts, it accomplishes the same basic purpose 

  2. This applies to both those that use the WYSIWYG editor and that those that don’t… 

Leave a Reply

Your email address will not be published. Required fields are marked *