Proposal: Encoded inline text formatting

Introduction

This proposal does not discuss general formatting requirements, which we believe in the context of exchange of biological data to be most adequately dealt with by software creating reports or user interfaces.

However, it is believed that some text content in databases or xml elements requires minimal text formatting to preserve the correct semantics, or to allow the degree of expressiveness considered necessary by content authors. An example of required semantics is the use of superscript and subscript markup: "G2", "G₂", and "G²" may be different symbols referring to different concepts. An example of a social requirement is the tradition in biology to write scientific taxon names in italics. Biologists usually find a system that does not support this inadequate to their needs.

Existing solutions

Email and Wikis

In "email-style", simple superscript expressions may be found as "mm^2". No equivalent for subscript or italics is similarly widespread.

Studying existing WIKI software (which faces similar problems) members of the TDWG-SDD found that either (x)html is being used, or a large and inconsistent variety of text-based markup is employed.

Usage in the DELTA standard

The experience with DELTA shows that text formatting is a significant issue. DELTA underwent changes from "[Itext]" to "[I]text['I]" to RTF typesetting marks. The DELTA User guide (Edition 4.12, 2000) explicitly accepts codes for: italics, bold, superscript, subscript, font size, default font appearance, En dash, Left/Right quote, new paragraph, default paragraph attributes, space before paragraph, space after paragraph, line indentation, first line indentation. Of these, the en-dash and quotes are already covered by Unicode. Italics, bold, superscript, subscript are proposed to support in a special convention detailed below. The paragraph-level formatting is considered problematic, since it may conflict with report generation styles, and may in fact cause invalid html to be created. As a compromise, the support of the break tag ( ) is proposed. Font size or style in general is not considered desirable. However, support of the and <big> tags may be introduced if requested.

UBIF proposals up to version 1.0 beta 18

UBIF/SDD tried to use xhtml (rather than inventing proprietary codes). In versions of UBIF prior to 1.0 beta 18 used a mixed content markup, closely modeled on a mimized set of xhtml inline-formatting elements. Use of mixed content may also be found in ATOM 1.0 (for text having the attribute type="xhtml").

However, this use of mixed content was received critically and was considered to pose a significant burden to implementations. For example, a product like Altova Mapforce (allowing graphical xslt creation to map to and from databases) can not handle elements with mixed content. Furthermore, the element validation that is implicit in using xhtml-style markup to format label or definitional text, creates an impedance problem between database and xml: The database most likely uses ANSI or Unicode to encode &, <, >, rather than natively storing character entities (&, <, >). It would thus have to distinguish between passing these through unencoded if used in combination of the few recognized markup tags, and encoding them otherwise.

Example: the Database content "A1 > A2" would have to be recoded into: "A1 > A2" when creating SDD xml content. This is especially problematic, since some validation may have to occur in the conversion process to avoid ill-formed xml or non-valid SDD. For example, unbalanced markup like: "A1" should be converted to "A1"!

Starting with UBIF 1.0 beta 18, it is proposed to change this to a new concept based on encoded markup, avoiding mixed content.

Proposal

Support for a limited list of formatting symbols based on xhtml is recommended (see table below). String content in UBIF xml element content should not be mixed content. To avoid this, all occurrences of "<", ">", or "&" should be encoded (i. e. to "<", ">", or "&"). Processors may then choose to recognize and recover the limited encoded formatting symbols. Thus a text that is literally in the UBIF document "H2O" may be treated by a report generating software as xhtml mixed content markup "H2O". When using xhtml as reporting format, the UBIF-encoded formatting may simply be recovered and displayed as "H₂O". The recovering process should be limited to expressions producing well-formed xml (e. g. "H2O" should be left unchanged).

The proposed encoded formatting is very similar to the a formatting use found in ATOM 1.0 (for text having the attribute type="html").

The content of this recommendation is not validated by UBIF schemata. It is intended to be an agreement between content authors and processors (e. g. a routine creating SDD natural language documents). Processors may not implement this, if it is not relevant for their purposes. However, they may wish to realize that "xyz" and "xyz" are more similar than a plain text comparison may indicate. An example xslt script to strip encoded formatting symbols is provided for such purposes.

The following xhtml tags are proposed to be recognized:

	Examples:
Tag name	Content editor or database view	XML encoded string in UBIF document	Browser view after RecoverEncodedFormatting	Plain text after StripEncodedFormatting
strong	<strong>Strongly emphasized</strong>	<strong>Strongly emphasized</strong>	Strongly emphasized	Strongly emphasized
em	This is <em>emphasized</em>.	This is <em>emphasized</em>.	This is emphasized.	This is emphasized.
b	Using <b>bold</b>text.	Using <b>bold</b> text.	Using bold text.	Using bold text.
i	This is <i>italics</i>.	This is <i>italics</i>.	This is italics.	This is italics.
sub	H<sub>2</sub>O needs subscript.	H<sub>2</sub>O needs subscript.	H₂O needs subscript.	H2O needs subscript.
sup	cm<sup>3</sup> needs superscript	cm<sup>3</sup> needs superscript	cm³ needs superscript	cm3 needs superscript
br	line break (3 forms):<br> (1),<br/> (2),<br /> (3)	line break (3 forms):<br> (1),<br/> (2),<br /> (3)	line break (3 forms): (1), (2), (3)	line break (3 forms): (1), (2), (3)

Strong and emphasis are usually rendered bold and italics, respectively. They are logical markup which leaves the exact rendering to the processor. Thus, emphasized words should be marked with em, whereas explicitly required italics (e. g. for taxonomic names) should use i.

The following xhtml tags are not yet proposed, but discussion is encouraged.

small - smaller font size = smaller font size.
big - <big>larger font size</big> = larger font size.
ins - inserted text revision marker, example: <ins datetime="yyyy-MM-ddThh:mm:ss+hh:mm">...</ins>
del - deleted text revision marker, example: >del datetime="yyyy-MM-ddThh:mm:ss+hh:mm">...</del>

Support for the u/underline tag was already rejected in a previous SDD discussion.

Applicable elements

In UBIF/SDD any element named "Text", "Details", "Definition", "Abbreviation", and "InternationalAbbreviation" should be treated when creating formatted reports, as potentially containing encoded formatting.

Specific to SDD natural language generation are the following additional element names: "TextBefore, TextAfter, SingleDelimiterText, RepeatedDelimiterText, LastDelimiterText"

Example code

Short sample xslt scripts and data have been created as proof of concept. These are listed in the resource directory corresponding to this proposal.

Gregor Hagedorn 18.8.2004, revised 2.6.2006