TODO: what's ODA



1 Unicode, Rich Text, and Mathematics

Murray Sargent III

Microsoft Corporation

30-Aug-95

In this paper, we discuss ways in which Unicode can be used together with rich-text formatting to represent documents ranging from simple enhancements of plain text to sophisticated technical documents. Things of interest include the interplay of document structure and rich-text formatting, exchange versus editing formats, and the tradeoffs between embedded codes and parallel text runs. As our rich text exchange format, we use RTF, and discuss its advantages and disadvantages and ways of generalizing it to handle Unicode. We close with discussions on plain and rich-text encodings of mathematical expressions.

1. Unicode and Rich Text 1

2. Plain Text Plus Some Embedded Codes 2

3. Rich Text Exchange Format 2

4. Formatting Versus Content 6

5. Making Editing Easier: Parallel Text Runs 6

6. Glyph Variants and Unicode 7

7. Round-tripping DBCS through Unicode 7

8. Unicode Plain-Text Encoding of Mathematics 8

9. Recognizing Mathematical Expressions 10

10. Operator Summary 13

11. Export to programming languages and TEX 14

12. Conclusions 18

1 Unicode and Rich Text

Unicode is a great step forward in the computer representation of the world’s text, but it was never intended to represent all aspects of text. In addition to Unicode, we need to be able to handle:

1) Language attribute: sort order, word breaks

2) Rich (fancy) text formatting: bold, italic, underline, font changes, sub/superscripts

3) Content tags: SGML, HTML: headings, abstract, author, figure

4) Glyph variants: the Poetica font has 58 different ampersands, the Mantinia font has many novel ligatures (TT, TE, etc.)

2 Plain Text Plus Some Embedded Codes

You can do more than you might think with plain Unicode text. Unicode defines semantics for its characters above and beyond simple characteristics like upper and lower case. For example, the code U+2029 is defined to be the paragraph separator character. This differs from ASCII’s CR (U+000d) and LF (U+000a), which are teletype codes to return the carriage to the left margin and to feed a line, respectively. In the PC world, the combination CRLF has come to mean a paragraph mark, but this usage has never been formally defined and certainly isn’t universally accepted. (Furthermore it’s a simple example of DBCS, that is, two bytes are used to represent a single character, while other characters are represented by one byte). Many other interesting punctuation characters are defined in the range U+2000 to U+204f. Microsoft’s Win32 collects much of this character-type information into the CT_TYPEx tables, which have bits that identify upper and lower case, decimal and hexadecimal digits, left-to-right and right-to-left, punctuation, and much more (see the Win32 GetStringTypeW() function for a discussion).

There is a gray zone between rich and plain text: embedded codes. In fact, general rich text can be represented using plain text with embedded fields, as illustrated by Hewlett-Packard’s PCL5 print format. The problem with this approach is that it’s hard to edit, since cursor movement involves skipping over embedded fields, and the text can confuse various text scanning programs, such as spelling and grammar checkers. Unicode defines a BiDi (bidirectional) algorithm for mixing left-to-right and right-to-left text that does use a few embedded codes, such as U+200e (left-to-right mark) and U+200f (right-to-left mark). In the last part of this paper, we discuss the addition of a few characters and a mathematics character type that let most mathematical expressions to be represented using plain text with a couple of embedded symbols.

3 Rich Text Exchange Format

The binary formats of most word processors can express sophisticated rich text. Such formats may be efficient for editing and display purposes, but they aren’t good exchange formats because they are proprietary, poorly documented or undocumented, and not easily extensible. Some printer control languages, such as Hewlett-Packard’s PCL5 can also represent rich text, but are clearly hard to edit and are unfriendly to spelling checkers. Unicode plain text qualifies as a well documented exchange format, but explicitly excludes rich-text enhancements. The rich-text exchange format closest to my work is Microsoft Word’s RTF (rich-text format). RTF was explicitly developed to allow people to transfer formatted text to and from Word, but it is used more generally as a kind of “default” rich text format. It has good points and bad points. We look at them after defining it briefly and giving some simple examples.

Traditional RTF consists of appropriate sequences of ASCII characters in the range U+0020 through U+007e. To be recognized as RTF text, the text must start with {\rtf (here we use a sans-serif font to represent program code or RTF examples), and close with a matching }. Here \rtf is an example of a control word, which more generally is a backslash (\) followed by one or more lower-case ASCII letters, followed by zero or decimal digits. The matched brace pair {} defines a group, here comprising the whole RTF text, but it can be appear nested down inside to delimit the scope of things. Does this syntax remind you a bit of TEX? So the RTF text

{\rtf This is some plain text.}

is the way you’d represent the plain text “This is some plain text.” in RTF. Since no formatting is specified, an RTF reader would use the formatting active wherever it is supposed to insert the text.

To do something more complicated, you use additional control words such as in

{\rtf This is some \b bold\b0 text.}

which gives “This is some bold text” (without the quotes, of course). Notice here that a blank follows the control word \b. This blank terminates the control word and is not displayed. You don’t have to put in this terminator blank if the character that directly follows the control word cannot be part of the control word, e.g., something other than an alphabetic, numeric, or blank character. Alternatively, you could use a group to delimit the boldface as in

{\rtf This is some {\b bold} text.}

This takes a bit longer to process since the state active at the start of the group has to be restored when the group terminates. But it has the advantage that you know the scope of the \b control word, even if you didn’t recognize that control word (more about this later).

Needless-to-say, there are a lot more control words, many having to do with character and paragraph formatting, but many having to do with document properties and relevant word-processing facilities. If you’re interested in a complete list, see the Microsoft Developer Network (MSDN) CD-ROMs, which document RTF along with a myriad other things. You can also generate lots of examples by saving Word files in RTF and looking at the resulting text.

The question immediately arises as to what a reader should do when it doesn’t recognize a control word. The RTF specification says that in general, an RTF reader should just ignore control words it doesn’t recognize, but it should keep any plain text that follows. There is a category of special control words that deviates from this rule, namely that for alternate text destinations. The main text destination is the text proper, but many other destinations exist, such as page headers and footers, footnotes, style definitions, font-handle definitions, and color-table entries. All alternate destinations are defined within group braces. A bunch of these destinations were defined in the original RTF spec and RTF readers should either handle these or know to skip the text that follows the destination control word within the enclosing group braces. Destination control words defined after the original spec must be preceded by \*, which explicitly identifies such control words as destination control words. So readers know that if they don’t recognize the control word \annotation in the group

{\*\annotation this text could certainly be clearer...}

it can skip to the end of the group without inserting the text “this text could certainly be clearer...” Hopefully you get the idea. I have to admit it took me a while to appreciate it.

The crucial point is that this mechanism for handling unrecognized control words allows RTF to be used as an extensible rich-text exchange format. If the control words you need have been defined, you use them, but if you need other control words, you can add them without preventing RTF readers “out there” from reading your RTF text. It’s true that those readers won’t know about your neat new features (unless they’ve been programmed to handle them), but at least you can read the lion’s share of another RTF writer’s rich text and you can read and write all of your own.

1 Unicode RTF

An important question is how to extend rich-text formats to handle Unicode. Nominally Unicode characters are represented by unsigned 16-bit characters, but rich-text formats like RTF are based on 7-bit codes that pass easily through the international networks. Once again, to be concrete, I consider RTF. At first glance, you might conclude that you just make all characters 16-bit, and maybe take advantage of some of the Unicode semantics to simplify RTF. However with a little time spent actually writing RTF converters, you see a different picture emerging. For one thing, a substantial part of RTF files consist of ASCII control words, so it would be nice to use smaller characters to represent them.

Two such variable-length encodings have received a lot of attention recently: UTF-8 and UTF-7. The main reasons for these byte encodings of Unicode is to be endian-independent (file-system safe), and to go over the Internet. In addition to these advantages, UTF-8 fits easily into existing RTF converter code, but uses the high bit of a byte to represent Unicodes above ASCII, and UTF-7 fits well into the 7-bit world in general. Neither would be backward compatible with current readers, so they would have to be alternative RTF formats, but converters could be made readily available to translate to formats understandable to older readers. There's also the possibility of supporting UTF-8/7 as code pages 1208/7 in the Win32 MultiByteToWideChar() and WideCharToMultiByte() APIs. This would enable the RTF readers currently being developed to support UTF-8/7 with no additional changes, thanks to a proposed new control word \ansicpgN , which specifies which code page to use for an RTF document.

To tantalize you with just how simple things can become, the following is a brief definition of UTF-8. UTF-7 is somewhat more complicated, but it is still pretty easy to program. Note that you can “land” on any byte in UTF-8 and with no more than an adjacent byte check, you know which byte it is of a multibyte sequence. Multibyte encodings often don’t have this nice feature. Letting numbers be binary, a 16-bit Unicode is converted to a multibyte UTF-8 sequence as given in the following table:

|Type |16-Bit Form |UTF-8 Multibyte Form |

|1 |00000000.0bbbbbbb |0bbbbbbb |

|2 |00000bbb.bbbbbbbb |110bbbbb, 10bbbbbb |

|3 |bbbbbbbb.bbbbbbbb |1110bbbb, 10bbbbbb, 10bbbbbb |

Conversion type 1 provides a one-byte sequence that spans the ASCII character set in a compatible way. Conversion types 2 and 3 represent higher-valued characters as sequences of two or three bytes with the high bit set. The number of adjacent high-bit 1’s in the leading byte gives the number of bytes for the encoding. So 110bbbbb implies that 2 bytes are used, while 1110bbbb implies that three are used. The full UTF-8 specification continues this idea beyond the three cases in the table to encode up to the 31-bit characters defined in ISO 10646. When there are multiple ways to encode a value, for example the code 0, the shortest encoding is used. In the inverse mapping, any sequence except those described above is incorrect and is converted to Unicode U+0080.

Since UTF-8 and UTF-7 encodings for Unicode-enabled RTF are not compatible with existing readers, it’s also desirable to have a “fat” format that works with old and new readers. Such a format includes Unicode control-word sequences that Unicode enabled readers understand along with the corresponding multibyte text sequences that older readers can use and the Unicode readers can ignore. This approach is an important one, but it is relatively difficult to program and leads to file sizes several times as large as UTF-8 RTF. Hence I prefer using the latter unless old readers need the former, in which case a simple converter can translate between the two.

Armed with this introduction, let’s summarize RTF’s good and bad points.

2 RTF Good Points

1. Can represent common rich-text character attributes, e.g., bold, italic, underline, font changes, language

2. Can represent common paragraph attributes, e.g., alignment, spacing, borders

3. Can represent named styles, such as headings

4. Is totally expressible using a subset of ASCII

5. Has a simple syntax that handles multiple text destinations

6. Is general enough to support anything in a Word document

7. Has well-defined rules for ignoring control words that you don’t understand

8. Easy to extend to Unicode (unless full backward compatibility is required)

3 RTF Bad Points

1. Difficult control-word scope: unless a control word occurs within a group {}, the scope potentially runs to the end of the RTF text. However it may be delimited by another control word and the syntax doesn’t reveal this. This is a problem if you want to preserve unrecognized RTF. E.g., \b turns on boldface and \b0 turns it off. All RTF readers and writers understand this particular control word, since it’s so basic to rich text. But suppose someone defines a new character attribute \glitter, which turns on “glitter” text, which is supposed to be displayed in a scintillating way. Just as \b0 turns off boldface, \glitter0 is defined to turn off the glitter attribute, but that’s a control-word semantic, not a property of RTF syntax. So a reader that doesn’t know what \glitter means can’t know the scope of \glitter. A more general syntax could be implemented using a header that describes basic scope and feature characteristics of all control words used in a rich-text document. With such a header, even though you might not know what \glitter means, you’d have enough information to know that it’s a character-format attribute that is turned off by \glitter0. If someone then deletes the rich text that includes \glitter through \glitter0, you’d know that you can delete these control words. RTF doesn’t have such headers, although we probably ought to include them in a future rich-text format.

2. Is general enough to support anything in a Word document: unless representing a Word document is the goal, this generality is too great, too little, or both, i.e., there’s a mismatch. If too great, then you have to ignore information or write a lot of code to handle stuff you don’t care about (round-tripping Word files in a different editing environment).

3. Examples of where Word generality may be insufficient: font weight: Word has normal and bold, but TrueType and PostScript both define fonts with many more possibilities. Or you want to add \glitter or \fire to your characters. Word may not have anything similar.

4 Formatting Versus Content

When you display some text in boldface, you do so for a reason, e.g., to emphasize the text, or mark it as a keyword. In electronic publications, it’s very valuable to expose such reasons to browsers. There’s far too much information out there to read it all, so exposing the content of documents can greatly increase the amount of text that people can navigate through. This is even more true for blind people. RTF per se, as TEX per se, doesn’t expose content very well. The content information may be implied, but it’s hidden behind specific formatting. On the other hand, if RTF styles are used, the content can be exposed sufficiently well to map into HTML and SGML. Microsoft has Word add-ons (Internet Assistant and SGML author, respectively) for both targets. Styles require end user training and they complicate RTF files, but they are incredibly valuable.

Similarly in math, knowing that some text comprises the numerator of a fraction is more useful than just being told to display it at a raised and shifted location. The former could be used as input to symbolic manipulation and graphing programs, while the latter would require sophisticated pattern recognition to be used for such purposes. (We’ll need such pattern recognition software to intelligently encode mathematical expressions in most of archived literature). Math formats such as TEX’s or eqn/troff’s retain the appropriate content information, while many typesetters and word processors use codes that do not.

5 Making Editing Easier: Parallel Text Runs

RTF, Hewlett-Packard PCL5, SGML (with associated DTDs), TEX, and eqn/troff are examples of document formats with embedded rich-text formatting. These formats are fine for display, but are difficult to use for editing and information access, such as for spelling and grammar checkers.

To facilitate editing, Microsoft text programs such as Word and the system RichEdit control use multiple, parallel text runs to represent rich text. For example, the RichEdit 2.0 control has three sets of text runs:

1) Unicode plain-text runs

2) Character-format runs

3) Paragraph-format runs

Any kind of run is characterized by a count of characters and an index into an array of text-run descriptors that describe the runs. The plain-text runs divide the text into a set of text blocks to speed up insertions and deletions. These blocks contain the plain text. The character-format runs identify text runs with constant character formatting and the paragraph-format runs identify text runs with constant paragraph formatting. For example, in the text “This is bold”, there are two character format runs, the first with a count of eight and the second with a count of four.

By separating the plain text out into its own separate runs, editing is facilitated, since you don’t have to “step over” embedded rich-text fields. This also makes the text easier to use for applications that aren’t concerned with rich-text formatting, such as spell checkers, database programs, and search engines. However since multiple blocks are used for the plain text, direct access to this “backing store” isn’t permitted; special text-object-model interfaces exist to access the text, whether rich or plain.

6 Glyph Variants and Unicode

At first thought, you might choose to use codes in the Unicode Private Use Area as handles that index into glyph structures that define the glyphs in detail. Such handles would be volatile; they’d be allocated as requested on loading a file and freed when the file is closed. The definition of a glyph could be made with an RTF-like syntax in a group that starts with the control word \glyph.

However if you have character-format runs, a better way is to put the base character(s) directly into the backing store and associate them with a character-format run that identifies the special information needed to display the desired glyph variant. So for the many Poetica ampersands, an ampersand would be stored into the backing store and the associated single-character character-format run would identify the font (Poetica) as well as information that identifies which glyph to use. In this way, sorting and spelling programs work with a plain-text backing store that contains the nominal character values instead of strange glyph-variant codes.

The same approach is useful for ligature substitutions that aren’t rendered algorithmically. With English letters, it’s easy to call out ligature substitutions for the combinations ff, fi, fl, ffi, and ffl. But no algorithm tells us to use the TT and TE ligatures in the Mantinia font; such choices are made on artistic grounds by the end user. Having made such a choice, the end user still wants spell checkers to get TT and TE, rather than special codes for the TT and TE ligatures.

7 Round-tripping DBCS through Unicode

In the discussion of rich text and glyph variants, I’ve avoided using the Unicode Private Use Area, since other approaches appear to work better. But one problem has arisen that can make good use of the Private Use Area: character handles for DBCS characters that have no Unicode counterparts. This problem has arisen because we use Unicode internally in our applications and yet the world out there still may store files in a multibyte character system. So we merrily use the Win32 MultiByteToWideChar() and WideCharToMultiByte() APIs to translate to and from Unicode. The problem is that some characters won’t “round trip” through Unicode, that is, they have no Unicode counterparts. If our handy APIs would assign Private-Use-Area handles for these cases, these characters could round trip through Unicode. It would even be possible to have new APIs that would let you display these characters using fonts with the appropriate DBCS mapping tables. In general, using Private-Use-Area codepoints as volatile character handles seems to be the best approach. If you assign stored values to these codepoints, you run the risk of conflicting meanings.

8 Unicode Plain-Text Encoding of Mathematics

Onward to one of my most favorite topics (please pardon the double superlative!). Given the power of Unicode relative to ASCII, how much better can a plain-text encoding of mathematical expressions look using Unicode? The most well-known plain-text ASCII encoding of such expressions is that of TEX, so I use it for comparison. Notwithstanding TEX’s phenomenal success in the science and engineering communities, a casual glance at its representation of mathematical expressions reveals that they don’t look very much like the expressions they represent. Only the most daring would ever try to make algebraic calculations using TEX’s notation. With Unicode, we can do much better, and the resulting plain text can be used directly for such calculations.

For example, a TEX fraction numerator consists of the expression that follows a { up to the keyword \over and the denominator consists of what follows the \over up to the matching }. In both the fraction and subscript/superscript cases, the { } are not printed. These simple rules immediately give a “plain text” that is unambiguous, but looks quite different from the corresponding mathematical notation, thereby making it hard to read.

Instead, suppose we define a simple operand to consist of all consecutive non-operator characters. We call this sequence of one or more characters a span of non-operators. As such, a simple numerator or denominator is terminated by any operator, including, for example, arithmetic operators, the blank operator, all Unicode characters with codes U+22xx, and a special argument “break” operator consisting of a small raised dot. The fraction operator is given by the Unicode fraction slash operator U+2044, which we depict with the glyph //. So the simple built-up fraction

.

appears in plain text as abc//d.

For more complicated operands, such as those that include operators, parentheses ( ), brackets [ ], or { } can be used to enclose the desired character combinations. If parentheses are used and the outermost parenthesis set is preceded and followed by operators, that set is not displayed in built-up form, since usually one doesn’t want to see such parentheses. So the plain text (a + b)//c displays as

.

In practice, this approach leads to plain text that is significantly easier to read than TEX’s, e.g., {a + c \over d} , since in many cases, outermost parentheses are not needed, while TEX requires { }’s. To force the display of an outermost parenthesis set, one encloses the set, in turn, within parentheses, which then become the outermost set. A really neat feature of this notation is that the plain text is, in fact, a legitimate mathematical notation in its own right, so it’s relatively easy to read.

Nature isn’t so kind with subscripts and superscripts, but they’re still quite readable. Specifically, we introduce a subscript by a subscript operator with its own special glyph that resembles a subscripted down arrow ¯. The subscript itself can be any operand as defined above. Another compound subscript is a subscripted subscript, which works using right-to-left associativity, e.g., a¯b¯c means a). Similarly a­b­c means a).

As an example of a slightly more complicated example, consider the expression W has the plain-text format W­3b¯d1r1s2. In contrast, for TEX, you type

$W^{3\beta}_{\delta_1\rho_1\sigma_2}$ ,

which is hard to read. The TEX version looks distinctly better using Unicode for the symbols, namely $W^{3b}_{d_1r_1s_2}$ or $W^{3b}_{d1r1s2}$, since Unicode has a full set of decimal subscripts and superscripts. However the need to use the {}, not to mention the $’s, makes even the last of these harder to read than the plain-text version W­3b¯d1r1s2.

For the ratio

,

the Unicode plain text reads a23//(b23 + g23), while the standard TEX version reads as

${\alpha^3_2 \over \beta^3_2 + \gamma^3_2}$·

The Unicode plain text is a legitimate mathematical expression, while the TEX version bears no resemblance to a mathematical expression.

TEX becomes very cumbersome for longer equations such as

W = U + , U))) .

The Unicode plain-text version of this reads as

W­b¯d1r1s2 = U­3b¯d1r1 + 1/8p2 ò¯a1­a2 da2' [(U­2b¯d1r1 - a2'U­1b¯r1s2)/U­0b¯r1s2]

while the standard TEX version reads as

${W^{3\beta}_{\delta_1\rho_1\sigma_2}

= U^{3\beta}_{\delta_1\rho_1} + {1 \over 8\pi^2}

\int_{\alpha_1}^{\alpha_2} d\alpha_2\prime \left[

{U^{2\beta}_{\delta_1\rho_1} - \alpha_2\prime

U^{1\beta}_{\rho_1\sigma_2} \over

U^{0\beta}_{\rho_1\sigma_2}} \right] }$ .

In a “Unicoded” TEX, it could read as

${W^{3b}_{d1r1s2} = U^{3b}_{d1r1} + {1 / 8p2}

ò_{a1}^{a2} da2' \left[{U^{2b}_{d1r1} - a2'U^{1b}_{r1s2}

/ U^{0b}_{r1s2}} \right] }$ ,

which is significantly easier to read than the ASCII TEX version, although still much harder to read than the Unicode plain-text version.

Brackets [ ], braces { }, and parentheses ( ) represent themselves in the Unicode plain text, and a word processing system capable of displaying built-up formulas should expand them to fit around what’s inside them. Here we use U+2032 for \prime and U+2044 for \over.

9 Recognizing Mathematical Expressions

Unicode plain-text encoded mathematical expressions can be used “as is” for simple documentation purposes. Use in more elegant documentation and in programming languages requires knowledge of the underlying mathematical structure. This section describes some of the heuristics that can distill the structure out of the plain text.

Many mathematical expressions patently identify themselves as mathematical, obviating the need to declare them explicitly as such. One of TEX’s greatest limitations is its inability to detect expressions that are obviously mathematical, but that are not enclosed within $’s. To complicate matters, the popular TEX dialects use the $ as a toggle, which is a poor choice as a myriad TEX users will loudly testify! There’s nothing as frustrating as leaving out a $ by mistake and thereby receiving a slew of error messages because TEX interprets subsequent text in the wrong mode. An advantage of recognizing mathematical expressions without math-on/math-off syntax is that it is much more tolerant to user errors of this sort. Resyncing is automatic, while in TEX you basically have to start up again from the omission in question. Furthermore, this approach should be useful in an important related endeavor, namely in recognizing and converting the mathematical literature that’s not yet available in an object-oriented machine-readable form, into that form. A similar recognition problem exists for pen entry of equations.

The PS Technical Word Processor uses a number of heuristics for identifying mathematical expressions and treating them accordingly. These heuristics are not foolproof, but they lead to the most popular choices. Special commands are available to overrule these choices. With Unicode, the approach can be significantly improved. Ultimately it could be used as an autoformat style wizard that tags expressions with a rich-text math style. The user could then override cases that were tagged incorrectly. A math style would connect in a straightforward way to an SGML tag.

The basic idea is that math characters identify themselves as such and potentially identify their surrounding characters as math characters as well. For example, the fraction (U+2044) and ASCII slashes, symbols in the range U+2200 through U+22ff, and the symbol combining marks (U+20d0 - U+20ff) identify the characters immediately surrounding them as parts of math expressions. The symbol set of “math characters” includes Greek letters, which would have to be reassessed for doing math in the Greek language.

Except for “a”, “A”, and “I”, single ASCII letters are automatically treated as math characters and italicized. Various if statements are used to figure out if “a”, “A”, and “I”, are math characters, but especially with “a”, the benefit of the doubt is given to the article. An on-line English-language dictionary would be of help in resolving such ambiguities. The user can force italics by using a explicitly, and, in fact, all mathematical variable names can be entered in italic. Two problems occur with this: 1) it’s more effort to type italics, and 2) italic is used for emphasis as well as for mathematical symbols and it’s genuinely useful to know which is which.

As described above, a simple subscript operand consists of the string of all non-operators that follow the subscript operator. Compound subscripts include expressions within parentheses, square brackets, and curly braces. In addition it’s worthwhile to treat two more operators, the comma and the period, in special ways. Specifically, if a subscript operand is followed directly by a comma or a period that is, in turn, followed by whitespace, then the comma or period appears on line, i.e., is treated as the operator that terminates the subscript. However a comma or period followed by a non-operator is treated as part of the subscript. This refinement obviates the need for many overriding parentheses, thereby yielding a more readable plain text.

ASCII letter pairs surrounded by whitespace are often mathematical expressions, and as such should be italicized in print. If a letter pair fails to appear in a list of common English and European two-letter words, it is treated as a mathematical expression and italicized. Many Unicode characters are not mathematical in nature and suggest that their neighbors are not parts of mathematical expressions.

Strings of characters containing no whitespace but containing one or more unambiguous mathematical characters are generally treated as mathematical expressions. Certain two-, three-, and four-letter words inside such expressions are not italicized. These include trigonometric function names like sin and cos, as well as ln, cosh, etc. Words or abbreviations, often used as subscripts (see program in Sec. 11), also should not be italicized, even when they clearly appear inside mathematical expressions.

Special cases will always be needed, such as in documenting the syntax itself. One needs a symbol that causes the character that follows it to be treated as an ordinary character. This allows the printing of characters without modification that by default are considered to be mathematical and thereby subject to a changed display. Similarly, mathematical expressions that the algorithms treat as ordinary text can be sandwiched between math-on and math-off symbols. Such “overhead” symbols clutter up the text and hopefully will be rarely needed in Unicode plain text. The method I’ve used up to now is to introduce a special override symbol to force the behavior desired. This does complicate the preparation of technical documents and although you can get very good at it, it’s not the most user-friendly way of doing things. On the other hand, identifying the beginning and end of math expressions using $’s isn’t user friendly either.

1 Some Minimal Rich Text

If you’re willing to introduce a minimal amount of rich text, mathematical expressions can be marked as such, just as boldface text is marked as boldface. The heuristics for recognizing math can be used to provide the initial math run marking and the user can touch up exceptions by selecting misidentified text with the appropriate style. It’s also possible to dive deeper into mathematical expressions, identifying all the component pieces with text runs with the correct nesting level. But since most rich-text engines don’t know about mathematical expressions, it’s probably better to leave the expressions in the plain-text form, which can be displayed correctly and recognized in a content-sensitive fashion so that formulas can be graphed, manipulated symbolically, and used as input into program code. I’m hoping to sneak this capability into the system one of these days so that WYSIWYG editing of text with built-up formulas can be done by any application that uses an edit control.

2 Input of Mathematical and other Unicode Characters

This leads to the important problem of input ease. The ASCII math symbols are easy to find, e.g., + - / * [ ] ( ) { }, but often need to be used as themselves. From a syntax point of view, the official Unicode minus sign (U+2212) is certainly preferable to the ASCII hyphen-minus (U+002D) and the prime (U+2032) is preferable to the ASCII apostrophe (U+0027), but users may find the ASCII characters more easily. Similarly it’s easier to type ASCII letters than italic letters, but when used as mathematical variables, such letters are traditionally italicized in print. Other post-entry enhancements include automatic-ligature and left-right quote substitutions, which can be done automatically by some word processors. Suffice it to say that intelligent input algorithms can dramatically simplify the entry of mathematical symbols.

A special math shift facility for keyboard entry could bring up proper math symbols. The values chosen can be displayed on an on-screen keyboard. For example, the left Alt key can access the most common mathematical characters and Greek letters, the right Alt key could access italic characters plus a variety of arrows, and the right Ctrl key could access script characters and other mathematical symbols. The numeric key pad offers locations for a variety of symbols, such as sup/superscript digits using the left Alt key. Left Alt CapsLock could lock into the left-Alt symbol set, etc. Other possibilities involve the NumLock and ScrollLock keys in combinations with the left/right Ctrl/Alt keys. Pretty soon you realize that this approach rapidly approaches literally billions of combinations, i.e., several orders of magnitude more than Unicode can handle!

The autocorrect feature of Microsoft Word 95 (and later) offers another way of entering mathematical characters for people familiar with TEX. For example, you type \alpha and shazaam! It changes to (.

Pull-down menus are a popular method for handling large character sets, but they are notoriously slow. A better approach is the symbol box, which is an array of symbols either chosen by the user or displaying the characters in a font. Symbols in symbol boxes can be dragged and dropped onto key combinations on the on-screen keyboard(s), or directly into applications. On-screen keyboards and symbol boxes are valuable for entry of mathematical expressions and of Unicode text in general.

10 Operator Summary

Operands in subscripts, superscripts, fractions, roots, boxes, etc. are defined in part in terms of operators and operator precedence. While such notions are very familiar to mathematically oriented people, some of the symbols that we define as operators might surprise one at first. Most notably, the space (ASCII 32) is an important operator in the plain-text encoding of mathematics. A minimal list of operators is

FF CR \

([{

)]}|

Space ".,=-+ LF Tab

/*×·•((



ò S P

¯ ­

where LF = U+000A, FF = U+000C, and CR = U+000D.

As in arithmetic, operators have precedence, which streamlines the interpretation of operands. The operators are grouped above in order of increasing precedence, with equal precedence values on the same line. For example, in arithmetic, 3+1/2 = 3.5, not 2. Similarly the plain-text expression a + b/g means

( + not .

As in arithmetic, precedence can be overruled, so (a + b)/g gives the latter.

The following gives a list of the syntax for a variety of mathematical constructs.

exp1/exp2 Create a built-up fraction with numerator exp1 and denominator exp2. Numerator and denominator expressions are terminated by operators such as /*])­¯· and blank (can be overruled by enclosing in parentheses). The “/” is given by U+2044.

­exp1 Superscript expression exp1. The superscripts 0 - 9 + - ( ) exist as Unicode symbols. Sub/superscripts expressions are terminated by /*])·­¯ and blank. Sub/superscript operators associate right to left.

¯exp1 Subscript expression exp1. The subscripts 0 - 9 + - ( ) exist as Unicode symbols.

[exp1] Surround exp1 with built-up brackets. Similarly for { } and ( ).

[exp1]­exp2 Surround exp1 with built-up brackets followed by superscripted exp2 (moved up high enough). Similarly for { } and ( ).

Öexp1 Square root of exp1.

· Small raised dot that is not intended to print. It is used to terminate an operand, such as in a subscript, superscript, numerator, or denominator, when other operators cannot be used for this purpose. Similar raised dots like • and · also terminate operands, but they are intended to print.

S¯ exp1­exp2 Summation from exp1 to exp2. ¯exp1 and ­exp2 are optional.

P¯exp1­exp2 Product from exp1 to exp2.

ò¯exp1­exp2 Integral from exp1 to exp2.

exp1(exp2 Align exp1 over exp2 (like fraction without bar)

Diacritics are handled using Unicode combining marks (U+0300 - U+036F, U+20D0 - U+20FF).

11 Export to programming languages and TEX

Getting computers to understand human languages is important in increasing the utility of computers. Natural-language translation, speech recognition and generation, and programming are typical ways in which such machine comprehension plays a role. The better this comprehension, the more useful the computer, and hence there has been considerable current effort devoted to these areas since the early 1960s.

Ironically one truly international human language that tends to be neglected in this connection is mathematics itself. In the middle 1950’s, the authors of FORTRAN named their computer language after FORmula TRANslation, but they only went half way. Arithmetic expressions in Fortran and other current high-level languages still don’t look like mathematical formulas and considerable human coding effort is needed to translate formulas into their machine comprehensible counterparts. Whitehead once said that 90% of mathematics is notation and that a perfect notation would be a substitute for thought. From this point of view, modern computer languages are badly lacking.

Using real mathematical expressions in computer programs would be far superior in terms of readability, reduced coding times, program maintenance, and streamlined documentation. In studying computers we have been taught that this ideal is unattainable, and that we must be content with the arithmetic expression as it is or some other non-mathematical notation such as TEX’s. It is time to reexamine this premise. Whereas true mathematical notation clearly used to be beyond the capabilities of machine recognition, we feel it no longer is.

In general, mathematics has a very wide variety of notations, none of which look like the arithmetic expressions of programming languages. Although ultimately it would be desirable to be able to teach computers how to understand all mathematical expressions, we start with our Unicode plain-text format.

In raw form, these expressions look very like traditional mathematical expressions. With use of the heuristics of Sec. 9, they can be printed or displayed in traditional built-up form. On disk, they can be stored in pure-ASCII program files accepted by standard compilers and symbolic manipulation programs like Derive, Mathematica, and Macsyma. The translation between Unicode symbols and the ASCII names needed by ASCII-based compilers and symbolic manipulation programs is carried out via table-lookup (on writing to disk) and hashing (on reading from disk) techniques. We have found that such translation only increases disk access times of typical programs by about 10%.

Hence formulas can be at once printable in manuscripts and computable, either numerically or analytically. The expressions can contain standard arithmetic operations and special characters, such as Greek, italics, script, and various mathematical symbols like the square root. Two levels of implementation are envisaged: scalar and vector. Scalar operations can be performed on traditional compilers such as those for C and Fortran. The scalar multiply operator is represented by a raised dot, a legitimate mathematical symbol, instead of the asterisk. To keep auxiliary code to a minimum, the vector implementation requires an object-oriented language such as C++.

The advantages of using the Unicode plain text are at least threefold: 1) many formulas in document files can be programmed simply by copying them into a program file and inserting appropriate multiplication dots. This dramatically reduces coding time and errors. 2) The use of the same notation in programs and the associated journal articles and books leads to an unprecedented level of self documentation. In fact, since many programmers document their programs poorly or not at all, this enlightened choice of notation can immediately change nearly useless or nonexistent documentation into excellent documentation. 3) In addition to providing useful tools for the present, these proposed initial steps should help us figure out how to accomplish the ultimate goal of teaching computers to understand and use arbitrary mathematical expressions. Such machine comprehension would greatly facilitate future computations as well as the conversion of the existing paper literature and Pen-Windows input into machine usable form.

The concept is portable to any environment that supports a large character set, preferably Unicode, and it takes advantage of the fact that high-level languages like C and Fortran accept an “escape” character (“_” and “$”, respectively) that can be used to access extended symbol sets in a fashion similar to TEX. In addition, the built-in C preprocessor allows niceties such as aliasing the asterisk with a raised dot, which is a legitimate mathematical symbol for multiplication. Of course if we could convince our compiler friends to allow use to use Unicode for program-variable names, we’d really have it made! Compatibility with unenlightened ASCII-only compilers could be done via an ASCII representation of Unicode characters.

To get an idea as to the differences between the standard way of programming mathematical formulas and the proposed way, compare the following versions of a C++ routine entitled IHBMWM (inhomogeneously broadened multiwave mixing)

void IHBMWM(void)

{

gammap = gamma*sqrt(1 + I2);

upsilon = cmplx(gamma+gamma1, Delta);

alphainc = alpha0*(1-(gamma*gamma*I2/gammap)/(gammap + upsilon));

if (!gamma1 && fabs(Delta*T1) < 0.01)

alphacoh = -half*alpha0*I2*pow(gamma/gammap, 3);

else

{

Gamma = 1/T1 + gamma1;

I2sF = (I2/T1)/cmplx(Gamma, Delta);

betap2 = upsilon*(upsilon + gamma*I2sF);

beta = sqrt(betap2);

alphacoh = 0.5*gamma*alpha0*(I2sF*(gamma + upsilon)

/(gammap*gammap - betap2))

*((1+gamma/beta)*(beta - upsilon)/(beta + upsilon)

- (1+gamma/gammap)*(gammap - upsilon)/

(gammap + upsilon));

}

alpha1 = alphainc + alphacoh;

}

void IHBMWM(void)

{

g' = g•Ö(1 + I2);

u = g + g1 + i•D;

ainc = a0•(1 - (g•g•I2/g')/(g' + u));

if (!g1 || fabs(D•T1) < 0.01)

acoh = -.5•a0•I2•pow(g/g', 3);

else

{

G = 1/T1 + g1;

I2F = (I2/T1)/(G + i•D);

b = Ö(b2 = u•(u + g•I2F));

acoh = .5•g•a0•(I2F.(g + u)/(g'•g' - b2))

×((1+g/b)•(b - u)/(b + u) - (1+g/g')•(g' - u)/(g' + u));

}

a1 = ainc + acoh ;

}

The above function runs fine with current C++ compilers, but C++ does impose some serious restrictions based on its limited operator table. For example, vectors can be multiplied together using dot, cross, and outer products, but there’s only one asterisk to overload in C++. In built-up form, the function looks even more like mathematics, namely

void IHBMWM(void)

{

g' = g•;

u = g + g1 + i•D;

ainc = a0• ;

if (!g1 || fabs(D•T1) < 0.01)

acoh = -.5•a0•I2•pow(g/g', 3);

else

{

G = 1/T1 + g1;

I2F = ;

b = ;

acoh = .5•g•a0•×)• - )•) ;

}

a1 = ainc + acoh ;

}

The ability to use the second and third versions of the program is built into the PS Technical Word Processor. With it we already come much closer to true formula translation on input, and the output is displayed in standard mathematical notation. Lines of code can be previewed in built-up format, complete with fraction bars, square roots, and large parentheses. To code a formula, you copy (cut and paste) it from a technical document into a program file, insert appropriate raised dots for multiplication and compile. No change of variable names are needed. Call that 70% of true formula translation! In this way, the C++ function on the preceding page compiles without modification. The code appears nearly the same as the formulas in print [see Chaps. 5 and 8 of P. Meystre and M. Sargent III (1991), Elements of Quantum Optics, Springer-Verlag].

Questions remain, such as to whether subscript expressions in the Unicode plain text should be treated as part of program-variable names, or whether they should be translated to subscript expressions in the target programming language. Similarly, it would be straightforward to automatically insert an asterisk (indicating multiplication) between adjacent symbols, rather than have the user do it. However here there is a major difference between mathematics and computation: symbolically, multiplication is infinitely precise and infinitely fast, while numerically, it takes time and is restricted to a binary subset of the rationals with very limited (although often adequate) precision. Consequently for the moment, at least, it seems wiser to consider adjacent symbols as part of a single variable name, just as adjacent ASCII letters are part of a variable name in current programming languages. Perhaps intelligent algorithms will be developed that decide when multiplication should be performed and insert the asterisks optimally.

Export to TEX is similar to that to programming languages, but has a modified set of requirements. With current programs, comments are distilled out with distinct syntax. This same syntax can be used in the Unicode plain-text encoding, although it’s interesting to think about submitting a mathematical document to a preprocessor that can recognize and separate out programs for a compiler. In this connection, compiler comment syntax isn’t particularly pretty; ruled boxes around comments and vertical dividing lines between code and comments are noticeably more readable. So some refinement of the ways that comments are handled would be very desirable. For example, it would be nice to have a vertical window-pane facility with synchronous window-pane scrolling and the ability to display C code in the left pane and the corresponding // comments in the right pane. Then if you want to see the comments, you widen the right pane accordingly. On the other hand, to view lines with many characters of code, the // comments needn’t get in the way. Such a dual-pane facility would also be great for working with assembly-language programs.

With TEX, the text surrounding the mathematics is part and parcel of the technical document, and TEX needs its $’s to distinguish the two. These can be included in the plain text, but we have repeatedly pointed out how ugly this solution is. The heuristics described in Sec. 9 go a long way in determining what is mathematics and what is natural language. Accordingly, the export method consists of identifying the mathematical expressions and enclosing them in $’s. The special symbols are translated to and from the standard TEX ASCII names via table lookup and hashing, as for the program translations. Better yet, TEX should be recompiled to use Unicode.

12 Conclusions

We have shown how Unicode can be used together with rich-text formatting to represent documents ranging from simple enhancements of plain text to sophisticated mathematical documents. Such documents can be stored in a Unicode version of Microsoft Word’s RTF format. We have discussed how Unicode rich text can be easily edited using parallel text runs, which also have the flexibility to be used to represent glyph variants.

This leads into the discussion of Unicode and mathematical expressions. With a few additions to Unicode, mathematical expressions can represented with a remarkably readable Unicode plain-text format. The text consists of combinations of operators and operands. A simple operand consists of a span of non-operators, a definition that dramatically reduces the number of parenthesis-override pairs and thereby increases the readability of the plain text. The only disadvantage to this approach versus TEX’s ubiquitous { } pairs is that the user needs to know what characters are operators. To reveal the operators, operator-aware editors could be instructed to display operators with a different color or some other attribute. Heuristics can be applied to the Unicode plain text to recognize what parts of a document are mathematical expressions. This allows the Unicode plain text to be used in a variety of ways, including in technical document preparation, symbolic manipulation, and numerical computation.

The heuristics given for recognizing mathematical expressions work surprisingly well, but they are not infallible. An effective use of the heuristics would be as an autoformatting wizard that marks what it thinks is mathematics with a mathematics rich-text style. The user could then overrule incorrect choices. Once marked unequivocally as mathematics (an elegant alternative to TEX’s $’s), export to SGML, compilers, and other consumers of mathematical expressions is straightforward.

The additions to Unicode that we recommend include subscript and superscript operator symbols, a “literal” operator symbol, a non-math quoting symbol, and math-on/math-off symbols (analogous to the left-to-right U+200E and right-to-left U+200F bidirectional marks). The special operator symbols could go together with general punctuation symbols (U+2000 - U+206F). It would also be nice to have subscript and superscript commas and periods included with the subs and sups (U+2070 - U+209F). Finally it’s important to identify all mathematical operators as such with a special bit, much as right-to-left characters are marked as right-to-left. Furthermore, these operators need precedence values that control the association of operands with operators unless overruled by parentheses. With these additions, we have a workable plain-text encoding of mathematics that looks remarkably like mathematics even with the most limited display capabilities. Appropriate display software can unambiguously make it look like the real thing.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download