Convert pinyin tone numbers to tone marks

text to convert:
(use v for ü)
add HTML coding for Web pages? yes
no
output style: cleanest, clearest code (combining diacritical marks only; but only partial compliance in most browsers)
cross-browser version (safe combining diacritical marks + individually coded characters)
code soup (invidually coded characters only)
tag style: put all special characters in <span> tags
put coding in styled <span> tags
put coding in <span> tags with a declared class
no <span> tags

Explanations

text to convert

This program will convert pinyin with tone numbers (e.g., Han4zi4 bu4 mie4, Zhong1guo2 bi4 wang2! -- Lu Xun) to pinyin with tone marks (e.g., Hànzì bù miè, Zhōngguó bì wáng! -- Lu Xun).
Here's a more complete sample, if you want to try out the full range of the converter:
A1A2A3A4 E1E2E3E4 I1I2I3I4 O1O2O3O4 U1U2U3U4 V1V2V3V4 NA1NA2NA3NA4N NE1NE2NE3NE4N NI1NI2NI3NI4N NO1NO2NO3NO4N NU1NU2NU3NU4N NV1NV2NV3NV4N NNA1NNA2NNA3NNA4NN NNE1NNE2NNE3NNE4NN NNI1NNI2NNI3NNI4NN NNO1NNO2NNO3NNO4NN NNU1NNU2NNU3NNU4NN NNV1NNV2NNV3NNV4NN a1a2a3a4 e1e2e3e4 i1i2i3i4 o1o2o3o4 u1u2u3u4 v1v2v3v4 na1na2na3na4n ne1ne2ne3ne4n ni1ni2ni3ni4n no1no2no3no4n nu1nu2nu3nu4n nv1nv2nv3nv4n nna1nna2nna3nna4nn nne1nne2nne3nne4nn nni1nni2nni3nni4nn nno1nno2nno3nno4nn nnu1nnu2nnu3nnu4nn nnv1nnv2nnv3nnv4nn

add HTML for Web pages?

yes
The best character set to declare for Web pages with hanyu pinyin is undoubtedly UTF-8. The following needs to be in the head of your document's HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Also essential is a CSS declaration; for this, see the explanation below for tag style.
Overall, there's no doubt that, at least on Windows systems, Microsoft's Internet Explorer handles tonal hanyu pinyin better than other browsers to date (June 2003). Although the ubiquitous Times New Roman font works fine in Internet Explorer, it does not work properly for hanyu pinyin in other browsers. So don't add it to the list, at least until such time as the other browser companies get their acts together. (I suppose it could be that Microsoft is cheating on the standards in some way; but either way, it's a problem.) Indeed, there is no standard serif font on Microsoft systems that works for tonal hanyu pinyin across different browsers; even Bitstream's Cyberbit unicode font fails. (This problem also occurs with other special characters, which is why it is best not to use curly quotation marks, em dashes and other such typographical niceties on Web pages.)
So we must turn to the san-serifs. Interestingly, Verdana works so-so in Opera 7 but fails in IE 6. The font that works best is Arial Unicode MS. Although I understand Microsoft's reasons for removing it from the free download area of the company site, it is an essential font. (I think Microsoft should donate it, in exchange for a tax write-off to the United Nations for free download by all.) And those of you who don't have it should by all means try to acquire a copy. (It comes with most Microsoft Office products, though it isn't always installed automatically.)
no

output style:

cleanest, clearest code
This makes use of Unicode's combining diacritical marks. With this, just one code entry per diacritical mark is necessary. The code &#780;, for example, will place a third-tone mark over the letter that precedes it, regardless of what that letter is. Thus, only four code segments are necessary for all of hanyu pinyin, which is wonderfully simple. Unfortunately, however, browser support is inadequate and even buggy. Microsoft's Internet Explorer does a good job of handling these marks; but even such otherwise excellent browsers as Opera 7, Netscape 7, and Mozilla Firebird 0.6 place the marks over i's, ü's and all capital letters incorrectly. (See my test page for pinyin and Unicode.) So for now it's best to use the next option.
cross-browser version
This uses only those combining diacritical marks that work correctly in all the major browers, with individually coded characters (such as &#474; for ǚ, rather than u&#780;) being used for all problematic characters.
code soup
This uses no combining diacritical marks at all. Using a text editor to work on a Web page with a lot of this sort of coding is a real headache.

tag style:

no <span> tags
Although this should be no problem, use with caution. When used on double-byte operating systems (such as the traditional Chinese version of Windows 98), most browsers (other than IE) get confused by the presence of special characters, such as those with diacritical marks. This causes some characters not to be shown at all, or even rendered mistakenly as Chinese characters.
put all special characters in <span> tags
Placing special characters within <span> tags can clear up the double-byte system problem. (They needn't be <span> tags; any in-line tag will do. The <b> tag would be a good choice because of its brevity, as long as it's hacked in the style sheet to match normal text by using b { font-weight: normal;}.) The problem with this approach, other than its sloppy code, is that it can lead to line breaks in unwanted places, such as
... Tō
kyo ...
If you choose this option, you need to put the following in the head of the HTML of your Web page:
<style type="text/css">
</style>
put coding in <span> tags with a declared class
If you choose this option, you need to put the following in the head of the HTML of your Web page:
<style type="text/css">
.pinyin   {
    font-family: "arial unicode ms", "lucida sans unicode", sans-serif;
    }
</style>
put coding in styled <span> tags
This uses in-line CSS to set the required fonts for the hanyu pinyin text. If you choose this option, you need to put the following in the head of the HTML of your Web page:
<style type="text/css">
</style>

Last updated: December 3, 2003