Elegant Report - Unicode



Conversion between Hong Kong Supplementary Character Set (HKSCS) and Unicode

LINUS TOSHIHIRO TANAKA

Principal Technical Staff

Oracle Corporation

email: Linus.Tanaka@

phone: +1-650-506-8049

Abstract

There are two written Chinese languages well recognized in the computer industry. They are Simplified Chinese used primarily in Mainland China, and Traditional Chinese used primarily in Hong Kong and Taiwan. Hong Kong's written Chinese language is normally treated as Traditional Chinese, but there are many characters used in Hong Kong and Mainland China but not frequently used in Taiwan. Therefore, Hong Kong's written Chinese language may be somewhere between Traditional Chinese and Simplified Chinese but much closer to Traditional Chinese than Simplified Chinese. There are also many characters used in Hong Kong that are not used or not frequently used in other Chinese speaking countries and regions. Some of these Hong Kong specific characters were not included even in Unicode3.0.

In order to solve these issues, Hong Kong (which is now Hong Kong Special Administrative Region, or Hong Kong S.A.R.) government had defined Government Common Character Set (GCCS) based on the Big5 encoded character set. GCCS included around 3,000 extra characters over the standard Big5. About half of them are included in the GBK encoded character set, thus also included in Unicode1.1. Remaining half were not included in Big5, GBK, nor Unicode1.1. Some of these Hong Kong specific characters have been included in Unicode3.0, but there are still some characters not included in Unicode3.0. (Many of them are now included in Unicode3.1.)

In September 1999, Hong Kong S.A.R. government defined Hong Kong Supplementary Character Set (HKSCS) which is the successor of GCCS. Unlike GCCS, HKSCS defines precise mapping between HKSCS and Unicode1.1, and also between HKSCS and Unicode3.0.

Oracle has implemented HKSCS in Oracle8i Release 3 (8.1.7). It handles mapping between HKSCS and Unicode3.0, as well as the compatibility mapping between HKSCS and Unicode1.1. Although HKSCS is very carefully defined by Hong Kong S.A.R. government, there are small number of implementation dependent issues. In this paper, I explain the specific issues when implementing HKSCS, and what Oracle has done for them.

1 Introduction

1 What is HKSCS? Why is it necessary?

More than 20 years ago, people started handling East Asian languages on computers. Although it’s now possible to handle some spoken languages on computers, discussions here are only for written languages. Many issues that are specific to East Asian languages are for written languages because of the number of different characters they use, and one of the East Asian written languages is the focus of this paper.

In the computer industry, we normally recognize four East Asian written languages. They are Japanese, Korean, Simplified Chinese, and Traditional Chinese. (Historical Vietnamese could be added in this group of written languages.) Hong Kong’s written Chinese language and Taiwan’s written Chinese language are both normally treated as Traditional Chinese, but there seem to be many differences between them. One of the biggest differences between Hong Kong’s written Chinese language and Taiwan’s written Chinese language is that there are many Chinese characters used in Hong Kong that are not used or not frequently used in Taiwan.

When we handle a written language on computers, there needs to be an encoded character set. In order to handle Hong Kong’s written Chinese language, most people have been using the Big5 encoded character set. However, Big5 is missing many Chinese characters used in Hong Kong. Some of these characters are included in the GBK encoded character set, thus also included in Unicode1.1, but some other characters are not included in GBK nor Unicode1.1.

There are a few different approaches to solve this issue. The first approach is to use Unicode. With Unicode1.1, 2.0, or 2.1, Hong Kong probably needs around 2,000 characters that are not included in these versions of Unicode. Luckily, Unicode has the Private Use Area (PUA) that can be used for this purpose. PUA in the Basic Multilingual Plane (BMP) could encode up to 6,400 characters, which is probably big enough for Hong Kong. If we use Unicode3.0, the number of characters that need to be encoded in PUA is reduced from around 2,000 to around 1,500. This is because Unicode3.0 added around 500 characters that Hong Kong needs. If we use Unicode3.1, the use of PUA is further reduced because more than 1,000 characters that Hong Kong needs (but not included in Unicode3.0) are added in Unicode3.1.

The second approach is to use Big5. Characters that are not included in Big5 need to be encoded using User Defined Character (UDC) mechanism. When we take this approach, we need to make sure that UDC is supported, because vendor-specific implementations of Big5 may or may not support UDC. We could encode up to around 6,000 characters (as the UDC) on top of the standard Big5. This number (around 6,000 characters as the UDC) is probably big enough for Hong Kong.

The third approach is to use GBK. Characters included in the standard Big5 are also included in GBK. In addition, Hong Kong uses some characters included in GBK but not included in Big5, so this is one advantage of using GBK. We could encode up to around 2,000 characters (as the UDC) on top of the standard GBK. Unfortunately, this number (around 2,000 characters as the UDC) is probably not big enough for Hong Kong even if GBK has one advantage mentioned above. Therefore, this approach probably doesn’t work for Hong Kong.

Although two of the three approaches mentioned above (Unicode with PUA, and Big5 with UDC) should be able to handle characters that Hong Kong needs, there is still one big problem. If many people independently define characters using PUA or UDC, the same character codes have different meanings for different people. This interoperability problem can be solved by the standard encoded character sets that define all characters that Hong Kong needs.

Hong Kong government had defined Government Common Character Set (GCCS) which partly solved the interoperability problem. GCCS is based on Big5. GCCS included around 3,000 characters (as the UDC) on top of the standard Big5. About half of them are included in GBK, thus also included in Unicode1.1. Remaining half were not included in Big5, GBK, nor Unicode1.1. The table below shows the relationship among these character sets. Although many GCCS characters are included in Unicode1.1, it was a common practice to map all GCCS characters (around 3,000 characters encoded as Big5’s UDC) to PUA when converting to Unicode. It was a problem to use Unicode’s PUA for the characters that are already in Unicode.

|Big5 |GCCS |GBK |Unicode1.1 | |1,471 characters | |included | | | |Big5 characters |included |included |included |included | |1,578 characters | |included |included |included | |other GBK characters | | |included |included | |other Unicode1.1 characters | | | |included | |

In September 1999, Hong Kong S.A.R. government defined Hong Kong Supplementary Character Set (HKSCS). HKSCS defines three ways to encode characters. The first way is to use Big5 encoding with UDC. This Big5 encoding of HKSCS is the successor of GCCS. The second way is to use Unicode1.1 (or ISO/IEC 10646-1:1993). The third way is to use Unicode3.0 (or ISO/IEC 10646-1:2000). One of the greatest things in HKSCS is that it defines precise mapping between Unicode1.1 and Big5 encoding of HKSCS, and also between Unicode3.0 and Big5 encoding of HKSCS. Also, HKSCS added 1,759 characters on top of GCCS so that HKSCS provides much better coverage than GCCS for the characters used in Hong Kong. In HKSCS, Unicode’s PUA is used only for the characters that are not yet included in Unicode. Therefore, HKSCS solved the problem mentioned above for GCCS. One thing we should not forget is that there are some characters in HKSCS that are not included in Unicode (1,648 characters when using Unicode3.0, or 2,175 characters when using Unicode1.1, 2.0, or 2.1). These characters need to be mapped to PUA until Unicode will include them. A good news is that more than 1,000 of the remaining 1,648 characters are included in Unicode3.1. The following table shows the relationship among Big5, GCCS, HKSCS, Unicode1.1, and Unicode3.0.

|Big5 |GCCS |HKSCS |Unicode 1.1 |Unicode 3.0 | |106 characters | |included | | | | |1,648 characters | |(*1) |included | | | |527 characters | |(*1) |included | |included | |Big5 characters |included |included |included |included |included | |1,578 characters | |included |included |included |included | |949 characters | | |included |included |included | |other Unicode1.1 characters | | | |included |included | |other Unicode3.0 characters | | | | |included | |

(*1) Among these 1,648 + 527 = 2,175 characters, 1,365 characters are included in GCCS.

In the table above, the total number of GCCS characters on top of Big5 is ( 106 + 1,365 ) + 1,578 = 1,471 + 1,578 = 3,049 characters. The total number of HKSCS characters on top of Big5 is 1,648 + 527 + 1,578 + 949 = 4,702 characters. HKSCS added ( 1,648 + 527 - 1,365 ) + 949 = 810 + 949 = 1,759 characters, and removed 106 characters from GCCS.

With HKSCS, it is now so much easier to handle all characters that Hong Kong needs. Not only that. Since HKSCS clarified what characters Hong Kong needs, which ones are included in Unicode3.0, and which ones are not yet included in Unicode3.0, it is a very important standard from which future versions of Unicode (and ISO/IEC 10646) could benefit.

2 What kinds of characters are in HKSCS?

GCCS is based on Big5. Big5 encoding of HKSCS is the successor of GCCS. Therefore, we can say that the starting point of HKSCS was Big5. We need to handle all characters of Big5 (in Big5 encoding, or in one of the Unicode encoding forms) before handling HKSCS.

HKSCS includes roughly the following characters. I listed characters of Big5 because HKSCS doesn’t make sense without them.

n Standard Big5 characters (not explicitly defined in HKSCS, but they are necessary for HKSCS to be meaningful).

n Characters that have been used in various vendor-specific implementations of Big5 (Cyrillic, Hiragana, Katakana, enclosed numbers, etc.). The definitions of these characters in HKSCS minimized the implementation dependency of HKSCS.

n Traditional Chinese characters used in Hong Kong and Taiwan but not included in the standard Big5.

n Traditional Chinese characters used in Hong Kong but not used in Taiwan.

n Simplified Chinese characters used in Hong Kong and Mainland China.

n Some other symbols which are not included in the standard Big5 (Pinyin, IPA, radicals, etc.).

Although the characters in HKSCS are mostly Traditional Chinese characters, there are some Simplified Chinese characters included in HKSCS. The next section lists some of them.

3 Simplified Chinese characters in HKSCS

The below are some of the Simplified Chinese characters in HKSCS. Since the standard Big5 includes Traditional counterparts of these characters, HKSCS effectively includes both Simplified and Traditional forms of these characters.

The fact that Big5 doesn’t include these Simplified Chinese characters and HKSCS includes them tells that Hong Kong's written Chinese language may be somewhere between Traditional Chinese and Simplified Chinese (but much closer to Traditional Chinese than Simplified Chinese).

HKSCS includes only a small number of Simplified Chinese characters, though. Big5 encoding of HKSCS doesn’t have enough space for the full range of Simplified Chinese characters.

| | Character |HKSCS Big5 encoding |Mapping to Unicode |

|Simplified | [pic] |0x9EB2 |U+4E1A |

|Traditional | [pic] |0xB77E |U+696D |

|Simplified | [pic] |0x9DD6 |U+4E1C |

|Traditional | [pic] |0xAA46 |U+6771 |

|Simplified | [pic] |0x9DBA |U+4E9A |

|Traditional | [pic] |0xA8C8 |U+4E9E |

|Simplified | [pic] |0x89D2 |U+4EBF |

|Traditional | [pic] |0xBBF5 |U+5104 |

|Simplified | [pic] |0x9DA9 |U+4EEA |

|Traditional | [pic] |0xBBF6 |U+5100 |

|Simplified | [pic] |0x8950 |U+4FA8 |

|Traditional | [pic] |0xB9B4 |U+50D1 |

|Simplified | [pic] |0x8952 |U+5174 |

|Traditional | [pic] |0xBFB3 |U+8208 |

|Simplified | [pic] |0x8953 |U+519C |

|Traditional | [pic] |0xB941 |U+8FB2 |

2 Implementing HKSCS

HKSCS is an excellent standard, and implementing HKSCS is quite straightforward. Especially, the precise mapping between Unicode (1.1 and 3.0) and Big5 encoding of HKSCS is really useful when implementing HKSCS.

From HKSCS’ point of view, Unicode1.1, 2.0, and 2.1 are almost same, because the differences among these versions of Unicode don’t affect HKSCS. The only exception could be “euro” which is a new character in Unicode2.1. Some implementations of Big5 might include “euro”.

Below, how to implement HKSCS is discussed for each category of characters. The main topic here is how to implement the conversion between Unicode and Big5 encoding of HKSCS.

1 Standard Big5 characters

Although HKSCS doesn’t explicitly define the standard Big5 characters, they are necessary when handling HKSCS. One of the implementation dependencies of HKSCS comes from standard Big5 characters because some implementation dependency seems to exist for some Big5 characters.

For example, Big5 0xA145 is mapped to Unicode U+2022 for some implementations, but the same Big5 character is mapped to Unicode U+2027 for some other implementations. Another example is “euro”, which has the very clear mapping to Unicode (2.1 or later), but it may or may not exist in a specific implementation of Big5. We need to decide how to map standard Big5 characters that have implementation dependency. Luckily, very few characters have this issue.

2 Characters not in Big5 but in both Unicode1.1 and 3.0

Since these characters are included in both Unicode1.1 and 3.0, mapping from Big5 encoding of HKSCS to Unicode is simple. We can just follow the mapping described in HKSCS standard. Mapping from Unicode to Big5 encoding of HKSCS has two paths, one is from non-PUA, and the other is from PUA. An example is shown below.

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, 3.0, or 3.1

[pic] 0xFA41 (---------------------( U+92DB

0xFA41 (----------------------- U+E001

3 Characters not in Big5 nor Unicode1.1 but in Unicode3.0

For these characters, the implementation depends on whether Unicode3.0 is supported in the system or not. Examples below show how to map these characters.

If Unicode3.0 is supported:

Big5 encoding of HKSCS Unicode3.0, or 3.1

[pic] 0xFA45 (---------------------( U+42B5

0xFA45 (----------------------- U+E005

If Unicode3.0 is not yet supported:

Big5 encoding of HKSCS Unicode1.1, 2.0, or 2.1

[pic] 0xFA45 (---------------------( U+E005

Since Oracle8i Release 3 (8.1.7) supports all codepoints of Unicode3.0, we took the first choice shown above.

4 Characters not in Big5, Unicode1.1, nor Unicode3.0

As long as we use HKSCS based on Unicode3.0, the only mapping is to use PUA for these characters. The below is an example. (This mapping will have to be changed after Hong Kong government will publish a new version of HKSCS based on a new version of Unicode. It is explained in the “Unicode3.1 and later versions” section in this paper.)

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, or 3.0

[pic] 0xFA40 (---------------------( U+E000

5 Unified characters (total 84 cases)

This is probably the most difficult case. There are probably two different approaches. To treat unified characters as “not unified” could be better for compatibility. To treat unified characters as “unified” could be better for migrating to Unicode. Examples below show the differences between these approaches.

“Compatibility” approach

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, 3.0, or 3.1

[pic] 0xADC5 (---------------------( U+5029

[pic] 0xFA5F (---------------------( U+E01F

“Unicode migration” approach

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, 3.0, or 3.1

[pic] 0xADC5 (---------------------( U+5029

0xADC5 (---------------------- U+E01F

[pic] 0xFA5F ---------------------( U+5029

We believe that the direction is to move to Unicode. Therefore, Oracle8i Release 3 (8.1.7) took the “Unicode migration” approach. Once a future version of Unicode includes all HKSCS characters, people can stop using PUA except for really private use characters, and achieve the perfect interoperability for all HKSCS characters. “Unicode migration” approach shown above is an important step toward this goal.

For people who really want to take “Compatibility” approach, Oracle provides a way. The modified mapping can be defined through the new Locale Builder tool in Oracle9i Database.

6 Characters that are not verifiable (total 22 cases)

For these 22 characters, there is no way for us to map them to any existing Unicode characters. Therefore the only mapping we can use is to use PUA. The below is an example.

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, 3.0, or 3.1

[pic] 0x9EAC (---------------------( U+ED2B

7 Reserved codepoints of the Big5 encoding of HKSCS

When implementing an encoded character set and conversion to Unicode, we could choose not to map the reserved codepoints to any valid Unicode codepoints so that we can be sure that these reserved codepoints don’t have any meaning. However, HKSCS probably needs a different approach. Since all GCCS codepoints were mapped to Unicode PUA, we probably should keep this PUA mapping even for the reserved codepoints of the Big5 encoding of HKSCS. Here is an example.

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, 3.0, or 3.1

0x8540 (---------------------( U+F12C

8 User defined characters of HKSCS

HKSCS heavily utilizes Big5 UDC and Unicode PUA, but HKSCS itself provides UDC area in Big5 encoding (0x8140 through 0x84FE) which is mapped to Unicode PUA (U+EEB8 through U+F12B) so that users can define their own characters. The below is an example.

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, 3.0, or 3.1

0x8140 (---------------------( U+EEB8

9 Unicode3.1 and later versions

If some of the characters that are not in Big5, Unicode1.1, nor Unicode3.0 are included in Unicode3.1 or later, then those characters need to have a new mapping between Unicode3.1 (or later) and Big5 encoding of HKSCS. However, before changing the mapping, we should wait until Hong Kong government will publish a new version of HKSCS based on a new version of Unicode (or ISO/IEC 10646). The below is an example of changing the mapping.

Before the change:

Big5 encoding of HKSCS Unicode1.1, 2.0, 2.1, or 3.0

[pic] 0xFA40 (---------------------( U+E000

After the change:

Big5 encoding of HKSCS Unicode3.1

[pic] 0xFA40 (---------------------( U+20547

0xFA40 (----------------------- U+E000

After a future version of HKSCS will provide the mapping to the new Unicode codepoints, Oracle will change the mapping as shown above.

Please note that additional Chinese characters in Unicode3.1 and later versions are outside of BMP, which means that surrogate pairs in UTF-16 and 4-byte values in UTF-8 will be necessary. Oracle9i Database will support both of them through the AL16UTF16 and AL32UTF8 character sets.

10 New characters in HKSCS

We need to be prepared for new characters in HKSCS. In fact, there seem to be 31 characters (or more?) that will be included in HKSCS in the future. New characters in HKSCS could belong to any of the categories below.

n Characters not in Big5 but in both Unicode1.1 and 3.0

n Characters not in Big5 nor Unicode1.1 but in Unicode3.0

n Characters not in Big5, Unicode1.1, nor Unicode3.0

All of these cases have already been explained for the existing HKSCS characters, and implementing new HKSCS characters can be handled in the same way as existing HKSCS characters (as long as we can get precise mapping between Unicode and Big5 encoding of HKSCS).

If you want, you can define a new mapping through the new Locale Builder tool in Oracle9i Database and include the new HKSCS characters in the mapping. Or, if adding those characters is not urgent for you, you can just wait for the new version of the product that includes the new HKSCS characters.

3 Added feature in Oracle’s HKSCS implementation

In addition to the mapping explained above, Oracle’s implementation of HKSCS has one more feature. When Oracle converts from Unicode to Big5 encoding of HKSCS, or from a Simplified Chinese character set to Big5 encoding of HKSCS, if HKSCS doesn’t include a given Simplified Chinese character, Oracle tries to map it to the corresponding Traditional Chinese character.

This feature is activated only if a target character set (Big5 encoding of HKSCS, in this case) doesn’t include a given Simplified Chinese character, and it’s effective only if the target character set includes the corresponding Traditional Chinese character. Without this feature, the given character would have been mapped to a replacement character such as a question mark. This feature is useful when trying to read Simplified Chinese with Big5 encoding of HKSCS.

The other direction of this feature is also supported. For example, when Oracle converts from Big5 encoding of HKSCS to the GB2312-80 encoded character set, Traditional Chinese characters that are not included in GB2312-80 will be converted to the corresponding Simplified Chinese characters (if the corresponding Simplified Chinese characters are included in GB2312-80). This is useful when trying to read Hong Kong’s Chinese with GB2312-80 or other Simplified Chinese character sets.

4 Database character set migration

So far, the discussions in this paper have mostly been about the conversions between Unicode and Big5 encoding of HKSCS. Those conversions are very important when a database needs to be migrated from one encoding to another. The diagram below shows some migration paths. There are seven migrations shown in the diagram. These seven migrations are explained from #1 to #7 as shown in the diagram.

[pic]

1 From Big5 to CodePage 950

If the characters in the database are encoded in GCCS and the database (encoded) character set is configured as Big5 (Oracle’s name is ZHT16BIG5), the database needs to be migrated to CodePage 950 (Oracle’s name is ZHT16MSWIN950) which is compatible with GCCS, with the help from “Character Set Scanner” utility in Oracle8i Release 3 (8.1.7). This is because Oracle’s implementation of Big5 (ZHT16BIG5) is not compatible with GCCS.

2 From Big5 to Big5 encoding of HKSCS

If the characters in the database are encoded in GCCS and the database (encoded) character set is configured as Big5 (Oracle’s name is ZHT16BIG5), and if you wish to migrate directly to the Big5 encoding of HKSCS, then the database needs to be migrated to Oracle’s ZHT16HKSCS, with the help from “Character Set Scanner” utility in Oracle8i Release 3 (8.1.7). “Character Set Scanner” utility is necessary for this migration because Oracle’s implementation of Big5 (ZHT16BIG5) is not compatible with GCCS.

3 From CodePage 950 to Big5 encoding of HKSCS

If the characters in the database are encoded in GCCS and the database (encoded) character set is configured as CodePage 950 (Oracle’s name is ZHT16MSWIN950), and if you wish to migrate to the Big5 encoding of HKSCS, then the database needs to be migrated to Oracle’s ZHT16HKSCS.

4 From CodePage 950 to UTF-8 (Unicode3.0)

If the characters in the database are encoded in GCCS and the database (encoded) character set is configured as CodePage 950 (Oracle’s name is ZHT16MSWIN950), and if you wish to migrate to Unicode, then the migration is to move to Oracle’s UTF8. However, this migration converts all GCCS characters to Unicode’s PUA. Therefore this migration path is not recommended for GCCS characters.

5 From Big5 encoding of HKSCS to UTF-8 (Unicode3.0)

If the characters in the database are encoded in Big5 encoding of HKSCS and the database (encoded) character set is configured as Oracle’s ZHT16HKSCS, and if you wish to migrate to Unicode, then the migration is to move to Oracle’s UTF8. This migration is possible because of the conversions explained in this paper. Please, however, note that there are 1,648 characters in HKSCS that will be mapped to Unicode’s PUA.

6 From Big5 encoding of HKSCS to UTF-8 (Unicode3.1 or later)

If the characters in the database are encoded in Big5 encoding of HKSCS and the database (encoded) character set is configured as Oracle’s ZHT16HKSCS, and if you wish to migrate to Unicode3.1 or later, then the migration is to move to Oracle’s AL32UTF8 (which supports 4-byte values of UTF-8). It would be ideal if a given version of Unicode includes all HKSCS characters. By the way, you cannot do this migration until Oracle will support Unicode3.1 or later.

7 From UTF-8 (Unicode3.0) to UTF-8 (Unicode3.1 or later)

Since Unicode3.0 doesn’t include all HKSCS characters, there are 1,648 characters in HKSCS that are stored using Unicode’s PUA. When moving from Unicode3.0 to Unicode3.1 or later, HKSCS characters (that are stored using Unicode’s PUA) may need to be converted to non-PUA with some application programs because the database may not know which PUA values are for Hong Kong (Taiwan and Japan also use PUA).

5 Conclusion

HKSCS is an excellent standard. It clarified and solved many issues related to Chinese characters used in Hong Kong. Implementing HKSCS is relatively easy since HKSCS provides precise mapping between Unicode (1.1 and 3.0) and Big5 encoding of HKSCS. Because of this excellent standard, there are very few implementation dependent areas. This paper tried to clarify them, and explained the approach that Oracle8i Release 3 (8.1.7) took. I hope the information in this paper is useful for those who implement HKSCS.

With HKSCS, people in Hong Kong can more comfortably read and write their Chinese language on computers and communicate effectively through the Internet. Oracle supports HKSCS in both Big5 encoding and Unicode3.0 so that all characters in Hong Kong can be stored in Oracle databases that are often connected to the Internet.

The only remaining concern is the number of products in the computer industry (hardware and software) that support HKSCS. Hopefully the number will soon increase dramatically.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download