Shift-JIS text to UTF-8

I searched for an answer for how to convert Japanese Shift-JIS text to UTF-8 format but couldn’t find any suitable answer so here is my proposed method.

I first encountered Japanese Shift-JIS when cutting and pasting Japanese text that my wife had typed on a Windows computer. I thought it started out as UTF-8 which is best for the web but during the cut and paste in the Windows environment, it was converted to Shift-JIS format.

Shift-JIS is a horrible way of encoding Japanese symbols that is only used for Japanese and there are several versions of it. But for us PC programmers, we are interested in Microsoft Code Page 932. This is a map of all the supported Japanese characters on a PC to Unicode i.e. a 16 bit (word) value.

The Shift-JIS character codes are either 8 bit or 16 bit. So in code we need to check each byte that we process for if it is a starting byte for a double byte value or not.

Then we can convert this to the 16 bit Unicode value.

UTF-8 is becoming the most popular way to encode text for web pages since is is backwards compatible with ASCII which I think the majority of text on the web uses, but it can handle extended character sets too such as Kanji, removing the need for many diverse character encoding schemes.

So given that we have converted our original Shift-JIS to Unicode. The next step is to convert to UTF-8. This will result in 1 – 3 bytes of data per character for our Japanese text. So it is actually less efficient in terms of file size but makes the web a simpler and more compatible place.

We will convert to a 1, 2 or 3 byte sequence for each 16 bit Unicode value. So our algorithm will switch between each one depending on the range of integer values that our character code falls into.

Then we can output a file that should be a conversion to UTF-8 format.

To summarize: convert Shift-JIS Bytes or Words to Unicode Word values via a lookup table (based on Code Page 932) e.g an associated array. Then convert the 16 bit value to UTF-8. Or do it in 1 step if you create a direct character-mapping table.

Useful links:
Code Page 932
UTF-8

Disclaimer:
I haven’t actually written the code to do this yet, just researched how I would do it
In my case I need to convert forum posts to WordPress Blog Posts but it is a low-priority task right now.

If there is an easier way, please share in the comments. I was thinking that there may be a way to cut and paste in a particular way or use a fancy PHP function for multi-byte stuff.

Posted in etc