Charset and encoding issues

Search on this site

UTF-8EE

U++ is not using pure UTF-8, but our extension that we have named "UTF-8EE" as for "Error Escaped".

The thing is that not every sequence of bytes is valid UTF-8. Now the issues is how to react to this problem when loading file into TheIDE (or other editor). Of course, error message is one solution, but we sometimes have to process text files that have several section with different encodings, UTF-8 being one of them. It is of course cool to have editor capable of dealing with this.

So let us introduce UTF-8EE. The idea is this - when invalid input sequence is encountered, it is "escaped" into the unicode private area (0xe000 - 0xf8ff) - all bytes are escaped as 0xEExx unicode characters (another source of that "EE" in the name). For this purpose, valid utf8 sequences that would yield 0xee00-0xeeff values are also considered invalid in utf8ee and escaped (but as it is private area, no real characters can be there). Now when unicode text is converted back to utf-8, those escaped bytes are simply interpreted with their original value. This basically means that any text can be converted from UTF-8EE to 16bit unicode (UCS2) and back and the result is equal to the original text.

Last edit by cxl on 07/17/2009. Do you want to contribute?. T++