9

Actually I am in a situation where I need to read a string which is in utf8 format but its chars use variable-length encoding so I have problem encoding them to string and I get weird chars when printing it, the chars seem to be in Korean and the is the code I used but had no result:

public static String byteToUTF8(byte[] bytes) {
    try {
        return (new String(bytes, "UTF-8"));

    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    Charset UTF8_CHARSET = Charset.forName("UTF-8");
    return new String(bytes, UTF8_CHARSET);
}

Also I used UTF-16 and got a bit better results, however it was giving me strange chars yet and according to doc provided above I should use utf8.

Thanks in advance for helping.

EDIT:

Base64 value: S0QtOTI2IEdHMDA2AAAAAA==\n enter image description here

28
  • Just a thought, the text might also be encoded improperly on the other end. Just for reference, here are Java's supported encodings and their internal names. Commented Nov 3, 2016 at 15:43
  • I don't understand the page you linked to. Is that XML document the content you're trying to decode? Commented Nov 3, 2016 at 15:47
  • @SotiriosDelimanolis it is the document of bluetooth, I am trying to read model number string from BLE service and it's encoding has problem. Commented Nov 3, 2016 at 15:54
  • 2
    If I decode S0QtOTI2IEdHMDA2AAAAAA== in base 64, I get KD-926 GG006, I don't see any Korean characters Commented Nov 7, 2016 at 8:51
  • 1
    Please note that I believe that you misunderstand the doc, UTF-8 is a variable length encoding because according to the character to encode it will be encoded in 1 to 5 bytes Commented Nov 7, 2016 at 10:28

2 Answers 2

5
+25

Bluetooth name display issue:

If you check Bluetooth adapter setName(), you will get that

https://developer.android.com/reference/android/bluetooth/BluetoothAdapter.html#setName

Valid Bluetooth names are a maximum of 248 bytes using UTF-8 encoding, although many remote devices can only display the first 40 characters, and some may be limited to just 20.

Android Supported Versions:

If you check the link https://stackoverflow.com/a/7989085/2293534, you will get the list of android supported version.

Supported and Non supported locales are given in the table:

-----------------------------------------------------------------------------------------------------
             | DEC Korean | Korean EUC | ISO-2022-KR | KSC5601/cp949 | UCS-2/UTF-16 | UCS-4 | UTF-8 |
-----------------------------------------------------------------------------------------------------
 DEC Korean  |      -     |      Y     |     N       |      Y        |        Y     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
 Korean EUC  |      Y     |      -     |     Y       |      N        |        N     |   N   |   N   |
-----------------------------------------------------------------------------------------------------
 ISO-2022-KR |      N     |      Y     |     -       |      Y        |        N     |   N   |   N   |
-----------------------------------------------------------------------------------------------------
KSC5601/cp949|      Y     |      N     |     Y       |      -        |        Y     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
 UCS-2/UTF-16|      Y     |      N     |     N       |      Y        |        -     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
    UCS-4    |      Y     |      N     |     N       |      Y        |        Y     |   -   |   Y   |
-----------------------------------------------------------------------------------------------------
    UTF-8    |      Y     |      N     |     N       |      Y        |        Y     |   Y   |   -   |
-----------------------------------------------------------------------------------------------------

For solution,

Solution#1:

Michael has given a great example for conversion. For more you can check https://stackoverflow.com/a/40070761/2293534

When you call getBytes(), you are getting the raw bytes of the string encoded under your system's native character encoding (which may or may not be UTF-8). Then, you are treating those bytes as if they were encoded in UTF-8, which they might not be.

A more reliable approach would be to read the ko_KR-euc file into a Java String. Then, write out the Java String using UTF-8 encoding.

InputStream in = ...
Reader reader = new InputStreamReader(in, "ko_KR-euc"); // you can use specific korean locale here
StringBuilder sb = new StringBuilder();
int read;
while ((read = reader.read()) != -1){
  sb.append((char)read);
}
reader.close();

String string = sb.toString();

OutputStream out = ...
Writer writer = new OutputStreamWriter(out, "UTF-8");
writer.write(string);
writer.close();

N.B: You should, of course, use the correct encoding name

Solution#2:

Using StringUtils, you can do it https://stackoverflow.com/a/30170431/2293534

Solutions#3:

You can use Apache Commons IO for conversion. A very great example is given here: http://www.utdallas.edu/~lmorenoc/research/icse2015/commons-io-2.4/examples/toString_49.html

1 String resource;
2 //getClass().getResourceAsStream(resource) -> the <code>InputStream</code> to read from
3 //"UTF-8" -> the encoding to use, null means platform default
4 IOUtils.toString(getClass().getResourceAsStream(resource),"UTF-8");

Resource Links:

  1. Korean Codesets and Codeset Conversion
  2. Korean Localization
  3. Changing the Default Locale
  4. Byte Encodings and Strings
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks I'll check and inform you, though the solution shouldn't be locale specific.
2

I suggest you use StringUtils per Apache libraries. I believe the necessary methods for your are documented here:

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/StringUtils.html

3 Comments

I've seen this utils before and I neglegted it because of overhead of library, but I'll give it a try and I'll let you know the result. Note that the source should be byte[] and I guess coverting it first to Base64 or something else before encoding to UTF8 probably ruines everything.
Then your string is not UTF-8
It should be, at least according to docs, but there may be problems in company that used it to feed BLE device.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.