Encoding variable-length utf8 byte array in Java

Question

Actually I am in a situation where I need to read a string which is in utf8 format but its chars use variable-length encoding so I have problem encoding them to string and I get weird chars when printing it, the chars seem to be in Korean and the is the code I used but had no result:

public static String byteToUTF8(byte[] bytes) {
    try {
        return (new String(bytes, "UTF-8"));

    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
    Charset UTF8_CHARSET = Charset.forName("UTF-8");
    return new String(bytes, UTF8_CHARSET);
}

Also I used UTF-16 and got a bit better results, however it was giving me strange chars yet and according to doc provided above I should use utf8.

Thanks in advance for helping.

EDIT:

Base64 value: S0QtOTI2IEdHMDA2AAAAAA==\n

Just a thought, the text might also be encoded improperly on the other end. Just for reference, here are Java's supported encodings and their internal names. — Mena
– Mena, Commented Nov 3, 2016 at 15:43
I don't understand the page you linked to. Is that XML document the content you're trying to decode? — Sotirios Delimanolis
– Sotirios Delimanolis, Commented Nov 3, 2016 at 15:47
@SotiriosDelimanolis it is the document of bluetooth, I am trying to read model number string from BLE service and it's encoding has problem. — M. Erfan Mowlaei
– M. Erfan Mowlaei, Commented Nov 3, 2016 at 15:54
If I decode S0QtOTI2IEdHMDA2AAAAAA== in base 64, I get KD-926 GG006, I don't see any Korean characters — Nicolas Filotto
– Nicolas Filotto, Commented Nov 7, 2016 at 8:51
Please note that I believe that you misunderstand the doc, UTF-8 is a variable length encoding because according to the character to encode it will be encoded in 1 to 5 bytes — Nicolas Filotto
– Nicolas Filotto, Commented Nov 7, 2016 at 10:28

Community · Accepted Answer · 2017-05-23 12:08:51Z

Bluetooth name display issue:

If you check Bluetooth adapter setName(), you will get that

https://developer.android.com/reference/android/bluetooth/BluetoothAdapter.html#setName

Valid Bluetooth names are a maximum of 248 bytes using UTF-8 encoding, although many remote devices can only display the first 40 characters, and some may be limited to just 20.

Android Supported Versions:

If you check the link https://stackoverflow.com/a/7989085/2293534, you will get the list of android supported version.

Supported and Non supported locales are given in the table:

-----------------------------------------------------------------------------------------------------
             | DEC Korean | Korean EUC | ISO-2022-KR | KSC5601/cp949 | UCS-2/UTF-16 | UCS-4 | UTF-8 |
-----------------------------------------------------------------------------------------------------
 DEC Korean  |      -     |      Y     |     N       |      Y        |        Y     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
 Korean EUC  |      Y     |      -     |     Y       |      N        |        N     |   N   |   N   |
-----------------------------------------------------------------------------------------------------
 ISO-2022-KR |      N     |      Y     |     -       |      Y        |        N     |   N   |   N   |
-----------------------------------------------------------------------------------------------------
KSC5601/cp949|      Y     |      N     |     Y       |      -        |        Y     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
 UCS-2/UTF-16|      Y     |      N     |     N       |      Y        |        -     |   Y   |   Y   |
-----------------------------------------------------------------------------------------------------
    UCS-4    |      Y     |      N     |     N       |      Y        |        Y     |   -   |   Y   |
-----------------------------------------------------------------------------------------------------
    UTF-8    |      Y     |      N     |     N       |      Y        |        Y     |   Y   |   -   |
-----------------------------------------------------------------------------------------------------

For solution,

Solution#1:

Michael has given a great example for conversion. For more you can check https://stackoverflow.com/a/40070761/2293534

When you call getBytes(), you are getting the raw bytes of the string encoded under your system's native character encoding (which may or may not be UTF-8). Then, you are treating those bytes as if they were encoded in UTF-8, which they might not be.

A more reliable approach would be to read the ko_KR-euc file into a Java String. Then, write out the Java String using UTF-8 encoding.
InputStream in = ...
Reader reader = new InputStreamReader(in, "ko_KR-euc"); // you can use specific korean locale here
StringBuilder sb = new StringBuilder();
int read;
while ((read = reader.read()) != -1){
  sb.append((char)read);
}
reader.close();

String string = sb.toString();

OutputStream out = ...
Writer writer = new OutputStreamWriter(out, "UTF-8");
writer.write(string);
writer.close();
N.B: You should, of course, use the correct encoding name

Solution#2:

Using StringUtils, you can do it https://stackoverflow.com/a/30170431/2293534

Solutions#3:

You can use Apache Commons IO for conversion. A very great example is given here: http://www.utdallas.edu/~lmorenoc/research/icse2015/commons-io-2.4/examples/toString_49.html

1 String resource;
2 //getClass().getResourceAsStream(resource) -> the <code>InputStream</code> to read from
3 //"UTF-8" -> the encoding to use, null means platform default
4 IOUtils.toString(getClass().getResourceAsStream(resource),"UTF-8");

Resource Links:

Thanks I'll check and inform you, though the solution shouldn't be locale specific.

Nikolaj Hansen · Accepted Answer · 2016-11-07 01:31:37Z

2

I suggest you use StringUtils per Apache libraries. I believe the necessary methods for your are documented here:

https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/StringUtils.html

answered Nov 7, 2016 at 1:31

Nikolaj Hansen

2531 silver badge7 bronze badges

3 Comments

M. Erfan Mowlaei Over a year ago

I've seen this utils before and I neglegted it because of overhead of library, but I'll give it a try and I'll let you know the result. Note that the source should be byte[] and I guess coverting it first to Base64 or something else before encoding to UTF8 probably ruines everything.

Nikolaj Hansen Over a year ago

Then your string is not UTF-8

M. Erfan Mowlaei Over a year ago

It should be, at least according to docs, but there may be problems in company that used it to feed BLE device.

Collectives™ on Stack Overflow

Encoding variable-length utf8 byte array in Java

2 Answers 2

Bluetooth name display issue:

Android Supported Versions:

Supported and Non supported locales are given in the table:

For solution,

Resource Links:

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Bluetooth name display issue:

Android Supported Versions:

For solution,

Resource Links:

1 Comment

3 Comments

Linked

Related