encoding and decoding utf8 string and its bytes in Java

Question

I'm working on a project that I need to code and encode a string in java. my string is an UTF-8 string consist of persian character. I simply want to xor every bytes with a static character and then xor it again with the same static character.

I wrote the below code but it completely works wrong! I check it with English character and it works.

How can I fix this problem?

String str = "س";
char key = 'N';
byte bKey = (byte) key;

byte[] b = str.getBytes();

for (int i = 0; i < b.length; i++)
{
    b[i] = Byte.valueOf((byte) (b[i] ^ bKey));
}

String str1 = new String(b);
b = str1.getBytes();

for (int i = 0; i < b.length; i++)
{
    b[i] = (byte) (b[i] ^ bKey);
}

String str2 = new String(b);

jas · Accepted Answer · 2014-05-06 15:52:17Z

The problem is coming when you create str1 from the mutated bytes. Assuming your default encoding is UTF8, when you say String str1 = new String(b); you're saying here are some bytes in UTF8 encoding, please build a nice string for me. But because you XOR'd the bytes, the encoding is invalid UTF8, and Java doesn't quite know what to do with it. If you look at the bytes that are being retrieved from str1 with b = str1.getBytes(); you'll see they are different than the bytes you created the string with!

Really you shouldn't be creating a string from "nonsense" bytes --- do you really need to store the XOR'd bytes back in a string?

If you really want to do that, you can trick the system by using a single-byte encoding where all the possible byte values are valid. Then you can be sure that the bytes you put into the string will be the same ones you get out. Here's an example that's working for me:

public class B {
    static public void main(String[] args) throws Exception {
        String str = "س";
        System.out.println(str);
        char key = 'N';
        byte bKey = (byte) key;

        byte[] b = str.getBytes("UTF8");

        System.out.println("Original bytes from str:");
        for (int i = 0; i < b.length; i++) {
            System.out.println(b[i]);
        }

        System.out.println("Bytes used to create str1:");
        for (int i = 0; i < b.length; i++) {
            b[i] = Byte.valueOf((byte) (b[i] ^ bKey));
            System.out.println(b[i]);
        }

        String str1 = new String(b, "Cp1256");

        b = str1.getBytes("Cp1256");

        System.out.println("Bytes retrieved from str1:");
        for (int i = 0; i < b.length; i++) {
            System.out.println(b[i]);
            b[i] = (byte) (b[i] ^ bKey);
        }

        System.out.println("Bytes used to create str2:");
        for (int i = 0; i < b.length; i++) {
            System.out.println(b[i]);
        }

        String str2 = new String(b, "UTF8");
        System.out.println(str2);
    }
}

The output I get is:

س
Original bytes from str:
-61
-65
-30
-119
-91
Bytes used to create str1:
-115
-15
-84
-57
-21
Bytes retrieved from str1:
-115
-15
-84
-57
-21
Bytes used to create str2:
-61
-65
-30
-119
-91
س

thanks! it works perfectly. actually i use XOR to simply encrypt some data. i want to store the xor ed data in my database and decrypt it in my program! but here is the question, i xor the characters with my key and it works without using any encoding. why? i mean char c = str.charAt(i) ^ key and collecting them into string. why it works? and again thanks a lot for your explanation :))
To be honest I'd have to see more exactly what you're doing and play around a bit to really understand it, but there are basically two possibilities: It may be just lucky that after you xor the character you're left with a value that is still valid UTF-16 (the in-memory encoding Java uses internally for chars). Or it may be that Java doesn't validate the value in this case since it only needs to "understand" the character when it's time to convert to or from a specific encoding. In either case you would get out of the string exactly what you put in, which is the real requirement here.

kuporific · Accepted Answer · 2014-05-06 15:52:47Z

1

The problem occurs when you try to create a new String with the XORed bytes:

String str1 = new String(b);
b = str1.getBytes();

Since the XORed bytes do not form valid Unicode/UTF-8 characters, this String is invalid and getBytes() does not return what you think it does.

If you skip translating back into a String, your code will work fine.

answered May 6, 2014 at 15:52

kuporific

10.3k3 gold badges46 silver badges48 bronze badges

Comments

pinxue · Accepted Answer · 2014-05-06 14:19:35Z

0

Firstly, str.getBytes(); means converting characters to bytes using default charset. And String str1 = new String(b); is using default charset, too. There is nothing related to UTF-8 here.

And doing bit operation in Java is a bit tricky, try to change all b[i] to (b[i] & 0xff).

answered May 6, 2014 at 14:19

pinxue

1,74612 silver badges17 bronze badges

1 Comment

strings95 Over a year ago

hmm! i add utf8 in title because this problem is just with utf8 encoding. as i say it works fine with english characters. and your answer did not work :(( i change b[i] & 0xff but the result was same

Collectives™ on Stack Overflow

encoding and decoding utf8 string and its bytes in Java

3 Answers 3

2 Comments

Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

1 Comment

Related