2

I am trying to read in the content of a file to any readable form. I am using a FileInputStream to read from the file to a byte array, and then am trying to convert that byte array into a String.

So far, I have tried 3 different ways:

FileInputStream inputStream = new FileInputStream(file);
byte[] clearTextBytes = new byte[(int) file.length()];
inputStream.read(clearTextBytes);

String s = IOUtils.toString(inputStream); //first way

String str = new String(clearTextBytes, "UTF-8"); //second way

String string = Arrays.toString(clearTextBytes); //third way
String[] byteValue = string.substring(1, string.length() - 1).split(",");
byte[] bytes = new byte[byteValue.length]
for(int i=0, len=bytes.length; i<len; i++){
   bytes[i] = Byte.parseByte(byteValue[i].trim());
}
String newStr = new String(bytes);

When I print out each of the Strings: 1) prints out nothing, and 2 & 3) print out a lot of weird characters, such as: PK!�Q���[Content_Types].xml �(���MO�@��&��f��]���pP<*���v �ݏ�,_��i�I�(zi�N��}fڝ���h�5)�&��6Sf����c|�"�d��R�d���Eo�r�� �l�������:0Tɭ�"Э�p'䧘��tn��&� q(=X����!.���,�_�WF�L8W......

I would love any advice on how to properly convert my byte array to a String.

11
  • 7
    I'd guess your byte array does not contain a string in the first place. From the look of things you have given I'd say that's a Word document, not a txt. For reading the contents of a Word document you'd need some library like Apache POI Commented Dec 1, 2015 at 13:14
  • 3
    Are you sure the file is not a zip file ? Typically this happens when you try to read directly from a zip file and do not unzip it. Commented Dec 1, 2015 at 13:16
  • 2
    I'd guess that "first way" doesn't print anything because you've already read everything from inputStream into clearTextBytes, so there are no more bytes to read. Commented Dec 1, 2015 at 13:16
  • 1
    @StackFlowed ... and the file starts PK ;) Commented Dec 1, 2015 at 13:22
  • 1
    but that might be decrypted zip or decrypted docx Commented Dec 1, 2015 at 13:26

4 Answers 4

4

As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. Note files which start with PK could be in PKZIP format and the randomness of your data does suggest it could be compressed. http://www.garykessler.net/library/file_sigs.html Try making the renaming the file to have .ZIP at the end and see if you can open it in file explorer.

From the link above, the start of a DOCX file looks as follows.

50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX

Microsoft Office Open XML Format (OOXML) Document

NOTE: There is no subheader for MS OOXML files as there is with
DOC, PPT, and XLS files. To better understand the format of these files,
rename any OOXML file to have a .ZIP extension and then unZIP the file;
look at the resultant file named [Content_Types].xml to see the content
types. In particular, look for the <Override PartName= tag, where you
will find word, ppt, or xl, respectively.

Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes
at the end of the file.

Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input.

You can try brute force to find a character set which doesn't produce any unknown characters.

public static Set<Charset> possibleCharsets(byte[] bytes) {
    Set<Charset> charsets = new LinkedHashSet<>();
    for (Charset charset : Charset.availableCharsets().values()) {
        if (!new String(bytes, charset).contains("�"))
            charsets.add(charset);
    }
    return charsets;
}
Sign up to request clarification or add additional context in comments.

3 Comments

Great - I have made it into a zip and have opened it. However, I'm a bit confused as to what you mean by checking what the encoding is and checking the corruption is not when I output the string instead of in the input. So is that possibleCharsets function of yours supposed to return all the sets of chars that don't include �, and then I create a new String out of that? Sorry I'm fairly new to bytes/binary data/ascii stuff.
(also, initially the Word document I was trying to read in was a simple .docx)
@KevinDonahoe there is nothing simple about the docx file format ;) You need a library designed to read such a document to have any chance of reading it. As it's a binary format, character encoding doesn't apply.
0

UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. Try the classic dos codepage instead:

new String(clearTextBytes, "DOS-US");

Comments

0

Check this out for getting text contents of word file: You'd need Apache POI libraries.

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

[...]

   XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));       
   XWPFWordExtractor we = new XWPFWordExtractor(docx);
   System.out.println(we.getText());

Comments

0

I've written a very basic program to read the contents of a file and to print each string on a new line in the console. Here is the content of the file:

File1.txt

Here is the program I wrote:

import java.io.*;
import java.util.*;

class Test {
    public static void main(String args[]) throws FileNotFoundException {
        File file = new File("File1.txt");
        Scanner input = new Scanner(file);

        while (input.hasNext()) {
            System.out.println(input.next());
        }

        input.close();

    } // main()
} // class Test

This is the output to the console:

apples
pears
1
2
3
oranges
carrots
bananas
pineapples

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.