-1

I have the dataset of reviews of products and I want to extract text between text from that file and print.How can I extract Data File contains data in the following format

<review> id 
<reviewer></reviewer> 
<start word></end word> 
</review>

my code is like

File file = new File("D://Data/Dataset/unlabeled.review");
    FileInputStream fis = new FileInputStream(file);
    byte[] bytes = new byte[(int) file.length()];
    fis.read(bytes);
    fis.close();
    String text = new String(bytes, "UTF-8");
    System.out.println(text.substring(text.indexOf("<start word>"), text.lastIndexOf("</end word>")));
2
  • 1
    With some code.. What did you try? Commented Mar 1, 2016 at 12:21
  • see stackoverflow.com/questions/34129040/… for example Commented Mar 1, 2016 at 12:26

1 Answer 1

1

Your extraction code is this:

    text.substring(text.indexOf("<review_text>"), 
                   text.lastIndexOf("</review_text>"));

There are three problems with this code:

  1. The indexOf and lastIndexOf methods return the offset of the first character of some occurrence of the argument string. But you need to extract from the first character after "".

  2. If there are multiple "<review_text>" / "</review_text>" pairs, then your code doesn't extract the the text between each pair.

  3. If there is no "<review_text>" or no "</review_text>", then one or both of the index-of calls will return -1, and that will lead to an exception in the substring call.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.