2

I want to parse an HTML file and store the bold text (inside <b> tags). One solution is to read the file line by line and split or use RegEx. This means that I should store the entire page in a String variable? If I don't save it in a variable then I have no guarantee that the start of the tag and the end of it are on the same line.

What solution do you suggest?

3
  • Possible duplication of this! Commented May 20, 2013 at 17:39
  • 1
    Attempting to parse HTML with a regex is generally a bad idea and will lead to nothing but tears. But if you insist, yes, if you need to match across lines that's one way to do it. You can also deal with just reading it line by line if you keep track of a state. Commented May 20, 2013 at 17:40
  • possible duplicate of HTML/XML Parser for Java Commented May 20, 2013 at 18:05

2 Answers 2

5

Use JSoup to parse the contents

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document doc = Jsoup.parse(html);
Sign up to request clarification or add additional context in comments.

5 Comments

Why would you not want to use a reliable third party library? That's like saying "I want to connect to a database, but I don't want to use jdbc".
I would argue that there are reasons/scenarios for not wanting to have external dependencies but ... this isn't one of them.
@david99world because it is a project I have for university :)
@Andrew Could you elaborate a bit why universities don't like third party library? Is it about politics? I don't get it but I'm keen to know
Its probably because they want you to learn about the specific process and the best way to learn is to do.
0

it is a project I have for university

Use HTMLEditorKit.ParserCallback

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.