Java: Parse html file and extract text

Question

I want to parse an HTML file and store the bold text (inside <b> tags). One solution is to read the file line by line and split or use RegEx. This means that I should store the entire page in a String variable? If I don't save it in a variable then I have no guarantee that the start of the tag and the end of it are on the same line.

What solution do you suggest?

Attempting to parse HTML with a regex is generally a bad idea and will lead to nothing but tears. But if you insist, yes, if you need to match across lines that's one way to do it. You can also deal with just reading it line by line if you keep track of a state. — Brian Roach
– Brian Roach, Commented May 20, 2013 at 17:40

David · Accepted Answer · 2013-05-20 17:36:18Z

5

Use JSoup to parse the contents

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document doc = Jsoup.parse(html);

answered May 20, 2013 at 17:36

David

20.2k31 gold badges110 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

David Over a year ago

Why would you not want to use a reliable third party library? That's like saying "I want to connect to a database, but I don't want to use jdbc".

Brian Roach Over a year ago

I would argue that there are reasons/scenarios for not wanting to have external dependencies but ... this isn't one of them.

Andrew Over a year ago

@david99world because it is a project I have for university :)

kakacii Over a year ago

@Andrew Could you elaborate a bit why universities don't like third party library? Is it about politics? I don't get it but I'm keen to know

rockstardev Over a year ago

Its probably because they want you to learn about the specific process and the best way to learn is to do.

camickr · Accepted Answer · 2013-05-20 19:27:30Z

0

it is a project I have for university

Use HTMLEditorKit.ParserCallback

answered May 20, 2013 at 19:27

camickr

325k21 gold badges174 silver badges293 bronze badges

Collectives™ on Stack Overflow

Java: Parse html file and extract text

2 Answers 2

5 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Related