Parsing a HTML file in Java

Question

I am currently in the process of developing an application that will request some information from Websites. What I'm looking to do is parse the HTML files through a connection online. I was just wondering, by parsing the Website will it put any strain on the server, will it have to download any excess information or will it simply connect to the site as I would do through my browser and then scan the source?

If this is putting extra strain on the Website then I'm going to have to make a special request to some of the companies I'm scanning. However if not then I have the permission to do this.

I hope this made some sort of sense. Kind regards, Jamie.

Marsellus Wallace · Accepted Answer · 2011-06-30 15:41:50Z

2

No extra strain on other people servers. The server will get your simple HTML GET request, it won't even be aware that you're then parsing the page/html.

Have you checked this: JSoup?

answered Jun 30, 2011 at 15:41

Marsellus Wallace

18.7k27 gold badges96 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Hank Gay · Accepted Answer · 2011-06-30 15:39:28Z

1

Consider doing the parsing and the crawling/scraping in separate steps. If you do that, you can probably use an existing open-source crawler such as crawler4j that already has support for politeness delays, robots.txt, etc. If you just blindly go grabbing content from somebody's site with a bot, the odds are good that you're going to get banned (or worse, if the admin is feeling particularly vindictive or creative that day).

answered Jun 30, 2011 at 15:39

Hank Gay

72.4k36 gold badges164 silver badges224 bronze badges

2 Comments

Kibbee Over a year ago

This probably depends on how much content you are scraping from the site. If you get a couple pages a day, likely the admins won't mind or even notice. However, if you are trying to crawl a very large site, without respecting robots.txt, then they may get ban you.

Jamie McLeish Over a year ago

Thanks very much for the recommendation, I'm currently using JSoup and opening a connection to the URL to get access to the source? Would this be classes as crawling? As only the page that someone requests is being scanned, no unnecessary scanning and processing is used. Thanks again for the quick response.

Amir Raminfar · Accepted Answer · 2011-06-30 15:42:25Z

0

Depends on the website. If you do this to Google then most likely you will be on a hold for a day. If you parse Wikipedia, (which I have done myself) it won't be a problem because its already a huge, huge website.

If you want to do it the right way, first respect robots.txt, then try to scatter your requests. Also try to do it when the traffic is low. Like around midnight and not at 8AM or 6PM when people get to computers.

answered Jun 30, 2011 at 15:42

Amir Raminfar

34.2k8 gold badges97 silver badges125 bronze badges

4 Comments

Jamie McLeish Over a year ago

Thanks very much for the answer. It's not looking to download all of the data off a Website? It's essentially a gocompare for books in that someone will put in an ISBN and it will go directly to the page and get the price, of a few websites. Would this violate rules if I were to use something like JSoup?

Amir Raminfar Over a year ago

If this a huge company like amazon then yes you are violating something for sure. Find a webservice somewhere and use that instead.

Amir Raminfar Over a year ago

Make sure you set your client to a known browser and you maybe able to not get in trouble for this. However, if you starting a company and this is a legit job, I would recommend not spoofing a bigger company. Because basically you are trying to see what their prices are and display to the user.

Jamie McLeish Over a year ago

Yes that's the purpose of the application, to display the price and display it to the user. It won't make any unnecessary requests. I tried to get in contact with Amazon but they simply said they can't provide any details of the inner workings of their company. JSoup would parse the page and get the price, and then provide a link to the user.

Community · Accepted Answer · 2017-05-23 11:45:52Z

0

Besides Hank Gay's recommendation, I can only suggest that you can also re-use some open-source HTML parser, such as Jsoup, for parsing/processing the downloaded HTML files.

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Jun 30, 2011 at 15:43

MarcoS

13.6k7 gold badges44 silver badges64 bronze badges

4 Comments

Jamie McLeish Over a year ago

I am currently using JSoup to process the files and have been able to get the prices, I was just wondering if it would download any more than a usual request to go to the page? Thanks very much for the fast response.

MarcoS Over a year ago

@Jamie McLeish: well, it's a good idea to use an open-source crawler to download the files, because in that way you have support for politeness, robots.txt, etc

Jamie McLeish Over a year ago

Thanks very much, would JSoup have respect for the robots.txt or support the politeness. Again, thank you for your help.

MarcoS Over a year ago

@Jamie McLeish: I don't think so: JSoup is essentially an HTML parser, not a Web crawler

michael nesterenko · Accepted Answer · 2011-06-30 15:44:26Z

0

You could use htmlunit. It gives you virtual gui less browser.

answered Jun 30, 2011 at 15:44

michael nesterenko

14.5k26 gold badges120 silver badges187 bronze badges

Comments

Neil Coffey · Accepted Answer · 2011-06-30 15:48:58Z

Your Java program hitting other people's server to download the content of a URL won't put any more strain on the server than a web browser doing so-- essentially they're precisely the same operation. In fact, you probably put less strain on them, because your program probably won't be bothered about downloading images, scripts etc that a web browser would.

BUT:

if you start bombarding a server of a company with moderate resources with downloads or start exhibiting obvious "robot" patterns (e.g. downloading precisely every second), they'll probably block you; so put some sensible constraints on what you do (e.g. every consecutive download to the same server happens at random intervals of between 10 and 20 seconds);
when you make your request, you probably want to set the "referer" request header either to mimic an actual browser, or to be open about what it is (invent a name for your "robot", create a page explaining what it does and include a URL to that page in the referer header)-- many server owners will let through legitimate, well-behaved robots, but block "suspicious" ones where it's not clear what they're doing;
on a similar note, if you're doing things "legally", don't fetch pages that the site's "robot.txt" files prohibits you from fetching.

Of course, within some bounds of "non-malicious activity", in general it's perfectly legal for you to make whatever request you want whenever you want to whatever server. But equally, that server has a right to serve or deny you that page. So to prevent yourself from being blocked, one way or another, you need to either get approval from the server owners, or "keep a low profile" in your requests.

Thank you very much for the detailed response. I've been in contact with some of the companies and they've referred me to their respective technical departments, I just wanted to be sure that it wouldn't put any more strain on the Web Page. The program works by a user inputting a item and then the program will go to the respective pages on different services that sell this item. Then it will only get the price from the Website. This is done using JSoup and it will then provide a link to the Website which you can follow through with a browser. I hope this makes sense. Thanks again.
Yes, it makes sense and it's quite a common thing to do, and not illegal or bad per se. But equally, a company is within their rights to decide to block page requests they don't like. Also bear in mind that some companies like Amazon have an actual API for fetching such data, and if they do, they'll probably prefer you to use their API "properly" rather than scraping their pages.

Collectives™ on Stack Overflow

Parsing a HTML file in Java

6 Answers 6

Comments

2 Comments

4 Comments

4 Comments

Comments

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

2 Comments

4 Comments

4 Comments

Comments

2 Comments

Related