Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

4
  • Why don't you use a hashmap to check for duplicate links? Commented May 8, 2019 at 9:40
  • It will run out of memory, as there could be a billion links. Commented May 8, 2019 at 9:43
  • You need to run a profiler to see why your webcrawler is so slow. Commented May 8, 2019 at 9:53
  • The suggestion of hashmap of the addresses and compiling the database queries into a single call are good ones. If you're finding only a few thousand links and your starting source is something big like stackexchange, then there is probably some other error. You are ensuring that the link is new before going through its own links, right? And you're not pruning too much from the address? Commented Jun 7, 2019 at 13:48