0

I'm creating a tool for scraping links from multiple URLs. I want to store this information, then test the scraped links for their status.

I am expecting having to test a lot of links, about 60,000. So the problem I have is deciding how to store the links to test.

What I'm thinking of doing is creating text files for the URLs I'll be scraping. I'll have to create about 40 text files for URLs I'll be scraping(the URLs I'm scraping are the same URL, just regionalised).

  • Would creating lots of text files cause performance issues?
  • Would I be best off storing the URLs in an array and then writing the array to the text file, or should I just write the URL to the text file as I go? Or is there a better way?
  • Is there a better method than storing in text files? (I don't really want to use a database but if there is a good case for it I could be convinced)
3
  • 2
    Why do you need to store them? How is the stored going to be used? Have you considered small database engines, for example SQLite? Commented Apr 20, 2012 at 12:42
  • It's not completely necessary to store them. I'm fairly new to programming, I was thinking that if I was to store the data in say an array, it would need to use a good bit of memory hence causing performance issues? Would storing so many strings of information in arrays use a lot of memory? Commented Apr 20, 2012 at 13:02
  • Java uses UTF-16, about 2 bytes per character. Guess at 200 chars per URL and you get: 60000*200*2 = 24 MB. Should be easy to fit in RAM. Commented Apr 20, 2012 at 16:04

1 Answer 1

1

imho the easiest approach is to use serialization to save your information. For example, serialize Map<String, Set<String>> of urls. Multiple files should work too, without any serious performance impact. But it's slightly longer to implement

Another approach - register on mongolab and use free account. (It's not advertising, I just like this service) You don't need to install anything, just download mongo driver and go ahead

Sign up to request clarification or add additional context in comments.

5 Comments

If I am reading the serialization article correctly(and I'm probably not) does that mean I can store information in the memory and recall it later on? Would using this method to store a lot of large array strings ~60,000 use up a lot of memory and cause performance issues? I pretty new to programming :/
You are right about recall later. 60 000 strings is not so many. Anyway you can tweak JVM (allocate more memory for your program). And it's not about performance, it's about memory consumption. You should not worry about that.
This sounds almost exactly what I need. One last question. When you serialize an object in a class, can you deserialize it anywhere else in your application?
The purpose of serialization - survive application restart. For example, you've got array, serialize it, turn off computer, and later, deserialize it. If are not going to save (serialize) it to disc, you don't need serialization. Just let different part of application access your array in memory
Ah right, I didn't realize it saved to disk. I think I'll probably go down this route. Need to learn about serialization I guess! Thank you for your help!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.