0

I have a web scraper that runs great, I can pass it a list of domain names, and it will scrape each site to see which sites are missing SEO. I realize there are tools for this (like Screaming Frog) but I am trying to create a PHP script to do this for me and write the results to a Google Sheet.

I've got a Google Sheet with a list of 300 sites. My script pulls from this Google Sheet like so:

 $domain_batch = getBatchOfFive(1,5);

This returns 5 of the sites from the Google Sheet, and then I take the returned array and pass it to a function that scrapes each site like so:

 foreach ($domain_batch as $site){

      $seo = getAllPagesSEO($site);

      //then logic to add the results to a spreadsheet
 }

Then when I run it again, I change that to:

 $domain_batch = getBatchOfFive(6,10);

And so on until I get through all of the sites in the Google Sheet.

How I run this script, is I just pull up the script in my browser:

https://example.com/seo-scraper.php

The problem is I can only scrape about 5 sites at a time before the script times out. I'm wondering if it would be possible to run this script incrementally somehow.

Is there any way I could programmatically create something that would run the script for the first 5 sites, and once the script finishes running, it would automatically run the script again with another 5 sites, and continue doing this process until all of the sites are run through the script?

That way I don't have to go into seo-scraper.php after each run and change the values here:

 $domain_batch = getBatchOfFive(6,10);

I'm thinking this might not be possible, but I'm looking for any ideas!

4
  • 1
    sure, you could look to see if its xhr or a post request, else serve some js, which in turn calls back with a post or xhr request then runs your main code, could even make it server sent events or long polling.. but if you just want it not to time out perhaps just add ignore_user_abort so it carries on even if your browser times out, and set_time_limit so php doesn't timeout, also you should loop over 0 to n records not manually shift along five Commented Aug 16, 2022 at 21:23
  • It seems that most of the time I am getting a 504 Gateway Timeout. I'm wondering if ignore_user_abort would help bypass that issue? Commented Aug 16, 2022 at 21:49
  • 3
    You could have a look at How to run a large PHP script. Some solutions might work for you, like running it on a command line instead of a browser or via a cron-job. Commented Aug 16, 2022 at 22:50
  • Both of these comments were VERY helpful. Thank you both! Commented Aug 18, 2022 at 19:49

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.