Simple image scraping

Question

I wrote this code over the last few days and I learned a lot, and it works as I expect. However I suspect it's woefully inefficient:

import requests, bs4, os

os.chdir('../../Desktop')
wallpaperFolder = os.makedirs('Wallpapers', exist_ok=True)

pageCount = 1
GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)

linkList = []
while not GalleryUrl.endswith('page=3'):
    # Download the page
    GalleryUrl = 'https://wall.alphacoders.com/search.php?search=Game+Of+Thrones&page=%s' % (pageCount)
    print('Opening gallary... %s' % (GalleryUrl))
    res = requests.get(GalleryUrl)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # find urls to image page
    section = soup.find_all('div', attrs={'class': 'boxgrid'})

    for images in section:
        imageLink = images.find_all('a')
        for a in imageLink:
            imageUrl = a.get('href')
            linkList.append(imageUrl)
            print('Image found at... ' + imageUrl)
    linkCount = 0
# Follow links and download images on page.
    while linkCount != len(linkList):
        print('Downloading image at... %s' % (str(linkList[linkCount])))
        imagePageUrl = ('https://wall.alphacoders.com/' + str(linkList[linkCount]))
        res = requests.get(imagePageUrl)

        wallpaperSoup = bs4.BeautifulSoup(res.text, 'html.parser')

        link = wallpaperSoup.find(id='main_wallpaper')
        imageSrc = link['src']

        # Save image to folder

        res = requests.get(imageSrc)
        imageFile = open(os.path.join(os.getcwd(), 'Wallpapers', os.path.basename(imageSrc)), 'wb')
        for chunk in res.iter_content(12000):
            imageFile.write(chunk)
        imageFile.close()
        linkCount += 1

    # Move to next page and empty link list
    pageCount += 1
    del linkList[:]       

print('Done')

Aside from cutting the print statements and simply letting the program do its thing, how could I optimize this code to run more efficiently?

Mast · Accepted Answer · 2016-07-14 20:29:42Z

os.chdir('../../Desktop')

Halt. Stop right there. This assumes the script is located at some specific location. Desktop happens to be always in the same place, but this assumes the script to be in a place relative to Desktop. On what OS do you intent to run this? If it's Linux-only, you could replace it by:

os.chdir('~/Desktop')

Now it will work regardless of where the script is located. Of-course, you just made the problem worse for non-Linux systems.

Even better would be to ask where the user wants to drop his files, possibly by using arguments (take a look at argparse).

But why don't you simply drop it instead? A map named Wallpapers will simply be created in the current directory. This would be obvious and expected behaviour.

for images in section:
    imageLink = images.find_all('a')
    for a in imageLink:
        imageUrl = a.get('href')
        linkList.append(imageUrl)
        print('Image found at... ' + imageUrl)

This isn't your bottleneck (not by a long shot), but that second for reeks of bad time complexity for something simple as telling the user where the images are.

One of the major time-consumers appears to be the downloading itself.

I ran cProfile against your code to find the bottleneck(s):

python3 -m cProfile -s tottime CR_133564.py

This automatically sorts the output on the total time taken per function, highest first. Don't forget to import cProfile. Let's take a look at everything with a total time higher than 0.5 seconds:

         8989953 function calls (8984594 primitive calls) in 108.880 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    20393   59.314    0.003   59.314    0.003 {method 'read' of '_ssl._SSLSocket' objects}
      183   28.853    0.158   28.853    0.158 {method 'do_handshake' of '_ssl._SSLSocket' objects}
      183    9.689    0.053    9.689    0.053 {method 'connect' of '_socket.socket' objects}
      183    1.547    0.008    1.548    0.008 {built-in method _socket.getaddrinfo}
      183    1.005    0.005    1.005    0.005 {method 'load_verify_locations' of '_ssl._SSLContext' objects}
    78129    0.621    0.000    3.383    0.000 parser.py:301(parse_starttag)
   585660    0.580    0.000    0.580    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
       93    0.549    0.006    5.750    0.062 parser.py:134(goahead)

That's responsible for 102.2 seconds out of the 108.9 it takes. If you want to optimize, do it here. The rest is peanuts.

Do you notice anything? You're wasting half a minute on handshakes. That's almost 30 seconds you could be doing something useful instead. Unless you find a trick to carry all files over one handshake, there's not much you can do about it. It also takes almost 60 seconds simply to read (AKA: download) the data. That's only slightly more than I'd expect with my internet connection, so there's not much you can do about that either.

Thanks Mast. This introduces a few concepts I'm not yet familiar with, but now I know what direction to look. — Lachy Vass
– Lachy Vass, Commented Jul 15, 2016 at 2:09

Stack Exchange Network

Simple image scraping

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Simple image scraping

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions