Return to Revisions

2 of 2

Clarifications based on comments

edited Jul 30, 2015 at 7:51

1.2k
7
13

You have an awful lot of excess whitespace. Use one or two blank lines between functions, and one blank line to break up different parts of each function (especially, just below the end of a loop or if block). Don't put blank lines immediately between a line ending in : and the start of the indented block beneath it.

The while loop in info uses a class attribute as a loop counter, which actually seems to cause a potential bug here: the counter persists, so on subsequent calls you will skip the loop, never append to the outer list, and it will be returned empty. If page is a local variable instead, and it will work fine no matter how many calls you make.

Instead of keeping a loop counter like this, and incrementing it manually, preferred Python style is to use a for loop. The equivalent for loop to your current while loop would be:

for page in range(1, 2):
   ...

Although it would be good to store the maximum page in a variable. But in light of your clarification in the comments that this would normally be a while True: loop, which only breaks once it hits a page with no interesting links, then the equivalent is to use itertools from the stdlib:

for page in it.count(1):
    ...

You have a variable links to capture the links on each iteration of the loop, and linkz to accumulate them across all the pages you parse (which is only one at the moment anyway, but I'm assuming that could change later). Those names are quite confusing; it would be good to differentiate better - call them links for the one that gathers all the links, and page_links for the one that gets the links for just this page.

Your email function parses what I would call your 'main' data structure. Since that is a lot more than just email-related information, email isn't the best name for it.

The data structure that email creates seems awkward for your data. If you think of people as 'records', and the information you capture about them to be 'fields', then it would make more sense to have it as a list of namedtuples, or or even a pandas dataframe if you're working with a lot of data. That way, your code to parse it is a little bit simpler:

person = namedtuple(...)
people = []
for link in links:
   this_person = person()
   person.phone = #parse phone number
   etc.

return people

But this is the major data structure in your class. It contains all the information that this class is designed to parse. So, I think all of this should be done in __init__, with this structure - and only this structure - captured as an attribute.

The only other attribute you store is the filename that gets passed to __init__. I think it would make more sense as an argument to your csv method - the only method that uses it - because the filename to write it to is a decision that goes with "I want to write this data to disk" more than "I want to scrape some data from a website".

So, you're left with a class with one attribute, and two methods - one of which is __init__, and the other is fairly trivial. In fact, csv becomes even more trivial with a list of namedtuples, since you wouldn't need to zip it. Similarly if you use a pandas DataFrame, it has it's own CSV routines that you can use.

That means there is no need for a class. Instead, use what your __init__ has grown to as a standalone function - call it parse_yellowpages(url). Have it return the list or DataFrame. Then drop the csv function entirely and just put that in the main line of your program.

Your current code is a little schizophrenic about whether it allows multi-valued fields: eg, can a person have more than one phone number?

The code parsing it looks like it does. You parse out a list of data, then you extend it onto the accumulated list. But then when you write out your CSV, you assume (by your use of zip to regroup them) that all fields have exactly one value.

In the comments you clarified that you only want single-valued fields. That being the case, don't include a full list of every match you found in your data structure. If you get three phone numbers for a person when you're only expecting one, you want to either consider it an error and bail out, or forget all but one of them. Your current code would silently remember all of them, forget that it did that, and end up with corrupt data.

To take the 'signal an error' path, do something like this:

phone_numbers = soup2.findall(...)
if len(phone_numbers) != 1:
    raise RuntimeError('Empty or multivalued field: ' + url)
else:
   this_person.phone = phone_numbers[0]

If you prefer to instead ignore any surplus values, use find instead of find_all - they take the same arguments, but find will just return one result instead of a list of them.

answered Jul 29, 2015 at 7:07

lvc

1.2k
7
13