0

I have created a small program that gets search_id from a csv file, and then uses that to scrape through a webform to write to another csv file (in this example product name and price.)

import csv

with open("file1.csv","rb+a") as file1_read:
    r = csv.reader(file_read1, delimiter = ",")
    for search_id in r:
    # -- Logic for web scraping here, omitted --
        with open("file_2.csv","a") as file2_write:
                wr = csv.writer(file2_write, delimiter = ",")
                wr.writerow(search_id)
                wr.writerow(name)
                wr.writerow(price)

Example – for 2 rows of search_id, gives 6 rows of data in 3 columns:

id01
Coffee    
$4          $5          $3
id02
Soda    
$2          $3          $4

The reason I get three cells for "price" is because I'm scraping for a price series.

Now, I would like to output this as so instead:

Coffee      id01      $4
Coffee      id01      $5
Coffee      id01      $3
Soda        id02      $4
Soda        id02      $5
Soda        id02      $3

Any ideas on how I can reassign the code to output to above format?

Update: Here is more of the code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
import csv
from time import sleep
from itertools import *

# Omitted code here, navigates to search form on web page.

with open("file1.csv", "rb+a") as file1_read: # Open file1.csv containing column with searchkey (id).
    r = csv.reader(file1_read, delimiter = ",")
    for id in r:
        searchkey = driver.find_element_by_name("searchkey")
        searchkey.clear()
        searchkey.send_keys(id)
        print "\nSearching for %s ..." % id
        sleep(9) # Sleep timer, allow for retrieving results
        search_reply = driver.find_element_by_class_name("ac_results") # Get list, variable how many items and what their contents may be. Price and product names are extracted through regex.
        price = re.findall("((?<=\()[0-9]*)", search_reply.text)
        product_name = re.findall("(.*?)\(", search_reply.text)
        with open("file2.csv", "a") as file2_write: # Write to file2.csv.
                wr = csv.writer(file2_write, delimiter = ",")
                wr.writerow(id)
                wr.writerow(product_name)
                wr.writerow(price)

driver.quit()

1 Answer 1

4

The first thing you need to do is group your input. Assuming your format is always exactly what you've shown, one easy way to do this is with the grouper recipe from the itertools docs. Assuming you've copied that recipe into your code (or installed more_itertools and imported it from there), and that your scraped data is in some iterable named rows, each row being an iterable of columns:

for group in grouper(rows, 3):
    search_id = group[0][0]
    name = group[1][0]
    prices = group[2]

Now, you just need to write those all out as separate rows:

    for price in prices:
        wr_insref.writerow([search_id, name, price])

One minor thing: Your desired output appears to be tab-delimited, but your code explicitly specifies delimiter=','. Obviously one of those is wrong. If the output is correct, use delimiter='\t'.


Now that you've shown some code, it seems to be very different from what you described. You don't actually get grouped data at all; for each search_id you do a separate query, and get back a single product_name and a single list of prices. If that's the case, you don't even need grouper; just do this:

search_reply = driver.find_element_by_class_name("ac_results")
product_name = re.match("(.*?)\(", search_reply.text).group()
prices = re.findall("((?<=\()[0-9]*)", search_reply.text)
with open("file2.csv", "a") as file2_write:
    wr = csv.writer(file2_write, delimiter = ",")
    for price in prices:
        wr.writerow([search_id, product_name, price])

If, however, ac_results actually does return something with multiple groups as your question originally implied, then you can't just find all the product names and all the prices separately and try to merge them back together; you have to first split it into groups, then find the product name and list of prices for each group. I don't think grouper will help for that, but there's probably some very easy way to do it (which nobody can figure out for you without seeing your input, but my guess is that there's actually a tag with a class or id you can search for within the ac_results instead of using a regex, or at least a tag you can find by table structure; if not, then you need a more complicated regex).


One last thing: you seem to be throwing random file modes at your open calls. "rb+a" isn't valid, and it's not even clear what you want it to mean. Assuming you're using Python 2.x, you pick one of r, w, or a; then optionally a U (meaning universal newlines); then optionally a + (meaning read-write instead of read-only or write-only), then optionally a b (meaning binary rather than text). Asking for both "read+update" and "append+update" doesn't make any sense. And I'm not sure why you want it in any writable mode when you're not writing to it. For file2, I doubt you want to append to file2.csv (which is presumably full of incorrect garbage from your previous runs) rather than re-create it, so why use a? Finally, opening one file in binary and another in text when they're supposed to be equivalent also doesn't make sense. So, I think your two modes should be rb for the first one and wb for the second, but please read the docs, decide what you want, and write that.

Sign up to request clarification or add additional context in comments.

9 Comments

Thank you! Looks like a neat solution. But I get this error when copying the grouper recipe: line 42, in grouper return zip_longest(*args, fillvalue=fillvalue) NameError: global name 'zip_longest' is not defined
@Winterflags: Do you not understand how imports and globals work in Python? If you want to use the recipe exactly as written, you'd need to from itertools import zip_longest (or from itertools import *). But the recipe is meant to be understood, not just copied and pasted. If you don't want to do that, just install and import more_itertools instead.
Actually I have done from itertools import * but still get that error. I also have tried with more_itertools. But there may be something I'm missing, I have only been coding for less than a week so I usually understand the logic when I see it but I'm not aware of all basics. Do I need to copy the definition for itertools.zip_longest() as well?
@Winterflags: OK, then you're using Python 2.x, not 3.x, so you need to copy the version from the 2.x docs, not the version from the 3.x docs. (But the only change should be izip_longest instead of zip_longest.) Apologies for missing the tag; I'll update the link in the answer.
I tried a bit with your suggestion but wasn't able to implement it. Don't know if I'm misunderstanding the components in the recipe. But actually, since it seems to be intended for fixed-length columns, and the scraped content will be of variable length, maybe it wouldn't have worked anyway?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.