0

I built a small web scraping application with ruby whereby I scrape data from a website and then store it in a csv file. I'm scraping and storing everything successfully, however I'm unable to structure my csv file in a 'table' format, whereby there are two columns and multiple rows. My csv file should have a name column and a price column, with the name and price of each product item. This is my code:

require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'

    def whey_scrapper
        company = 'Body+%26+fit'
        url = "https://www.bodyenfitshop.nl/eiwittenwhey/whey-proteine/?limit=81&manufacturer=#{company}"
        unparsed_page = open(url).read
        parsed_page = Nokogiri::HTML(unparsed_page)
        product_names = parsed_page.css('div.product-primary')
        name = Array.new
        product_names.each do |product_name| 
            name << product_name.css('h2.product-name').text
        end
        product_prices = parsed_page.css('div.price-box')
        price = Array.new
        product_prices.each do |product_price|
            price << product_price.css('span.price').text
        end
        headers = ["name", "price"]
        item = [name, price]
        CSV.open('data/wheyprotein.csv', 'w', :col_sep => "\t|", :headers => true) do |csv|
            csv << headers
            item.each {|row| csv << row }
        end
        byebug
    end   
    whey_scrapper

I create a row after each item iteration, however the csv file is still very unstructured & messy.

This is how my csv file looks:

name	|price
-----------------
"
                            
                                Whey Perfection                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection® bestseller box                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection - Special Series                                Body & fit
                            
                        "	|"
                            
                                Isolaat Perfection                                Body & fit
                            
                        "	|"
                            
                                Perfect Protein                                Body & fit
                            
                        "	|"
                            
                                Whey Isolaat XP                                Body & fit
                            
                        "	|"
                            
                                Micellar Casein Perfection                                Body & fit
                            
                        "	|"
                            
                                Low Calorie Meal                                Body & fit
                            
                        "	|"
                            
                                Whey Breakfast                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection - Flavour Box                                 Body & fit
                            
                        "	|"
                            
                                Protein Breakfast                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection Summer Box                                Body & fit
                            
                        "	|"
                            
                                Puur Whey                                Body & fit
                            
                        "	|"
                            
                                Whey Isolaat Crispy                                Body & fit
                            
                        "	|"
                            
                                Vegan Protein voordeel                                Body & fit vegan
                            
                        "	|"
                            
                                Whey Perfection Winter Box                                Body & fit
                            
                        "	|"
                            
                                Sports Breakfast                                Body & fit
                            
                        "
€ 7,90	|€ 9,90	|€ 11,90	|€ 17,90	|€ 31,90	|€ 18,90	|€ 12,90	|€ 6,90	|€ 6,90	|€ 10,90	|€ 15,90	|€ 9,90	|€ 26,90	|€ 6,90	|€ 24,90	|€ 9,90	|€ 20,90

1 Answer 1

1

First of all - product names. You are fetching too much information from the HTML. The h2 element contains whitespaces and span element inside which probably should be ignored. You can do it like this:

product_names.each do |product_name| 
  name << product_name.css('h2.product-name a').children[0].text.gsub(/\s{2,}/, '')
end

Then, CSV requires passing each row as an array with multiple items. In your case, there should be many arrays with two items (product name and price). To do this, you can simply zip two tables:

items = name.zip(price)

And then create CSV file:

CSV.open('data/wheyprotein.csv', 'w') do |csv|
  csv << headers
  items.each {|row| csv << row }
end

The full method looks like this:

def whey_scrapper
    company = 'Body+%26+fit'
    url = "https://www.bodyenfitshop.nl/eiwittenwhey/whey-proteine/?limit=81&manufacturer=#{company}"
    unparsed_page = open(url).read
    parsed_page = Nokogiri::HTML(unparsed_page)
    product_names = parsed_page.css('div.product-primary')
    name = Array.new
    product_names.each do |product_name| 
        name << product_name.css('h2.product-name a').children[0].text.gsub(/\s{2,}/, '')
    end
    product_prices = parsed_page.css('div.price-box')
    price = Array.new
    product_prices.each do |product_price|
        price << product_price.css('span.price').text
    end
    headers = ["name", "price"]
    items = name.zip(price)
    CSV.open('data/wheyprotein.csv', 'w+') do |csv|
        csv << headers
        items.each {|row| csv << row }
    end
end   
Sign up to request clarification or add additional context in comments.

3 Comments

I don't think you need headers: true when writing the csv. I tested and you can remove it with no effect. It can be used as an option to CSV.parse to get you back hashes instead of just the arrays of values
Yes, you are right, the headers: true is unnecessary. Thanks!
Thanks alot my friend. @MrShemek

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.