1

I am currently using the below code to web scrape data and then store it in a CSV file.

from bs4 import BeautifulSoup 
import requests

url='https://www.business-standard.com/rss/companies-101.rss'
soup = BeautifulSoup(requests.get(url).content, 'xml')

news_items = []

for item in soup.findAll('item'):
    news_item = {}
    news_item['title'] = item.title.text
    news_item['excerpt'] = item.description.text

    print(item.link.text)
    s = BeautifulSoup(requests.get(item.link.text).content, 'html.parser')

    news_item['text'] = s.select_one('.p-content').get_text(strip=True, separator=' ')
    news_item['link'] = item.link.text
    news_item['pubDate'] = item.pubDate.text
    news_item['Category'] = 'Company'
    news_items.append(news_item)

import pandas as pd
df = pd.DataFrame(news_items)
df.to_csv('company_data.csv',index = False)

When displaying the data frame, the results look fine as attached.enter image description here But while opening the csv file, the columns are not as expected. enter image description hereCan anyone tell me the reason.

4
  • 1
    This is a problem with Excel. Try going through Data/New Query/From File/From CSV. Commented Jul 15, 2020 at 12:57
  • @divingTobi Thanks. It is working now. Is there a way where I can directly open it? Commented Jul 15, 2020 at 13:02
  • 1
    No, not aware of any way. But why don't you save the file as a xlsx to begin with? df.to_excel('file.xlsx'). Commented Jul 15, 2020 at 13:16
  • Yes, will do that. Thanks! Commented Jul 15, 2020 at 13:24

2 Answers 2

2

The issue is that your data contains commas and the default seperator for to_csv is "," So each comma in your data set is treated as a seperate column.

If you perform df.to_excel('company_data.xlsx', index=False) you won't have this issue since it is not comma seperated.

Sign up to request clarification or add additional context in comments.

1 Comment

NO! By default the CSV delimiter of MS Excel is locale dependant, and is the semicolon ; on some West European locales
0

A csv file is not an Excel file despite what Microsoft pretends but is a text file containing records and fields. The records are separated with sequences of \r\n and the fields are separated with a delimiter, normally the comma. Fields can contain new lines or delimiters provided they are enclosed in quotation marks.

But Excel is known to have a very poor csv handling module. More exactly it can read what it has written, or what is formatted the way it would have written it. To be locale friendly, MS folks decided that they will use the locale decimal separator (which is the comma , in some West European locales) and will use another separator (the semicolon ; when the comma is the decimal separator). As a result, using Excel to read CSV files produced by other tools is a nightmare with possible workarounds like changing the separator when writing the CSV file, but no clean way. LibreOffice is far behind MS Office for most features except CSV handling. So my advice is to avoid using Excel for CSV files but use LibreOffice Calc instead.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.