Find a specific header in a CSV file using python 3 code

Question

right now I have Python 3 code that takes a column of data within a CSV file, delimits the phrases in each cell into individual words based on spaces, then exports the data back into a new CSV file.

What I am wondering about is if there is a way to tell python to only apply the formatting code to a specific column with a particular header?

Here is what my source data looks like

Keyword              Source       Number 
Lions Tigers Bears     US          3
Dogs Zebra            Canada       5
Sharks Guppies         US          2

and here is my code which delimits the phrases in each cell into individual words based on a space

with open(b'C:\Users\jk\Desktop\helloworld.csv', 'r') as datafile:
    data = []
    for row in datafile:
        data.extend(item.strip() for item in row.split())
with open('test.csv', 'w') as a_file:
    for result in data:
        result = ''.join(result)
        a_file.write(result + '\n')
        print(result)

so that the source data becomes

 Keywords         Source         Number
 Lions            US              3
 Tigers
 Bears
 Dogs             Canada          5

etc

In this case, I only need all of this code to apply to the one column with the heading Keyword. Ideally, what I am trying to do is also extend the data found in the "Source" and "Number" to these newly created rows (Lions US 3 -- Tigers US 3 -- Bears US 3 etc) but I haven't really figured out that part yet!

I've been poking around the forum for awhile trying to find an answer and I know you can tell python to read the first line of the CSV file where the headers are placed (headers = file.readline()) but beyond that I am lost. Would this be an easier task using the CSV reader?

Hi, Martijn -- the file is in CSV format so I do not believe so — user3682157
– user3682157, Commented Aug 16, 2014 at 15:06
The C stands for Character; both comma and tabs are common. I'll assume you have comma-separated data then; your sample data doesn't give much of a hint. — Martijn Pieters
– Martijn Pieters, Commented Aug 16, 2014 at 15:07
@JonClements: so many people load the CSV into Excel and then show the results from that rather than the actual file contents. — Martijn Pieters
– Martijn Pieters, Commented Aug 16, 2014 at 15:09

Martijn Pieters · Accepted Answer · 2014-08-17 22:53:36Z

Use the csv module to split your data into columns. Use the csv.DictReader() object to make it easier to select a column by the header:

import csv

source = r'C:\Users\jk\Desktop\helloworld.csv'
dest = 'test.csv'

with open(source, newline='') as inf, open(dest, 'w', newline='') as outf:
    reader = csv.DictReader(inf)
    writer = csv.DictWriter(outf, fieldnames=reader.fieldnames)
    for row in reader:
        words = row['Keyword'].split()
        row['Keyword'] = words[0]
        writer.writerow(row)
        writer.writerows({'Keyword': w} for w in words[1:])

The DictReader() will read the first row from your file and use it as the keys for the dictionaries produced for each row; so a row looks like:

{'Keyword': 'Lions Tigers Bears', 'Source': 'US', 'Number': '3'}

Now you can address each column individually, and update the dictionary with just the first word of the Keyword column before producing additional rows for the remaining words.

I'm assuming here that your files are comma separated. If a different delimiter is needed, then set the delimiter argument to that character:

reader = csv.DictReader(inf, delimiter='\t')

for a tab-separated format. See the module documentation for the various options, including pre-defined format combinations called dialects.

Demo:

>>> import sys
>>> import csv
>>> from io import StringIO
>>> sample = StringIO('''\
... Keyword,Source,Number
... Lions Tigers Bears,US,3
... Dogs Zebra,Canada,5
... Sharks Guppies,US,2
... ''')
>>> output = StringIO()
>>> reader = csv.DictReader(sample)
>>> writer = csv.DictWriter(output, fieldnames=reader.fieldnames)
>>> for row in reader:
...     words = row['Keyword'].split()
...     row['Keyword'] = words[0]
...     writer.writerow(row)
...     writer.writerows({'Keyword': w} for w in words[1:])
... 
12
15
13
>>> print(output.getvalue())
Lions,US,3
Tigers,,
Bears,,
Dogs,Canada,5
Zebras,,
Sharks,US,2
Guppies,,

row['Keyword'] = row['Keyword'].replace(' ','') looks better to me than the split/join. I have a feeling the OP could be after an output row of 3 columns though for the Lions Tigers Bears (but not sure)
@JonClements: I was trying to 'reuse' a concept from the OP code; simply showing how to apply the same transformation to just one column. Since the original code uses ''.join() I think it is fair to assume that that is the goal but for just one column.
@PadraicCunningham: no, the point was to remove spaces from that column. The column is split on whitespace, then re-joined without.
Hi Martijn, thank you for the response and the commentary! Very useful especially with this dictreader! The only issue is that my original code doesn't just remove the white spaces within the keyword row (i.e. lionstigersbears). What it does is delimits each of those "phrases" so that it becomes a stacked column where each cell only has one word! I have updated my original post to make this clearer

Collectives™ on Stack Overflow

Find a specific header in a CSV file using python 3 code

1 Answer 1

10 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Linked

Related