3

I am having problems with export to csv by using python script. some array data need to be exported to CSV from Mongodb, but the following script did not export properly because three subfield data are dumped into a column. I want to separate three fields(order, text,answerid) under answers field into three different columns in a CSV.

the sample of Mongodb:

"answers": [
        {
            "order": 0,
            "text": {
                "en": "Yes"
            },
            "answerId": "527d65de7563dd0fb98fa28c"
        },
        {
            "order": 1,
            "text": {
                "en": "No"
            },
            "answerId": "527d65de7563dd0fb98fa28b"
        }
    ]

the python script:

import csv
cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
cursor = list(cursor)
with open('answer_2.csv', 'w') as outfile:   

    fields = ['_id','answers.order', 'answers.text', 'answers.answerid']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for x in cursor: 
        for y, v in x.iteritems():
            if y == 'answers'
                print (y, v)
                write.writerow(v)
                write.writerow(x)
2
  • Could you edit your question to show a properly formatted code? Also, it would help if you added some sample of how the data you're getting from Mongo looks like. Commented Oct 25, 2016 at 17:23
  • please help me out. Commented Oct 25, 2016 at 17:36

1 Answer 1

10

So... The problem is that the csv writer doesn't understand the concept of "subdictionaries" as mongo returns it.

If I understood correctly, when you query Mongo, you get a dictionary like this:

{
   "_id": "a hex ID that correspond with the record that contains several answers",
   "answers": [ ... a list with a bunch of dicts in it... ]
}

So when the csv.DictWriter tries to write that, it is only writing one dictionary (the topmost). It doesn't know (or cares) that answers is a list that contains dictionaries whose values need to be written in columns as well (accessing fields in dictionaries using the dot notation such as answers.order is only understood by Mongo, not by the csv writer)

What I understand you should do is "walk" the list of answers and create one dictionary out of each record (each dictionary) in that list. Once you have a list of "flattened" dictionaries you can pass those and write them in your csv file:

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
    answers_record_id = answers_record['_id']
    for answer_record in answers_record['answers']:
        flattened_record = {
            '_id': answers_record_id,
            'answers.order': answer_record['order'],
            'answers.text': answer_record['text'],
            'answers.answerId': answer_record['answerId']
        }
        flattened_records.append(flattened_record)

# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for flattened_record in flattened_records:
        write.writerow(flattened_record)

Whatch for the use of plurals. answers_record is different than answer_record

That creates a file like this:

$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b

EDIT:

Your query (the one that makes cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})) will return all the entries in the questions collection. If this collection is very large, you might want to use the cursor as an iterator.

As you might have already realized, the first for loop in my code above puts all the records in a list (the flattened_records list). You can do lazy loading by iterating through the cursor (instead of loading all the items in memory, fetch one, do something with it, get the next, do something with it...).

It's slightly slower, but more memory efficient.

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for answers_record in cursor:  # Here we are using 'cursor' as an iterator
        answers_record_id = answers_record['_id']
        for answer_record in answers_record['answers']:
            flattened_record = {
                '_id': answers_record_id,
                'answers.order': answer_record['order'],
                'answers.text': answer_record['text'],
                'answers.answerId': answer_record['answerId']
            }
            write.writerow(flattened_record)

It will produce the same .csv file as shown above.

Sign up to request clarification or add additional context in comments.

6 Comments

it worked. it is awesome! I got four separated fields in csv after run the script. thanks so much.
Glad I was able to help! By the way if my answer solved your problem, feel free to click the big checkbox to its left to accept it as the answer (SO netiquette: meta.stackexchange.com/questions/14994/…)
if anyone wants to use the script, make sure to put a space between step 1 and step2. It works perfectly. thanks for your help.
@user7070824, I added a "bonus" in my answer (how to make it more memory efficient)
The second script worked too. I tested it. @BorrajaX 2 , I already posted another problem. Could you have a look if you have time? the issue is the type field can not be added into a list in python script because another collection stages has a number of field name which donot match in every record. thanks again
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.