2

I've streamed tweets from Tweepy and stored it as a text file, as such. Now I am looking to convert this into a pandas dataframe but I don't know how. I've tried looking for similar posts here on Stack Overflow and in the pandas documentation as well, but I'm still not sure on how I would start parsing all of this data.

Answer: Solved this by turning the json file into a list and then was able to turn it into a dataframe. Thank you everyone who helped.

    tweets = []
    for line in open('tweets.txt', 'r'):
       tweets.append(json.loads(line))

    df = pd.DataFrame(tweets)
1
  • 1
    I can't tell whether the data makes sense once read into pandas. However, the suggestion from @chargingupfor does seem to work. Commented Nov 28, 2019 at 19:18

2 Answers 2

1

You don't have to convert your text file to json in order to read it as a pandas dataframe just do:

pd.read_json('yourfile.txt')

and it should work. This assumes that your format is:

{"name": "first json"}

and not:

{"name": "first json"}{"name": "second json"}

However, if you do have the second format then you can just any of these methods (there are many more):

Iterate through the file -> track the open brackets -> create json objects on the go -> append them to a list -> feed the list into pandas.

def parseMultipleJSON(lines):
    skip = prev = 0
    data = []
    lines = ''.join(lines)
    for idx, line in enumerate(lines):
        if line == "{":
            skip += 1
        elif line == "}":
            skip -= 1
            if skip == 0:
                json_string = ''.join(lines[prev:idx+1])
                data.append(json.loads(json_string))
                prev = idx+1
    return data

Or use split as such and add removed brackets:

def parseMultipleJSON2(lines):
    lines = ''.join(lines).split('}{')
    data = []
    for line in lines:
        if line.endswith('}') == False:
            line += '}'
        if line.startswith('{') == False:
            line = '{%s' % line
        data.append(json.loads(line))
    return data

This is the same as the second solution but abbreviated:

def parseMultipleJSON3(lines):
    lines = ''.join(lines).split('}{')
    data = [json.loads('%s}' % line) if idx == 0 else json.loads('{%s' % line) if idx == len(lines)-1 else json.loads('{%s}' % line) for idx, line in enumerate(lines)]
    return data

Then you can call any which you want to choose as such:

import pandas as pd
import json

with open('yourfile.txt','r') as json_file:
    lines = json_file.readlines()
    lines = [line.strip("\n") for line in lines]
    #data = parseMultipleJSON(lines)
    #data = parseMultipleJSON2(lines)
    data = parseMultipleJSON3(lines)

df = pd.DataFrame(data)
Sign up to request clarification or add additional context in comments.

4 Comments

I'm getting a trailing data error. Any suggestions?
Hey, yeah so I'm guessing you are probably getting this when reading multiple tweets right? As this should not be a problem on a single JSON object. If so, you need to convert them to a list of json objects as such data = [{"tweet": "the first tweet"}, {"tweet": "the second tweet}] then do df = pd.DataFrame(data). What's causing the problem is that right now your tweets are back to back without a delimiter {"tweet": "the first tweet"}{"tweet": "the second tweet}.
Thanks. I have a huge number of tweets in the file- how would you go about converting to a list?
@PhilipLiu o, didn't see you updated the question, I guess you won't be needing my updated answer XDD but glad I could help.
0

If you have multiple tweets in your json file (yourfile.txt) and you want to read them all into your data frame:

df = pd.read_json('yourfile.txt', lines=True)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.