Creating new dataframe with .txt file using Pandas

Question

I have a text file with data displayed like this:

{"created_at":"Mon Jun 02 00:04:00 +0000 2018","id":870430762953920,"id_str":"87043076220","text":"Hello there","source":"\u003ca href=\"http:\/\/tapbots.com\/software\/tweetbot\/mac\" rel=\"nofollow\"\u003eTweetbot for Mac\u003c\/a\u003e","truncated":false,"in_reply_to_status_id"}

The data is twitter posts and I have hundreds of these in one text file. I want to get the key value pair of "text":"Hello there" and turn that into it's own dataframe with a third column named target. I don't need any of the other columns. I'm doing some sensitivity analysis.

What would be the most pythonic way to go about this? I thought about using the df = pd.read_csv('test.txt', sep=r'"'), but then I don't know how to get rid of all the other columns i don't need and select the column with the text in it.

Any help would be much appreciated!

just a heads up there're some errors with the data. The value false will need to be capitalized and the last key doesn't have a value. This should raise errors when trying to process it. — Joseph Rajchwald
– Joseph Rajchwald, Commented Nov 26, 2019 at 18:57
You are aware that your text file is a JSON file? Do note: Pandas can read JSON files. — Parfait
– Parfait, Commented Nov 26, 2019 at 20:08
I did not! How can I create a dataframe with 2 columns from the JSON file? — says
– says, Commented Nov 26, 2019 at 21:13

Joseph Rajchwald · Accepted Answer · 2019-11-26 19:03:23Z

1

I had to modify the lost two key/value pairs in your data to work. You may want to check if you're getting the data correctly or if you copy and pasted properly because you should be getting errors with the data as is displayed in your post.

"truncated":False,"in_reply_to_status_id":1

Then this worked well for me:

import pandas as pd

with open('test.txt','r') as inf1:   # reads the text file as code to evaluate
    d =eval(inf1.read())
index = range(len(d))
df = pd.DataFrame(d,index=index) # have to add index to because the entire df are scalar values
df = df.pop('text')
print(df)

Returns

0    Hello there
1    Hello there
2    Hello there
3    Hello there
4    Hello there
5    Hello there
6    Hello there
Name: text, dtype: object

answered Nov 26, 2019 at 19:03

Joseph Rajchwald

4875 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

says Over a year ago

Is there a way to just ignore those two key:value pairs? I don't need them

Joseph Rajchwald Over a year ago

Not that I know of. Reading in the data did not work from me using both read_csv and eval because of them.

says Over a year ago

If that's the case, I'd need to change these key:value pairs for hundreds of records. Could you suggest the best way?

Joseph Rajchwald Over a year ago

Nothing immediately comes to mind. But that would require another post / round of research and testing.

Collectives™ on Stack Overflow

Creating new dataframe with .txt file using Pandas

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related