1

I'm trying to parse a text file of tweets captured by another researcher. The first two lines are:

    {"firstpost_date": 1435805238, "title": "#Jetlounge. #100Days100Nights #NOLA. #CWMB3 https://t.co/0B8c0h1PwS", "url": "http://twitter.com/DANIELCP3/status/616437964219023360", "tweet": {"contributors": null, "truncated": false, "text": "#Jetlounge. #100Days100Nights #NOLA. #CWMB3 https://t.co/0B8c0h1PwS", "in_reply_to_status_id": null, "id": 616437964219023360, "favorite_count": 0, "source": "<a href=\"http://instagram.com\" rel=\"nofollow\">Instagram</a>", "retweeted": false, "coordinates": null, "timestamp_ms": "1435805238190", "entities": {"symbols": [], "user_mentions": [], "trends": [], "hashtags": [{"indices": [0, 10], "text": "Jetlounge"}, {"indices": [12, 29], "text": "100Days100Nights"}, {"indices": [30, 35], "text": "NOLA"}, {"indices": [37, 43], "text": "CWMB3"}], "urls": [{"indices": [44, 67], "url": "https://t.co/0B8c0h1PwS", "expanded_url": "https://instagram.com/p/4nhZSDRV5W/", "display_url": "instagram.com/p/4nhZSDRV5W/"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 0, "id_str": "616437964219023360", "favorited": false, "user": {"follow_request_sent": null, "profile_use_background_image": true, "geo_enabled": true, "description": null, "verified": false, "profile_image_url_https": "https://pbs.twimg.com/profile_images/555269587395026946/agpaj4CS_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 106863509, "profile_text_color": "333333", "followers_count": 566, "profile_sidebar_border_color": "C0DEED", "id_str": "106863509", "default_profile_image": false, "location": "NEW ORLEANS", "utc_offset": -18000, "statuses_count": 817, "profile_background_color": "C0DEED", "friends_count": 1354, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/555269587395026946/agpaj4CS_normal.jpeg", "notifications": null, "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "name": "internationaldaniel", "lang": "en", "profile_background_tile": false, "favourites_count": 211, "screen_name": "DANIELCP3", "url": "http://www.datpiff.com/B3-The-Set-Up-mixtape.687448.html", "created_at": "Wed Jan 20 22:56:10 +0000 2010", "contributors_enabled": false, "time_zone": "Central Time (US & Canada)", "protected": false, "default_profile": true, "following": null, "listed_count": 5}, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "und", "created_at": "Thu Jul 02 02:47:18 +0000 2015", "filter_level": "low", "in_reply_to_status_id_str": null, "place": null}, "author": {"author_img": "http://pbs.twimg.com/profile_images/555269587395026946/agpaj4CS_normal.jpeg", "name": "internationaldaniel", "url": "http://twitter.com/danielcp3", "nick": "danielcp3", "followers": 555.0, "image_url": "http://pbs.twimg.com/profile_images/555269587395026946/agpaj4CS_normal.jpeg", "type": "twitter", "influence_level": 1.0, "description": ""}, "original_author": {"author_img": "http://pbs.twimg.com/profile_images/555269587395026946/agpaj4CS_normal.jpeg", "description": "", "url": "http://twitter.com/danielcp3", "nick": "danielcp3", "followers": 555.0, "image_url": "http://pbs.twimg.com/profile_images/555269587395026946/agpaj4CS_normal.jpeg", "type": "twitter", "influence_level": 1.0, "name": "internationaldaniel"}, "citation_date": 1435805238, "metrics": {"acceleration": 0, "ranking_score": 8.222051, "citations": {"influential": 1, "total": 2, "data": [{"timestamp": 1435777199, "citations": 0}, {"timestamp": 1435780799, "citations": 0}, {"timestamp": 1435784399, "citations": 0}, {"timestamp": 1435787999, "citations": 0}, {"timestamp": 1435791599, "citations": 0}, {"timestamp": 1435795199, "citations": 0}, {"timestamp": 1435798799, "citations": 0}, {"timestamp": 1435802399, "citations": 0}, {"timestamp": 1435805999, "citations": 0}, {"timestamp": 1435809599, "citations": 0}, {"timestamp": 1435813199, "citations": 0}, {"timestamp": 1435816799, "citations": 0}, {"timestamp": 1435820399, "citations": 0}, {"timestamp": 1435823999, "citations": 0}, {"timestamp": 1435827599, "citations": 0}, {"timestamp": 1435831199, "citations": 0}, {"timestamp": 1435834799, "citations": 0}, {"timestamp": 1435838399, "citations": 0}, {"timestamp": 1435841999, "citations": 0}, {"timestamp": 1435845599, "citations": 0}, {"timestamp": 1435849199, "citations": 0}, {"timestamp": 1435852799, "citations": 0}, {"timestamp": 1435856399, "citations": 0}, {"timestamp": 1435859999, "citations": 0}], "matching": 2, "replies": 0}, "peak": 0, "impressions": 7187, "momentum": 0}, "highlight": "#Jetlounge. #100Days100Nights #NOLA. #CWMB3 https://t.co/0B8c0h1PwS", "type": "tweet", "citation_url": "http://twitter.com/DANIELCP3/status/616437964219023360"}
    {"firstpost_date": 1435806666, "title": "#Jetlounge. #100Days100Nights #NOLA. #CWMB3 by internationalcorporation3 http://t.co/jiibcs21ho http://t.co/Ci2MkoKgMC", "url": "http://twitter.com/instaNewOrleans/status/616443954825981958", "tweet": {"contributors": null, "truncated": false, "text": "#Jetlounge. #100Days100Nights #NOLA. #CWMB3 by internationalcorporation3 http://t.co/jiibcs21ho http://t.co/Ci2MkoKgMC", "in_reply_to_status_id": null, "id": 616443954825981958, "favorite_count": 0, "source": "<a href=\"http://ifttt.com\" rel=\"nofollow\">IFTTT</a>", "retweeted": false, "coordinates": null, "timestamp_ms": "1435806666462", "entities": {"symbols": [], "media": [{"expanded_url": "http://twitter.com/instaNewOrleans/status/616443954825981958/photo/1", "sizes": {"large": {"h": 640, "resize": "fit", "w": 640}, "small": {"h": 340, "resize": "fit", "w": 340}, "medium": {"h": 600, "resize": "fit", "w": 600}, "thumb": {"h": 150, "resize": "crop", "w": 150}}, "url": "http://t.co/Ci2MkoKgMC", "media_url_https": "https://pbs.twimg.com/media/CI4MgsNWcAAO7Ry.jpg", "id_str": "616443954758840320", "indices": [96, 118], "media_url": "http://pbs.twimg.com/media/CI4MgsNWcAAO7Ry.jpg", "type": "photo", "id": 616443954758840320, "display_url": "pic.twitter.com/Ci2MkoKgMC"}], "hashtags": [{"indices": [0, 10], "text": "Jetlounge"}, {"indices": [12, 29], "text": "100Days100Nights"}, {"indices": [30, 35], "text": "NOLA"}, {"indices": [37, 43], "text": "CWMB3"}], "user_mentions": [], "trends": [], "urls": [{"indices": [73, 95], "url": "http://t.co/jiibcs21ho", "expanded_url": "http://ift.tt/1LW6KFA", "display_url": "ift.tt/1LW6KFA"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "retweet_count": 0, "id_str": "616443954825981958", "favorited": false, "user": {"follow_request_sent": null, "profile_use_background_image": true, "geo_enabled": false, "description": "Latest pics from New Orleans via Instagram", "verified": false, "profile_image_url_https": "https://pbs.twimg.com/profile_images/551083419187564545/5SaxR6d9_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "id": 2957041439, "profile_text_color": "333333", "followers_count": 970, "profile_sidebar_border_color": "C0DEED", "id_str": "2957041439", "default_profile_image": false, "location": "", "utc_offset": null, "statuses_count": 154998, "profile_background_color": "C0DEED", "friends_count": 98, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/551083419187564545/5SaxR6d9_normal.jpeg", "notifications": null, "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2957041439/1420223502", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "name": "Pics from NewOrleans", "lang": "ru", "profile_background_tile": false, "favourites_count": 0, "screen_name": "instaNewOrleans", "url": null, "created_at": "Fri Jan 02 18:24:44 +0000 2015", "contributors_enabled": false, "time_zone": null, "protected": false, "default_profile": true, "following": null, "listed_count": 385}, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Thu Jul 02 03:11:06 +0000 2015", "filter_level": "low", "in_reply_to_status_id_str": null, "place": null, "extended_entities": {"media": [{"expanded_url": "http://twitter.com/instaNewOrleans/status/616443954825981958/photo/1", "sizes": {"large": {"h": 640, "resize": "fit", "w": 640}, "small": {"h": 340, "resize": "fit", "w": 340}, "medium": {"h": 600, "resize": "fit", "w": 600}, "thumb": {"h": 150, "resize": "crop", "w": 150}}, "url": "http://t.co/Ci2MkoKgMC", "media_url_https": "https://pbs.twimg.com/media/CI4MgsNWcAAO7Ry.jpg", "id_str": "616443954758840320", "indices": [96, 118], "media_url": "http://pbs.twimg.com/media/CI4MgsNWcAAO7Ry.jpg", "type": "photo", "id": 616443954758840320, "display_url": "pic.twitter.com/Ci2MkoKgMC"}]}}, "author": {"author_img": "http://pbs.twimg.com/profile_images/551083419187564545/5SaxR6d9_normal.jpeg", "name": "InstaNewOrleans", "url": "http://twitter.com/instaneworleans", "nick": "instaneworleans", "followers": 29.0, "image_url": "http://pbs.twimg.com/profile_images/551083419187564545/5SaxR6d9_normal.jpeg", "type": "twitter", "description": "Latest pics from New Orleans via Instagram"}, "original_author": {"author_img": "http://pbs.twimg.com/profile_images/551083419187564545/5SaxR6d9_normal.jpeg", "description": "Latest pics from New Orleans via Instagram", "url": "http://twitter.com/instaneworleans", "nick": "instaneworleans", "followers": 29.0, "image_url": "http://pbs.twimg.com/profile_images/551083419187564545/5SaxR6d9_normal.jpeg", "type": "twitter", "name": "InstaNewOrleans"}, "citation_date": 1435806666, "metrics": {"acceleration": 48, "ranking_score": 8.218007, "citations": {"influential": 1, "total": 3, "data": [{"timestamp": 1435777199, "citations": 0}, {"timestamp": 1435780799, "citations": 0}, {"timestamp": 1435784399, "citations": 1}, {"timestamp": 1435787999, "citations": 0}, {"timestamp": 1435791599, "citations": 0}, {"timestamp": 1435795199, "citations": 0}, {"timestamp": 1435798799, "citations": 0}, {"timestamp": 1435802399, "citations": 0}, {"timestamp": 1435805999, "citations": 0}, {"timestamp": 1435809599, "citations": 0}, {"timestamp": 1435813199, "citations": 1}, {"timestamp": 1435816799, "citations": 0}, {"timestamp": 1435820399, "citations": 0}, {"timestamp": 1435823999, "citations": 0}, {"timestamp": 1435827599, "citations": 0}, {"timestamp": 1435831199, "citations": 0}, {"timestamp": 1435834799, "citations": 0}, {"timestamp": 1435838399, "citations": 0}, {"timestamp": 1435841999, "citations": 0}, {"timestamp": 1435845599, "citations": 0}, {"timestamp": 1435849199, "citations": 0}, {"timestamp": 1435852799, "citations": 0}, {"timestamp": 1435856399, "citations": 0}, {"timestamp": 1435859999, "citations": 0}], "matching": 3, "replies": 0}, "peak": 1435827599, "impressions": 8377, "momentum": 2}, "highlight": "#Jetlounge. #100Days100Nights #NOLA. #CWMB3 by internationalcorporation3 http://t.co/jiibcs21ho http://t.co/Ci2MkoKgMC", "type": "tweet", "citation_url": "http://twitter.com/instaNewOrleans/status/616443954825981958"}

All the ways I've tried to parse the text file have given me some version of this error:

    JSONDecodeError: Expecting value: line 2 column 1 (char 2)

My current code is simple but I've tried a few different versions:

    tweets = []
    for line in open('tweets.txt', 'r'):
    tweets.append(json.loads(line))

Utf-8 encoding didn't fix it and there's nothing obviously hinky with line 2 column 1 that I can see. I'm at a loss for what's causing the error so I'm not sure what to try to fix it.

5
  • If you are reading this line by line there is not line 2. It means line 1 is incomplete and the parser is looking for more. Note that the parser cannot see anything but that one line! It could be line 328 in your file, and it'd still just see it as line 1. Commented Aug 12, 2015 at 21:09
  • Provided the leading whitespace is just a posting error here, your 2 sample lines are valid JSON and load without issues. Commented Aug 12, 2015 at 21:12
  • So, does that mean one of the lines somewhere in the txt file are incomplete? How do I get around that? Commented Aug 12, 2015 at 21:18
  • You could start by logging the line that fails to load. Use try..except json.JSONDecodeError: to catch the exception, then report on the line and continue. Commented Aug 12, 2015 at 21:20
  • I tried ` tweets = [] try: for line in open('tweets.txt', 'r'): tweets.append(json.loads(line)) print line except ValueError as detail: # includes simplejson.decoder.JSONDecodeError print 'Decoding JSON has failed', detail ` and it's failing on the last line. So none of the data are flawed, it's just freaking out at the end when it runs out of data. Commented Aug 12, 2015 at 21:32

1 Answer 1

1

I think it's probably caused by a trailing newline at the end of the file.

for line in open('tweets.txt', 'r'):
    if line.strip():
        tweets.append(json.loads(line))
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.