1

I want to know how to clean up my data to better understand it so that I can know how to sift through the data more easily. So far I have been able to download a public google spreadsheets doc and then convert that into a csv file. But when I print the data it is quite messy and hard to understand. The data came from a website, so when I go to google developer mode I can see how it is neatly organized.

Like this: Website data on inspect page mode

But actually seeing it as I print into in Jupyter notebooks it looks messy like this:

b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok","sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id":"B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type":"date","pattern":"yyyy-mm-dd"},{"id":"D","label":"Flights 2019 (Reference)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern":"General"},{"id":"F","label":"% vs 2019 (Daily)","type":"number","pattern":"General"},{"id":"G","label":"Flights (7-day moving average)","type":"number","pattern":"General"},{"id":"H","label":"% vs 2019 (7-day Moving Average)","type":"number","pattern":"General"},{"id":"I","label":"Day 2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day Previous Year","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"Flights Previous Year","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,2)","f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v":-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":"Date(2019,8,4)","f":"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f":"96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},

Is there a Panda way to keep this data up?

Essentially what I am trying to do is extract three variables from the data: country, date, and a number.

Here it can be seen how the code starts out with the title, "rows":

Code in Jupyter showing how the code starts out

Essentially it gives a country, date, then a bunch of associated numbers.

What I want to get is the country name, a specific date, and a specific number.

For example, here is an example section, this sequence is repeated throughout the data:

{"c":[{"v":"Albania"},{"v":36.0,"f":"36"},{"v":"Date(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":"Date(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"}]},

of this section of the data I only want to get out the word Country name: Albania, the date "2020-09-01", and the number -0.5038

Here is the code I used to grab the google spreadsheet data and save it as a csv:

import requests
import pandas as pd 

r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')

data = r.content

print(data)

Please any and all advice would be amazing.

Thank you

0

1 Answer 1

1

I'm not sure how you arrived at this csv file, but the easiest way would be to get the json directly with requests, load it as a dict and process it. Nonetheless a solution for the current file would be:

import requests
import pandas as pd 
import json

r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')

data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0]) # clean up the string so only the json data is left
d = [[i['c'][0]['v'], i['c'][2]['f'], i['c'][5]['v']] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['country', 'date', 'number'])
Output:
|    | country   | date       |        number |
|---:|:----------|:-----------|--------------:|
|  0 | Albania   | 2020-09-01 |     -0.503876 |
|  1 | Albania   | 2020-09-02 |     -0.358696 |
|  2 | Albania   | 2020-09-03 |     -0.302083 |
|  3 | Albania   | 2020-09-04 |     -0.135922 |
|  4 | Albania   | 2020-09-05 |     -0.43617  |
Sign up to request clarification or add additional context in comments.

6 Comments

You can also slice data to data = json.loads(data[47:-2])
@RJ Adriaansen, Thank you! is there any sort of way to have it specifically pull out a countries name and then do grab its specific date and number? I need to extract out specific countries and their associated data points.
@RJ Adriaansen, also here is the website I am scraping the data from: eurocontrol.int/Economics/DailyTrafficVariation-States.html. I go to inspect the page and go into the XHR and see where the GET requests are coming from. Not sure how to json it. Would love to know how to.
No now I see that the site itself loads it in this format from a google spreadsheet, so you can settle for my code. Filtering by country can be easily done in pandas: df[df['country'] == 'France']
@RJ Adriaanse, I am sorry for my question I now see that pandas prints only the first 5. This is an amazing answer thank you so much
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.