Just a few quick considerations:
- You have
import ostwice - You are not using
matplotlibandnumpy, so theimports can go - The line
tweet = tweets[0]is useless - You're not closing the files you open, you should use the
withkeyword
Two optimizations:
- I'd remove the
print(file). This is probably single best optimization you can do - You're already looping once, why do you loop another five times?
What about something like this (not tested!):
from collections import defaultdict
elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)
for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
for file in files:
if file.endswith('.json'):
with open(file, 'r') as input_file:
for line in input_file:
try:
tweet = json.loads(line)
for key in elements_keys:
elements[key].append(tweet[key])
except:
continue
df=pd.DataFrame({'Ids': pd.Index(elements['id']),
'Text': pd.Index(elements['text']),
'Lang': pd.Index(elements['lang']),
'Geo': pd.Index(elements['geo']),
'Place': pd.Index(elements['place'])})
df