Skip to main content
added 17 characters in body
Source Link
ChatterOne
  • 2.9k
  • 12
  • 18

Just a few quick considerations:

  • You have import os twice
  • You are not using matplotlib and numpy, so the imports can go
  • The line tweet = tweets[0] is useless
  • You're not closing the files you open, you should use the with keyword

Two optimizations:

  • I'd remove the print(file). This is probably single best optimization you can do
  • You're already looping once, why do you loop another five times?

What about something like this (not tested!):

from collections import defaultdict

elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file:
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        for key in elements_keys:
                            elements[key].append(tweet[key])
                    except:
                        continue

df=pd.DataFrame({'Ids': pd.Index(elements['id']),
                 'Text': pd.Index(elements['text']),
                 'Lang': pd.Index(elements['lang']),
                 'Geo': pd.Index(elements['geo']),
                 'Place': pd.Index(elements['place'])})
df

Just a few quick considerations:

  • You have import os twice
  • You are not using matplotlib and numpy, so the imports can go
  • The line tweet = tweets[0] is useless
  • You're not closing the files you open, you should use the with keyword

Two optimizations:

  • I'd remove the print(file). This is probably single best optimization you can do
  • You're already looping once, why do you loop another five times?

What about something like this (not tested!):

import defaultdict

elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file:
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        for key in elements_keys:
                            elements[key].append(tweet[key])
                    except:
                        continue

df=pd.DataFrame({'Ids': pd.Index(elements['id']),
                 'Text': pd.Index(elements['text']),
                 'Lang': pd.Index(elements['lang']),
                 'Geo': pd.Index(elements['geo']),
                 'Place': pd.Index(elements['place'])})
df

Just a few quick considerations:

  • You have import os twice
  • You are not using matplotlib and numpy, so the imports can go
  • The line tweet = tweets[0] is useless
  • You're not closing the files you open, you should use the with keyword

Two optimizations:

  • I'd remove the print(file). This is probably single best optimization you can do
  • You're already looping once, why do you loop another five times?

What about something like this (not tested!):

from collections import defaultdict

elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file:
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        for key in elements_keys:
                            elements[key].append(tweet[key])
                    except:
                        continue

df=pd.DataFrame({'Ids': pd.Index(elements['id']),
                 'Text': pd.Index(elements['text']),
                 'Lang': pd.Index(elements['lang']),
                 'Geo': pd.Index(elements['geo']),
                 'Place': pd.Index(elements['place'])})
df
Source Link
ChatterOne
  • 2.9k
  • 12
  • 18

Just a few quick considerations:

  • You have import os twice
  • You are not using matplotlib and numpy, so the imports can go
  • The line tweet = tweets[0] is useless
  • You're not closing the files you open, you should use the with keyword

Two optimizations:

  • I'd remove the print(file). This is probably single best optimization you can do
  • You're already looping once, why do you loop another five times?

What about something like this (not tested!):

import defaultdict

elements_keys = ['ids', 'text', 'lang', 'geo', 'place']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Dir'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file:
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        for key in elements_keys:
                            elements[key].append(tweet[key])
                    except:
                        continue

df=pd.DataFrame({'Ids': pd.Index(elements['id']),
                 'Text': pd.Index(elements['text']),
                 'Lang': pd.Index(elements['lang']),
                 'Geo': pd.Index(elements['geo']),
                 'Place': pd.Index(elements['place'])})
df