2

I am attempting to ingest txt files (an entire directory) into a pandas dataframe such that each row in the data frame has the content of one file.

The text files as far as I can tell are not delimited, they are the body of email messages. All files but one are split into many rows. So instead of having 20 something rows (one for each file) I have over 500 rows. I cannot tell how the one file differs from the rest. They are all plain-text.

The code I am using is:

import pandas as pd 

for i in files:
    list_.append(pd.read_csv('//directory'+i ,sep="\t" , quoting=csv.QUOTE_NONE,header=None,names=["message", "label"]))

I've set the separator to tabular as I think it will not effect the ingestion of the text at all. Any ideas what the problem is here?

1
  • How about the white space "\s+" as the separator argument? Commented Nov 25, 2015 at 9:36

2 Answers 2

6

You are reading the emails as CSV files, so the file contents will be:

  1. Split at every tab separator to create a column; whatever separator you chose, I suspect it will be a bad choice, since any character is likely to appear in the body of your email;

  2. Every newline in the email will create a new row (which probably explains your 500 rows)

Since emails are not CSV files, why not just write your own function to read each file individually into a string, then create a data frame out of all of these strings. For example, to read all the files in the current dir as strings:

data = []
path = '.'
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
  with open (f, "r") as myfile:
    data.append(myfile.read())

df = pd.DataFrame(data)

Here is an example of this in action as it were:

$ ls .
test1.txt  test2.txt  load_files.py

$ cat load_files.py 

import pandas as pd
import os

data = []
path = '.'
files = [f for f in os.listdir(path) if os.path.isfile(f)]
for f in files:
  with open (f, "r") as myfile:
    data.append(myfile.read())

df = pd.DataFrame(data)
print df


$ cat test1.txt 
asdasd
ada
adasd

$ cat test2.txt 
sasdad
asd
dadaadad

$ python load_files.py 
                                                   0
0                               asdasd\nada\nadasd\n
1                          sasdad\nasd\ndadaadad\n\n
2  import pandas as pd\nimport os\n\ndata = []\np...
Sign up to request clarification or add additional context in comments.

3 Comments

yeah you are correct with your first assessment. Thanks for you help, it works out great. I did a little manipulation though: path = '//directory-path' data = [] for f in [f for f in os.listdir(path) if not f.startswith('.')]: print(f) with open(path+f, "r") as myfile: data.append(myfile.read().replace('\n', '')) df = pd.DataFrame(data) print(df)
Sorry about the formatting of the code I can never get how to format comments out.
Glad it helped. I removed newline chars initially, but I suppose this is application specific. Sometimes you may want to keep them (e.g. what if later on you want to know the average length of an email line). Also I think you may want to replace them by ' ' rather than '' since now you merged the last word on every line with the first word on the next line, which doesn't seem a good idea. If it helped, can you also accept the answer please?
4

After reading the answer by @paul-g I decided to go about it a little bit differently. For context, my application is for use in an NLP project. My files had unique identifiers so using the list approach wasn't quite what I was looking for and I decided to go about it with a dictionary approach. The file name was my unique identifier. Note, you may have to do additional cleaning if your directory has other files beyond the ones you want to load. My directory had only my text files. Unlike the ls example in @paul-g's answer, my python files were in a different directory, so the python file was not included in my data frame.

import pandas as pd
import os

file_names = os.listdir('<folder file path here>')
# Create Dictionary for File Name and Text
file_name_and_text = {}
for file in file_names:
    with open('<folder file path here>' + file, "r") as target_file:
         file_name_and_text[file] = target_file.read()
file_data = (pd.DataFrame.from_dict(file_name_and_text, orient='index')
             .reset_index().rename(index = str, columns = {'index': 'file_name', 0: 'text'}))

This will give you a data frame as follows:

index      file_name          text

0            file1.txt              This is text from file 1

1            file2.txt              This is text from file 2

Edit: If you have a lot of little text files this can be adjusted to use the Python multithreading functionality (ThreadPool) to speed up the load time.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.