0

I have about 1000 txt files in the format below. The identifier is separated from the data by ':' . However the data it self can contains ':' I am working on a script that can allow me to transpose and import the data into an excel, csv, microsoft acess or mssql format. in a structured way. Given that the identifiers will be the same for all the txt files.

Author: sharkespare
Content-Length: 15200098
Content-Type: application/pdf
Creation-Date: 2015-02-05T07:27:34Z
Last-Modified: 2015-02-05T07:28:38Z
Last-Save-Date: 2015-02-05T07:28:38Z
created: Thu Feb 05 08:27:34 CET 2015
creator: Williams spear
date: 2015-02-05T07:27:34Z
dc:creator: Library of congress
2
  • did you consider using excel macros? Commented Sep 25, 2015 at 12:53
  • Yes that will be an option. I am also working on some tool to do the job.. Commented Sep 26, 2015 at 8:53

3 Answers 3

0

What about:

Open txt file and split by ": " (with space after the colon), put it into a list.

text=[]
with open('read_file.txt','r') as f:
    for line in f:
        for word in line.split(": "):
            text.append(word)

Now every even element is an identifier and the next odd element ist the corresponding data. Rebuild the list putting every identifier and its data together:

sorted = [text[i:i+2] for i in range(0, len(text), 2)]

Finally write everything into a csv file in colums using zip:

import csv
with open('save_file.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(zip(*sorted))

If you don't want to transpose it, just replace zip(*sorted) with sorted. After saving the text from the first file, loop over the other ones and add only the data to the csv file. You can achieve that by opening with the 'a' parameter, allowing you to append stuff at the end of the file, and using [zip(*sorted)[1]]:

files = ["file1.txt", "file2.txt", "file3.txt", etc]
for each in  files:
    text=[]
    with open(str(each),'r') as f:
        for line in f:
            for word in line.split(": "):
                text.append(word)

    sorted = [lst[i:i+2] for i in range(0, len(lst), 2)]

    with open('save_file.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerows([zip(*sorted)[1]])
Sign up to request clarification or add additional context in comments.

3 Comments

Hi a.smiet.. Thanks for the input. I will give it a try.. and get back to you
Hi Smiet .. it is a good code.. have some typos.. will optimise it and paste my result later..
Fixed it using 'import glob path = '/home/download/*.txt' files=glob.glob(path) for file in files: '
0

You can loop thought the string until you encounter the ":" symbol and than look at the next letter if it is space( " " ) than the data you want is separated from the whole string

1 Comment

I can use sed or awk and find to clean the data. The main issue is how to transpose the data.. and auto import the data "files" into a csv format or database
0

You do not state what technique is preferred, but since python and Excel are tagged I assume you are OK using both. The following is pretty easy to implement in python. I am assuming the file format is static.

Create a header line string, e.g. "Author, Content-Length, ...". Open an output file and write the header to the file.

Loop through the directory of input files. For each file read the lines in the file and split of the identifier (you know the names of identifiers so a split or string replace should work, you could split by colon, but then you will have to take special care of the dc:creator identifier). Append the isolated data to a string for that file, append a ','. When done with the file append that string to the output file. When done with all files the output file will be a csv that can be opened with Excel.

Some pointers that might help with the implementation:

How can I iterate over files in a given directory?

How do I read a file line-by-line into a list?

Writing to CSV with Python adds blank lines

2 Comments

Dear ikkjo.. Thanks for the input.. .. I might be looking for a sample code. a link or .. something.. Will work on the pseudocode
Added some pointers to the answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.