2

I'm having difficulty coming up with a simple solution to make a nice simple dataframe to work with from text in the format below:

    Dose [Gy]   Relative dose [%]    Structure Volume [cm³]
            0                   0                   45888.7
          0.1            0.166667                   27061.7
          0.2            0.333333                   18911.6
          0.3                 0.5                   14907.6
          0.4            0.666667                   12602.7
          0.5            0.833333                   11127.8
          0.6                   1                   10041.9
          0.7             1.16667                   9184.75
          0.8             1.33333                   8480.96
          0.9                 1.5                   7885.19
            1             1.66667                   7382.82
          1.1             1.83333                   6947.77
          1.2                   2                   6570.69
          1.3             2.16667                   6242.93
          1.4             2.33333                   5959.37
          1.5                 2.5                   5713.12
          1.6             2.66667                   5497.12
          1.7             2.83333                   5305.86
          1.8                   3                    5135.8
          1.9             3.16667                   4983.65
            2             3.33333                   4846.38
          2.1                 3.5                    4720.5
          2.2             3.66667                   4604.54
          2.3             3.83333                    4496.7
          2.4                   4                   4396.11
          2.5             4.16667                   4303.21

What I was doing was directly indexing the value on each line, like:

  for line in lines:
       value1 = line[10:20]
       value3 = line[55:70]

However, its not very pythonic, and not robust at all.

Now I am trying to let pandas do the heavy lifting and am struggling to get the data to come out correctly. For example:

df = pd.read_csv(StringIO.StringIO(data), sep="          ",engine='python')

Which outputs something that still includes new lines "\n" and "'" along with the numbers.

Is there a smarter way to tackle this? Or do I need to do a lot of pre-processing before pandas can work with it?

Thanks for any help/advice!

2
  • Can you post the raw text file as is, not as a list as you've done as this confuses things, thanks Commented Oct 25, 2016 at 8:03
  • Done, hopefully that is what you mean? Commented Oct 25, 2016 at 8:06

3 Answers 3

2

I think you need separator s{2,} - 2 or more whitespaces:

import pandas as pd
import numpy as np
from pandas.compat import StringIO

temp=u"""Dose [Gy]   Relative dose [%]    Structure Volume [cm³]
            0                   0                   45888.7
          0.1            0.166667                   27061.7
          0.2            0.333333                   18911.6
          0.3                 0.5                   14907.6
          0.4            0.666667                   12602.7
          0.5            0.833333                   11127.8
          0.6                   1                   10041.9
          0.7             1.16667                   9184.75
          0.8             1.33333                   8480.96
          0.9                 1.5                   7885.19
            1             1.66667                   7382.82
          1.1             1.83333                   6947.77
          1.2                   2                   6570.69
          1.3             2.16667                   6242.93
          1.4             2.33333                   5959.37
          1.5                 2.5                   5713.12
          1.6             2.66667                   5497.12
          1.7             2.83333                   5305.86
          1.8                   3                    5135.8
          1.9             3.16667                   4983.65
            2             3.33333                   4846.38
          2.1                 3.5                    4720.5
          2.2             3.66667                   4604.54
          2.3             3.83333                    4496.7
          2.4                   4                   4396.11
          2.5             4.16667                   4303.21"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp),sep=r'\s{2,}', engine='python')
print (df)
    Dose [Gy]  Relative dose [%]  Structure Volume [cm³]
0         0.0           0.000000                45888.70
1         0.1           0.166667                27061.70
2         0.2           0.333333                18911.60
3         0.3           0.500000                14907.60
4         0.4           0.666667                12602.70
5         0.5           0.833333                11127.80
6         0.6           1.000000                10041.90
7         0.7           1.166670                 9184.75
8         0.8           1.333330                 8480.96
9         0.9           1.500000                 7885.19
10        1.0           1.666670                 7382.82
11        1.1           1.833330                 6947.77
12        1.2           2.000000                 6570.69
13        1.3           2.166670                 6242.93
14        1.4           2.333330                 5959.37
15        1.5           2.500000                 5713.12
16        1.6           2.666670                 5497.12
17        1.7           2.833330                 5305.86
18        1.8           3.000000                 5135.80
19        1.9           3.166670                 4983.65
20        2.0           3.333330                 4846.38
21        2.1           3.500000                 4720.50
22        2.2           3.666670                 4604.54
23        2.3           3.833330                 4496.70
24        2.4           4.000000                 4396.11
25        2.5           4.166670                 4303.21
Sign up to request clarification or add additional context in comments.

Comments

1

use read_fwf as it's a fixed width file and pass the column positions as a list of tuple pairs:

In [63]:
t="""    Dose [Gy]   Relative dose [%]    Structure Volume [cm³]
            0                   0                   45888.7
          0.1            0.166667                   27061.7
          0.2            0.333333                   18911.6
          0.3                 0.5                   14907.6
          0.4            0.666667                   12602.7
          0.5            0.833333                   11127.8
          0.6                   1                   10041.9
          0.7             1.16667                   9184.75
          0.8             1.33333                   8480.96
          0.9                 1.5                   7885.19
            1             1.66667                   7382.82
          1.1             1.83333                   6947.77
          1.2                   2                   6570.69
          1.3             2.16667                   6242.93
          1.4             2.33333                   5959.37
          1.5                 2.5                   5713.12
          1.6             2.66667                   5497.12
          1.7             2.83333                   5305.86
          1.8                   3                    5135.8
          1.9             3.16667                   4983.65
            2             3.33333                   4846.38
          2.1                 3.5                    4720.5
          2.2             3.66667                   4604.54
          2.3             3.83333                    4496.7
          2.4                   4                   4396.11
          2.5             4.16667                   4303.21"""

You can see that the final df is formatted correctly:

df = pd.read_fwf(io.StringIO(t), colspecs=[(0,13),(14,33),(34,59)])
df

Out[63]:
    Dose [Gy]  Relative dose [%]  Structure Volume [cm³]
0         0.0           0.000000                45888.70
1         0.1           0.166667                27061.70
2         0.2           0.333333                18911.60
3         0.3           0.500000                14907.60
4         0.4           0.666667                12602.70
5         0.5           0.833333                11127.80
6         0.6           1.000000                10041.90
7         0.7           1.166670                 9184.75
8         0.8           1.333330                 8480.96
9         0.9           1.500000                 7885.19
10        1.0           1.666670                 7382.82
11        1.1           1.833330                 6947.77
12        1.2           2.000000                 6570.69
13        1.3           2.166670                 6242.93
14        1.4           2.333330                 5959.37
15        1.5           2.500000                 5713.12
16        1.6           2.666670                 5497.12
17        1.7           2.833330                 5305.86
18        1.8           3.000000                 5135.80
19        1.9           3.166670                 4983.65
20        2.0           3.333330                 4846.38
21        2.1           3.500000                 4720.50
22        2.2           3.666670                 4604.54
23        2.3           3.833330                 4496.70
24        2.4           4.000000                 4396.11
25        2.5           4.166670                 4303.21

Comments

0

Although the other solutions might be more pythonic, I would propose to first convert the file such that it doesn't contain multiple blank spaces anymore. Then you can easily read it into a pandas dataframe:

import pandas as pd

infile = open('test.txt', 'r')
outfile = open('testout.txt', 'w')

for eachrow in infile:
    stripped = '#'.join(filter(None,eachrow.split('  ')))
    outfile.write(stripped)

infile.close()
outfile.close()

df = pd.read_csv('testout.txt', encoding = 'latin1', sep='#', engine='python')
df.head()

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.