Transforming data into columns

Question

 AC=126;AC_AFR=0;AC_AMR=0;AC_Adj=126;AC_EAS=120;AC_FIN=0;AC_Het=112;
 AC=12683;AC_AFR=4578;AC_AMR=559;AC_Adj=12680;AC_EAS=2104;AC_FIN=501;AC_Het=91966

I have data with one of the columns that look like this, i.e. keys and values. I would like to transform selected data into column with header being the key and values in the column.

Not all the lines have the same data. Some lines would not have fields that appear in other lines.

output wanted:

AC      AC_AFR    AC_AMR and so on
126     0         0
12683   4578      559

Not sure how to do this or where to start

Welcome to Unix.stackexchange! I recommend you take the tour. — Stephen Rauch
– Stephen Rauch, Commented Jan 23, 2017 at 5:08
I can see a few ways this could be done. Can you show us what you got so far and where you're stuck? — Julie Pelletier
– Julie Pelletier, Commented Jan 23, 2017 at 5:10
This command would do the trick: echo -e "AC\tAC_AFR\tAC_AMR\tAC_Adj\tAC_EAS\tAC_FIN\tAC_Het" && sed -e 's/[Aa-Zz _=]*//g' datacolum | sed -e 's/;/\t/g', but it is required that in your original file all the lines have the same fields. As I have read in your comments that some lines wouldn't have fields that are in other ones, I have edited your question with that relevant info. — Zumo de Vidrio
– Zumo de Vidrio, Commented Jan 23, 2017 at 8:23

Tim Kennedy · Accepted Answer · 2017-01-23 05:57:50Z

The challenges with this are that the data is not a simple CSV type file, where the first line is column names, and the rest of the lines are column data by row.

Here you have column_name=column_data, delimited by ; characters. My solution would be to use a language like Python to read the file in line by line. I would create dict() from each line, and a K:V pair for each field. Then I would append that dict to a list() of all the lines.

Once I had that, I could process the list. If I'm on the first line, I'll print the Column Names, then the values, other wise I will only print the values.

I think the method would be similar whatever language you're using, but it's definitely doable.

Here's a quick example in Python that uses OrderedDicts to preserve "column" order:

#!/usr/bin/python
''' a quick example of a script to parse '=' delimited fields in 
    ';' delimited columns of a text file.
    prints tab delimited columnar data with headers to STDOUT
'''
from collections import OrderedDict

with open('data', 'rb') as infile:
    FLINES = infile.read().split()

DATA = []
for line in FLINES:
    fields = line.split(';')
    d = OrderedDict()
    for field in fields:
        if '=' in field:
            col, value = field.split('=')
            d.update({col: value})
    DATA.append(d)

L = 0
for D in DATA:
    if L == 0:
        print '\t'.join(D.keys())
    print '\t'.join(D.values())
    L += 1

This example assumes that all your lines will have the same columns, because it will only print the col_names for the first entry it gets out of the list.

hi the challenge is that not all lines have the same data. Some lines would not have fields that are presence in other lines. — Jan Shamsani
– Jan Shamsani, Commented Jan 23, 2017 at 6:16
@JanShamsani: Does your challenge allow you to do two passes? — Julie Pelletier
– Julie Pelletier, Commented Jan 23, 2017 at 6:24
Just extend the example. Read through all the lines in your file, extract all the column names, and then process all the lines inserting zeros or nulls where they line doesn't have a certain column. The example I gave is just an example. — Tim Kennedy
– Tim Kennedy, Commented Jan 23, 2017 at 17:26

Satō Katsura · Accepted Answer · 2017-01-23 07:24:20Z

1

Quick and dirty solution with perl:

#!/usr/bin/env perl
use strict;
use warnings;

my %cache;
while (<>) {
    chomp;
    for my $pair ( split /;/ ) {
        $pair =~ s/=.*//;
        $cache{$pair} = 1;
    }
}
continue {
    last if eof;
}

my @keys = sort keys %cache;

print +( join "\t", @keys ), "\n";

while (<>) {
    chomp;
    my %h = map { m/([^=]+)=(\S+)/; ( $1, $2 ) } split /;/;
    print +( join "\t", map { $h{$_} // '' } @keys ), "\n";
}

Use it like this:

perl script.pl input.txt input.txt

This scans the input file twice, first to get the keys, then to format the columns. It's dirty because it should probably use Text::CSV and Array::Unique.

answered Jan 23, 2017 at 7:24

Satō Katsura

13.7k2 gold badges34 silver badges52 bronze badges

this kinda work but i do not have the skills in writing perl. but i can understand what each command does. The data shown above is actually in column 71. There are other columns in the files. How can I change to do it just for column 71. My basic knowledge will be cut column 71, run the script and paste is back.

Jan Shamsani
– Jan Shamsani

2017-01-23 08:33:07 +00:00
Commented Jan 23, 2017 at 8:33

Add a comment |

glenn jackman · Accepted Answer · 2017-01-23 15:07:30Z

1

Using GNU awk

gawk -F '[=;]' '
    {for (i=1; i<NF; i+=2) values[$i][NR] = $(i+1)}
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (key in values) printf "%s\t", key
        print ""
        for (line=1; line<=NR; line++) {
            for (key in values) printf "%s\t", value[key][line]
            print ""
        }
    }
' filename

AC      AC_AFR  AC_AMR  AC_Adj  AC_EAS  AC_FIN  AC_Het  
126     0       0       126     120     0       112 
12683   4578    559     12680   2104    501     91966

I'm using 2 field separator characters here, so all the odd-numbered fields are the keys, and all the even-numbered fields are the values.

edited Jan 23, 2017 at 15:07

answered Jan 23, 2017 at 14:51

glenn jackman

88.5k16 gold badges124 silver badges179 bronze badges

1

This reads the entire file in memory. You can avoid that the same way I did in my answer, by passing the file twice: detect the keys at the first pass, then format the entries at the second pass.

Satō Katsura
– Satō Katsura

2017-01-23 15:16:03 +00:00
Commented Jan 23, 2017 at 15:16

Add a comment |

Kusalananda · Accepted Answer · 2023-01-12 14:58:14Z

The following pipeline with sed and mlr removes a trailing ; on any line of input, if there is one, and then reads the lines as records with ;-delimited fields where = is used to delimit a label indicating the field name and the field's value. The ordering of the fields in each record is unimportant to reading the data. The output is TSV.

$ sed 's/;$//' file | mlr --ifs ';' --otsv unsparsify
AC      AC_AFR  AC_AMR  AC_Adj  AC_EAS  AC_FIN  AC_Het
126     0       0       126     120     0       112
12683   4578    559     12680   2104    501     91966

The unsparsify sub-command of mlr will assign empty values to missing fields (fields present in some records but missing in others).

Stack Exchange Network

Transforming data into columns

4 Answers 4

You must log in to answer this question.

Hot Network Questions

Transforming data into columns

4 Answers 4

You must log in to answer this question.

Related

Hot Network Questions