2
 AC=126;AC_AFR=0;AC_AMR=0;AC_Adj=126;AC_EAS=120;AC_FIN=0;AC_Het=112;
 AC=12683;AC_AFR=4578;AC_AMR=559;AC_Adj=12680;AC_EAS=2104;AC_FIN=501;AC_Het=91966

I have data with one of the columns that look like this, i.e. keys and values. I would like to transform selected data into column with header being the key and values in the column.

Not all the lines have the same data. Some lines would not have fields that appear in other lines.

output wanted:

AC      AC_AFR    AC_AMR and so on
126     0         0
12683   4578      559

Not sure how to do this or where to start

5
  • 1
    Welcome to Unix.stackexchange! I recommend you take the tour. Commented Jan 23, 2017 at 5:08
  • I can see a few ways this could be done. Can you show us what you got so far and where you're stuck? Commented Jan 23, 2017 at 5:10
  • This command would do the trick: echo -e "AC\tAC_AFR\tAC_AMR\tAC_Adj\tAC_EAS\tAC_FIN\tAC_Het" && sed -e 's/[Aa-Zz _=]*//g' datacolum | sed -e 's/;/\t/g', but it is required that in your original file all the lines have the same fields. As I have read in your comments that some lines wouldn't have fields that are in other ones, I have edited your question with that relevant info. Commented Jan 23, 2017 at 8:23
  • How many rows do you have to deal with? Commented Jan 23, 2017 at 15:46
  • can get up to 50 Millions rows Commented Jan 24, 2017 at 3:38

4 Answers 4

1

The challenges with this are that the data is not a simple CSV type file, where the first line is column names, and the rest of the lines are column data by row.

Here you have column_name=column_data, delimited by ; characters. My solution would be to use a language like Python to read the file in line by line. I would create dict() from each line, and a K:V pair for each field. Then I would append that dict to a list() of all the lines.

Once I had that, I could process the list. If I'm on the first line, I'll print the Column Names, then the values, other wise I will only print the values.

I think the method would be similar whatever language you're using, but it's definitely doable.

Here's a quick example in Python that uses OrderedDicts to preserve "column" order:

#!/usr/bin/python
''' a quick example of a script to parse '=' delimited fields in 
    ';' delimited columns of a text file.
    prints tab delimited columnar data with headers to STDOUT
'''
from collections import OrderedDict

with open('data', 'rb') as infile:
    FLINES = infile.read().split()

DATA = []
for line in FLINES:
    fields = line.split(';')
    d = OrderedDict()
    for field in fields:
        if '=' in field:
            col, value = field.split('=')
            d.update({col: value})
    DATA.append(d)

L = 0
for D in DATA:
    if L == 0:
        print '\t'.join(D.keys())
    print '\t'.join(D.values())
    L += 1
  • This example assumes that all your lines will have the same columns, because it will only print the col_names for the first entry it gets out of the list.
3
  • hi the challenge is that not all lines have the same data. Some lines would not have fields that are presence in other lines. Commented Jan 23, 2017 at 6:16
  • @JanShamsani: Does your challenge allow you to do two passes? Commented Jan 23, 2017 at 6:24
  • Just extend the example. Read through all the lines in your file, extract all the column names, and then process all the lines inserting zeros or nulls where they line doesn't have a certain column. The example I gave is just an example. Commented Jan 23, 2017 at 17:26
1

Quick and dirty solution with perl:

#!/usr/bin/env perl
use strict;
use warnings;

my %cache;
while (<>) {
    chomp;
    for my $pair ( split /;/ ) {
        $pair =~ s/=.*//;
        $cache{$pair} = 1;
    }
}
continue {
    last if eof;
}

my @keys = sort keys %cache;

print +( join "\t", @keys ), "\n";

while (<>) {
    chomp;
    my %h = map { m/([^=]+)=(\S+)/; ( $1, $2 ) } split /;/;
    print +( join "\t", map { $h{$_} // '' } @keys ), "\n";
}

Use it like this:

perl script.pl input.txt input.txt

This scans the input file twice, first to get the keys, then to format the columns. It's dirty because it should probably use Text::CSV and Array::Unique.

1
  • this kinda work but i do not have the skills in writing perl. but i can understand what each command does. The data shown above is actually in column 71. There are other columns in the files. How can I change to do it just for column 71. My basic knowledge will be cut column 71, run the script and paste is back. Commented Jan 23, 2017 at 8:33
1

Using GNU awk

gawk -F '[=;]' '
    {for (i=1; i<NF; i+=2) values[$i][NR] = $(i+1)}
    END {
        PROCINFO["sorted_in"] = "@ind_str_asc"
        for (key in values) printf "%s\t", key
        print ""
        for (line=1; line<=NR; line++) {
            for (key in values) printf "%s\t", value[key][line]
            print ""
        }
    }
' filename
AC      AC_AFR  AC_AMR  AC_Adj  AC_EAS  AC_FIN  AC_Het  
126     0       0       126     120     0       112 
12683   4578    559     12680   2104    501     91966   

I'm using 2 field separator characters here, so all the odd-numbered fields are the keys, and all the even-numbered fields are the values.

1
  • 1
    This reads the entire file in memory. You can avoid that the same way I did in my answer, by passing the file twice: detect the keys at the first pass, then format the entries at the second pass. Commented Jan 23, 2017 at 15:16
0

The following pipeline with sed and mlr removes a trailing ; on any line of input, if there is one, and then reads the lines as records with ;-delimited fields where = is used to delimit a label indicating the field name and the field's value. The ordering of the fields in each record is unimportant to reading the data. The output is TSV.

$ sed 's/;$//' file | mlr --ifs ';' --otsv unsparsify
AC      AC_AFR  AC_AMR  AC_Adj  AC_EAS  AC_FIN  AC_Het
126     0       0       126     120     0       112
12683   4578    559     12680   2104    501     91966

The unsparsify sub-command of mlr will assign empty values to missing fields (fields present in some records but missing in others).

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.