Data pivot with awk

Question

I'm trying to pivot a file using awk. This is an example of my input File:

VarName;TimeString;VarValue;Validity;Time_ms
A;23.11.201215:03:53;1;1;41236627696,7593
C;23.11.201215:03:53;2;1;41236627696,7593
D;23.11.201215:03:53;3;1;41236627696,7593
A;23.11.201215:04:53;31;1;41236628391,2037
B;23.11.201215:04:53;12;1;41236628391,2037
C;23.11.201215:04:53;1;1;41236628391,2037
D;23.11.201215:05:53;8;1;41236629097,2222
B;23.11.201215:05:53;7;1;41236629097,2222
C;23.11.201215:05:53;15;1;41236629097,2222

And this is my desired output:

TimeString;Time_ms;A;B;C;D
23.11.201215:03:53;41236627696,7593;1;;2;3
23.11.201215:04:53;41236628391,2037;31;12;1;
23.11.201215:05:53;41236629097,2222;;7;15;8

Which is the best way to obtain this result?

Solved using the script writed by @steve

Could you explain what a data pivot is, for those of us who don't grok databases? Does it have to be awk — this may be easier to do in perl/python/ruby? — Gilles 'SO- stop being evil'
– Gilles 'SO- stop being evil', Commented Jan 11, 2013 at 23:39
I'm new to programming... I performed my first data management with sed and now im looking to awk. I don't now if this may be easier to do in perl/python/ruby but I think that awk can do this. Any Help is usefull thank you — Ludovico
– Ludovico, Commented Jan 12, 2013 at 7:00
in And this is my desired output: 2nd line I see an A value of 31.Where does this come from? How does that derive from your example? - sorry for the noise - I've got it — sparkie
– sparkie, Commented Jan 13, 2013 at 6:59
Please don't mark your question as "solved": this is a wiki, not a forum. Accept the correct answer so it is marked as such. — jasonwryan
– jasonwryan, Commented Jan 16, 2013 at 19:53

Steve · Accepted Answer · 2013-01-16 04:53:28Z

Here's one way using gawk. Run like:

awk -f script.awk file

Contents of script.awk:

BEGIN {
    FS=OFS=";"
}

NR==1 {
    r = $2 FS $5
    next
}

{
    !x[$1]
    a[$2,$5][$1]=$3
}

END {

    m = asorti(x,y)
    for (k=1;k<=m;k++) {
        r = r FS y[k]
    }
    print r

    n = asorti(a,b)
    for (i=1;i<=n;i++) {
        for (j=1;j<=m;j++) {
            for (k in a[b[i]]) {
                if (k == y[j]) {
                    var = a[b[i]][k]
                }
            }

            line = line FS var
            var = ""
        }
        sub(SUBSEP, FS, b[i])

        print b[i] line
        line = ""
    }
}

Alternatively, here's the one liner:

awk 'BEGIN { FS=OFS=";" } NR==1 { r = $2 FS $5; next } { !x[$1]; a[$2,$5][$1]=$3 } END { m = asorti(x,y); for (k=1;k<=m;k++) { r = r FS y[k] } print r; n = asorti(a,b); for (i=1;i<=n;i++) { for (j=1;j<=m;j++) { for (k in a[b[i]]) { if (k == y[j]) { var = a[b[i]][k] } } line = line FS var; var = "" } sub(SUBSEP, FS, b[i]); print b[i] line; line = "" } }' file

Results:

TimeString;Time_ms;A;B;C;D
23.11.201215:03:53;41236627696,7593;1;;2;3
23.11.201215:04:53;41236628391,2037;31;12;1;
23.11.201215:05:53;41236629097,2222;;7;15;8

You need to run dos2unix on your file first. i.e:

dos2unix Flussi0.csv

Alternatively, change the record separator to \r\n so that awk knows what a windows newline ending looks like. You can do this in the BEGIN block:

BEGIN {
    FS=OFS=";"
    RS="\r\n"
}

Results with the input file posted in the comments below:

"TimeString";"Time_ms";"FIT01";"FIT02";"FIT03";"FIT04";"FIT05";"FIT06"
"22.06.2012 09:31:33";41082396909,7222;1,157408E-02;5,787041E-03;2,507718E-02;2,89352E-03;2,314816E-02;5,787035E-04
"22.06.2012 09:32:34";41082397615,7407;1,157408E-02;5,787041E-03;2,314816E-02;2,89352E-03;2,713479E-02;5,787035E-04
"22.06.2012 09:33:35";41082398321,7593;1,157408E-02;5,787041E-03;2,314816E-02;2,89352E-03;2,314816E-02;5,787035E-04
"22.06.2012 09:34:35";41082399016,2037;1,157408E-02;5,787041E-03;2,314816E-02;2,89352E-03;2,535274E-02;5,787035E-04
"22.06.2012 09:35:36";41082399722,2222;;;;;2,314816E-02;

The script fully work whit the example but I tried on an other file and fails. I tried whit a real file that I must edit whit this script... I don't know why but something goes wrong (unexpected line break after column 2)... What's the difference between my example and this file? Did I something wrong? Here there is the output — Ludovico
– Ludovico, Commented Jan 14, 2013 at 20:37
Update: script fails when numbers are present in this format: 1E-02 — Ludovico
– Ludovico, Commented Jan 14, 2013 at 21:33
@Ludovico: There's nothing wrong with the script. You're just not using it correctly. Please see the update. HTH. — Steve
– Steve, Commented Jan 16, 2013 at 4:54
Thank you @steve I modified the awk script as you write. It's work very well. I hope this discussion will help other people — Ludovico
– Ludovico, Commented Jan 16, 2013 at 19:46

Scott - Слава Україні · Accepted Answer · 2013-01-12 00:35:29Z

1

The best way? I don’t know. Here’s a way. I assumed that the code didn’t really need to look at the header line of the input data, and could just hard-code TimeString;Time_ms;.

(line > /dev/null; sort) < input_file > tmp0    # Discard the header line; sort the data.
        # Here lies the basic pivot:
awk -F";" '
    {
        print $1 > "tmp1"
        print $2 > "tmp2"
        print $5 > "tmp5"
    }' tmp0
echo "TimeString;Time_ms;\c"
tr "\n" ";" < tmp1; echo
tr "\n" ";" < tmp2; echo
tr "\n" ";" < tmp5; echo

This will end each line of the output with a semicolon (;). It wasn’t clear whether you wanted that. If you don’t want it, you can probably figure out a way to eliminate it.

answered Jan 12, 2013 at 0:35

Scott - Слава Україні

10.9k10 gold badges42 silver badges70 bronze badges

Thank you for your help... It's a good start but I should do it's a little bit different... My example was not so clear... I give you an other... I Modified original post. The semicolon at the end of the indicate the missing value (see the new example row two)

Ludovico
– Ludovico

2013-01-12 07:22:40 +00:00
Commented Jan 12, 2013 at 7:22
So, are you saying that you can handle it from here? Or do you still need help? I suggest that you work harder at making you question(s) understandable. For example, TimeString and Time_ms go together –– couldn’t you have rearranged your input to put them in adjacent columns? Also, every one of your TimeString values begins with 23.11.201215:0, and every one of your Time_ms values begins with 4123662. This makes it very difficult for a person to look at the values and see which ones are the same and which ones are different. I suggest that you remove the duplication.

Scott - Слава Україні
– Scott - Слава Україні

2013-01-13 20:54:22 +00:00
Commented Jan 13, 2013 at 20:54
You're right, I apologize

Ludovico
– Ludovico

2013-01-14 18:59:08 +00:00
Commented Jan 14, 2013 at 18:59

Add a comment |

Kusalananda · Accepted Answer · 2023-01-25 07:28:53Z

Using the reshape sub-command of Miller (mlr, a CSV-aware tool), followed by unsparsify:

$ mlr --csv --fs ';' reshape -s VarName,VarValue then unsparsify file
TimeString;Validity;Time_ms;A;C;D;B
23.11.201215:03:53;1;41236627696,7593;1;2;3;
23.11.201215:04:53;1;41236628391,2037;31;1;;12
23.11.201215:05:53;1;41236629097,2222;;15;8;7

The -s option of the reshape sub-command takes a comma-delimited pair consisting of a key-field name and a value-field name. It then does a long-to-wide pivot operation, creating the fields in the key field with the data from the value field.

The unsparsify operation is necessary to add the empty fields in the records that do not have all fields.

To also reorder the four last fields in the correct order and remove the Validity field:

$ mlr --csv --fs ';' reshape -s VarName,VarValue then unsparsify then reorder -e -f A,B,C,D then cut -x -f Validity file
TimeString;Time_ms;A;B;C;D
23.11.201215:03:53;41236627696,7593;1;;2;3
23.11.201215:04:53;41236628391,2037;31;12;1;
23.11.201215:05:53;41236629097,2222;;7;15;8

Or, shorter using only the cut sub-command for rearranging and selecting the output fields:

$ mlr --csv --fs ';' reshape -s VarName,VarValue then unsparsify then cut -o -f TimeString,Time_ms,A,B,C,D file
TimeString;Time_ms;A;B;C;D
23.11.201215:03:53;41236627696,7593;1;;2;3
23.11.201215:04:53;41236628391,2037;31;12;1;
23.11.201215:05:53;41236629097,2222;;7;15;8

In the general case, we may not know the names of the newly generated fields, so we may want to sort the fields by their names, remove the Validity field, and then put the two time fields first without mentioning the new fields by name:

$ mlr --csv --fs ';' reshape -s VarName,VarValue then unsparsify then sort-within-records then cut -x -f Validity then reorder -f TimeString,Time_ms file
TimeString;Time_ms;A;B;C;D
23.11.201215:03:53;41236627696,7593;1;;2;3
23.11.201215:04:53;41236628391,2037;31;12;1;
23.11.201215:05:53;41236629097,2222;;7;15;8

Stack Exchange Network

Data pivot with awk

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Data pivot with awk

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions