Revisions to match column1-5 in both files and print matching column1-5 along with column6&7 of file1 and column6 of file2

added 2 characters in body

Source Link

edited Dec 16, 2021 at 14:45

35.9k
6
25
60

Also - don't use all upper case variable names (e.g. A) to avoid clashing with builtin variable names, and use parentheses to make your code clearer if nothing else, e.g. idk if $1 FS $2 in A means ($1 FS) ($2 in A) (i.e. the concatenation of $1 FS with the result of looking up $2 in A$2 in A) or ($1 FS $2) in A (i.e. the result of looking up the concatenation of $1 FS $2 in A) or something else but if you simply write it as ($1 FS $2) in A then it's clear and unambiguous.

Also - don't use all upper case variable names (e.g. A) to avoid clashing with builtin variable names, and use parentheses to make your code clearer if nothing else, e.g. idk if $1 FS $2 in A means ($1 FS) ($2 in A) (i.e. the concatenation of $1 FS with the result of looking up $2 in A) or ($1 FS $2) in A (i.e. the result of looking up the concatenation of $1 FS $2 in A) or something else but if you simply write it as ($1 FS $2) in A then it's clear and unambiguous.

Also - don't use all upper case variable names (e.g. A) to avoid clashing with builtin variable names, and use parentheses to make your code clearer if nothing else, e.g. idk if $1 FS $2 in A means ($1 FS) ($2 in A) (i.e. the concatenation of $1 FS with the result of looking up $2 in A) or ($1 FS $2) in A (i.e. the result of looking up the concatenation of $1 FS $2 in A) or something else but if you simply write it as ($1 FS $2) in A then it's clear and unambiguous.

added 533 characters in body

Source Link

edited Dec 16, 2021 at 14:17

Ed Morton

35.9k
6
25
60

$ cat tst.awk
{ key = $1 FSOFS $2 FSOFS $3 FSOFS $4 FSOFS $5 }
FNR==NR {
    file1[key] = $6 OFS $7
    next
}
{
    print $0, (key in file1 ? file1[key] : "NA" OFS "NA")
    delete file1[key]
}
END {
    for ( key in file1 ) {
        print key, "NA", file1[key]
    }
}

Note the use of a variable to hold the key index for both files. When you have a key that's a combination of values, don't try to write it multiple times in your code as it's easy to get that wrong as you did where you forgot to include $2 in the lookup of the map when processing file2 (you used $1 FS $3 FS $4 FS $5 in A instead of $1 FS $2 FS $3 FS $4 FS $5 in A and did the same again in A[$1 FS $3 FS $4 FS $5]).

Also - don't use all upper case variable names (e.g. A) to avoid clashing with builtin variable names, and use parentheses to make your code clearer if nothing else, e.g. idk if $1 FS $2 in A means ($1 FS) ($2 in A) (i.e. the concatenation of $1 FS with the result of looking up $2 in A) or ($1 FS $2) in A (i.e. the result of looking up the concatenation of $1 FS $2 in A) or something else but if you simply write it as ($1 FS $2) in A then it's clear and unambiguous.

Finally note that I used OFS instead of FS as the array index separator - that's because in the code we are printing that value in the END section and the thing that should separate output fields is OFS, not FS. In this case they are both the same thing, a blank char, so you won't see a difference but if you wanted to output CSV instead then with my code all you need to do is add -v OFS=',' and it'd work as-is whereas if you used FS between the array index subscripts then you'd need to change that.

$ cat tst.awk
{ key = $1 FS $2 FS $3 FS $4 FS $5 }
FNR==NR {
    file1[key] = $6 OFS $7
    next
}
{
    print $0, (key in file1 ? file1[key] : "NA" OFS "NA")
    delete file1[key]
}
END {
    for ( key in file1 ) {
        print key, "NA", file1[key]
    }
}

Note the use of a variable to hold the key index for both files. When you have a key that's a combination of values, don't try to write it multiple times in your code as it's easy to get that wrong as you did where you forgot to include $2 in the lookup of the map when processing file2 (you used $1 FS $3 FS $4 FS $5 in A instead of $1 FS $2 FS $3 FS $4 FS $5 in A).

$ cat tst.awk
{ key = $1 OFS $2 OFS $3 OFS $4 OFS $5 }
FNR==NR {
    file1[key] = $6 OFS $7
    next
}
{
    print $0, (key in file1 ? file1[key] : "NA" OFS "NA")
    delete file1[key]
}
END {
    for ( key in file1 ) {
        print key, "NA", file1[key]
    }
}

Note the use of a variable to hold the key index for both files. When you have a key that's a combination of values, don't try to write it multiple times in your code as it's easy to get that wrong as you did where you forgot to include $2 in the lookup of the map when processing file2 (you used $1 FS $3 FS $4 FS $5 in A instead of $1 FS $2 FS $3 FS $4 FS $5 in A and did the same again in A[$1 FS $3 FS $4 FS $5]).

Also - don't use all upper case variable names (e.g. A) to avoid clashing with builtin variable names, and use parentheses to make your code clearer if nothing else, e.g. idk if $1 FS $2 in A means ($1 FS) ($2 in A) (i.e. the concatenation of $1 FS with the result of looking up $2 in A) or ($1 FS $2) in A (i.e. the result of looking up the concatenation of $1 FS $2 in A) or something else but if you simply write it as ($1 FS $2) in A then it's clear and unambiguous.

Finally note that I used OFS instead of FS as the array index separator - that's because in the code we are printing that value in the END section and the thing that should separate output fields is OFS, not FS. In this case they are both the same thing, a blank char, so you won't see a difference but if you wanted to output CSV instead then with my code all you need to do is add -v OFS=',' and it'd work as-is whereas if you used FS between the array index subscripts then you'd need to change that.

added 533 characters in body

Source Link

edited Dec 16, 2021 at 14:10

Ed Morton

35.9k
6
25
60

Your code doesn't match your description, e.g. it's storing $3 from file1 to print later but you say you want to print $4 and $6, and then your expected output doesn't match either of those and shows $6 and $7 from file1 instead. So, just going by your expected output, I think this is what you want:

awk '
$ cat tst.awk
{ key {= key=$1$1 FS $2 FS $3 FS $4 FS $5 }
FNR==NR {
   FNR==NR{ map[key]file1[key] = $6 OFS $7;$7
 next   next
}
 {
   { print $0, (key in mapfile1 ? map[key]file1[key] : "NA" OFS "NA")
    delete file1[key]
}
'END {
    for ( key in file1 ) {
        print key, "NA", file1[key]
    }
}

$ awk -f tst.awk file1 file2
12  800000  900000  66  73  145(28.12) 0 0
12  900000  1000000 73  48  703(51.17) 2 2
13  1000000 1100000 11  11  545(43.99) 0 0
12  1100000 1200000 12  12  699(45.30) 0 0
14  16100000    16200000    0   0   11(14.50) NA NA
14  16200000    16300000    0   0   0 NA NA
18 1400000 1600000 33 33 NA 3 3

Note the use of a variable to hold the key index for both files. When you have a key that's a combination of values, don't try to write it multiple times in your code as it's easy to get that wrong as you did where you forgot to include $2 in the lookup of the map when processing file2 (you used $1 FS $3 FS $4 FS $5 in A instead of $1 FS $2 FS $3 FS $4 FS $5 in A).

Your code doesn't match your description, e.g. it's storing $3 from file1 to print later but you say you want to print $4 and $6, and then your expected output doesn't match either of those and shows $6 and $7 from file1 instead. So, just going by your expected output, I think this is what you want:

awk '
    { key=$1 FS $2 FS $3 FS $4 FS $5 }
    FNR==NR{ map[key] = $6 OFS $7; next }
    { print $0, (key in map ? map[key] : "NA" OFS "NA") }
' file1 file2

Your code doesn't match your description, e.g. it's storing $3 from file1 to print later but you say you want to print $4 and $6, and then your expected output doesn't match either of those and shows $6 and $7 from file1 instead. So, just going by your expected output, I think this is what you want:

$ cat tst.awk
{ key = $1 FS $2 FS $3 FS $4 FS $5 }
FNR==NR {
    file1[key] = $6 OFS $7
    next
}
{
    print $0, (key in file1 ? file1[key] : "NA" OFS "NA")
    delete file1[key]
}
END {
    for ( key in file1 ) {
        print key, "NA", file1[key]
    }
}

$ awk -f tst.awk file1 file2
12  800000  900000  66  73  145(28.12) 0 0
12  900000  1000000 73  48  703(51.17) 2 2
13  1000000 1100000 11  11  545(43.99) 0 0
12  1100000 1200000 12  12  699(45.30) 0 0
14  16100000    16200000    0   0   11(14.50) NA NA
14  16200000    16300000    0   0   0 NA NA
18 1400000 1600000 33 33 NA 3 3

Note the use of a variable to hold the key index for both files. When you have a key that's a combination of values, don't try to write it multiple times in your code as it's easy to get that wrong as you did where you forgot to include $2 in the lookup of the map when processing file2 (you used $1 FS $3 FS $4 FS $5 in A instead of $1 FS $2 FS $3 FS $4 FS $5 in A).

Source Link

answered Dec 16, 2021 at 14:04

Ed Morton

35.9k
6
25
60

Loading

Stack Exchange Network

Return to Answer