Comparing two files using Awk in linux

Question

I have two Files, File A and File B. The structure of the File A is mentioned shown below:

3314530275|76|1|20240422045006|
3335984469|64|2|20150804235959|
3367892381|203|3|20141025235959|
3369039388|203|4|20131219235959|

The contents of the second File B are given below:

3314530275|2000|999000000073101614|0|20370101000000|76|
3314530275|2000|999000000073101614|0|20370101000000|76|
3369039388|2000|812000002628721|-112|20360101235959|203|
3335984469|5037|5210367877660|180|20150213000000|64|
3335984469|5048|5210367877661|6|20150213000000|64|
3335984469|2000|812000002629182|1913|20360101235959|64|
3367892381|5014|5210365185964|419430400|20150308000000|203|
3367892381|5044|5210365185965|226020|20150308000000|203|
3367892381|2000|817000102009605|0|20360101235959|203|

The script should first check File A, if the third field ($3) is equal to 2, it should store the value of first ($1) and fourth column ($4).

Afterwards it will check if the $1 values (of the second file) are present in the values that we stored in the first step.

If the value is present and the second field is equal to 2000 it should print $1,$2,$4,(Value of the fourth column that we got from the first file and stored it)
If the value is present and the second field is not equal to 2000, it should print $1,$2,$4,$5

Sample Output in the above mentioned case:

3335984469|5037|180|20150213000000|
3335984469|5048|6|20150213000000|
3335984469|2000|1913|20150804235959|

This is what I have so far:

awk -F \| 'FNR==NR {if($3 == 2) a[$1] = $4; next} ($1 in a) {if($2==2000) print$1"|"$2"|"$4"|"a[$1]"|"} ($1 in a) {if($2!=2000) print$1"|"$2"|"$4"|"$5"|"} ' FileA FileB > Output_File

Any help will be greatly appreciated.

I have come up with this uptil now but I am not sure if am using the code correctly because the output seems to be missing a lot of values that should be present --------------- awk -F \| 'FNR==NR {if($3 == 2) a[$1] = $4; next} ($1 in a) {if($2==2000) print$1"|"$2"|"$4"|"a[$1]"|"} ($1 in a) {if($2!=2000) print$1"|"$2"|"$4"|"$5"|"} ' FileA FileB > Output_File — Muhammad Abdullah
– Muhammad Abdullah, Commented Feb 13, 2015 at 13:43
What am I looking for is an alternative way to achieve the same thing! My script works fine for a sample of values but when i use it on large files, the result is not the same — Muhammad Abdullah
– Muhammad Abdullah, Commented Feb 13, 2015 at 13:49
It looks like it should work, unless you have duplicate $1 in file A. Do you have duplicate first fields in file A? — Wintermute
– Wintermute, Commented Feb 13, 2015 at 14:18
@MuhammadAbdullah, looks right. The only change I'd make is to fold the if and else into the same block: $1 in a {if ($2 == 2000) print $1,$2,$4,a[$1],""; else print $1,$2,$4,$5,""} -- implies OFS="|" — glenn jackman
– glenn jackman, Commented Feb 13, 2015 at 14:28

Ed Morton · Accepted Answer · 2015-02-13 19:52:03Z

1

Your script will work as-is given correct contents of fileA (335984469 in FileA should be 3335984469, i.e. one more leading 3.) but it can be simplified to:

$ cat tst.awk
BEGIN{ FS=OFS="|" }
FNR==NR { if ($3==2) a[$1] = $4; next }
$1 in a { print $1, $2, $4, ($2==200 ? a[$1] : $5), "" }

$ awk -f tst.awk fileA fileB
3335984469|5037|180|20150213000000|
3335984469|5048|6|20150213000000|
3335984469|2000|1913|20360101235959|

Feel free to cram it all back onto one line if you find that useful.

If the above doesn't work, check for the presence of control characters in both of your input files, the most likely being control_Ms as generously donated by Microsoft whenever their tools create files. You can check for them using cat -v and remove them with dos2unix or similar.

edited Feb 13, 2015 at 19:52

answered Feb 13, 2015 at 19:46

Ed Morton

208k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Muhammad Abdullah Over a year ago

Thanks Ed Morton. There was a typo in file A as you pointed out. I corrected my mistake. Can you kindly explain how does ($2==2000 ? a[$1] : $5) this part of the code works?

tripleee Over a year ago

If $2==2000 is true, the value a[$1] is returned, otherwise the value $5

Ed Morton Over a year ago

@tripleee is correct and it's just a ternary expression, common to many languages - google "ternary expression".

Muhammad Abdullah Over a year ago

Thanks @tripleee and Ed Morton!

repzero · Accepted Answer · 2015-02-14 14:20:26Z

0

awk  'BEGIN{FS=OFS="|"};FNR==NR{if($3==2){a[$1]=$4;next}};{if( $1 in a && $2==2000 ){print $1,$2,$4,a[$1]}else if ($1 in a && $2!=2000){print $1,$2,$4,$5}}' 'fileA'  'fileB'

adjustments that I have made to your command line to get the command line above

if( $1 in a && $2==2000 ){print $1,$2,$4,a[$1]}

else if ($1 in a && $2!=2000){print $1,$2,$4,$5}}

results

3335984469|5037|180|20150213000000
3335984469|5048|6|20150213000000
3335984469|2000|1913|20150804235959

answered Feb 14, 2015 at 14:20

repzero

8,4203 gold badges21 silver badges42 bronze badges

1 Comment

Muhammad Abdullah Over a year ago

This works as well but sorry @Xorg, i can only select one as correct answer! :(

Collectives™ on Stack Overflow

Comparing two files using Awk in linux

2 Answers 2

4 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Related